Aikido, programming, system administration, and other things I find interesting

Massive data loss in MySQL 5.5 with master-master replication.

If you put two mysql servers into a master<->master replication and then change the port of one of the servers, data magically disappears from the first master server.

I’ve been able to repeat that (twice) bith with a mixed (mysql 5.1<-> mysql 5.5) setup and with a pure Ubuntu 12.04, mysql 5.5 <-> mysql 5.5 setup.

This issue has been reported to MySQL, and given bug ID  #66260

Look:

Note, in the example below, only a script inserting data was running,
and this was a database in a virtual machine dedicated to this test, there is no
chance of unexpected delete commands.

mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 12683 |
+----------+
1 row in set (0.01 sec)
mysql> stop slave;
Query OK, 0 rows affected (0.23 sec)
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 12689 |
+----------+
1 row in set (0.01 sec)
mysql> change master to master_port=3369;
Query OK, 0 rows affected (0.23 sec)
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 12706 |
+----------+
1 row in set (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected (0.01 sec)
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 155 |
+----------+
1 row in set (0.00 sec)

Learn how to replicate it after the fold:

How to replicate:

Set up mysql on the first system, create /etc/mysql/conf.d/replication.cnf

[mysqld] 
server-id = 10
auto_increment_offset = 1
auto_increment_increment = 10
log_bin = mysql-bin
log-slave-updates
replicate-same-server-id = 0
binlog-ignore-db=mysql
replicate-do-db = testis 
replicate-do-db = testaltv

Create a table for testing:

CREATE TABLE `prva` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `stamp` datetime DEFAULT NULL,
 `local` varchar(32) DEFAULT NULL,
 `host` varchar(32) DEFAULT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB;

create a script that will write data to the file:

#!/usr/bin/perl -w
use strict;
use DBI;
my $dsn = "DBI:mysql:database=testis;host=localhost";
my $user = 'root';
my $password = '';
my $host = `hostname`; chomp $host;
my $dbh = DBI->connect($dsn,$user,$password);
my $sth = $dbh->prepare("insert into prva values (NULL,now(),?,?)");
while (1) {
 $sth->execute(scalar localtime(time),$host);
 sleep 1;
}

You can leave it running for some time, or not, it doesn’t seem to matter.
Now, set up the second host, and establish master-slave replication between them.

Create a mysql server, put the following in /etc/mysql/conf.d/replication.cnf
[mysqld]
server-id=20
auto_increment_offset = 2 
auto_increment_increment= 10
log-bin = mysql-bin
# Also log queries received from our master (for circular replication)
log-slave-updates
replicate-same-server-id = 0

Transfer the database content to “prime” the slave: (id-20)

on id:10

 create user repl@'192.168.1.20' identified by 'a';
 grant replication slave to repl@'192.168.1.20';
 mysqldump --opt --single-transaction --master-data --databases testis testaltv > /tmp/dump20120808.sql

on id:20

 mysql> change master to master_host='192.168.1.10', master_user='repl', master_password='a';
 mysql < dump20120808.sql
 start slave;

Verify that the slave replication is running correctly.

Leave it running for a while, long enough that you will see obvious difference between the table size when the replication was set up and when the bug occured.

Now, establish master/master replication between the two servers:

On id:20, do

create user repl2@'192.168.1.10' identified by 'a';
grant replication slave to repl2@'192.168.1.10';
flush tables with read lock;
show master status;
unlock tables;

On id:10 do

change master to master_host='192.168.1.20', master_user='repl2', master_password='a',master_log_file='get it from show master',master_log_pos='get it from show master';
start slave;

Verify that the master/master replication is now working. You can start the write_to_db process on the id:20 now, and you should see the DB entries appear on the ID:10 side.

Again, leave the replication to run for a while, so the difference in the number of entries is obvious when it happens.

Now, say that you got orders from management to change the default port on id:20.

On id:20 edit /etc/mysql/my.cnf and change both port entries to 3369.

Restart mysql service on id:20. This will break the replication (it will still be proceeding 10->20, but not the other way around, because id:10 can not reconnect to id:20).

Now, on id:10, 
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 12683 |
+----------+
1 row in set (0.01 sec)
mysql> stop slave;
Query OK, 0 rows affected (0.23 sec)
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 12689 |
+----------+
1 row in set (0.01 sec)
mysql> change master to master_port=3369;
Query OK, 0 rows affected (0.23 sec)
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 12706 |
+----------+
1 row in set (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected (0.01 sec)
mysql> select count(*) from prva;
+----------+
| count(*) |
+----------+
| 155 |
+----------+
1 row in set (0.00 sec)

BOOM. Data written since replication was set up DISAPPEARS from server id:10. (Some?) of the data that was written to id:20 is now present in the id:10 database (it arrived via replication). Replication 20->10 is working, but replication 10->20 has likely stopped with a “duplicate key” error.

 

Related Posts

Why is my munin slow and how to speed it up

At $work we are monitoring a network of hundreds of servers, and that means that we end up recording hundreds of thousands of variable values every five minutes. After a while, the server started slowing down, taking more than 300 seconds to collect the data. Since it has a whole-system lock, that means the next […]

Read More

A munin plugin to monitor each CPU core separately

Monitoring each core separately may seem like a waste – after all, we have an overall CPU usage already available under “system” in munin, isn’t that enough? It turns out that it isn’t. Sometimes, when using top on a multicore/multicpu machine, you can see a process pegged at 100%, while other processes are comfortably using […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *