Mongo possible data corruption returning secondary to replica set
I am trying to understand the source of some data corruption that occurred around the same time a secondary was returned to a replica set.
We have a production replica set with 4 nodes - 3 data carrying nodes and an arbiter.
 I took a secondary (call it X ) out of the production replica set and used it to seed a new test replica set for some performance benchmarking.  After seeding the new replica set, I put X back into the production replica set.  Within about 10 hours we had complaints from customers that they had lost around 2 days of data.  X had been out of production for 2 days as well.  So we are wondering if re-introducing X caused some data reversion.  
The timings line up very closely and we haven't been able to find any plausible alternative theory - hence this post.
 The odd thing is that only some mongo collections were "reverted".  Our database seems to be a mix of the primary and X .  
In more detail this is what I did:
rs.remove(X)  mongod.conf  X  local database and ran db.dropDatabase() to clean out the production replica set info  mongod.conf but with a new replica set name  X  X in the new replica set  rs.stepDown() and rs.remove(X)  mongod.conf  local database  mongod.conf but with the production replica set name  rs.add(X) to add X back into the production replica set   To clarify - no new data was added to X when it was the primary in the test replica set.  
Here's some info which might be relevant:
All nodes are mmapv1 running mongo 3.2.7.
 After X was removed from the production replica set, it's entry for the production primary in /etc/hosts accidentally got deleted.  It was able to communicate directly with the other secondary and arbiter but not the primary.  There were lots of heartbeat error logs.  
 I found these logs which seem to indicate that X 's data got dropped when it re-entered the production replica set:  
2017-01-13T10:00:59.497+0000 I REPL     [ReplicationExecutor] syncing from: (other secondary)
2017-01-13T10:00:59.552+0000 I REPL     [rsSync] initial sync drop all databases 
2017-01-13T10:00:59.554+0000 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 3 
2017-01-13T10:00:59.588+0000 I JOURNAL  [rsSync] journalCleanup... 
2017-01-13T10:00:59.588+0000 I JOURNAL  [rsSync] removeJournalFiles
Prior to all this, developers had also been reporting issues that the primary was sometimes unresponsive under higher loads. These are some errors from the reactivemongo driver:
No primary node is available!
The primary is unavailable, is there a network problem?
not authorized for query on [db]:[collection]
 The nodes are on aws: the primary runs on an m3.xlarge , the secondaries on m3.large and the arbiter on m3.medium .  
 About 30 hours after we got customer complaints, our replica set held an election and X became the primary.  These are the logs:  
2017-01-15T16:00:33.332+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 
2017-01-15T16:00:33.333+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected 
2017-01-15T16:00:33.347+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election 
2017-01-15T16:00:33.370+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 2 
2017-01-15T16:00:33.370+0000 I REPL     [ReplicationExecutor] transition to PRIMARY 
2017-01-15T16:00:33.502+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
 This happened before I realized the /etc/hosts file was broken on X .  
I also found a lot of these errors in the logs when replicating one very large collection (260 million documents):
2017-01-13T13:01:35.576+0000 E REPL     [repl writer worker 9] update of non-mod failed: { ts: Timestamp 1484301755000|10, t: 1, h: -7625794279778931676, v: 2, op: "u", ns: ...
This is a different collection though to the one which got corrupted.
链接地址: http://www.djcxy.com/p/61656.html上一篇: metricbeat Mongo辅助节点“无法访问服务器”错误
下一篇: Mongo可能的数据损坏返回副本集
