I recently encountered a strange issue in Rails which questioned basic Relational Database principles. I spent a sleepless night until I was able to get to the root cause of the issue. 

The Problem and the pain point

The problem is pretty straightforward. A Rake task that generates an email with the documents count mentioned at two places. Ideally, the documents count at both places ought to be the same – but for some reason, it was different!

This problem is painful because it occurs intermittently and is hard to reproduce. The problem with intermittency is that there is always some valid theory behind it, but catching or realizing that theory is very painful. So, was there something in this case too?

How did we get to the root cause?

To figure out the root cause and fix it, I had to start somewhere. So, I thought of running some basic checks. I checked the Rake task’s log file which is  an output of my executed Email jobs. Things looked fine there – as the tasks were completed in under 90 seconds as expected. The next step was to look at the production logs. The logs as expected had 30 insert statements – Check. They also had a read statement for the insert statements before and it was a typical count(*) query. The problem occurred at this point. The count(*) was only 4 instead of 30! However, another count(*) within the code returned 30 as expected.

The check performed on the production logs revealed that this problem resided not within the Rails layer but within the production database setup. So, I routed my energy towards in that direction.

The production database environment is a Master-Slave configuration with the Master taking Writes and Reads and Slave configured to take only the Reads. Both nodes are load balanced via a PG Pool server. My first thought was to investigate the PG Pool, but I felt that it would not be of much use as all that the PG Pool can do is route the traffic. So, I spent time on reading two types of the Master – Slave replication configurations –  synchronous replication and streaming replication. Further research in this area revealed the root cause!

ReplicationSynchronous vs. Streaming replication

Let us assume that there are two databases – A and B.  The database A is a R/W Master and database B is an R-only Slave. When an insert or update command is issued, the command will write that entry in the database A because it is configured for write. If database A returns upon ensuring that all the slaves have this write – it is a Synchronous or 2-Safe replication. If database A does not wait for this step and instead, it acknowledges that it wrote successfully and later, it streams that value to B – it is a Streaming replication.

Both types of replication have obvious pros and cons. The Streaming replication is best suited for Raw Speed and is also a very good configuration for large number of writes. Even though the Synchronous replication is not as fast as the Streaming replication, it provides 100% consistency. Our production database was in the Streaming replication mode! The 30 inserts happened so fast in database A that even before it could stream those to B, the count query intervened and read the half baked data from database B, in terms of millisecond speed!

How did we fix it?

My team quickly isolated all the cron jobs to run in a dedicated node and pointed the database directly to the Master database server skipping the PG Pool in the process. In a single database configuration, the concept of Streaming or Synchronous Replication is not applicable. Hope this helps!

Cheers!
Braga