I work on a high volume Rails application that has a REST API for indexing and searching high volume science metadata. We chose JRuby to implement it because of the easy deployment to our Tomcat application server and our existing experience in the Java ecosystem. JRuby’s threading and performance characteristics didn’t hurt either. The application went live last January and has been indexing and searching about 100 million documents since then. We started noticing some odd behavior a few weeks after the initial deployment. We store our XML data GZIP’d in a column of type BLOB in Oracle. We would occasionally get the error
ORA-22990: LOB Locators cannot span transactions. Then we would find that a row had been inserted into to the table with an empty BLOB even though this sort of error should cause a rollback and no data should be inserted. This error can happen if a commit occurs after executing an insert statement but before streaming the BLOB data to oracle or if autocommit is set to true on the JDBC connection. There was no way from examining the oracle-enhanced adapter code that this could happen. All testing locally showed we were always in a transaction with autocommit set to false.
The Mystery Thickens
We continued to see other issues. Another time one of our Tomcat instances failed on any request that tried to start a new transaction. The error was due to invalid sql
SAVEPOINT active_record_-1. Restarting tomcat fixed the issue. Other seemingly random problems included a lock being held for hours on a row in the oracle database. This prevented other instances from completing background indexing tasks. Recently, we received reports from partners that some of our data was incorrect. They had sent data replacements as a PUT on our REST API. The PUT had completed successfully, which we verified via logs, but the actual data in the database was old. We investigated and found additional records that were missing or out of date. Though this represented a small fraction of the amount of data we receive and index it was still concerning that otherwise successful transactions were losing data.
A Common Explanation
Ever since the Oracle LOB errors started I had suspected that there was some problem causing connections to be shared across concurrent requests on the REST API. This would be bad news because both the ActiveRecord connection object and the underlying JDBC connection are not thread safe. They’re meant to be used by a single thread at a time. The only way I could reproduce the LOB error was by manipulating the ActiveRecord Oracle Enhanced adapter code to commit before it streamed the BLOB data. If multiple threads were using the same connection object it’s possible that one thread’s commit could occur at just the right time to cause the error. We only saw the LOB error about 6 times over a period of 4 months but it seems reasonable that with hundreds of thousands of requests a few of them would hit the jackpot if connections were being shared.
The -1 in the SAVEPOINT sql came from the @open_transactions instance variable in the ActiveRecord connection adapter. The number of open transactions increases and decreases as transactions are started and committed. It’s a count because transactions can be nested. The number of open transactions represents the level of nesting. It didn’t make any sense for @open_transactions to be -1. It was another weird random error that could be explained by multiple threads manipulating this value at run time as the methods incrementing and decrementing it weren’t meant to be thread safe.
The PUT requests that mysteriously lost data could also be explained by multiple threads using the same connection. Suppose two PUT requests, request A and request B, were using the same connection object. Request A starts just before B. A inserts data, then B inserts data. Request A encounters an error and issues a rollback which rolls back both A and B’s data. B then issues a commit. This commit is meaningless but successful. It’s just committing an empty transaction which won’t result in an error for request B. The user who sent the B request would not receive any error and it would look as if the data was actually committed instead of rolled back.
Digging into ActiveRecord
With all of these issues there had to be a problem with ActiveRecord, the oracle-enhanced gem or our Tomcat connection pooling. We had been using the Tomcat level connection pooling for years without any issues so it seemed like the ActiveRecord or Oracle Enhanced gem was the culprit. Blaming your framework code is a little bit like blaming the compiler but it turns out this time it actually was the framework. We were using Rails 3.0.13. We hadn’t upgraded this particular project to 3.2 but had been keeping up with security patches. We started looking through the ActiveRecord code and found the ConnectionPool class wasn’t accessing a hash called @reserved_connections in a thread safe way. @reserved_connections is a hash of thread object ids to the connection the thread was using. Hash isn’t thread-safe in JRuby even though every thread would use a unique key to access the hash.
We compared the ConnectionPool code in 3.0.13 to master and found among many other changes that @reserved_connection access was now synchronized. The synchronization fixes had been put into Rails 3.2 but not in 3.1 or 3.0. We could immediately fix the problem by creating a patch to the ConnectionPool class that synchronized the methods that manipulated @reserved_connections. We also started work on upgrading from 3.0.13 to the latest Rails 3.2 release. We were lucky that we could fix all the incorrect data as we had left the previous system running and it was capturing the same changes as the new REST API. We left the previous system running as a fallback in the event something went wrong in the newer REST API. This was fortuitous as it prevented us from having to ask the users of our system to please resend their data.
In a recent talk at Clojure West, Allen Rohner said
JRuby theoretically supports multithreading. Rails 3 theoretically supports multithreading. In practice we couldn’t get that to work. In Rails 3 the core libraries support multithreading. Many people don’t use JRuby so they don’t test that their own gems work.
I don’t agree completely with that as more gems are tested now against JRuby with things like travis.ci. But I can’t completely disagree with it either. These types of bugs show up because the original implementor wasn’t considering JRuby during the implementation. The code works in JRuby and passes unit tests but isn’t thread safe in a JRuby environment. That’s obvious from the original ActiveRecord ConnectionPool code which proudly has claimed for many versions to be “completely thread-safe, and will ensure that a connection cannot be used by two threads at the same time”. This was true in the older versions on MRI which is multithreaded but doesn’t have the same issues with concurrent access to shared variables thanks to the GIL but definitely not true when run on JRuby. We’re at a time right now where JRuby users will run into issues like this when combined with gems that were originally written to work with MRI. The community is doing the right thing here. Many other people seem to be using JRuby and reporting and fixing issues in popular gems like ActiveRecord. The Rails team seems responsive to include these fixes in the master branch and push out releases quickly. We discovered this issue on our own but if we had kept up latest rails releases we would have gotten the fix several months earlier.