I've been using Lucene for the better part of 2 years, from initial playing around, to prototyping to production application. It's an impressive library and it has come along way in the past couple of years.
When I first started playing around with it the version was 2.1 and the search times were so much faster than what we were trying to use at the time (Oracle Text). The first test was indexing a monster dataset and searching it quickly. It passed with flying colors!
Next was to add in record level access control. Easy and extremely fast.
Next was to add in all the other data needed for our application. That was a little bit harder, considering that we have close to 150 fields in our index and well into the billion record range (growing everyday).
The problem was that we needed more memory and there was no extra money for any more servers (or upgrades). So there we were, stuck. So I decided to start poking around using visualvm to see if there were any places in our application or in Lucene to save some memory.
We had already disabled norms on all our fields (we really didn't need norms for our data nor did we have the resources). Took a long look at all our fields that we were indexing to see if there were any we didn't need, but we really did need them all. Then I stumbled across the TermInfosReader class in Lucene.
This is where Lucene really gets it speed, but also uses quite a bit memory to do it. And this is where I wrote my first Lucene patch.
In TermInfosReader there is a bunch stuff but the big memory hogs are in three arrays.
Basically Lucene does a binary search across the Terms array (that by default contains every 128th Term in the index) with a given Term to find where on disk the exact Term needed lives. There's a little bit more going on in the class than that, but that's basically what it's doing.
So, I started this patch with the need to save memory. So how in the world do you do that in java when everything is already in basic arrays and everything is needed in memory. Well you have to save it another way, references. References are a hidden cost in Java, every single reference in 32-bit JVM costs you 4 bytes, and 64-bit JVM it's 8 (assuming that you don't have compressed references).
Let's count the references.
- Terms length * 3, 1 reference for the Term and 2 references for the two Strings inside the Term
- TermInfo length * 1
- long = 1 reference total
So, let's talk numbers. If you have a billion terms in your index, that's 125 MB (1,000,000,000 / 128 * (3 + 1 references) * 4 bytes for every ref) bytes of memory for the references. In a 64-bit JVM that doubled 250 MB. Not to mention the object overhead for every one of those Term and TermInfos objects. Wow that's a lot!
So I decided to remove nearly all of those references by using a byte array and an int array as an offset index.
The results were impressive!
Given an index of 6.2 GB size 1,010,000 number of documents with 179,822,683 number of terms the default implementation uses 292,235,512 bytes to just get the index usable.
My no-ref implementation of the same index uses only 49,849,744 bytes get the index usable. That a 17% of the original size, that's an 83% savings!
And the best part is, that it loads the segments faster into memory. So those real-time updates will be online faster. The run-time performance is slightly faster as well. But the huge performance saving is in garbage collection. Over 7 times faster for full GC's on my Macbook Pro. Wow!
I think that the results speak for themselves, and I hope that the Lucene folks will accept my patch. That way I won't have to continue patching each version after the fact. Also removing references can be great, but the code required to do it, and maintain the same level of performance, is ugly! So don't try this at home!