What Happens To Lucene When You Go Big... Part 1

My current project has some unique searching requirements.

Requirements

  • Fuzzy searching is a must (Soundex, Levenshtein, etc.)

  • Has to be fast, a must with any searching solution

  • Has to provide access control

  • Full data load indexing needs to be completed in a reasonable amount of time

  • Scoring needs to be a custom implementation

  • Needs to run on a predetermined environment, meaning that new hardware purchases are not going to happen any time soon

  • And last but not least is ability do all these things on a dataset that exceeds a billion records

So we have had a lot of constraints to deal with, the hardest one by far is the last one.

The Data

  • 1 billion plus records

  • Over 30 million unique terms

Indexing and Searching Server Specs

  • 20 CPUs

  • 32 Gig of ram

  • Dedicated SAN storage

First Searching Experiences

After getting the index built in multiple partitions, I fired up a simple Lucene console to do some simple searches with a Lucene multi searcher. Ran out of memory with 2 Gig heap, tried the maximum heap size for the 32 bit JVM we were using, 3.3 Gig, and that ran out of memory as well. So, initial tries to just run one search were unsuccessful.

Then we installed a 64-bit JVM and tried an 8 Gig heap, and it worked! I could run searches and after the first couple of warm up searches it was getting 20 - 80 ms responses on single term searches. Great, but then we tried a Fuzzy search, which uses a Levenshtein algorithm to calculate matches, 2 minutes 45 seconds, this was unacceptable.

Next we wrote our own Levenshtein Lucene query and got the 2 minutes plus search down to about one second. We found that the built in Lucene Fuzzy query was taking 85-95% of the time to find the terms to search. Then after those terms were found the actual search with those expanded terms only took a second to two depending on how many terms were found. So we replaced the built in Fuzzy query with a custom one that gets near instantaneous results on Levenshtein fuzzy matches. Problem solved.

Indexing Time

After our initial proof of concept was complete, we needed to improve the indexing time down to something more reasonable. The index creation from scratch was taking 36 - 48 hours to build with 20 CPUs running at 100% utilization. Which means that the machine was indexing about 9,000 records a second. Not bad for Lucene 2.2, but not that great.

First we stopped merging the indexes after we created them, that by itself was taking about 12 hours. At this point we also started searching these multiple indexes in parallel, and we are seeing modest increases in query performance.

Second, we upgraded to Lucene 2.3, this provided a huge increase in indexing speed. Our index creation time went from 36 - 48 hours (depending on if we merged indexes or not) down to 3-4 hours. The indexing process is now indexing around 125,000+ records a second. Huge improvement, if you haven't upgraded to 2.3, you should!

Current Development

We are in the process of adding access control to Lucene as well as adding new custom queries and scoring. So far Lucene has performed better than any of the competition that it has come up against, and with it's price point it seems to have won acceptance on our project.

In upcoming parts I will go into more details about the technical solutions that we have developed to solve these problems, as well others that I haven't mentioned yet.

Comments

  1. Jo:

    I have the same performance problem with the fuzzy queries. How did you implement your own fuzzy queries? N-Gram, Prefix, Query Exmpansion, other? I would thank you for a useful hint.

    1. Aaron McCurry:

      It's still the same Levinstein fuzzy algorithm that is built into Lucene by default. It's just implemented as an in memory lookup, instead of the built-in terms scanner implementation.

  2. Jim:

    What similarity threshold do you use for your Levinstein based fuzzy search? Do your queries get considerably slower when the similarity threshold goes below 70%?

  3. Aaron:

    The lowest threshold we allow is 50%, but we allow users to select what threshold they want.

  4. Jo:

    Hi Aaron.

    Thanks for you answer. I am just back to the problem with Levenshtein performance. What do you mean with 'in memory lookup'? As far as I see it, the calculation of the Lev distances and the ammount of calculations takes the time. Could you please describe the solution? Many thanks in advance.

  5. Aaron:

    Basically it is a tree structure of all the terms characters, and then it uses a proprietary algorithm to perform the levenshtein match. That's all I can really describe at this level, however I may release an open source implementation of it at some point.

  6. Kevin:

    Hi Aaron-

    I would be very interested in seeing this open source implementation. Is this something you're still planning?

  7. Aaron:

    I may but it is a custom solution for a customer of ours. So I may need to get their permission before implementing an open source solution.

  8. Fuad Efendi:

    Hi Aaron,I implemented BK-Tree structure to calculate distance (Levenshtein), http://en.wikipedia.org/wiki/B...Lucene 2.9.1No any improvement.Finally, I added some trace messages to org.apache.lucene.search.FuzzyQuery.rewrite(...) method. And I found that... similarity takes only 10% of overall processing of a query.For instance, query product_name:glass~0.5 in average is processed in 500ms, and BK-Tree or any other "similarity" (including classic one) takes only 50ms.At the same time, query product_name:glas* takes 50ms in average.It means performance bottleneck is somewhere else."We found that the built in Lucene Fuzzy query was taking 85-95% of the time to find the terms to search."

    • I can't reproduce it in Lucene 2.9.1; in my tests, it is only 10%, without dependency on "similarity" implementation.
  9. Aaron:

    Interesting, I would have to ask how many terms are in your index then?Because I know the biggest problem we had at the time I wrote this was the number of terms in the index was huge and iterating over all of them and calculating the edit distance for each one was the slow part. Also when I wrote this, we were still running Lucene 2.4.

  10. Fuad Efendi:

    Aaron,I finally found what was wrong... I tried load-stress with at least 32 concurrent threads, then increased, but it was wrong... Fuzzy Search is CPU-bound, and 16 threads can kill the server. That was my mistake during tests.Proper tests show up to 20x performance boost (with new faster "distance" algo).200000 terms, billion documents.I contributed code to http://issues.apache.org/jira/... Looks like
    http://www.catalysoft.com/arti... is 5x times faster and logically more appropriate than Levenstein algo for many "text" use cases

  11. Aaron:

    Interesting, thanks for the info! I will check it out.Aaron

  12. Fuad Efendi:

    Have you seen https://issues.apache.org/jira... - interesting...

  13. Aaron:

    Thanks, yeah I have looked in to it, although I haven't seen an implementation yet. But maybe I've missed a few posts.

New Comment