After struggling with large FULLTEXT indexes in MySQL, Solr comes to the rescue, 16 million records ingested in 20 minutes – wow!
One small Gotcha was the security classes, which have obviously moved since the documentation was written (see fix at end of the post).
For web apps I live off MySQL, albeit now-a-days often wrapped with my own NoSQLite libraries to do Mongo-style databases over the LAMP stack. I’d also recently had a successful experience using MySQL FULLTEXT indices with a smaller database (10s of thousands of records) for the HCI Book search. So when I wanted to index 16 million the book titles with their author names from OpenLibrary I thought I might as well have a go.
For some MySQL table types, the normal recommendation used to be to insert records without an index and add the index later. However, in the past I have had a very bad experience with this approach as there doesn’t appear to be a way to tell MySQL to go easy with this process – I recall the disk being absolutely thrashed and Fiona having to restart the web server 🙁
Happily, Ernie Souhrada reports that for MyISAM tables incremental inserts with an index are no worse than bulk insert followed by adding the index. So I went ahead and set off a script adding batches of a 10,000 records at a time, with small gaps ‘just in case’. The just in case was definitely the case and 16 hours later I’d barely managed a million records and MySQL was getting slower and slower.
I cut my losses, tried an upload without the FULLTEXT index and 20 minutes later, that was fine … but no way could I dare doing that ‘CREATE FULLTEXT’!
In my heart I knew that lucene/Solr was the right way to go. These are designed for search engine performance, but I dreaded the pain of trying to install and come up to speed with yet a different system that might not end up any better in the end.
However, I bit the bullet, and my dread was utterly unfounded. Fiona got the right version of Java running and then within half an hour of downloading Solr I had it up and running with one of the examples. I then tried experimental ingests with small chunks of the data: 1000 records, 10,000 records, 100,000 records, a million records … Solr lapped it up, utterly painless. The only fix I needed was because my tab-separated records had quote characters that needed mangling.
So, a quick split into million record chunks (I couldn’t bring myself to do a single multi-gigabyte POST …but maybe that would have been OK!), set the ingest going and 20 minutes later – hey presto 16 million full text indexed records 🙂 I then realised I’d forgotten to give fieldnames, so the ingest had taken the first record values as a header line. No problems, just clear the database and re-ingest … at 20 minutes for the whole thing, who cares!
As noted there was one slight gotcha. In the Securing Solr section of the Solr Reference guide, it explains how to set up the security.json file. This kept failing until I realised it was failing to find the classes solr.BasicAuthPlugin and solr.RuleBasedAuthorizationPlugin (solr.log is your friend!). After a bit of listing of contents of jars, I found tat these are now in org.apache.solr.security. I also found that the JSON parser struggled a little with indents … I think maybe tab characters, but after explicitly selecting and then re-typing spaces yay! – I have a fully secured Solr instance with 16 million book titles – wow 🙂
This is my final security.json file (actual credentials obscured of course!
{ "authentication":{ "blockUnknown": true, "class":"org.apache.solr.security.BasicAuthPlugin", "credentials":{ "tom":"blabbityblabbityblabbityblabbityblabbityblo= blabbityblabbityblabbityblabbityblabbityblo=", "dick":"blabbityblabbityblabbityblabbityblabbityblo= blabbityblabbityblabbityblabbityblabbityblo=", "harry":"blabbityblabbityblabbityblabbityblabbityblo= blabbityblabbityblabbityblabbityblabbityblo="}, }, "authorization":{"class":"org.apache.solr.security.RuleBasedAuthorizationPlugin"} }