A Database or a Search Engine?
Many, many years ago, I made an internet search engine. I was working in a company where we used Rebol in some projects, and I had fun creating a crawler that looked up pages on the internet containing the word rebol. It collected the contents of such web pages and put it in a MySQL database with full-text indexing.
This was «The RIX»: A search engine written in Rebol, indexing Rebol sites only. I remember I felt a bit proud when Carl Sassenrath mentioned it on his blog. He wrote that it was smart:
you can ask it to search not only for standard HTML documents about REBOL but also script files by the contents of their headers. If you want to search for all known documents by, say, “Ladislav”, you can click the author option. Other options include script title, name, purpose, and the entire header.
My spider found this information, indexed it and put it into the MySQL database. Although the Rebol community was small and the number of websites limited, a huge number of documents found their way into the database. But I didn’t do much programming in Rebol after this, and left the thing going without thinking too much about it since after all, it worked!
It worked with MySQL. Carl Sassenrath had already made the rebol script header a place for all sorts of information, so when my spider found a script, it extracted this information into different database columns: title, date, name, version, author, examples and comments, just to mention a few. Each of them got their database column. And the complete script (and/or the complete web page) found its way to the full-text indexed column.
Five years later, my search engine was mentioned again, but this time because it was slow. And when I checked, indeed it was. It did contain a lot of information, but Google handed over the same information a lot quicker. And with my free web hosting plan, I couldn’t any longer spend bandwidth on crawling and serving (slow) searches, so I put the Rebol indexer to sleep.
Since then, the MySQL full-text indexing has probably evolved a lot. But so have also many other products. One of the tools that have emerged is Solr. If I were to write the Rebol indexer today, I would have used Solr. In fact, the search capabilities I had constructed through different MySQL select statements, and that Carl Sassenrath referred to as smartness, largely coincides with the search possibilities within Solr: using wildcards, using different data types together, picking only certain fields etc., just like in regular sql statements; and also, and this is important, Solr is fast.
In fact, Solr is not a database, it’s a document base. When using Solr, we don’t think in terms of rows and columns; instead we say that Solr stores documents with different fields in them. The reason Solr is fast, is that its default indexing is bottom-up, i.e. every document’s every word is put into an index, and searching through the document base actually is searching through that index. So instead of going through database rows to look for a search term in every column according to its type, we go directly at the index and find our search terms there. We can worry about the different data types in the different fields (if we need to) after we have retrieved the relevant documents.
An interesting thing is that Solr is also very quick to set up. You dn’t have to scrutinize your data and decide on the different (column) data types. As you start, you can import a complete dataset into Solr with no schema, and the different types of data will be inferred. You are then free to go on with the inferred data types for the different fields, or you can make your needed changes. But you get to index hundreds of thousands of documents first, if you like, and that in itself may be a great help when you examine your data afterwards!
So if I were to create the Rebol Indexer again today, how would I go about it? Let’s say I still had the old MySQL database available. To get everything into Solr, I would convert it to json. I could pull out all the columns and make relevant json field names. But the very simplest way would be to pull out only one single column: the full-text one (in addition to the URL). This would be much quicker. Then I’d push it into Solr. It would be indexed, the search would work and that could be it! It would already give good hits and be much faster than the old MySQL setup was.
A slightly better solution would be to divide the documents in two piles: one for scripts, the other for everything else. Documents without scripts could be just one text field; and for scripts, I could pick out all the fields in the script header and place them in correspondingly named json fields. During the import into Solr, these fields would be categorized and automatically assigned the correct type. We would have the same smartness possibilities as in the old search engine, but it would be much faster, both in terms of setup efforts and in terms of yielding search results.
When databases are said to be fast, it’s actually true, but you still have to perform very structured selects on them. Searches in Solr aren’t as demanding. When we say that Solr is fast, it’s just because Solar is fast!