Is NoSQL database an alternative for a search engine?
I have been thinking about this question “Is NoSQL database an alternative for a search engine?”. I think I just found an answer here.
Lets talk about some terms first.
NoSQL – Not only SQL – meaning that a NoSQL database differs from a RDBMS in some way.
IR – information retieval – science of searching documents, their metadata, and retrieval.
Let us enlist features and then compare MongoDB and Lucene.
- MongoDB is a document based database having following features ( reference http://www.mongodb.org/ ):
- Document-oriented storage
- Full Index Support
- Replication and High Availability
- Auto-Sharding
- Querying
- Fast In-Place Updages
- Map/Reduce
- GridFS
- Commercial Support
- Lucene features ( reference http://lucene.apache.org/java/docs/features.html ) :
- Scalable, High-Performace Indexing ( which is actually quite fast )
- Powerful, Accurate and Efficient Search Algorithms
- * ranked searching — best results returned first
- * many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
- * fielded searching (e.g., title, author, contents)
- * date-range searching
- * sorting by any field
- * multiple-index searching with merged results
- * allows simultaneous update and searching
- Cross Platform
NoSQL is preferable when database needs to be scalable, highly available, with fast query results. However it doesn’t completely solve the problem of Information Retrieval.
Search (Information Retrieval) isn’t just about grabbing any documents that match, if you want your search results to have any relevance at all you’re going to need something along the lines of TF-IDF, phrase matching (words in a sequence score higher) or any number of other IR techniques to improve search precision.
NoSQL database such as MongoDB dont provide relevance based search results, which is one key point to consider. I think this is the biggest factor to consider when choosing a NoSQL database or a search engine framework.
An another alternative is to couple a database with a search engine to achieve the goals. For example:
- couchdb-lucene provides such an integration with CouchDB and Lucene
- Solr provides integration with RDBMSes ( such as MySQL ) and uses Lucene as its search library.
Thats all for now.
Comments and suggestions are welcome
mdasif 4:02 am on March 1, 2011 Permalink |
in my pov, non-relational db and lucene tries to solves different parts of a information retrieval problem. lucene is awesome implementation of indexing and ranking algorithms. While all non-relational db tries to solve the problem is to store and fetch data on internet scale in tera and peta bytes.
I am not sure but how does lucene scale ? I mean if i wants to let say store all tweets and run a query, doing that on one machine is not possible. does lucene runs on multiple machine and parallelize work without me writing code to shard the tweet data and running mutiple instances of lucene.
So as soon as we start talking about multiple instances, hardware failure recovery and consistency across nodes start coming into pictures and all non-relational db does that part very well. They allow you to run custom query efficiently but does not rank well.
tuxdna 7:26 am on March 2, 2011 Permalink |
Lucene supports “multiple-index searching with merged results”, so you would need to spread your indexes across systems into multiple chunks. I believe if you are using Solr, then it will handle it for you, provides you have configured it ( http://wiki.apache.org/solr/DistributedSearch ).
As an aside, I just came across this article ( http://www.linuxforu.com/developers/up-close-and-personal-with-nosql/ ) about NoSQL database which is quite concise and informative.
Shamail 12:48 pm on August 15, 2011 Permalink |
Its not always true that No-SQL solves the problem of scaling alone.
To actually scale-out, you’ll need to de-normalize your data. DB’s are just a way to represent them.
One can very well represent de-normalized data in either MySQL, PostGREs or Mongo/Cassendra.
Consider a scenario of designing Twitter timeline. (say)
If your data is normalized, then if you are to display a user’s timeline, your standard way out will be..
1. select followers.
2. do a Cartesian product of these followers with the status_message table.
3. Sort the result on time.
But if this is a denormalized data, you’ll then introduce a redundant table called timeline defined as:
table ( user, timestamp, statusmessage )
This will hold the timeline of each user, this will be redundantly filled with the status updates of all my followers.
So to solve the same problem, I’ll end up just calling
select timeline where user = me;
–
In the second case, at the time of any status update by a person whom I am following, an INSERT will be made. Which is definitely far cheaper than the JOIN that you’ll have to do in case (I).
—–
The second most important principal is lazy-referencing.
Consider a django model:
class User (models.Model):
pincode = model.ForeignKey (Pincode)
So, internally, if you call user.pincode a left outer join takes place.
Why not design it like this?
class User (models.Model):
pincodeid = models.ObjectIdField () # or IntegerField
And call:
pinid = user.pincodeid
pincode = Pincode.objects.get ( id = pinid )
This saves from a big cost of join.
And do note, here the tables can lie on different systems all-together? Because they are not coupled!
–
Yes, indeed Mongo/or any other no-sql helps you achive the above design-principles easily. But there is no good reason that you can’t do it in a trivial RDMBS.
–
Lastly, Lucene solves completely different problem. While you are at it, try whoosh (if you want a pure python thingy) or solr (if you want HTTP API like stuff and don’t want to mess with J-thingy)
The use case of lucene is full-text-search. Whereas problem of scaling is completely different IMHO.