Updates from February, 2011 Toggle Comment Threads | Keyboard Shortcuts

  • tuxdna 10:03 am on February 27, 2011 Permalink | Reply  

    Is NoSQL database an alternative for a search engine? 

    I have been thinking about this question “Is NoSQL database an alternative for a search engine?”. I think I just found an answer here.

    Lets talk about some terms first.
    NoSQL – Not only SQL – meaning that a NoSQL database differs from a RDBMS in some way.
    IR – information retieval – science of searching documents, their metadata, and retrieval.

    Let us enlist features and then compare MongoDB and Lucene.

      MongoDB is a document based database having following features ( reference http://www.mongodb.org/ ):

    • Document-oriented storage
    • Full Index Support
    • Replication and High Availability
    • Auto-Sharding
    • Querying
    • Fast In-Place Updages
    • Map/Reduce
    • GridFS
    • Commercial Support
      Lucene features ( reference http://lucene.apache.org/java/docs/features.html ) :

    • Scalable, High-Performace Indexing ( which is actually quite fast )
    • Powerful, Accurate and Efficient Search Algorithms
    • * ranked searching — best results returned first
    • * many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
    • * fielded searching (e.g., title, author, contents)
    • * date-range searching
    • * sorting by any field
    • * multiple-index searching with merged results
    • * allows simultaneous update and searching
    • Cross Platform

    NoSQL is preferable when database needs to be scalable, highly available, with fast query results. However it doesn’t completely solve the problem of Information Retrieval.

    Search (Information Retrieval) isn’t just about grabbing any documents that match, if you want your search results to have any relevance at all you’re going to need something along the lines of TF-IDF, phrase matching (words in a sequence score higher) or any number of other IR techniques to improve search precision.

    NoSQL database such as MongoDB dont provide relevance based search results, which is one key point to consider. I think this is the biggest factor to consider when choosing a NoSQL database or a search engine framework.

    An another alternative is to couple a database with a search engine to achieve the goals. For example:

    • couchdb-lucene provides such an integration with CouchDB and Lucene
    • Solr provides integration with RDBMSes ( such as MySQL ) and uses Lucene as its search library.

    Thats all for now.
    Comments and suggestions are welcome :)

     
    • mdasif 4:02 am on March 1, 2011 Permalink | Reply

      in my pov, non-relational db and lucene tries to solves different parts of a information retrieval problem. lucene is awesome implementation of indexing and ranking algorithms. While all non-relational db tries to solve the problem is to store and fetch data on internet scale in tera and peta bytes.
      I am not sure but how does lucene scale ? I mean if i wants to let say store all tweets and run a query, doing that on one machine is not possible. does lucene runs on multiple machine and parallelize work without me writing code to shard the tweet data and running mutiple instances of lucene.

      So as soon as we start talking about multiple instances, hardware failure recovery and consistency across nodes start coming into pictures and all non-relational db does that part very well. They allow you to run custom query efficiently but does not rank well.

    • tuxdna 7:26 am on March 2, 2011 Permalink | Reply

      Lucene supports “multiple-index searching with merged results”, so you would need to spread your indexes across systems into multiple chunks. I believe if you are using Solr, then it will handle it for you, provides you have configured it ( http://wiki.apache.org/solr/DistributedSearch ).

      As an aside, I just came across this article ( http://www.linuxforu.com/developers/up-close-and-personal-with-nosql/ ) about NoSQL database which is quite concise and informative.

    • Shamail 12:48 pm on August 15, 2011 Permalink | Reply

      Its not always true that No-SQL solves the problem of scaling alone.

      To actually scale-out, you’ll need to de-normalize your data. DB’s are just a way to represent them.
      One can very well represent de-normalized data in either MySQL, PostGREs or Mongo/Cassendra.

      Consider a scenario of designing Twitter timeline. (say)
      If your data is normalized, then if you are to display a user’s timeline, your standard way out will be..

      1. select followers.
      2. do a Cartesian product of these followers with the status_message table.
      3. Sort the result on time.

      But if this is a denormalized data, you’ll then introduce a redundant table called timeline defined as:
      table ( user, timestamp, statusmessage )

      This will hold the timeline of each user, this will be redundantly filled with the status updates of all my followers.
      So to solve the same problem, I’ll end up just calling
      select timeline where user = me;


      In the second case, at the time of any status update by a person whom I am following, an INSERT will be made. Which is definitely far cheaper than the JOIN that you’ll have to do in case (I).

      —–

      The second most important principal is lazy-referencing.

      Consider a django model:

      class User (models.Model):
      pincode = model.ForeignKey (Pincode)

      So, internally, if you call user.pincode a left outer join takes place.

      Why not design it like this?
      class User (models.Model):
      pincodeid = models.ObjectIdField () # or IntegerField

      And call:
      pinid = user.pincodeid
      pincode = Pincode.objects.get ( id = pinid )

      This saves from a big cost of join.

      And do note, here the tables can lie on different systems all-together? Because they are not coupled!

      Yes, indeed Mongo/or any other no-sql helps you achive the above design-principles easily. But there is no good reason that you can’t do it in a trivial RDMBS.

      Lastly, Lucene solves completely different problem. While you are at it, try whoosh (if you want a pure python thingy) or solr (if you want HTTP API like stuff and don’t want to mess with J-thingy)
      The use case of lucene is full-text-search. Whereas problem of scaling is completely different IMHO.

  • tuxdna 9:33 am on February 25, 2011 Permalink | Reply  

    PyLucene on Fedora 14 

    I couldn’t install pylucene simply by doing yum install pylucene. Neither did easy_install pylucene work, and nor pip-python install pylucene. So I had to build it myself. Here, I list those steps:

    A. Install JCC

    $ JCC_JDK=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 pip-python install jcc

    B. Download pylucene pylucene

    wget -c http://apache.mirrors.pair.com//lucene/pylucene/pylucene-2.4.1-1-src.tar.gz
    tar zxf pylucene-2.4.1-1-src.tar.gz
    cd pylucene-2.4.1-1

    C. Build and install http://lucene.apache.org/pylucene/documentation/install.html

    1. pushd jcc
    2. edit setup.py to match your environment
    3. JCC_JDK=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 python setup.py build
    4. sudo python setup.py install
    5. popd
    6. edit Makefile to match your environment

    I had to update the Makefile for Fedora 14:

    1. Linux (Fedora 14, Python 2.7, OpenJDK 1.6, setuptools 0.6.14)

    PREFIX_PYTHON=/usr
    ANT=ant
    PYTHON=$(PREFIX_PYTHON)/bin/python
    JCC=$(PYTHON) -m jcc --shared
    NUM_FILES=2

    D. Continue building

    7. make
    8. sudo make install
    9. make test (look for failures)

    The last step make test gave some failures.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel
Follow

Get every new post delivered to your Inbox.