On building large scale data processing system

I was reading a few blog posts about distributed, large-scale processing of data, be it in batch or real-time. And definitely the move is towards real-time now. ( here and here ) .

Well, in this blog post I am only going to mention about the things that I have come across so far. I would like to learn more.

All the buzz around large scale data processing, in some way or the other, seems to be inspired by papers published by Google or the systems they built. It is just fascinating!

And then there are techonogies that just stand out on their own.

  • Apache Lucene – Search Engine Library
  • Apache Mahout – Scalable Machine Learning and Data Mining
  • Storm – Distributed, fault-tolerant realtime computation
  • Spark – Lightning fast cluster computing

The technologies are all there. The time is right. The pieces have to be connected and made to work like one.

Distributed file systems:

HDFS is just one of the many options available. Since it is a user-space file system, other such filesytems also fit the criteria viz. MapRFS, GlusterFS, Tahoe-LAFS and more.

Batch Processing:

Apache Hadoop is pretty much the tool that is widely used. However many of the industry players have already moved on.

Realtime Processing:

Many tools to choose from here Storm, Spark, Esper, S4, and HStreaming

Machine learning, Data Mining and Analytics:

There are so many Free and Open Source tools available to pick from: R, Python scipy package, Weka, Apache UIMA, Apache Mahout.

Hardware:

Finally we need lots of hardware to run this infrastructure on.

Comments and suggestions welcome 🙂

Advertisements