On building large scale data processing system
Well, in this blog post I am only going to mention about the things that I have come across so far. I would like to learn more.
All the buzz around large scale data processing, in some way or the other, seems to be inspired by papers published by Google or the systems they built. It is just fascinating!
- Apache HDFS – Distributed user space filesystem based Google File System
- Apache Hadoop – A Map Reduce implementation
- Apache Drill – Implementation of Dremel. This one I explored in my earlier blog post.
- Apache Hama – A BSP implementation inspired by Pregel ( Large-scale graph computing at Google )
And then there are techonogies that just stand out on their own.
- Apache Lucene – Search Engine Library
- Apache Mahout – Scalable Machine Learning and Data Mining
- Storm – Distributed, fault-tolerant realtime computation
- Spark – Lightning fast cluster computing
The technologies are all there. The time is right. The pieces have to be connected and made to work like one.
Distributed file systems:
Apache Hadoop is pretty much the tool that is widely used. However many of the industry players have already moved on.
Machine learning, Data Mining and Analytics:
Finally we need lots of hardware to run this infrastructure on.
Comments and suggestions welcome 🙂