Recent Updates Page 2 Toggle Comment Threads | Keyboard Shortcuts

  • tuxdna 8:03 am on February 18, 2013 Permalink | Reply
    Tags: emacs, ,   

    Set up for learning Scala with Emacs 

    Obviously the first step is to install Scala language.

    $ sudo aptitude install scala
    OR
    $ yum install scala
    

    Then I ran my first Scala “Hello world!” program from CLI.

    Setting up Scala mode for Emacs was a bit of a pain so I merged the old scala-mode and the latest into my repo. Here are very simple steps to setup scala-mode for Emacs.

    $ cd ~/.emacs.d/
    $ git clone git://github.com/tuxdna/scala-mode.git
    $ cd scala-mode
    $ make
    

    Now add following startup code to ~/.emacs file

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    ;; START: Scala mode
    (add-to-list 'load-path "/home/tuxdna/.emacs.d/scala-mode")
    (require 'scala-mode-auto)
    ;; END: Scala mode
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    

    And you are done with the setup!

     
  • tuxdna 1:49 pm on February 15, 2013 Permalink | Reply
    Tags: distributed, realtime   

    On building large scale data processing system 

    I was reading a few blog posts about distributed, large-scale processing of data, be it in batch or real-time. And definitely the move is towards real-time now. ( here and here ) .

    Well, in this blog post I am only going to mention about the things that I have come across so far. I would like to learn more.

    All the buzz around large scale data processing, in some way or the other, seems to be inspired by papers published by Google or the systems they built. It is just fascinating!

    And then there are techonogies that just stand out on their own.

    • Apache Lucene - Search Engine Library
    • Apache Mahout – Scalable Machine Learning and Data Mining
    • Storm – Distributed, fault-tolerant realtime computation
    • Spark – Lightning fast cluster computing

    The technologies are all there. The time is right. The pieces have to be connected and made to work like one.

    Distributed file systems:

    HDFS is just one of the many options available. Since it is a user-space file system, other such filesytems also fit the criteria viz. MapRFS, GlusterFS, Tahoe-LAFS and more.

    Batch Processing:

    Apache Hadoop is pretty much the tool that is widely used. However many of the industry players have already moved on.

    Realtime Processing:

    Many tools to choose from here Storm, Spark, Esper, S4, and HStreaming

    Machine learning, Data Mining and Analytics:

    There are so many Free and Open Source tools available to pick from: R, Python scipy package, Weka, Apache UIMA, Apache Mahout.

    Hardware:

    Finally we need lots of hardware to run this infrastructure on.

    Comments and suggestions welcome :)

     
  • tuxdna 1:05 pm on February 4, 2013 Permalink | Reply
    Tags: lucele, solr, tika   

    Indexing the documents stored in a database using Apache Solr and Apache Tika 

    Indexing the documents stored in a database

    Outline:

    • Setup a MySQL database [1] containing documents( PDF/DOC/HTML etc ).
    • Setup Apache Solr / Tika
    • Import the documents just by hitting an import url.

    (NOTE: Also check the update note at the end of this post. )

    These steps were done on my machine running Fedora 17. The commands be easliy converted for other distributions.

    Setup MySQL database with documents

    Install MySQL Server:

    # yum install mysql-server
    # service mysqld start
    

    Also install Java library for connecting to MySQL ( Solr would need it later )

    # yum install -y mysql-connector-java
    

    Setup a MySQL database [1] for storing binary files

    CREATE DATABASE binary_files;
    
    CREATE TABLE tbl_files (
     id_files tinyint(3) unsigned NOT NULL auto_increment,
     bin_data longblob NOT NULL,
     description tinytext NOT NULL,
     filename varchar(255) NOT NULL,
     filesize integer NOT NULL,
     filetype varchar(255) NOT NULL,
     PRIMARY KEY (id_files)
    );
    
    GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, ALTER
     ON binary_files.*
     TO binary_user@localhost
     IDENTIFIED BY 'binary_password';
    

    Now lets create a ruby script to populate the database with documents. We would need a Ruby-MySQL driver [2][3] and a MIME detection library [4].

    # yum install ruby-mysql
    # yum install rubygem-mime-types
    

    Here is the script: https://gist.github.com/4706365#file-insert-mysql-rb

    Lets add some documents to the database:

    $ mkdir sample-docs
    $ cd sample-docs
    $ wget http://www.manning.com/lowagie/sample-ch03_Lowagie.pdf
    $ wget http://www.xmlw.ie/aboutxml/wordsample2.doc
    $ wget http://www.columbia.edu/~fdc/sample.html
    $ cd ..
    $ for f in sample-docs/* ; do DESC=`basename "$f" | tr ' ' '-' `; ruby insert-mysql.rb "$f" "$DESC"; done
    

    If you get this error when running MySQL insert/update queries “Lost connection to MySQL server during query”, you might want to consider updating your MySQL server limits [6]. Update the limit in /etc/my.cnf:

    max_allowed_packet=32M
    

    Setup Apache Solr with Apache Tika integration

    $ wget -c http://apache.techartifact.com/mirror/lucene/solr/3.6.2/apache-solr-3.6.2.tgz
    $ tar zxf apache-solr-3.6.2.tgz
    $ cd apache-solr-3.6.2
    $ cd examples/
    

    Here you would see that there is an example of Data import from HSQL database but we want to work with MySQL. So we create a new configuration. ( It will be easy to follow if you have follwed the README files in the Apache Solr package ).

    $ cp -r example-DIH/ dih-mysql/
    $ cd dih-mysql/
    $ rm -rf hsqldb/
    

    Remove everything except db/

    $ cd solr/
    $ rm -rf solr mail rss tika
    $ ln -s /usr/share/java/mysql-connector-java.jar db/lib/
    

    Now the directory structure should look something like this:

    $ find  dih-mysql/ -type d
    dih-mysql/
    dih-mysql/solr
    dih-mysql/solr/db
    dih-mysql/solr/db/conf
    dih-mysql/solr/db/conf/xslt
    dih-mysql/solr/db/lib
    dih-mysql/solr/db/data
    dih-mysql/solr/db/data/index
    

    Lets now update Solr configuration. Here, it is necessary to make sure that the Tika content parser libraries are put in configuration file. How to do this is mentioned below:

    Configuration file: dih-mysql/solr/db/conf/solr-config.xml
    Source: https://gist.github.com/4706517#file-solrconfig-xml

    We just added libraries to parse the content ( to avoid ClassNotFound errors ).

      <lib dir="../../../../contrib/extraction/lib/" regex="tika-core-\d.*\.jar" />
      <lib dir="../../../../contrib/extraction/lib/" regex="tika-parsers-\d.*\.jar" />
      <lib dir="../../../../contrib/extraction/lib/" regex=".*\.jar" />
    

    Configuration file: dih-mysql/solr/db/conf/schema.xml
    Source: https://gist.github.com/4706509#file-schema-xml

    Add the relevant fields which will be indexed along with the binary content ( PDF/DOC/HTML etc. )

    Configuration file: dih-mysql/solr/db/conf/db-data-config.xml
    Source: https://gist.github.com/4706528#file-db-data-config-xml

    We configured the database/table/columns from which to fetch the content to be indexed.

    I would recommend you to go through the official documentation [7].

    Now we are all set with the configuration. Its time to index the documents:

    Index the documents

    Start the Solr server. Notice how we are specifying the configuration path:

    $ cd apache-solr-3.6.2/example
    $ java -jar -Dsolr.solr.home="./dih-mysql/solr/" start.jar
    

    And invoke the indexer by hitting this url: http://localhost:8983/solr/db/dataimport?command=full-import

    The steps that I described worked for me just fine. I hope this helps in resolving the issues faced by others [8] and [9].

    UPDATE:

    I had to add extra entity field in both schema.xml and db-data-config.xml to make the indexing work. Perhaps there is some other problem with my configuration. I don’t understand why this works ( and why the one officially documented [7] doesn’t work ), but this is the work around I figured out.

    Update in schema.xml

       <field name="text2" type="text" indexed="false" stored="false"/>
    

    Update in db-data-config.xml

    	<entity dataSource="fieldReader"
    		processor="TikaEntityProcessor"
    		dataField="root.bin_data" format="text">
              <field column="text2" name="body" />
    	</entity>
    

    References:
    [1] http://onlamp.com/pub/a/php/2000/09/15/php_mysql.html
    [2] http://www.tmtm.org/en/ruby/mysql/
    [3] http://zetcode.com/db/mysqlrubytutorial/
    [4] http://stackoverflow.com/questions/2082293/get-mimetype-from-filename
    [5] https://gist.github.com/4706365#file-insert-mysql-rb
    [6] http://stackoverflow.com/questions/5688403/how-to-check-and-set-max-allowed-packet-mysql-variable
    [7] http://lucidworks.lucidimagination.com/display/lweug/Indexing+Binary+Data+Stored+in+a+Database
    [8] http://stackoverflow.com/questions/11339840/indexing-binary-files-from-database-issue-no-errors/14638909#14638909
    [9] http://stackoverflow.com/questions/14671461/tika-fetches-the-binary-content-stored-in-database-but-does-not-indexes-it

     
  • tuxdna 11:09 am on January 6, 2013 Permalink | Reply  

    JCallTracer: Tool to generate Sequence Diagrams for Java programs 

    For sometime now I have been working on a project called JCallTracer. I had a simple problem at hand: generate Squence Diagrams for a program written in Java. I did try to google such a tool but I couldn’t find anything that was Open Source and worked on Linux. The closest I could find was Java Call Tracer. This tool was designed for Windows users and didn’t compile on Linux. I fixed that, but then it was apparently designed for Java programs with small memory foot-print. It always crashed my system for bigger Java programs. The reason being that, it maintained the whole call graph in-memory and also had too much use of realloc(), which made it so slow!

    My problem wasn’t solved. Hence the tool ( still in the making ):

    JCallTracer is a tool to generate Sequence Diagrams for big and multi-threaded Java applications. Large Java application themselves use a lot of RAM, leaving little space for a JVMTI agent to maintain its own data-structures to trace the calls.

    This tool works ( experimental at the moment ), by offloading much of the internal data-structures to disk using squential writes. Sequential writes are fast!

    Parts of this project are based on the code from Java Call Tracer.

    Project repository: https://github.com/tuxdna/jcalltracer

    I would like to know your opinion.

     
  • tuxdna 6:25 am on December 24, 2012 Permalink | Reply  

    C 2011 Standard and current FOSS implementations 

    Today I got to know that C has a new standard released in 2011. You can find a detailed Dr. Dobbs’s article on the subject. So far I havent come across any Open Source compiler that fully implements C11 features. Clang and GCC are yet to fully support this standard.

    Clang has added support for anonymous structs and anonymous unions:

    Clang 3.1 adds support for anonymous structs and anonymous unions, added in the latest ISO C standard. Use -std=c11 or -std=gnu11 to enable support for the new language standard. The new C11 features are backwards-compatible and are available as an extension in all language modes.

    GCC has implemented C11 in parts as mentioned in its documentation below:

    A fourth version of the C standard, known as C11, was published in 2011 as ISO/IEC 9899:2011. GCC has limited incomplete support for parts of this standard, enabled with -std=c11 or -std=iso9899:2011. (While in development, drafts of this standard version were referred to as C1X.)

    Which means, that not only would the complier needs feature additions, but also the C library too. Which makes Glibc a candidate for code additions: Sourceware Bug 14092 for adding C11 threads.

    Way to go!

    References:

    http://stackoverflow.com/questions/9804594/compilers-that-support-c11

    http://llvm.org/releases/3.1/docs/ClangReleaseNotes.html#cchanges

    http://gcc.gnu.org/onlinedocs/gcc/Standards.html

     
  • tuxdna 7:12 pm on December 11, 2012 Permalink | Reply
    Tags: apache, incubator   

    Apache Incubator projects 

    I was going through a list of Apache Incubator projects and I found a few really interesting projects, primarily because I could immediately relate them to some functionality I could readily use.

    However, I have to say that the layout on the Apache Incubator projects makes it a daunting task to visit each and every project link to know the technology or domain a project name could be relate to. If instead of a project name matrix, there was a simple project list with Project Name, Technologies, Domain etc., it would have been far easier to identify the relevant projects.

    Here is the list of ones I found interesting:

    OpenMeetings

    Openmeetings provides video conferencing, instant messaging, white board, collaborative document editing and other groupware tools using API functions of the Red5 Streaming Server for Remoting and Streaming.

    If you have used remote collaboration tools ( e.g. ellumniate ) with chat, screen-sharing, audio/video conferencing etc. then you would realized how useful OpenMeetings would be.

    Apache Dirll

    Apache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google’s Dremel. Basically something very similar to Google’s BigQuery.

    It also has a discussion blog.

    My understanding is that there are three kinds of processing we generally do on big-data:

    1) batch processing i.e. analysis using Map-Reduce like frameworks

    2) realtime processing i.e. process the data as it arrives using Storm for example

    3) drill down processing i.e. find a needle in a haystack or a fine-grained search using a complex SQL query

    Apache Drill would provide the last functionality (3).

    I hope both of these projects will be good to watch out for.

     
  • tuxdna 6:20 am on December 6, 2012 Permalink | Reply
    Tags: , memory, process   

    Memory consumption by a .so file for a running process 

    I wanted to know how much memory is consumed by C++ standard library for a process running on Linux. There is no straightforward way I could find so I have written a small script to do exactly that.

    Script Location: https://gist.github.com/4215536

    How to use?


    $ wget https://raw.github.com/gist/4215536/6ae899f454fd72ba3b6202724e15f855f80e33b3/mem-usage.rb
    $ ruby ./mem-usage.rb /proc/5952/maps | grep libstd
    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16: 2988.0 KBs

    In the above example, 5952 is the PID of Thunderbird mail client and C++ standard library consumes 2988 KB of memory for this process.

     
    • Siddhesh 1:08 pm on December 6, 2012 Permalink | Reply

      It’s actually quite easy to do:

      pmap $(pgrep cat) | grep -v “\[” | awk ‘BEGIN{prev=”"; size = 0;} {if (prev == “” || prev == $4) {size=size + $2} else { printf(“%s: %dK\n”, prev, size); size = $2}; prev = $4}’

      I’m sure there’s an easier way to do this. Also, you’re only seeing the static sections allocated for the library, not stuff that the library dynamically allocates – that is in the anonymous mappings or in the heap.

      • tuxdna 1:29 pm on December 6, 2012 Permalink | Reply

        @Siddhesh: You are correct. For my purpose I was only concerned about how much more memory would be needed if I link to C++ standard library.

        I think there should be a nice tool to do such an analysis.

    • spinningmatt 1:25 pm on December 6, 2012 Permalink | Reply

      $ pmap $(pidof thunderbird)

      00000032d3400000 916K r-x– /usr/lib64/libstdc++.so.6.0.17
      00000032d34e5000 2044K —– /usr/lib64/libstdc++.so.6.0.17
      00000032d36e4000 32K r—- /usr/lib64/libstdc++.so.6.0.17
      00000032d36ec000 8K rw— /usr/lib64/libstdc++.so.6.0.17
      00000032d36ee000 84K rw— [ anon ]

      Often the dynamic [anon] sections follow their named library. However, I don’t know if that’s intended behavior or a happy accident.

      • tuxdna 1:27 pm on December 6, 2012 Permalink | Reply

        Wow! I didn’t know about pmap at all. :)

      • Siddhesh 1:34 pm on December 6, 2012 Permalink | Reply

        > Often the dynamic [anon] sections follow their named library.

        No they dont. There are no guarantees about location of dynamically allocated address space.

  • tuxdna 5:45 am on November 8, 2012 Permalink | Reply
    Tags: ubuntu xscreensaver   

    Lots of screen savers on Ubuntu:

    $ sudo aptitude install xscreensaver-data-extra xscreensaver-screensaver-bsod xscreensaver-screensaver-dizzy xscreensaver-screensaver-webcollage
    
     
  • tuxdna 5:33 am on October 3, 2012 Permalink | Reply
    Tags: atom, , jdom, , rome tools, rss, xml   

    XML, RSS, ATOM and Java 

    I was searching for ways to generate xml feeds ( ATOM / RSS ) using Java. It appeared to be trivial task but its not. There are so many different libraries in Java which are capable of reading and writing XML in Java that it became a daunting task to evaulate them. After a bit of experimentation I have settled down with JDOM. JDOM is so simple to use.

    On top of that, ROME tools make it even easier to read/write feeds using Java. There is a very nice tutorial availabe to use ROME tools. Not just that, ROME tools have a bunch of modules so that you can add more information to your feeds.

     
    • sarabjeet 9:59 am on October 10, 2012 Permalink | Reply

      i really like that you are giving information on core and advance java concepts. Being enrolled at http://www.wiziq.com/course/1779-core-and-advance-java-concepts i found your information very helpful indeed.thanks for it.

    • modern Management systems 1:30 am on May 3, 2013 Permalink | Reply

      Appreciating the time and energy you put into your site and detailed information you provide.
      It’s great to come across a blog every once in a while that isn’t the same unwanted rehashed material.
      Excellent read! I’ve saved your site and I’m including your RSS feeds to my Google
      account.

  • tuxdna 7:52 am on May 13, 2012 Permalink | Reply  

    Tzinga – An energy drink startup in India 

    An energy drink startup in India.

    In, India we mostly consume only areated drinks. Energy drinks are kind of very rare to see if you consider both urban and rural landscape. So far I have come across a very few engergy drinks – RedBull, Tzinga, Could9, Rio etc. A friend of mine asked me to try this engergy drink called Tzinga.

    Al three Tzinga

    Tzinga drinks

    I received a package with all the three flavors ( shown in the picture above ). What’s my take on this?

    The cost factor: The cost is really very cheap, at only Rs. 20, per packet.
    The kick: All of these contain caffeine equivalent to 1 cup of coffee.

    The three flavors one by one, in the order I liked them.
    LEMON MINT: This is the best flavor in my opinion, very smooth and bit tangy. I and my friends liked it.
    TROPICAL TRIP: This again is a very smooth flavor. I and my friends liked it.
    MANGO STRAWBERRY: This one I didn’t like as it tasted a little bit bitter.

    In sum, I would recommend Tzinga to all the energy drink fans.

     
    • tuxdna 8:39 am on May 13, 2012 Permalink | Reply

      No, its not.

    • debayan 8:47 pm on May 13, 2012 Permalink | Reply

      What is it that actually keeps you awake? Just the caffeine?

      • tuxdna 1:20 pm on May 17, 2012 Permalink | Reply

        Yes, the caffeine pretty much does the job. Also I like it to be chilled.

    • runa 4:16 am on May 17, 2012 Permalink | Reply

      Where did you get it from?

      • tuxdna 1:20 pm on May 17, 2012 Permalink | Reply

        It is not sold in Maharashtra as of now. I got it delivered from Gurgaon.

c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel
Follow

Get every new post delivered to your Inbox.