Recent Updates Toggle Comment Threads | Keyboard Shortcuts

  • tuxdna 8:50 pm on April 27, 2014 Permalink | Reply  

    I have moved to a new blog location http://tuxdna.in/blog/archives/

     
    • Johng269 6:31 pm on May 28, 2014 Permalink | Reply

      Hi, Neat post. There is a problem with your web site in internet explorer, would check this IE still is the market leader and a large portion of people will miss your magnificent writing because of this problem. ekcgbecddgcb

  • tuxdna 8:54 pm on February 3, 2014 Permalink | Reply  

    A simple Scala parser to parse 44GB Wikipedia XML Dump 

    I had to parse a Wikipedia XML Dump ( 44GB XML file uncompressed ). The XML dump is available here, and I have also created a smaller sample file to run this code: sample wiki.xml file.

    Below is the XML event based parser using Scala’s XMLEventReader:

    package xml
    
    import scala.io.Source
    import scala.xml.pull._
    import scala.collection.mutable.ArrayBuffer
    import java.io.File
    import java.io.FileOutputStream
    import scala.xml.XML
    
    object wikipedia extends App {
    
      val xmlFile = args(0)
      val outputLocation = new File(args(1))
    
      val xml = new XMLEventReader(Source.fromFile(xmlFile))
    
      var insidePage = false
      var buf = ArrayBuffer[String]()
      for (event <- xml) {
        event match {
          case EvElemStart(_, "page", _, _) => {
            insidePage = true
            val tag = "<page>"
            buf += tag
          }
          case EvElemEnd(_, "page") => {
            val tag = "</page>"
            buf += tag
            insidePage = false
    
            writePage(buf)
            buf.clear
          }
          case e @ EvElemStart(_, tag, _, _) => {
            if (insidePage) {
              buf += ("<" + tag + ">")
            }
          }
          case e @ EvElemEnd(_, tag) => {
            if (insidePage) {
              buf += ("</" + tag + ">")
            }
          }
          case EvText(t) => {
            if (insidePage) {
              buf += (t)
            }
          }
          case _ => // ignore
        }
      }
    
      def writePage(buf: ArrayBuffer[String]) = {
        val s = buf.mkString
        val x = XML.loadString(s)
        val pageId = (x \ "id")(0).child(0).toString
        val f = new File(outputLocation, pageId + ".xml")
        println("writing to: " + f.getAbsolutePath())
        val out = new FileOutputStream(f)
        out.write(s.getBytes())
        out.close
      }
    
    }
    

    Find this code snippet on Github

    Lets see how long it takes to process all the Wikipedia pages in the 44GB XML Dump.

    It took roughly 7 hours 30 minutes. Thats not bad:

    $ time sbt "run-main xml.wikipedia enwiki-20140102-pages-articles-multistream.xml wiki-pages/"
    
    [success] Total time: 26918 s, completed Feb 4, 2014 9:56:38 AM
    
    real	448m41.888s
    user	82m47.594s
    sys	192m46.238s
    

    And it generated 14128976 XML files:

    $ ls wiki-pages/ | wc -l
    14128976
    $ du -sh wiki-pages/ 
    80G	wiki-pages/
    

    Now as you can see that 44GB uncompressed XML file got split up onto 80GB of total storage for all the separate pages. Now that’s something to be worked on.

    References:

    First steps with Scala: XML pull parsing

    Scala finding elements in big (30MB) xml files

     
    • Monserrate 9:03 pm on March 5, 2014 Permalink | Reply

      Hi there it’s me, I am also visiting this website on a regular basis, this website is in fact nice and the visitors are really sharing fastidious thoughts.

    • Will Sargent 5:36 pm on April 12, 2014 Permalink | Reply

      You might want to try Nux or Aalto if you need to stream lots of XML fast. Nux especially is designed for large datasets.

  • tuxdna 4:31 am on November 10, 2013 Permalink | Reply  

    JMILUG Meetup – 9th November 2013 

    IMG_20131109_131837

    Attendees:

    • Hammad Haleem
    • Saleem Ansari
    • Pankaj Sharma
    • Safiyat Reza
    • Umar Ahmad
    • Vivek Gupta
    • Sawood Alam
    • Viupl Nayyar
    • Amit Shah

    There wasn’t a pre-defined agenda so the discussion took its own course. We discussed about many things:

    • Sawood Alam shared his work he is doing in his research group at the Old Dominion University.
    • Vivek Gupta shared what kind of challenging problems he is working on.
    • Vipul Nayyar shared his GSoC experience during RTEMS project.
    • There was a discussion about patents, patent laws, usage of search engines etc.
    • How to approach for further studies, programming competitions and some general chit-chat.

    In this meetup some members met after a very long time, so it was only a get together. No tasks assigned in particular.

     
  • tuxdna 8:25 pm on August 25, 2013 Permalink | Reply  

    Attending ScalaTraits 2013 event in New Delhi 

    First of all I was surprised that an event specifically targeted towards Scala was happening in India and very fortunately in New Delhi itself. I attend the ScalaTraits 2013 event in New Delhi a couple of days back.

    The event was put together by Knoldus, a company specializing exclusively in Scala and related technologies.

    Goodies at ScalaTraits 2013

    The agenda was like this:

    Introductory talk

    This talk was an introduction to the Scala ecosystem as a whole by Vikas Hazrati. He discussed some history, which Scala technologies are currently popular and which companies are using them.

    Kick start to Scala by Sanjeev Kumar

    This was a 3 hour session on some core concepts in Scala, followed by setting up Scala IDE bundle and using Scala Worksheets to try out some cool examples.

    Some of the topics I remember right away are Functional Programming, Equational Reasoning, Functional Language features ( functions are fist class values, it encourages immutability ), every statement has a return value ( and a type ), compound expression has a return type as well, Type inference, Classes and Objects, Class Inheritance, Default constructor, Predef object, Case classes, Functional Objects ( those objects that do not have mutable state ), File processing etc.

    Kick start to Play Neeklanth Sachdeva

    This was a 3 hour session in which we learnt how Play is an MVC web framework ( quite a lot like Ruby on Rails actually ). We setup Play development environment, created a sample Play app with database connectivity, some routes, and a basic HTML view. Then we deployed it on Heroku.

    All in all, it was a good learning experience with on-the-spot hands-on exercises. Other than that, the venue was very nice, with great food and pleasant team at work. Bravo!

     
  • tuxdna 7:54 pm on August 25, 2013 Permalink | Reply  

    Deleting lots of spam content on a Drupal website 

    Deleting Spam on FUDCON.in website

    After FUDCon Pune event in 2011, the website has been running as is. Just a couple of days back, I noticed a lot of spam accumulated on the website. However it is that content which is not displayed on the website, unless you know its URL. I located the last known sane activity and began estimating how much spam content I have to delete.

    Here I use Drush and a simple PHP script.

    $ drush sql-cli
    
    mysql> select unix_timestamp('2012-02-28 19:57:11 +0530') from dual;
    +---------------------------------------------+
    | unix_timestamp('2012-02-28 19:57:11 +0530') |
    +---------------------------------------------+
    |                                  1330487831 | 
    +---------------------------------------------+
    
    mysql> SELECT count(*) FROM node AS n WHERE n.type = 'session' and  n.created > 1330487831  ;
    +----------+
    | count(*) |
    +----------+
    |    22110 | 
    +----------+
    1 row in set (0.22 sec)
    

    Twenty two thounsand plus entries! That is so much content to be deleted from the Admin UI.

    A simple solution was to script it.

    $ cat delete_spam.php
    <?php
      require_once './includes/bootstrap.inc';
      drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
    
      global $user;
      $original_user = $user;
      $user = user_load(1);
    
      echo $user->uid . " " . $user->mail;
      echo "\n";
    
      // $aquery= db_query("SELECT nid FROM {node} AS n WHERE n.type = 'session' and n.created > 1330487831");
      $aquery= db_query("SELECT nid FROM {node} AS n WHERE n.type = 'session' and n.nid >= 315");
      while ($row = db_fetch_object($aquery)) {
        // node_delete($row->nid);
        $nid = $row->nid;
        $node = node_load(array("nid" => $nid));
        echo "Deleting " . $nid . ": " . $node->title . "\n" ;
        node_delete($nid);
      }
    
      $user = $original_user;
    
    ?>
    

    Now we can execute this script:

    $ drush php-script delete_spam.php | tee delete_spam.out
    $ wc -l delete_spam.out
    22110 delete_spam.out
    

    Now, all 22110 entries were deleted!

    Next step is to clean the old cache as well:

    $ drush cc
    Enter a number to choose which cache to clear.
     [0]  :  Cancel         
     [1]  :  all            
     [2]  :  drush          
     [3]  :  theme-registry 
     [4]  :  menu           
     [5]  :  css-js         
     [6]  :  block          
     [7]  :  module-list    
     [8]  :  theme-list     
     [9]  :  nodeaccess     
    
    1
    'all' cache was cleared                   [success]
    

    Thats how I deleted so much spam content quickly.

    References:

     
  • tuxdna 2:04 pm on August 7, 2013 Permalink | Reply
    Tags: , rdesktop   

    Remote Desktop from a Linux client machine 

    Connecting to Remote Desktop from Linux machine is easy. Invoke the following command

    rdesktop -r sound=local -r clipboard:CLIPBOARD -z -g '80%' -a 15 -u user.name -p - -d MYDOMAIN  remote.hostname.com
    

    Above command does the following:

    • Forwards remote sound to local machine
    • Enables clipboard sharing
    • Uses compression
    • Makes the remote desktop screen to 80% of the local machine’s screen
    • Uses 15bit color depth on the remote desktop
    • With user name user.name and password taken from STDIN
    • Connects to remote.hostname.com at domain DOMAINNAME

    Thats it!

    EDIT: Updated the explanation in the order of CLI options to the rdesktop command.

     
    • chuck 3:09 pm on August 7, 2013 Permalink | Reply

      Thanks for the tip. One little favor to ask of you though. When you explain what the command does, can you put the explanation in the order of the command? For example, in the command to set the color depth option is near the middle of command line sequence but at the end of your explanation list.

    • tuxdna 7:40 pm on August 7, 2013 Permalink | Reply

      @chuck:

      Thank you so much for your feedback. I updated the post as per your suggestion. :-)

    • tuxdna 7:06 am on August 19, 2013 Permalink | Reply

      I also like to fit the rdesktop window into the available space in the screen. For example, if I want to fit to the size of Gnome Terminal, I use xwininfo:

      $ xwininfo

      xwininfo: Please select the window about which you
      would like information by clicking the
      mouse in that window.

      xwininfo: Window id: 0x3800004 “/bin/bash”

      Absolute upper-left X: 0
      Absolute upper-left Y: 47
      Relative upper-left X: 0
      Relative upper-left Y: 22
      Width: 1280
      Height: 936
      Depth: 32
      Visual: 0x67
      Visual Class: TrueColor
      Border width: 0
      Class: InputOutput
      Colormap: 0x3800003 (not installed)
      Bit Gravity State: NorthWestGravity
      Window Gravity State: NorthWestGravity
      Backing Store State: NotUseful
      Save Under State: no
      Map State: IsViewable
      Override Redirect State: no
      Corners: +0+47 -0+47 -0-41 +0-41
      -geometry 157×53+0+25

      and then specify the geometry as

      $ rdesktop -g “1280×936″

    • tuxdna 9:23 am on August 19, 2013 Permalink | Reply

      Rdesktop clipboard wasn’t working well when on Ubuntu. As it turns out its actually a problem with the remote windows machine. Here is the solution:

      http://wyding.blogspot.com/2011/11/clipborad-not-work-using-ubuntu-for.html

      • kill rdpclip.exe ( on remote machine )
      • exit rdesktop
      • rerun rdesktop ( this launches rdpclip.exe automatically )
  • tuxdna 9:46 pm on July 30, 2013 Permalink | Reply  

    Contiuned: Juniper Networks VPN from Fedora 64bit 

    This post is the continuation of my earlier post about Juniper VPN.

    In the earlier post, I connected to VPN using a login/password/certification combination. Now I also managed to use the ncui tool for the connection which is based on a cookie value and a certificate. I wasn’t able to connect to this configuration using the method in my previous post.

    $ ./ncui -h vpn.example.com -c DSID="YOUR_DSID_COOKIE" -f vpn.example.com-cert.der
    Password: <ENTER SUDO PASSWORD HERE>
    

    Here, first you need to login to your vpn domain from a web-browser. Once you do that, you need to obtain two things:

    • DSID cookie value. This can be easily obtained from a web-browser. You only need to browse the relevant cookie value in preferences or through “View Cookies” from your web-browser.

    For more detailed information I have listed the installation steps in a gist. Search for “Alternative method using ncui” at the bottom of this gist.

    Also I came across two noticeable tools for Juniper VPN:

    • jvpn: a nice tool which automates many of the steps.

    Again, I got rid of Windows dependency! :-)

    References:

     
  • tuxdna 1:50 am on July 21, 2013 Permalink | Reply
    Tags: , openstack   

    Setting up OpenStack on Fedora 19 is a lot of work 

    I wanted to experiment with creating a Fedora 19 compute node on Fedora 19 + OpenStack. However it seems there are a bunch of issues which need to be fixed. The issues and solutions are already recorded by many people.

    I list the highlights:

    • MySQL Server in Fedora 19 is actually MariaDB Server
    • Keystone log file needs to be chowned to keyston:keystone
    • Fedora 19 doesnt have kvm.modules file at the expected location
    • I saw atleast one error due to selinux

    I am recording the errors, commands and references in the following gist on github.com: openstack-fedora19.md

    EDIT:

    Finally I managed to complete the setup. Now its time to launch some VMs.!

     
  • tuxdna 1:25 pm on July 20, 2013 Permalink | Reply
    Tags: ,   

    Emacs fullscreen and Fedora 19 

    I was using fullscreen.el for so long but now that doesn’t seem to work on Fedora 19 / GNOME 3.8.1.

    What to do? Following are the steps I did for now.

    First install wmctrl:

    $ sudo yum install wmctrl
    

    Now add following code to your .emacs configuration file:

    (defun switch-full-screen ()
      (interactive)
      (shell-command "wmctrl -r :ACTIVE: -btoggle,fullscreen"))
    
    (global-set-key [f11] 'switch-full-screen)
    

    Restart Emacs and press F11. Thats it!

     
    • jmt 2:45 pm on July 20, 2013 Permalink | Reply

      Emacs doesn’t even go full screen using “emacs -fs” from the shell.

    • jmt 2:50 pm on July 20, 2013 Permalink | Reply

      Seems to be a Gnome 3 and/or GTK3 problem since this works fine on the Mate Desktop.

      Thanks for the workaround.

      • tuxdna 6:36 pm on July 20, 2013 Permalink | Reply

        @jmt: Yes, it does’t. I too tried “emacs -fs”. Surely, something to do with GTK3/GNOME3.

        Thanks :-)

    • Alexander Kahl 8:17 am on July 21, 2013 Permalink | Reply

      Just what I needed, thanks a lot!

    • Dag 12:28 pm on July 28, 2013 Permalink | Reply

      There’s a configurable keyboard shortcut for “Toggle fullscreen mode” in GNOME Settings that works with most applications/windows.

  • tuxdna 10:36 pm on July 19, 2013 Permalink | Reply
    Tags: , ghostscript, ocr, pdf, tesseract   

    Extract Text from from multi-page PDF with only Images 

    Sometimes there are only images in a PDF. In such cases you can not select text to copy / paste or just for reference.

    To extract text from an Image or a PDF containing only images, I used Tesseract OCR Engine and Ghostscript. I am running Fedora 19 at the moment, however these steps should apply to an older version of Fedora or Ubuntu. ( I believe this can be done on Windows as well ). Both Tesseract and Ghostscript are free softwares.

    First, install both Tesseract and Ghostscript on Fedora:

    $ sudo yum install -y ghostscript tesseract
    

    Now go to the folder where your PDF is located ( assuming that it is named as story.pdf ):

    $ cd ~/Downloads/
    

    Next, extract each page from PDF as a PNG. For this I used Ghostscript. Note the resolution ( -r300 ):

    $ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf
    $ ls page*.png
    page001.png
    page002.png
    ...
    

    Once we have a PNG for each page, we can use the OCR software to extract text:

    $ for f in page*.png ; do tesseract $f $f.out; done
    $ ls page*.out.txt
    page001.png.out.txt
    page002.png.out.txt
    ...
    

    So, now we have all the text from images into text files. Tesseract works quite well with OCR output, and obviously it cant read drawing or misprinted characters quite well, still its quite accurate.

    I hope it is helpful for you.

    References:

     
    • Nana111 3:29 am on December 25, 2013 Permalink | Reply

      HI there
      Thanks for your sharing.It is really helpful for me.I am looking for the method for extracting page from PDF files.I have tried to do that using this PDF program:
      http://www.rasteredge.com/how-to/csharp-imaging/pdf-extract-pages/
      But it can not work in my computer.I don’t know why.
      Thanks for your answers.I want to know that if there is a free trial in your program.Thanks a lot.

    • tuxdna 1:37 pm on January 2, 2014 Permalink | Reply

      @Nana111 All the tools I mentioned in this post are free. You can try them on your own. For this you would specificall need Fedora ( a GNU/Linux distribution ) which you can download and install from here: https://fedoraproject.org/get-fedora

      Once you do that you can install the software as I explained above. Let me know how where you are stuck with using it, if at all.

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
Follow

Get every new post delivered to your Inbox.

Join 25 other followers