A simple Scala parser to parse 44GB Wikipedia XML Dump

I had to parse a Wikipedia XML Dump ( 44GB XML file uncompressed ). The XML dump is available here, and I have also created a smaller sample file to run this code: sample wiki.xml file.

Below is the XML event based parser using Scala’s XMLEventReader:

package xml

import scala.io.Source
import scala.xml.pull._
import scala.collection.mutable.ArrayBuffer
import java.io.File
import java.io.FileOutputStream
import scala.xml.XML

object wikipedia extends App {

  val xmlFile = args(0)
  val outputLocation = new File(args(1))

  val xml = new XMLEventReader(Source.fromFile(xmlFile))

  var insidePage = false
  var buf = ArrayBuffer[String]()
  for (event <- xml) {
    event match {
      case EvElemStart(_, "page", _, _) => {
        insidePage = true
        val tag = "<page>"
        buf += tag
      }
      case EvElemEnd(_, "page") => {
        val tag = "</page>"
        buf += tag
        insidePage = false

        writePage(buf)
        buf.clear
      }
      case e @ EvElemStart(_, tag, _, _) => {
        if (insidePage) {
          buf += ("<" + tag + ">")
        }
      }
      case e @ EvElemEnd(_, tag) => {
        if (insidePage) {
          buf += ("</" + tag + ">")
        }
      }
      case EvText(t) => {
        if (insidePage) {
          buf += (t)
        }
      }
      case _ => // ignore
    }
  }

  def writePage(buf: ArrayBuffer[String]) = {
    val s = buf.mkString
    val x = XML.loadString(s)
    val pageId = (x \ "id")(0).child(0).toString
    val f = new File(outputLocation, pageId + ".xml")
    println("writing to: " + f.getAbsolutePath())
    val out = new FileOutputStream(f)
    out.write(s.getBytes())
    out.close
  }

}

Find this code snippet on Github

Lets see how long it takes to process all the Wikipedia pages in the 44GB XML Dump.

It took roughly 7 hours 30 minutes. Thats not bad:

$ time sbt "run-main xml.wikipedia enwiki-20140102-pages-articles-multistream.xml wiki-pages/"

[success] Total time: 26918 s, completed Feb 4, 2014 9:56:38 AM

real	448m41.888s
user	82m47.594s
sys	192m46.238s

And it generated 14128976 XML files:

$ ls wiki-pages/ | wc -l
14128976
$ du -sh wiki-pages/ 
80G	wiki-pages/

Now as you can see that 44GB uncompressed XML file got split up onto 80GB of total storage for all the separate pages. Now that’s something to be worked on.

References:

First steps with Scala: XML pull parsing

Scala finding elements in big (30MB) xml files

Advertisements