Skip to content

Creating XML for each of the pages using xml.wikipedia

Saleem Ansari edited this page May 21, 2014 · 3 revisions

I had to parse a Wikipedia XML Dump ( 44GB XML file uncompressed ). The XML dump is available here. I have also created a smaller sample file to run this code -- sample wiki.xml file.

Things to note:

  • This code doesn't not require anything ( such as Hadoop cluster etc. ), other than Scala, SBT and wikipedia XML dump in uncompressed format.
  • The output is generated in a directory that is provided as second argument ( see below ).
  • Each of the wikipedia article is stored in a separate file

A example with wiki.xml ( sample file mentioned above )

$ time sbt "run-main xml.wikipedia wiki.xml output-pages"
[info] Loading project definition from /home/saleem/projects/scala/scala-snippets/project
[info] Set current project to scala-snippets (in build file:/home/saleem/projects/scala/scala-snippets/)
[info] Compiling 1 Scala source to /home/saleem/projects/scala/scala-snippets/target/scala-2.10/classes...
[info] Running xml.wikipedia wiki.xml output-pages
Creating output directory: /home/saleem/projects/scala/scala-snippets/output-pages
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/10.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/12.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/13.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/14.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/15.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/18.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/19.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/20.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/21.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/23.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/24.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/25.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/27.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/29.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/30.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/35.xml
[success] Total time: 7 s, completed 21 May, 2014 4:13:28 PM

real	0m10.135s
user	0m18.763s
sys	0m0.593s

Lets see how long it takes to process all the Wikipedia pages in the 44GB XML Dump.

It took roughly 7 hours 30 minutes. Thats not bad:

$ time sbt "run-main xml.wikipedia enwiki-20140102-pages-articles-multistream.xml wiki-pages"
[success] Total time: 26918 s, completed Feb 4, 2014 9:56:38 AM
real        448m41.888s
user        82m47.594s
sys 192m46.238s

And it generated 14128976 XML files:

$ ls wiki-pages/ | wc -l
14128976
$ du -sh wiki-pages/
80G wiki-pages/

Now as you can see that 44GB uncompressed XML file got split up onto 80GB of total storage for all the separate pages. Now that's something to be worked on.

Clone this wiki locally