-
Notifications
You must be signed in to change notification settings - Fork 5
Creating XML for each of the pages using xml.wikipedia
Saleem Ansari edited this page May 21, 2014
·
3 revisions
I had to parse a Wikipedia XML Dump ( 44GB XML file uncompressed ). The XML dump is available here. I have also created a smaller sample file to run this code -- sample wiki.xml file.
Things to note:
- This code doesn't not require anything ( such as Hadoop cluster etc. ), other than Scala, SBT and wikipedia XML dump in uncompressed format.
- The output is generated in a directory that is provided as second argument ( see below ).
- Each of the wikipedia article is stored in a separate file
A example with wiki.xml ( sample file mentioned above )
$ time sbt "run-main xml.wikipedia wiki.xml output-pages"
[info] Loading project definition from /home/saleem/projects/scala/scala-snippets/project
[info] Set current project to scala-snippets (in build file:/home/saleem/projects/scala/scala-snippets/)
[info] Compiling 1 Scala source to /home/saleem/projects/scala/scala-snippets/target/scala-2.10/classes...
[info] Running xml.wikipedia wiki.xml output-pages
Creating output directory: /home/saleem/projects/scala/scala-snippets/output-pages
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/10.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/12.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/13.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/14.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/15.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/18.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/19.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/20.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/21.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/23.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/24.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/25.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/27.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/29.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/30.xml
writing to: /home/saleem/projects/scala/scala-snippets/output-pages/35.xml
[success] Total time: 7 s, completed 21 May, 2014 4:13:28 PM
real 0m10.135s
user 0m18.763s
sys 0m0.593s
Lets see how long it takes to process all the Wikipedia pages in the 44GB XML Dump.
It took roughly 7 hours 30 minutes. Thats not bad:
$ time sbt "run-main xml.wikipedia enwiki-20140102-pages-articles-multistream.xml wiki-pages"
[success] Total time: 26918 s, completed Feb 4, 2014 9:56:38 AM
real 448m41.888s
user 82m47.594s
sys 192m46.238s
And it generated 14128976 XML files:
$ ls wiki-pages/ | wc -l
14128976
$ du -sh wiki-pages/
80G wiki-pages/
Now as you can see that 44GB uncompressed XML file got split up onto 80GB of total storage for all the separate pages. Now that's something to be worked on.