Skip to content Skip to sidebar Skip to footer

Spark And Python Trying To Parse Wikipedia Using Gensim

Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and the

Solution 1:

I usually work with Spark in Scala. Nevertheless here are my thoughts:

When you load a file via sc.textFile, it is some sort of line iterator which is distributed across your sparkWorkers. I think given the xml format of the wikipedia one line does not necessarily corresponds to a parsable xml item, and thus you are getting this problem.

i.e:

Line1 :  <item>
 Line 2 :  <title> blabla </title><subitem>
 Line 3 : </subItem>
 Line 4 : </item>

If you try to parse each line on its own, it will spit out exceptions like the ones you got.

I usually have to mess around with a wikipedia dump, so first thing I do is to transform it into a "REadable version" which is easily digested by Spark. i.e: One line per article entry. Once you have it like that you can easily feed it into spark, and do all kind of processing. It doesn't take much resources to transform it

Take a look at ReadableWiki: https://github.com/idio/wiki2vec

Post a Comment for "Spark And Python Trying To Parse Wikipedia Using Gensim"