Spark And Python Trying To Parse Wikipedia Using Gensim

July 02, 2024 Post a Comment

Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and the

Solution 1:

I usually work with Spark in Scala. Nevertheless here are my thoughts:

When you load a file via sc.textFile, it is some sort of line iterator which is distributed across your sparkWorkers. I think given the xml format of the wikipedia one line does not necessarily corresponds to a parsable xml item, and thus you are getting this problem.

i.e:

Line1 :  <item>
 Line 2 :  <title> blabla </title><subitem>
 Line 3 : </subItem>
 Line 4 : </item>

If you try to parse each line on its own, it will spit out exceptions like the ones you got.

I usually have to mess around with a wikipedia dump, so first thing I do is to transform it into a "REadable version" which is easily digested by Spark. i.e: One line per article entry. Once you have it like that you can easily feed it into spark, and do all kind of processing. It doesn't take much resources to transform it

Take a look at ReadableWiki: https://github.com/idio/wiki2vec

Python Developer

Spark And Python Trying To Parse Wikipedia Using Gensim

Solution 1:

Post a Comment for "Spark And Python Trying To Parse Wikipedia Using Gensim"