Spark And Python Trying To Parse Wikipedia Using Gensim
Solution 1:
I usually work with Spark in Scala. Nevertheless here are my thoughts:
When you load a file via sc.textFile, it is some sort of line iterator which is distributed across your sparkWorkers. I think given the xml format of the wikipedia one line does not necessarily corresponds to a parsable xml item, and thus you are getting this problem.
i.e:
Line1 : <item>
Line 2 : <title> blabla </title><subitem>
Line 3 : </subItem>
Line 4 : </item>
If you try to parse each line on its own, it will spit out exceptions like the ones you got.
I usually have to mess around with a wikipedia dump, so first thing I do is to transform it into a "REadable version" which is easily digested by Spark. i.e: One line per article entry. Once you have it like that you can easily feed it into spark, and do all kind of processing. It doesn't take much resources to transform it
Take a look at ReadableWiki: https://github.com/idio/wiki2vec
Post a Comment for "Spark And Python Trying To Parse Wikipedia Using Gensim"