Skip to content Skip to sidebar Skip to footer

Parse Large Xml With Lxml

I am trying to get my script working. So far it doesn't managed to output anything. This is my test.xml tag present, because that only applies to tags without a namespace.

You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page' element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns' element.

Many lxml methods do let you specify a namespace map to make matching easier, but the iterparse() method is not one of them, unfortunately.

The following .iterparse() call certainly processes the right page tags:

context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')

but you'll need to use .find() to get the ns and title tags on the page element, or use xpath() calls to get the text directly:

defprocess_element(elem):
    if elem.xpath("./*[local-name()='ns']/text()=0"):
        print elem.xpath("./*[local-name()='title']/text()")[0]

which, for your input example, prints:

>>>fast_iter(context, process_element)
MediaWiki:Category

Post a Comment for "Parse Large Xml With Lxml"