Skip to content Skip to sidebar Skip to footer

Extracting .docx Data, Images And Structure

Good day SO, I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of t

Solution 1:

For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.

from docx import Document
document = docx.Document('demo.docx')text  = []
style = []
for x in document.paragraphs:
    if x.text != '':
        style.append(x.style.name)
        text.append(x.text)

with x.style.name you can get the styling of text in your document.

You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by

for elem in document.element.getiterator():
    print(elem.tag)

Let me know if you need anything else.

For extracting image name and its location use this.

tags = []
text = []
for t in doc.element.getiterator():
    if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
        if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
            print('Picture Found: ',t.attrib['name'])
            tags.append('Picture')
            text.append(t.attrib['name'])
        elif t.text:
            tags.append('text')
            text.append(t.text)

You can check previous and next text from text list and their tag from the tag list.

Post a Comment for "Extracting .docx Data, Images And Structure"