Extracting .docx Data, Images And Structure
Good day SO, I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of t
Solution 1:
For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.
from docx import Document
document = docx.Document('demo.docx')text = []
style = []
for x in document.paragraphs:
if x.text != '':
style.append(x.style.name)
text.append(x.text)
with x.style.name you can get the styling of text in your document.
You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by
for elem in document.element.getiterator():
print(elem.tag)
Let me know if you need anything else.
For extracting image name and its location use this.
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.
Post a Comment for "Extracting .docx Data, Images And Structure"