Reading .docx Files In Python To Find Strikethrough, Bullets And Other Formats
Can anyone help me identify, in Python using python-docx, if a paragraph in a .docx file contains text that is formatted with strikethrough (ie. it appears but is crossed out), or
Solution 1:
For strikethrough, you can just modify your example like so:
from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
for run in p.runs:
if run.font.strike:
print"STRIKE: " + run.text
See the API docs for the Font object for more fun stuff you can check.
Solution 2:
Using a native Word DocX parser, rather than converting it to HTML and using an HTML parser, per the Python DocX Docs:
from docx.enum.style importWD_STYLE_TYPEstyles= document.stylesparagraph_styles= [
s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH
]
for style in paragraph_styles:
if style.name == 'List Bullet':
print "I'm a bullet"
Solution 3:
Following from the suggestion from mkrieger1 - I would suggest to use Pandoc to convert .docx to .html and parse the document from there.
Installing Pandoc is the same effort as installing python-docx and the conversion from .docx to .html worked like a charm using Pandoc. In .html the structure of the document I am parsing, and all format elements, is absolutely clear and thus easy to work with.
Post a Comment for "Reading .docx Files In Python To Find Strikethrough, Bullets And Other Formats"