Skip to content Skip to sidebar Skip to footer

Reading .docx Files In Python To Find Strikethrough, Bullets And Other Formats

Can anyone help me identify, in Python using python-docx, if a paragraph in a .docx file contains text that is formatted with strikethrough (ie. it appears but is crossed out), or

Solution 1:

For strikethrough, you can just modify your example like so:

from docx import Document
document = Document(r'C:\stuff\Document.docx')
for p in document.paragraphs:
    for run in p.runs:
        if run.font.strike:
            print"STRIKE: " + run.text

See the API docs for the Font object for more fun stuff you can check.

Solution 2:

Using a native Word DocX parser, rather than converting it to HTML and using an HTML parser, per the Python DocX Docs:

from docx.enum.style importWD_STYLE_TYPEstyles= document.stylesparagraph_styles= [
    s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH
]
for style in paragraph_styles:
    if style.name == 'List Bullet':
        print "I'm a bullet"

Solution 3:

Following from the suggestion from mkrieger1 - I would suggest to use Pandoc to convert .docx to .html and parse the document from there.

Installing Pandoc is the same effort as installing python-docx and the conversion from .docx to .html worked like a charm using Pandoc. In .html the structure of the document I am parsing, and all format elements, is absolutely clear and thus easy to work with.

Post a Comment for "Reading .docx Files In Python To Find Strikethrough, Bullets And Other Formats"