How To Identify Likely Broken Pdf Pages Before Extracting Its Text?
Solution 1:
A quick bash function based on pdftotext
Here is a full (rewrited) function scanning for badpages:
#!/bin/bash
findBadPages() {
local line opts progress=true usage="Usage: ${FUNCNAME[0]} [-f first page]"
usage+=' [-l last page] [-m min bad/page] [-q (quiet)]'
local -a pdftxt=(pdftotext -layout - -)
local -ia badpages=()
local -i page=1 limit=10 OPTIND
while getopts "ql:f:m:" opt;do
case $opt in
f ) pdftxt+=(-f $OPTARG); page=$OPTARG ;;
l ) pdftxt+=(-l $OPTARG) ;;
m ) limit=$OPTARG ;;
q ) progress=false ;;
* ) printf >&2 '%s ERROR: Unknown option!\n%s\n' \
"${FUNCNAME[0]}" "$usage";return -1 ;;
esac
done
shift $((OPTIND-1))
while IFS= read -r line; do
[ "$line" = $'\f' ] && page+=1 && $progress && printf %d\\r $page
((${#line} > 1 )) && badpages[page]+=${#line}
done < <(
tr -d '0-9a-zA-Z\047"()[]{}<>,-./+?!$&@#:;%$=_ºÁÃÇÔàáâãçéêíóôõú– ' < <(
"${pdftxt[@]}" <"$1"
))
for page in ${!badpages[@]} ;do
(( ${badpages[page]} > limit )) && {
$progress && printf "There are %d strange characters in page %d\n" \
${badpages[page]} $page || echo $page ;}
done
}
Then now:
findBadPages DJE_3254_I_18062021\(1\).pdf
There are 2237 strange characters in page 23
There are 258 strange characters in page 24
There are 20 strange characters in page 32
findBadPages -m 100 -f 40 -l 100 DJE_3254_I_18062021.pdf
There are 623 strange characters in page 80
There are 1068 strange characters in page 81
There are 1258 strange characters in page 82
There are 1269 strange characters in page 83
There are 1245 strange characters in page 84
There are 256 strange characters in page 85
findBadPages DJE_3254_III_18062021.pdf
There are 11 strange characters in page 125
There are 635 strange characters in page 145
findBadPages -qm100 DJE_3254_III_18062021.pdf
145
findBadPages -h
/bin/bash: illegal option -- h
findBadPages ERROR: Unknown option!
Usage: findBadPages [-f first page] [-l last page] [-m min bad/page] [-q (quiet)]
Usage:
findBadPages [-f INTEGER] [-l INTEGER] [-m INTEGER] [-q] <pdf file>
Where
-f
Let you specify first page.-l
for last page.-m
for Minimum strange character found per page to print status.-q
flag suppress page number display during progression, then show only badpages numbers.
Note:
The string used by tr -d
: 0-9a-zA-Z\047"()[]{}<>,-./:;%$=_ºÁÃÇÔàáâãçéêíóôõú–
was built by sorting used characters in your PDF files! They could not match another language! Maybe adding some more accented charaters or other missed printable could become necessary in some futur uses.
Solution 2:
@mkl may be onto a method by saying use a word dictionary search
I have tried different methods to see some simple way of detecting the two bad pages in your smallest 3rd example and suspect it can easily be defeated by a page that has good and bad text thus this is not a complete answer as almost certainly needs more passes to refine.
We have to accept that by you asking the question that each PDF is unknown quality but will need to be processed. So accepting the fact that we hope the majority of pages are good we run through the burst stage blindly.
A very common word construct can contain the 3 letter syllable "est" so if we search the burst files we see that is missing from Page 23 and Page 24 thus they are good candidates for corruption.
Likewise for the 855 page file you say Page 146 is a problem (confirmed by my previous search method for corrupted only pages, containing ���, just that one is corrupt) but now we can easily see in the first 40 pages
OCR is certainly also needed for pages (including those that are image only)
Page 4, 5, 8, 9, 10, 35 (35 is a very odd page ? BG image ?)
but we get a false positive for 2 Pages 19, 33 (Do have text but no est or EST)
and 20, 32, 38 which have EST so the search needs to be case insensitive
SO using a case insensitive search without modification we should get a confidence of 95% (2 wrong in 40) need OCR but I did not test deeply why lowercase est
only returned 275 from the total 855 unless there is a very high percent later with images needing OCR.
I had previously suggested searching the third 6054 page file much faster by looking for ??????? which gives us a more erratic result for use, but does show the corruption is ALL text pages from 25 to 85
So where does that lead to ?
In reality it would be very rare to find someone use ???
Corrupt Pages usually contain ??? OR ���
In Portuguese a corrupt page is less likely to contain /I est
Part corrupt pages may contain a large image for OCR and either est
or not
A few corrupt pages will be none of the above
Solution 3:
Since this is also (or mainly) a performance problem I would suggest to modify your code to more multi-threaded solution or simply use GNU Parallel
Pretty nice article about it -> link
Solution 4:
Try to use another module in order to extract the text correctly; I suggest PyPDF2.
Here goes a function which should fix the issue:
import PyPDF2
def extract_text(filename, page_number):
# Returns the content of a given page
pdf_file_object = open(filename, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_object)
# page_number - 1 below because in python, page 1 is considered as page 0
page_object = pdf_reader.getPage(page_number - 1)
text = page_object.extractText()
pdf_file_object.close()
return text
By the way, PyPDF2 isn't a preinstalled module in Python. To install it, install pip (although it is very likely that point is already done) and run 'pip install PyPDF2' through the command line.
Post a Comment for "How To Identify Likely Broken Pdf Pages Before Extracting Its Text?"