If you are at all familiar with the Python programming language you will know that one of the nice things is the simplicity of writing scripts. The other great thing about Python is the considerable collection of modules that do the heavy lifting for you. In this example I want to share a script I wrote to leverage PyPDF2 another Python library for getting information like metadata and page contents from within PDF files. I hope this is something you find useful and helps you on your path to writing in the Python language.
A principle I tried to use with this script was to keep the code as dynamic as possible. Meaning if I had a way to pull something like Title, Author, Subject I want to fetch it using reflection rather than saying author = author. This way I don’t miss something and the code is less brittle.
Unfortunately PyPDF has some limitations. The XMP sequence is not enumerable as far as I could tell so I couldn’t use reflection. This really sucks considering the newer PDF versions use XMP. If someone has a more robust python library for handling PDF’s or you can fix my code then let me know!
Lets get to the code. Below I broke it up into numbered sections with a short explanation.
First, import’s are important. We are importing PyPDF2.
from stat import *
from PyPDF2 import PdfFileReader
Second, lets set some command line arguments
#command line inputs
parser = argparse.ArgumentParser(description=’Process some input arguments.’)
parser.add_argument(‘-f’, ‘–file’, type=str, help=’filename to parse’, required=True)
parser.add_argument(‘-g’, ‘–GrabText’, type=bool, help=’extract text. Example -g True’, default=False, required=False)
args = parser.parse_args()
Third, make sure the file exists and put some information about the file out to the command line
# check if it’s a file and exists
if not os.path.isfile(args.file): sys.exit()
if os.path.isdir(args.file): sys.exit()
fo = open(args.file, “rb”)
# read the file header to get the PDF version information.
head = fo.read(8)
# pdf header file type verification
if “%PDF” not in head:
print “\nNot a supported document format.”
# print filename
print “File:\t\t%s” % args.file
# print pdf header
print “Header:\t\t%s” % head
Fourth, DocumentInfo. Print the Document Information dynamically! This is the author, title, created, etc… We have to use encode utf-8 to deal with Unicode characters. Lets hope it’s just utf-8!
# iterate DocumentInfo()
print “Document info”
for (key, value) in meta.iteritems():
print re.sub(“/”,””,key) + “:\t” , value.encode(‘utf-8’)
Fifth, XMP Information. The part that is annoying. Since any one of these elements may not exist we have to test it using hasattr() otherwise we will get a null exception and everything will break.
print “\nXMP info”
#get the pdf xmp information if it exists
xmpinfo = input1.getXmpMetadata()
if hasattr(xmpinfo,’dc_contributor’): print ‘dc_contributor’, xmpinfo.dc_contributor
if hasattr(xmpinfo,’dc_identifier’): print ‘dc_identifier’, xmpinfo.dc_identifier
if hasattr(xmpinfo,’dc_date’): print ‘dc_date’, xmpinfo.dc_date
if hasattr(xmpinfo,’dc_source’): print ‘dc_source’, xmpinfo.dc_source
if hasattr(xmpinfo,’dc_subject’): print ‘dc_subject’, xmpinfo.dc_subject
if hasattr(xmpinfo,’xmp_modifyDate’): print ‘xmp_modifyDate’, xmpinfo.xmp_modifyDate
if hasattr(xmpinfo,’xmp_metadataDate’): print ‘xmp_metadataDate’, xmpinfo.xmp_metadataDate
if hasattr(xmpinfo,’xmpmm_documentId’): print ‘xmpmm_documentId’, xmpinfo.xmpmm_documentId
if hasattr(xmpinfo,’xmpmm_instanceId’): print ‘xmpmm_instanceId’, xmpinfo.xmpmm_instanceId
if hasattr(xmpinfo,’pdf_keywords’): print ‘pdf_keywords’, xmpinfo.pdf_keywords
if hasattr(xmpinfo,’pdf_pdfversion’): print ‘pdf_pdfversion’, xmpinfo.pdf_pdfversion
for y in xmpinfo.dc_publisher:
print “Publisher:\t” + y
Sixth, lets dump some file mac times information. This isn’t that critical but something I wanted to include since I may need it.
print “\nFilesystem mac times”
(mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime) = os.stat(args.file)
# time.strftime(“%d/%m/%Y %H:%M:%S”,time.localtime(st.ST_CTIME))
print “Creation:\t%s” % time.ctime(ctime)
print “Last Mod:\t%s” % time.ctime(mtime)
print “Last Access:\t%s” % time.ctime(atime)
#print “Size:\t” % os.path.getsize(args.file)
Seventh, this section will grab the number of pages in the document. It also extracts the text from all the pages in a readable way that you can search later if you want! I don’t like the way the PyPDF2 documentation suggests doing it so I changed it up here. I want to keep some formatting and this does that.
#get the page text contents if we can
content = “”
# Iterate pages
for i in range(0, input1.getNumPages()):
# Extract text from page and add to content
content += input1.getPage(i).extractText() + “\n”
#content = ” “.join(content.replace(u”\xa0″, ” “).strip().split())
print content.encode(“ascii”, “ignore”)
Wrapping it up
This is a fun example of how you can pull metadata and text from a PDF file using Python. Hopefully you found this helpful and will post your own code online to help the community. You can download the full script “pyMetaDive_PDF.py” here.