Python XPath Namespaces and No Matches

Tags: python

Scenario: You’re running XPath expressions on a document or element parsed with Python’s ElementTree module, and you can’t find the expected results. It’s like there are no elements in the tree. Weirder still, iterating over all elements proves that there are elements in the tree.

TL;DR: Namespaces. Use them in your queries.

Python’s ElementTree module makes XML processing pretty tolerable, most of the time. I don’t like digging through XML documents, but I don’t hate it when I’m using ElementTree and XPath to do the dirty work.

My latest project takes government data provided in Excel spreadsheets and loads them into a database where I can then serve up the data as JSON and JavaScript visualizations. The files looked like Excel 2010 spreadsheets on the surface, but they weren’t vanilla. They were generated with SAS, and uploaded as raw XML documents:

$ file coalpublic1983.xls
coalpublic1983.xls: XML document text

That’s weird for an Excel spreadsheet. Normally they’re zip files with the XML tucked inside along with various metadata files. Here’s that same file after opening and saving in LibreOffice:

$ file coalpublic1983.xlsx
coalpublic1983.xlsx: Zip archive data, at least v2.0 to extract

Since they weren’t proper Excel spreadsheets, openpyxl couldn’t read them. Since they were XML, ElementTree could, but the findall method returned no elements. A bit of sleuthing showed that the ElementTree element names included the relevant XML schemas, so rather than

for row in tree.findall('.//Row'):
pass

Python

I did this instead:

for row in tree.findall('.//{urn:schemas-microsoft-com:office:spreadsheet}Row');
pass

Python

Just rolls right off the tongue, doesn’t it?