52.3 XML: ElementTree Parsing and Building

The eXtensible Markup Language (XML) provides a robust, hierarchical, and self-descriptive format for data serialization. While numerous parsing approaches exist, the xml.etree.ElementTree module in Python’s standard library offers a particularly elegant and “Pythonic” interface for both parsing existing XML documents and programmatically constructing new ones. Its name derives from its core abstraction: an XML document is treated as a tree of Element objects, where each element has a tag, attributes, a text content, and a list of child elements.

The Element Object: The Fundamental Building Block

Every part of an XML document is represented by an Element object. Understanding its properties is crucial. The tag is a string identifying the element’s type (e.g., 'book', 'author'). The attrib property is a dictionary holding the element’s attributes (e.g., {'id': '101'}). The text attribute contains the text directly inside the element, before any child elements. The tail attribute, often a source of confusion, contains text that comes after the element’s closing tag but before the next sibling tag. Elements behave like lists for their children, supporting iteration, indexing, and methods like append() and insert().

import xml.etree.ElementTree as ET

# Create a root element
root = ET.Element('library', {'type': 'fiction'})

# Create a child element and set its text
book = ET.SubElement(root, 'book')
book.set('id', '123')
title = ET.SubElement(book, 'title')
title.text = 'The Hobbit'

# Print the rough XML to see the structure
ET.dump(root)

Output:

<library type="fiction"><book id="123"><title>The Hobbit</title></book></library>

Parsing XML Documents

ElementTree provides two primary functions for parsing: parse() for files and fromstring() for strings already in memory. Both return an ElementTree object, which represents the whole document. You get its root element using .getroot(). A common pitfall is forgetting that find() and findall() only search direct children of the current element, not the entire subtree. For deep searches, use the powerful XPath-like iterfind() method or the .iter() method for simple iteration.

xml_data = '''
<library>
  <book id="101">
    <title>Python Crash Course</title>
    <author>Eric Matthes</author>
  </book>
  <book id="102">
    <title>Fluent Python</title>
    <author>Luciano Ramalho</author>
  </book>
</library>'''

# Parse from a string
root = ET.fromstring(xml_data)

# Find the first 'book' element *directly under* root
first_book = root.find('book')
print('First book ID:', first_book.get('id')) # Output: 101

# Find ALL 'title' elements anywhere in the tree (using a limited XPath)
for title in root.iter('title'):
    print('Title:', title.text)
# Output:
# Title: Python Crash Course
# Title: Fluent Python

Navigating and Modifying the Tree

Once parsed, the tree is fully mutable. You can navigate using list indices (root[0] for the first child), iterate over children (for child in root:), or use methods to find specific elements. Modifying the tree is straightforward: change .text or .attrib, add new children with append() or insert(), and remove children with remove().

# Continue from previous parsing example
for book in root.findall('book'):
    # Add a new child element to each book
    rating = ET.SubElement(book, 'rating')
    rating.text = '5'

    # Modify an existing element's text
    if book.get('id') == '102':
        author_elem = book.find('author')
        author_elem.text = 'Luciano Ramalho (Updated)'

ET.dump(root)

Building XML Documents from Scratch

Constructing a new document involves creating a root Element and building its hierarchy using the Element() constructor and the SubElement() helper function, which creates an element and automatically appends it to a parent. This is far safer and more readable than manually concatenating strings, as it automatically handles XML escaping for special characters like <, >, and &, preventing malformed XML or injection vulnerabilities.

# Build a catalog
root = ET.Element('catalog')
for product_id in [2001, 2002]:
    product = ET.SubElement(root, 'product', {'id': str(product_id)})
    name = ET.SubElement(product, 'name')
    name.text = f'Widget Model #{product_id}'
    # This text would break a string-concatenation approach
    description = ET.SubElement(product, 'description')
    description.text = 'Standard size: 5" x 3" <and> safe for kids & pets'

# Create an ElementTree object for the whole document
tree = ET.ElementTree(root)

# Write to a file with proper declaration and formatting
tree.write('catalog.xml', encoding='utf-8', xml_declaration=True)

The resulting catalog.xml will be correctly escaped:

<?xml version='1.0' encoding='utf-8'?>
<catalog>
  <product id="2001">
    <name>Widget Model #2001</name>
    <description>Standard size: 5" x 3" &lt;and&gt; safe for kids &amp; pets</description>
  </product>
  ...
</catalog>

Namespaces: A Common Pitfall

XML namespaces, denoted by xmlns attributes, are essential for avoiding tag name conflicts but complicate searching in ElementTree. A tag in a namespace is not just its local name like 'author'; it is a fully qualified name, often a cumbersome URI like '{http://example.com/books}author'. When parsing documents with namespaces, you must use these full names in find() and findall(). The best practice is to define a dictionary of namespace prefixes and use them in your searches.

ns_xml = '''
<books xmlns:bk="http://example.com/books">
  <bk:book>
    <bk:title>XML Guide</bk:title>
  </bk:book>
</books>'''
root = ET.fromstring(ns_xml)

# This will NOT find anything because of the namespace
print('Find with local name:', root.find('book')) # Output: None

# Define a namespace dictionary and use it in the search
namespaces = {'bk': 'http://example.com/books'}
found_book = root.find('bk:book', namespaces)
found_title = found_book.find('bk:title', namespaces)
print('Title with namespace:', found_title.text) # Output: XML Guide

# Alternatively, use the full clark notation (ugly but explicit)
full_tag_name = '{http://example.com/books}title'
print('Title with full tag:', root.find(f'*/{full_tag_name}').text)

Best Practices and Security Warning

When writing XML, always specify encoding='utf-8' and xml_declaration=True in the write() method to ensure interoperability. For complex output with significant whitespace or processing instructions, consider using the xml.dom.minidom module’s toprettyxml() method on your ElementTree. Most critically, never parse XML from untrusted sources with the standard ElementTree parser. It is vulnerable to a class of attacks known as XML Bombs (e.g., Billion Laughs attack), which can lead to denial-of-service by exhausting system memory. For untrusted data, use the defusedxml third-party package, which replaces the standard parsers with secure counterparts.