56.4 urllib.parse: Parsing and Building URLs
The urllib.parse module in Python provides a suite of functions for parsing Uniform Resource Locators (URLs) into their component parts and, conversely, for constructing URLs from their components. This functionality is fundamental to web programming, as URLs are the primary means of addressing resources on the internet. The module adheres to the syntax and semantics defined in RFC 3986, ensuring standards-compliant handling of URLs. Its operations are based on the concept of breaking a URL string into its constituent elements—scheme, netloc, path, parameters, query, and fragment—and assembling them back together. This allows developers to manipulate URLs programmatically, a common requirement in tasks like web scraping, building API clients, or handling web requests.
The urlparse() and urlsplit() Functions
The urlparse() function is the workhorse for deconstructing a URL. It returns a named tuple, ParseResult, which behaves like a tuple for backward compatibility but allows for more readable code through named attributes.
from urllib.parse import urlparse
url = "https://www.example.com:8080/path/to/page;param?query=arg#fragment"
parsed = urlparse(url)
print(parsed.scheme) # 'https'
print(parsed.netloc) # 'www.example.com:8080'
print(parsed.path) # '/path/to/page'
print(parsed.params) # 'param'
print(parsed.query) # 'query=arg'
print(parsed.fragment) # 'fragment'
print(parsed.hostname) # 'www.example.com' (convenience attribute)
print(parsed.port) # 8080 (convenience attribute)
The urlsplit() function is nearly identical but does not split the parameters from the URL path. This is a subtle but important distinction. Parameters (the part after a semicolon ; in the path) are an obsolete feature of URLs and are rarely used in modern web development. urlsplit() returns a SplitResult tuple, which lacks the params attribute, combining the parameters into the path.
from urllib.parse import urlsplit
split_url = urlsplit(url)
print(split_url.path) # '/path/to/page;param' # Note the params are part of the path
Why this matters: For most modern URLs, urlsplit() is sufficient and slightly faster. Use urlparse() only if you explicitly need to handle the obsolete parameters component.
The urlunparse() and urlunsplit() Functions
These functions perform the inverse operation of urlparse() and urlsplit(), respectively. They construct a URL string from the individual components. It is crucial to use the correct function that matches your parsing method. Passing a ParseResult object with a params component to urlunsplit() will result in that data being lost, as it will be treated as part of the path.
from urllib.parse import urlunparse, urlunsplit
# Reconstruct from a ParseResult tuple (6 components)
reconstructed_from_parse = urlunparse(parsed)
print(reconstructed_from_parse)
# Reconstruct from a list of 6 components
new_components = ('https', 'www.new.org', '/new/path', '', 'a=1&b=2', '')
new_url = urlunparse(new_components)
print(new_url) # 'https://www.new.org/new/path?a=1&b=2'
# Reconstruct from a SplitResult tuple (5 components)
reconstructed_from_split = urlunsplit(split_url)
The urljoin() Function: Resolving Relative URLs
One of the most common and error-prone tasks in web development is resolving a relative URL against a base URL. Manually concatenating strings is fraught with pitfalls regarding trailing slashes and parent directories (..). The urljoin() function handles this correctly according to RFC standards.
from urllib.parse import urljoin
base = "https://www.base.com/section/page.html"
print(urljoin(base, "another.html")) # https://www.base.com/section/another.html
print(urljoin(base, "/absolute/path.html")) # https://www.base.com/absolute/path.html
print(urljoin(base, "../parent.html")) # https://www.base.com/parent.html
print(urljoin(base, "https://absolute.com/full")) # https://absolute.com/full # Base is ignored
Why this is critical: Without urljoin, a simple script downloading links from a page could easily break, requesting malformed URLs like https://www.base.comanother.html or incorrectly handling relative paths, leading to 404 errors.
Parsing and Building Query Strings with parse_qs() and parse_qsl()
The query string (the part after the ?) is often a string of key-value pairs. urllib.parse provides functions to parse this string into more useful data structures.
parse_qs()returns a dictionary where each key maps to a list of values. This is because query strings can have multiple values for the same key (e.g.,?color=red&color=blue).parse_qsl()returns a list of tuples, preserving the order of the parameters.
from urllib.parse import parse_qs, parse_qsl
query_string = "name=Alice&age=30&job=engineer&job=consultant"
dict_result = parse_qs(query_string)
list_result = parse_qsl(query_string)
print(dict_result) # {'name': ['Alice'], 'age': ['30'], 'job': ['engineer', 'consultant']}
print(list_result) # [('name', 'Alice'), ('age', '30'), ('job', 'engineer'), ('job', 'consultant')]
The quote(), quote_plus(), unquote(), and unquote_plus() Functions
URLs can only contain a limited set of characters from the ASCII set. Any other character, including those outside the ASCII range (like é or 字) and certain reserved characters (like spaces, &, or %), must be “percent-encoded” (e.g., a space becomes %20 or +).
quote(): Replaces special characters with their%XXescape sequence. Use this for most URL parts like paths.quote_plus(): Similar toquote(), but also encodes spaces as+signs. This is the standard encoding for the query string portion of a URL.unquote()andunquote_plus(): Perform the reverse operation.
from urllib.parse import quote, quote_plus, unquote, unquote_plus
path_segment = "file with spaces&special=.txt"
query_value = "blue & green socks"
print(quote(path_segment)) # 'file%20with%20spaces%26special%3D.txt'
print(quote_plus(query_value)) # 'blue+%26+green+socks'
print(unquote('blue+%26+green')) # 'blue+&+green' # Note the literal '+'
print(unquote_plus('blue+%26+green')) # 'blue & green' # Correctly decodes '+' to space
Best Practice and Pitfall: The most common mistake is using the wrong encoding function for the wrong part of the URL. Encoding a query value with quote() instead of quote_plus() can lead to servers misinterpreting spaces. Always use quote_plus() for individual query string values.