51.5 Globbing and Finding Files with pathlib
The pathlib module, introduced in Python 3.4, provides an object-oriented approach to filesystem paths and includes powerful methods for finding files through globbing. Unlike the older glob.glob function which returns a list of strings, pathlib.Path.glob() returns generator-like Path objects, making it more memory-efficient for large directory searches and immediately integrating the results into the pathlib ecosystem.
The Path.glob() and Path.rglob() Methods
The primary tools for finding files in pathlib are the glob() and rglob() methods. Both methods use the familiar globbing patterns, where * matches any number of characters, ? matches a single character, and [] denotes a character set.
The glob() method performs a shallow search, matching the pattern only in the immediate directory attached to the Path object. For a recursive search that traverses all subdirectories, you can use rglob(pattern), which is essentially equivalent to glob('**/' + pattern) but often more readable.
from pathlib import Path
# Create a Path object pointing to the current directory
current_dir = Path('.')
# Find all .txt files in the current directory (non-recursive)
for txt_file in current_dir.glob('*.txt'):
print(txt_file)
# Find all .py files recursively in the current directory and all subdirectories
for py_file in current_dir.rglob('*.py'):
print(py_file)
# Equivalent recursive search using glob with '**'
for py_file in current_dir.glob('**/*.py'):
print(py_file)
A key behavioral detail is that these methods return generator iterators, not lists. This is a critical performance optimization. When searching a directory with thousands of files, a generator doesn’t block memory by building a huge list all at once; it yields each result as it’s found. If you need a list, you can explicitly convert the result (list(my_path.glob('*.py'))), but be mindful of the memory implications on large filesystems.
Using the ** Pattern for Recursion and Advanced Matching
The double-asterisk (**) pattern is a powerful feature of pathlib globbing. When used alone, it matches any file and directory recursively. When combined with other patterns, it allows for precise recursive searches. It’s important to understand that the ** pattern will also match the current directory (zero levels of recursion), which can sometimes lead to surprising duplicates if not used carefully.
# Find any file named 'config.ini' anywhere within the project directory
project_path = Path('/path/to/project')
config_files = project_path.rglob('config.ini')
# A more complex pattern: find all .jpg files in any subdirectory of 'images'
# but not in the 'images' directory itself
for img in project_path.glob('images/**/*.jpg'):
print(img)
# A common pitfall: this matches '.git' in the current dir AND all '.git' directories recursively
all_git_dirs = project_path.glob('**/.git')
Filtering Results and Combining with Other Methods
Since glob() returns Path objects, you can chain them with other Path methods to create robust, expressive file-finding operations. This is a significant advantage over string-based globbing. You can filter results based on file properties, such as whether the path is a file or a directory, its size, or its modification time.
from datetime import datetime, timedelta
# Find all non-hidden Python module files (not directories)
python_modules = [p for p in Path.cwd().rglob('*.py') if p.is_file()]
# Find all PDF files larger than 1 MiB
large_pdfs = [p for p in Path('~/Documents').expanduser().glob('**/*.pdf')
if p.is_file() and p.stat().st_size > 1024 * 1024]
# Find all files modified in the last 7 days
one_week_ago = datetime.now() - timedelta(days=7)
recent_files = [p for p in Path.cwd().rglob('*')
if p.is_file() and p.stat().st_mtime > one_week_ago.timestamp()]
Common Pitfalls and Best Practices
Case Sensitivity: Globbing is case-sensitive on Unix-based systems (Linux, macOS) and case-insensitive on Windows. To write cross-platform code, be cautious with your patterns. A pattern for
*.JPGwill not find.jpgfiles on a Linux system.Hidden Files (Dotfiles): The standard
*pattern does not match files or directories whose names begin with a dot (.). These are traditionally hidden in Unix systems. To include them, you must explicitly include the dot in your pattern (e.g.,.*to find all hidden files, or.[!.]*to find hidden files but exclude the special.and..directories).Directory Separators: Always use forward slashes (
/) in your glob patterns, even on Windows. Thepathlibmodule automatically handles the translation to the Windows backslash separator, making your code portable.Error Handling: The
glob()method is generally forgiving. If a directory within the search path cannot be accessed (due to permission errors, for example), it will typically skip that directory and continue without raising an exception, though this can be platform-dependent. For mission-critical code, you may want to add explicit error handling within your loops.Performance with
rglob('**'): Usingrglob('*')orglob('**/*')can be slow on filesystems with deeply nested directories and a huge number of files because it must traverse every single node. Always use the most specific pattern possible to limit the search scope.