82.7 Input Validation: Preventing Injection Attacks
Right, let’s talk about input validation. This is where we stop being polite and start getting real. You see, most software vulnerabilities aren’t born from complex zero-day exploits; they’re born from a simple, almost naive trust that the user will send us exactly what we expect. They won’t. They’ll send you ' OR '1'='1'-- because some blog post from 2003 told them to. Your job is to treat every single byte of input from the outside world—users, APIs, a file, a network request, even the system clock—as hostile until proven otherwise. This isn’t paranoia; it’s the default setting for a professional.
The granddaddy of all input validation failures is the injection attack. The concept is laughably simple: an attacker tricks your application into interpreting their malicious data as part of your code. It’s like someone yelling “AND I NOW DECLARE MYSELF THE KING!” in the middle of your carefully planned parliamentary procedure, and your program just goes, “Well, I guess we have a new king now. All hail the king!”
The SQL Injection Classic (And How to Kill It)
The classic example is SQL Injection. You’ve probably seen the scary demo. You have a login form that builds a query like this:
# This is TERRIBLE. Never do this.
username = get_form_data('username')
password = get_form_data('password')
sql = f"SELECT * FROM users WHERE username = '{username}' AND password = '{password}'"
If I enter admin'-- as the username, that query becomes:
SELECT * FROM users WHERE username = 'admin'--' AND password = ''
Congratulations, I just logged in as admin because the -- comment symbol rendered the password check useless. This is the “brilliant friend” equivalent of leaving your house key under the doormat with a sign that says “KEY UNDER HERE.”
The solution isn’t to try and escape the quotes or filter out dashes. That’s a whack-a-mole game you will lose. The solution is parameterized queries (or prepared statements). This mechanism separates the SQL code from the data so they can never be confused.
# This is the way.
import sqlite3
conn = sqlite3.connect('app.db')
cursor = conn.cursor()
username = get_form_data('username')
password = get_form_data('password')
# The ? are placeholders. The database engine knows they are data, not code.
sql = "SELECT * FROM users WHERE username = ? AND password = ?"
cursor.execute(sql, (username, password)) # The library handles the escaping.
user = cursor.fetchone()
Why is this bulletproof? Because the database driver receives the query structure (SELECT ... WHERE username = ? ...) and the data (('admin\'--', 'hunter2')) separately. It knows the data is always data. Even if the data contains SQL keywords, quotes, or semicolons, it will be treated as a harmless string. Use your language’s database library’s parameterized queries. Always. No exceptions.
Beyond SQL: Command Injection
SQL isn’t the only language you can inject into. What if your app uses user input to build a system command? Let’s say you have a feature that pings a host provided by the user.
# This is also TERRIBLE. A monument to bad ideas.
hostname = get_form_data('hostname')
os.system(f"ping -c 4 {hostname}") # Oh dear.
What if I set hostname to google.com; rm -rf /? You just asked your server to run ping -c 4 google.com; rm -rf /. The semicolon ends the first command, and the shell happily executes the second one. This is command injection, and it’s often game over.
The fix? Avoid building shell commands with user input at all. Use language-specific APIs that allow you to pass the command and arguments separately.
# A much safer approach.
import subprocess
hostname = get_form_data('hostname')
# The key is 'shell=False' (the default) and passing a list of arguments.
# The subprocess module handles the execution without involving the shell.
try:
# Validate the input first! Is it a valid hostname?
if not is_valid_hostname(hostname): # You need to write this function.
raise ValueError("Invalid hostname")
# Now run the command safely.
result = subprocess.run(['ping', '-c', '4', hostname],
capture_output=True,
text=True,
timeout=30) # Always set a timeout!
print(result.stdout)
except subprocess.TimeoutExpired:
print("Ping request timed out.")
This works because subprocess.run with a list of arguments doesn’t invoke the shell. It calls the ping executable directly and passes hostname as a single argument. Even if hostname was google.com; rm -rf /, it would be treated as one long, invalid hostname string for the ping command, not as a shell instruction.
The Validation Mindset: Allowlisting vs. Blocklisting
Your default strategy should always be allowlisting (accepting only what is known to be good), not blocklisting (rejecting what is known to be bad). The list of bad things is infinite and always changing. The list of valid inputs for a specific field is usually finite and knowable.
Need a phone number? Don’t try to filter out SQL fragments. Use a regular expression (regex) to validate it against the pattern of a valid phone number and reject anything that doesn’t match.
import re
def validate_phone_number(input_string):
# A simple example: North American format. Adjust for your needs.
pattern = r'^\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
if re.match(pattern, input_string):
return input_string # Maybe sanitize by removing non-digits for storage
else:
raise ValueError("Invalid phone number format")
For more complex data like XML or JSON, don’t roll your own parser. Use a well-established, secure library that doesn’t have features like external entity expansion enabled by default (looking at you, old XML libraries).
The core principle is this: define exactly what “good” input looks like for your specific context, validate ruthlessly against that definition as early as possible, and then—and only then—trust the data enough to use it. It’s the digital equivalent of a good bouncer: they have a list, and if you’re not on it, you don’t get in, no matter how convincing your story about being friends with the owner.