4.1 Data Collection: Surveys, Sensors, Web Scraping, Synthetic Generation

Alright, let’s talk about getting data. This is where your project goes from a neat idea on a whiteboard to a messy, complicated reality. And that’s a good thing. Real things are messy. Your job is to be the adult in the room who figures out how to handle that mess without just sweeping it under the rug.

The first rule of data collection is simple: garbage in, garbage out. You can have the most sophisticated neural network ever conceived by humanity, but if you train it on junk, it will become a masterful, high-performance producer of more sophisticated junk. So let’s get our hands dirty.

Surveys: The Art of Not Lying to Yourself

Surveys are the psychological profile of your dataset. They tell you what people say they do, not what they actually do. The gap between those two things is where bad models go to die.

The biggest pitfall here is bias, and it sneaks in everywhere. Leading questions? Bias. “Don’t you agree our new product is fantastic?” is not a question; it’s a plea for validation. Sampling bias? Massive. If you only survey users on your website between 9-to-5 on weekdays, you’ve systematically excluded anyone with a day job. Congratulations, your data now represents the unemployed and students. Response bias? You bet. The people who feel most strongly (usually the very angry or the very delighted) are the most likely to respond.

Best practice? Keep it simple, stupid (KISS principle). Ask unambiguous questions. Use Likert scales consistently (e.g., always have 1 be “Strongly Disagree” and 5 be “Strongly Agree”). Pilot your survey on five people first. You’ll immediately spot the questions everyone interprets wrong. And for the love of all that is holy, never make a question required if “Prefer not to say” is a valid and ethical answer.

Sensors: The Real World is Noisy and Brutal

Sensor data is the opposite of surveys: it’s what is, not what people say. But the real world is a hostile environment for clean data. Temperature sensors drift over time. Accelerometers in phones are jostled in pockets. Cameras get dirt on their lenses.

The raw signal is often useless. You need to clean it, and that starts with understanding the physical properties of your sensor. For a simple temperature read, you might apply a rolling median filter to smooth out absurd, short-lived spikes without losing the overall trend.

import pandas as pd
import numpy as np

# Simulate a temperature sensor reading with some random noise and a sudden, absurd spike
time_index = pd.date_range(start='2023-01-01', periods=200, freq='min')
temperature = np.sin(2 * np.pi * np.arange(200) / 50) + 15  # A nice smooth wave around 15°C
temperature[50] = 100  # A sensor glitch or a bird pooped on it

# Create a Series and smooth it with a rolling median
sensor_series = pd.Series(temperature, index=time_index)
smoothed_series = sensor_series.rolling(window=5, center=True).median()  # Window of 5 readings

print(smoothed_series.head(10))

Why a median filter and not a mean? Because a median is robust to outliers. That 100° spike will drastically pull a mean calculation, but the median will ignore it completely, which is exactly what we want. Calibration is also key. Knowing that your sensor consistently reads 2 degrees too high is a fixable problem. Not knowing it is a catastrophe.

Web Scraping: Be a Good Citizen of the Internet

Web scraping is the digital equivalent of panning for gold. There’s value in those rivers of HTML, but you can also really mess up the ecosystem if you’re not careful.

First, the ethical part: check robots.txt. It’s a file at the root of a website (e.g., example.com/robots.txt) that tells automated agents what they can and cannot scrape. Ignoring it is a dick move. Second, pace your requests. Hammering a server with 100 requests per second is a good way to get your IP banned and potentially cause a denial-of-service. Use delays. Be polite.

Technically, it’s about parsing HTML, which is a notoriously messy format. Use robust libraries like BeautifulSoup and requests.

import requests
from bs4 import BeautifulSoup
import time

url = 'https://example.com/listings'

# Always set a user-agent header that identifies your bot responsibly
headers = {'User-Agent': 'MyResearchBot/1.0 (contact@myemail.com)'}

try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()  # Raises an exception for bad HTTP status codes (4xx or 5xx)

    soup = BeautifulSoup(response.text, 'html.parser')
    listings = soup.find_all('div', class_='listing')  # Find all divs with class 'listing'

    for listing in listings:
        title = listing.find('h2').text.strip()
        print(title)
        # ... extract more data ...

except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

# Be nice. Wait a second before potentially making another request.
time.sleep(1)

The try/except block and timeout are crucial. The internet is unreliable. Servers go down, connections hang. Your scraper should handle these gracefully, not just crash. And always, always assume the website’s structure will change next Tuesday, breaking your entire script. It’s not a matter of if, but when.

Synthetic Data: Building a Perfect, Imaginary World

Sometimes, the data you need doesn’t exist. Or it’s too sensitive to use. Or you need to test an edge case that happens once in a million events. Enter synthetic data.

This isn’t about just making up random numbers. Good synthetic data mimics the statistical properties and relationships of the real thing. You might use a library like scikit-learn to create a dataset with specific correlations for testing a classifier.

from sklearn.datasets import make_classification

# Generate a synthetic dataset with 1000 samples, 5 features, and 2 informative features.
# This creates a dataset that has a real, but artificial, structure.
X, y = make_classification(
    n_samples=1000,
    n_features=5,
    n_informative=2,
    n_redundant=1,
    n_clusters_per_class=1,
    random_state=42  # So the result is reproducible
)

# X is now our feature matrix, y are the labels.
print(f"Feature matrix shape: {X.shape}")
print(f"First few labels: {y[:10]}")

The huge advantage here is control and privacy. The huge danger is that your synthetic data is too perfect. It might lack the subtle, weird correlations and noise of the real world, meaning a model that performs flawlessly on synthetic data will face-plant the moment it sees reality. Use it for initial development, stress-testing, and when privacy is paramount, but always validate on a small set of real data before you declare victory.