Preprocessing | mikePietsch.com

12.9 Embedded Methods: LASSO and Tree Feature Importance

Right, so you’ve got your data, you’ve thrown a bunch of features at the wall, and now you’re wondering which ones are actually sticking. You’re not just throwing spaghetti at the wall to see what sticks; you’re trying to build a damn suspension bridge. This is where embedded methods come in—they’re the smart, multitasking construction crew that builds the bridge and tells you which steel beams are load-bearing and which are just for show. They perform feature selection as part of the model training process itself. No separate step. Efficient. I like it.

12.8 Wrapper Methods: RFE and Sequential Feature Selection

Alright, let’s talk about wrapper methods. You’ve probably been eyeballing your dataset, wondering which features are the real MVPs and which are just dead weight. Filter methods (like correlation scores) are a good first date, but they don’t tell you how features actually behave in a relationship with your specific model. That’s where wrapper methods come in. They’re more demanding—they actually train the model over and over to see which subset of features makes it perform best. It’s computationally expensive, like a high-maintenance partner, but you get a much clearer picture of what works.

12.7 Filter Methods: Correlation, Chi-Squared, Mutual Information

Right, let’s talk about filtering features. This is where we get to play the role of a bouncer at a club, deciding which variables get past the velvet rope and into your model. The goal is simple: quickly and ruthlessly eliminate the weak, the redundant, and the downright useless before we even think about training. It’s a pre-screening process, and it’s gloriously computationally cheap. Filter methods work by looking at the intrinsic properties of the data, judging each feature on its own individual statistical merit. They don’t care about your specific model algorithm (a Random Forest, a Logistic Regression, etc.). This is both their greatest strength and their most significant weakness. They’re fast and model-agnostic, but they’re also completely oblivious to feature interactions. They’re judging the solo artists, not how well they might play in a band.

12.6 Text Features: TF-IDF, CountVectorizer, Embeddings

Right, let’s talk about turning words into numbers, because your model is a glorified calculator and it doesn’t speak Shakespeare. It speaks vectors. Our job is to translate the messy, beautiful chaos of human language into a tidy spreadsheet of numbers it can actually crunch. We’ve got three main tools for this, and I’ll be honest with you: they range from “simple but surprisingly effective” to “black magic that works suspiciously well.”

12.5 Date and Time Feature Extraction

Right, let’s talk about dates and times. Your model doesn’t understand that “January 1st, 2023” is a Saturday, comes after a Friday, and is a national holiday. It just sees a string or, heaven forbid, an integer. Our job is to translate the rich, contextual information hidden in a timestamp into a language your algorithm can actually use. This isn’t just data cleaning; it’s data archaeology. We’re excavating meaning. The first and most critical rule: never, ever store or use your datetime as a raw string. You’re just asking for pain. The moment you get a new data source with a slightly different format ('01-Jan-2023' vs. '2023/01/01'), your entire pipeline grinds to a halt. Your first line of defense is to parse it into a proper datetime object immediately. In Python, that means datetime.datetime.

12.4 Binning, Bucketing, and Quantile Transformation

Alright, let’s talk about making your continuous data behave. You’ve got a column like ‘age’ or ‘income’—a stream of endless, unique numbers. Throwing that raw into some models is like handing a toddler a spreadsheet and asking for a regression analysis. It’s messy, it’s inefficient, and frankly, it’s a bit rude to the algorithm. Many models, especially tree-based ones, don’t need this. But for linear models, or if you suspect a non-linear relationship, we need to impose some order. Enter binning, bucketing, and their more sophisticated cousin, the quantile transformation.

12.3 Polynomial and Interaction Features

Right, let’s talk about making your data more… interesting. You’ve got your nice, neat, linear features. They’re fine. They’re polite. But the real world isn’t polite; it’s messy, curved, and full of relationships where two things together create a third, unexpected thing. That’s where polynomial and interaction features come in. They’re how we take our vanilla dataset and give it a shot of espresso, teaching our linear models to see the world in more than just straight lines.

12.2 Encoding Categorical Variables: One-Hot, Ordinal, Target Encoding

Alright, let’s talk about turning your messy, non-numeric categories into something a model can actually digest. Most machine learning algorithms are, at their heart, just glorified calculators. They love numbers. They dream in matrices. They have no idea what to do with a “red,” “blue,” or “green.” Our job is to translate that categorical gibberish into a numerical dialect they understand, and we’ve got a few primary methods for that. Choose wisely, because this is one of the highest-leverage decisions you’ll make in a project.

12.1 Domain-Driven Feature Creation

Alright, let’s get our hands dirty. You’ve got your raw data, and it’s… fine. It’s a start. But if you want your model to do more than just mediocre guesswork, you need to feed it something better. That’s where domain-driven feature creation comes in. This isn’t about blindly applying one-hot encoding and calling it a day. This is the art of using your brain—your understanding of the problem space—to create features that scream the important patterns to your model. It’s the single biggest lever you have to improve performance, and frankly, it’s where the real fun is.

12. Feature Engineering and Selection

4.9 Data Versioning: DVC and LakeFS

Right, let’s talk about the one thing that separates a data science project from a weekend of frantic, soul-crushing hacking: version control. But not for your code. For your data. You’ve been there. You’ve trained a model, gotten a great result, and then… the data changes. A new source, a corrected column, a fresh batch from the client. Suddenly, your brilliant model is a useless pile of matrix multiplication, and you have no idea which version of training_data_final_v2_USE_THIS_one.csv was the one that actually worked. This is why we version data. It’s not just a nice-to-have; it’s your project’s lifeline.

4.8 Data Ethics: Bias in Datasets, Consent, and Privacy

Alright, let’s get our hands dirty with the part of data science nobody puts on their recruitment brochures: ethics. You’re not just building models; you’re building systems that impact real people’s lives, jobs, and freedoms. Screw this up, and you’re not just a bad engineer—you’re a liability. So, let’s do it right. The Ghost in the Machine: Recognizing Bias in Datasets Bias isn’t some gremlin that jumps into your dataset; it’s baked in from the start. It’s a reflection of historical and social inequities. Think of it like this: if you train a model to recognize CEOs using a dataset of Fortune 500 company photos, your model will become brilliantly, flawlessly accurate at identifying older white men in suits. It learned the bias perfectly. The technical term for this garbage-in-garbage-out phenomenon is “sample bias,” and it’s everywhere.

4.7 Class Imbalance: Oversampling, Undersampling, and SMOTE

Right, so you’ve built your model, you’re feeling pretty good, and then… it predicts everything as the majority class. 99% accuracy! Fantastic! Except it’s completely useless because you’re trying to find the one fraudulent transaction in a sea of legitimate ones. Welcome to the wonderfully frustrating world of class imbalance. It’s the single biggest party pooper for classification models. They’re desperate to minimize error, and the easiest way to do that is to just always guess the most common outcome. Lazy little things.

4.6 Train/Validation/Test Split: Preventing Data Leakage

Right, let’s talk about splitting your data. This is the part where we build the reality-distortion field that lets your model think it’s a genius, while we secretly know the truth: it’s just really good at memorizing the answers to a test it’s already seen. Our job is to prevent that. We’re going to lock the final exam away in a vault until the very end, and we’re going to be ruthless about it.

4.5 Data Normalization and Standardization

Right, let’s talk about making your data play nice in the sandbox. You’ve collected your numbers, and they’re a mess. One feature is in the millions, another is a decimal between zero and one, and a third is… well, you’re not even sure what unit it’s in. If you feed this glorious disaster directly into most machine learning models, the model will treat the feature with the larger numerical range (the millions) as if it’s the most important thing in the universe. It’s not. It’s just louder. Our job is to make sure each feature gets to speak in a normal, indoor voice so the algorithm can actually listen to the content of what they’re saying, not just who’s shouting the loudest. This is the entire point of normalization and standardization.

4.4 Outlier Detection and Treatment

Alright, let’s talk about outliers. You’ve got your beautiful, clean dataset, you run a quick describe(), and boom—there it is. max: 4,289,302. The 75% is 82. Your data has a data goblin. That’s an outlier. It’s a data point that’s so far removed from its peers it makes you question reality, your data collection methods, and sometimes, your life choices. These little monsters aren’t just statistical nuisances; they’re the wrecking balls of your analysis. Throw one into a linear regression, and it’ll pull the entire line of best fit towards its own bizarre reality, like a black hole warping spacetime. A simple average? Forget about it. They can single-handedly skew your results into something completely meaningless. Your job is to find them, understand them, and then decide their fate. Do you rehabilitate them? Or do you… well, you know.

4.3 Handling Missing Values: Imputation Strategies

Alright, let’s get our hands dirty. Missing data isn’t an if, it’s a when. You’ll find NaN, None, NA, or just a suspicious-looking empty string staring back at you from the dataset, and your first instinct might be to just drop those rows. Resist it. That’s the data science equivalent of throwing away a puzzle because a single piece is missing. It’s lazy, and it can introduce massive bias into your model. Your model will learn from the data you give it, and if you’ve systematically removed all the records where, say, income was missing (which might correlate with a certain demographic), congratulations, you’ve just built a biased model. So, we’re going to impute—a fancy word for “make an educated guess.”

4.2 Exploratory Data Analysis (EDA): Understanding Before Modeling

Right, you’ve got your data. Your first instinct is probably to throw it into the nearest machine learning model and see what sticks. Resist that urge. That’s how you end up with a model that’s spectacularly, hilariously wrong because it learned that “number of ice cream cones sold” is the primary predictor of “homicide rate.” You and I both know the lurking variable is summer heat, but the model doesn’t. It’s just a fancy pattern-matching machine, and without your guidance, it will find the dumbest patterns imaginable.

4.1 Data Collection: Surveys, Sensors, Web Scraping, Synthetic Generation

Alright, let’s talk about getting data. This is where your project goes from a neat idea on a whiteboard to a messy, complicated reality. And that’s a good thing. Real things are messy. Your job is to be the adult in the room who figures out how to handle that mess without just sweeping it under the rug. The first rule of data collection is simple: garbage in, garbage out. You can have the most sophisticated neural network ever conceived by humanity, but if you train it on junk, it will become a masterful, high-performance producer of more sophisticated junk. So let’s get our hands dirty.