Notes on

Hands-on Machine Learning With Scikit-Learn, Keras, and TensorFlow

by Aurélien Géron

| 313 min read


The author has provided a fantastic GitHub repository containing notebooks for each of the chapters and more. I’d highly recommend you also take a look there!

As I wrote in my review: this is a fantastic book.
If you’re new to AI/ML, I can highly recommend it.
Note that TensorFlow has become less popular in the industry. Around the time I reached neural networks, I switched over to using PyTorch while following this book.
Using PyTorch while reading this book is also good practice: you can’t just copy over the code and passively follow along.

You can find some of my code here: chhoumann/ml.

Chapter 1 The Machine Learning Landscape

What Is Machine Learning?

The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample). The part of a machine learning system that learns and makes predictions is called a model.

Training set = examples used.
Training instance / sample = each training example.
Model = what makes the predictions.

Why Use Machine Learning?

Why even use machine learning?

A key feature of machine learning over traditional learning is easily illustrated by considering a spam filter (which was also one of the first uses!).
Without machine learning, here are the steps you’d likely need to take

  1. Study the problem
  2. Write rules
  3. Evaluate
  4. Launch / analyze errors

But since the problem is hard, step 2 is incredibly time-consuming, and you’d likely end up with a long list of complex rules.

This is what ML can do for you. Switch out step 2 with Train ML model, given some data, and it’ll handle the classification of spam for you!
And if the spammers continuously learn & improve their methods, you simply automate the training, evaluation, launch, and updating data. Then you’ll always be blocking their ‘improved’ spam mails — and you won’t even have to write rules forever.

But the model itself can also be inspected to gain a better understanding of the problem. A spam filter would be able to reveal unsuspected correlations or new trends. Digging into large amounts of data to discover hidden patterns is Data mining.

So, ML is great for:

  • Problems where existing solutions require a lot of fine-tuning or long lists of rules.
  • Complex problems where traditional approaches don’t give good solutions – perhaps ML techniques can.
  • Fluctuating environments, because the ML system can be retrained on new data, and therefore always be kept up to date.
  • Getting insights about complex problems and large amounts of data.

Types of Machine Learning Systems

There are many types, so we classify them in broad categories based on the following criteria:

  • How they are supervised during training (supervised, unsupervised, semi-supervised, self-supervised, and others)
  • Whether they can learn incrementally on the fly (Online Learning vs. Offline Learning)
  • Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model (instance-based vs. model-based learning)

These aren’t exclusive, and can be combined in any way you want.
E.g. a SotA spam filter might learn on the fly using a deep neural network model trained using human-provided examples of spam/ham. So it would be an online, model-based, supervised learning system.

The words target and label are generally treated as synonyms in supervised learning, but target is more common in regression tasks and label is more common in classification tasks. Moreover, features are sometimes called predictors or attributes. These terms may refer to individual samples (e.g., “this car’s mileage feature is equal to 15,000”) or to all samples (e.g., “the mileage feature is strongly correlated with price”).

  • Attribute = data type (e.g. “Mileage”)
  • Feature has several meanings, but often means attribute + its value (e.g. “Mileage = 15,000”).

Visualization algorithms are also good examples of unsupervised learning: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted

Visualization algorithms take unlabeled data and give a 2D or 3D representation that is easy to plot.
They try to keep structure, so you can understand how the data is organized.

A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be strongly correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.

Dimensionality reduction: simplify data without losing too much information.
For example, you can merge several correlated features into one. This is called feature extraction.

Yet another important unsupervised task is anomaly detection—for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.

The system is shown mostly normal instances during training, so it learns to recognize them; then, when it sees a new instance, it can tell whether it looks like a normal one or whether it is likely an anomaly (see Figure 1-10).

A very similar task is novelty detection: it aims to detect new instances that look different from all instances in the training set.

This requires having a very “clean” training set, devoid of any instance that you would like the algorithm to detect

Anomaly detection seeks to find outlier data. Novelty detection tries to find normal data.

Finally, another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.

For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak.

Thus, you may want to place these items close to one another.

Association rule learning is about finding relations between attributes.

E.g. you’d use it on sales logs to find that people who bought X and Y also buy Z. So you place those items close to each other.

Semi-supervised learning

Where you deal with partially labeled data.

Most of these algorithms are combinations of unsupervised & supervised algorithms.

This is what Google Photos does when you upload images. It recognizes person A in some photos, person B in others (that’s the unsupervised part: it clusters). Then you tell it who those people are – one label per person, and it’ll name everyone in every photo.

Self-supervised learning

Another approach to machine learning involves actually generating a fully labeled dataset from a fully unlabeled one.
Again, once the whole dataset is labeled, any supervised learning algorithm can be used.
This approach is called self-supervised learning.

For example, if you have a large dataset of unlabeled images, you can randomly mask a small part of each image and then train a model to recover the original image (Figure 1-12).
During training, the masked images are used as the inputs to the model, and the original images are used as the labels.

Self-supervised learning is generating a fully labeled dataset from a fully unlabeled one.

For example, given a large dataset of unlabeled images, you might mask a random small part of each image and train a model to recover the original image. The inputs are the masked images and the original images are used as labels.

The model you get here can be useful, but is often not the final goal. You might want to tweak and fine-tune the model for a slightly different task.

So following the example before, if you really want a pet species classification model:

  • Have large dataset of unlabeled photos of pets
  • Train image-repairing model with self-supervised learning
  • When it performs well, it should be able to distinguish pet species – e.g., it knows not to add a dog’s face to an image of a cat.
  • Assuming your model’s architecture allows it (most NN architectures do), tweak the model so it predicts pet species instead of repairing images.
  • Fine-tune the model on a labeled dataset – it already knows the difference, but it needs to learn the mapping between the species it knows and the labels we expect.

Transferring knowledge from one task to another is called transfer learning, and it’s one of the most important techniques in machine learning today, especially when using deep neural networks (i.e., neural networks composed of many layers of neurons).
We will discuss this in detail in Part II.

Transfer learning = transferring knowledge from one task to another. Great with deep neural networks.

Reinforcement learning

Reinforcement learning is a very different beast.

The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as shown in Figure 1-13).

It must then learn by itself what is the best strategy, called a policy, to get the most reward over time.

A policy defines what action the agent should choose when it is in a given situation.

Reinforcement learning agents observe the environment, select and perform actions, and get rewards in return (or penalties, which are just negative rewards)

It learns by itself what the best strategy is, which is called a policy, to maximize reward over time.
Policies define what actions the agent should take in given situations.

  1. Observe
  2. Select action using policy
  3. Take action
  4. Get reward or penalty
  5. Update policy (learning step)
  6. Iterate until optimal policy is found

A good example is DeepMind’s AlphaGo, which beat the world champion Ke Jieat in May 2017.

Batch Versus Online Learning

Batch Learning: the system cannot learn incrementally, so it must be trained on all available data. This takes lots of time and computing resources, so it’s typically done offline. Because you have to use the full dataset, it takes a lot of computing resources to train. It may even be impossible to use batch learning if the dataset is too large. And if your system needs to learn autonomously and has limited resources (e.g. a Mars rover), it can’t be carrying around lots of training data and use resources to train for hours a day. In such a case, you’d want the incremental learning capabilities of an online learning system.

Offline learning would be where the system is trained first, then launched into production where it runs without learning anymore – it just applies what it has learned.

Online learning is where you train the system incrementally by feeding it data sequentially, either individually or in small groups called mini-batches. Each step is fast and cheap, so the system can learn on the fly. That is, it can learn while being deployed.

Online learning algorithms can be used to train models on huge datasets that cannot fit in one machine’s main memory (this is out-of-core learning – which is usually done offline… that is, not on the live system. So it’s more like incremental learning.).
The algorithm loads part of the data, runs a training step on that data, and repeats until it’s finished with all the data.

Online learning systems has an important parameter used to determine how fast they adapt to changing data. This is called the learning rate. If it’s high, the system rapidly adapts to new data – and therefore quickly forgets old data. If it’s low, the system will have more inertia – it’ll learn more slowly, but also be less sensitive to noise or outliers.

Unfortunately, a model’s performance tends to decay slowly over time, simply because the world continues to evolve while the model remains unchanged.

This phenomenon is often called model rot or data drift.

The solution is to regularly retrain the model on up-to-date data.

How often you need to do that depends on the use case: if the model classifies pictures of cats and dogs, its performance will decay very slowly, but if the model deals with fast-evolving systems, for example making predictions on the financial market, then it is likely to decay quite fast.

A consequence of using batch (offline) learning is that the model’s performance decays over time due to the world evolving, while the model stays the same.

This is called model rot or data drift.
The solution is to regularly retrain the model on up-to-date data. How often depends on the use case.

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.

If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data (and you don’t want a spam filter to flag only the latest kinds of spam it was shown).

Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

What learning rate is, and why it matters.

Having a high learning rate means your system learns quickly and adapts quickly to new data, but also quickly forgets old data.
Setting a low learning rate means the system has more inertia (learns slowly), but it’s also less sensitive to noise or outliers.

A big challenge with online learning is that if bad data is fed to the system, the system’s performance will decline, possibly quickly (depending on the data quality and learning rate).

If it’s a live system, your clients will notice.

For example, bad data could come from a bug (e.g., a malfunctioning sensor on a robot), or it could come from someone trying to game the system (e.g., spamming a search engine to try to rank high in search results).

To reduce this risk, you need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance.

You may also want to monitor the input data and react to abnormal data; for example, using an anomaly detection algorithm.

A consequence of using online learning is that, if bad data is fed to the system, the system’s performance suffers.

This can happen quickly, depending on data quality and learning rate.

If your customers rely on this system, they’ll notice.

It could come from spam, malfunctioning sensors, etc.

Monitor your system and switch learning off if you see a drop in performance. You can also monitor the input data and react to abnormal data.

Instance-Based Versus Model-Based Learning

We also categorize ML systems by how they generalize. That is, how it predicts based on examples it hasn’t seen before.

By convention, the Greek letter θ (theta) is frequently used to represent model parameters.

Nice to know. represents model parameters.

Instance-based learning

This is where your system learns some set of known points, and generalizes new data by comparing them to the known data using a similarity measure.

Model-based learning and a typical machine learning workflow

This is where you generalize from examples by building a model of them, and then using that model to make predictions.

How can you know which values will make your model perform best?

To answer this question, you need to specify a performance measure.

You can either define a utility function (or fitness function) that measures how good your model is, or you can define a cost function that measures how bad it is.

For linear regression problems, people typically use a cost function that measures the distance between the linear model’s predictions and the training examples; the objective is to minimize this distance.

This is where the linear regression algorithm comes in: you feed it your training examples, and it finds the parameters that make the linear model fit best to your data.

This is called training the model.

You can define a utility function (or fitness function) that measures how good your model is, or a cost function that measures how bad it is.

When doing linear regression problems, people usually use cost functions that measure the distance between the linear model’s predictions and the training examples (and try to minimize it).

You train your Linear Regression model by giving it training examples, such that it can find the parameters (weights, biases) that it needs to best fit the data.

Linear Regression, of course, is model based. You could also do kNN to make it instance based (so it finds the instances it’s closest to and predicts based on those, given the new input).

you applied the model to make predictions on new cases (this is called inference)

Inference = applying the model to make predictions on new cases

Main Challenges of Machine Learning

Can be summarized and classified as two problems:

  1. Bad algorithm
  2. Bad data

Insufficient Quantity of Training Data

The Unreasonable Effectiveness of Data

This paper by Peter Norvig et al. shows that more data > better algorithms. But of course, getting more data can be hard, so we shouldn’t abandon improving algorithms just yet.

Nonrepresentative Training Data

If your data isn’t representative of the cases you want to generalize to, you’re in for a bad time.

It is crucial to use a training set that is representative of the cases you want to generalize to.

This is often harder than it sounds: if the sample is too small, you will have sampling noise (i.e., nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampling method is flawed.

This is called sampling bias.

In your quest to get cases that are representative, you’re also struck with more problems. If you sample too little, you’ll have to deal with sampling noise, which is where you can have nonrepresentative data because of chance.
But large samples can lead to sampling bias, which is where, if your sampling method is flawed, your data is nonrepresentative as well.

Poor-Quality Data

Solve this by fixing errors, discarding outliers, either ignoring or filling in missing values, etc.

Irrelevant Features

A big part of ML projects is Feature Engineering, which is to come up with good features to train on. The process involves:

  • Feature selection, which is where you select which features to train on.
  • Feature extraction, which is where you combine existing features to produce useful a more one (e.g., via dimensionality reduction).
  • Creating new features by gathering new data.

Overfitting the Training Data

This is an example of ‘bad algorithms’.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data.
Here are possible solutions:

  • Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model.
  • Gather more training data.
  • Reduce the noise in the training data (e.g., fix data errors and remove outliers).

Solutions to overfitting:

  • Simplify the model by selecting one with fewer parameters
  • Get more training data
  • Reduce noise in data - fix data errors, remove outliers, etc.

Constraining a model to reduce overfitting is called regularization. If a model has two parameters, it gives the algorithm two degrees of freedom to adapt the model to the training data.

The amount of regularization to apply during learning can be controlled by a hyperparameter.

A hyperparameter is a parameter of a learning algorithm (not of the model).

As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.

If you set the regularization hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution.

Tuning hyperparameters is an important

The amount of regularization to apply during learning is controlled by a hyperparameter.
That is a parameter of a learning algorithm. It’s set prior to training.
If you set it to a large value, you’ll get an almost flat model (almost certainly no overfitting, but probably a worse solution).

Tuning these are a big part of building a ML system.

Underfitting the Training Data

This is the opposite of overfitting. Solve it by

  • Selecting a more powerful model, with more parameters.
  • Use better features (do feature engineering).
  • Reduce constraints on the model - e.g. reducing the regularization hyperparameter.

Testing and Validating

This is very important. You don’t just want to train a model and hope it’s good at generalizing new cases. You want to evaluate it.

This is done by trying it on new cases.
And instead of just deploying your model to production, you split your data set into a training set and a test set.

The error rate on new cases is the generalization error (or out-of-sample error).
Your error value tells you your model’s performance on instances it has never seen before.

If your training error is low, but generalization error is high, you’re overfitting.

Hyperparameter Tuning and Model Selection

How do you decide between models?
This is what you do in Model Selection.

You can train the models you’re thinking about and compare their performance.

And now that you’ve selected a model, you’ll want to do some regularization to avoid overfitting. So you do Hyperparameter Tuning.
For this, you could train 100 different models using 100 different values for the hyperparameter.

But if you do so, be careful that you don’t just aim for low error on the test set. Then you’re just finding the best fit for that set.

A solution to this problem is holdout validation: you hold out part of the training set to evaluate several candidate models and select the best one.
The held out set is the validation set (or development set / dev set).
Basically, you train multiple models on different subsets of your training set and use the held out set as your validation set.
After finding the best-performing model, you train it on the full training set (including the validation set). And then you evaluate that model to get an estimate of the generalization error.
But this is not so good when the validation set is too small.
You can use Cross Validation to solve some of the issues with this approach, but it can be slow / resource intensive.

Chapter 2 End-to-End Machine Learning Project

Here are the main steps we will walk through:

  1. Look at the big picture.
  2. Get the data.
  3. Explore and visualize the data to gain insights.
  4. Prepare the data for machine learning algorithms.
  5. Select a model and train it.
  6. Fine-tune your model.
  7. Present your solution.
  8. Launch, monitor, and maintain your system.

These are the steps you go through in a real machine learning project.
Here’s an awesome checklist.

You also generally want to be working with real data, not some artificial dataset.

Look at the Big Picture

Frame the Problem

When you start ML projects, the first step when looking at the big picture should be to identify and frame the problem.

  • What’s the business objective?
  • What is the current solution?

And then to frame the problem:

  • Is it supervised, unsupervised, self-supervised, or reinforcement learning?
  • Is it classification, regression, or something else?
  • Online or batch learning?

Multiple regression problems are where you use multiple features to predict.
Univariate regression problems are where you only try to predict one value ‘per unit’. The opposite is a multivariate regression problem (predicting multiple values per unit).

If you have a huge dataset for batch learning, you can split the learning across multiple servers using the MapReduce technique (or do online learning).

Signals = a piece of information fed to a ML system is often called a signal due to Claude Shannon’s information theory, which he developed at Bell Labs to improve telecommunications. He theorized that you want a high signal-to-noise ratio.

Select a Performance Measure

And now you select a performance measure.

A typical one for regression is Root Mean Square Error (RMSE).
This gives an idea of how much error the system typically makes in its predictions, with higher weight given to larger errors.

RMSE is generally preferred for regression tasks, but you might prefer to use another function.

There’s also Mean Absolute Error (MAE). This is great if there are many outliers.

Both RMSE and MAE are ways to measure the distance between two vectors: the prediction vector, and the target value vector.
There are various distance measures (or norms) possible:

  • Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm. This is called the norm, denoted (or just ).
  • Computing the sum of absolutes (MAE) corresponds to the norm, denoted . It’s often called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
  • The norm of a vector with elements is defined as . And gives the number of non-zero elements in the vector, and gives the maximum absolute value in the vector.

The higher the norm index, the more it focuses on large values and neglects small ones. This is why RMSE is more sensitive to outliers than MAE. But when outliers are rare, RMSE performs well and is preferred.

Both the RMSE and the MAE are ways to measure the distance between two vectors

RMSE and MAE are ways to measure distance between two measures.
You can do various distance measures (or norms).

RMSE corresponds to Euclidean norm. It’s also called the norm.
MAE corresponds to norm, sometimes called Manhattan norm, as it measures the distance between two points in a city if you can only travel along orthogonal city blocks.

Generally, the norm of a vector containing elements is defined as

gives the number of nonzero elements in the vector, and gives the maximum absolute value in the vector.

The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

That’s a fantastic explanation.

RMSE is more sensitive to outliers than MAE. It has a higher norm index, and focuses on large values & neglects small ones. So when outliers are rare, RMSE is good - and preferred.

Notations

There are various common notations in ML.

RMSE shows some.

  • is the number of instances in the dataset you’re measuring the RMSE on. For example, for a set with 2000 samples, .
  • is a vector of all feature labels (excluding the label) of the i’th instance in the dataset, and is its label (the desired output for that instance).
    • Say we have a housing dataset where the first district is located at longitude , latitude , has 1416 inhabitants with a median income of 156,400. You want to predict the housing value.
    • and .
  • is a matrix with all feature values (excluding labels) of all instances in the dataset. There is one row per instance and the i’th row is equal to the tranpose of , denoted .
  • is the system’s prediction function (also called hypothesis). Given an instance’s feature vector , it outputs a predicted value for that instance.
  • is the cost function measured on the set of examples using your hypothesis .

If you imagine a table:

  • is the entire table’s rows’ values, except the label feature’s values.
  • is a row in the table (except it doesn’t have the label column value).

Lowercase italics for scalar values and function names.
Lowercase bold for vectors.
Uppercase bold for matrices.

Multiple regression = using multiple features to predict
Univariate regression = predicting only one feature value per item
Multivariate regression = predicting multiple feature values per item

Check the Assumptions

It’s important that you check your assumptions.
List all of them and verify the ones made (by you or others).

Get the Data

Once you’ve fetched the data from the relevant data store, start exploring it. For example with df.head() or df.info().

You can also explore categorical features with df["feature_name"].value_counts().

You might also use df.describe(), which generates descriptive statistics.

  • Count
  • Mean
  • Standard Deviation – measures how dispersed the values are.
  • Percentiles (25th, 50th, 75th by default)
  • Min
  • Max

Another way is to plot a histogram over each numerical attribute.

import matplotlib.pyplot as plt

housing.hist(bins=50, figsize=(12, 8))
plt.show()

You might notice skew on histograms. A skew right would mean they extend farther right of the median than to the left. Skew can make it harder for some ML algorithms to detect patterns. You might try to transform these skews into more symmetrical and bell-shaped distributions. This could be done by e.g. computing the logarithm or square root.

Create a Test Set

After doing some brief initial exploration, create a test set, put it aside, and never look at it.

The reason you do that is to avoid data snooping bias. Your brain is great at detecting patterns, and if you stumble upon an interesting one, you may base your decisions on that. Therefore, you would have introduced errors.

Random sampling

Creating a test set can be as simple as picking instances at random (20% of the dataset usually) and setting them aside. However, you need to be careful that the dataset is split the same way each time. You could do this by setting a random seed, but that method breaks when you get an updated dataset.

Here’s the naive way:

import numpy as np

def shuffle_and_split_data(data, test_ratio: float):
    assert 0 < test_ratio <= 1.0, "Invalid ratio: {}".format(test_ratio)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    # iloc is integer-location based indexing for selecting by position.
    return data.iloc[train_indices], data.iloc[test_indices]

A common solution is to use the identifiers for each instance, if they have one. You use that to decide whether or not it should go in the test set. You might also generate an identifier by computing a hash. Then put the instance in the test set if the hash is lower than or equal to 20% of the max hash value.

Here’s how:

from zlib import crc32

def is_id_in_test_set(identifier, test_ratio: float):
    return crc32(np.int64(identifier)) < test_ratio * 2**32 # type: ignore

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return (
        # `loc` selects rows and cols by label or boolean array, `~` is bitwise negation (inverts bools)
        data.loc[~in_test_set], # selects all rows where corresponding value in `in_test_set` is False.
        data.loc[in_test_set]  # selects all rows where value in `in_test_set` is True.
    )

housing_with_id = housing.reset_index()  # adds `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, 'index')

But the issue with using the row index as identifier is you need to ensure that new data gets appended. Otherwise, perhaps use the most stable features to build a unique identifier.

Just use Scikit-Learn’s train_test_split for random sampling:

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

But be careful of sampling bias when doing random sampling on small datasets.
You can use stratified sampling instead to prevent this.

Stratified sampling

This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population. If the people running the survey used purely random sampling, there would be about a 10.7% chance of sampling a skewed test set with less than 48.5% female or more than 53.5% female participants. Either way, the survey results would likely be quite biased.

You can do stratified sampling, which is where you take subgroups of your target group and make sure members represent the overall data. Each subgroup is called a strata.

For example, US population is 51.1% female and 48.9% female. You’d want to divide into male and female stratum, and then take 511 females and 489 males if you want set of 1000 people.

If you did random sampling, you could end up with a skewed dataset. This would mean a biased result.

Stratified sampling can be done on continuous numerical attributes by creating a categorical attribute.

You first look at the distribution of the values, and then you see where the potential subgroups are. If there are many closely grouped, you might want to keep the amount of strata small. It’s important to get a sufficient number of instances for each stratum, or the importance of a stratum’s importance might get biased.
So: not too many strata, and each stratum should be large enough.

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1,2,3,4,5])

This uses pd.cut() to create an income category for attributes with 5 categories, labeled from 1-5.

And then we can use Scikit-Learn’s StratifiedShuffleSplit to generate stratified splits:

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
    strat_train_set_n = housing.iloc[train_index]
    strat_test_set_n = housing.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

There are various splitters in sklearn.model_selection that use various strategies. Common between them is a split() method that returns an iterator over different training/test splits of the same data. That is, it yields the train/test indices.

But you might want to just use this:

strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)

And at this point, you should simply drop the categorical feature you created.

Explore and Visualize the Data to Gain Insights

Look for correlations

If the training set is large, you should probably explore a sample of it only.
Should also make a copy: data = df.copy() or something.

If you are looking at a smaller dataset, you can compute the standard correlation coefficient, called Pearson’s r, between every pair of attributes. This correlation coefficient ranges from -1 to 1. Close to 1 means strong positive correlation. Close to -1 means strong negative correlation. Close to 0 means no linear correlation.

corr_matrix = housing.corr()

corr_matrix["median_house_value"].sort_values(ascending=False)

This only captures linear relationships (e.g. x goes up, y generally goes up/down) - not nonlinear (e.g. as x approaches 0, y generally goes up).

You can also check for correlation with a scatter matrix, which plots every numerical attribute against every other numerical attribute. It also creates a histogram of each numerical attribute’s values on the main diagonal. The main diagonal would otherwise be a bunch of straight lines, which isn’t useful.

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

You may identify clear correlations from such a matrix.

  • Clear trend
  • Points not too dispersed

You may also identify other quirks in the data that could lead to worse predictions.

Experiment with Attribute Combinations

You should try out creating features to see if you can find correlations from combined features.

Prepare the Data for Machine Learning Algorithms

It is a very good idea to create functions to clean the data for you.

  • It’s easier when you get more data
  • You build a library of cleaning functions that you can reuse
  • These can be used in your live system to transform new data before feeding it to your algorithm

Clean the Data

Most ML algorithms don’t do well with missing features. You can

  • Get rid of instances with missing values
  • Get rid of the entire attribute with missing values
  • Set missing values to some value through imputation, where the missing value would be e.g. the mean, zero, median, etc.

Get rid of rows with missing values:

data.dropna(subset=["feature"])

Get rid of the entire attribute with missing values

data.drop("attribute", axis=1)

Set the values to some value
This is for the median:

median = data["attribute"].median()
data["attribute"].fillna(median, inplace=True)

If you do this, remember to do it on the test set as well.

Scikit-Learn provides SimpleImputer to help take care of missing values:

from sklearn.impute import SimpleImputer

# There's also "mean", "most_frequent", "constant" with fill_value, etc.
imputer = SimpleImputer(strategy="median")
# median obv. only works on numerical attributes
housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num) # fit inputer instance to training data, i.e. compute median of each attribute & store in `statistics_` var.
print(imputer.statistics_)

X = imputer.transform(housing_num)

## Now it's a NumPy array. Convert back to DataFrame:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

There are also other, more powerful, imputers:

  • KNNImputer replaces each missing value with the mean of the k-nearest neighbors’ values for that feature. Distance is based on all available features.
  • IterativeImputer trains a regression model per feature and predicts the missing values based on all other available features. Then trains the model again on the updated data and repeats this several times, improving the model & replacement values each time.

Handling Text and Categorical Attributes

You can convert categorical attributes to text with Scikit-Learn’s OrdinalEncoder class. This is like converting them to an enum – the categorical values get represented by numbers instead.

Just keep in mind that some ML algorithms assume that two nearby values are more similar than distant values.
This might not be the case.

A common solution is to create a binary attribute per category, where its values are 1 if the instance is in that category, and 0 if not.
This is called One-Hot Encoding because only one attribute will be 1 (hot) while the others are 0 (cold).
The new attributes are called dummy attributes.

Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors:

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
data_cat_1hot = cat_encoder.fit_transform(data_cat) # data_cat is your categorical attribute(s)

This gives a SciPy sparse matrix. Such a matrix stores only the location of the non-zero elements (because storing all the 0s would be wasteful).
You can use it like a normal 2D array.
You can convert it to a (dense) NumPy array by calling toarray().

Keep in mind that if a categorical attribute has a large number of possible categories, one-hot encoding will result in a ton of input features. This can degrade performance and slow down training.
You can replace categorical features with useful numerical ones that are related. E.g. replace country code with population & GDP per capita.
You might also use an encoder provided by the category_encoders package on GitHub.
Or when doing NNs, replace each category with a learnable, low-dimensional vector called an embedding. Then each category’s representation is learned during training. This is an example of representation learning (later in the book).

Feature Scaling and Transformation

Very important because ML algorithms don’t do well when numerical attributes have different scales.

There are two ways to ensure they have the same scale:

  • Min-max scaling
  • Standardization

Min-max scaling is often called normalization.

It’s rather simple: for each attribute, values are shifted and rescaled, so they end up ranging from 0 to 1.

Done by subtracting the min value and dividing by the difference between the min and the max.

Use Scikit-Learn’s MinMaxScaler for it.
You can set the feature range to something else than 0 to 1, e.g. -1 to 1. Neural Networks work best with zero-mean inputs, so a range of -1 to 1 is preferable.

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

Standardization:

  • First subtract the mean value (standardized values have a zero mean)
  • Divide results by the standard deviation (so standardized values have a standard deviation = 1)

This doesn’t restrict you to a specific range, but is much less affected by outliers. If min-max saw a large outlier, it would squish all other values down to almost nothing, but standardization wouldn’t be very affected.

You can use Scikit-Learn’s StandardScaler for standardization.

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
# can set `with_mean` hyperparameter to False so it only divides by std, not subtract mean. Then you can scale a sparse matrix without converting it to a dense matrix first.
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

heavy tail (i.e., when values far from the mean are not exponentially rare)

Heavy tail = when values far from the mean are not exponentially rare.

If a feature distribution has a heavy tail, min-max scaling and standardization will squash the values into a small range.
This isn’t great - most ML algorithms don’t like this.
So before you scale it, transform the feature to shrink the heavy tail.

Deal with it by transforming it to shrink the heavy tail & making the distribution roughly symmetrical. Here are some methods if the heavy tail is to the right (for positive features):

  • Replace it by its square root, or
  • Raise the feature to a power between 0 and 1

If the feature distribution has a really long and heavy tail (like a power law distribution), then replace it with its logarithm.

Can also deal with heavy-tailed features by bucketizing the feature. Chop the distribution into roughly equal-sized buckets & replace each feature value with the index of the bucket it belongs to.

Then you get an almost uniform distribution.

When a feature has a multimodal distribution (i.e., with two or more clear peaks, called modes), such as the housing_median_age feature, it can also be helpful to bucketize it, but this time treating the bucket IDs as categories, rather than as numerical values. This means that the bucket indices must be encoded, for example using a OneHotEncoder (so you usually don’t want to use too many buckets). This approach will allow the regression model to more easily learn different rules for different ranges of this feature value. For example, perhaps houses built around 35 years ago have a peculiar style that fell out of fashion, and therefore they’re cheaper than their age alone would suggest.
Another approach to transforming multimodal distributions is to add a feature for each of the modes (at least the main ones), representing the similarity between the housing median age and that particular mode. The similarity measure is typically computed using a radial basis function (RBF)—any function that depends only on the distance between the input value and a fixed point. The most commonly used RBF is the Gaussian RBF, whose output value decays exponentially as the input value moves away from the fixed point. For example, the Gaussian RBF similarity between the housing age x and 35 is given by the equation exp(–γ(x – 35)²). The hyperparameter γ (gamma) determines how quickly the similarity measure decays as x moves away from 35. Using Scikit-Learn’s rbf_kernel() function, you can create a new Gaussian RBF feature measuring the similarity between the housing median age and 35:

Multimodal distribution = two or more clear peaks, called modes.

To deal with such distributions, you can bucketize it, but treat the bucket ID as categories rather than numerical values. So do encoding (e.g. 1hot).

You can also transform the multimodal distribution by adding a feature for each of the (main) modes that represents the similarity between the feature and that particular mode.

This is typically computed using a radial basis function (RBF), which is any function that depends only on the distance between the input value and a fixed point.

The most commonly used RBF is Gaussian RBF, whose output values decay exponentially as the input values move away from the fixed point.

from sklearn.metrics.pairwise import rbf_kernel

age_simil_35 = rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1)

So far we’ve only looked at the input features, but the target values may also need to be transformed.

You might find your target values may need to be transformed:

from sklearn.linear_model import LinearRegression

# You might need to scale the targets. Here's an example of a scaling + 'unscaling'
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
some_new_data = housing[["median_income"]].iloc[:5] # pretend new data

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

But here’s a simpler way:

from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(LinearRegression(), transformer=StandardScaler())
model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)

Custom Transformers

You can implement your own custom Scikit-Learn transformers. This is useful for custom transformations, cleanup operations, or combining specific attributes.

Add BaseEstimator and TransformerMixin as superclasses to your custom class, for example.

For transformations that don’t require any training, you can just write a function that takes a NumPy array as input and outputs the transformed array.

Custom transformer that doesn’t require training:

from sklearn.preprocessing import FunctionTransformer

# Inverse is optional - useful for when you plan to use it in e.g. a TransformedTargetRegressor.
log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

from sklearn.metrics.pairwise import rbf_kernel
# Can also specify `kw_args` to give hyperparameters.
rbf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[[35.]], gamma=0.1))
age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])

For example, here’s a custom transformer that acts much like the StandardScaler:

You can create custom transformers that can train. It needs fit(), transform(), and fit_transform(). But you can get fit_transform() for free by adding TransformerMixin as base. You don’t otherwise really need to inherit from any classes. But for this one, we add BaseEstimator to get get_params() and set_params().

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted

class StandardScalerClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean=True):  # no *args or **kwargs
        self.with_mean = with_mean

    def fit(self, X, y=None):  # y is required even though we don't use it
        X = check_array(X)  # check X is array with finite floats
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]  # every estimator stores this in fit()
        return self # always need to do this - fluent pattern
    
    def transform(self, X):
        check_is_fitted(self)  # looks for learned attributes (with trailing _)
        
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_
        return X / self.scale_

This isn’t 100% complete, as all estimators should set feature_names_in_ in fit() when they’re passed a DataFrame. Also, all transformers should provide a get_feature_names_out() method, and inverse_transform() when their transformation can be reversed.

A custom transformer can (and often does) use other estimators in its implementation. For example, the following code demonstrates custom transformer that uses a KMeans clusterer in the fit() method to identify the main clusters in the training data, and then uses rbf_kernel() in the transform() method to measure how similar each sample is to each cluster center:

from sklearn.cluster import KMeans

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state
        
    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self
    
    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
    
    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

This finds cluster similarity with kmeans and RBF.

k-means is a clustering algorithm that locates clusters in data. It searches for n_clusters clusters. After training, you’ll find cluster centers in cluster_centers_.

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]], sample_weight=housing_labels)

Transformation Pipelines

Many data transformation steps are usually needed-and in the right order.
Scikit-learn has the Pipeline class to help with doing such sequences.

Note: If you’re using a Jupyter notebook and use sklearn, you can use sklearn.set_config(display="diagram") to visualize all Scikit-Learn’s estimators as interactive diagrams. Very useful for visualizing pipelines.

The pipeline exposes the same methods as the final estimator. If the final estimator is a transformer, you’ll have a transform() method, etc.

For example:

from sklearn.pipeline import Pipeline

## Here we have to name the steps. Names can be anything, as long as they are unique and don't have double underscores.
## Estimators must all be transformers, i.e. have `fit_transform`, except the last one - which can be anything (transformer, predictor, etc.)
num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler())
])

from sklearn.pipeline import make_pipeline
## Don't want to name the transformers? This uses the names of the transformers' classes.
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# The pipeline exposes the same methods as the final estimator.
# StandardScaler is a transformer, so the pipeline acts like one.

# transformed (numerical) dataset = return val of pipeline on dataset's numerical attributes
housing_num_prepared = num_pipeline.fit_transform(housing_num)

df_housing_num_prepared = pd.DataFrame(
    housing_num_prepared, columns=num_pipeline.get_feature_names_out(), index=housing_num.index
)

But it’d be better to handle both categorical and numerical columns together. For this, we can use the ColumnTransformer.

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)  # instead of writing every numerical attribute...
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
)

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)
])

Instead of using a transformer, you can specify string “drop” if you want columns to get dropped, or “passthrough” if you want them to be left untouched.
By default, remaining columns get dropped. But you can set remainder hyperparameter to any transformer - or "passthrough" if you want them handled differently.

This is easier than ColumnTransformer instantiation and listing all columns:

from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object))
)

The code that builds the pipeline to do all of this should look familiar to you by now:

Here’s an example of a larger pipeline:

from sklearn.compose import make_column_selector

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]


def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]


def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler(),
    )


log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler(),
)

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1.0, random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

preprocessing = ColumnTransformer(
    [
        # Ratio of bedrooms / rooms
        ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
        # Ratio of rooms / households
        ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
        # Ratio of population / households
        ("people_per_house", ratio_pipeline(), ["population", "households"]),
        # Logarithm of these features to handle long tails (transform to uniform/Gaussian distributions)
        (
            "log",
            log_pipeline,
            [
                "total_bedrooms",
                "total_rooms",
                "population",
                "households",
                "median_income",
            ],
        ),
        # Cluster similarity with k-means & RBF
        ("geo", cluster_simil, ["latitude", "longitude"]),
        # Categorical attributes get 1hot encoded
        ("cat", cat_pipeline, make_column_selector(dtype_include=object)), # type: ignore
    ],
    # remaining numerical gets imputed by median and standardized
    remainder=default_num_pipeline,
)

Select and Train a Model

So now we have

  • Explored the data
  • Sampled a training set and a test set
  • Written a preprocessing pipeline to automatically clean & prepare the data for an ML algorithm

Train and Evaluate on the Training Set

We can start out with a basic linear regression model.

from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

housing_predictions = lin_reg.predict(housing)
housing_predictions[:5].round(-2)  # -2 means rounded to nearest hundred
> array([243700., 372400., 128800., 94400., 328300.])

""" BUT! It's off by quite a lot! """
housing_labels[:5].values
> array([458300., 483800., 101700., 96100., 361800.])

from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(housing_labels, housing_predictions, squared=False)
lin_rmse
> 68687.89176590017 # Off by $68,628!

As we can see, the Linear Regression model is underfitting the data. So we can either

  • Select a more powerful model
  • Get better features
  • Reduce the constraints – which we can’t do because the model isn’t regularized.

Let’s try a more complex model.

from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_labels)

housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(housing_labels, housing_predictions, squared=False)
tree_rmse
> 0.0

Which is likely to have badly overfit the data.
But we can evaluate it – see next section.

Better Evaluation Using Cross-Validation

You can use Scikit-Learn’s k-fold cross validation feature.

from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels, scoring="neg_root_mean_squared_error", cv=10)

pd.Series(tree_rmses).describe()
> count       10.000000
> mean     66880.208011
> std       2049.481815
> min      63649.536493
> 25%      65429.433745
> 50%      66801.953094
> 75%      68229.934454
> max      70094.778246
> dtype: float64

But do note that it expects a utility function (greater is better) and not a cost function (lower is better). That’s why we do - in front.

k-fold cross validation splits the training set into n nonoverlapping subsets called folds, then trains and evaluates the model 10 times, picking a different fold for evaluation each time, using the remaining n-1 for training.

Since the decision tree model hasn’t worked too well, let’s try an Ensemble model – the Random Forest!

from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing, RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing, housing_labels, scoring="neg_root_mean_squared_error", cv=10)

pd.Series(forest_rmses).describe()

> count       10.000000
> mean     47030.511142
> std       1029.358881
> min      45458.112527
> 25%      46474.122490
> 50%      46967.596354
> 75%      47332.463998
> max      49227.030610
> dtype: float64

It’s better, but likely still overfitting – train a single Random Forest and you get a RMSE of 17474, which means overfitting. The results from CV are fine.
It’s a good idea to find 2-5 promising models by trying out multiple potential approaches at this stage.

Fine-Tune Your Model

Now you have a list of promising models, and you want to fine-tune them.

One way is by fiddling with the hyperparameters manually. This is tedious, though.
So instead, have Scikit-Learn’s GridSearchCV do it for you!

Just tell it which hyperparameters to experiment with, and which values to try, and it’ll evaluate them for you.

# This will search for the best combination of hyperparameter values for the random forest regressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42))
])

param_grid = [
    # The 'preprocessing...' means we refer to the hyperparameter of the estimator in the pipeline, no mater how deeply nested it is.
    # When Scikit-Learn sees the string, it'll split the string at double underscores, looking for an
    # estimator named 'preprocessing' in the pipeline and finds the preprocessing ColumnTransformer.
    # Then it looks for a transformer 'geo' inside that, and finds the ClusterSimilarity transformer.
    # Then it finds that transformer's 'n_clusters' hyperparameter.
    {'preprocessing__geo__n_clusters': [5, 8, 10], 'random_forest__max_features': [4, 6, 8]},
    {'preprocessing__geo__n_clusters': [10, 15], 'random_forest__max_features': [6, 8, 10]}
]

grid_search = GridSearchCV(full_pipeline, param_grid=param_grid, cv=3, scoring='neg_root_mean_squared_error')
grid_search.fit(housing, housing_labels)

By using a Scikit-Learn pipeline for the preprocessing, you can also do hyperparameter tuning on the preprocessing hyperparameters – along with the model hyperparameters. They often interact. In this case, increasing n_clusters might require increasing max_features as well.
If fitting the pipeline transformers is computationally expensive, set the pipeline’s memory hyperparameter to the path of a caching directory. It’ll then save the fitted transformer after first fitting the pipeline. If you fit the pipeline again with the same hyperparameters, it’ll load the cached transformers.

Best estimator can be accessed via grid_search.best_estimator_.
If you initialize with refit=True (default), then once it finds the best estimator using CV, it’ll retrain on the whole training set.

If you don’t know what value a hyperparameter should have, try out consecutive powers of 10.

To get the best parameters, use:

grid_search.best_params_

And to get the best estimator:

grid_search.best_estimator_

And you can get the scores:

cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

# rmse = -score
test_score_columns = [col for col in cv_res.columns if 'test_score' in col and not col == "std_test_score"]

for col in test_score_columns:
    cv_res[col] = -cv_res[col]

cv_res.head()

Similar to Grid Search, there’s also Randomized Search.

Grid search is good when looking at relatively few combinations. When the hyperparameter search space is large, it’s preferable to use RandomizedSearchCV.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    'preprocessing__geo__n_clusters': randint(low=3, high=50),
    'random_forest__max_features': randint(low=2, high=20)
}

rnd_search = RandomizedSearchCV(
    full_pipeline, param_distributions=param_distribs, n_iter=10, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42
)

rnd_search.fit(housing, housing_labels)

There’s also HalvingRandomSearchCV and HalvingGridSearchCV hyperparameter search classes.

Bonus section: how to choose the sampling distribution for a hyperparameter

  • scipy.stats.randint(a, b+1): for hyperparameters with discrete values that range from a to b, and all values in that range seem equally likely.
  • scipy.stats.uniform(a, b): this is very similar, but for continuous hyperparameters.
  • scipy.stats.geom(1 / scale): for discrete values, when you want to sample roughly in a given scale. E.g., with scale=1000 most samples will be in this ballpark, but ~10% of all samples will be < 100 and ~10% will be > 2300.
  • scipy.stats.expon(scale): this is the continuous equivalent of geom. Just set scale to the most likely value.
  • scipy.stats.loguniform(a, b): when you have almost no idea what the optimal hyperparameter value’s scale is. If you set a=0.01 and b=100, then you’re just as likely to sample a value between 0.01 and 0.1 as a value between 10 and 100.

Ensemble Methods

Another way to make your system perform better is by combining methods that work the best. We call the group an “ensemble,” and it usually performs better than an individual model.

For example, Random Forests are better than individual Decision Trees.

It’s usually better than the individual model.

Analyzing the Best Models and Their Errors

It can be a good idea to analyze your best models.

For example, you can check the importance scores for each of the features, and potentially drop ‘useless’ ones.

final_model = rnd_search.best_estimator_
feature_importances = final_model["random_forest"].feature_importances_
feature_importances.round(2)

# This shows us the importance scors in descending order with their corresponding attribute names
sorted(
    zip(feature_importances, final_model["preprocessing"].get_feature_names_out()),
    reverse=True,
)

In fact, the sklearn.feature_selection.SelectFromModel transformer can automatically drop the least useful features.

You might also look at the specific errors your system makes and try to understand why – and what could fix it (more features, removing uninformative ones, cleaning outliers, etc.).

You might also spend time creating subsets of your validation set to check that your model performs well on all categories of data you’re working with. If it performs well generally, but poorly on one, it might cause more harm than good.

Evaluate Your System on the Test Set

Now, run the system on your test set.
When running your pipeline, just transform - do not fit!

# Getting the RMSE for the test set. Just getting predictors and labels from test set and running
# the final model to transform the data & make predictions, then evaluating them
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

final_predictions = final_model.predict(X_test)

final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
print(final_rmse)

You can also compute a 95% confidence interval for the generalization error.

To explain it simply: when creating a model, the goal is for the model to generalize well. That is, it should be able to make accurate predictions on new, unseen data.
The generalization error is a way to quantify how well the model is expected to perform on this new data.

95% confidence interval for the generalization error is a range of values that, based on the data & model, contains the true generalization error 95% of the time.
That is, if we could somehow create 100 identical universes & train our model in each one, we’d expect the generalization error to fall within this interval in 95 out of 100 of those.

So the 95% confidence interval gives us a range where we expect the true value to lie 95% of the time. For the generalization error, it’s giving us a range where we believe our model’s performance on new data will fall.
It’s a way of capturing the uncertainty we have about our model’s ability to make accurate predictions.

# Here we're calculating the 95% confidence interval for the generalization error
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

Launch, Monitor, and Maintain Your System

It’s often a good idea to save every model you experiment with so that you can come back easily to any model you want.
You may also save the cross-validation scores and perhaps the actual predictions on the validation set.
This will allow you to easily compare scores across model types, and compare the types of errors they make.

You can save your models.
So save the hyperparameters, the trained parameters, and the cross-validation scores. Then it’s easy to compare scores across model types.

You can save models like this:

import joblib

joblib.dump(my_model, "my_model.pkl")
# and later ... - just make sure to have all the dependencies loaded/imported as well.
my_model_loaded = joblib.load("my_model.pkl")
  • Get your system ready for production: plug input data sources in and write tests.
  • Write monitoring code to monitor live performance and trigger alerts when it drops.
  • Evaluate system performance by sampling its predictions and evaluating them manually.
  • Evaluate system input data quality.
  • It’s generally good to train your models on a regular basis with fresh data. Try to automate this.

You can deploy your model to the cloud, e.g. Google’s Vertex AI. Just upload it to Google Cloud Storage (GCS), go to Vertex AI, and create a new model version by pointing it to the GCS file.
Then you get a simple web service that handles load balancing and scaling.

Chapter 3 Classification

Training a Binary Classifier

from sklearn.datasets import fetch_openml

# This is a dict with the data and some metadata.
# 'data' has the input data, 'target' has the labels.
mnist = fetch_openml('mnist_784', as_frame=False)
X, y = mnist['data'], mnist['target']
X.shape
> (70000, 784)

import matplotlib.pyplot as plt

def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    plt.imshow(image, cmap='binary')
    plt.axis('off')
    
some_digit = X[0]
plot_digit(some_digit)
plt.show() # shows digit

TRAIN_SIZE = 60_000
X_train, X_test, y_train, y_test = (
    X[:TRAIN_SIZE],
    X[TRAIN_SIZE:],
    y[:TRAIN_SIZE],
    y[TRAIN_SIZE:],
)

The training set (60k images) and the test set (10k images).
The training set is already shuffled, which is great because this guarantees that all cross-validation folds will be similar (you don’t want one fold to be missing some digits). Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row. Shuffling the dataset ensures that this won’t happen.

But shuffling is a poor idea if you’re dealing with e.g. time series data.

Let’s train a binary classifier for the number 5.

y_train_5 = (y_train == '5') # True for all 5s, False for all other digits.
y_test_5 = (y_test == '5')

We’ll use a Stochastic Gradient Descent classifier.
This classifier is capable of handling large datasets efficiently.
It deals with training instances independently, one at a time.
This makes SGD well suited for online learning.
SGD relies on randomness during training (hence the name “stochastic”).
If you want reproducible results, you should set the random_state parameter.
This is useful for testing and debugging.

SGDClassifier is just a linear model. It assigns a weight per class to each pixel, and when it seems a new image, it sums up the weighted pixel intensities to get a score for each class.

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

sgd_clf.predict([some_digit])

Performance Measures

There are a ton of performance measures for classifiers. It’s generally harder to measure the performance of a classifier than a regressor.

Measuring Accuracy Using Cross-Validation

We could use Cross Validation.

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

And even if we get a 95% accuracy, there’s no reason to get excited. If you trained a dummy classifier that just classifies every image in the most frequent class (here: the negative class, that is non-5), you’d get over 90%. About 10% of the images are 5s, so just guessing ‘not 5’ makes you right 90% of the time.

It’s generally not the best idea to use accuracy as a measure for classifiers. A much better way is Confusion Matrix.

Confusion Matrices

With a confusion matrix, the idea is to count the number of times instances of class A are classified as class B, for all A/B pairs.
And you can find the amount of times the classifier confused 5s with 3s by looking at the 5th row and 3rd column of the matrix.

You can use confusion_matrix from sklearn.metrics to make one.

Instead of using the test set at this time, we can use cross_val_predict! This does k-fold cross validation and returns the predictions made on each test fold. So you get clean predictions for each instance in the training set – that is, out-of-sample predictions (the model makes predictions on data it never saw during training).

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
cm = confusion_matrix(y_train_5, y_train_pred)

Each row represents an actual class, and each column represents a predicted class.

False positives are also called type I errors. Likewise, false negatives are Type II errors.

Confusion matrices give lots of information. You might prefer a more concise metric.

📊 Precision refers to the fraction of true positive instances among the instances that the model predicted as positive. It’s a measure of how many of the positive predictions were actually correct.
Formula:

You’d actually get perfect precision by making a classifier that always makes negative predictions, except for one single positive prediction on the instance it’s most confident about. If that prediction is correct, it would have a precision of 100%.
That isn’t very useful, though. That’s why we often pair precision with recall.

🎯 Recall is the fraction of true positive instances among the actual positive instances. It measures how many of the actual positives were captured by the model.
So if there are, say, 6 actual fives, but the model only captured 5 fives, we’d have a recall of .
Formula:

Precision and Recall

You can get precision and recall from Scikit-Learn: from sklearn.metrics import precision_score, recall_score.

We can combine precision and recall into a single metric called the score. This can help you compare two classifiers.
The score is the harmonic mean of the precision and recall. It gives a balanced measure between them. It ranges from 0 to 1, where 1 is the best value.
The regular mean treats all values equally, the harmonic mean gives more weight to low values. So the classifier only gets a high score if both precision and recall are high.

Or just use from sklearn.metrics import f1_score.

You don’t always need to maximize both precision and recall. Increasing precision reduces recall, and vice versa. This is the precision/recall tradeoff.

It’s also important to keep in mind that sometimes the cost of false positives and false negatives are different. This would be easy to imagine in a healthcare scenario, where you’d rather avoid the machine saying a patient doesn’t have some disease, when they actually do (false negative).

Since you often do need to make the precision/recall tradeoff, you need to know what is most important.
Another example is detecting videos that are safe for kids. You’d rather be on the safe side (high precision), even though that might mean rejecting many otherwise good videos (low recall). You don’t want really bad videos to slip through!

Precision/Recall Tradeoff

This is how the SGDClassifier makes its classification decisions:
For each instance, compute a score based on decision function. If that score is greater than threshold value, assign the instance to the positive class: and otherwise, the negative class.

Scikit-Learn doesn’t allow you to touch the threshold value (by default, it’s 0). But it gives you access to the decision function, so you can call that instead of predict to get a score for each instance, and then use any threshold you want:

y_scores = sgd_clf.decision_function([some_digit])
print(y_scores)

# and you can manually set threshold:
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

And you can decide which threshold to use like this:
First, get scores of all instances in the training set.

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

With these scores, compute precision and recall for all possible thresholds withprecision_recall_curve():

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

You can then plot precision and recall as functions of the threshold value. This can also help you choose.

def plot_precision_vs_recall(precisions, recalls, thresholds, threshold):
    """Plot precision vs. recall for a given threshold."""
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.vlines(threshold, 0, 1.0, "k", "dotted", label="Threshold")

    ## Beautify
    idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
    plt.plot(thresholds[idx], precisions[idx], "bo")
    plt.plot(thresholds[idx], recalls[idx], "go")
    plt.axis([-50000, 50000, 0, 1])
    plt.grid()
    plt.xlabel("Threshold")
    plt.legend(loc="center right")
    ## End beautify
    
    plt.show()

You can also plot precision against recall. This will help select a good precision/recall tradeoff:

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")
plt.show()

If you want 90% precision, you can search for the lowest threshold that gives at least 90% precision.
Use argmax() for this, which returns the first index of the maximum value (here, the first True value):

idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]

Which you could then use as your new threshold.

As you can see, it’s easy to reach any amount of precision. But you need to keep recall in mind! You can always get 1 right to get 100% precision, but what if you just ignore the remaining instances (low recall)? That’s bad.

The ROC Curve

The Receiver Operating Characteristic (ROC) curve can also be used with binary classifiers to help assess performance.
It’s similar to plotting the precision vs. recall curve, but instead of that, it plots the true positive rate (recall) against the false positive rate (FPR).
FPR is the ratio of negative instances that are incorrectly classified as positive. Equivalent to , which is the ratio of negative instances that are correctly classified as negative. TNR is also called specificity.
The ROC curve plots sensitivity (recall) versus .

A high area under the ROC curve (AUC: area under curve) indicates better model performance. The metric summarizes the trade-off between the TRP and FPR for different threshold values.

The y-axis would be TPR (Recall) and the x-axis is FPR (Fall-Out).

This is how you plot it with Scikit-Learn and matplotlib:

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

# from the book
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    # Not shown in the book
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) # Not shown
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    # Not shown
    plt.grid(True)                                            # Not shown

plt.figure(figsize=(20, 12))
plot_roc_curve(fpr, tpr)

And as we can see, the higher the recall, the more false positives the classifier produces.

We also measure the area under the curve (AUC). A perfect classifier has a ROC AUC of 1, where a completely random one has a ROC AUC of 0.5.

Compute with Scikit-Learn:

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

You should use Precision vs. Recall curve (PR curve) when the positive class is rare, or when you care more about false positives than false negatives, and otherwise, you should use the ROC curve.

At the end of this chapter, we trained a Random Forest classifier:

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_proba_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

y_proba_forest[:2] # first two instances. First column is probability of negative class, second column is probability of positive class. These are estimated probabilities.
> array([[0.11, 0.89], [0.99, 0.01]])

y_scores_forest = y_proba_forest[:, 1] # score = probability of positive class
precisions_forest, recalls_forest, thresholds_forest = precision_recall_curve(
    y_train_5, y_scores_forest
)

plt.plot(recalls_forest, precisions_forest, linewidth=2, label="Random Forest")
plt.plot(recalls, precisions, "--", linewidth=2, label="SGD")

plt.show() # shows PR curves for both random forest classifier and SGD. Evidently shows that the RF classifier is better as the AUC is greater, and the PR curve is closer to the top-right corner.

from sklearn.metrics import roc_auc_score, f1_score
y_train_pred_forest = y_proba_forest[:, 1] > 0.5 # positive probability > 50%
print(f"F_1={f1_score(y_train_5, y_train_pred_forest)}")
print(f"ROC AUC={roc_auc_score(y_train_5, y_scores_forest)}")
> F_1=0.9242275142688446
> ROC AUC=0.9983436731328145

Multiclass Classification

Multiclass classifiers, or multinomial classifiers, can distinguish between two or more classes.
While some Scikit-Learn classifiers like Random Forest classifiers, Logistic regression, and naive Bayes classifiers support multiple classes directly, others are strictly binary: SVM classifiers and Linear classifiers can only do binary.
But there are strategies to do multiclass classification using multiple binary classifiers (e.g. one-versus-all and one-versus-one)

Here’s an example of the One-versus-All (OvA) strategy (also called one-versus-the-rest):
Say you wish to predict numbers from 0-9. Then you could train 10 binary classifiers – one for each digit. When you want to predict, then predict with all of them, and choose the prediction score that is highest.

There’s also the One-versus-One (OvO) strategy.
Same situation: you want to predict digits 0-9. Train a binary classifier for every pair of digits, e.g., (0, 1), (0, 2), … (1, 2)…, and so on.
If there are classes, you train classifiers.
To make predictions, you run through every classifier and see which class wins the most ‘duels’.
Of course, each classifier only needs to be trained on the part of the training set for the two classes it must distinguish between. This is the main advantage.
That is, it’s good for algorithms that don’t scale well with the size of the training set.

SVM classifiers scale poorly with the size of the training set, so for those algorithms, OvO is preferred. Otherwise, it’s better to do OvA (for the other types of algorithms).

Scikit-Learn will automatically detect when you try to use a binary classification algorithm for a multiclass classification task, and runs OvA (or OvO for SVM classifiers).

from sklearn.svm import SVC

svm_clf = SVC(random_state=42)
svm_clf.fit(X_train[:2000], y_train[:2000]) # use a subset of the training set to speed up training time

svm_clf.predict([some_digit])
> array(['5'], dtype=object)

some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores.round(2)
> array([[ 3.79, 0.73, 6.06, 8.3 , -0.29, 9.3 , 1.75, 2.77, 7.21, 4.82]])
# Above: scores for each class. Highest score is for 5, which could be found with `argmax`. You can do:
class_id = some_digit_scores.argmax()
svm_clf.classes_[class_id]

This trains 10 classifiers (under the hood), got each of their decision scores, and selected the best one (the one that won the most duels).

And for methods that directly support multiclass prediction tasks (e.g., random forests), Scikit-Learn doesn’t need to do OvA or OvO.
You can force Scikit-Learn to do OvO or OvR by using either OneVsOneClassifier or OneVsRestClassifier classes. Create an instance and pass a classifier to its constructor (doesn’t need to be binary classifier):

from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC(random_state=42))
ovr_clf.fit(X_train[:2000], y_train[:2000])

ovr_clf.predict([some_digit])

We do seem to get better results when the data has been scaled:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype("float64"))

Error Analysis

We can analyze the types of errors or model makes.

First, look at the confusion matrix.
Make predictions with cross_val_predict() and call confusion_matrix().

It’s probably better to visualize it:

from sklearn.metrics import ConfusionMatrixDisplay

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
plt.show()

If most images are on the main diagonal, it’s pretty good.

It’s important to normalize the confusion matrix by dividing each value by the total number of images in the corresponding (true) class (i.e. divide by the row’s sum). Do that by setting normalize="true". You can also specify values_format=".0%" to show percentages with no decimals.

We can focus the plot on the errors – make them stand out more – by putting zero weight on the correct predictions.

sample_weight = (y_train_pred != y_train)

ConfusionMatrixDisplay.from_predictions(
    y_train, y_train_pred, sample_weight=sample_weight,
	# normalize="pred" would be to normalize by column. "true" is by row.
    normalize="true", values_format=".0%"
)
plt.show()

And fill the diagonal with zeroes to keep only the errors, and plot the result:

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

And now you can see which errors the classifier makes.
Rows represents actual classes, and columns represent predicted classes.
If the column for a class is bright, many images get misclassified as it. And if the row for a class is bright, it often gets misclassified.
Just remember not to misinterpret the percentages in this diagram. Since we’ve excluded the correct predictions, 36% in row #7 col #9 doesn’t mean 36% of all images of 7s were misclassified as 9s, but that 36% of errors the model made on images of 7s were misclassifications as 9s.

What do you do about the errors? You might want to get more data to help misclassification.
Or you can add some new features that help the classifier. For example, make an algorithm to count the number of closed loops (8 has 2, 6 has 1, 5 has 0, etc.).
Or you can preprocess the images with, for example, Scikit-Image, Pillow, or OpenCV to make patterns stand out more.

Multilabel Classification

What if you want your classifier to output multiple classes for each instance? E.g., recognizing multiple things in one image.

Such a classifier – one that outputs multiple binary tags – is called a multilabel classification system.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= '7')
y_train_odd = (y_train.astype('int8') % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

knn_clf.predict([some_digit])
> array([[False, True]])

The above code creates a y_multilabel array containing two target labels for each digit image. The first indicates whether it’s large (7, 8, or 9), and the second indicates whether it’s odd. Then we create a KNeighborsClassifier, which supports multilabel classification (not all classifiers do), and train this model using the multiple targets array. some_digit is ‘5’, and it is indeed not large (False) and is odd (True).

There are various ways to evaluate a multilabel classifier. The right metric depends on your project.
One approach is to measure the score for each individual label (or another binary classifier metric) and compute the average score.

Below we compute the average score across all labels.

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
> 0.976410265560605

If not all labels are equally important (e.g. if you have more pictures of X and Y and Z), you can give more weight to the score on X. One option is to give each label a weight equal to its support (number of instances with that target label): just set average="weighted" when calling f1_score().

To use a classifier that doesn’t natively support multilabel, e.g. SVC, you could train a model per label. This may have a hard time capturing dependencies between labels. To solve that, the models can be organized in a chain: when a model makes a prediction, it uses the input features plus all predictions of the models that come before in the chain. Scikit-Learn has a class ChainClassifier that does that. It uses true labels for training by default, feeding each model the appropriate labels depending on their position in the chain.
If you set the cv hyperparameter, it’ll use cross-validation to get “clean” (out-of-sample) predictions from each trained model for every instance in the training set, and these are used to train all models later in the chain.

from sklearn.multioutput import ClassifierChain

chain_clf = ClassifierChain(SVC(), cv=3, random_state=42)
chain_clf.fit(X_train[:2000], y_multilabel[:2000])

chain_clf.predict([some_digit])
> array([[0., 1.]])

Multioutput Classification

Multioutput classification, or multioutput-multiclass classification is a generalization of multilabel classification where each label can be multiclass. That is, each label can have more than two possible values.

Multioutput systems aren’t limited to classification tasks. You could have a system that outputs multiple labels per instance, including both class labels and value labels.

The book gives and example where we build a system that removes noise from images. It takes a noisy digit image and outputs a clean digit image, represented as an array of pixel intensities (like MNIST images).
The output is multilabel (one label per pixel) and each label can have multiple values (pixel intensity ranges from 0 to 255). So it’s a multioutput classification system.

Defining data set – adding noise to the MNIST data

np.random.seed(42)
# generates random integers from 0 to 100 (exclusive) with shape (len(X_train), 784)
# recall that X = mnist['data'], X has shape (70000, 784)!
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise

y_train_mod = X_train
y_test_mod = X_test

plot_digit(X_test_mod[0])
plt.show() # shows noisy 7

Training

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[0]])
plot_digit(clean_digit)
plt.show() # shows clean 7

Chapter 4 Training Models

This chapter starts by looking at two ways to train a Linear Regression model:

  • Using a “closed-form” equation (an equation only composed of a finite number of constants, variables, and standard operations like – no infinite sums, limits, integrals, etc.) that directly computes the model parameters that best fit the model to the training set (minimize cost function over training set)
  • Using an iterative optimization algorithm (Gradient Descent) that gradually tweaks model parameters to minimize the cost function over the training set, eventually converging on the same set of parameters as the above method. There are multiple variants of gradient descent (batch GD, mini-batch GD, and Stochastic Gradient Descent).

After that, we look at polynomial regression. This is a more complex model, but it can fit nonlinear datasets. Since it has more parameters than linreg, it is more prone to overfitting. We’ll see how this is the case with learning curves, and then we look at Regularization techniques.

Finally, we look at two more models commonly used for classification: logistic regression and softmax regression.

Linear Regression

Generally, a Linear model makes a prediction by computing a weighted sum of the input features, plus a constant called the bias term (or intercept term):

where is the predicted value, is the number of features, is the i’th feature value, and is the j’th model parameter, including the bias term and feature weights .

This can be written more concisely in the vectorized form:

where is the hypothesis, using model parameters , is the model’s parameter vector, containing bias term and feature weights to , is the instance’s feature vector, containing to , with always being , and is the dot product of the two vectors, equal to .

So how do we train the model?
Training a model means settings its parameters so the model best fits the training set.
So how do we measure how well it fits?
The most common for regression models is Root Mean Square Error (RMSE), so we need to find the value of that minimizes RMSE.
In practice, it’s simpler to minimize Mean Squared Error (MSE), and it leads to the same result (the value minimizing a positive function also minimizes its square root).

MSE of a linear regression hypothesis on training set is found with:

The Normal Equation

There’s a closed-form solution (mathematical equation that gives the result directly) to finding the value of that minimizes the Mean Squared Error (MSE). This is the Normal equation:

where is the value of that minimizes the cost function, is the vector of target values containing to . Don’t mind the spacing, you’re still doing matmul there – it’s just for aesthetics.

Now let’s try it out.

We’ll generate a linear dataset with the function:

import numpy as np
from sklearn.preprocessing import add_dummy_feature

np.random.seed(42)

m = 100  # number of instances
X = 2 * np.random.rand(m, 1) # column vector
y = 4 + 3 * X + np.random.randn(m, 1) # column vector

X_b = add_dummy_feature(X) # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
theta_best
> array([[4.21509616], [2.77011339]])

Getting and would have been perfect, but due to the noise it is impossible to recover the exact parameters of the original function.

We can use theta_best to make predictions.

X_new = np.array([[0], [2]])
X_new_b = add_dummy_feature(X_new)
y_predict = X_new_b @ theta_best
y_predict
> array([[4.21509616], [9.75532293]])

We can also do linear regression with Scikit-Learn using the LinearRegression class. This class is based on the np.linalg.lstsq() function (the name stands for “least squares”).

As can be seen, they separate the bias term (intercept_) from the feature weights (coef_).

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
> (array([4.21509616]), array([[2.77011339]]))

lin_reg.predict(X_new)
> array([[4.21509616], [9.75532293]])

Let’s try using np.linalg.lstsq() (Least Squares).

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd
> array([[4.21509616], [2.77011339]])

The function computes , where is the pseudoinverse of (the Moore-Penrose inverse). You can use np.linalg.pinv to compute the pseudoinverse directly:

np.linalg.pinv(X_b) @ y # Moore-Penrose pseudoinverse
> array([[4.21509616], [2.77011339]])

The pseudoinverse is computed using Singular Value Decomposition (SVD) to decompose the training set matrix into the matrix multiplication of . The pseudoinverse is computed as . To compute , the algorithm takes and sets all values smaller than a tiny threshold to zero, then it replaces all the nonzero values with their inverse, and finally transposes the resulting matrix.
This is more efficient that computing the Normal equation, and handles edge cases (Normal equation may not work if the matrix is singular - not invertible).

Gradient Descent

This is a generic optimization algorithm capable of finding optimal solutions to various problems.
The idea is to tweak parameters iteratively to minimize a cost function.

The algorithm measures the local gradient of the error function w.r.t the parameter vector , and goes in the direction of descending gradient. When the gradient is 0, you’ve reached a minimum.

In practice, you start by filling with random values (this is called random initialization). Then you improve it gradually, each step attempting to decrease the cost function (e.g. MSE) until the algorithm converges to a minimum.

The size of the steps is determined by the learning rate Hyperparameter. If this is too small, it’ll take long to converge.
But if it’s too large, you may overstep and end up with larger values.

Not all cost functions look like Us. There can be aspects that make convergence on the minimum difficult. For example, finding Local maxima vs. Global maxima.
Random initialization may start you off in an area which leads you to a local minimum. If there happens to be a steep hill between it and the global minimum, you won’t be getting there.
Likewise, if you start far from a minimum point, you may make slow but steady improvements – and if you stop early, you won’t reach the global minima.

Convex functions help quite a bit.
For example, the MSE cost function for a linear regression model is a convex function.
Convex means that if you pick any two points on the curve, the line segment joining them is never below the curve.
This implies there is no local minima, only one global maximum.

MSE is also continuous with a slope that never changes abruptly.

These two characteristics mean that the Gradient Descent is guaranteed to approach arbitrarily closely the global minimum.

When using gradient descent, ensure all feature have a similar scale. For one, this ensures it takes less time to converge on the minimum.

Batch Gradient Descent

To implement Gradient Descent, you need to compute the gradient of the cost function w.r.t each model parameter . That is, you calculate how much the cost function will change if you change a bit. This is called the partial derivative.

This computes the partial derivative of Mean Squared Error (MSE) w.r.t parameter , noted :

But instead of computing the partial derivatives individually, you can compute them all in one go.
The gradient vector, , contains all the partial derivatives of the cost function – one for each model parameter:

The algorithm is called batch gradient descent because we do calculations over the whole batch of training data.
This is also why it’s slow on large training sets.
Gradient descent scales well with the number of features – it’s faster than the Normal equation or Singular Value Decomposition (SVD) when training a Linear Regression model.

Given the gradient vector, which points ‘uphill’, you go in the opposite direction to go ‘downhill’. This means to subtract from . This is where the learning rate comes into play: we multiply the gradient vector by to determine the size of the ‘downhill’ step.

Here’s an implementation:

eta = 0.1  # learning rate
n_epochs = 1000
m = len(X_b)  # number of instances

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

for epoch in range(n_epochs):
    gradients = 2 / m * X_b.T @ (X_b @ theta - y)
    theta = theta - eta * gradients

theta
> array([[4.21509616], [2.77011339]])

Can use Grid Search to find a good learning rate.
To find a good amount of epochs to run for is hard. Too low, and you’ll finish before a good solution. Too large and you’ll waste compute & time. A simple solution is early stopping – setting a very large number of epochs, but interrupting the algorithm when the gradient vector becomes tiny, which is when the norm becomes smaller than some number (called the tolerance), which happens when gradient descent has almost reached the minimum.

Stochastic Gradient Descent

Batch gradient descent uses the whole training set to compute gradients at every step. This is slow on large training sets.

Stochastic Gradient Descent instead picks a random instance in the training step at every step and computes gradients based on only that instance.
This makes it faster, given there’s less data to manipulate each iteration, and it also makes it possible to train on huge datasets, as only one instance needs to be in memory at each iteration.
Stochastic Gradient Descent can be implemented as an out-of-core algorithm.

However, due to the stochastic nature of the algorithm, it’s less regular than batch gradient descent. The cost function will bounce up and down, not go down steadily as a more deterministic approach.
Over time, it’ll get very close to the minimum, but will never settle there – it’ll continue to bounce around. That also means you won’t find the optimal solution, only a good enough one.

This can be a plus: you may jump out of local minima on irregular cost functions. So there’s a better chance of finding the global minimum than batch gradient descent.

Given the duality of being able to escape local minima, but never settling at the minimum, it can be useful to gradually reduce the learning rate.
The steps start out large, which helps make quick progress and escape local minima, and then they get smaller and smaller, allowing the algorithm to settle at the global minimum.
This process is similar to simulated annealing, which is an algorithm inspired by the process of annealing from metallurgy, where molten metal is slowly cooled down.
The function used to determine the learning rate at each iteration is called the learning schedule. If you reduce the learning rate too quickly, you may get stuck in a local minimum, or end up frozen halfway to the minimum. If it’s reduced too slowly, you may jump around the minimum for a long time, and end up with a suboptimal solution if you end too early.

Implementation

n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
m = len(X_b)

def learning_schedule(t):
    return t0 / (t + t1)

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

for epoch in range(n_epochs):
    for iteration in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T @ (xi @ theta - yi)  # we don't divide by m because we only use one instance
        eta = learning_schedule(epoch * m + iteration)
        theta = theta - eta * gradients

theta
> array([[4.21076011], [2.74856079]])

It’s convention to iterate by rounds of iterations.

When using SGD, the raining instances must be independent and identically distributed (IID) to ensure the parameters get pulled towards the global optimum, on average. Can ensure this by shuffling instances during training (pick instance randomly, or shuffle training set at beginning of each epoch). If you don’t, SGD first optimizes by each ‘sorting’ you have, whether implicit or explicit.

Linear Regression with SGD using Scikit-Learn

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(
    max_iter=1000, # runs for 1k epochs
    tol=1e-5, # or until loss drops by less than tol...
    n_iter_no_change=100,  # during 100 epochs
    eta0=0.01,  # starts with this learning rate
    penalty=None,  # no regularization
    random_state=42
)

sgd_reg.fit(X, y.ravel())  # ravel() flattens the array: fit expects a 1D array

sgd_reg.intercept_, sgd_reg.coef_
> (array([4.21278812]), array([2.77270267]))

Mini-Batch Gradient Descent

At each step, instead of computing the gradients based on the full training set (batch GD), or based on just one instance (SGD), this method computes the gradients on small random sets of instances called mini-batches.

The advantage of mini-batch GD over SGD is the performance boost you can get from hardware optimization of matrix operations, esp. when using GPUs.

Will get closer to minimum than SGD because the algorithm’s progress in parameter space is less erratic. But may be harder to escape local minima.

Polynomial Regression

You can use a linear model to fit nonlinear data.
A simple way is adding powers of each feature as new features, then training a linear model on this extended set of features.
This is called Polynomial regression.

np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0], X_poly[0]
> (array([-0.75275929]), array([-0.75275929, 0.56664654]))

We generate a nonlinear and noisy dataset based on a Quadratic Equation.
Then we use PolynomialFeatures from Scikit-Learn to transform the training data, adding the square (second-degree Polynomial) of each feature in the training set as a new feature.
X_poly contains the original features of X plus the square of the (single) feature.
We can use a LinearRegression model now:

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
> (array([1.78134581]), array([[0.93366893, 0.56456263]]))

The model estimates , where the original function was .

When there are multiple features, polynomial regression can find relationships between them. Plain linear models cannot do this. This is possible because PolynomialFeatures also adds all combinations of features up to the given degree.
For example, given two features , and degree=3, it would not only add , but also .

Given degree , PolynomialFeatures transforms an array with features into an array with features, where is the factorial of . Important to be aware of the combinatorial explosion of features.

Learning Curves

High-degree polynomial regression will likely fit data better than plain Linear Regression. This can be Overfitting.

You could use Cross Validation to check if your model generalizes. If it does well on training data, but poorly on CV metrics, then it’s overfitting. If it does poorly on both, it’s underfitting.

But you can also use Learning Curves.
These are plots of the model’s training error and validation error as a function of the training iterations.
Evaluate the model at regular intervals during training on both train and validation sets, and plot the results.
If you can’t train the model incrementally (via partial_fit() or warm_start), then train it several times on gradually larger subsets of the training set.

Scikit-Learn has learning_curve() to help.
It trains and evaluates the model via Cross Validation.
By default, it retrains the model on growing subsets of training data, but if the model supports incremental learning, use exploit_incremental_learning=True in the function parameters.
It returns training set sizes for each eval & training + validation scores for each size, and for each cv fold.

from sklearn.model_selection import learning_curve

train_sizes, train_scores, valid_scores = learning_curve(
    LinearRegression(), X, y, train_sizes=np.linspace(0.01, 1.0, 50), cv=5,
    scoring="neg_mean_squared_error"
)
train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)

plt.plot(train_sizes, train_errors, "r-+", label="Training set", linewidth=2)
plt.plot(train_sizes, valid_errors, "b-", label="Validation set", linewidth=3)
plt.show()

This model is underfitting.
Looking at training error: In the beginning, there are few enough instances such that it can fit them perfectly. However, since the data is not linear, and it’s noisy, it becomes increasingly evident that the model is unable to fit it. That’s why the training error goes up and finally reaches a plateau.
Looking at validation error: In the beginning, the model cannot generalize, as there are too few samples. Then, as it gets exposed to more samples, the validation error goes down. But it gets stuck at the same plateau as train, as the model doesn’t fit the data well.

from sklearn.pipeline import make_pipeline

polynomial_regression = make_pipeline(
    PolynomialFeatures(degree=10, include_bias=False),
    LinearRegression()
)

train_sizes, train_scores, valid_scores = learning_curve(
    polynomial_regression, X, y, train_sizes=np.linspace(0.01, 1.0, 50), cv=5,
    scoring="neg_mean_squared_error"
)

train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)

plt.figure(figsize=(6, 4))
plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train")
plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid")
plt.legend(loc="upper right")
plt.xlabel("Training set size")
plt.ylabel("RMSE")
plt.grid()
plt.axis([0, 80, 0, 2.5])
plt.show()

Now, using a 10th-degree polynomial model on the same data seems to help.
The error on the training data is noticeably lower. There is a gap between the curves: the model performs better on train than validation, which is an indicator of Overfitting.
However, using a larger dataset, the two curves would get closer. A way to improve an Overfitting model is to give it more data until the validation error reaches the training error.

Regularized Linear Models

You can reduce Overfitting through Regularization (constraining the model). The fewer degrees of freedom it has, the harder for it to overfit the data.

A simple way to regularize a polynomial model is to reduce the number of Polynomial degrees.

For a linear model, you do it by constraining the weights of the model.
In this part, we look at three ways to constrain the weights:

  • Ridge Regression
  • Lasso Regression
  • Elastic Net Regression

Ridge Regression

AKA Tikhonov Regularization is a regularized Linear Regression.
Add a regularization term equal to to the MSE.
This forces the algorithm to not only fit the data, but also keep the model weights as small as possible.
This regularization term should only be added to the cost function during training. You’d use standard RMSE/MSE afterwards to evaluate model performance.

controls how much to regularize the model.
If , it’s just Linear Regression. If large, then all weights end up close to zero, and you’d get a flat line through the data’s mean.

Ridge regression cost function

Bias term is not regularized (sum starts at 1, not 0).

Define as the vector of feature weights ( to ), then the regularization term is equal to , where is the Euclidean Norm of the weight vector.

For batch gradient descent, add to the part of the MSE gradient vector that corresponds to feature weights, without adding anything to the gradient of the bias term.

Ridge regression closed-form solution
Can perform ridge regression by either gradient descent or computing a closed-form equation.

where is the identity matrix, except with a 0 in the top left cell, corresponding to the bias term.

Can use Sklearn, which uses a matrix factorization technique by Andre-Lous Cholesky:

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=0.1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
> array([[1.5325833]])

And Stochastic Gradient Descent:

sgd_reg = SGDRegressor(
	penalty="l2",  # sets regularization type. "l2" means we add regularization to the MSE cost function equal to alpha times the square of the $\ell_2$ norm. This is like ridge regression, except no division by m - so we pass alpha=0.1/m to get the same as Ridge(alpha=0.1).
	alpha=0.1/m,
	tol=None,
	max_iter=1000,
	eta0=0.01,
	random_state=42
)
sgd_reg.fit(X, y.ravel())  # ravel because fit expects 1d targets
sgd_reg.predict([[1.5]])
> array([1.55302613])

Lasso Regression

LASSO is actually Least absolute shrinkage and selection operator regression.
This is also a regularized Linear Regression, and just like Ridge regression, it adds a Regularization term to the cost function. It uses the Norm of the weight vector instead of the square of the norm.

We multiply the norm by ( norm is multiplied by ) to ensure the optimal value is independent of the training set size.

Lasso regression cost function

Lasso regression tends to eliminate the weights of the least important features (setting them to 0).
That is, it automatically does feature selection and outputs a sparse model with few nonzero weights.

To keep Gradient Descent from bouncing around the optimum at the end, you’ll need to gradually reduce the learning rate. While it’ll keep bouncing around the optimum, the steps get smaller and smaller, meaning it’ll converge.

The lasso cost function is not differentiable at (for ), but you can still use gradient descent if you use a subgradient vector1 instead when any :

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])
> array([1.53788174])

Could also use SGDRegressor(penalty="l1", alpha=0.1).

Elastic Net Regression

Like middle ground between ridge regression and lasso regression.
The regularization term is a weighted sum of both their terms, where the ratio is controlled by mix ratio . and it’s ridge regression, and it’s lasso.

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])
> array([1.54333232])

Back to overall regularization
When to use what:

  • Always have at least a little regularization: avoid plain linear regression
  • Ridge is a good default if you suspect only a few features are useful
  • Prefer lasso or elastic net because they tend to reduce useless features’ weights to zero. And then, prefer elastic net, as lasso can behave erratically when number of features > number of training instances or when several features are strongly correlated.

Early Stopping

You can regularize iterative learning algorithms (like Gradient Descent) by stopping when the validation error reaches a minimum.

This way, you could capture the model i.e. before it overfits, and you get the best version of that model.
Geoffrey Hinton called it a “beautiful free lunch”.

For gradient descent methods whose curves are not smooth (stochastic, mini-batch), it can be hard to know if you’ve reached a minimum or not. You can therefore stop only after it’s been above the minimum for a while — when you’re confident the model won’t get better — and then roll back parameters to a point where the validation error was at a minimum.

The basic approach is to use deepcopy from copy to copy the best model, given by the best validation error (so you track that with a separate variable).
And you’d want to use partial_fit (or similar) to do incremental learning.

Logistic Regression

Logistic Regression (or logit regression) is often used to estimate the probability that an instance belongs to a particular class. If over some threshold, it predicts that it belongs to the given class (or not).

Estimating Probabilities

Logistic regression works by computing a weighted sum of the input features (plus a bias term), just like linear regression. Instead of outputting the result directly, it outputs the logistic of the result:

Logistic regression model estimated probability (vectorized form)

The logistic is noted and is a Sigmoid function that outputs a number between 0 and 1.

Given the probability that an instance belongs to the positive class, it’ll make the prediction by checking against a threshold value.

The input to the function is often denoted , which is often called the logit. This is because the logit function () is the inverse of the logistic function. Computing the logit of the estimated probability , you get . The logit is also called log-odds because it’s the log of the ratio between the estimated probability for the positive class and the estimated probability for the negative class.

Training and Cost Function

The objective of training is to set the parameter vector so the model estimates high probabilities for positive instances () and low probabilities for negative instances ().
This is captured in the cost function for a single instance :

Cost function of a single training instance

The cost will be large if the model estimates a probability close to 0 for a positive instance, as grows large when approaches .
And it’ll also be large if the model estimates a probability close to for a negative instance.
But is close to when is close to , so the cost will be close to if the estimated probability is close to for a negative instance or close to for a positive instance.

The cost function over the whole training set is the average cost over all training instances:
Logictic regression cost function (Log loss)

It can be shown with Bayesian inference that minimizing this loss will result in the model with the maximum likelihood of being optimal, assuming instances follow a Gaussian distribution around the mean of their class.
This is the assumption you make when using Log loss – the more wrong it is, the more biased the model.

There’s no known closed-form equation to compute that minimizes this cost function. But it is convex, so Gradient Descent or other optimization algorithms will find the global minimum.

Partial derivatives of the Logistic cost function

For each instance, it computes the prediction error and multiplies it by the feature value, then computes the average over all training instances. Once we have the gradient vector with all partial derivatives, we can use it in the batch gradient descent algorithm.

For stochastic GD, use one instance at a time. For mini-batch GD, use a mini-batch at a time.

Softmax Regression

We can generalize the logistic regression model to support multiple classes directly without having to train and combine multiple binary classifiers (One-versus-One or One-versus-the-Rest). This is Softmax Regression, or multinomial logistic regression (Multinomial Classifier).

When given an instance , the softmax regression model first computes a score for each class , then estimates the probability of each class by applying Softmax to the scores.

Each class has its own dedicated parameter vector , and these are typically stored as rows in a parameter matrix .

Once scores have been computed for each class for the instance , we estimate the probability that the instance belongs to class by running the scores through Softmax. The function computes the exponential of every score, normalizes them by dividing by the sum of all exponentials. The scores are usually called logits or log-odds, but they are actually unnormalized log-odds.

where is the number of classes, is a vector containing the scores of each class for the instance , and is the estimated probability that the instance belongs to class given the scores of each class for that instance.

The prediction will be the class with the highest estimated probability (the one with the highest score):

Argmax returns the value of a variable that maximizes a function. Here, it’s the value of that maximizes the estimated probability .

Softmax regression is multiclass, not multioutput! So can only be used with mutually exclusive classes – no predicting multiple people in a photo.

How to train
Objective is getting a model that estimates a high probability for the target class – and low for the other classes.
Minimizing cost function below (Cross Entropy) should do just that because it penalizes the model when it estimates a low probability for a target class.
Cross Entropy cost function: