Name: Hands-on Machine Learning With Scikit-Learn, Keras, and TensorFlow
Author: Aurélien Géron

The author has provided a fantastic GitHub repository containing notebooks for each of the chapters and more. I’d highly recommend you also take a look there!

As I wrote in my review: this is a fantastic book.
If you’re new to AI/ML, I can highly recommend it.
Note that TensorFlow has become less popular in the industry. Around the time I reached neural networks, I switched over to using PyTorch while following this book.
Using PyTorch while reading this book is also good practice: you can’t just copy over the code and passively follow along.

You can find some of my code here: chhoumann/ml .

Chapter 1 The Machine Learning Landscape

What Is Machine Learning?

The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample). The part of a machine learning system that learns and makes predictions is called a model.

Training set = examples used.
Training instance / sample = each training example.
Model = what makes the predictions.

Why Use Machine Learning?

Why even use machine learning?

A key feature of machine learning over traditional learning is easily illustrated by considering a spam filter (which was also one of the first uses!).
Without machine learning, here are the steps you’d likely need to take

Study the problem
Write rules
Evaluate
Launch / analyze errors

But since the problem is hard, step 2 is incredibly time-consuming, and you’d likely end up with a long list of complex rules.

This is what ML can do for you. Switch out step 2 with Train ML model, given some data, and it’ll handle the classification of spam for you!
And if the spammers continuously learn & improve their methods, you simply automate the training, evaluation, launch, and updating data. Then you’ll always be blocking their ‘improved’ spam mails — and you won’t even have to write rules forever.

But the model itself can also be inspected to gain a better understanding of the problem. A spam filter would be able to reveal unsuspected correlations or new trends. Digging into large amounts of data to discover hidden patterns is Data mining.

So, ML is great for:

Problems where existing solutions require a lot of fine-tuning or long lists of rules.
Complex problems where traditional approaches don’t give good solutions – perhaps ML techniques can.
Fluctuating environments, because the ML system can be retrained on new data, and therefore always be kept up to date.
Getting insights about complex problems and large amounts of data.

Types of Machine Learning Systems

There are many types, so we classify them in broad categories based on the following criteria:

How they are supervised during training (supervised, unsupervised, semi-supervised, self-supervised, and others)
Whether they can learn incrementally on the fly (Online Learning vs. Offline Learning)
Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model (instance-based vs. model-based learning)

These aren’t exclusive, and can be combined in any way you want.
E.g. a SotA spam filter might learn on the fly using a deep neural network model trained using human-provided examples of spam/ham. So it would be an online, model-based, supervised learning system.

The words target and label are generally treated as synonyms in supervised learning, but target is more common in regression tasks and label is more common in classification tasks. Moreover, features are sometimes called predictors or attributes. These terms may refer to individual samples (e.g., “this car’s mileage feature is equal to 15,000”) or to all samples (e.g., “the mileage feature is strongly correlated with price”).

Attribute = data type (e.g. “Mileage”)
Feature has several meanings, but often means attribute + its value (e.g. “Mileage = 15,000”).

Visualization algorithms are also good examples of unsupervised learning: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted

Visualization algorithms take unlabeled data and give a 2D or 3D representation that is easy to plot.
They try to keep structure, so you can understand how the data is organized.

A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be strongly correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.

Dimensionality reduction: simplify data without losing too much information.
For example, you can merge several correlated features into one. This is called feature extraction.

Yet another important unsupervised task is anomaly detection—for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.

The system is shown mostly normal instances during training, so it learns to recognize them; then, when it sees a new instance, it can tell whether it looks like a normal one or whether it is likely an anomaly (see Figure 1-10).

A very similar task is novelty detection: it aims to detect new instances that look different from all instances in the training set.

This requires having a very “clean” training set, devoid of any instance that you would like the algorithm to detect

Anomaly detection seeks to find outlier data. Novelty detection tries to find normal data.

Finally, another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.

For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak.

Thus, you may want to place these items close to one another.

Association rule learning is about finding relations between attributes.

E.g. you’d use it on sales logs to find that people who bought X and Y also buy Z. So you place those items close to each other.

Semi-supervised learning

Where you deal with partially labeled data.

Most of these algorithms are combinations of unsupervised & supervised algorithms.

This is what Google Photos does when you upload images. It recognizes person A in some photos, person B in others (that’s the unsupervised part: it clusters). Then you tell it who those people are – one label per person, and it’ll name everyone in every photo.

Self-supervised learning

Another approach to machine learning involves actually generating a fully labeled dataset from a fully unlabeled one.
Again, once the whole dataset is labeled, any supervised learning algorithm can be used.
This approach is called self-supervised learning.

For example, if you have a large dataset of unlabeled images, you can randomly mask a small part of each image and then train a model to recover the original image (Figure 1-12).
During training, the masked images are used as the inputs to the model, and the original images are used as the labels.

Self-supervised learning is generating a fully labeled dataset from a fully unlabeled one.

For example, given a large dataset of unlabeled images, you might mask a random small part of each image and train a model to recover the original image. The inputs are the masked images and the original images are used as labels.

The model you get here can be useful, but is often not the final goal. You might want to tweak and fine-tune the model for a slightly different task.

So following the example before, if you really want a pet species classification model:

Have large dataset of unlabeled photos of pets
Train image-repairing model with self-supervised learning
When it performs well, it should be able to distinguish pet species – e.g., it knows not to add a dog’s face to an image of a cat.
Assuming your model’s architecture allows it (most NN architectures do), tweak the model so it predicts pet species instead of repairing images.
Fine-tune the model on a labeled dataset – it already knows the difference, but it needs to learn the mapping between the species it knows and the labels we expect.

Transferring knowledge from one task to another is called transfer learning, and it’s one of the most important techniques in machine learning today, especially when using deep neural networks (i.e., neural networks composed of many layers of neurons).
We will discuss this in detail in Part II.

Transfer learning = transferring knowledge from one task to another. Great with deep neural networks.

Reinforcement learning

Reinforcement learning is a very different beast.

The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as shown in Figure 1-13).

It must then learn by itself what is the best strategy, called a policy, to get the most reward over time.

A policy defines what action the agent should choose when it is in a given situation.

Reinforcement learning agents observe the environment, select and perform actions, and get rewards in return (or penalties, which are just negative rewards)

It learns by itself what the best strategy is, which is called a policy, to maximize reward over time.
Policies define what actions the agent should take in given situations.

Observe
Select action using policy
Take action
Get reward or penalty
Update policy (learning step)
Iterate until optimal policy is found

A good example is DeepMind’s AlphaGo, which beat the world champion Ke Jieat in May 2017.

Batch Versus Online Learning

Batch Learning: the system cannot learn incrementally, so it must be trained on all available data. This takes lots of time and computing resources, so it’s typically done offline. Because you have to use the full dataset, it takes a lot of computing resources to train. It may even be impossible to use batch learning if the dataset is too large. And if your system needs to learn autonomously and has limited resources (e.g. a Mars rover), it can’t be carrying around lots of training data and use resources to train for hours a day. In such a case, you’d want the incremental learning capabilities of an online learning system.

Offline learning would be where the system is trained first, then launched into production where it runs without learning anymore – it just applies what it has learned.

Online learning is where you train the system incrementally by feeding it data sequentially, either individually or in small groups called mini-batches. Each step is fast and cheap, so the system can learn on the fly. That is, it can learn while being deployed.

Online learning algorithms can be used to train models on huge datasets that cannot fit in one machine’s main memory (this is out-of-core learning – which is usually done offline… that is, not on the live system. So it’s more like incremental learning.).
The algorithm loads part of the data, runs a training step on that data, and repeats until it’s finished with all the data.

Online learning systems has an important parameter used to determine how fast they adapt to changing data. This is called the learning rate. If it’s high, the system rapidly adapts to new data – and therefore quickly forgets old data. If it’s low, the system will have more inertia – it’ll learn more slowly, but also be less sensitive to noise or outliers.

Unfortunately, a model’s performance tends to decay slowly over time, simply because the world continues to evolve while the model remains unchanged.

This phenomenon is often called model rot or data drift.

The solution is to regularly retrain the model on up-to-date data.

How often you need to do that depends on the use case: if the model classifies pictures of cats and dogs, its performance will decay very slowly, but if the model deals with fast-evolving systems, for example making predictions on the financial market, then it is likely to decay quite fast.

A consequence of using batch (offline) learning is that the model’s performance decays over time due to the world evolving, while the model stays the same.

This is called model rot or data drift.
The solution is to regularly retrain the model on up-to-date data. How often depends on the use case.

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.

If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data (and you don’t want a spam filter to flag only the latest kinds of spam it was shown).

Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

What learning rate is, and why it matters.

Having a high learning rate means your system learns quickly and adapts quickly to new data, but also quickly forgets old data.
Setting a low learning rate means the system has more inertia (learns slowly), but it’s also less sensitive to noise or outliers.

A big challenge with online learning is that if bad data is fed to the system, the system’s performance will decline, possibly quickly (depending on the data quality and learning rate).

If it’s a live system, your clients will notice.

For example, bad data could come from a bug (e.g., a malfunctioning sensor on a robot), or it could come from someone trying to game the system (e.g., spamming a search engine to try to rank high in search results).

To reduce this risk, you need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance.

You may also want to monitor the input data and react to abnormal data; for example, using an anomaly detection algorithm.

A consequence of using online learning is that, if bad data is fed to the system, the system’s performance suffers.

This can happen quickly, depending on data quality and learning rate.

If your customers rely on this system, they’ll notice.

It could come from spam, malfunctioning sensors, etc.

Monitor your system and switch learning off if you see a drop in performance. You can also monitor the input data and react to abnormal data.

Instance-Based Versus Model-Based Learning

We also categorize ML systems by how they generalize. That is, how it predicts based on examples it hasn’t seen before.

By convention, the Greek letter θ (theta) is frequently used to represent model parameters.

Nice to know. represents model parameters.

Instance-based learning

This is where your system learns some set of known points, and generalizes new data by comparing them to the known data using a similarity measure.

Model-based learning and a typical machine learning workflow

This is where you generalize from examples by building a model of them, and then using that model to make predictions.

How can you know which values will make your model perform best?

To answer this question, you need to specify a performance measure.

You can either define a utility function (or fitness function) that measures how good your model is, or you can define a cost function that measures how bad it is.

For linear regression problems, people typically use a cost function that measures the distance between the linear model’s predictions and the training examples; the objective is to minimize this distance.

This is where the linear regression algorithm comes in: you feed it your training examples, and it finds the parameters that make the linear model fit best to your data.

This is called training the model.

You can define a utility function (or fitness function) that measures how good your model is, or a cost function that measures how bad it is.

When doing linear regression problems, people usually use cost functions that measure the distance between the linear model’s predictions and the training examples (and try to minimize it).

You train your Linear Regression model by giving it training examples, such that it can find the parameters (weights, biases) that it needs to best fit the data.

Linear Regression, of course, is model based. You could also do kNN to make it instance based (so it finds the instances it’s closest to and predicts based on those, given the new input).

you applied the model to make predictions on new cases (this is called inference)

Inference = applying the model to make predictions on new cases

Main Challenges of Machine Learning

Can be summarized and classified as two problems:

Bad algorithm
Bad data

Insufficient Quantity of Training Data

The Unreasonable Effectiveness of Data

This paper by Peter Norvig et al. shows that more data > better algorithms. But of course, getting more data can be hard, so we shouldn’t abandon improving algorithms just yet.

Nonrepresentative Training Data

If your data isn’t representative of the cases you want to generalize to, you’re in for a bad time.

It is crucial to use a training set that is representative of the cases you want to generalize to.

This is often harder than it sounds: if the sample is too small, you will have sampling noise (i.e., nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampling method is flawed.

This is called sampling bias.

In your quest to get cases that are representative, you’re also struck with more problems. If you sample too little, you’ll have to deal with sampling noise, which is where you can have nonrepresentative data because of chance.
But large samples can lead to sampling bias, which is where, if your sampling method is flawed, your data is nonrepresentative as well.

Poor-Quality Data

Solve this by fixing errors, discarding outliers, either ignoring or filling in missing values, etc.

Irrelevant Features

A big part of ML projects is Feature Engineering, which is to come up with good features to train on. The process involves:

Feature selection, which is where you select which features to train on.
Feature extraction, which is where you combine existing features to produce useful a more one (e.g., via dimensionality reduction).
Creating new features by gathering new data.

Overfitting the Training Data

This is an example of ‘bad algorithms’.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data.
Here are possible solutions:

Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model.

Gather more training data.

Reduce the noise in the training data (e.g., fix data errors and remove outliers).

Solutions to overfitting:

Simplify the model by selecting one with fewer parameters
Get more training data
Reduce noise in data - fix data errors, remove outliers, etc.

Constraining a model to reduce overfitting is called regularization. If a model has two parameters, it gives the algorithm two degrees of freedom to adapt the model to the training data.

The amount of regularization to apply during learning can be controlled by a hyperparameter.

A hyperparameter is a parameter of a learning algorithm (not of the model).

As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.

If you set the regularization hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution.

Tuning hyperparameters is an important

The amount of regularization to apply during learning is controlled by a hyperparameter.
That is a parameter of a learning algorithm. It’s set prior to training.
If you set it to a large value, you’ll get an almost flat model (almost certainly no overfitting, but probably a worse solution).

Tuning these are a big part of building a ML system.

Underfitting the Training Data

This is the opposite of overfitting. Solve it by

Selecting a more powerful model, with more parameters.
Use better features (do feature engineering).
Reduce constraints on the model - e.g. reducing the regularization hyperparameter.

Testing and Validating

This is very important. You don’t just want to train a model and hope it’s good at generalizing new cases. You want to evaluate it.

This is done by trying it on new cases.
And instead of just deploying your model to production, you split your data set into a training set and a test set.

The error rate on new cases is the generalization error (or out-of-sample error).
Your error value tells you your model’s performance on instances it has never seen before.

If your training error is low, but generalization error is high, you’re overfitting.

Hyperparameter Tuning and Model Selection

How do you decide between models?
This is what you do in Model Selection.

You can train the models you’re thinking about and compare their performance.

And now that you’ve selected a model, you’ll want to do some regularization to avoid overfitting. So you do Hyperparameter Tuning.
For this, you could train 100 different models using 100 different values for the hyperparameter.

But if you do so, be careful that you don’t just aim for low error on the test set. Then you’re just finding the best fit for that set.

A solution to this problem is holdout validation: you hold out part of the training set to evaluate several candidate models and select the best one.
The held out set is the validation set (or development set / dev set).
Basically, you train multiple models on different subsets of your training set and use the held out set as your validation set.
After finding the best-performing model, you train it on the full training set (including the validation set). And then you evaluate that model to get an estimate of the generalization error.
But this is not so good when the validation set is too small.
You can use Cross Validation to solve some of the issues with this approach, but it can be slow / resource intensive.

Chapter 2 End-to-End Machine Learning Project

Here are the main steps we will walk through:

Look at the big picture.

Get the data.

Explore and visualize the data to gain insights.

Prepare the data for machine learning algorithms.

Select a model and train it.

Fine-tune your model.

Present your solution.

Launch, monitor, and maintain your system.

These are the steps you go through in a real machine learning project.
Here’s an awesome checklist .

You also generally want to be working with real data, not some artificial dataset.

Popular open data repositories
Meta portals – which list open data repositories
- - Data Portals
- Open Data DK
Other pages with many popular open data repositories

Look at the Big Picture

Frame the Problem

When you start ML projects, the first step when looking at the big picture should be to identify and frame the problem.

What’s the business objective?
What is the current solution?

And then to frame the problem:

Is it supervised, unsupervised, self-supervised, or reinforcement learning?
Is it classification, regression, or something else?
Online or batch learning?

Multiple regression problems are where you use multiple features to predict.
Univariate regression problems are where you only try to predict one value ‘per unit’. The opposite is a multivariate regression problem (predicting multiple values per unit).

If you have a huge dataset for batch learning, you can split the learning across multiple servers using the MapReduce technique (or do online learning).

Signals = a piece of information fed to a ML system is often called a signal due to Claude Shannon’s information theory, which he developed at Bell Labs to improve telecommunications. He theorized that you want a high signal-to-noise ratio.

Select a Performance Measure

And now you select a performance measure.

A typical one for regression is Root Mean Square Error (RMSE).
This gives an idea of how much error the system typically makes in its predictions, with higher weight given to larger errors.

RMSE is generally preferred for regression tasks, but you might prefer to use another function.

There’s also Mean Absolute Error (MAE). This is great if there are many outliers.

Both RMSE and MAE are ways to measure the distance between two vectors: the prediction vector, and the target value vector.
There are various distance measures (or norms) possible:

Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm. This is called the norm, denoted (or just ).
Computing the sum of absolutes (MAE) corresponds to the norm, denoted . It’s often called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
The norm of a vector with elements is defined as . And gives the number of non-zero elements in the vector, and gives the maximum absolute value in the vector.

The higher the norm index, the more it focuses on large values and neglects small ones. This is why RMSE is more sensitive to outliers than MAE. But when outliers are rare, RMSE performs well and is preferred.

Both the RMSE and the MAE are ways to measure the distance between two vectors

RMSE and MAE are ways to measure distance between two measures.
You can do various distance measures (or norms).

RMSE corresponds to Euclidean norm. It’s also called the norm.
MAE corresponds to norm, sometimes called Manhattan norm, as it measures the distance between two points in a city if you can only travel along orthogonal city blocks.

Generally, the norm of a vector containing elements is defined as

gives the number of nonzero elements in the vector, and gives the maximum absolute value in the vector.

The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

That’s a fantastic explanation.

RMSE is more sensitive to outliers than MAE. It has a higher norm index, and focuses on large values & neglects small ones. So when outliers are rare, RMSE is good - and preferred.

Notations

There are various common notations in ML.

RMSE shows some.

is the number of instances in the dataset you’re measuring the RMSE on. For example, for a set with 2000 samples, .
is a vector of all feature labels (excluding the label) of the i’th instance in the dataset, and is its label (the desired output for that instance).
- Say we have a housing dataset where the first district is located at longitude , latitude , has 1416 inhabitants with a median income of 156,400. You want to predict the housing value.
- and .
is a matrix with all feature values (excluding labels) of all instances in the dataset. There is one row per instance and the i’th row is equal to the tranpose of , denoted .
is the system’s prediction function (also called hypothesis). Given an instance’s feature vector , it outputs a predicted value for that instance.
is the cost function measured on the set of examples using your hypothesis .

If you imagine a table:

is the entire table’s rows’ values, except the label feature’s values.
is a row in the table (except it doesn’t have the label column value).

Lowercase italics for scalar values and function names.
Lowercase bold for vectors.
Uppercase bold for matrices.

Multiple regression = using multiple features to predict
Univariate regression = predicting only one feature value per item
Multivariate regression = predicting multiple feature values per item

Check the Assumptions

It’s important that you check your assumptions.
List all of them and verify the ones made (by you or others).

Get the Data

Once you’ve fetched the data from the relevant data store, start exploring it. For example with df.head() or df.info().

You can also explore categorical features with df["feature_name"].value_counts().

You might also use df.describe(), which generates descriptive statistics.

Count
Mean
Standard Deviation – measures how dispersed the values are.
Percentiles (25th, 50th, 75th by default)
Min
Max

Another way is to plot a histogram over each numerical attribute.

import matplotlib.pyplot as plt

housing.hist(bins=50, figsize=(12, 8))
plt.show()

You might notice skew on histograms. A skew right would mean they extend farther right of the median than to the left. Skew can make it harder for some ML algorithms to detect patterns. You might try to transform these skews into more symmetrical and bell-shaped distributions. This could be done by e.g. computing the logarithm or square root.

Create a Test Set

After doing some brief initial exploration, create a test set, put it aside, and never look at it.

The reason you do that is to avoid data snooping bias. Your brain is great at detecting patterns, and if you stumble upon an interesting one, you may base your decisions on that. Therefore, you would have introduced errors.

Random sampling

Creating a test set can be as simple as picking instances at random (20% of the dataset usually) and setting them aside. However, you need to be careful that the dataset is split the same way each time. You could do this by setting a random seed, but that method breaks when you get an updated dataset.

Here’s the naive way:

import numpy as np

def shuffle_and_split_data(data, test_ratio: float):
    assert 0 < test_ratio <= 1.0, "Invalid ratio: {}".format(test_ratio)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    # iloc is integer-location based indexing for selecting by position.
    return data.iloc[train_indices], data.iloc[test_indices]

A common solution is to use the identifiers for each instance, if they have one. You use that to decide whether or not it should go in the test set. You might also generate an identifier by computing a hash. Then put the instance in the test set if the hash is lower than or equal to 20% of the max hash value.

Here’s how:

from zlib import crc32

def is_id_in_test_set(identifier, test_ratio: float):
    return crc32(np.int64(identifier)) < test_ratio * 2**32 # type: ignore

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return (
        # `loc` selects rows and cols by label or boolean array, `~` is bitwise negation (inverts bools)
        data.loc[~in_test_set], # selects all rows where corresponding value in `in_test_set` is False.
        data.loc[in_test_set]  # selects all rows where value in `in_test_set` is True.
    )

housing_with_id = housing.reset_index()  # adds `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, 'index')

But the issue with using the row index as identifier is you need to ensure that new data gets appended. Otherwise, perhaps use the most stable features to build a unique identifier.

Just use Scikit-Learn’s train_test_split for random sampling:

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

But be careful of sampling bias when doing random sampling on small datasets.
You can use stratified sampling instead to prevent this.

Stratified sampling

This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population. If the people running the survey used purely random sampling, there would be about a 10.7% chance of sampling a skewed test set with less than 48.5% female or more than 53.5% female participants. Either way, the survey results would likely be quite biased.

You can do stratified sampling, which is where you take subgroups of your target group and make sure members represent the overall data. Each subgroup is called a strata.

For example, US population is 51.1% female and 48.9% female. You’d want to divide into male and female stratum, and then take 511 females and 489 males if you want set of 1000 people.

If you did random sampling, you could end up with a skewed dataset. This would mean a biased result.

Stratified sampling can be done on continuous numerical attributes by creating a categorical attribute.

You first look at the distribution of the values, and then you see where the potential subgroups are. If there are many closely grouped, you might want to keep the amount of strata small. It’s important to get a sufficient number of instances for each stratum, or the importance of a stratum’s importance might get biased.
So: not too many strata, and each stratum should be large enough.

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1,2,3,4,5])

This uses pd.cut() to create an income category for attributes with 5 categories, labeled from 1-5.

And then we can use Scikit-Learn’s StratifiedShuffleSplit to generate stratified splits:

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
    strat_train_set_n = housing.iloc[train_index]
    strat_test_set_n = housing.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

There are various splitters in sklearn.model_selection that use various strategies. Common between them is a split() method that returns an iterator over different training/test splits of the same data. That is, it yields the train/test indices.

But you might want to just use this:

strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)

And at this point, you should simply drop the categorical feature you created.

Explore and Visualize the Data to Gain Insights

Look for correlations

If the training set is large, you should probably explore a sample of it only.
Should also make a copy: data = df.copy() or something.

If you are looking at a smaller dataset, you can compute the standard correlation coefficient, called Pearson’s r, between every pair of attributes. This correlation coefficient ranges from -1 to 1. Close to 1 means strong positive correlation. Close to -1 means strong negative correlation. Close to 0 means no linear correlation.

corr_matrix = housing.corr()

corr_matrix["median_house_value"].sort_values(ascending=False)

This only captures linear relationships (e.g. x goes up, y generally goes up/down) - not nonlinear (e.g. as x approaches 0, y generally goes up).

You can also check for correlation with a scatter matrix, which plots every numerical attribute against every other numerical attribute. It also creates a histogram of each numerical attribute’s values on the main diagonal. The main diagonal would otherwise be a bunch of straight lines, which isn’t useful.

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

You may identify clear correlations from such a matrix.

Clear trend
Points not too dispersed

You may also identify other quirks in the data that could lead to worse predictions.

Experiment with Attribute Combinations

You should try out creating features to see if you can find correlations from combined features.

Prepare the Data for Machine Learning Algorithms

It is a very good idea to create functions to clean the data for you.

It’s easier when you get more data
You build a library of cleaning functions that you can reuse
These can be used in your live system to transform new data before feeding it to your algorithm

Clean the Data

Most ML algorithms don’t do well with missing features. You can

Get rid of instances with missing values
Get rid of the entire attribute with missing values
Set missing values to some value through imputation, where the missing value would be e.g. the mean, zero, median, etc.

Get rid of rows with missing values:

data.dropna(subset=["feature"])

Get rid of the entire attribute with missing values

data.drop("attribute", axis=1)

Set the values to some value
This is for the median:

median = data["attribute"].median()
data["attribute"].fillna(median, inplace=True)

If you do this, remember to do it on the test set as well.

Scikit-Learn provides SimpleImputer to help take care of missing values:

from sklearn.impute import SimpleImputer

# There's also "mean", "most_frequent", "constant" with fill_value, etc.
imputer = SimpleImputer(strategy="median")
# median obv. only works on numerical attributes
housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num) # fit inputer instance to training data, i.e. compute median of each attribute & store in `statistics_` var.
print(imputer.statistics_)

X = imputer.transform(housing_num)

## Now it's a NumPy array. Convert back to DataFrame:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

There are also other, more powerful, imputers:

KNNImputer replaces each missing value with the mean of the k-nearest neighbors’ values for that feature. Distance is based on all available features.
IterativeImputer trains a regression model per feature and predicts the missing values based on all other available features. Then trains the model again on the updated data and repeats this several times, improving the model & replacement values each time.

Handling Text and Categorical Attributes

You can convert categorical attributes to text with Scikit-Learn’s OrdinalEncoder class. This is like converting them to an enum – the categorical values get represented by numbers instead.

Just keep in mind that some ML algorithms assume that two nearby values are more similar than distant values.
This might not be the case.

A common solution is to create a binary attribute per category, where its values are 1 if the instance is in that category, and 0 if not.
This is called One-Hot Encoding because only one attribute will be 1 (hot) while the others are 0 (cold).
The new attributes are called dummy attributes.

Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors:

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
data_cat_1hot = cat_encoder.fit_transform(data_cat) # data_cat is your categorical attribute(s)

This gives a SciPy sparse matrix. Such a matrix stores only the location of the non-zero elements (because storing all the 0s would be wasteful).
You can use it like a normal 2D array.
You can convert it to a (dense) NumPy array by calling toarray().

Keep in mind that if a categorical attribute has a large number of possible categories, one-hot encoding will result in a ton of input features. This can degrade performance and slow down training.
You can replace categorical features with useful numerical ones that are related. E.g. replace country code with population & GDP per capita.
You might also use an encoder provided by the category_encoders package on GitHub .
Or when doing NNs, replace each category with a learnable, low-dimensional vector called an embedding. Then each category’s representation is learned during training. This is an example of representation learning (later in the book).

Feature Scaling and Transformation

Very important because ML algorithms don’t do well when numerical attributes have different scales.

There are two ways to ensure they have the same scale:

Min-max scaling
Standardization

Min-max scaling is often called normalization.

It’s rather simple: for each attribute, values are shifted and rescaled, so they end up ranging from 0 to 1.

Done by subtracting the min value and dividing by the difference between the min and the max.

Use Scikit-Learn’s MinMaxScaler for it.
You can set the feature range to something else than 0 to 1, e.g. -1 to 1. Neural Networks work best with zero-mean inputs, so a range of -1 to 1 is preferable.

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

Standardization:

First subtract the mean value (standardized values have a zero mean)
Divide results by the standard deviation (so standardized values have a standard deviation = 1)

This doesn’t restrict you to a specific range, but is much less affected by outliers. If min-max saw a large outlier, it would squish all other values down to almost nothing, but standardization wouldn’t be very affected.

You can use Scikit-Learn’s StandardScaler for standardization.

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
# can set `with_mean` hyperparameter to False so it only divides by std, not subtract mean. Then you can scale a sparse matrix without converting it to a dense matrix first.
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

heavy tail (i.e., when values far from the mean are not exponentially rare)

Heavy tail = when values far from the mean are not exponentially rare.

If a feature distribution has a heavy tail, min-max scaling and standardization will squash the values into a small range.
This isn’t great - most ML algorithms don’t like this.
So before you scale it, transform the feature to shrink the heavy tail.

Deal with it by transforming it to shrink the heavy tail & making the distribution roughly symmetrical. Here are some methods if the heavy tail is to the right (for positive features):

Replace it by its square root, or
Raise the feature to a power between 0 and 1

If the feature distribution has a really long and heavy tail (like a power law distribution), then replace it with its logarithm.

Can also deal with heavy-tailed features by bucketizing the feature. Chop the distribution into roughly equal-sized buckets & replace each feature value with the index of the bucket it belongs to.

Then you get an almost uniform distribution.

When a feature has a multimodal distribution (i.e., with two or more clear peaks, called modes), such as the housing_median_age feature, it can also be helpful to bucketize it, but this time treating the bucket IDs as categories, rather than as numerical values. This means that the bucket indices must be encoded, for example using a OneHotEncoder (so you usually don’t want to use too many buckets). This approach will allow the regression model to more easily learn different rules for different ranges of this feature value. For example, perhaps houses built around 35 years ago have a peculiar style that fell out of fashion, and therefore they’re cheaper than their age alone would suggest.
Another approach to transforming multimodal distributions is to add a feature for each of the modes (at least the main ones), representing the similarity between the housing median age and that particular mode. The similarity measure is typically computed using a radial basis function (RBF)—any function that depends only on the distance between the input value and a fixed point. The most commonly used RBF is the Gaussian RBF, whose output value decays exponentially as the input value moves away from the fixed point. For example, the Gaussian RBF similarity between the housing age x and 35 is given by the equation exp(–γ(x – 35)²). The hyperparameter γ (gamma) determines how quickly the similarity measure decays as x moves away from 35. Using Scikit-Learn’s rbf_kernel() function, you can create a new Gaussian RBF feature measuring the similarity between the housing median age and 35:

Multimodal distribution = two or more clear peaks, called modes.

To deal with such distributions, you can bucketize it, but treat the bucket ID as categories rather than numerical values. So do encoding (e.g. 1hot).

You can also transform the multimodal distribution by adding a feature for each of the (main) modes that represents the similarity between the feature and that particular mode.

This is typically computed using a radial basis function (RBF), which is any function that depends only on the distance between the input value and a fixed point.

The most commonly used RBF is Gaussian RBF, whose output values decay exponentially as the input values move away from the fixed point.

from sklearn.metrics.pairwise import rbf_kernel

age_simil_35 = rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1)

So far we’ve only looked at the input features, but the target values may also need to be transformed.

You might find your target values may need to be transformed:

from sklearn.linear_model import LinearRegression

# You might need to scale the targets. Here's an example of a scaling + 'unscaling'
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
some_new_data = housing[["median_income"]].iloc[:5] # pretend new data

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

But here’s a simpler way:

from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(LinearRegression(), transformer=StandardScaler())
model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)

Custom Transformers

You can implement your own custom Scikit-Learn transformers. This is useful for custom transformations, cleanup operations, or combining specific attributes.

Add BaseEstimator and TransformerMixin as superclasses to your custom class, for example.

For transformations that don’t require any training, you can just write a function that takes a NumPy array as input and outputs the transformed array.

Custom transformer that doesn’t require training:

from sklearn.preprocessing import FunctionTransformer

# Inverse is optional - useful for when you plan to use it in e.g. a TransformedTargetRegressor.
log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

from sklearn.metrics.pairwise import rbf_kernel
# Can also specify `kw_args` to give hyperparameters.
rbf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[[35.]], gamma=0.1))
age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])

For example, here’s a custom transformer that acts much like the StandardScaler:

You can create custom transformers that can train. It needs fit(), transform(), and fit_transform(). But you can get fit_transform() for free by adding TransformerMixin as base. You don’t otherwise really need to inherit from any classes. But for this one, we add BaseEstimator to get get_params() and set_params().

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted

class StandardScalerClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean=True):  # no *args or **kwargs
        self.with_mean = with_mean

    def fit(self, X, y=None):  # y is required even though we don't use it
        X = check_array(X)  # check X is array with finite floats
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]  # every estimator stores this in fit()
        return self # always need to do this - fluent pattern
    
    def transform(self, X):
        check_is_fitted(self)  # looks for learned attributes (with trailing _)
        
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_
        return X / self.scale_

This isn’t 100% complete, as all estimators should set feature_names_in_ in fit() when they’re passed a DataFrame. Also, all transformers should provide a get_feature_names_out() method, and inverse_transform() when their transformation can be reversed.

A custom transformer can (and often does) use other estimators in its implementation. For example, the following code demonstrates custom transformer that uses a KMeans clusterer in the fit() method to identify the main clusters in the training data, and then uses rbf_kernel() in the transform() method to measure how similar each sample is to each cluster center:

from sklearn.cluster import KMeans

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state
        
    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self
    
    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
    
    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

This finds cluster similarity with kmeans and RBF.

k-means is a clustering algorithm that locates clusters in data. It searches for n_clusters clusters. After training, you’ll find cluster centers in cluster_centers_.

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]], sample_weight=housing_labels)

Transformation Pipelines

Many data transformation steps are usually needed-and in the right order.
Scikit-learn has the Pipeline class to help with doing such sequences.

Note: If you’re using a Jupyter notebook and use sklearn, you can use sklearn.set_config(display="diagram") to visualize all Scikit-Learn’s estimators as interactive diagrams. Very useful for visualizing pipelines.

The pipeline exposes the same methods as the final estimator. If the final estimator is a transformer, you’ll have a transform() method, etc.

For example:

from sklearn.pipeline import Pipeline

## Here we have to name the steps. Names can be anything, as long as they are unique and don't have double underscores.
## Estimators must all be transformers, i.e. have `fit_transform`, except the last one - which can be anything (transformer, predictor, etc.)
num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler())
])

from sklearn.pipeline import make_pipeline
## Don't want to name the transformers? This uses the names of the transformers' classes.
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# The pipeline exposes the same methods as the final estimator.
# StandardScaler is a transformer, so the pipeline acts like one.

# transformed (numerical) dataset = return val of pipeline on dataset's numerical attributes
housing_num_prepared = num_pipeline.fit_transform(housing_num)

df_housing_num_prepared = pd.DataFrame(
    housing_num_prepared, columns=num_pipeline.get_feature_names_out(), index=housing_num.index
)

But it’d be better to handle both categorical and numerical columns together. For this, we can use the ColumnTransformer.

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)  # instead of writing every numerical attribute...
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
)

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)
])

Instead of using a transformer, you can specify string “drop” if you want columns to get dropped, or “passthrough” if you want them to be left untouched.
By default, remaining columns get dropped. But you can set remainder hyperparameter to any transformer - or "passthrough" if you want them handled differently.

This is easier than ColumnTransformer instantiation and listing all columns:

from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object))
)

The code that builds the pipeline to do all of this should look familiar to you by now:

Here’s an example of a larger pipeline:

from sklearn.compose import make_column_selector

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]


def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]


def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler(),
    )


log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler(),
)

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1.0, random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

preprocessing = ColumnTransformer(
    [
        # Ratio of bedrooms / rooms
        ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
        # Ratio of rooms / households
        ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
        # Ratio of population / households
        ("people_per_house", ratio_pipeline(), ["population", "households"]),
        # Logarithm of these features to handle long tails (transform to uniform/Gaussian distributions)
        (
            "log",
            log_pipeline,
            [
                "total_bedrooms",
                "total_rooms",
                "population",
                "households",
                "median_income",
            ],
        ),
        # Cluster similarity with k-means & RBF
        ("geo", cluster_simil, ["latitude", "longitude"]),
        # Categorical attributes get 1hot encoded
        ("cat", cat_pipeline, make_column_selector(dtype_include=object)), # type: ignore
    ],
    # remaining numerical gets imputed by median and standardized
    remainder=default_num_pipeline,
)

Select and Train a Model

So now we have

Explored the data
Sampled a training set and a test set
Written a preprocessing pipeline to automatically clean & prepare the data for an ML algorithm

Train and Evaluate on the Training Set

We can start out with a basic linear regression model.

from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

housing_predictions = lin_reg.predict(housing)
housing_predictions[:5].round(-2)  # -2 means rounded to nearest hundred
> array([243700., 372400., 128800., 94400., 328300.])

""" BUT! It's off by quite a lot! """
housing_labels[:5].values
> array([458300., 483800., 101700., 96100., 361800.])

from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(housing_labels, housing_predictions, squared=False)
lin_rmse
> 68687.89176590017 # Off by $68,628!

As we can see, the Linear Regression model is underfitting the data. So we can either

Select a more powerful model
Get better features
Reduce the constraints – which we can’t do because the model isn’t regularized.

Let’s try a more complex model.

from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_labels)

housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(housing_labels, housing_predictions, squared=False)
tree_rmse
> 0.0

Which is likely to have badly overfit the data.
But we can evaluate it – see next section.

Better Evaluation Using Cross-Validation

You can use Scikit-Learn’s k-fold cross validation feature.

from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels, scoring="neg_root_mean_squared_error", cv=10)

pd.Series(tree_rmses).describe()
> count       10.000000
> mean     66880.208011
> std       2049.481815
> min      63649.536493
> 25%      65429.433745
> 50%      66801.953094
> 75%      68229.934454
> max      70094.778246
> dtype: float64

But do note that it expects a utility function (greater is better) and not a cost function (lower is better). That’s why we do - in front.

k-fold cross validation splits the training set into n nonoverlapping subsets called folds, then trains and evaluates the model 10 times, picking a different fold for evaluation each time, using the remaining n-1 for training.

Since the decision tree model hasn’t worked too well, let’s try an Ensemble model – the Random Forest!

from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing, RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing, housing_labels, scoring="neg_root_mean_squared_error", cv=10)

pd.Series(forest_rmses).describe()

> count       10.000000
> mean     47030.511142
> std       1029.358881
> min      45458.112527
> 25%      46474.122490
> 50%      46967.596354
> 75%      47332.463998
> max      49227.030610
> dtype: float64

It’s better, but likely still overfitting – train a single Random Forest and you get a RMSE of 17474, which means overfitting. The results from CV are fine.
It’s a good idea to find 2-5 promising models by trying out multiple potential approaches at this stage.

Fine-Tune Your Model

Grid Search

Now you have a list of promising models, and you want to fine-tune them.

One way is by fiddling with the hyperparameters manually. This is tedious, though.
So instead, have Scikit-Learn’s GridSearchCV do it for you!

Just tell it which hyperparameters to experiment with, and which values to try, and it’ll evaluate them for you.

# This will search for the best combination of hyperparameter values for the random forest regressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42))
])

param_grid = [
    # The 'preprocessing...' means we refer to the hyperparameter of the estimator in the pipeline, no mater how deeply nested it is.
    # When Scikit-Learn sees the string, it'll split the string at double underscores, looking for an
    # estimator named 'preprocessing' in the pipeline and finds the preprocessing ColumnTransformer.
    # Then it looks for a transformer 'geo' inside that, and finds the ClusterSimilarity transformer.
    # Then it finds that transformer's 'n_clusters' hyperparameter.
    {'preprocessing__geo__n_clusters': [5, 8, 10], 'random_forest__max_features': [4, 6, 8]},
    {'preprocessing__geo__n_clusters': [10, 15], 'random_forest__max_features': [6, 8, 10]}
]

grid_search = GridSearchCV(full_pipeline, param_grid=param_grid, cv=3, scoring='neg_root_mean_squared_error')
grid_search.fit(housing, housing_labels)

By using a Scikit-Learn pipeline for the preprocessing, you can also do hyperparameter tuning on the preprocessing hyperparameters – along with the model hyperparameters. They often interact. In this case, increasing n_clusters might require increasing max_features as well.
If fitting the pipeline transformers is computationally expensive, set the pipeline’s memory hyperparameter to the path of a caching directory. It’ll then save the fitted transformer after first fitting the pipeline. If you fit the pipeline again with the same hyperparameters, it’ll load the cached transformers.

Best estimator can be accessed via grid_search.best_estimator_.
If you initialize with refit=True (default), then once it finds the best estimator using CV, it’ll retrain on the whole training set.

If you don’t know what value a hyperparameter should have, try out consecutive powers of 10.

To get the best parameters, use:

grid_search.best_params_

And to get the best estimator:

grid_search.best_estimator_

And you can get the scores:

cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

# rmse = -score
test_score_columns = [col for col in cv_res.columns if 'test_score' in col and not col == "std_test_score"]

for col in test_score_columns:
    cv_res[col] = -cv_res[col]

cv_res.head()

Randomized Search

Similar to Grid Search, there’s also Randomized Search.

Grid search is good when looking at relatively few combinations. When the hyperparameter search space is large, it’s preferable to use RandomizedSearchCV.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    'preprocessing__geo__n_clusters': randint(low=3, high=50),
    'random_forest__max_features': randint(low=2, high=20)
}

rnd_search = RandomizedSearchCV(
    full_pipeline, param_distributions=param_distribs, n_iter=10, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42
)

rnd_search.fit(housing, housing_labels)

There’s also HalvingRandomSearchCV and HalvingGridSearchCV hyperparameter search classes.

Bonus section: how to choose the sampling distribution for a hyperparameter

scipy.stats.randint(a, b+1): for hyperparameters with discrete values that range from a to b, and all values in that range seem equally likely.
scipy.stats.uniform(a, b): this is very similar, but for continuous hyperparameters.
scipy.stats.geom(1 / scale): for discrete values, when you want to sample roughly in a given scale. E.g., with scale=1000 most samples will be in this ballpark, but ~10% of all samples will be < 100 and ~10% will be > 2300.
scipy.stats.expon(scale): this is the continuous equivalent of geom. Just set scale to the most likely value.
scipy.stats.loguniform(a, b): when you have almost no idea what the optimal hyperparameter value’s scale is. If you set a=0.01 and b=100, then you’re just as likely to sample a value between 0.01 and 0.1 as a value between 10 and 100.

Ensemble Methods

Another way to make your system perform better is by combining methods that work the best. We call the group an “ensemble,” and it usually performs better than an individual model.

For example, Random Forests are better than individual Decision Trees.

It’s usually better than the individual model.

Analyzing the Best Models and Their Errors

It can be a good idea to analyze your best models.

For example, you can check the importance scores for each of the features, and potentially drop ‘useless’ ones.

final_model = rnd_search.best_estimator_
feature_importances = final_model["random_forest"].feature_importances_
feature_importances.round(2)

# This shows us the importance scors in descending order with their corresponding attribute names
sorted(
    zip(feature_importances, final_model["preprocessing"].get_feature_names_out()),
    reverse=True,
)

In fact, the sklearn.feature_selection.SelectFromModel transformer can automatically drop the least useful features.

You might also look at the specific errors your system makes and try to understand why – and what could fix it (more features, removing uninformative ones, cleaning outliers, etc.).

You might also spend time creating subsets of your validation set to check that your model performs well on all categories of data you’re working with. If it performs well generally, but poorly on one, it might cause more harm than good.

Evaluate Your System on the Test Set

Now, run the system on your test set.
When running your pipeline, just transform - do not fit!

# Getting the RMSE for the test set. Just getting predictors and labels from test set and running
# the final model to transform the data & make predictions, then evaluating them
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

final_predictions = final_model.predict(X_test)

final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
print(final_rmse)

You can also compute a 95% confidence interval for the generalization error.

To explain it simply: when creating a model, the goal is for the model to generalize well. That is, it should be able to make accurate predictions on new, unseen data.
The generalization error is a way to quantify how well the model is expected to perform on this new data.

95% confidence interval for the generalization error is a range of values that, based on the data & model, contains the true generalization error 95% of the time.
That is, if we could somehow create 100 identical universes & train our model in each one, we’d expect the generalization error to fall within this interval in 95 out of 100 of those.

So the 95% confidence interval gives us a range where we expect the true value to lie 95% of the time. For the generalization error, it’s giving us a range where we believe our model’s performance on new data will fall.
It’s a way of capturing the uncertainty we have about our model’s ability to make accurate predictions.

# Here we're calculating the 95% confidence interval for the generalization error
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

Launch, Monitor, and Maintain Your System

It’s often a good idea to save every model you experiment with so that you can come back easily to any model you want.
You may also save the cross-validation scores and perhaps the actual predictions on the validation set.
This will allow you to easily compare scores across model types, and compare the types of errors they make.

You can save your models.
So save the hyperparameters, the trained parameters, and the cross-validation scores. Then it’s easy to compare scores across model types.

You can save models like this:

import joblib

joblib.dump(my_model, "my_model.pkl")
# and later ... - just make sure to have all the dependencies loaded/imported as well.
my_model_loaded = joblib.load("my_model.pkl")

Get your system ready for production: plug input data sources in and write tests.
Write monitoring code to monitor live performance and trigger alerts when it drops.
Evaluate system performance by sampling its predictions and evaluating them manually.
Evaluate system input data quality.
It’s generally good to train your models on a regular basis with fresh data. Try to automate this.

You can deploy your model to the cloud, e.g. Google’s Vertex AI. Just upload it to Google Cloud Storage (GCS), go to Vertex AI, and create a new model version by pointing it to the GCS file.
Then you get a simple web service that handles load balancing and scaling.

Chapter 3 Classification

Training a Binary Classifier

from sklearn.datasets import fetch_openml

# This is a dict with the data and some metadata.
# 'data' has the input data, 'target' has the labels.
mnist = fetch_openml('mnist_784', as_frame=False)
X, y = mnist['data'], mnist['target']
X.shape
> (70000, 784)

import matplotlib.pyplot as plt

def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    plt.imshow(image, cmap='binary')
    plt.axis('off')
    
some_digit = X[0]
plot_digit(some_digit)
plt.show() # shows digit

TRAIN_SIZE = 60_000
X_train, X_test, y_train, y_test = (
    X[:TRAIN_SIZE],
    X[TRAIN_SIZE:],
    y[:TRAIN_SIZE],
    y[TRAIN_SIZE:],
)

The training set (60k images) and the test set (10k images).
The training set is already shuffled, which is great because this guarantees that all cross-validation folds will be similar (you don’t want one fold to be missing some digits). Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row. Shuffling the dataset ensures that this won’t happen.

But shuffling is a poor idea if you’re dealing with e.g. time series data.

Let’s train a binary classifier for the number 5.

y_train_5 = (y_train == '5') # True for all 5s, False for all other digits.
y_test_5 = (y_test == '5')

We’ll use a Stochastic Gradient Descent classifier.
This classifier is capable of handling large datasets efficiently.
It deals with training instances independently, one at a time.
This makes SGD well suited for online learning.
SGD relies on randomness during training (hence the name “stochastic”).
If you want reproducible results, you should set the random_state parameter.
This is useful for testing and debugging.

SGDClassifier is just a linear model. It assigns a weight per class to each pixel, and when it seems a new image, it sums up the weighted pixel intensities to get a score for each class.

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

sgd_clf.predict([some_digit])

Performance Measures

There are a ton of performance measures for classifiers. It’s generally harder to measure the performance of a classifier than a regressor.

Measuring Accuracy Using Cross-Validation

We could use Cross Validation.

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

And even if we get a 95% accuracy, there’s no reason to get excited. If you trained a dummy classifier that just classifies every image in the most frequent class (here: the negative class, that is non-5), you’d get over 90%. About 10% of the images are 5s, so just guessing ‘not 5’ makes you right 90% of the time.

It’s generally not the best idea to use accuracy as a measure for classifiers. A much better way is Confusion Matrix.

Confusion Matrices

With a confusion matrix, the idea is to count the number of times instances of class A are classified as class B, for all A/B pairs.
And you can find the amount of times the classifier confused 5s with 3s by looking at the 5th row and 3rd column of the matrix.

You can use confusion_matrix from sklearn.metrics to make one.

Instead of using the test set at this time, we can use cross_val_predict! This does k-fold cross validation and returns the predictions made on each test fold. So you get clean predictions for each instance in the training set – that is, out-of-sample predictions (the model makes predictions on data it never saw during training).

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
cm = confusion_matrix(y_train_5, y_train_pred)

Each row represents an actual class, and each column represents a predicted class.

False positives are also called type I errors. Likewise, false negatives are Type II errors.

Confusion matrices give lots of information. You might prefer a more concise metric.

📊 Precision refers to the fraction of true positive instances among the instances that the model predicted as positive. It’s a measure of how many of the positive predictions were actually correct.
Formula:

You’d actually get perfect precision by making a classifier that always makes negative predictions, except for one single positive prediction on the instance it’s most confident about. If that prediction is correct, it would have a precision of 100%.
That isn’t very useful, though. That’s why we often pair precision with recall.

🎯 Recall is the fraction of true positive instances among the actual positive instances. It measures how many of the actual positives were captured by the model.
So if there are, say, 6 actual fives, but the model only captured 5 fives, we’d have a recall of .
Formula:

Precision and Recall

You can get precision and recall from Scikit-Learn: from sklearn.metrics import precision_score, recall_score.

We can combine precision and recall into a single metric called the score. This can help you compare two classifiers.
The score is the harmonic mean of the precision and recall. It gives a balanced measure between them. It ranges from 0 to 1, where 1 is the best value.
The regular mean treats all values equally, the harmonic mean gives more weight to low values. So the classifier only gets a high score if both precision and recall are high.

Or just use from sklearn.metrics import f1_score.

You don’t always need to maximize both precision and recall. Increasing precision reduces recall, and vice versa. This is the precision/recall tradeoff.

It’s also important to keep in mind that sometimes the cost of false positives and false negatives are different. This would be easy to imagine in a healthcare scenario, where you’d rather avoid the machine saying a patient doesn’t have some disease, when they actually do (false negative).

Since you often do need to make the precision/recall tradeoff, you need to know what is most important.
Another example is detecting videos that are safe for kids. You’d rather be on the safe side (high precision), even though that might mean rejecting many otherwise good videos (low recall). You don’t want really bad videos to slip through!

Precision/Recall Tradeoff

This is how the SGDClassifier makes its classification decisions:
For each instance, compute a score based on decision function. If that score is greater than threshold value, assign the instance to the positive class: and otherwise, the negative class.

Scikit-Learn doesn’t allow you to touch the threshold value (by default, it’s 0). But it gives you access to the decision function, so you can call that instead of predict to get a score for each instance, and then use any threshold you want:

y_scores = sgd_clf.decision_function([some_digit])
print(y_scores)

# and you can manually set threshold:
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

And you can decide which threshold to use like this:
First, get scores of all instances in the training set.

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

With these scores, compute precision and recall for all possible thresholds withprecision_recall_curve():

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

You can then plot precision and recall as functions of the threshold value. This can also help you choose.

def plot_precision_vs_recall(precisions, recalls, thresholds, threshold):
    """Plot precision vs. recall for a given threshold."""
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.vlines(threshold, 0, 1.0, "k", "dotted", label="Threshold")

    ## Beautify
    idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
    plt.plot(thresholds[idx], precisions[idx], "bo")
    plt.plot(thresholds[idx], recalls[idx], "go")
    plt.axis([-50000, 50000, 0, 1])
    plt.grid()
    plt.xlabel("Threshold")
    plt.legend(loc="center right")
    ## End beautify
    
    plt.show()

You can also plot precision against recall. This will help select a good precision/recall tradeoff:

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")
plt.show()

If you want 90% precision, you can search for the lowest threshold that gives at least 90% precision.
Use argmax() for this, which returns the first index of the maximum value (here, the first True value):

idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]

Which you could then use as your new threshold.

As you can see, it’s easy to reach any amount of precision. But you need to keep recall in mind! You can always get 1 right to get 100% precision, but what if you just ignore the remaining instances (low recall)? That’s bad.

The ROC Curve

The Receiver Operating Characteristic (ROC) curve can also be used with binary classifiers to help assess performance.
It’s similar to plotting the precision vs. recall curve, but instead of that, it plots the true positive rate (recall) against the false positive rate (FPR).
FPR is the ratio of negative instances that are incorrectly classified as positive. Equivalent to , which is the ratio of negative instances that are correctly classified as negative. TNR is also called specificity.
The ROC curve plots sensitivity (recall) versus .

A high area under the ROC curve (AUC: area under curve) indicates better model performance. The metric summarizes the trade-off between the TRP and FPR for different threshold values.

The y-axis would be TPR (Recall) and the x-axis is FPR (Fall-Out).

This is how you plot it with Scikit-Learn and matplotlib:

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

# from the book
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    # Not shown in the book
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) # Not shown
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    # Not shown
    plt.grid(True)                                            # Not shown

plt.figure(figsize=(20, 12))
plot_roc_curve(fpr, tpr)

And as we can see, the higher the recall, the more false positives the classifier produces.

We also measure the area under the curve (AUC). A perfect classifier has a ROC AUC of 1, where a completely random one has a ROC AUC of 0.5.

Compute with Scikit-Learn:

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

You should use Precision vs. Recall curve (PR curve) when the positive class is rare, or when you care more about false positives than false negatives, and otherwise, you should use the ROC curve.

At the end of this chapter, we trained a Random Forest classifier:

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_proba_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

y_proba_forest[:2] # first two instances. First column is probability of negative class, second column is probability of positive class. These are estimated probabilities.
> array([[0.11, 0.89], [0.99, 0.01]])

y_scores_forest = y_proba_forest[:, 1] # score = probability of positive class
precisions_forest, recalls_forest, thresholds_forest = precision_recall_curve(
    y_train_5, y_scores_forest
)

plt.plot(recalls_forest, precisions_forest, linewidth=2, label="Random Forest")
plt.plot(recalls, precisions, "--", linewidth=2, label="SGD")

plt.show() # shows PR curves for both random forest classifier and SGD. Evidently shows that the RF classifier is better as the AUC is greater, and the PR curve is closer to the top-right corner.

from sklearn.metrics import roc_auc_score, f1_score
y_train_pred_forest = y_proba_forest[:, 1] > 0.5 # positive probability > 50%
print(f"F_1={f1_score(y_train_5, y_train_pred_forest)}")
print(f"ROC AUC={roc_auc_score(y_train_5, y_scores_forest)}")
> F_1=0.9242275142688446
> ROC AUC=0.9983436731328145

Multiclass Classification

Multiclass classifiers, or multinomial classifiers, can distinguish between two or more classes.
While some Scikit-Learn classifiers like Random Forest classifiers, Logistic regression, and naive Bayes classifiers support multiple classes directly, others are strictly binary: SVM classifiers and Linear classifiers can only do binary.
But there are strategies to do multiclass classification using multiple binary classifiers (e.g. one-versus-all and one-versus-one)

Here’s an example of the One-versus-All (OvA) strategy (also called one-versus-the-rest):
Say you wish to predict numbers from 0-9. Then you could train 10 binary classifiers – one for each digit. When you want to predict, then predict with all of them, and choose the prediction score that is highest.

There’s also the One-versus-One (OvO) strategy.
Same situation: you want to predict digits 0-9. Train a binary classifier for every pair of digits, e.g., (0, 1), (0, 2), … (1, 2)…, and so on.
If there are classes, you train classifiers.
To make predictions, you run through every classifier and see which class wins the most ‘duels’.
Of course, each classifier only needs to be trained on the part of the training set for the two classes it must distinguish between. This is the main advantage.
That is, it’s good for algorithms that don’t scale well with the size of the training set.

SVM classifiers scale poorly with the size of the training set, so for those algorithms, OvO is preferred. Otherwise, it’s better to do OvA (for the other types of algorithms).

Scikit-Learn will automatically detect when you try to use a binary classification algorithm for a multiclass classification task, and runs OvA (or OvO for SVM classifiers).

from sklearn.svm import SVC

svm_clf = SVC(random_state=42)
svm_clf.fit(X_train[:2000], y_train[:2000]) # use a subset of the training set to speed up training time

svm_clf.predict([some_digit])
> array(['5'], dtype=object)

some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores.round(2)
> array([[ 3.79, 0.73, 6.06, 8.3 , -0.29, 9.3 , 1.75, 2.77, 7.21, 4.82]])
# Above: scores for each class. Highest score is for 5, which could be found with `argmax`. You can do:
class_id = some_digit_scores.argmax()
svm_clf.classes_[class_id]

This trains 10 classifiers (under the hood), got each of their decision scores, and selected the best one (the one that won the most duels).

And for methods that directly support multiclass prediction tasks (e.g., random forests), Scikit-Learn doesn’t need to do OvA or OvO.
You can force Scikit-Learn to do OvO or OvR by using either OneVsOneClassifier or OneVsRestClassifier classes. Create an instance and pass a classifier to its constructor (doesn’t need to be binary classifier):

from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC(random_state=42))
ovr_clf.fit(X_train[:2000], y_train[:2000])

ovr_clf.predict([some_digit])

We do seem to get better results when the data has been scaled:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype("float64"))

Error Analysis

We can analyze the types of errors or model makes.

First, look at the confusion matrix.
Make predictions with cross_val_predict() and call confusion_matrix().

It’s probably better to visualize it:

from sklearn.metrics import ConfusionMatrixDisplay

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
plt.show()

If most images are on the main diagonal, it’s pretty good.

It’s important to normalize the confusion matrix by dividing each value by the total number of images in the corresponding (true) class (i.e. divide by the row’s sum). Do that by setting normalize="true". You can also specify values_format=".0%" to show percentages with no decimals.

We can focus the plot on the errors – make them stand out more – by putting zero weight on the correct predictions.

sample_weight = (y_train_pred != y_train)

ConfusionMatrixDisplay.from_predictions(
    y_train, y_train_pred, sample_weight=sample_weight,
	# normalize="pred" would be to normalize by column. "true" is by row.
    normalize="true", values_format=".0%"
)
plt.show()

And fill the diagonal with zeroes to keep only the errors, and plot the result:

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

And now you can see which errors the classifier makes.
Rows represents actual classes, and columns represent predicted classes.
If the column for a class is bright, many images get misclassified as it. And if the row for a class is bright, it often gets misclassified.
Just remember not to misinterpret the percentages in this diagram. Since we’ve excluded the correct predictions, 36% in row #7 col #9 doesn’t mean 36% of all images of 7s were misclassified as 9s, but that 36% of errors the model made on images of 7s were misclassifications as 9s.

What do you do about the errors? You might want to get more data to help misclassification.
Or you can add some new features that help the classifier. For example, make an algorithm to count the number of closed loops (8 has 2, 6 has 1, 5 has 0, etc.).
Or you can preprocess the images with, for example, Scikit-Image, Pillow, or OpenCV to make patterns stand out more.

Multilabel Classification

What if you want your classifier to output multiple classes for each instance? E.g., recognizing multiple things in one image.

Such a classifier – one that outputs multiple binary tags – is called a multilabel classification system.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= '7')
y_train_odd = (y_train.astype('int8') % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

knn_clf.predict([some_digit])
> array([[False, True]])

The above code creates a y_multilabel array containing two target labels for each digit image. The first indicates whether it’s large (7, 8, or 9), and the second indicates whether it’s odd. Then we create a KNeighborsClassifier, which supports multilabel classification (not all classifiers do), and train this model using the multiple targets array. some_digit is ‘5’, and it is indeed not large (False) and is odd (True).

There are various ways to evaluate a multilabel classifier. The right metric depends on your project.
One approach is to measure the score for each individual label (or another binary classifier metric) and compute the average score.

Below we compute the average score across all labels.

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
> 0.976410265560605

If not all labels are equally important (e.g. if you have more pictures of X and Y and Z), you can give more weight to the score on X. One option is to give each label a weight equal to its support (number of instances with that target label): just set average="weighted" when calling f1_score().

To use a classifier that doesn’t natively support multilabel, e.g. SVC, you could train a model per label. This may have a hard time capturing dependencies between labels. To solve that, the models can be organized in a chain: when a model makes a prediction, it uses the input features plus all predictions of the models that come before in the chain. Scikit-Learn has a class ChainClassifier that does that. It uses true labels for training by default, feeding each model the appropriate labels depending on their position in the chain.
If you set the cv hyperparameter, it’ll use cross-validation to get “clean” (out-of-sample) predictions from each trained model for every instance in the training set, and these are used to train all models later in the chain.

from sklearn.multioutput import ClassifierChain

chain_clf = ClassifierChain(SVC(), cv=3, random_state=42)
chain_clf.fit(X_train[:2000], y_multilabel[:2000])

chain_clf.predict([some_digit])
> array([[0., 1.]])

Multioutput Classification

Multioutput classification, or multioutput-multiclass classification is a generalization of multilabel classification where each label can be multiclass. That is, each label can have more than two possible values.

Multioutput systems aren’t limited to classification tasks. You could have a system that outputs multiple labels per instance, including both class labels and value labels.

The book gives and example where we build a system that removes noise from images. It takes a noisy digit image and outputs a clean digit image, represented as an array of pixel intensities (like MNIST images).
The output is multilabel (one label per pixel) and each label can have multiple values (pixel intensity ranges from 0 to 255). So it’s a multioutput classification system.

Defining data set – adding noise to the MNIST data

np.random.seed(42)
# generates random integers from 0 to 100 (exclusive) with shape (len(X_train), 784)
# recall that X = mnist['data'], X has shape (70000, 784)!
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise

y_train_mod = X_train
y_test_mod = X_test

plot_digit(X_test_mod[0])
plt.show() # shows noisy 7

Training

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[0]])
plot_digit(clean_digit)
plt.show() # shows clean 7

Chapter 4 Training Models

This chapter starts by looking at two ways to train a Linear Regression model:

Using a “closed-form” equation (an equation only composed of a finite number of constants, variables, and standard operations like – no infinite sums, limits, integrals, etc.) that directly computes the model parameters that best fit the model to the training set (minimize cost function over training set)
Using an iterative optimization algorithm (Gradient Descent) that gradually tweaks model parameters to minimize the cost function over the training set, eventually converging on the same set of parameters as the above method. There are multiple variants of gradient descent (batch GD, mini-batch GD, and Stochastic Gradient Descent).

After that, we look at polynomial regression. This is a more complex model, but it can fit nonlinear datasets. Since it has more parameters than linreg, it is more prone to overfitting. We’ll see how this is the case with learning curves, and then we look at Regularization techniques.

Finally, we look at two more models commonly used for classification: logistic regression and softmax regression.

Linear Regression

Generally, a Linear model makes a prediction by computing a weighted sum of the input features, plus a constant called the bias term (or intercept term):

where is the predicted value, is the number of features, is the i’th feature value, and is the j’th model parameter, including the bias term and feature weights .

This can be written more concisely in the vectorized form:

where is the hypothesis, using model parameters , is the model’s parameter vector, containing bias term and feature weights to , is the instance’s feature vector, containing to , with always being , and is the dot product of the two vectors, equal to .

So how do we train the model?
Training a model means settings its parameters so the model best fits the training set.
So how do we measure how well it fits?
The most common for regression models is Root Mean Square Error (RMSE), so we need to find the value of that minimizes RMSE.
In practice, it’s simpler to minimize Mean Squared Error (MSE), and it leads to the same result (the value minimizing a positive function also minimizes its square root).

MSE of a linear regression hypothesis on training set is found with:

The Normal Equation

There’s a closed-form solution (mathematical equation that gives the result directly) to finding the value of that minimizes the Mean Squared Error (MSE). This is the Normal equation:

where is the value of that minimizes the cost function, is the vector of target values containing to . Don’t mind the spacing, you’re still doing matmul there – it’s just for aesthetics.

Now let’s try it out.

We’ll generate a linear dataset with the function:

import numpy as np
from sklearn.preprocessing import add_dummy_feature

np.random.seed(42)

m = 100  # number of instances
X = 2 * np.random.rand(m, 1) # column vector
y = 4 + 3 * X + np.random.randn(m, 1) # column vector

X_b = add_dummy_feature(X) # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
theta_best
> array([[4.21509616], [2.77011339]])

Getting and would have been perfect, but due to the noise it is impossible to recover the exact parameters of the original function.

We can use theta_best to make predictions.

X_new = np.array([[0], [2]])
X_new_b = add_dummy_feature(X_new)
y_predict = X_new_b @ theta_best
y_predict
> array([[4.21509616], [9.75532293]])

We can also do linear regression with Scikit-Learn using the LinearRegression class. This class is based on the np.linalg.lstsq() function (the name stands for “least squares”).

As can be seen, they separate the bias term (intercept_) from the feature weights (coef_).

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
> (array([4.21509616]), array([[2.77011339]]))

lin_reg.predict(X_new)
> array([[4.21509616], [9.75532293]])

Let’s try using np.linalg.lstsq() (Least Squares).

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd
> array([[4.21509616], [2.77011339]])

The function computes , where is the pseudoinverse of (the Moore-Penrose inverse). You can use np.linalg.pinv to compute the pseudoinverse directly:

np.linalg.pinv(X_b) @ y # Moore-Penrose pseudoinverse
> array([[4.21509616], [2.77011339]])

The pseudoinverse is computed using Singular Value Decomposition (SVD) to decompose the training set matrix into the matrix multiplication of . The pseudoinverse is computed as . To compute , the algorithm takes and sets all values smaller than a tiny threshold to zero, then it replaces all the nonzero values with their inverse, and finally transposes the resulting matrix.
This is more efficient that computing the Normal equation, and handles edge cases (Normal equation may not work if the matrix is singular - not invertible).

Gradient Descent

This is a generic optimization algorithm capable of finding optimal solutions to various problems.
The idea is to tweak parameters iteratively to minimize a cost function.

The algorithm measures the local gradient of the error function w.r.t the parameter vector , and goes in the direction of descending gradient. When the gradient is 0, you’ve reached a minimum.

In practice, you start by filling with random values (this is called random initialization). Then you improve it gradually, each step attempting to decrease the cost function (e.g. MSE) until the algorithm converges to a minimum.

The size of the steps is determined by the learning rate Hyperparameter. If this is too small, it’ll take long to converge.
But if it’s too large, you may overstep and end up with larger values.

Not all cost functions look like Us. There can be aspects that make convergence on the minimum difficult. For example, finding Local maxima vs. Global maxima.
Random initialization may start you off in an area which leads you to a local minimum. If there happens to be a steep hill between it and the global minimum, you won’t be getting there.
Likewise, if you start far from a minimum point, you may make slow but steady improvements – and if you stop early, you won’t reach the global minima.

Convex functions help quite a bit.
For example, the MSE cost function for a linear regression model is a convex function.
Convex means that if you pick any two points on the curve, the line segment joining them is never below the curve.
This implies there is no local minima, only one global maximum.

MSE is also continuous with a slope that never changes abruptly.

These two characteristics mean that the Gradient Descent is guaranteed to approach arbitrarily closely the global minimum.

When using gradient descent, ensure all feature have a similar scale. For one, this ensures it takes less time to converge on the minimum.

Batch Gradient Descent

To implement Gradient Descent, you need to compute the gradient of the cost function w.r.t each model parameter . That is, you calculate how much the cost function will change if you change a bit. This is called the partial derivative.

This computes the partial derivative of Mean Squared Error (MSE) w.r.t parameter , noted :

But instead of computing the partial derivatives individually, you can compute them all in one go.
The gradient vector, , contains all the partial derivatives of the cost function – one for each model parameter:

The algorithm is called batch gradient descent because we do calculations over the whole batch of training data.
This is also why it’s slow on large training sets.
Gradient descent scales well with the number of features – it’s faster than the Normal equation or Singular Value Decomposition (SVD) when training a Linear Regression model.

Given the gradient vector, which points ‘uphill’, you go in the opposite direction to go ‘downhill’. This means to subtract from . This is where the learning rate comes into play: we multiply the gradient vector by to determine the size of the ‘downhill’ step.

Here’s an implementation:

eta = 0.1  # learning rate
n_epochs = 1000
m = len(X_b)  # number of instances

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

for epoch in range(n_epochs):
    gradients = 2 / m * X_b.T @ (X_b @ theta - y)
    theta = theta - eta * gradients

theta
> array([[4.21509616], [2.77011339]])

Can use Grid Search to find a good learning rate.
To find a good amount of epochs to run for is hard. Too low, and you’ll finish before a good solution. Too large and you’ll waste compute & time. A simple solution is early stopping – setting a very large number of epochs, but interrupting the algorithm when the gradient vector becomes tiny, which is when the norm becomes smaller than some number (called the tolerance), which happens when gradient descent has almost reached the minimum.

Stochastic Gradient Descent

Batch gradient descent uses the whole training set to compute gradients at every step. This is slow on large training sets.

Stochastic Gradient Descent instead picks a random instance in the training step at every step and computes gradients based on only that instance.
This makes it faster, given there’s less data to manipulate each iteration, and it also makes it possible to train on huge datasets, as only one instance needs to be in memory at each iteration.
Stochastic Gradient Descent can be implemented as an out-of-core algorithm.

However, due to the stochastic nature of the algorithm, it’s less regular than batch gradient descent. The cost function will bounce up and down, not go down steadily as a more deterministic approach.
Over time, it’ll get very close to the minimum, but will never settle there – it’ll continue to bounce around. That also means you won’t find the optimal solution, only a good enough one.

This can be a plus: you may jump out of local minima on irregular cost functions. So there’s a better chance of finding the global minimum than batch gradient descent.

Given the duality of being able to escape local minima, but never settling at the minimum, it can be useful to gradually reduce the learning rate.
The steps start out large, which helps make quick progress and escape local minima, and then they get smaller and smaller, allowing the algorithm to settle at the global minimum.
This process is similar to simulated annealing, which is an algorithm inspired by the process of annealing from metallurgy, where molten metal is slowly cooled down.
The function used to determine the learning rate at each iteration is called the learning schedule. If you reduce the learning rate too quickly, you may get stuck in a local minimum, or end up frozen halfway to the minimum. If it’s reduced too slowly, you may jump around the minimum for a long time, and end up with a suboptimal solution if you end too early.

Implementation

n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
m = len(X_b)

def learning_schedule(t):
    return t0 / (t + t1)

np.random.seed(42)
theta = np.random.randn(2, 1) # random initialization

for epoch in range(n_epochs):
    for iteration in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T @ (xi @ theta - yi)  # we don't divide by m because we only use one instance
        eta = learning_schedule(epoch * m + iteration)
        theta = theta - eta * gradients

theta
> array([[4.21076011], [2.74856079]])

It’s convention to iterate by rounds of iterations.

When using SGD, the raining instances must be independent and identically distributed (IID) to ensure the parameters get pulled towards the global optimum, on average. Can ensure this by shuffling instances during training (pick instance randomly, or shuffle training set at beginning of each epoch). If you don’t, SGD first optimizes by each ‘sorting’ you have, whether implicit or explicit.

Linear Regression with SGD using Scikit-Learn

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(
    max_iter=1000, # runs for 1k epochs
    tol=1e-5, # or until loss drops by less than tol...
    n_iter_no_change=100,  # during 100 epochs
    eta0=0.01,  # starts with this learning rate
    penalty=None,  # no regularization
    random_state=42
)

sgd_reg.fit(X, y.ravel())  # ravel() flattens the array: fit expects a 1D array

sgd_reg.intercept_, sgd_reg.coef_
> (array([4.21278812]), array([2.77270267]))

Mini-Batch Gradient Descent

At each step, instead of computing the gradients based on the full training set (batch GD), or based on just one instance (SGD), this method computes the gradients on small random sets of instances called mini-batches.

The advantage of mini-batch GD over SGD is the performance boost you can get from hardware optimization of matrix operations, esp. when using GPUs.

Will get closer to minimum than SGD because the algorithm’s progress in parameter space is less erratic. But may be harder to escape local minima.

Polynomial Regression

You can use a linear model to fit nonlinear data.
A simple way is adding powers of each feature as new features, then training a linear model on this extended set of features.
This is called Polynomial regression.

np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0], X_poly[0]
> (array([-0.75275929]), array([-0.75275929, 0.56664654]))

We generate a nonlinear and noisy dataset based on a Quadratic Equation.
Then we use PolynomialFeatures from Scikit-Learn to transform the training data, adding the square (second-degree Polynomial) of each feature in the training set as a new feature.
X_poly contains the original features of X plus the square of the (single) feature.
We can use a LinearRegression model now:

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
> (array([1.78134581]), array([[0.93366893, 0.56456263]]))

The model estimates , where the original function was .

When there are multiple features, polynomial regression can find relationships between them. Plain linear models cannot do this. This is possible because PolynomialFeatures also adds all combinations of features up to the given degree.
For example, given two features , and degree=3, it would not only add , but also .

Given degree , PolynomialFeatures transforms an array with features into an array with features, where is the factorial of . Important to be aware of the combinatorial explosion of features.

Learning Curves

High-degree polynomial regression will likely fit data better than plain Linear Regression. This can be Overfitting.

You could use Cross Validation to check if your model generalizes. If it does well on training data, but poorly on CV metrics, then it’s overfitting. If it does poorly on both, it’s underfitting.

But you can also use Learning Curves.
These are plots of the model’s training error and validation error as a function of the training iterations.
Evaluate the model at regular intervals during training on both train and validation sets, and plot the results.
If you can’t train the model incrementally (via partial_fit() or warm_start), then train it several times on gradually larger subsets of the training set.

Scikit-Learn has learning_curve() to help.
It trains and evaluates the model via Cross Validation.
By default, it retrains the model on growing subsets of training data, but if the model supports incremental learning, use exploit_incremental_learning=True in the function parameters.
It returns training set sizes for each eval & training + validation scores for each size, and for each cv fold.

from sklearn.model_selection import learning_curve

train_sizes, train_scores, valid_scores = learning_curve(
    LinearRegression(), X, y, train_sizes=np.linspace(0.01, 1.0, 50), cv=5,
    scoring="neg_mean_squared_error"
)
train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)

plt.plot(train_sizes, train_errors, "r-+", label="Training set", linewidth=2)
plt.plot(train_sizes, valid_errors, "b-", label="Validation set", linewidth=3)
plt.show()

This model is underfitting.
Looking at training error: In the beginning, there are few enough instances such that it can fit them perfectly. However, since the data is not linear, and it’s noisy, it becomes increasingly evident that the model is unable to fit it. That’s why the training error goes up and finally reaches a plateau.
Looking at validation error: In the beginning, the model cannot generalize, as there are too few samples. Then, as it gets exposed to more samples, the validation error goes down. But it gets stuck at the same plateau as train, as the model doesn’t fit the data well.

from sklearn.pipeline import make_pipeline

polynomial_regression = make_pipeline(
    PolynomialFeatures(degree=10, include_bias=False),
    LinearRegression()
)

train_sizes, train_scores, valid_scores = learning_curve(
    polynomial_regression, X, y, train_sizes=np.linspace(0.01, 1.0, 50), cv=5,
    scoring="neg_mean_squared_error"
)

train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)

plt.figure(figsize=(6, 4))
plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train")
plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid")
plt.legend(loc="upper right")
plt.xlabel("Training set size")
plt.ylabel("RMSE")
plt.grid()
plt.axis([0, 80, 0, 2.5])
plt.show()

Now, using a 10th-degree polynomial model on the same data seems to help.
The error on the training data is noticeably lower. There is a gap between the curves: the model performs better on train than validation, which is an indicator of Overfitting.
However, using a larger dataset, the two curves would get closer. A way to improve an Overfitting model is to give it more data until the validation error reaches the training error.

Regularized Linear Models

You can reduce Overfitting through Regularization (constraining the model). The fewer degrees of freedom it has, the harder for it to overfit the data.

A simple way to regularize a polynomial model is to reduce the number of Polynomial degrees.

For a linear model, you do it by constraining the weights of the model.
In this part, we look at three ways to constrain the weights:

Ridge Regression
Lasso Regression
Elastic Net Regression

Ridge Regression

AKA Tikhonov Regularization is a regularized Linear Regression.
Add a regularization term equal to to the MSE.
This forces the algorithm to not only fit the data, but also keep the model weights as small as possible.
This regularization term should only be added to the cost function during training. You’d use standard RMSE/MSE afterwards to evaluate model performance.

controls how much to regularize the model.
If , it’s just Linear Regression. If large, then all weights end up close to zero, and you’d get a flat line through the data’s mean.

Ridge regression cost function

Bias term is not regularized (sum starts at 1, not 0).

Define as the vector of feature weights ( to ), then the regularization term is equal to , where is the Euclidean Norm of the weight vector.

For batch gradient descent, add to the part of the MSE gradient vector that corresponds to feature weights, without adding anything to the gradient of the bias term.

Ridge regression closed-form solution
Can perform ridge regression by either gradient descent or computing a closed-form equation.

where is the identity matrix, except with a 0 in the top left cell, corresponding to the bias term.

Can use Sklearn, which uses a matrix factorization technique by Andre-Lous Cholesky:

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=0.1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
> array([[1.5325833]])

And Stochastic Gradient Descent:

sgd_reg = SGDRegressor(
	penalty="l2",  # sets regularization type. "l2" means we add regularization to the MSE cost function equal to alpha times the square of the $\ell_2$ norm. This is like ridge regression, except no division by m - so we pass alpha=0.1/m to get the same as Ridge(alpha=0.1).
	alpha=0.1/m,
	tol=None,
	max_iter=1000,
	eta0=0.01,
	random_state=42
)
sgd_reg.fit(X, y.ravel())  # ravel because fit expects 1d targets
sgd_reg.predict([[1.5]])
> array([1.55302613])

Lasso Regression

LASSO is actually Least absolute shrinkage and selection operator regression.
This is also a regularized Linear Regression, and just like Ridge regression, it adds a Regularization term to the cost function. It uses the Norm of the weight vector instead of the square of the norm.

We multiply the norm by ( norm is multiplied by ) to ensure the optimal value is independent of the training set size.

Lasso regression cost function

Lasso regression tends to eliminate the weights of the least important features (setting them to 0).
That is, it automatically does feature selection and outputs a sparse model with few nonzero weights.

To keep Gradient Descent from bouncing around the optimum at the end, you’ll need to gradually reduce the learning rate. While it’ll keep bouncing around the optimum, the steps get smaller and smaller, meaning it’ll converge.

The lasso cost function is not differentiable at (for ), but you can still use gradient descent if you use a subgradient vector¹ instead when any :

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])
> array([1.53788174])

Could also use SGDRegressor(penalty="l1", alpha=0.1).

Elastic Net Regression

Like middle ground between ridge regression and lasso regression.
The regularization term is a weighted sum of both their terms, where the ratio is controlled by mix ratio . and it’s ridge regression, and it’s lasso.

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])
> array([1.54333232])

Back to overall regularization
When to use what:

Always have at least a little regularization: avoid plain linear regression
Ridge is a good default if you suspect only a few features are useful
Prefer lasso or elastic net because they tend to reduce useless features’ weights to zero. And then, prefer elastic net, as lasso can behave erratically when number of features > number of training instances or when several features are strongly correlated.

Early Stopping

You can regularize iterative learning algorithms (like Gradient Descent) by stopping when the validation error reaches a minimum.

This way, you could capture the model i.e. before it overfits, and you get the best version of that model.
Geoffrey Hinton called it a “beautiful free lunch”.

For gradient descent methods whose curves are not smooth (stochastic, mini-batch), it can be hard to know if you’ve reached a minimum or not. You can therefore stop only after it’s been above the minimum for a while — when you’re confident the model won’t get better — and then roll back parameters to a point where the validation error was at a minimum.

The basic approach is to use deepcopy from copy to copy the best model, given by the best validation error (so you track that with a separate variable).
And you’d want to use partial_fit (or similar) to do incremental learning.

Logistic Regression

Logistic Regression (or logit regression) is often used to estimate the probability that an instance belongs to a particular class. If over some threshold, it predicts that it belongs to the given class (or not).

Estimating Probabilities

Logistic regression works by computing a weighted sum of the input features (plus a bias term), just like linear regression. Instead of outputting the result directly, it outputs the logistic of the result:

Logistic regression model estimated probability (vectorized form)

The logistic is noted and is a Sigmoid function that outputs a number between 0 and 1.

Given the probability that an instance belongs to the positive class, it’ll make the prediction by checking against a threshold value.

The input to the function is often denoted , which is often called the logit. This is because the logit function () is the inverse of the logistic function. Computing the logit of the estimated probability , you get . The logit is also called log-odds because it’s the log of the ratio between the estimated probability for the positive class and the estimated probability for the negative class.

Training and Cost Function

The objective of training is to set the parameter vector so the model estimates high probabilities for positive instances () and low probabilities for negative instances ().
This is captured in the cost function for a single instance :

Cost function of a single training instance

The cost will be large if the model estimates a probability close to 0 for a positive instance, as grows large when approaches .
And it’ll also be large if the model estimates a probability close to for a negative instance.
But is close to when is close to , so the cost will be close to if the estimated probability is close to for a negative instance or close to for a positive instance.

The cost function over the whole training set is the average cost over all training instances:
Logictic regression cost function (Log loss)

It can be shown with Bayesian inference that minimizing this loss will result in the model with the maximum likelihood of being optimal, assuming instances follow a Gaussian distribution around the mean of their class.
This is the assumption you make when using Log loss – the more wrong it is, the more biased the model.

There’s no known closed-form equation to compute that minimizes this cost function. But it is convex, so Gradient Descent or other optimization algorithms will find the global minimum.

Partial derivatives of the Logistic cost function

For each instance, it computes the prediction error and multiplies it by the feature value, then computes the average over all training instances. Once we have the gradient vector with all partial derivatives, we can use it in the batch gradient descent algorithm.

For stochastic GD, use one instance at a time. For mini-batch GD, use a mini-batch at a time.

Softmax Regression

We can generalize the logistic regression model to support multiple classes directly without having to train and combine multiple binary classifiers (One-versus-One or One-versus-the-Rest). This is Softmax Regression, or multinomial logistic regression (Multinomial Classifier).

When given an instance , the softmax regression model first computes a score for each class , then estimates the probability of each class by applying Softmax to the scores.

Each class has its own dedicated parameter vector , and these are typically stored as rows in a parameter matrix .

Once scores have been computed for each class for the instance , we estimate the probability that the instance belongs to class by running the scores through Softmax. The function computes the exponential of every score, normalizes them by dividing by the sum of all exponentials. The scores are usually called logits or log-odds, but they are actually unnormalized log-odds.

where is the number of classes, is a vector containing the scores of each class for the instance , and is the estimated probability that the instance belongs to class given the scores of each class for that instance.

The prediction will be the class with the highest estimated probability (the one with the highest score):

Argmax returns the value of a variable that maximizes a function. Here, it’s the value of that maximizes the estimated probability .

Softmax regression is multiclass, not multioutput! So can only be used with mutually exclusive classes – no predicting multiple people in a photo.

How to train
Objective is getting a model that estimates a high probability for the target class – and low for the other classes.
Minimizing cost function below (Cross Entropy) should do just that because it penalizes the model when it estimates a low probability for a target class.
Cross Entropy cost function:

where is the target probability that the instance belongs to class . Generally it’s either 1 or 0, depending on whether the instance belongs to the class or not.

Given two classes, , the cost function is equivalent to the logistic regression cost function (Log loss).

Cross Entropy gradient vector for class

Now compute the gradient vector for each class, use Gradient Descent (or other optimizer) to find the parameter matrix minimizing the cost function.

Chapter 5 Support Vector Machines

Can do linear or nonlinear classification, regression, or novelty detection.
Shine with small to medium datasets (hundreds to thousands of instances), especially for classification.
- Don’t scale well with very large datasets.

Linear SVM Classification

The book explains this by analogy, making it harder to capture.

Imagine two distinct classes in a space. Since they are distinct, their clusters don’t overlap. Now imagine a straight line that is drawn between the two clusters, representing a decision boundary. Since we can draw this line, the classes are linearly separable.
An SVM classifier will not only separate the two classes, but also stay as far away from the closest training instances as possible.
This is analogous to a street. Imagine the decision boundary line, and two parallel lines to it, that are placed exactly on the closest training instances in each cluster to the decision boundary line. This places the decision boundary line in the middle, kind of like the stripes in the middle of a road.
SVM classifiers tries to fit the widest possible street between the classes. This is called a large margin classification.
Adding training instances outside this ‘street’ won’t affect the decision boundary. It’s fully determined (“supported”) by the instances on the edge (the closest training instances). These instances are called support vectors.

SVMs are sensitive to feature scales, so make sure you scale accordingly.

Soft Margin Classification

Saying that instances must strictly be ‘off the street’ and on the correct side is hard margin classification.
This only works if the data is linearly separable. And it’s sensitive to outliers.

We can avoid these issues with a more flexible model.
The goal is to keep the street as large as possible & limit margin violations (instances in the middle of the street, or on the wrong side). This is soft margin classification.

Sklearn lets you set regularization hyperparameter C (squared L2). By reducing it, you get a larger street, but likely also more margin violations. A lower value means less risk of overfitting, but if you reduce it too much, it’ll underfit.

##### Train
import numpy as np
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 2)  # Iris virginica

svm_clf = make_pipeline(StandardScaler(),
                        LinearSVC(C=1, dual=True, random_state=42))
svm_clf.fit(X, y)

##### Predict
X_new = [[5.5, 1.7], [5.0, 1.5]]
svm_clf.predict(X_new)
> array([ True, False])  # first plant is an Iris virginica; second isn't.

##### Scores
svm_clf.decision_function(X_new)
> array([ 0.66163411, -0.22036063])  # scores used

The scores we see above are used to make the predictions. They measure the signed (positive/negative sign) distance between each instance and the decision boundary.
When the score is greater than 0, the instance is predicted to be of the positive class. When less than zero, it belongs to the negative class. If exactly zero, it’s more ambiguous: it is located on the decision boundary. This is what signed means.

For Sklearn: LinearSVC doesn’t have a predict_proba() method to estimate class probabilities.
Using SVC instead and setting the probability hyperparameter to True, then the model will fit an extra model at the end of training to map the SVM decision function scores to estimated probabilities. This uses 5-fold Cross Validation under the hood to generate out-of-sample predictions for every instance in the training set, and then training a Logistic Regression model — which slows down training. Then you can use predict_proba() and predict_log_proba().

Nonlinear SVM Classification

Many datasets aren’t even close to being linearly separable.
An approach to handling those datasets is adding more features (like Polynomial Regression). This can result in a linearly separable dataset.

To do this in Sklearn, you could create a pipeline with a PolynomialFeatures transformer, followed by a StandardScaler and a LinearSVC classifier.

from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

polynomial_svm_clf = make_pipeline(
    PolynomialFeatures(degree=3),
    StandardScaler(),
    LinearSVC(C=10, max_iter=10_000, dual=True, random_state=42)
)
polynomial_svm_clf.fit(X, y)

Now we’ll present some kernels which can help you train better models more effectively.
You should generally try the linear kernel first – LinearSVC is much faster than SVC(kernel="linear"), especially on large training sets. You should also try kernelized SVM, starting with the Gaussian RBF kernel, which often works well.
And then you could move on to a few other kernels using hyperparameter search.
If there are any that work well for your training set’s structure, try them.

Polynomial Kernel

Just adding more polynomial features is simple, and can work with all kinds of ML algorithms, but at a low polynomial degree, this method can’t deal with very complex datasets. And with a high polynomial degree, it creates a ton of features, making the model slow.

When using SVMs, you can use a technique called the kernel trick.
This lets you get the same result as if you had added many polynomial features, even with a very high degree, without actually having to add them. So no combinatorial explosion of the number of features.
This is what the Sklearn SVC class uses.

from sklearn.svm import SVC

# Train SVM classifier using 3rd degree polynomial kernel.
poly_kernel_svm_clf = make_pipeline(StandardScaler(),
						SVC(kernel="poly", degree=3, coef0=1, C=5))

poly_kernel_svm_clf.fit(X, y)

If the model is overfitting, reduce the polynomial degree. If underfitting, increase it.
coef0 controls how much the model is influenced by high-degree terms vs. low-degree terms.

Similarity Features

You can also tackle nonlinear problems by adding features computed with a similarity function, which measures how much each instance resembles a particular landmark.

The following image shows selecting landmarks and (highlighted) on a 1D dataset. Then we set the similarity function to be the Gaussian Radial Basis Function (RBF) with , which is a bell-shaped function varying from 0 (far from landmark) to 1 (at the landmark) – as visible by the dotted curves .

Let’s go by example. Instance is highlighted as . It’s located at a distance of 1 from the first landmark, and 2 from the second landmark. We can compute its new features: and . You can see the transformed dataset on the right, which also shows that it’s now linearly separable.

Where to place the landmarks:
Simplest approach is to create a landmark at the location of every instance in the dataset. This creates many dimensions, increasing the chances that the transformed training set will be linearly separable.
But a set with instances and features gets transformed into a training set with instances and features (assuming you drop the original features).

Gaussian RBF Kernel

The similarity measures method can be computationally expensive when computing the new features, especially for large datasets.

You can once again use the kernel trick, making it possible to obtain a similar result as if you had added many similarity features, but without actually doing so.

rbf_kernel_svm_clf = make_pipeline(StandardScaler(),
   SVC(kernel="rbf", gamma=5, C=0.001)
)

rbf_kernel_svm_clf.fit(X, y)

gamma represents in the Gaussian RBF similarity function. Increasing it makes the bell-shaped curve narrower, making each instance’s range of influence smaller. This makes the decision boundary more irregular, meaning it’ll wiggle around individual instances.
A small gamma value makes the bell-shaped curve wider, meaning the instances will have a larger range of influence, so the decision boundary becomes smoother. acts as a regularization parameter: reduce it if you’re seeing overfitting, and increase it if you see underfitting.

SVM Classes and Computational Complexity

Sklearn info:

LinearSVC
- Based on liblinear
- Doesn’t support kernel trick, but scales almost linearly with number of training instances and number of features
- Training time is ~, but takes longer if you need very high precision (controlled by tolerance parameter tol, or – default is usually ok).
SVC
- Based on libsvm
- Supports kernel trick
- Training time is usually around to . So it’s slow on large sets, making it better for small-medium nonlinear sets.
- Scales well with number of features & esp. sparse features.
SGDClassifier
- Does large margin classification by default
- Uses SGD for training, meaning it uses little memory and can do incremental learning.
- Scales very well: computational complexity is .

SVM Regression

SVMs can be used for regression by tweaking the objective.

Now we aren’t trying to fit the largest possible street between two classes while limiting margin violations, but rather trying to fit as many instances as possible on the street while limiting margin violations (getting as few off the street).

We control the width of the street with hyperparameter .
Reducing it increases the number of support vectors, regularizing the model.
The model is -insensitive, meaning that adding more training instances within the margin won’t affect the model’s predictions.

from sklearn.svm import SVR

# extra code – these 3 lines generate a simple quadratic dataset
np.random.seed(42)
X = 2 * np.random.rand(50, 1) - 1
y = 0.2 + 0.1 * X[:, 0] + 0.5 * X[:, 0] ** 2 + np.random.randn(50) / 10

svm_poly_reg = make_pipeline(StandardScaler(),
                             SVR(kernel="poly", degree=2, C=0.01, epsilon=0.1))
svm_poly_reg.fit(X, y)

You can use a kernelized SVM model to handle nonlinear regression tasks.

from sklearn.svm import SVR

svm_poly_reg2 = make_pipeline(StandardScaler(),
                             SVR(kernel="poly", degree=2, C=100))
svm_poly_reg2.fit(X, y)

Sklearn: SVR is analogous to SVC and LinearSVR is analogous to LinearSVC.

Under the Hood of Linear SVM Classifiers

They predict the class of a new instance by first computing the decision function , where is the bias feature (always = 1).

If this is positive, the predicted class is the positive class (1), and otherwise it’s the negative class (0) – just like Logistic Regression.

Training requires finding weights vector and bias term that makes the margin (street) as wide as possible while limiting the number of margin violations (recall, this is the objective we stated earlier). To make the margin wider, we need to make smaller. The bias term has no effect on the margin, it just shifts it around, not affecting its size.

Since we want to avoid margin violations, we need the decision function be > 1 for all positive training instances and lower than -1 for all negative training instances.
Defining for negative instances (when ) and for positive instances (when ), then we write can the constraint as for all instances.

This leads to the hard margin linear SVM classifier objective as the following constrained optimization problem:

The reason we optimize , which is equal to , instead of (norm of w), is that it has a simple derivative (just ), while isn’t differentiable at .
Optimization algorithms often work better on differentiable functions.

Getting to a soft margin objective requires introducing a slack variable for each instance. measures how much the th instance is allowed to violate the margin.

Now we have two conflicting objectives:

Make slack variable as small as possible to reduce margin violations
Make as small as possible to increase the margin

This is where the C hyperparameter comes it, allowing us to define the trade-off between the two objectives — leading us to…

Soft margin linear SVM classifier objective

Both the hard and soft margin problems are convex quadratic optimization problems with linear constraints. These problems are known as quadratic programming (QP) problems.

Using a QP solver is one way to train an SVM. Another is to use Gradient Descent to minimize the hinge loss or squared hinge loss.
Squared hinge loss is more sensitive to outliers, but tends to converge faster if the dataset is clean. It’s also what Sklearn’s LinearSVC uses, whereas SGDClassifier uses the hinge loss.

When we have an instance of the positive class (), it should be off the margin and on the positive side of the boundary. In this case, the loss should be zero, and the output of (the decision function) should be ≥ 1.
Conversely, when an instance of the negative class () is off the margin and on the negative side, we should that the loss is if .
The further an instance is from the correct side of the margin, the higher the loss (this is what makes squared hinge loss more sensitive).

Implementation of Linear SVC using Gradient Descent from the book

from sklearn.base import BaseEstimator

class MyLinearSVC(BaseEstimator):
    def __init__(self, C=1, eta0=1, eta_d=10000, n_epochs=1000,
                 random_state=None):
        self.C = C
        self.eta0 = eta0
        self.n_epochs = n_epochs
        self.random_state = random_state
        self.eta_d = eta_d

    def eta(self, epoch):
        return self.eta0 / (epoch + self.eta_d)
        
    def fit(self, X, y):
        # Random initialization
        if self.random_state:
            np.random.seed(self.random_state)
        w = np.random.randn(X.shape[1], 1)  # n feature weights
        b = 0

        t = np.array(y, dtype=np.float64).reshape(-1, 1) * 2 - 1
        X_t = X * t
        self.Js = []

        # Training
        for epoch in range(self.n_epochs):
            support_vectors_idx = (X_t.dot(w) + t * b < 1).ravel()
            X_t_sv = X_t[support_vectors_idx]
            t_sv = t[support_vectors_idx]

            J = 1/2 * (w * w).sum() + self.C * ((1 - X_t_sv.dot(w)).sum() - b * t_sv.sum())
            self.Js.append(J)

            w_gradient_vector = w - self.C * X_t_sv.sum(axis=0).reshape(-1, 1)
            b_derivative = -self.C * t_sv.sum()
                
            w = w - self.eta(epoch) * w_gradient_vector
            b = b - self.eta(epoch) * b_derivative
            

        self.intercept_ = np.array([b])
        self.coef_ = np.array([w])
        support_vectors_idx = (X_t.dot(w) + t * b < 1).ravel()
        self.support_vectors_ = X[support_vectors_idx]
        return self

    def decision_function(self, X):
        return X.dot(self.coef_[0]) + self.intercept_[0]

    def predict(self, X):
        return self.decision_function(X) >= 0

The Dual Problem

This is another way to train a linear SVM classifier: solving the dual problem.

Given a constrained optimization problem (known as the primal problem), we can express a different (but closely related) problem called its dual problem.
This doesn’t always give the same solution (typically just a lower bound), but it can. This is the case for the SVM problem: both the dual problem and primal problem have the same solution here.

Dual form of the linear SVM objective

Using a QP solver (or something) you can find the vector that minimizes this equation. Then you can use the following equation to compute and that minimizes the primal problem. is the number of support vectors.

From the dual solution to the primal solution

The dual problem is faster to solve when the number of training < number of features. And it makes the kernel trick possible.

Kernelized SVMs

Say you want to apply a second-degree Polynomial transformation to a two-dimensional training set, then train a linear SVM classifier on the transformed set.

Here’s the second-degree polynomial mapping function you want to apply:

The transformation results in a 3D vector, not a 2D vector.
What if we apply the second-degree polynomial mapping to vectors and and compute the dot product of the transformed vectors?

Note: dot product is , but in ML, vectors are often column vectors, so the dot product is .

Kernel trick for a second-degree polynomial mapping

So the dot product of the transformed vectors is equal to the square of the dot product of the original vectors: .

The main idea is that applying to all training instances, then the dual problem contains the dot product . But if is the transformation we defined earlier, we can replace the dot product of the transformed vectors simply by . You don’t need to transform at all, just replace the dot product by the square in the dual form of the linear SVM objective.
The result is strictly the same as if you’d transformed the training set and fitted a linear SVM, but this trick makes the whole process more computationally efficient.

The function is a second-degree polynomial kernel.
A kernel in ML is a function capable of computing the dot product based only on the original vectors, without having to compute (or know) the transformation .
Some commonly used kernels:

Linear:
Polynomial:
Gaussian RBF:
Sigmoid:

Mercer’s theorem dictates that if a kernel function respects some conditions (Mercer’s conditions), then there is a function that maps and into another space so .
You can use as a kernel because you know exists, even if you don’t know what it is. For the Gaussian RBF kernel, you could show that maps each training instance to an infinite-dimensional space, so it’s rather nice that you don’t have to do the mapping.
Some kernels (e.g. sigmoid) don’t respect all of Mercer’s conditions.

How do we then make predictions?
So after using the kernel trick, you end up with equations that include . Since must have the same dimensions as that (which can be huge or infinite), you can’t compute it. So we plug the formula for into the decision formula for a new instance , and we get an equation with only dot products between input vectors:

Since only for support vectors, making predictions involves computing the dot product of the new input vector with only the support vectors, not all the training instances. But you need to use the same trick to compute the bias term :

Chapter 6 Decision Trees

Training and Visualizing a Decision Tree

Interestingly, you can use GraphViz to visualize the decision trees you make:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris(as_frame=True)
X_iris = iris.data[["petal length (cm)", "petal width (cm)"]].values
y_iris = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_iris, y_iris)

from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file=str(IMAGES_PATH / "iris_tree.dot"),
        feature_names=["petal length (cm)", "petal width (cm)"],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

from graphviz import Source

Source.from_file(IMAGES_PATH / "iris_tree.dot")

Convert to png with the GraphViz CLI tool:

dot -Tpng {IMAGES_PATH / "iris_tree.dot"} -o {IMAGES_PATH / "iris_tree.png"}

Root node is both the root and a split node. Left (orange) is a leaf node. Right (white) is a split node. Then that node’s children are both leaf nodes. In the image, the leftmost arrow after split nodes represents True and the rightmost arrow represents False.

Making Predictions

Using the above tree to make decisions. Say you have an iris flower:

Start at root node: is flower petal length < 2.45 cm?
- If yes, you’re at a leaf node, and it’s likely an iris setosa.
- If no, move to right child node (depth 1): is the petal width smaller than 1.75 cm?
  - If yes, it’s likely an iris versicolor (depth 2, left)
  - If not, it’s likely an iris virginica (depth 2, right)

The samples for a note counts how many training instances it applies to.
value says how many instances of each class this node applies to.

gini measures the node’s Gini impurity.
A node is pure (gini=0) if all training instances it applies to belong to the same class.
That means, if a node only applies to training instances of a single class, the node is pure, and its Gini impurity is 0.

You can compute the Gini Impurity of the node as follows:

where is the Gini impurity of the node, and is the ratio of class instances among the training instances in the node.
So for a node that applies to of one class, of another, and 13 of a third (63 total), we have:

Estimating Class Probabilities

You can use a decision tree to measure the probability that an instance belongs to a particular class.
To do so, it would traverse the tree to find the leaf node for the instance, and then return the ratio of the training instances of class in the node. So on the leaf node for that instance.

Use predict_proba in Sklearn.

The CART Training Algorithm

The Classification and Regression Tree (CART) algorithm is used to train decision trees (called “growing trees”).

It works as follows:
Split the training set into two subsets using a single feature and a threshold , like, is feature less than or equal some value?

The way it determines and is to search for the pair that produces the purest subsets, weighted by their size.
Here’s the CART cost function for classification:

where measures the impurity of the left / right subset, and is the number of instances in the left/right subset.

Once it has split the training set in two successfully, it splits the subsets in the same way, recursively, until it reaches the max depth defines by the max_depth Hyperparameter (or it can’t find a split that reduces impurity).

There are some more Hyperparameters to control additional stopping conditions:

min_samples_split
min_samples_leaf
min_weight_fraction_leaf
max_leaf_nodes

This algorithm is a Greedy Algorithm: greedily search for an optimum split at the top level; repeat at each subsequent level. Finding the optimal tree is a NP-complete problem, requiring time, so we settle for reasonably good solutions.

Computational Complexity

Making predictions require traversing the decision tree from root to leaf.
- They are approximately balanced: traversal means roughly going through nodes (where is the binary logarithm, ).
Each node only requires checking the value of one feature, so overall prediction complexity is: , which is fast.
The training algorithm compares all features (depending on max_features hyperparameter) on all samples at each node: .

Gini Impurity or Entropy

DecisionTreeClassifier uses Gini Impurity by default.

Entropy

You can make it use Entropy by setting criterion.
Entropy as a concept comes from thermodynamics, where it measures molecular disorder: entropy approaches zero when molecules are still and well-ordered.
From thermodynamics, it spread to other fields, like Shannon’s information theory (to measure avg. information content of a message: zero when all messages are identical).

In machine learning, Entropy is often used to measure impurity. The entropy of a set is zero when it contains instances of only one class.
So we can use it to measure entropy for the nodes in our decision tree.

Entropy is defined as:

where is the node and is a feature, and is the ratio of class instances among training instances in the th node.

So for a node that applies to of one class, of another, and 13 of a third (63 total), we have:

So which one?

Most of the time it doesn’t matter. Gini impurity is faster to compute.
When they differ, Gini impurity tends to isolate the most frequent class in its own branch, whereas entropy tends to produce slightly more balanced trees.

Regularization Hyperparameters

Decision trees make few assumptions about the training data.
If you don’t constrain them, they’ll likely overfit.

This kind of model is called a nonparametric model, because the number of parameters isn’t determined prior to training, so the model structure can stick to the data.
A parametric model, on the other hand, has a preset number of parameters, limiting its degree of freedom, which, in turn, reduces the risk of overfitting. For example, a linear model.

To avoid overfitting, you need to regularize. You can tune these hyperparameters:

max_depth controls the maximum depth of the decision tree. By default it’s none, meaning unlimited.
max_features
min_samples_split
min_samples_leaf
min_weight_fraction_leaf.

Increase min_* or reduce max_* hyperparameters to regularize.

Regression

Decision trees can also do regression tasks.

import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X_quad = np.random.rand(200, 1) - 0.5  # just one, random feature as input
y_quad = X_quad ** 2 + 0.025 * np.random.randn(200, 1)

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X_quad, y_quad)

To predict, it’ll traverse the tree until it gets to a leaf node. Tree traversal seems to be about answering whether a specific feature value number is lower than the node’s learned threshold.
The prediction value at the leaf is the average target value of the training instances associated with the node. This assumes the nodes’ associated training instances are similar enough that their average target value is a reasonable prediction.

You can basically imagine that the tree splits the feature space into regions, each represented by a leaf node. The boundaries of the regions are determined by the sequence of splits (decisions at internal notes) leading to each leaf.
In each of these regions, it’ll make a constant prediction, which is the average of the associated training instances.

The splits the algorithm makes support the assumption I mentioned above. They’re made to make the most training instances as close as possible to the predicted value.

CART works as before, but instead of trying to split to minimize impurity, it tries to split the training set to minimize MSE.

CART cost function for regression

where and .

You should also make sure you regularize when using decision trees for regression tasks.

Sensitivity to Axis Orientation

Decision trees love orthogonal decision boundaries, which is where all splits are perpendicular to an axis. This can lead to some bad fits.
You can deal with some of those by scaling the data, and then applying a PCA transformation. That will rotate the data in a way that reduces correlation between features, which often, but not always, makes things easier for trees.

Decision tree pipeline with scaling and PCA rotation

from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pca_pipeline = make_pipeline(StandardScaler(), PCA())
X_iris_rotated = pca_pipeline.fit_transform(X_iris)
tree_clf_pca = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf_pca.fit(X_iris_rotated, y_iris)

Decision Trees Have a High Variance

Small changes to hyperparameters or data can produce very different models.
Even retraining the same decision tree on the same data can produce very different models.

By averaging predictions over many trees, you can reduce variance. This ensemble is called a Random Forest.

Chapter 7 Ensemble Learning and Random Forests

Aggregating predictions from a group of predictors, like classifiers or regressors, you often get better predictions than from the individual predictor.
This group is called an ensemble.

If you train a group of decision tree classifiers, each on a different random subset of the training data, and you use their collective predictions to predict a class, then you have a random forest.

You’d often use ensemble methods near the end of a project, when you’ve already built some good predictors, combining them into an even better one.

Voting Classifiers

If you have a bunch of classifiers, you can can aggregate the predictions of each one, such that the class that gets the most votes is the ensemble’s prediction.
This is a hard voting classifier.

Often, this gets higher accuracy than the best classifier in the ensemble.
Even with each classifier being a weak learner (only does a little better than random guessing), the ensemble can still be a strong learner (gets high accuracy) – as long as there are a sufficient number of weak learners in the ensemble, and they are sufficiently diverse.

Say you have a biased coin: 51% heads, 49% tails.
Toss it 1000 times, and you’ll likely have around 510 heads and 490 tails.
That is, you have a majority of heads.
The probability of getting a majority of heads after 1000 tosses is close to 75%.
The more you toss the coin, the higher the probability of getting majority heads. This is due to the law of large numbers.

Carrying this analogy back to ensembles: imagine 1000 classifies in an ensemble. Even if they’re only right 51% of the time, simply predicting the majority voted class, you can get up to 75% accuracy.
But this is only true if they are perfectly independent and make uncorrelated errors. It’s more likely they’ll make the same type of errors – they were trained on the same data. This reduces the accuracy a bit.

Ensembles are best when the predictors are as independent as possible. One way to achieve this is training them using different algorithms.

Creating a VotingClassifier ensemble:

from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('log_reg', LogisticRegression(random_state=42)),
        ('rnd_forest', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

voting_clf.fit(X_train, y_train)

When fitting a VotingClassifier, it clones each estimator and fits each clone.
To find each classifier’s accuracy on the test set:

for name, clf in voting_clf.named_estimators.items():
	print(name, "=", clf.score(X_test, y_test))

> log_reg = 0.864
> rnd_forest = 0.896
> svc = 0.896

When you predict(), the VotingClassifier does hard voting.

voting_clf.score(X_test, y_test)
> 0.912

Notice how the accuracy is higher than the individual models.

You can also do soft voting, which is where it predicts the class with the highest class probability, averaged over all individual classifiers. This is only possible if all classifiers can estimate class probabilities.
This often gets better performance than hard voting because it gives more weight to highly confident votes.

voting_clf.voting = "soft"
# SVC doesn't estimate class probabilities by default, so set this hyperparameter to true. Then it uses cross-validation to estimate class probabilities.
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)
> 0.92

And as we see, it improves the accuracy.

Bagging and Pasting

Instead of using different training algorithms to get different classifiers, you can just train the same algorithm on different random subsets of the training set.

Bagging is to do just that, with replacement. “With replacement” means that you can possibly train on the same data multiple times. You have a pool of possible data to take from, and each time you take something out, you also put it back in.
Bagging is short for bootstrap aggregating, where bootstrap refers to its meaning statistics: resampling with replacement.
In short: bagging is the process of training the same algorithm on different random subsets of the training set with replacement.

When this sampling method is done without replacement, it’s pasting. So the same training instances cannot be redrawn from the pool.

Both methods involve training several predictors on different random samples of the training set.

Predictions are made by the ensemble by aggregating the predictions of all predictors.
It’s often the most frequent prediction that’s used for classification (aggregation function is the statistical mode), whereas it’s the average for regression.

Each individual predictor has higher bias than if it were trained on the original training set (so without random sampling), but the aggregation process reduces both bias and variance.

A nice feature of these methods is that they scale well: training can be done in parallel (across CPU cores, servers), and so can predictions.

Bagging and Pasting in Scikit-Learn

Train ensemble of 500 decision tree classifiers, each on 100 instances randomly sampled with replacement:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            max_samples=100, n_jobs=-1, random_state=42)
		   
bag_clf.fit(X_train, y_train)

n_jobs=-1 tells Sklearn to use all available CPU cores.

To train with pasting, set bootstrap=False.

Bagging is generally preferred as it often results in better models.

Out-of-Bag Evaluation

When you do bagging, some instances may be sample multiple times, and others not at all. With the BaggingClassifier approach, it’s about 63% / 37% sampled to not-sampled.
Training instances that are not sampled are called out-of-bag (OOB) instances.

Since they weren’t used for training, you can use the OOB instances to evaluate, meaning you won’t need a validation set.
Given enough estimators, it’s highly likely that each instance in the training set is an OOB instance for several estimators. So you can use them to make a fair ensemble prediction for that instance.

Set oob_score=True on BaggingClassifier to get an automatic OOB evaluation after t raining. Get it with the oob_score_ property.

Random Patches and Random Subspaces

BaggingClassifier can also sample features. This means each predictor would be trained on a random subset of input features.

This is good for high-dimensional input as it can speed up training.

Sampling both training instances and features is called the random patches method (bootstrap=False, max_samples=1.0).
Keeping all training instances but sampling features is called the random subspaces method (bootstrap_features=True, max_features < 1).

Sampling features leads to more predictor diversity, so you get a bit more bias for lower variance.

Random Forests

This is an ensemble of decision trees.

Typically trained trees via bagging.
max_samples is typically the size of the training set.

RandomForestClassifier or RandomForestRegressor.

The algorithm introduces randomness when growing trees.
Instead of looking for the best feature when splitting a node, it looks for the best features in a subset of features, with being the total number of features.
This gives greater tree diversity.

Extra-Trees

When growing a tree in a random forest, at each node only a random subset of the features is considered for splitting.
You can make the trees even more random by using random thresholds for each feature (instead of searching for the best possible threshold). That’s extra-trees.
Set splitter=random on DecisionTreeClassifier.

A forest of these kinds of trees is an extremely randomized trees ensemble (short: extra-trees).
This trades more bias for lower variance. And the extra-tree classifiers are faster to train due to opting to use a random threshold, instead of finding the best one.

Use ExtraTreesClassifier.

Feature Importance

You can use random forests to measure the relative importance of each feature.
They do that by looking at how much the tree nodes that use that feature reduces impurity on average, across all trees in the forest (by weighted average, each node’s weight being the amount of training instances associated with it).

Check feature_importances_ after training.

Boosting

Boosting was originally called hypothesis boosting.
It refers to any ensemble method that combines several weak learners into a strong learner.

The idea is to train predictors sequentially, each one trying to correct its predecessor.
The most popular approaches are AdaBoost (adaptive boosting) and gradient boosting.

AdaBoost

The idea in boosting is to sequentially train predictors, and having each predictor to correct its predecessor.
AdaBoost makes the succeeding predictors pay more attention to the training instances that the predecessors underfit.
For example, it might first train a base classifier and use it to make predictions. Then it increases the relative weight of misclassified training instances.
Next it will train a second classifier using the updated weights. Then it’ll repeat the process of increasing the weights, making predictions, and retraining.

The ensemble makes predictions where the predictors have different weights depending on their overall accuracy on the weighted training set.

A drawback of this method of sequential learning is that you cannot parallelize training. Each predictor can only be trained after the previous one has been both trained and evaluated.

The algorithm itself works as follows.
Each instance weight is initially .
When the first predictor is trained, we calculate its weighted error rate on the training set:

where is the th predictor’s prediction for the th instance.

We then compute the predictor’s weight :

where is the learning rate hyperparameter (defaults to 1).

The more accurate the predictor is, the higher its weight will be.
If it guesses randomly, the weight is close to zero. If it’s most often wrong, its weight is negative.

Now the algorithm updates the instance weights like so, boosting the weights of the misclassified instances:

After which it normalizes all instance weights, i.e., dividing by .

It then trains a new predictor with the updated weights, and the process is repeated until the desired number of predictors have been created – or a perfect predictor is found.

To make predictions, AdaBoost calculates the predictions of all predictors and weighs them by the predictor weights . The predicted class is the one with the majority of weighted votes.

where is the number of predictors.

Scikit-Learn uses Stagewise Additive Modeling using a Multiclass Exponential loss function (SAMME), which is a multiclass version of AdaBoost. When there are two classes, they’re equivalent. If the predictors can estimate class probabilities, Sklearn uses SAMME.R (R stands for “Real”), which relies on the class probabilities instead of predictions and which generally performs better.

Training an AdaBoost classifier on 30 decision stumps:

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
	DecisionTreeClassifier(max_depth=1),
	n_estimators=30,
	learning_rate=0.5,
	random_state=42
)

ada_clf.fit(X_train, y_train)

A decision stump is a decision tree with a max depth of 1 – it has one decision node and two leaf nodes.

If your AdaBoost ensemble overfits, reduce the number of estimators, or increase regularization of the base estimator.

Gradient Boosting

Gradient Boosting also works by sequentially adding predictors to the ensemble, where each one corrects its predecessor.

Instead of modifying instance weights at every iteration (AdaBoost), this fits the new predictor to the residual errors made by the previous predictor.

Gradient Tree Boosting / Gradient Boosted Regression Trees (GBRT) uses decision trees as the base predictors:
Random noise function is .

import numpy as np
from sklearn.tree import DecisionTreeRegressor

# Generate noisy quadratic dataset
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)

# GBRT
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2) # train on residuals made by the first predictor

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3) # train on residuals made by the second predictor

# 'Test' set
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
> array([0.49484029, 0.04021166, 0.75026781])

But in practice, you’d use Scikit-Learn’s GradientBoostingRegressor to train GBRT ensembles (or GradientBoostingClassifier for classification).

The learning rate scales the contribution of each tree. If it’s low (e.g. 0.05), you’ll need more trees in the ensemble, but predictions usually generalize better. This is a Regularization technique called Shrinkage.

Instead of doing cross-validation to find the optimal number of trees, you can use the n_iter_no_change hyperparameter to automatically stop adding more trees during training if the last X trees didn’t help (where X is the int you assigned to it). This is basically Early Stopping, but it tolerates no progress for a few iterations before stopping.
If you set it too low, it might stop too early and thus underfit. But if too high, your model might overfit.

Using the subsample hyperparameter sets the fraction of training instances to use for training each tree: subsample=0.25 means each tree is trained on 25% of training instances, selected randomly. This is stochastic gradient boosting.

Some good gradient boosting implementations:

Histogram-Based Gradient Boosting

Histogram-Based Gradient Boosting (HGB) is another GBRT implementation. It’s optimized for large datasets.

It works by binning input features, replacing them with integers.
This reduces the number of possible thresholds the training algorithm needs to evaluate.
And working with integers lets you use faster and more memory-efficient data structures.
Plus, due to the way bins are built, there’s no need for sorting the features when training each tree.

Computational complexity is where is the number of bins, and is the number of training instances.
It can be more than 100x faster than GBRT on large datasets, but binning can cause a precision loss. This does act as a regularizer, and help reduce overfitting (or cause underfitting).

Stacking

Short for stacked generalization.

Instead of using trivial functions (e.g. hard voting) to aggregate predictions of all predictors, you train a model to do the aggregation.
This model is called a blender, or meta learner. It takes the predictions from the ensemble models and makes the final predictions.

To train the blender, first build the blending training set.
Can use cross_val_predict() on all predictors in the ensemble to get out-of-sample predictions for each instance in the original training set, and then use them as input features to train the blender. Copy the targets from the original training set.

The blending training set will have one input feature per predictor.
Once you’ve trained the blender, then retrain the base predictors one last time on the full training set.

You can even train several blenders like that. For example, one linear regression, one random forest regressor, and then get a whole layer of blenders… and add another blender on top of that. This could even improve performance – at the cost of training time and complexity.

Scikit-Learn has StackingClassifier and StackingRegressor you can use.

Chapter 8 Dimensionality Reduction

Some problems involve data with thousands or even millions of features per training instance. This makes training extremely slow. It can even make it a lot harder to find a good solution – this is the curse of dimensionality.

In most problems you can reduce the number of features considerably, this making the problem more tractable. For example, dropping uninformative information, or merging highly correlated information.
Of course, reducing dimensions causes some information loss.

Dimensionality Reduction is also very useful for data visualization.
By reducing to two or three dimensions it’s possible to plot a condensed view of a high-dimensional dataset on a graph, which can lead to insights from visual inspection (e.g. seeing clusters).

This chapter talks about the curse of dimensionality.
Then two main approaches to Dimensionality Reduction:

Projection
Manifold learning

Then goes through three main Dimensionality Reduction techniques:

PCA
Random projection
Locally linear embedding

The Curse of Dimensionality

Things behave differently in high-dimensional space.
There’s a 0.4% chance of a random point in a unit square (1x1) being 0.001 from a border. That is, it’s unlikely that a random point is extreme along any dimension. But for a 10k-dimensional hypercube, the probability is greater than 99.999999%.
Most points in a high-dimensional hypercube are very close to the border.

There’s a ton of space in high-dimensional space. Therefore, points are (on average) much further apart.
High-dimensional datasets are at risk of being very sparse. Most training instances are likely to be far away from any other instance, so our predictions are less reliable – we’ll need to extrapolate more.

The more dimensions the training set has, the greater the risk of overfitting it.

One solution could be to just get enough data to reach sufficient density of training instances. But in practice, the amount of data needed grows exponentially with number of dimensions.
With just 100 features, you’d need more training instances than atoms in the observable universe for the instances to be within 0.1 of each other on average, assuming they’re spread uniformly across dimensions.

Main Approaches for Dimensionality Reduction

Projection

In most real problems, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.
This is because instances are not spread out uniformly across all dimensions. Many features are constant, and others are highly correlated.

By projecting the training instances onto a subspace, you can reduce dimensions.

Manifold learning

Projection isn’t always the best approach. In many cases, the subspace may twist and turn.

By projecting such data onto a plane, you’d squash the different layers. You’d probably rather want to ‘unroll’ the data.

A 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space.
Generally, a -dimensional manifold is a part of an -dimensional space (where ) that locally resembles a -dimensional hyperplane.

Many Dimensionality Reduction algorithms work by modeling the manifold on which the training instances lie; this is called manifold learning. It relies on the manifold assumption (or, also called manifold hypothesis), which says that most real-world high-dimensional datasets lie close to a much-lower dimensional manifold – this is often empirically observed.
This is often accompanied by another implicit assumption: that the task at hand (e.g. classification or regression) is simpler if expressed in the lower-dimensional space of the manifold.
But that doesn’t always hold. The new representation can make the data even more complex. Therefore, it all depends on the dataset.

PCA

Principal Component Analysis (PCA) is the most popular Dimensionality Reduction algorithm.

For very high-dimensional datasets, PCA can be too slow. E.g. datasets with tens of thousands of features or more. In this case you might want to use Random Projection.

Approach: First identify the hyperplane that is closest to the data. Then project data onto it.

Preserving the Variance

Before projecting, we need to find the right hyperplane.

We want to select the hyperplane that preserves the maximum amount of variance, as it’ll likely lead to less information loss compared to the other possible projections.

Using that criteria is also justified because that axis also minimizes the mean squared distance between the original dataset and its projection onto that axis.

Principal Components

Now PCA has found the axis that accounts for the largest amount of variance in the training set.

PCA then finds a second axis, which is orthogonal to the first one, which accounts for the largest amount of remaining variance.

And in higher-dimensional datasets, it’d find a third axis, orthogonal to the two other ones, and a fourth, fifth, and so on.
As many axes as the number of dimensions in the dataset.

The axis is called the principal component (PC) of the data.

For each of these principal components, PCA finds a zero-centered unit vector pointing in the direction of the PC. Since two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable. Altering the dataset and running PCA again might yield different results – the unit vectors may point in the opposite direction as the original. But they’ll generally be on the same axes. Sometimes a pair of unit vectors may rotate or swap (if variance along those two axes are close) but the plane generally remains the same.

Finding the principal components of a dataset can be done with Singular Value Decomposition (SVD). Decompose into where contains the unit vectors that define all the principal components you’re looking for.

import numpy as np

X = [...]  # create small 3D dataset
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt[0]  # component 1
c2 = Vt[1]

PCA assumes the dataset is centered around the origin.
But Sklearn’s PCA classes take care of centering.
If you implement PCA yourself, remember to handle centering.

Projecting Down to d Dimensions

Now you have the principal components, you can reduce the dimensionality of the dataset to dimensions by projecting it onto the hyperplane defined by the first dimensions.
Using this hyperplane ensures the projection preserves as much variance as possible.

To perform this projection, and obtaining , compute the matrix multiplication of the training set matrix by the matric , defined by the matrix containing the first columns of :

# Project onto plane defined by first two principal components
W2 = Vt[:2].T
X2D = X_centered @ W2

You can use Sklearn:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

The components_ attribute holds the transpose of , which contains one row for each of the first principal components.

Explained Variance Ratio

It’s useful to know the explained variance ratio (sklearn’s PCA has it: explained_variance_ratio_).

This ratio indicates the proportion of the dataset’s variance that lies along each principal component.

pca.explained_variance_ratio_
> array([0.7578477 , 0.15186921])

That says 76% of the dataset’s variance lies along the first PC and about 15% lies along the second PC – so ~9% for the third, which allows us to assume it doesn’t carry a lot of information.

Choosing the Right Number of Dimensions

Don’t arbitrarily choose a number of dimensions to reduce down to.

It would be better & simpler to choose the number of dimensions that add up to a sufficiently large proportion of the variance (e.g. 95%).
A simple way to do so:

pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

For data visualization you’d use 2-3.

Another option is to plot the explained variance as a function of the number of dimensions (plot the cumsum). There’ll usually be an elbow in the curve where the explained variance stops growing fast. You’d select somewhere the place where the rapid growth stops (with good explained variance).

And if you use Dimensionality Reduction as a preprocessing step for a Supervised Learning task, you could tune the number of dimensions like any other hyperparameter (e.g. randomized search, grid search, etc.).

PCA for Compression

The dataset takes less space after Dimensionality Reduction. So it can be used for compression.

And you can decompress by applying the inverse transformation of the PCA projection. You won’t get the original data back – the projection lost a bit of information – but you’ll get close.
We can measure this with the mean squared distance between the original data and reconstructed data (compressed and then decompressed) – also called the reconstruction error.

PCA inverse transformation equation

Randomized PCA

Randomized PCA quickly finds an approximation of the first principal components.

Computational complexity: instead of for the full SVD approach, meaning it’s much faster than full SVD when is much smaller than .

Use it with Sklearn:

rnd_pca = PCA(n_components=154, svd_solver="randomized", random_state=42)
X_reduced = rnd_pca.fit_transform(X_train)

By default, svd_solver is auto, meaning Sklearn will use this algorithm if and n_components is an int smaller than 80% of . Otherwise, it’ll do the full SVD approach. To use full, use full.

Incremental PCA

A problem with the other PCA approaches is that they need the entire training set in memory for the algorithm to run.

But Incremental PCA (IPCA) algorithms can be used, allowing you to split the dataset into mini-batches and feeding them in one at a time.
Very useful for applying PCA online.

Sklearn: IncrementalPCA in sklearn.decomposition.
You can use either NumPy’s array_split() or NumPy’s memmap class (allows you to manipulate a large array stored in a binary file on the disk, as if it were in memory. It only loads the data as needed.).

Random Projection

This approach projects data to a lower-dimensional space using a random linear projection.

Turns out, such a random projection is very likely to preserve distances well (demonstrated by William B. Johnson and Joram Lindenstrauss).

The more dimensions you drop, the more information is lost. So how many to drop?
Johnson and Lindenstrauss came up with an equation that determines the minimum number of dimensions to preserve to ensure (with high probability) that distances won’t change by more than a given tolerance.

Example
Given a dataset with instances with features each, and you don’t want the squared distance between any two instances to change by more than , then you should project the data down to dimensions with , which is dimensions.

Sklearn has johnson_lindenstrauss_min_dim via sklearn.random_projection.

Now we generate a random matrix of shape , where each item is sampled randomly from a Gaussian Distribution with mean and variance , and use it to project a dataset from dimensions down to .

from sklearn.random_projection import johnson_lindenstrauss_min_dim

m, eps = 5_000, 0.1
d = johnson_lindenstrauss_min_dim(m, eps=eps)

n = 20_000
np.random.seed(42)
P = np.random.randn(d, n) / np.sqrt(d)  # std dev = sqrt of variance

X = np.random.randn(m, n)  # fake dataset
X_reduced = X @ P.T

# or use this to do the same
from sklearn.random_projection import GaussianRandomProjection

gaussian_rnd_proj = GaussianRandomProjection(eps=eps, random_state=42)
X_reduced = gaussian_rnd_proj.fit_transform(X)

There’s also SparseRandomProjection in Sklearn. The difference is that the random matrix is sparse, meaning it uses much less memory. It’s also faster. So it’s generally preferable, especially for large or sparse datsets.

To perform the inverse transformation: compute pseudoinverse of the components matrix with Scipy’s pinv() and multiply the reduced data by the transpose of the pseudo-inverse. This can be slow for large component matrices.

Locally Linear Embedding (LLE)

This is a nonlinear Dimensionality Reduction (NLDR) technique. It’s a manifold technique that doesn’t rely on projections, unlike PCA and random projection.

It works by first measuring how each training instance linearly relates to its nearest neighbors, and then looks for a low-dimensional representation of the training set where these local relationships are best preserved.

For each training instance , the algorithm identifies its -nearest neighbors (KNN), then tries to reconstruct as a linear function of these neighbors. That is, it tries to find weights such that the squared distance between and is as small as possible, assuming if is not one of the k-nearest neighbors of .

So the first step is the optimization problem described below, where is the weights matrix with all weights . LLE step 1: Linearly modeling local relationships

After this, the weight matrix encodes the local linear relationships between the training instances.

The second step is mapping these training instances into a -dimensional space (where ) while preserving these local relationships as much as possible.

If is the image of in this d-dimensional space, we want the squared distance between and to be as small as possible.
This leads to the unconstrained optimization problem described below. Looks similar to the first step, but instead of keeping instances fixed and finding optimal weights, we do the reverse: keep weights fixed and find optimal position of all instances’ images in the low-dimensional space. is the matrix of all .

LLE step 2: reducing dimensionality while preserving relationships

While this is more complex than the projection techniques, it’s able to construct much better low-dimensional representations, especially if the data is nonlinear.

Other dimensionality reduction techniques

Good for the Dimensionality Reduction note.

Multidimensional scaling (MDS) reduces dimensionality while trying to preserve the distances between instances.
Isomap creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve the geodesic distances between the instances.
- Geodesic distance between two nodes is the number of nodes on the shortest path between them.
t-distributed stochastic neighbor embedding (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.
- Mostly used for visualization (particularly visualizing clusters of instances in high-dimensional space).
Linear discriminant analysis (LDA) is a linear classification algorithm that learns the most discriminative axes between the classes during training. These are then used to define a hyperplane onto which to project the data. This approach keeps classes as far apart as possible, so it’s good to do before running another classification algorithm.

Chapter 9 Unsupervised Learning Techniques

Most applications of ML are based on Supervised Learning, but most data is unlabeled.

Yann LeCun: “if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake”

We are yet to realize the potential of unsupervised learning.

The most common Unsupervised Learning task is Dimensionality Reduction. We also have:

Clustering: group similar instances in clusters.
Anomaly Detection (aka. outlier detection): learn what “normal” data looks like and use that to detect abnormal instances.
Density estimation: estimating the probability density function (PDF) of the random process that generated the dataset.

Clustering Algorithms: k-means and DBSCAN

What a cluster is depends on context.
Different algorithms capture different kinds of clusters. Some look for instances centered around a particular point (called a centroid), while others look for continuous regions of densely packed instances. Some algorithms are hierarchical, meaning it looks for clusters of clusters.

k-means

k-means is a simple algorithm that can do clustering quickly and efficiently.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# extra code – the exact arguments of make_blobs() are not important
blob_centers = np.array([[ 0.2,  2.3], [-1.5 ,  2.3], [-2.8,  1.8],
                         [-2.8,  2.8], [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std,
                  random_state=7)

k = 5 # obvious from looking at the data
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42)
y_pred = kmeans.fit_predict(X)

Instances are assigned to one of clusters.
An instance’s label is the index of the cluster – not to be confused with class labels in classification, which are used as targets.

You can find the centroids that the algorithm found: kmeans.cluster_centers_.
And you can assign new instances to the cluster whose centroid is closest with kmeans.predict.

Some mislabeling can occur, especially near the boundaries between clusters.
k-means doesn’t behave well when the clusters have different diameters – it only cares about the distance to the centroid when assigning an instance to a cluster.

Hard clustering is where we assign each instance to a single cluster.
Soft clustering is where we give each instance a score per cluster. This score could be the distance between the instance and the centroid or a similarity score (or affinity).
For example, you could use kmeans.transform, which measures the distance from each instance passed to the function to each centroid.

Centroid initialization methods

If you know roughly where the centroid should be, set init hyperparameter to an array of the centroids and n_init=1.

Can also run the algorithm multiple times with different random initializations and keep the best one. Use n_init. It uses the model’s inertia to measure which is best.

Inertia is the sum of squared distances between the instances and their closest centroids.

David Arthur and Sergei Vassilvitskii proposed k-means++ in 2006, where they introduced a smarter initialization step which tends to select centroid that are distant from one another, making the algorithm less likely to converge on a suboptimal solution. This comes at the cost of some additional computation, but it’s well worth it.
Here’s how k-means++ works:

Take a centroid , chosen uniformly at random from the dataset
Take a new centroid , choosing an instance with probability , where is the distance between the instance and the closest centroid that was already chosen.
1. The Probability Distribution above ensures instances farther away from the already chosen centroid are much more likely to be selected as centroids.
Repeat (2) until all centroids have been chosen.

This is what KMeans in Scikit-Learn uses.

Accelerated k-means and mini-batch k-means

Accelerated k-means by Elkan doesn’t always accelerate, and can even slow down training.

Sculley proposed mini-batch k-means. Instead of using the full dataset, use mini-batches. For each iteration, just move the centroids slightly. This is a good bit faster and lets you cluster huge datasets that don’t fit in memory (can use memmap with MiniBatchKMeans).
The inertia is generally slightly worse for mini-batch k-means than regular.

Finding the optimal number of clusters

You won’t always know the number of clusters just from looking at the data.
The result can be quite bad if you set to the wrong value.

Using inertia as a measure to pick isn’t good. It gets lower as you increase .
This is somewhat obvious: the more clusters, the closer each instance is to the closest centroid, and thereby lowering inertia.

You can use the silhouette score, which is the mean silhouette coefficient over all the instances.
It’s equal to , where is the mean distance to the other instances in the same cluster (the mean intra-cluster distance), and is the mean nearest-cluster distance (mean distance to the instance of the next closest cluster, defined as the one that minimizes , excluding the instance’s own cluster).
The silhouette coefficient varies between -1 and +1. Close to 1 means the instance is well inside its own cluster, and far from other clusters, close to 0 means it’s close to a cluster boundary, and close to -1 means it may have been assigned to the wrong cluster.
Calculate it with silouette_score(X, kmeans.labels), after you’ve trained the kmeans model.

It’s even better to plot every instance’s silhouette coefficient, sorted by the clusters they’re assigned to, and by the value of the coefficient. This is a silhouette diagram.
The same of each diagram is like that of a knife’s blade. The height of the shape indicates the number of instances in the cluster, and the width represents the sorted silhouette coefficients of the instances in the cluster – wider being better.
You can put a straight line on the diagram at the mean silhouette score value. When most instances in a cluster have a lower coefficient than the mean score, the cluster is bad. Visually, this is represented by the ‘knife blade’ not extending beyond the line.

Limits of k-means

Have to run it several times to avoid suboptimal solutions
Need to specify number of clusters
Doesn’t behave well when clusters have varying sizes, densities, or have nonspherical shapes
You need to scale input features before running k-means (can help make clusters spherical)

DBSCAN

This algorithm is based on local density estimation, which allows it to identify clusters of arbitrary shapes.

DBSCAN stands for density-based spatial clustering of applications with noise.
It defines clusters as continuous regions of high density.

It’s a simple but powerful algorithm that can identify any number of clusters of any shape. It’s also robust to outliers. But it can struggle if the density varies across clusters, or if there’s no sufficiently low-density region around some clusters. The computational complexity isn’t great, either: .

Hierarchical DBSCAN (HDBSCAN) is usually better than DBSCAN at finding clusters of varying densities.

Adapted these notes from one of my Uni courses (DIS) and added details from the book:

DBSCAN—Density-Based Spatial Clustering of Applications with Noise
- Clusters together points that are closely packed together, and separates points that are far apart.
- Concepts
  - Eps—user-specified parameter, , to specify radius of neighborhood considered for each object.
  - MinPts—minimum number of points in a cluster. Sometimes denoted min_samples.
  - Eps-neighborhood—space within radius centered at an object.
  - Core point—whose Eps-neighborhood is dense enough (# objects within ≥ MinPts).
  - Directly density-reachable—point directly density-reachable from another point if distance is ≤ Eps and is core point.
  - Density reachable—point is density-reachable from point if there is path between them and path consists only of core points.
  - Density-connected—two points and are density-connected if there is point so both and are density-reachable from
- Process
  - Select point at random from dataset
    - Get all points within distance of from this point. This region is the instance’s -neighborhood.
  - If Eps-neighborhood has at least MinPts objects
    - The instance is then considered a core instance.
      - Core instances are instances located in dense regions
    - Create new cluster and add points within neighborhood to it
      - So all instances in the neighborhood of a core instance belong to the same cluster.
      - The neighborhood can include other core instances, so a long sequence of neighboring core instances form a cluster
    - If an instance isn’t a core instance and doesn’t have one in its neighborhood, it’s an anomaly
  - Repeat for each point in cluster, adding if points are found to be in it
  - Repeat for all points in dataset
    - Any point not in a dense region is noise
  - Repeat for all clusters
    - Clusters with less than MinPts points are noise.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.20, min_samples=5)
dbscan.fit(X)

# There isn't any predict function on DBSCAN.
# So use a classification algorithm of your choice to make predictions for new samples

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])

Instances with a cluster index of -1 are considered anomalies.

Other Clustering Algorithms

Agglomerative clustering
Balanced iterative reducing and clustering using hierarchies (BIRCH)
Mean-shift
Affinity propagation
Spectral clustering

Gaussian Mixtures

These can be used for density estimation, clustering, and anomaly detection.

A Gaussian mixture model (GMM) is a probabilistic model which assumes the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.

Other Algorithms for Anomaly and Novelty Detection

Fast-MCD (minimum covariance determinant)
Isolation forest
Local outlier factor (LOF)
One-class SVM
PCA & other Dimensionality Reduction techniques with inverse_transform()

Chapter 10 Introduction to Artificial Neural Networks with Keras

NN stands for Artificial Neural network. It’s at the core of Deep Learning.

From Biological to Artificial Neurons

The Perceptron

Invented in 1957 by Frank Rosenblatt.
It’s based on a slightly different artificial neuron called a threshold logic unit (TLU), or linear threshold unit (LTU).

A TLU first computes a linear function of it inputs and then applies a step function on the results.

Inputs and outputs are numbers, and each input connection is associated with a weight.

TLU would first compute linear function of its inputs:

And then it applies a Step Activation Function of the result: .
Model parameters are the input weights and bias term .

The most commonly used Activation Function in Perceptrons is the Heaviside step function. Can also use Sign.

Heaviside Step Activation Function:

Sign Activation Function:

You generally add an extra bias feature (that is = 1, meaning it always outputs 1), which usually called a bias neuron.

A perceptron is composed of one or more TLUs organized in a single layer that is fully connected (to every input).

A perceptron has one or more TLUs organized in a single layer, where every TLU is connected to every input. That’s called a fully connected layer (or, dense layer) — all neurons in a layer are connected to every neuron in the previous layer.
The inputs would be the input layer. And after the layer of TLUs, we have the output layer, as they produce the final output.

You can compute the output of a fully connected layer with the following formula:

is the matrix of input features. Has one row per instance and one column per feature.
contains the connection weights. Has one row per input and one column per neuron.
is the bias vector. It has one bias term per neuron.
is the activation function.

first multiplies by , resulting in a matrix with one row instance and one column instance per output. Then it adds vector to every row of that matrix, adding a bias term to the corresponding output, for every instance. And then is applied itemwise to each item in the resulting matrix.

You train perceptrons by updating their weights.
The learning rule for updating the connection weights between the th input and the th neuron is given by:

: The current weight of the connection from the th input to the th output neuron.
: The learning rate, a small positive constant that determines the size of the weight updates.
: The target output for the th neuron.
: The actual output of the th neuron given the current inputs and weights.
: The value of the th input.

Perceptrons are fed one training instance at a time, and for each instance, it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction.

The way it reinforces this is through the term . The term represents the error between the target output and the actual output of the th neuron. Depending on this difference, the weight update can go in one of two directions:

If the prediction is correct (), then , and the weight does not change. This means that when a prediction is correct, the perceptron “leaves well enough alone.”
If the prediction is wrong, there are two scenarios:
- If the actual output is less than the target , meaning the perceptron was too conservative and did not activate when it should have, the error term is positive. This leads to an increase in the weight , making it more likely that the neuron will activate the next time it encounters similar inputs.
- Conversely, if is greater than , meaning the perceptron activated when it should not have, the error term is negative, leading to a decrease in , making such erroneous activation less likely in the future.

# This wasn't part of the book. I just needed to implement it to better understand.
import logging

import numpy as np
from numpy.typing import NDArray

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())


def step_function(x: float):
    """
    Step function for binary classification.
    y_hat = 1 if x > 0 else 0
    """
    return 1 if x > 0 else 0


class Perceptron:
    def __init__(self, n_inputs: int, learning_rate=0.1):
        self.weights = np.random.rand(n_inputs + 1)  # random initialization is fine for this example
        self.learning_rate = learning_rate

    def predict(self, x: NDArray):
        """
        Calculate the weighted sum of inputs and weights and apply the step function to get the predicted output.
        z = Wx + b
        where W is the weights (matrix), x is the input (vec), and b is the bias (vec).
        """
        weighted_sum = np.dot(x, self.weights[1:]) + self.weights[0]
        return step_function(weighted_sum)

    def train(self, x: NDArray, y: NDArray):
        """
        1. Initialize weights with random values (see __init__)
        2. For each input, calculate the weighted sum of inputs and weights (predict)
        3. Apply the step function to the weighted sum to get the predicted output (predict)
        4. Calculate the error (y - y_hat)
        5. Update the weights using the formula:
        W^{next step} = W + η(y - y_hat)x
        6. Update the bias using the formula:
        b^{next step} = b + η(y - y_hat)
        7. Repeat until done
        """
        y_hat = self.predict(x)
        error = y - y_hat
        # update weights
        self.weights[1:] += self.learning_rate * error * x
        # update bias
        self.weights[0] += self.learning_rate * error


def generate_and_gate_data():
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([0, 0, 0, 1])
    return X, y


def generate_or_gate_data():
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([0, 1, 1, 1])
    return X, y


def generate_nand_gate_data():
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([1, 1, 1, 0])
    return X, y


def generate_nor_gate_data():
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([1, 0, 0, 0])
    return X, y


def run(X: NDArray, y: NDArray, epochs=10):
    logger.debug("Training Perceptron with X: %s and y: %s", X, y)
    perceptron = Perceptron(n_inputs=X.shape[1])
    logger.debug("Perceptron initialized\n\n")

    for i in range(epochs):
        logger.debug("EPOCH: %s", i)
        for inputs, target in zip(X, y):
            perceptron.train(inputs, target)

        logger.debug("Weights after epoch %s: %s\n\n", i, perceptron.weights)

    y_pred = np.array([perceptron.predict(inputs) for inputs in X])
    logger.info("Accuracy: %s", np.mean(y_pred == y))
    logger.info("Done")


if __name__ == "__main__":
    for gate in [generate_and_gate_data, generate_or_gate_data, generate_nand_gate_data, generate_nor_gate_data]:
        logger.info("Running perceptron for %s", gate.__name__)
        X, y = gate()
        run(X, y, 20)
        logger.info("\n\n")

    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split

    # Generate binary classification dataset
    X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    logger.info("Running perceptron for binary classification")
    run(X_train, y_train, 20)

# This perceptron gets accuracy = 100% for all the gates, and 79.3% for the random set.

In practice, you’d want to use Scikit-Learn’s Perceptron.

The decision boundary of each output neuron is linear, so perceptrons can’t learn complex patterns (data must be linearly separable). But if the data is linearly separable, the algorithm converges to a solution — this is called the perceptron convergence theorem, and was demonstrated by Rosenblatt.

However, perceptrons aren’t capable of solving some trivial problems – like the XOR classification problem.

Try running the perceptron above with generate_xor_gate_data. You’ll get an accuracy of 50% — just as good as random guessing.

def generate_xor_gate_data():
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([0, 1, 1, 0])
    return X, y

These kinds of limitations can be solved by stacking multiple perceptrons.
The resulting ANN is called a Multilayer Perceptron (MLP).

The Multilayer Perceptron and Backpropagation

MLP stands for Multi-Layer Perceptrons. They have one input layer, one or more layers of TLUs called hidden layers, and a final TLU layer called the output layer.

When ANNs contain a deep stack of hidden layers, it’s a deep neural network (DNN).

Some history on training
Early 1960s, researchers looked at using gradient descent to train NNs.
However, this requires computing the gradients of the model’s error w.r.t the model parameters – it wasn’t clear how to do so efficiently with such a complex model at the time.
But in 1970, Seppo Linnainmaa (in his master’s thesis!) presented a technique to compute all the gradients automatically and efficiently: reverse-mode automatic differentiation.
In just a forward and backward pass through the network, it can compute the gradients of the NN’s error w.r.t every model parameter.
That is, it can find how each connection weight and bias should be tweaked to reduce error.
You use these gradients for the Gradient Descent step.
By repeatedly computing the gradients automatically and taking a GD step, the NN’s error gradually drops until it reaches a minimum.
The combination of reverse-mode autodiff and GD is called Backpropagation.

You can use Back Propagation on all kinds of computational graphs, not just neural networks. It took a while before it was actually used to train NNs.

In 1985, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper analyzing how backprop let NNs learn useful internal representations. This popularized backprop in the field.

How backpropagation works
It does one mini-batch at a time (e.g. with 32 instances each) and goes through the full training set multiple times (each one called an epoch).

Forward pass
Each batch enters through the input layer.
Then it computes the output of all neurons in the first hidden layer, for each instance in the mini-batch.
The result of that is passed to the next layer, where it computes the output of that layer, and passes it on, and so on, until reaching the final layer (the output layer).
You can think of this like computing predictions in the network, but you keep the intermediate results, as they’re needed for the backward pass.

Then it measures the network’s output error (using a loss function).
Then compute how much each output bias and each connection to the output layer contributed to the error (using the chain rule from Calculus).

Backward pass
Then it measures how much each of these error contributions came from each connection in the layer below (using the chain rule), working backward until it reaches the input layer.
The reverse pass it’s doing here measures the error gradient across all connection weights and biases in the network by propagating the error gradient backward throughout the network.

Finally it does a GD step to tweak all connection weights in the network using the error gradients it just computed.

Note: Need to do random initialization, as otherwise the model would treat each neuron the same, and they remain identical, which would be as good as just having one neuron per layer.

So: Make predictions for a minibatch while keeping the intermediate results (forward pass) → measure error → go through each layer in reverse to measure the error contribution from each parameter (reverse pass) → tweak connection weights and biases to reduce error (gradient descent step).

But for backprop to work, Rumelhart et al., replaced the step function in MLPs with the logistic function (Sigmoid function).
This was necessary because the step function only has flat segments, meaning there are no gradients to work with, which isn’t ideal as GD can’t move on a flat surface.
The Sigmoid Activation Function has a well-defined nonzero Derivative everywhere, so GD can make progress at every step.
Tanh and ReLU also works well – ReLU seemingly being best of all.

We need activation functions.
If you chain several linear transformations, you get a linear transformation.
And if you don’t add nonlinearity between layers, then every deep stack of layers is equivalent to a single layer, which can’t solve very complex problems.
But a large enough DNN with nonlinear activations can theoretically approximate any continuous function (and that’s really what we’re doing in NNs: approximating continuous functions).

Regression MLPs

To predict a single value, use a single output neuron. The output is the predicted value.
For multivariate regression, you need one output neuron per output dimensions.

MLPRegressor for univariate analysis

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, random_state=42
)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42
)

mlp_reg = MLPRegressor(hidden_layer_sizes=[50, 50, 50], random_state=42)
pipeline = make_pipeline(StandardScaler(), mlp_reg)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_valid)
rmse = mean_squared_error(y_valid, y_pred, squared=False)

This doesn’t use any activation function for the output layer.
If you want the output to always be positive, use ReLU there (or softplus).
And if you want the output to fall within a given range, use sigmoid or Hyperbolic Tangent, and scale targets to the appropriate range.
But MLPRegressor doesn’t support activation functions in the output layer — you’d likely use PyTorch, Keras, or Tensorflow for neural nets. But you don’t really use activation functions in the output layer for regression tasks.

In regression tasks, the goal is to predict a continuous numerical value. Using an activation function in the output layer of a regression neural network can limit the range of values the network can predict.
For example, if you use a sigmoid activation function in the output layer, the predicted values will be squashed between 0 and 1. This is suitable for binary classification tasks but not for regression, where the target values can be any real number.
Instead, for regression tasks, the output layer typically has a single neuron without any activation function applied to it. This allows the network to output any real number as the predicted value.

You may want to use Huber loss if there are (many) outliers in your data.

Classification MLPs

For binary classification, you just need a single output neuron with a Sigmoid Activation Function: the output is between 0 or 1, which can be interpreted as the estimated probability of the positive class.

You can also use MLPs for Multilabel Classification.
Then you’d need output neurons for each positive class, all using the Sigmoid Activation Function.

For Multiclass Classification, you’ll need one output neuron for each class, and you should use the Softmax Activation Function|Softmax activation function for the whole output layer. This ensures all estimated probabilities are between 0 and 1, and that they add up to 1.

For the loss function, it’s generally a good idea to use cross-entropy (aka. x-entropy, or Log loss).

In Scikit-Learn, we have MLPClassifier, which is like MLPRegressor, except it minimizes cross-entropy rather than MSE.

Implementing MLPs with Keras

Building an Image Classifier Using the Sequential API

You can use different initialization methods in Keras by setting kernel_initializer. Kernel is another name for the matrix of connection weights.

When using the SGD optimizer, it’s important to tune the learning rate.

import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
# Already shuffled and split: 60k training, 10k test
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
# Use 5000 samples for validation
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]

# The dataset is represented as a 28x28 array, not a 1D array of size 784.
# And pixel intensities are represented as integers (0-255) rather than floats (0.0 to 255.0).

X_train.shape
> (55000, 28, 28)

X_train.dtype
> dtype('uint8')

# Scale pixel intensities down to the 0-1 range and convert them to floats
X_train, X_valid, X_test = X_train / 255.0, X_valid / 255.0, X_test / 255.0

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

# random seed for reproducibility: random weights of hidden layers and ouput layer will be the same
tf.random.set_seed(42)

# Create Sequential model: simple NN with single stack of layers connected sequentially. Called the sequential API.
model = tf.keras.Sequential([
    # Input layer defining the input shape (28x28)
    # Flatten the input images (28x28) to a 1D array (784)
    tf.keras.layers.Flatten(input_shape=[28, 28]), # if we don't specify input_shape, it will be inferred the first time we pass some data through the model or call the build() method
    # Dense hidden layer with 300 neurons, using ReLU activation function
    # Manages its own weights and biases - weight matrix contains all connection between neurons and their inputs, and has bias vector w. 1 bias per neuron
    # The number of parameters is 784*300 + 300 = 235500
    tf.keras.layers.Dense(300, activation="relu"),
    # Dense hidden layer with 100 neurons, using ReLU activation function
    tf.keras.layers.Dense(100, activation="relu"),
    # Dense output layer with 10 neurons (1 per class), using softmax activation function
    tf.keras.layers.Dense(10, activation="softmax")
])

model.summary() # if run in notebook, this gives a nice summary
tf.keras.utils.plot_model(model, "my_fashion_mnist_model.png", show_shapes=True) # visualize input/output dimensions across layers

# must compile the model to specify loss function & optimizer
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="sgd",
    metrics=["accuracy"]
)

history = model.fit(
    X_train, y_train, epochs=30, 
    validation_data=(X_valid, y_valid)
)

We use “sparse_categorical_crossentropy” loss because we have sparse labels (i.e., for each instance, there is just a target class index, from 0 to 9 in this case), and the classes are exclusive.
If instead we had one target probability per class for each instance (such as one-hot vectors, e.g. [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.] to represent class 3), then we would need to use the “categorical_crossentropy” loss instead.
If we were doing binary classification (with one or more binary labels), then we would use the “sigmoid” (i.e., logistic) activation function in the output layer instead of the “softmax” activation function, and we would use the “binary_crossentropy” loss.

Learning curves are pretty easy to plot:

import matplotlib.pyplot as plt
import pandas as pd

pd.DataFrame(history.history).plot(
    figsize=(8, 5), xlim=[0, 29], ylim=[0,1], grid=True,
    xlabel="Epoch", title="Learning curves", style=["r--", "r--.", "b-", "b-*"]
)

plt.show()

If model performance isn’t great, try tuning learning rate. Or try another optimizer.
And if that doesn’t help, try tuning model hyperparameters like number of layers, number of neurons per layer, types of activation functions used for each hidden layer, or modifying e.g. batch size.

Evaluation

model.evaluate(X_test, y_test)
# First element is loss, second is accuracy (88.7%)
> [0.34440889954566956, 0.8873999714851379]

Prediction

# We're just using first 3 instances from the test set
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)
> array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.99],
         [0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
         [0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]],
        dtype=float32)

y_pred = np.argmax(model.predict(X_new), axis=-1)
y_pred
> array([9, 2, 1])

np.array(class_names)[y_pred]
> array(['Ankle boot', 'Pullover', 'Trouser'], dtype='<U11')

Building a Regression MLP using the Sequential API

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

tf.random.set_seed(42)

# Normalization layer: scales input features to have a mean of 0 and a standard deviation of 1
norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:])

model = tf.keras.Sequential([
    norm_layer,
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(1)
])

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)

model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

# Learns the input statistics (mean and standard deviation)
norm_layer.adapt(X_train)

history = model.fit(X_train, y_train, epochs=30, validation_data=(X_valid, y_valid))

mse_test, rmse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_pred = model.predict(X_new)
print(y_pred)

mse_test, rmse_test
> (0.2832980453968048, 0.5322574973106384)

Building Complex Models using the Functional API

Gives example of Wide & Deep neural network (Heng-Tze Cheng et al.).
The architecture connects all or part of the inputs directly to the output layer.
So the NN can learn both deep patterns (using deep path) and simple rules (using short path).
A regular MLP would force all data through the full stack of layers, thereby distorting simple patterns from the transformations throughout.

# create layers
normalization_layer = tf.keras.layers.Normalization()  # used to standardize input features
hidden_layer1 = tf.keras.layers.Dense(30, activation="relu")
hidden_layer2 = tf.keras.layers.Dense(30, activation="relu")
concat_layer = tf.keras.layers.Concatenate()
output_layer = tf.keras.layers.Dense(1)

# use the layers: from input to output
input_ = tf.keras.layers.Input(shape=X_train.shape[1:]) # input object - specifies the kind of input the model will get
normalized = normalization_layer(input_)
hidden1 = hidden_layer1(normalized)
hidden2 = hidden_layer2(hidden1)
concat = concat_layer([normalized, hidden2]) # concatenate the (normalized) input and the output of the second hidden layer
output = output_layer(concat)

model = tf.keras.Model(inputs=[input_], outputs=[output])
# Can use `model.summary()` here

# Train & evaluate
tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
normalization_layer.adapt(X_train)
history = model.fit(
    X_train, y_train, epochs=20, validation_data=(X_valid, y_valid)
)
mse_test = model.evaluate(X_test, y_test)
y_pred = model.predict(X_new)

The functional API — as seen above — doesn’t compute any actual output, it just defines the model’s architecture. It’s like defining a function: it doesn’t do anything until you call it.

You can also send a subset of the features through the wide path and a different subset (potentially overlapping) through the deep path.
This is done by having multiple inputs.

In this example, we will send five features through the wide path (features 0 to 4), and six features through the deep path (features 2 to 7).
The inputs are named because we have two input layers.

input_wide = tf.keras.layers.Input(shape=[5], name="input_wide") # features 0-4
input_deep = tf.keras.layers.Input(shape=[6], name="input_deep") # features 2-7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = tf.keras.layers.Dense(1)(concat)

model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

X_train_wide, X_train_deep = X_train[:, :5], X_train[:, 2:]
X_valid_wide, X_valid_deep = X_valid[:, :5], X_valid[:, 2:]
X_test_wide, X_test_deep = X_test[:, :5], X_test[:, 2:]
X_new_wide, X_new_deep = X_test_wide[:3], X_test_deep[:3]

norm_layer_wide.adapt(X_train_wide)
norm_layer_deep.adapt(X_train_deep)
history = model.fit(
    {"input_wide": X_train_wide, "input_deep": X_train_deep}, 
    y_train, epochs=20, 
    validation_data=((X_valid_wide, X_valid_deep), y_valid)
)
mse_test = model.evaluate((X_test_wide, X_test_deep), y_test)
y_pred = model.predict((X_new_wide, X_new_deep))

You may also want to have multiple outputs. For example

for tasks that require e.g. both regression and classification, like locating and classifying objects in images
for performing multiple, independent, tasks based on the same data with the same neural network. This can actually lead to better results in general because the NN will learn features that are useful across tasks
for regularization, e.g. adding an auxiliary output in the NN architecture to ensure the underlying part of the network learns something useful on its own, instead of relying on the rest of the network

# Same as before
input_wide = tf.keras.layers.Input(shape=[5], name="input_wide") # features 0-4
input_deep = tf.keras.layers.Input(shape=[6], name="input_deep") # features 2-7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])

# Add names & new aux_output
output = tf.keras.layers.Dense(1, name="output")(concat)
aux_output = tf.keras.layers.Dense(1, name="aux_output")(hidden2)

model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output, aux_output])

# Training & eval
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(
    loss={"output": "mse", "aux_output": "mse"},
    loss_weights={"output": 0.9, "aux_output": 0.1},
    optimizer=optimizer, metrics=["RootMeanSquaredError"]
)

X_train_wide, X_train_deep = X_train[:, :5], X_train[:, 2:]
X_valid_wide, X_valid_deep = X_valid[:, :5], X_valid[:, 2:]
X_test_wide, X_test_deep = X_test[:, :5], X_test[:, 2:]
X_new_wide, X_new_deep = X_test_wide[:3], X_test_deep[:3]

norm_layer_wide.adapt(X_train_wide)
norm_layer_deep.adapt(X_train_deep)
history = model.fit(
    {"input_wide": X_train_wide, "input_deep": X_train_deep}, 
    y_train, epochs=20, 
    validation_data={
        "input_wide": X_valid_wide, "input_deep": X_valid_deep,
        "output": y_valid, "aux_output": y_valid
    }
)
eval_results = model.evaluate({"input_wide": X_test_wide, "input_deep": X_test_deep}, y_test)
weighted_sum_of_losses, main_loss, aux_loss, main_rmse, aux_rmse = eval_results
y_pred_tuple = model.predict({"input_wide": X_new_wide, "input_deep": X_new_deep})
y_pred = dict(zip(model.output_names, y_pred_tuple))

Using the Subclassing API to Build Dynamic Models

The sequential and functional APIs are declarative, which has some nice benefits. They are easy to save and clone, it’s easy to see and analyze their structure, and the framework can infer shapes and check types early (before passing data through!).
Since the model is a static graph of layers, it’s easy to debug.

But some models aren’t just static. They involve loops, varying shapes, conditional branching, and other dynamic behavior.

class WideAndDeepModel(tf.keras.Model):
    def __init__(self, units=30, activation="relu", **kwargs):
        super().__init__(**kwargs)  # handles standard args (e.g., name)
        self.norm_layer_wide = tf.keras.layers.Normalization()
        self.norm_layer_deep = tf.keras.layers.Normalization()
        self.hidden1 = tf.keras.layers.Dense(units, activation=activation)
        self.hidden2 = tf.keras.layers.Dense(units, activation=activation)
        self.main_output = tf.keras.layers.Dense(1)
        self.aux_output = tf.keras.layers.Dense(1)
        
    def call(self, inputs):
        input_wide, input_deep = inputs
        norm_wide = self.norm_layer_wide(input_wide)
        norm_deep = self.norm_layer_deep(input_deep)
        hidden1 = self.hidden1(norm_deep)
        hidden2 = self.hidden2(hidden1)
        concat = tf.keras.layers.concatenate([norm_wide, hidden2])
        main_output = self.main_output(concat)
        aux_output = self.aux_output(hidden2)
        return main_output, aux_output

model = WideAndDeepModel(30, activation="relu", name="wide_and_deep")

The code above shows the exact same Wide and Deep model as before, except using the subclassing API.
The idea is that you can do mostly anything you want in call.
You could use branching logic, loops, low-level TensorFlow operations, and so on.

This also means your model architecture is hidden in call, so it cannot easily be inspected or cloned. And summary() only gives a list of layers, not how they are connected. And Keras can’t check types nor shapes ahead of time (because of the freedom you have within call).

Keras models can also be used just like regular layers, so you can combine them to build complex architectures.

Saving and Restoring a Model

model.save("name", save_format="tf")

save_format="tf" saves the model with TensorFlow’s Saved-Model format.

You can load the model and use it again:

model = tf.keras.models.load_model("my_keras_model")
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))

Use save_weights() and load_weights() to save and load only parameter values. This is faster and uses less disk space, so it’s good for model checkpoints during training.
If you’re training a large model, you should save checkpoints regularly (in case of crashes, etc.). To do so, you use callbacks.

Using Callbacks

fit() accepts a callbacks argument letting you specify a list of objects that Keras calls before and after training, before and after each epoch, and before and after processing each batch.

# save checkpoints of your model at regular intervals
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
   "my_checkpoints", save_weights_only=True,
	# enable if using validation set to only save when the performance on the validation set is the best so far.
	save_best_only=True
)

# interrupt training when no progress on validation set for (patience) epochs
early_stopping_cb = tf.keras.callbacks.EarlyStopping(
    patience=10, 
	# roll back to best model at the end of training
    restore_best_weights=True
)

history = model.fit([...], callbacks=[checkpoint_cb, early_stopping_cb])

You can write custom callbacks by creating a class that extends tf.keras.callbacks.Callback.

Using TensorBoard for Visualization

This is a great tool for visualization, comparison, analysis, and even profiling.

Your program needs to output the data you want to visualize to binary log files called “event files”. Each data record is a ‘summary’. The TensorBoard server monitors the log directory and picks up the changes so it can visualize them.
Generally you want to point the TB server to a root log dir and have your program write to a different subdir each time it runs. This allows TensorBoard to visualize and compare multiple runs.

Here’s one way to make the log folders & define a TensorBoard callback:

from pathlib import Path
from time import strftime

def get_run_logdir(root_logdir="my_logs"):
    return Path(root_logdir) / strftime("run_%Y_%m_%d-%H_%M_%S")

run_logdir = get_run_logdir()

tensorboard_cb = tf.keras.callbacks.TensorBoard(
    run_logdir, 
    # profile the network batches 100 and 200 during the first epoch
    # 100 & 200 to allow for 'warm up'
    profile_batch=(100, 200)
)

I couldn’t get the VSCode TensorBoard integration to work, so I ran tensorboard serve –logdir my_logs. Then it’s accessible from localhost:6006.

Fine-Tuning Neural Network Hyperparameters

With so many hyperparameters that you can modify, how do you know which combination is best for your task?

You could convert your Keras model to a Scikit-Learn estimator and use GridSearchCV of RandomizedSearchCV by using the KerasRegressor and KerasClassifier from the SciKeras library.
But even better: use the Keras Tuner (keras-tuner) library for hyperparameter tuning of Keras models.

import keras_tuner as kt

# Builds and compiles an MLP to classify Fashion MNIST images
def build_model(hp: kt.HyperParameters):
    n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
    n_neurons = hp.Int("n_neurons", min_value=16, max_value=256)
    learning_rate = hp.Float("learning_rate", min_value=1e-4, max_value=1e-2, sampling="log")
    optimizer = hp.Choice("optimizer", ["sgd", "adam"])
    if optimizer == "sgd":
        optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
    else:
        optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten())
    for _ in range(n_hidden):
        model.add(tf.keras.layers.Dense(n_neurons, activation="relu"))
    model.add(tf.keras.layers.Dense(10, activation="softmax"))
    model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
    return model

The above code uses the hp object to build hyperparameters. It checks if the keys (e.g. n_hidden) are present: if so, returns the current value, otherwise, return value within defined range.

Then you can use this code to do a random search:

random_search_tuner = kt.RandomSearch(
    build_model, objective="val_accuracy", max_trials=5,
    overwrite=True, directory="my_fashion_mnist",
    project_name="my_rnd_search", seed=42
)

random_search_tuner.search(
    X_train, y_train, epochs=10, validation_data=(X_valid, y_valid)
)

top_3_best_models = random_search_tuner.get_best_models(num_models=3)
best_model = top_3_best_models[0]

top3_best_hps = random_search_tuner.get_best_hyperparameters(num_trials=3)
best_hps = top3_best_hps[0]

# Each tuner is guided by a 'oracle'
best_trial = random_search_tuner.oracle.get_best_trials(num_trials=1)[0]
best_trial.summary() # gives summary of hyperparameters

best_trial.metrics.get_last_value("val_accuracy") # get metrics

# Continue training
best_model.fit(X_train_full, y_train_full, epochs=10)
test_loss, test_accuracy = best_model.evaluate(X_test, y_test)

If you want to fine-tune data preprocessing hyperparameters or model.fit() parameters, you’ll want to subclass kt.HyperModel instead of writing a build_model() function.
Your class should have a build() and fit() method: build() does the same as build_model(), and fit() takes a HyperParameters object and a compiled model as argument, and all the model.fit() arguments, and fits the model & returns the History object.

For example:

class MyClassificationHyperModel(kt.HyperModel):
    def build(self, hp):
        return build_model(hp)
    
    def fit(self, hp, model, X, y, **kwargs):
        if hp.Boolean("normalize"):
            norm_layer = tf.keras.layers.Normalization()
            X = norm_layer(X)
        return model.fit(X, y, **kwargs)


hyperband_tuner = kt.Hyperband(
    MyClassificationHyperModel(), objective="val_accuracy", max_epochs=10,
    factor=3, directory="my_fashion_mnist", project_name="my_hyperband",
    seed=42
)

root_dir = Path(hyperband_tuner.project_dir) / "tensorboard"
tensorboard_cb = tf.keras.callbacks.TensorBoard(root_dir)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=2)
hyperband_tuner.search(
    X_train, y_train, epochs=10, validation_data=(X_valid, y_valid),
    callbacks=[tensorboard_cb, early_stopping_cb]
)

# Run `tensorboard serve --logdir my_fashion_mnist/my_hyperband/tensorboard/`

The HyperBand tuner is similar to HalvingRandomSearchCV: it starts by training many models for a few epochs, then eliminates the worst models, and keeps the top 1/factor ones, repeating until 1 is left.

The kt.BayesianOptimization tuner is an alternative to HyperBand.

Term: AutoML, refers to any system that takes care of a large part of the ML workflow.

Hyperparameter Tuning is an active area of research. Recently, evolutionary algorithms seem to have shown promise.

Number of Hidden Layers

For many problems: start with a one or two hidden layers.
A MLP with 1 hidden layer can theoretically model even the most complex functions given enough neurons.
For example, you can reach above 97% accuracy on MNIST with 1 hidden layer and a few hundred neurons, and 98% using two hidden layers

On more complex problems, ramp up the number of hidden layers until you start overfitting the training set.
Very complex tasks (like large image classification, speech recognition) typically require lots of layers (can even be hundreds, but not fully connected).
When faced with such complex tasks, it’s usually easier to reuse parts of a pretrained state-of-the-art neural network that does well on a similar task to yours.

For complex problems, deep NNs have higher parameter efficiency than shallow ones.
They can model complex functions using exponentially fewer neurons, thereby getting better performance with the same amount of training data.
Deep NNs take advantage of the common hierarchical structure of real-world data because the lower hidden layers model low-level structures (e.g. line segments of shapes, orientations), while intermediate hidden layers combine the low-level structures to model intermediate-level structures (e.g. squares, circles), and the highest hidden layers and the output layer combine these intermediate structures to model the high-level structures (e.g. faces).

For example, let’s look at what a deep neural network for image classification may learn in its layers.

Lower hidden layers: detect edges, simple patterns like line segments, edges at various orientations, basic textures. E.g. a neuron might activate strongly in response to vertical lines in a portion of an image, while another might respond to horizontal lines.
Intermediate hidden layers: starts to combine the simple patterns detected by the lower layers into more complex structures, e.g. squares, circles, parts of objects like a wheel of a car, the petal of a flower.
Highest hidden layers: aggregate the intermediate structures into even more complex features, e.g. entire objects or complex scenes. They might combine features like eyes, noses, and mouths into a ‘face’.
Output layer: takes the high-level structures from the highest hidden layers and combines them into a final prediction or decision. Like identifying a face as belonging to a particular person, in the classification case.

This hierarchical architecture (e.g. a mouth, a nose, and two ears make up a face) helps DNNs converge faster to a good solution and improves their ability to generalize to new datasets.
If you have a DNN that can recognize faces, but want to recognize hairstyles, you can kickstart training by reusing the lower layers of the first network. You’d initialize the weights and biases of the first few layers of the new neural net to those of the previous neural net. Then your new model wouldn’t have to learn all the low-level structures from scratch. This is transfer learning.

Number of Neurons per Hidden Layer

The number of neurons in the input and output layers depend on the type of input and output your task requires.
For example, I’ve built a neural network that takes 6144 spectral ranges as its inputs and outputs 8 oxide concentration percentages.

It’s common to size the hidden layers to form a pyramid: fewer and fewer at each layer.
The idea there is that many low-level features can coalesce into far fewer high-level features (recalling the hierarchical analogy made in the previous section).
This has been largely abandoned, though. Using the same number of neurons in all hidden layers seem to perform as well, or better (and there’d be fewer hyperparameters to tune).
However, depending on the dataset, making the first hidden layer larger than the others can help.

Like with the number of layers, you can try increasing the number of neurons until the networks starts overfitting.
Or start large and use early stopping and other regularization techniques to prevent too much overfitting.
This intuitively seems better: if a layer doesn’t have enough neurons, it can’t preserve all useful information from the inputs, so some gets lost. That information won’t get recovered in subsequent layers.

Generally: prefer increase the number of layers over increasing the number of neurons per layer – you’ll get more bang for your buck.

Learning Rate, Batch Size, and Other Hyperparameters

Learning rate
Generally, the optimal rate is half of the maximum rate (the rate above which the training algorithm diverges).

One way to find a good one is starting with a very low rate (e.g. ), training the model for a few hundred iterations, and gradually increasing it to a very large value (e.g. ). Do so by multiplying by a constant factor at each iteration (e.g. when doing 500 iterations).
Plot the loss as a function of the learning rate. It should drop at first, but then when the learning rate is too high, the loss shoots up. The optimal rate is a bit lower than the point where the loss starts to climb (10x lower typically).
Then reinitialize your model and train it with the learning rate.

Note that optimal learning rate depends on other hyperparameters (esp. batch size). If you modify other hyperparameters, modify the learning rate as well.

Optimizer
Pick something better than Mini-Batch Gradient Descent. Book explores more later.

Batch size
This will have significant impact on your model’s performance and training time.

Large batch sizes work well with GPUs – they can process them efficiently.
So many researchers recommend using the max batch size you can fit in GPU RAM.

But large batch sizes often leads to training instabilities.

Some think small batches (2-32) are best, while others say very large (up to 8192) are good with techniques like warming up the learning rate (starting small, ramping up).

You can try using a large one with learning rate with such a technique to start, and then trying a small size if it doesn’t work out.

Activation function
Generally, ReLU is a good default for hidden layers.

Number of iterations
Just use Early Stopping.

Chapter 11 Training Deep Neural Networks

The Vanishing/Exploding Gradient Problems

Problem: gradients grow ever smaller or larger when flowing backward during the deep neural net; this makes lower layers hard to train.
When you do backpropagation, the gradients usually get smaller and smaller as the algorithm goes down to the lower layers, meaning their connection weights are left basically unchanged.
This is the vanishing gradients problem.
Sometimes, the opposite happen: the gradients get bigger and bigger, until layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem.

That is, deep neural networks suffer from unstable gradients, which mean different layers may learn at very different speeds.
(This is actually one reason why DNNs were abandoned in the early 2000s)

Glorot and Bengio some found suspects as to why this was happening.
This included the combination of using the Sigmoid activation function as well as how weights were initialized at the time (normal distribution with mean=0 and stdev=1).
They showed the variance of the outputs of each layer is much greater than its inputs, which meant the variance would keep increasing after each layer until the activation function saturates at the top layers.
You can see this in the Sigmoid Activation Function because with large (both positive & negative) inputs, the function saturates at 0 or 1, with derivatives close to 0, so there isn’t much gradient to propagate back through the network.

Glorot and He Initialization

Glorot and Bengio proposed a way to alleviate the unstable gradients problem.
Signal should flow properly in both forward and reverse directions during Backpropagation.
The signal shouldn’t die out, nor explode and saturate.
So we need the variance of the outputs of each layer to be equal to the variance of its inputs. And the gradients should have equal variance before and after flowing through a layer in reverse-mode.

Here’s a nice analogy:

if you set a microphone amplifier’s knob too close to zero, people won’t hear your voice, but if you set it too close to the max, your voice will be saturated and people won’t understand what you are saying. Now imagine a chain of such amplifiers: they all need to be set properly in order for your voice to come out loud and clear at the end of the chain. Your voice has to come out of each amplifier at the same amplitude as it came in.

We can’t guarantee both cases of equal variance, unless the layer has an equal number of inputs and outputs.

The number of inputs in a layer/neuron is called ‘fan in’ and the number of outputs is called ‘fan out’

The authors (Glorot and Bengio) proposed ‘Xavier Initialization’ or ‘Glorot initialization’.
We have .

To do Glorot initialization, when using the Sigmoid Activation Function:

Normal distribution with mean 0 and variance
Uniform distribution between and , with

Replace with and you get Yann LeCun’s LeCun initialization.
Of course, this is equivalent to Glorot initialization when .

Using Glorot initialization can speed up training a lot.

There are similar strategies for other activation functions. They differ only by the scale of the variance and whether they use or .

For uniform distribution, just use the above.

It is advised to use the following initializations for the given activation functions:

For no activation function, tanh, sigmoid, or softmax, use Glorot initialization.
- Normal distribution with variance .
For ReLU, Leaky ReLU, ELU, GELU, Swish, or Mish, use He initialization.
- Normal distribution with .
For SELU, use LeCun.
- Normal distribution with .

So you create either a normal distribution or a uniform distribution with the appropriate statistics, and then you sample from that to initialize your weights.

Keras can do this for you. Just set the kernel_initializer argument. Or use a VarianceScaling initializer.

from tensorflow.keras.initializers import GlorotUniform, GlorotNormal, HeUniform, HeNormal, LecunUniform, LecunNormal

# use any of them here:
layer = Dense(128, activation='relu', kernel_initializer=GlorotUniform())

Better Activation Functions

Problems with unstable gradients were in part due to a poor choice of activation function.
ReLU, for example, performs better in deep neural networks than sigmoid. It doesn’t saturate for positive values and is fast to compute.
But it isn’t perfect: it suffers from ‘dying ReLUs’, where some neurons “die” under training, meaning they just output 0. This can happen when a neuron’s weight gets tweaked in a way such that the input of the ReLU function (the weighted sum of the neuron’s inputs + bias term) is negative for all training instances.
And gradient descent doesn’t affect the neuron anymore when it dies, as the gradients of ReLU is zero when its input is negative.

To solve that, you can use a variant of ReLU, like Leaky ReLU.

Leaky ReLU

defines how much the function “leaks”. It’s the slope of the function for .
With slope when it ensures that leaky ReLU never dies (can go into long coma, but can wake up).

Bing Xu et al. found that leaky ReLU always outperformed strict ReLU. And (huge leak) is better than (small leak).
Randomized ReLU (pick randomly from range during training) also did fine. It can also act as a regularizer.
Parametric leaky ReLU (PReLU) strongly outperformed ReLU on large image datasets, but can overfit on smaller datasets. PReLU basically learns during training, so it’s a parameter, not a hyperparameter.

Use He initialization for these variants.

ReLU and the aforementioned variants are not smooth functions. That is, their derivatives abruptly change at . This can make gradient descent bounce around the optimum and slow down convergence.
There are some smooth variants: ELU and SELU.

ELU and SELU

Exponential Linear Unit (ELU)
Djork-Arné Clevert et al. proposed the exponential linear unit (ELU) that outperformed other ReLU variants: faster training time, the network performed better on the training set.

defines the opposite of the value that the ELU function approaches when is a large negative number. It’s usually 1, but can be tweaked.

So it:

has a negative value when , allowing the unit to have an avg. output closer to 0, and helps alleviate vanishing gradients.
has a nonzero gradient for , helping avoid dead neurons
is smooth everywhere if , even around . This helps speed up GD, as it doesn’t bounce around as much to the left and right of

Use He initialization.

Drawback of ELU: slower to compute than ReLU and other variants (due to exp).

Scale Exponential Linear Unit (SELU)
Later, a paper by Günter Klambauer et al. introduced scaled ELU (SELU).
The scale is about 1.05 times ELU, using .
They showed that if you build a NN exclusively of a stack of dense layers, all using SELU, then the network self-normalizes. That means the output of each layer tends to preserve a mean of 0 and a stdev of 1 during training (solving vanishing/exploding gradients).

But there are a few conditions for that to happen: input features must be standardized (mean=0, stdev=1), hidden layer weights must be initialized with LeCun normal initialization, you can’t use regularization (like , max-norm, batch-norm, or regular dropout).

The self-normalizing property is only guaranteed with plain MLPs, not e.g. RNNs or networks with skip connections.

Due to these constraints, SELU didn’t get a lot of traction.
And GELU, Swish, and Mish seem to outperform it consistently on most tasks.

GELU, Swish, and Mish

Gaussian Error Linear Unit
Dan Hendrycks and Kevin Gimpel introduced GELU. It’s like a smooth variant of ReLU.

where is the standard Gaussian Cumulative Distribution Function. corresponds to the probability that a value sampled randomly from a normal distribution of mean 0 and variance 1 is lower than .

It’s quote computationally intensive. So while it’s good, it has an extra cost in that regard.

Swish
Swish activation function

Prajit Ramachandran et al. rediscovered the sigmoid linear unit (SiLU) and called it Swish. SiLU was originally proposed by the authors of GELU, in the GELU paper.
However, Ramachandran et al. found Swish to outperform even GELU.
They later generalized it by adding a hyperparameter to scale the sigmoid function’s input.

GELU is actually approximately equal to parameterized Swish with .

You can also tune – or even make it trainable and let gradient descent optimize it (like with PReLU). This may lead to overfitting, though.

Parameterized Swish

Mish
Mish activation function

where .

Mish was introduced by Diganta Misra. They found that it generally outperformed other activation functions, even GELU and Swish (by a tiny margin).

OK, so which activation function to use?

ReLU is a good default for simple tasks (can be as good as the variants, is fast to compute, there are ReLU-specific optimizations).
Swish is probably a better default for more complex tasks. Or try parameterized Swish with a learnable parameter for very complex tasks.
Mish can give even better results, but is more computationally intensive.
If runtime latency matters, prefer leaky ReLU or parametrized leaky ReLU.
For deep MLPs, you can try SELU (but beware constraints).

Batch Normalization

Here’s another way we can solve the unstable gradients problem.

He initialization and ReLU (or a variant) can help alleviate vanishing/exploding gradients at the beginning of training, but won’t guarantee that it doesn’t come back later.

Batch Normalization (BN) by Sergey Ioffe and Christian Szegedy addresses this problem.

Add an operation in the model before / after the activation function of each layer which zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer (one for scaling, one for shifting).
This lets the model learn the optimal scale and mean of each of the layer’s inputs.

Often, putting a BN layer as the first layer of your NN means you won’t have to standardize your training set.

To zero-center and normalize, it estimates each input’s mean and standard deviation by evaluating it over the current mini-batch.

where:

is the vector of input means, evaluated over the whole mini-batch , and contains one mean per input.
is the number of instances in the mini-batch.
is the vector of input standard deviations (evaluated over the whole mini-batch), containing one per input.
is the vector of zero-centered and normalized inputs for instance
is to avoid division by zero & ensuring the gradients don’t get too large. It’s a small number – typically . Called the smoothing term
is the output scale parameter for the layer, and contains one scale parameter per input
means elementwise multiplication
is the output shift (offset) parameter vector for the layer, and contains one offset parameter per input. So each input is shifted by the corresponding shift parameter.
is the output of the batch-norm operation: the rescaled, shifted version of the inputs

So, (1) is to calculate input means, (2) is to calculate input standard deviations, (3) is to zero-center and normalize inputs, and (4) is the final output with the centered and normalized inputs rescaled and shifted.

Batch Normalization adds a runtime penalty that makes the NN slower at predicting due to needing extra computations at each layer. But you can often fuse the BN layer with the previous layer after training by updating the previous layer’s weights and biases so it directly produces outputs of the appropriate scale and offset.
There are tools to do this for you.

While training may be slower, this is usually counterbalanced by faster convergence.

To use BN, add a BatchNormalization layer before or after each hidden layer’s activation function. You can also add one as the first layer, but a Normalization layer usually works just as well there.
Each BN layer will add four parameters per input (). So if the associated layer takes 784 inputs, BN will add 4x784=3136 parameters.

# Adding BN after activation functions
model = tf.keras.Sequential([
	tf.keras.layer.Flatten(input_shape=[28, 28]),
	tf.keras.layer.BatchNormalization(),
	tf.keras.layer.Dense(300, activation="relu", kernel_initializer="he_normal"),
	tf.keras.layer.BatchNormalization(),
	tf.keras.layer.Dense(100, activation="relu", kernel_initializer="he_normal"),
	tf.keras.layer.BatchNormalization(),
	tf.keras.layer.Dense(10, activation="softmax"),
])

# Adding BN before activation functions
# Setting use_bias=False because BN layers include one offset parameter per input
# And drop the first BN layer to avoid sanwiching the first hidden layer between to BN layers
model = tf.keras.Sequential([
	tf.keras.layer.Flatten(input_shape=[28, 28]),
	tf.keras.layer.Dense(300, use_bias=False, kernel_initializer="he_normal"),
	tf.keras.layer.BatchNormalization(),
	tf.keras.layer.Activation("relu"),
	tf.keras.layer.Dense(100, use_bias=False, kernel_initializer="he_normal"),
	tf.keras.layer.BatchNormalization(),
	tf.keras.layer.Activation("relu"),
	tf.keras.layer.Dense(10, activation="softmax"),
])

You may need to tweak BatchNormalization’s momentum. Default is usually fine, though.
Momentum is used by the BN layer when updating exponential moving averages. Given a new value (new vector of input means or standard deviations, computed over current batch), the layer updates the running average (of either the mean or standard deviation) with:

A good value is typically close to 1 (0.9, 0.99, 0.999, etc.). More 9s for larger datasets and for smaller mini-batches.

axis is another hyperparameter. It determines which axis to normalize. Default is -1 (last axis), meaning it uses the means and standard deviations that are computed across the other axes.

Gradient Clipping

This is a technique to stabilize gradients during training, which helps mitigate the exploding gradients problem. We clit the gradients during Backpropagation so they never exceed some threshold.
It is the components of the gradient vector that get clipped.

Generally used in RNNs where Batch Normalization is tricky.

Use in TensorFlow by setting the clipvalue argument on your optimizer.
This may change the orientation of the gradient vector. If you have a gradient vector [0.9, 100], and you clip to between -1 and 1, you’ll get [0.9, 1.0]. While this works well in practice, the orientation is now very different.
If you want to ensure gradient clipping doesn’t change the direction of the gradient vector, use clipnorm, which clips the whole gradient if its norm is greater than the threshold, preserving the orientation. But if we have the same [0.9, 100] gradient vector, notice that the 0.9 would essentially be eliminated because 100 is so much larger.

Reusing Pretrained Layers

Problem: Not having enough training data for a large network, or it’s too costly to label.

It’s a good idea to try to find existing neural networks that works on tasks similar to yours.
If you do, you can generally reuse most of its layers (except the top ones).
This is called transfer learning: it speeds up training and requires much less data.

If the input dimensions aren’t the same, you’ll need to add a preprocessing step to resize the inputs to the pretrained network’s expected size.

Transfer learning works best when the inputs have similar low-level features.

You’d most likely want to replace the output layer. It’s unlikely to be useful for your task, and probably won’t have the right number of outputs.
And similar for the upper hidden layers (closer to output). The high-level features they’ve learned are likely not as useful to your task as layers that are trained specifically for it.
So finding the appropriate number of layers to reuse is important. The more similar the tasks, the more you can reuse. You could even reuse all but the output layer.

How to reuse

Start by freezing the reused layers (making their weights non-trainable; GD won’t modify they), and then training your model and see how it performs.
Then unfreeze 1-2 of the top layers to let Backpropagation update them and see if performance improves.
- The more training data you have, the more layers you can unfreeze.
- You might also want to reduce learning rate when you unfreeze reused layers to avoid wrecking the fine-tuned weights.
If you still don’t get good performance, and have little training data, try dropping the top hidden layers and freezing the remaining hidden layers.
If you have lots of data, try replacing the top hidden layers instead of dropping them, and adding more hidden layers.
Iterate until you find the right number of layers to reuse.

Transfer Learning with Keras

So, imagine you have some model A that has already been trained and shows good performance.
You want to do some different, yet similar task, and you trained a model B for that.
Say model B has the same architecture as model A.

Sidenote: when using Dense layers in a NN that takes images as inputs, only patterns that occur at the same location can be reused. Convolutional layers transfer better, as learned patterns can appear anywhere in the image.

We can reuse parts of model A:

# Load the model you'd like to reuse layers of
model_A = tf.keras.models.load_model("my_model_A")

# We clone the model so `model_B_on_A` doesn't share layers with model A - otherwise, updating B_on_A would update A.
model_A_clone = tf.keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights()) # Important! Cloning doesn't set weights.

model_B_on_A = tf.keras.Sequential(model_A_clone.layers[:-1]) # all but the last layer (output layer)
model_B_on_A.add(tf.keras.layers.Dense(1, activation="sigmoid"))

Training model_B_on_A now could wreck the reused weights because the new output layer was randomly initialized, and will therefore make large errors at first, so there’d be large error gradients.
Instead, we freeze the reused layers during the first few epochs. This gives the new layer some time to learn reasonable weights.

# Make reused layers untrainable / freeze
for layer in model_B_on_A.layers[:-1]:
	layer.trainable = False

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
# Remember to always compile your model after freezing/unfreezing layers!
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])

# Train for a few epochs:
history = model_B_on_A.fit(
	X_train_B, y_train_B, epochs=4,
	validation_data=(X_valid_B, y_valid_B)
)

# Make reused layers trainable / unfreeze
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model_B_on_A.fit(
	X_train_B, y_train_B, epochs=16,
	validation_data=(X_valid_B, y_valid_B)
)

In this part, the author gave a nice example of “torturing the data until it confesses.”
They tuned the configuration until one gave a nice performance boost.
It’s important to be suspicious when a (new) paper looks too positive: maybe the new technique doesn’t even help, or it reduces performance. The authors have just have tried many variants and only reported the best results (could be due to sheer luck they even found them).

Transfer learning doesn’t work great with small dense networks.
This could be because they learn few patters, or they only learn very specific patterns that aren’t useful for other tasks.

It does, however, work great with deep Convolutional Neural Networks (CNNs), which tend to learn feature detectors that are much more general, especially in lower layers.

Unsupervised Pretraining

What if you have a complex task but not much labeled training data, and you can’t find a pretrained model?

Try to get more labeled training data.
If you can’t, try unsupervised pretraining.

If you get enough unlabeled training data, you can use it to train an unsupervised model (like autoencoder or GAN). Then you can reuse the lower layers of the autoencoder or lower layers of the GAN’s discriminator, add the output layer for your task on top, and fine-tune the final network using supervised learning (with your labeled samples).

Pretraining on an Auxiliary Task

If you don’t have much labeled training data, you can train a neural network on an auxiliary task, for which you can easily get/generate labeled training data.
Then reuse the lower layers of that network on the actual task. The lower layers will learn feature detectors that would likely be reusable by the second NN.

For example: recognizing faces.
Say you only have few pictures of each individual.
You could gather a lot of pictures of random people on the web, train a NN to detect whether two pictures feature the same person.
That NN would learn good feature detectors for faces, so if you reuse the lower layers, you could use it for your classifier with your small set of training data.

For NLP, you could download a corpus of millions of text documents and generate labeled data from it. E.g. randomly mask out some words & train a model to predict the missing words.
If this model reaches good performance, it’ll likely know a lot about language, so you could reuse it for your actual NLP task and fine-tune on your labeled data.

Faster Optimizers

Problem: Slow training.

Some other strategies to speed up training (and getting better solutions):

Using a good initialization strategy for connection weights
Using a good activating function
Batch Normalization
Reusing parts of pretrained networks

You can also get a speed boost from using a faster optimizer than regular Gradient Descent.

Some of the most popular optimization algorithms include:

Momentum
Nesterov accelerated gradient
AdaGrad
RMSProp
Adam (& variants)

Momentum

Momentum Optimization was proposed by Boris Polyak.
The main idea is like the momentum we know from the physical world.
Imagine a bowling ball rolling down a gentle slope on a smooth surface. It starts out slowly, but will quickly pick up momentum until it reaches terminal velocity (if there’s friction or air resistance).

It’s faster than regular Gradient Descent at reaching the minimum.
Gradient Descent takes small steps when the slope is gentle and big steps when the slope is steep, but won’t ever pick up speed.

Gradient Descent updates weights by subtracting the gradient of the cost function w.r.t the weights (), multiplied by the learning rate : .

Momentum optimization cares about what the previous gradients were. At each iteration, it subtracts the local gradient from the momentum vector (multiplied by learning rate ). So the gradient is used as an acceleration (not speed).
And to simulate friction (preventing momentum from growing too large), we have hyperparameter , called momentum, which should be set between (high friction) and (no friction). Typically 0.9.

Momentum algorithm

Because of the momentum, the optimizer may overshoot, come back, overshoot, and keep doing so for a bit until it stabilizes at the minimum. This is why we have friction.

Nested Accelerated Gradient

Yurii Nesterov proposed a variant to momentum optimization.
This is almost always faster.

Nesterov Accelerated Gradient (NAG), or Nesterov momentum optimization, measures the gradient of the cost function, lot at local position , but slightly ahead in the direction of the momentum, at .

NAG algorithm

This works because in general the momentum vector points in in the right direction (towards the optimum), so it’ll be more accurate to use the gradient measured a bit farther in that direction than the gradient at the original position.

AdaGrad

is elementwise multiplication. is elementwise division.

The first step accumulates the square of the gradients into the vector .
This vectorized for is equivalent to computing for each element of the vector .
Each accumulates the squares of the partial derivative of the cost function w.r.t parameter .
if the cost function is steep along the ith dimension, will get larger and larger at each iteration.

The second step is like gradient descent, but the gradient vector is scaled down by .
is a smoothing term to avoid division by zero (typically ).
This vectorized for is equivalent to simultaneously computing for all parameters .

So, this algorithm decays the learning rate, but does so faster for steep dimensions than dimensions with gentler slopes. This is called an adapting learning rate.
This helps point the resulting updates more directly toward the global optimum. And additionally, it requires much less tuning of the learning rate hyperparameter .

AdaGrad often does well for simple quadratic problems, but often stops too early when training neural networks. The learning rate gets scaled down too much, so the algorithm stops before reaching the global optimum.
So avoid using it for deep neural networks.

RMSProp

AdaGrad may slow down too fast and never converge to the global optimum.
RMSProp fixes that by accumulating only the gradients from the most recent iterations (not all gradients since starting training).
It does so by using exponential decay in the first step.

is elementwise multiplication. is elementwise division.

We have the decay rate , which is typically .

This algorithm was created by Geoffrey Hinton and Tijmen Tieleman. They did not write a paper, and the algorithm was presented by Hinton in his Coursera class on neural networks.

This was the preferred optimization algorithm until Adam.

Adam

Stands for adaptive momentum estimation.
It combines the ideas from momentum optimization and RMSProp.

It keeps track of an exponentially decaying average of past gradients (like momentum optimizers).
And it also keeps track of an exponentially decaying average of past squared gradients (like RMSProp).

These are estimations of the mean and (uncentered) variance of the gradients.
We often call mean the first moment and variance the second moment. Hence adaptive moment estimation.

is elementwise multiplication. is elementwise division. represents the iteration number, starting at 1.

Steps 1, 2, and 5, show the similarity to momentum optimization and RMSProp.
corresponds to in momentum optimization, and corresponds to in RMSProp.
But step 1 computes an exponentially decaying average, rather than an exponentially decaying sum (these are equivalent except for a constant factor – decaying avg. is just times the decaying sum).

Steps 3 and 4: since and are initialized at 0, they’ll be biased towards 0 at the beginning of training, so these steps are used to help boost and at the beginning of training.

The momentum decay hyperparameter is usually initialized to 0.9.
The scaling decay hyperparameter is usually initialized to 0.999.
The smoothing term is usually initialized to a tiny number (like ).

Because Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate hyperparameter . You can just use a default of .

Some noteworthy variants of Adam: AdaMax, Nadam, AdamW.

AdaMax

This was actually also introduced in the Adam paper.

In step 2 of the Adam algorithm, it accumulates the squares of gradients in (with greater weight for more recent gradients).
In step 5, if we ignore and steps 3 and 4 (which are just technical details…), it scales down the parameter updates by the square root of .
Basically: Adam scales down the parameter updates by the Norm of the time-decayed gradients.

And AdaMax replaces norm with norm, meaning it

replaces step 2 of Adam with ,
drops step 4, and
in step 5, scales down the gradient updates by a factor of (the max of the absolute value of the time-decayed gradients)

This can make AdaMax more stable than Adam, but it depends on the dataset. Adam generally performs better.

Nadam

This is Adam optimization plus the Nesterov trick. Converges faster than Adam.
Nadam generally outperforms Adam but is sometimes outperformed by RMSProp (found by Timothy Dozat).

AdamW

This integrates a Regularization technique into Adam called weight decay.

Weight decay reduces the size of the model’s weights at each training iteration by multiplying them by a decay factor (e.g. 0.99).

You might want to tune the weight decay hyperparameter.

For general note on optimization:
Adaptive optimization methods (e.g. RMSProp, Adam, AdaMax, Nadam, and AdamW) are often great, but may have trouble creating solutions that generalize well on some datasets. So try NAG – and keep up with research for better methods.

The optimization techniques here rely on first-order partial derivatives (Jacobians). Some algorithms exist that are based on the second-order partial derivatives (Hessians – partial derivatives of Jacobians). But these are hard to use with deep NNs as there are Hessians per output (n is the number of parameters) while there are just Jacobians per output. So it just doesn’t fit in memory (often) and is too slow to compute.

The methods also produce dense models. Most parameters will therefore be nonzero.
If you need fast models at runtime (or ones that take less memory), you may want a sparse model.
You could do that by training a model and then getting rid of tiny weights by setting them to zero, but this often doesn’t lead to a very sparse model, and can degrade model performance.
Instead, use strong Regularization during training to push the optimizer to zero out as many weights as it can.
There are other solutions, of course.

Optimizer comparison

Class	Convergence Speed	Convergence Quality
`SGD`	Bad	Good
`SGD(momentum=...)`	Average	Good
`SGD(momentum=..., nesterov=True)`	Average	Good
`Adagrad`	Good	Bad (stops too early)
`RMSprop`	Good	Avg/Good
`Adam`	Good	Avg/Good
`AdaMax`	Good	Avg/Good
`Nadam`	Good	Avg/Good
`AdamW`	Good	Avg/Good

Learning Rate Scheduling

Problem: Models with millions of parameters severely risk overfitting the training set, esp. when there isn’t enough training instances or if they’re too noisy.

One way to find a good learning rate is training a model for a few hundred iterations, exponentially increasing the LR from a very small value to a very large value and looking at the learning curve, then picking a LR slightly lower than the one where the learning curve starts shooting back up. Then reinitialize the model with that learning rate.

But we can do better than constant learning rates. Start with a large learning rate, then reduce it once training stops making fast progress. This can help us reach a good solution faster than with an optimal constant learning rate.

Learning schedules are strategies to reduce the learning rate during training.

Power scheduling: set learning rate to a function of the iteration number :
- Hyperparameters: initial learning rate , the power (typically 1), steps . All these require tuning (less so c).
- This drops learning rate at each step. After steps, learning rate is down to . After more steps, it’s down to , and so on. So it drops quickly.
Exponential Scheduling: set learning rate to .
- Gradually drops by a factor of 10 every steps. is iterations.
Piecewise constant scheduling: use a constant learning rate for a number of epochs, then a smaller rate for another number of epochs, and so on.
- Can work well, but requires fiddling around to get the right sequence of learning rates and for how long to use them.
Performance scheduling: measure validation error every steps and reduce LR by a factor of when the error stops dropping.
1cycle scheduling: starts by increasing the initial learning rate , growing linearly up to halfway during training. Then decreases the learning rate linearly down to again during the second half of training, finishing the last few epochs by dropping the ate down by multiple orders of magnitude (linearly).
- Introduced by Leslie Smith in 2018.
- Max learning rate needs to be found. is usually 10x lower.
- When using a momentum, start with high momentum first (e.g. 0.95), then drop to a lower momentum during the first half (e.g. 0.85, linearly), and bring back up to max during the second half, and finishing the last few epochs with that.
- Can speed up training and reach better performance.

Andrew Senior et al. found that, when using momentum optimization, both performance scheduling and exponential scheduling performs well (favoring exponential scheduling).

The 1cycle approach seems to perform even better.
But all three (exponential, performance, 1cycle) are good.

Avoiding Overfitting Through Regularization

Deep neural nets typically have tens of thousands (sometimes millions or billions) of parameters. This gives them a high degree a freedom, meaning they can fit a lot of complex datasets. But this also makes them prone to overfitting! So we use regularization to prevent that.

l1 and l2 Regularization

You can use Regularization to constrain a NN’s connection weights, and/or if you want a sparse model (many weights = 0).

Dropout

At each training iteration, a random subset of all neurons (except those in the output layer) across layers get dropped out, and will output 0 for that iteration.

This is a popular regularization technique proposed by Geoffrey Hinton in 2012 (further detailed in 2014 by Nitish Srivastava et al.). It’s been very successful, giving even state-of-the-art networks a 1%-2% accuracy boost!

At every training step, every neuron (including input, excluding output) has a probability of being temporarily dropped out. Dropping it out will mean that it will be ignored during this training step, but may be active during next step. When training is done, neurons don’t get dropped anymore.

is called the dropout rate. It’s typically between 10% and 50%. Closer to 20%-30% in RNNs and closer to 40%-50% in CNNs.

This seemingly has a few benefits. Neurons trained with dropout can’t co-adapt with their neighboring neurons. That means they have to be as useful as possible on their own.
They can’t rely too much on just a few input neurons – they’ll have to pay attention to each of their input neurons. So the neurons become less sensitive to slight changes in inputs, and you get a more robust network that generalizes better.

In practice, you can usually apply use dropout on neurons in the top 1-3 layers (not output layer).

It’s important that we divide the connection weights by the keep probability () during training. ‘Keep probability’ being the probability that a neuron is kept active (not dropped).
When neurons are dropped, its output becomes zero. But during testing/inference, all neurons are kept active (no dropout). To compensate for this difference and maintain the expected output of the network, we scale the connection weights. By dividing the weights by the keep probability during training, we ensure the expected output of a neuron remains the same whether Dropout is applied or not – the expected output of the network remains consistent between training and testing, even as dropout isn’t applied during testing.

Because dropout is active during training only, the training loss and validation loss can be misleading. Your model can overfit the training set yet have similar training & validation losses, so evaluate the training loss without dropout after training.

If your model is overfitting, try increasing dropout rate. Decrease if it’s underfitting.
You may also want to increase it for larger layers and decrease it for smaller ones.
Some neural network architectures only use dropout after the last hidden layer – you can try that.

Dropout tends to slow down convergence, but usually leads to a better model when properly tuned.

If you’re using SELU to get a self-normalizing network, you should regularize with alpha dropout. Alpha dropout is a variant of dropout that preserves the mean and standard deviation of its inputs.

Monte Carlo (MC) Dropout

MC Dropout can boost the performance of any trained dropout model without having to retrain or modify it. And it provides better uncertainty estimates.

Here’s the full implementation of MC dropout:

import numpy as np

# Using Keras
y_probas = np.stack(
	# model() is similar to model.predict() but returns a tensor
	# instead of a NumPy array, and can take the `training` argument.
	# Set `training=True` to ensure Dropout layer is active (so all predictions are slightly different) as we make 100 predictions.
	[model(X_test, training=True) for sample in range(100)]
)
y_proba = y_probas.mean(axis=0)

Above, we make 100 predictions. Each call to model returns a matrix with one row per instance and one column per class. Let’s call the number of instances an number of classes . So we get a matrix of shape . We stack 100 of these, so y_probas is a 3D array of shape . When we average them over the first dimension (axis=0), we get y_proba, an array of shape , like with a single prediction.

If you used layers like Batch Normalization (layers that behave in a special way during training), don’t force training like above!

This process, averaging over multiple predictions with dropout on, gives a Monte Carlo estimate that is generally more reliable than a single prediction with dropout off.
That is, the reliability of the model’s probability estimates is improved.

You can also check the Standard Deviation of y_probas to see how much variance there is in the probability estimates. This is useful for e.g. risk-sensitive systems, where you’d like to know how much uncertainty there is with the given prediction.

Number of Monte Carlo samples to use is a hyperparameter we can tweak. We used 100 here. The higher it is, the more accurate the predictions and their uncertainty estimates are – but inference time also increases. And there’s a ceiling to how much improvement you’ll see as you increase it.

Max-Norm Regularization

For each neuron, constrain the weights of the incoming connections such that , where is the max-norm hyperparameter and is the Norm.

This doesn’t add a regularization loss term to the loss function. Instead, it’s implemented by computing after each training step, and rescaling if needed ().

Reducing increases the amount of regularization and helps reduce overfitting.
Max-norm can also help against unstable gradients (if not using Batch Normalization).

The axis you apply max-norm to needs to be adjusted depending on what you’re doing.

Dense layers usually have weights of shape [number of inputs, number of neurons], so axis=0 means the max-norm constraint is applied independently to each neuron’s weight vector.
Convolutional layers require you to set the constraint axis as well. Usually axis=[0, 1, 2].

Summary and Practical Guidelines

There’s no clear consensus. These are just guidelines.

Default deep neural network configuration:

Hyperparameter	Default value
Kernel initializer	He initialization
Activiation function	ReLU if shallow; Swish if deep
Normalization	None if shallow; Batch Normalization if deep
Regularization	Early Stopping; weight decay if needed
Optimizer	Nesterov accelerated gradients (NAG) or AdamW
Learning rate scheduler	Performance scheduling or 1cycle

Deep neural network configuration for a self-normalizing net (e.g. if it’s a simple stack of dense layers):

Hyperparameter	Default value
Kernel initializer	LeCun initialization
Activiation function	SELU
Normalization	None (self-normalization)
Regularization	Alpha dropout if needed
Optimizer	Nesterov accelerated gradients (NAG)
Learning rate scheduler	Performance scheduling or 1cycle

Remember to normalize input features.
And try to reuse parts of a pretrained neural net if you can, or use unsupervised pretraining if you have lots of unlabeled data, or use pretraining on an auxiliary task if you have lots of labeled data for a similar task.

Exceptions

If you need a sparse model: can use Regularization (optionally zero out tiny weights after training). If you need an even sparser model, try e.g. TensorFlow Model Optimization Toolkit (breaks self-normalization, so use default config).
If you need a low-latency model (very fast predictions): try using fewer layers, a fast activation function (e.g. ReLU or leaky ReLU), and fold Batch Normalization layers into the previous layers after training.
- A sparse model also helps.
- And reducing float precision from 32 bits to 16 or 8 bits.
For risk-sensitive applications (or you don’t care about latency): can use MC dropout to boost performance and get more reliable predictions + uncertainty estimates.

Chapter 12 Custom Models and Training with Tensorflow

A Quick Tour of TensorFlow

TF is a library for numerical computation.

has a core similar to NumPy, but with GPU support
supports distributed computing
has a JIT compiler that extracts the computation graph from a Python function, optimizes it (prunes unused nodes, for example), and runs it efficiently (e.g. running independent ops in parallel)
can export computation graphs
implements Reverse-Mode Autodiff and various optimizers

There’s the high-level Keras API, but TF also supports data loading (tf.data, tf.io, ...), image processing operations (tf.image), signal processing ops (tf.signal), etc.

Each operation (op) is implemented using efficient C++ code. This also means you could write your own operations using the C++ API.
Many of these have multiple implementations called kernels, where each kernel is dedicated to a specific device type (CPU, GPU, TPU, etc.).

GPUs can speed up computations by splitting them into many smaller chunks and running them in parallel across many GPU threads.

TPUs are faster, as they are built specifically for deep learning operations.
TPUs are custom ASIC chips.

Some alternatives to TensorFlow:

Facebook’s PyTorch is very popular in academia
Google’s JAX is gaining momentum

Using TensorFlow like NumPy

Tensors in TensorFlow are similar to NumPy’s ndarray: usually a multidimensional array, but can also hold a scalar.

Tensors and Operations

Can create one with tf.constant(). They have a shape and dtype.
If it holds a scalar, the shape is empty.

Tensors and NumPy

You can convert a ndarray to a tensor and vice-versa.

Variables

Tensor values are immutable, so we can’t use regular tensors to implement e.g. weights in neural nets (they can’t get tweaked by backprop).
Instead use a tf.Variable. It’s like a tensor, but mutable.

Custom Loss Function

# use vectorized implementation & use tf only to benefit from its graph optimization
def huber_fn(y_true, y_pred):
	error = y_true - y_pred
	is_small_error = tf.abs(error) < 1
	squared_loss = tf.square(error) / 2
	linear_loss = tf.abs(error) - 0.5
	return tf.where(is_small_error, squared_loss, linear_loss)

# ...
model.compile(loss=huber_fn, optimizer=...)
model.fit(...)

Saving and Loading Models That Contain Custom Components

When loading a model containing custom objects, map the names to the objects via a dictionary and pass it to custom_objects.
You can also decorate your custom functions with keras.utils.register_keras_serializable.

If you’re using e.g. a custom loss function, you could implement it with tf.keras.losses.Loss and implement the get_config() method. This allows you to save the configuration for your loss (e.g. threshold config) along with saving the model, and therefore you just have to specify the class for custom_objects, not instantiate it with the config values again. When saving the model, Keras calls get_config, saves the config, and when you load it, it calls from_config.

Custom Metrics

Streaming metrics (or, stateful metrics), like Precision, is gradually updated, batch after batch.
Create your own by subclassing tf.keras.metrics.Metric.

Custom Layers

Useful for defining your own layers, or if you want to use an architecture that has lots of repetition, wherein a block of layers is repeated often (and you want to treat each block as a single layer).

Layers with no weights (e.g. Flatten, ReLU) can be created simply by writing a function and wrapping it in tf.keras.layers.Lambda.
Then you can use it like any other layer.

To build a custom stateful layer (a layer with weights), subclass tf.keras.layers.Layer.

import tensorflow as tf


class MyDense(tf.keras.layers.Layer):
    def __init__(self, units: int, activation=None, **kwargs):
        """
        Initialize the layer.

        Parameters:
        units (int): Number of units in the layer.
        activation (optional): Activation function to use. If None, no activation is applied. Callable function or string.
        **kwargs: Additional keyword arguments for the layer. Like name, dtype, input_shape, trainable, etc.
        """
        super().__init__(**kwargs)
        self.units = units
        self.activation = tf.keras.activations.get(activation)

    def build(self, batch_input_shape: tf.TensorShape) -> None:
        """
        Build the layer's weights.
        Usually called the first time the layer is used.

        Parameters:
        batch_input_shape (tf.TensorShape): The shape of the input batch.
        """
        self.kernel = self.add_weight(
            # batch_input_shape[-1] is the number of features in the input,
            # corresponding to the number of neurons in the previous layer
            name="kernel", shape=[batch_input_shape[-1], self.units], initializer="glorot_normal"
        )
        self.bias = self.add_weight(name="bias", shape=[self.units], initializer="zeros")

    def call(self, X: tf.Tensor) -> tf.Tensor:
        """
        Forward pass for the layer.

        Parameters:
        X (tf.Tensor): Input tensor.

        Returns:
        tf.Tensor: Output tensor.
        """
        assert self.activation is not None, "Activation function is None"
        return self.activation(X @ self.kernel + self.bias)

    def get_config(self):
        """
        Returns the configuration of the layer.

        This method collects the configuration of the layer, such as the number of units and the activation function,
        and returns it as a dictionary. This is useful for saving the model or for creating a new layer with the same configuration.

        Returns:
            dict: A dictionary containing the layer's configuration.
        """
        base_config = super().get_config()
        return {**base_config, "units": self.units, "activation": tf.keras.activations.serialize(self.activation)}

If your layer has behavior only during training (e.g. Batch Normalization, Dropout), you should add a Boolean training argument to call.

Custom Models

Subclass tf.keras.Model, create layers and variables in the constructor, implement the call() method to specify model behavior.

Custom layers can contain other layers. Keras will automatically detect that the attribute you save them in (e.g. self.hidden) contains trackable objects (layers in this case) and automatically adds them to the layer’s list of variables.

To be able to save your custom models, implement a get_config method. Or use save_weights() and load_weights().

Losses and Metrics Based on Model Internals

Sometimes you want to define losses based on parts of your model, like the weights or activations of hidden layers. Useful for e.g. regularization or to monitor internal aspects of the model.

Define a custom loss based on model internals by:

Computing it based on the part of the model you want
Pass the result to add_loss()

You can also add a custom metric with add_metric().

Computing Gradients Using Autodiff

The partial derivative of the function w.r.t. is .
Likewise, the partial derivative of the function w.r.t. is .

At the point , the partial derivatives are and , respectively. This gives us a gradient vector of .

Doing this kind of analytical computation of partial derivatives by hand for a neural network is virtually impossible. The function is much more complex, and a very large amount of parameters.

Instead, we use Reverse-Mode Autodiff.

TensorFlow supports this through tf.Variable and tf.GradientTape. GradientTape is a context that automatically records every operation involving a variable, and can therefore compute the gradient of the result of the operation with regard to the variables you’re tracking.

You can force tape to watch any tensors with tape.watch(...).

Usually, the gradient tape is used to compute the gradients of a single value (e.g. loss) w.r.t a set of values (e.g. model parameters).
If you try to compute the gradients of a vector (e.g. a vector with multiple losses), it’ll compute the gradients of the vector’s sum.
If you want the individual gradients, that is, the gradients of each loss w.r.t the model parameters, you’ll need to call the tape’s jacobian() method to perform Reverse-Mode Autodiff once for each loss in the vector.
You could also compute second-order partial derivatives (Hessians).

You can stop gradients from backpropagating through some part of your network: use tf.stop_gradient(), which returns inputs (like identity function) during the forward pass, but doesn’t let gradients through during Backpropagation.

Custom Training Loops

Prefer using fit() over implementing your own training loop.

TensorFlow Functions and Graphs

TensorFlow’s automatic graph generation can speed up your custom code and make it portable to any platform supported by TensorFlow.

You can define a function with some computation(s) and use tf.function(...) to have TensorFlow analyze it and generate a computation graph. Use the @tf.function decorator.

TF will optimize the computation graph (e.g. prune unused nodes, simplify expressions, etc.).
You can set jit_compile=True in tf.function to have TF use accelerated linear algebra (XLA) to compile dedicated kernels for your graph (and often fuse multiple operations). This can make your function much faster & consume less memory.

When you write a custom loss function, metric, layer, or something else that you use in your Keras model, Keras automatically converts it to a TF function.

TF Function Rules

I’ve noted some of the rules in a condensed form below.

Use TensorFlow operations over calling NumPy operations. In general, calling any external library (or even the standard library), the call only runs during tracing.

Use for i in tf.range(x) over for i in range(x). TensorFlow only captures for loops that iterate over a tensor or Dataset.

Use vectorized implementations over loops.

Chapter 13 Loading and Preprocessing Data with TensorFlow

TensorFlow has its own data loading an preprocessing API, so you don’t necessarily need to use Pandas & e.g. Sklearn.

The tf.data API

tf.data revolves around tf.data.Dataset. Dataset represents a sequence of data items.

Keras Preprocessing Layers

Encoding Categorical Features Using Embeddings

An embeddings is a dense representations of some higher-dimensional data (e.g. a category, or word in a vocabulary). If we have 50,000 possible categories, one-hot encoding would produce a 50,000-dimensional sparse vector. But an embedding would be a small dense vector in comparison, e.g. with just 100 dimensions.

In DL, embeddings are usually initialized randomly and then trained by gradient descent, along with other model parameters.

Because embeddings are trainable, they’ll improve during training.
Embeddings that represent similar categories will be pushed closer together by Gradient Descent, while those dissimilar will be further from each other in the embedding space.

The better the representation, the easier it is for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories.
This is representation learning.

Often, embeddings can be reused for other tasks. For example, word embeddings (embeddings of individual words). When working on some NLP task, it’s often better to reuse pretrained word embeddings than training your own.

Tomas Mikolov et al. published a paper describing an efficient technique for learning word embeddings with neural networks. Their technique significantly outperformed previous approaches.
They could learn embeddings on a very large corpus of text by training a neural network to predict the words near any given word. This gave great embeddings.
Synonyms had very close embeddings. Semantically related words ends up clustered together.

But it isn’t just about proximity. Word embeddings are organized along meaningful axes in embedding space. There’s the famous example of computing King - Man + Woman (meaning you add and subtract the embedding vectors of the words), and the result being very close to the embedding of Queen. There are analogous examples.

Keras has an Embedding layer, which wraps an embedding matrix. Such a matrix has one row per category and one column per embedding dimension.
To convert a category ID to an embedding, it looks up and returns the row that corresponds to that category.

Image Preprocessing Layers

tf.keras.layers.Resizing to resize images to the desired size.
tf.keras.layers.Rescaling to rescale pixel values.
tf.keras.layers.CenterCrop to crop images, keeping only a center patch of the desired height and width.

Chapter 14 Deep Computer Vision Using Convolutional Neural Networks

CNNs emerged from studying the brain’s visual cortex.

Object detection = classifying multiple objects in an image and placing bounding boxes around them.

Semantic segmentation = classifying each pixel according to the class of the object it belongs to.

The Architecture of the Visual Cortex

David H. Hubel and Torsten Wiesel performed some experiments leading to insights into the structure of the visual cortex:
Many neurons in the visual cortex have a small local receptive field.
That is, they only react to visual stimuli located in a limited region of the visual field.
These fields can overlap. Together, they tile the whole visual field.

Some neurons react only to images of horizontal lines; others only to lines with different orientations. Two neurons could have the same receptive field, but react to different line orientations.
Some neurons have larger receptive fields and react to more complex patterns that are combinations of low level patterns.
This led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons.
The author showed an illustration that each neuron is connected only to nearby neurons from the previous layer in a neural network, showing this idea.

This kind of architecture can detect all sorts of complex patterns in any area of the visual field.

These studies led to the ‘neocognitron’, which gradually involved into what we now call CNNs. We reached a milestone in 1998, when Yann LeCun introduced LeNet-5, along with convolutional layers and pooling layers.

Deep neural networks with fully connected layers aren’t well-suited for image data due to the huge number of parameters it requires. A 100x100 image has 10,000 pixels. If the first layer has just 1,000 neurons (not a lot!), then that means a total of 10 million connections for just the first layer.
CNNs solve this by using partially connected layers and weight sharing.

Convolutional Layers

These are the building blocks of CNNs.

Neurons in the first convolutional layer aren’t connected to every pixel in the input image, but only to pixels in their receptive fields.
And each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer.

In a CNN, each layer is represented in 2D, making it easier to match neurons with their corresponding inputs. Other multilayer Neural Networks have layers composed of a long line of neurons: we’d have to flatten images to 1D before feeding them to the neural network.

More formally:
A neuron in row , column of a given layer is connected to the outputs of the neurons in the previous layer located in rows to , columns to , where and are the height and width of the receptive field.

For a layer to have the same height and width as the previous layer, it’s common to add zeros around the input. This is called zero padding.

We can also connect a large input layer to a much smaller layer by spacing out the receptive fields. This reduces computational complexity.
We have a horizontal or vertical step size from one receptive field to the next, called the stride. The stride can be different or the same in each direction.

Filters

We can represent a neuron’s weights as a small image the size of the receptive field.

The image below shows two possible sets of weights (filters). We also call them convolution kernels, or just kernels.
The first filter is a black square with a vertical white line in the middle, represented as a matrix of 0s except for the central column, which is filled with 1s.
Neurons with these weights ignore everything in their receptive field except the central vertical line (as all inputs are multiplied by 0, except the ones in the central vertical line).
The second filter is similar, but is a horizontal white line in the middle.

A layer full of neurons using the same filter outputs a feature map, which highlights areas in an image that activate the filter the most.

During training, the Convolutional Neural Network automatically learns the most useful filters for its task, and the layers above learn to combine them into more complex patterns.

Stacking Multiple Feature Maps

A convolutional layer can have multiple filters, and outputs one feature map per filter.
It has one neuron per pixel in each filter map, and all neurons within a given feature map share the same parameters (same kernel and bias term). Neurons in different feature maps use different parameters.
A neuron’s receptive field extends across all the feature maps of the previous layer.

So a convolutional layer applies multiple trainable filters to its inputs, which makes it capable of detecting multiple features anywhere in its inputs.

All neurons in a feature map sharing parameters means we have a dramatically reduced number of parameters in the model
When it learns to recognize a pattern in one location, it can recognize it in any other location. This is in contrast to fully connected Neural Networks, who, when they learn a pattern in one location, can only recognize it in that location.

Input images also have multiple sublayers: one per color channel. Typically RGB.
Grayscale images have just one channel. Some have more (e.g. some images capture additional light frequencies, like infrared).

A neuron in row , column of the feature map in a given convolutional layer is connected to the outputs of the neurons in the previous layer , located in rows to and columns to across all feature maps (in layer ).
Within a layer, all neurons in the same row and column but in different feature maps are connected to the outputs of the same neurons in the previous layer.

Computing the output of a neuron in a convolutional layer:
with

That just calculates the weighted sum of all the inputs + the bias term. We have:

is the output of the neuron located in row , column in feature map of the convolutional layer .
and are the vertical and horizontal strides, and are the height and width of the receptive field, and is the number of feature maps in the previous layer ().
is the output of the neuron in layer , row , column , feature map (or channel if the previous layer is the input layer).
is the bias term for feature map (in layer ). Analogous to a knob that tweaks the brightness of the feature map.
is the connection weight between any neuron in feature map of the layer and its input located at row , column (relative to the neuron’s receptive field) and feature map .

Implementing Convolutional Layers with Keras

My code will look a little weird from here. I can’t make TensorFlow work with my GPU, so I’m using Keras 3 with a PyTorch backend.

import numpy as np
import os 

os.environ["KERAS_BACKEND"] = "torch"

import keras
import torch
import matplotlib.pyplot as plt

from sklearn.datasets import load_sample_images

images = load_sample_images()["images"]
images = keras.layers.CenterCrop(height=70, width=120)(images)
images = keras.layers.Rescaling(scale=1 / 255)(images)

What are the dimensions?

2x sample images
each image is 70x120 (see CenterCrop - originals were 427x640)
3 color channels (RGB)

images.shape
> torch.Size([2, 70, 120, 3])

2D convolutional layer: we create 32 filters, each of size 7x7 (kernel_size=7)
For 2D convolutional layers, “2D” refers to the number of spatial dimensions (height and width)0’

torch.manual_seed(42)
np.random.seed(42)

conv_layer = keras.layers.Conv2D(filters=32, kernel_size=7)
fmaps = conv_layer(images)
fmaps.shape
> torch.Size([2, 64, 114, 32])

Notice that the shape is different.

After setting filters=32, we get 32 output feature maps. So instead of the intensity of red, green, or blue in each pixel, we now have 32 different feature maps. These represent the intensity of each feature at each location.

We’ll also see that the height and width have shrunk by 6 pixels. Conv2D doesn’t use any zero-padding by default, so we lose a few pixels on the sides of the output feature maps, depending on the size of the filters. Because the kernel size is 7, we lose 6 pixels horizontally and 6 pixels vertically (3 on each of the 4 sides).

We could set padding="same", such that the inputs are padded with enough 0s on all sides that the output feature maps have the same size as the inputs.
If stride > 1, then no matter if you set padding="same" or not, the output feature maps will have different sizes. For example, if you set stride=2, then the output feature maps will have half the height and width of the inputs.

Randomly generated filters typically act like edge detectors. This is nice because it’s useful for image processing, and that’s the type of filters a convolutional layer usually starts with.
During training, it’ll learn improved filters to recognize useful patterns for the task.

We can get the kernels and bias weights with conv_layer.get_weights(). kernels.shape is (7, 7, 3, 32): [kernel_height, kernel_width, input_channels, output_channels].
The biases array is 1D with shape [output_channels]. The number of output channels = number of output feature maps = number of filters.

The height and width of the input images does not appear in the kernel’s shape. All neurons in the output feature maps share the same weights, meaning we can feed images of any size into the layer, as long as they are at least as large as the kernels, and have the right number of channels.

You generally want to specify an Activation Function (e.g. ReLU) and a kernel initializer (e.g. He Initialization) when creating a Conv2D layer. This is done for the same reason as with Dense layers: a convolutional layer performs a linear operation, so stacking multiple of them without any activation function would be the same as having a single convolutional layer. The layers won’t learn anything really complex.

Memory Requirements

CNNs (the convolutional layers) take a lot of RAM.
Especially during training, as the reverse pass of Backpropagation requires all intermediate values from the forward pass.

A convolutional layer with 200 filters, stride=1, “same” padding:
Given input RGB image (so 3 channels), the number of parameters is . The is due to the bias terms.
But each of the 200 feature maps contains neurons, and each of these need to compute a weighted sum of its inputs, so we get a total of million float multiplications.
And if the feature maps are represented with 32-bit floats, the convolutional layer’s output would be million bits (12 MB) of RAM for just one instance. That would be 1.2 GB for 100 instances.

But during inference we can release the RAM occupied by one layer as soon as the next layer has been computed.
It’s during training the issues are most severe.

If you get crashes due to out-of-memory errors, try one or more:

reducing mini-batch size
dimensionality with stride
removing a few layers
using 16-bit floats instead of 32-bit floats
distributing the CNN across multiple devices

Pooling Layers

We use pooling layers to subsample (shrink) the input image in order to reduce the computational load, memory usage, and number of parameters (reduce risk of Overfitting).

Like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field.
You define the size, stride, and padding type, as before.

But a pooling neuron has no weights. It just aggregates the inputs using an aggregation function like max or mean. A max pooling layer is the most common type. Pooling kernels are analogous to stateless sliding windows.

Max pooling also introduces some level of invariance to small translations.
In this context, invariance refers to the ability of a system to remain unaffected by certain changes or transformations in the input data.
Inserting a max pooling layer every few layers in a CNN allows us to get some level of translation invariance at a larger scall.
Max pooling also offers a small amount of rotational invariance and a slight scale invariance.

Max pooling is destructive. Even a small kernel with a small stride drops a lot of information. Invariance isn’t always desired, either (e.g. in semantic segmentation).

Implementing Pooling Layers with Keras

# 2D Max pooling layer with a 2x2 kernel size
# You can do average pooling by using `AvgPool2D` instead.
max_pool = keras.layers.MaxPool2D(pool_size=2)

Average pooling works like max pooling but computes the mean instead of the max. They aren’t as popular, though. Max pooling seems to perform better.
Intuitively, mean pooling loses less information than max pooling. So why do they perform worse? Max pooling only preserves the strongest features, getting rid of meaningless ones, so the next layer gets a clearer signal to work with. And it has stronger translation invariance & requires less compute.

While not common, we can do max/mean pooling along the depth dimension instead of spatial dimension. This lets the CNN learn to be invariant to various features, e.g. learn multiple features, each detecting a different rotation of the same pattern. Or learn to be invariant of anything else, like thickness, brightness, skew, color, etc.
Depthwise max pooling would ensure the output is the same regardless of rotation.

class DepthPool(keras.layers.Layer):
    """
    Depthwise pooling layer.

    Reshapes inputs to split channels into groups of desired size (pool_size).
    Then applies max pooling across each group.
    Assumes stride = pool size.
    """
    def __init__(self, pool_size=2, **kwargs):
        super().__init__(**kwargs)
        self.pool_size = pool_size

    def call(self, inputs):
        shape = inputs.shape
        n_channels = shape[-1]
        n_channel_groups = n_channels // self.pool_size
        new_shape = shape[:-1] + (n_channel_groups, self.pool_size)
        return inputs.view(*new_shape).max(dim=-1)[0]

fmaps = np.random.rand(2, 70, 120, 60)

with torch.no_grad():
    output = torch.nn.functional.max_pool2d(torch.tensor(fmaps), kernel_size=(1, 3), stride=(1, 3), padding=0)

output.shape
> torch.Size([2, 70, 120, 20])

np.allclose(DepthPool(pool_size=3).forward(torch.tensor(fmaps)), output)
> True

You’ll also see a global average pooling layer often in modern architectures.

global_avg_pool = keras.layers.GlobalAvgPool2D()

This computes the mean of each entire feature map, so it just outputs a single number per feature map and per instance.
While this is very destructive, it can be useful just before the output layer.

CNN Architectures

Typically, CNN architectures stack some convolutional layers (each generally followed by a ReLU layer), then a pooling layer, then some more convolutional layers (+ReLU), and so on.
The image should get smaller and smaller as it goes through the network, but also typically deeper and deeper (i.e., with more feature maps) due to the convolutional layers.
At the top of the stack, we have a feedforward neural network consisting of some fully connected layers (+ReLUs). The final layer outputs the prediction (e.g., a Softmax to compute the estimate class probabilities, or a linear layer to output continuous values for regression).

Be careful not to use convolutional kernels that are too large.
Instead of using a convolutional layer with a kernel, stack two layers with kernels. This uses fewer parameters, requires fewer computations, and generally performs better.
The first convolutional layer can usually have a large kernel, though (e.g., ), usually with a stride of 2 or more. This reduces the spatial dimensions of the image without losing too much information.

A 91% accuracy CNN for Fashion MNIST. Not state of the art, but not bad either.

from functools import partial

# Define 'partial' function for the default convolutional layer
DefaultConv2D = partial(
    keras.layers.Conv2D, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal"
)

# Create Sequential model
model = keras.Sequential(
    [
        # Images are 28x28 pixels, with 1 color channel (grayscale).
        keras.layers.Input(shape=[28, 28, 1]),
        # 64 filters (7x7 kernel), which is fairly large.
        # Uses default stride of 1 (input imgs are not very large).
        DefaultConv2D(filters=64, kernel_size=7),
        # Reduces size of feature maps by 2x due to default pool size of 2.
        keras.layers.MaxPool2D(),
        # Repeat, but with: 128 filters (3x3 kernel)
        DefaultConv2D(filters=128),
        DefaultConv2D(filters=128),
        keras.layers.MaxPool2D(),
        # Repeat, but with: 256 filters (3x3 kernel)
        DefaultConv2D(filters=256),
        DefaultConv2D(filters=256),
        keras.layers.MaxPool2D(),
        # Now for the fully connected network.
        # Two hidden dense layers & a dense output layer.
        # We use softmax for the output layer, since this is a multiclass classification task.
        # We also use dropout to reduce overfitting.
        keras.layers.Flatten(),
        keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),
        keras.layers.Dropout(rate=0.5),
        keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
        keras.layers.Dropout(rate=0.5),
        keras.layers.Dense(units=10, activation="softmax"),
    ]
)

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
score = model.evaluate(X_test, y_test)

Notice the number of filters double as we go deeper into the network.
It makes sense to grow, since the number of low-level features is rather low, but there are many ways to combine them into higher-level features.
It’s common to double the number of filters after each pooling layer, as the pooling layer divides each spatial dimension by a factor of 2, we can double the number of feature maps without fear of exploding parameters, memory usage, or computational load.

LeNet-5

Created by Yann LeCun in 1998.

AlexNet

By Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.
Won ILSVRC 2012.

It’s larger & deeper than LeNet-5. It was also the first architecture to stack convolutional layers on top of another, instead of pooling layers on top of convolutional layers.

For Regularization, they used dropout (50%) and data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

They also have a competitive normalization step after the ReLU step of some of the lower convolutional layers, called the local response normalization (LRN).

GoogLeNet

Developed by Christian Szegedy et al. from Google Research.
Won ILSVRC 2014.
It’s deeper than previous CNNs.
It uses subnetworks called inception modules, allowing it to use parameters much more efficiently than previous architectures. So it actually has ~10x fewer parameters than AlexNet.

In these inception modules, some convolutional layers have 1x1 kernels. Why?

It can’t capture spatial patterns, but instead captures patterns along the depth dimension (i.e., across channels)
They output fewer feature maps than their inputs, so they’re bottleneck layers (they reduce dimensionality).
- Cuts computational cost & number of parameters. They get faster training and improved generalization.
Some of these are also connected to a successive convolutional layer (so we have pairs), which act like a single powerful convolutional layer.

The whole inception module can be thought of as a convolutional layer on steroids.

I’ve attempted implementing the original GoogLeNet architecture using Keras with a PyTorch backend:

from keras.models import Model
from keras.layers import Input, Conv2D, MaxPool2D, GlobalAvgPool2D, Dense, Dropout, concatenate, Layer
import torch.nn.functional as F


def inception_module(input, f1s1, f3x3_reduce, f3x3, f5x5_reduce, f5x5, fpool):
    # paths from left -> right
    # branch1: conv 1x1, stride=1, padding=same
    # branch2: conv 1x1, stride=1, padding=same -> conv 3x3, stride=1, padding=same
    # branch3: conv 1x1, stride=1, padding=same -> conv 5x5, stride=1, padding=same
    # branch4: max pool 3x3, stride=1, padding=same -> conv 1x1, stride=1, padding=same
    branch1 = Conv2D(filters=f1s1, kernel_size=(1, 1), padding="same", strides=1, activation="relu")(input)

    branch2 = Conv2D(filters=f3x3_reduce, kernel_size=(1, 1), padding="same", strides=1, activation="relu")(input)
    branch2 = Conv2D(filters=f3x3, kernel_size=(3, 3), padding="same", strides=1, activation="relu")(branch2)

    branch3 = Conv2D(filters=f5x5_reduce, kernel_size=(1, 1), padding="same", strides=1, activation="relu")(input)
    branch3 = Conv2D(filters=f5x5, kernel_size=(5, 5), padding="same", strides=1, activation="relu")(branch3)

    branch4 = MaxPool2D(pool_size=(3, 3), strides=1, padding="same")(input)
    branch4 = Conv2D(filters=fpool, kernel_size=(1, 1), padding="same", strides=1, activation="relu")(branch4)

    output = concatenate([branch1, branch2, branch3, branch4], axis=-1)
    return output


class LocalResponseNormalization(Layer):
    def __init__(self, size=5, alpha=1e-4, beta=0.75, k=2.0, **kwargs):
        super(LocalResponseNormalization, self).__init__(**kwargs)
        self.size = size
        self.alpha = alpha
        self.beta = beta
        self.k = k

    def call(self, inputs):
        square = torch.pow(inputs, 2)
        scale = self.k + self.alpha * F.avg_pool2d(square, self.size, stride=1, padding=self.size // 2)
        scale = torch.pow(scale, self.beta)
        return inputs / scale


def GoogLeNet():
    input = Input(shape=(224, 224, 3))

    x = Conv2D(filters=64, kernel_size=7, padding="same", strides=2, activation="relu")(input)
    x = MaxPool2D(pool_size=3, strides=2, padding="same")(x)
    x = LocalResponseNormalization()(x)
    x = Conv2D(filters=64, kernel_size=1, padding="same", strides=1, activation="relu")(x)
    x = Conv2D(filters=192, kernel_size=3, padding="same", strides=1, activation="relu")(x)
    x = LocalResponseNormalization()(x)
    x = MaxPool2D(pool_size=3, strides=2, padding="same")(x)

    x = inception_module(x, 64, 96, 128, 16, 32, 32)
    x = inception_module(x, 128, 128, 192, 32, 96, 64)
    x = MaxPool2D(pool_size=3, strides=2, padding="same")(x)

    x = inception_module(x, 192, 96, 208, 16, 48, 64)
    x = inception_module(x, 160, 112, 224, 24, 64, 64)
    x = inception_module(x, 128, 128, 256, 24, 64, 64)
    x = inception_module(x, 112, 144, 288, 32, 64, 64)
    x = inception_module(x, 256, 160, 320, 32, 128, 128)
    x = MaxPool2D(pool_size=3, strides=2, padding="same")(x)

    x = inception_module(x, 256, 160, 320, 32, 128, 128)
    x = inception_module(x, 384, 192, 384, 48, 128, 128)

    x = GlobalAvgPool2D()(x)
    x = Dropout(0.4)(x)
    x = Dense(units=1000, activation="softmax")(x)

    model = Model(inputs=input, outputs=x, name="GoogLeNet")
    return model


model = GoogLeNet()
model.summary()

VGGNet

By Karen Simonyan and Andrew Zisserman from the Visual Geometry Group (VGG) at Oxford University.
Was runner-up in ILSVRC 2014.

ResNet

By Kaiming He el al.
Won ILSVRC 2015.

The networks are just getting deeper (but with fewer and fewer parameters!).
The winning variant had 152 layers.

ResNet-34 is ResNet with 34 layers (only counting convolutional layers and the fully connected layer), containing 3 RUs that output 64 feature maps, 4 RUs with 128 maps, 6 with 256 maps, and 3 with 512 maps.

They were able to train this deep a network by using skip connections (or, shortcut connections), where the signal feeding into a layer is also added to the output of a layer located higher up in the stack.

When training a Neural Network, you are trying to model a target function . If you add the input to the output of the network (i.e., you add a skip connection), the network is forced to model rather than .
This is residual learning.

A neural network with a skip connection can easily model the identity function.
The original network, when initialized, has weights close to 0, so it just outputs values close to zero.
Then a skip connection is added, the network basically outputs the inputs (recall, the skip connection adds the input to the output of later layers).
So if the target function to learn is close to the identity, the network will do a good job at this, without any training.

Adding many skip connections, the network can make progress even if several layers haven’t started learning. Some layers may block Backpropagation, which skip connections can help alleviate – the signal can make its way across the whole network. A deep residual network can be seen as a stack of residual units (RUs), where each is a small Neural Network with a skip connection.

Xception

Stands for Extreme Inception, and is a variant of the GoogLeNet architecture proposed by François Chollet (author of Keras).

It merges the ideas of GoogLeNet and ResNet (like Inception v4), but replaces the inception modules with a special type of layer called depthwise separable convolution layer (short: separable convolution layer).
While regular convolutional layers use filters that try to simultaneously capture spatial patterns (e.g., an oval) and cross-channel patterns (e.g., mouth + nose + eyes = face), a separable convolutional layer makes the strong assumption that spatial patterns and cross-channel patterns can be modeled separately.
So it is composed of two parts. First part applies a single spatial filter to each input feature map. The second part looks exclusively for cross-channel patterns (it’s just a regular convolutional layer with 1x1 filters).

Separable convolutional layers only have one spatial filter per input channel, so avoid using them after layers with too few channels (e.g. input layer).

Separable convolutional layers seem to often perform better. They use fewer parameters, less memory, and fewer computations than regular convolutional layers. Consider using them as default, except after layers with few channels.

SENet

Stands for Squeeze-and-Excitation Network (SENet).
Won ILSVRC 2017.

Extends architectures like inception networks & ResNets, and boosts their performance.
Extended versions of inception networks and ResNets are called SE-Inception and SE-ResNet, respectively.
The boots comes from SENet adding a small Neural Network, called an SE block, to every inception module or residual unit in the original architecture.

SE blocks analyze the output of the unit it’s attached to, focusing on the depth dimension, learning which features are usually most active together.
It uses that to recalibrate the feature maps. For example, if X, Y, Z usually appears together, and you see X and Y, you should expect to see Z as well. So if it sees X and Y, but just a little Z, it’ll boost Z (or, actually, it reduces irrelevant feature maps). This helps resolve ambiguity, in case Z would be confused with something else.

SE blocks are composed of 3 layers:

a global average pooling layer: computes mean activation for each feature map, e.g. if input contains 256 feature maps, this outputs 256 numbers representing the overall level of response for each filter.
a hidden dense layer with a Relu Activation Function: this is the ‘squeeze’ – often has 16x fewer neurons than feature maps, so the 256 numbers would get compressed into a small vector, basically an Embedding of the distribution of feature responses. This layer forces the block to learn a general representation of the feature combinations.
a dense output layer using the Sigmoid Activation Function: takes the embedding and outputs a recalibration vector with one number per feature map (e.g., 256), each between 0 and 1. Then the feature maps are multiplied by the recalibration vector, so irrelevant features (low recalibration scores) get scaled down while relevant features (close to 1) are left alone.

Other Noteworthy Architectures

ResNeXt
DenseNet
MobileNet
CSPNet
EfficientNet
- The authors of EfficientNet proposed a method to scale any CNN efficiently by jointly increasing depth (number of layers), width (number of filters per layer), and resolution (size of input images) in a principled way. Called compound scaling.
- They used neural architecture search to find a good architecture for a scaled-down version of ImageNet (smaller and fewer images) and then used compound scaling to create larger and larger versions of the architecture.
- EfficientNet vastly outperformed all existing models when it came out, and are still among the best.

Compound Scaling is based on a logarithmic measure of the compute budget, : if your compute budget doubles, increases by .
So the number of floating-point operations available for training is proportional to .
Your CNN architecture’s depth, width, and resolution should scale as , respectively.
Factors must be > 1, and should be close to 2.
Optimal values depend on the CNN’s architecture.

Choosing the Right CNN Architecture

Depends on what matters: accuracy, model size (can matter depending on where to deploy), inference speed on CPU/GPU, and such.

Implementing a ResNet-34 CNN Using Keras

Usually you’d want to load a pretrained network instead of implementing your own.

from functools import partial

DefaultConv2D = partial(
    keras.layers.Conv2D, kernel_size=3, strides=1, padding="same", kernel_initializer="he_normal", use_bias=False
)


class ResidualUnit(keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = keras.activations.get(activation)
        self.main_layers = [
            DefaultConv2D(filters, strides=strides),
            keras.layers.BatchNormalization(),
            self.activation,
            DefaultConv2D(filters),
            keras.layers.BatchNormalization(),
        ]
        # Need skip layers for the shortcut connection if strides > 1
        # as we'll be adding the input to the output of the main layers
        # and the shape of the input will be different from the output without this
        self.skip_layers = (
            [
                DefaultConv2D(filters, kernel_size=1, strides=strides),
                keras.layers.BatchNormalization(),
            ]
            if strides > 1
            else []
        )

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)

        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)

        return self.activation(Z + skip_Z)

model = keras.Sequential([
    keras.layers.Input(shape=[224, 224, 3]),
    DefaultConv2D(64, kernel_size=7, strides=2),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    keras.layers.MaxPool2D(pool_size=3, strides=2, padding="same"),
])

prev_filters = 64
# First 3 RUs have 64 filters, next 4 have 128, etc...
# We set stride=1 when number of filters is the same as in the previous RU, otherwise we set it to 2.
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
    strides = 1 if filters == prev_filters else 2
    model.add(ResidualUnit(filters, strides=strides))
    prev_filters = filters
    
model.add(keras.layers.GlobalAvgPool2D())
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Using Pretrained Models from Keras

keras.applications has the model implementations and weights that you can just download and reuse.

Classification and Localization

This is about finding an object and putting a bounding box on it.
There’s a lot of labeling to do, which can take time.
But when you have the bounding boxes for each image in your dataset, you need to create a dataset whose items are batches of preprocessed images along with their class labels and bounding boxes of the form (images, (class_labels, bounding_boxes)).

The boxes should be normalized so the horizontal and vertical coordinates, as well as the height and width all range from 0 to 1.
It’s also common to predict the square root of the height and width rather than the height and width directly, so e.g., a 10-pixel error for a large bounding box isn’t penalized as much as a 10-pixel error for a small bounding box.

MSE is often fine as a cost function to train the model, but it’s not great as a metric for how well it can predict bounding boxes. Use intersection over union (IoU) here. It measures the overlap between the predicted bounding box and the target bounding box, divided by the area of their union.

Object Detection

This refers to the task of classifying and localizing multiple objects in an image.

There are many object detection models openly available online, like YOLOv5, Single Shot MultiBox Detector (SSD), Faster R-CNN, EfficientDet.

We used to take a CNN that was trained to classify and locate a single object ~centered in the image, then slide it across the image and make predictions at each step. It was trained to not only make class predictions and a bounding box, but also an objectness score (estimated probability the image does contain an object centered near the middle).
This is rather simple, but it’d often detect the same object multiple times, at slightly different positions.
So we used postprocessing to get rid of unnecessary bounding boxes. We used non-max suppression for that: remove all bounding boxes where objectness score < some threshold → find remaining bounding box with highest objectness score, get rid of all other remaining boxes that overlap a lot with it (e.g. measure by IoU > 60%) → repeat step 2 until no more boxes to remove.

That has worked pretty well, but requires you to run the CNN many times, making too slow.
You can slide a CNN across an image much faster using a fully convolutional network (FCN).

Fully Convolutional Networks

Introduced by Jonathan Long et al., for semantic segmentation (classifying every pixel in an image according to the class of the object it belongs to).

FCNs are unique because they replace the fully connected (dense) layers of standard CNNs with convolutional layers.
This is great because convolutional layers will process images of any size, while dense layers expect a specific input size. It does expect a certain number of channels, though.
Because FCNs only contain convolutional layers (and pooling layers), they can be trained & executed on images of any size.

You can actually just copy the weights from trained dense layers to convolutional layers.

You Only Look Once

YOLO is a fast and accurate object detection architecture proposed by Joseph Redmon in 2015. It can run in real-time on a video.

There’s been many successors to the original (e.g. YOLOv2, YOLOv3, YOLO9000, YOLOv4, YOLOv5, PP-YOLO…).

Object Tracking

This is a step up from object detection. Instead of only detecting the presence of objects, we now also track them over time.
This is challenging, as objects can move, change size, their appearance can change (turn around, move to different lighting, etc.), or they may be temporarily occluded by other objects, and so on.

One object tracking system is DeepSORT, which combines classical algorithms and deep learning. It uses Kalman filters to estimate the most likely current position of an object, given prior detections, and assuming objects tend to move at a constant speed. It uses deep learning to measure resemblance between new detections and existing tracked objects. And it uses the Hungarian algorithm to map new detections to existing tracked objects (or new ones).

GitHub - theAIGuysCode/yolov4-deepsort: Object tracking implemented with YOLOv4, DeepSort, and TensorFlow.

Semantic Segmentation

Here, we classify each pixel by the class of the object it belongs to (e.g., road, car, pedestrian, building, etc.). This also means we don’t distinguish objects of the same class.

In instance segmentation, we do distinguish between objects of the same class.

Generally on Computer Vision:
Most state of the art models are based on Convolutional Neural Networks. But as of 2020, transformers have entered the space.

Some areas of research:

Adversarial learning: attempting to make networks more resistant to images designed to fool it
Explainabilty: understanding why the network makes a certain decision
Realistic image generation
Single-shot learning: a system that can recognize an object after seeing it just once
Predicting the next video frames
Combining text and image tasks (Multi-modal e.g. LLMs?)

Chapter 15 Processing Sequences using RNNs and CNNs

RNNs can work on sequences of arbitrary lengths, rather than fixed-size inputs.

Chapter is:

RNNs
- How to train RNNs using backpropagation through time
Time Series Forecasting
- With ARMA
- And other models
Issues RNNs face
- Unstable Gradients
- Very limited short-term memory (fixed by LSTM and GRU)

For small sequences, regular DNNs can work.
For very long sequences (like audio or text), CNNs can work well, too.

Recurrent Neurons and Layers

A recurrent neural network is like a feedforward NN, except it also has connections pointing backward.

The simplest possible RNN would be one neuron receiving inputs, producing an output, and sending that output back to itself.
At time step , also called a frame, the recurrent neuron receives the input as well as its own input from the previous time step, . At the first time step, the output is generally set to 0.

In a layer of recurrent neurons, at each time step , every neuron receives both input vector and the output vector from the previous time step .

Each recurrent neuron has two sets of weights:

one for the inputs , called , and
one for the outputs of the previous time step , called

When we consider the whole recurrent layer, we keep all those weight vectors in the two weight matrices and .

The output vector for the recurrent layer is computed as follows, with being the bias vector, and being the Activation Function:

We can compute the recurrent layer’s output in one shot for an entire mini-batch by putting all inputs at time step in an input matrix :

where

is an matrix containing the layer’s outputs at time step for each instance in the mini-batch. is the number of instances in the mini-batch and is the number of neurons.
is an matrix with the inputs for all instances. is the number of input features.
is an matrix containing the connection weights for the inputs of the current time step.
is an matrix containing the connection weights for the outputs of the previous time step.
is a vector of size containing each neuron’s bias term.
Weight matrices and are often concatenated vertically into a single weight matrix of shape .
is the horizontal concatenation of the two matrices

Memory Cells

RNNs have a form of memory since the output of a recurrent neuron at time step is a function of all the inputs from previous time steps.
A neural network can have a part that preserves some state across time steps, called a memory cell (or just cell).

1 recurrent neuron (or layer of recurrent neurons) is a basic cell capable of learning short patterns.

A cell’s state at time step , (h for hidden), is a function of some inputs at that time step and its state at the previous time step: .

Its output at time step , denoted is also a function of the previous state and the current inputs.

Input and Output Sequences

Sequence-to-sequence networks can take a sequence of inputs and produce a sequence of outputs.
- Useful for e.g. time-series forecasting.
Sequence-to-vector networks can take a sequence of inputs and ignore all outputs except the last one.
- Useful for e.g. sentiment analysis. Feed the network a sequence of words corresponding to a movie review & it’ll output a sentiment score.
Vector-to-sequence networks can take the same input vector over and over again, and at each time step it outputs a sequence.
- E.g. useful for image captioning. Give it an image or the output of a CNN & the output could be a caption for it.

You can also have a sequence-to-vector network (called an encoder) followed by a vector-to-sequence network (called a decoder).
This could be used for e.g. translating between languages. Feeding the network a sentence in one language, and the encoder converts it to a single vector representation, and then the decoder decodes it into a sentence in another language.
This model is called an encoder-decoder and works better than trying to translate on the fly with a single sequence-to-sequence RNN.

Training RNNs

The trick is to unroll it through time and use regular Back Propagation.
This strategy is called backpropagation through time (BPTT).

First forward pass through the unrolled network
Evaluate output sequences using a loss function
Propagate gradients of loss function backward through the unrolled network
Perform gradient descent step to update parameters

Forecasting a Time Series

Time series: data with values at different time steps, usually at regular intervals.

Multivariate time series: there are multiple values per time step.
Univariate time series: there is only a single value per time step.

Forecasting (predicting future values) is a typical task.
Can also do imputation (filling in missing past values), classification, anomaly detection, etc.

If you can see a similar pattern (clearly) repeated every week, then that would indicate weekly seasonality.

Naive forecasting: copying past values to make the forecast.
It’s often a good baseline.

You can lag a time series, meaning you shift its values toward the right.
And if you plot the difference between the original series and the lagged series, you’re differencing.
Differencing can remove trend and seasonality from a time series. It’s easier to study a stationary time series (the statistical properties remain constant over time – no seasonality or trends). And you can easily turn them back by re-adding the values you subtracted.

When a time series is correlated with a lagged version of itself, it’s what we call autocorrelated.

MAE, MAPE, and MSE are common metrics to evaluate forecasts.

The ARMA Model Family

Autoregressive moving average (ARMA) model. Developed by Herman Wold in the 1930s.

It computes forecasts using a weighted sum of lagged values and corrects these forecasts by adding a moving average.
The moving average component is computed using a weighted sum of the last few forecast errors.

with .

is the model’s forecast for time step
is the time series’ value at time step
The first sum is the weighted sum of the past values of the time series, using learned weights . is a hyperparameter that determines how far back into the past to look. The sum is the autoregressive component (regression based on past values).
The second sum is the weighted sum over the past forecast errors using the learned weights . is a hyperparameter. Sum is the moving average component of the model.

This model assumes the time series is stationary.

Generally, running consecutive rounds of differencing computes an approximation of the order derivative of the time series, eliminating polynomial trends up to degree . Hyperparameter is called the order of integration.

This is central to the AutoRegressive Integrated Moving Average model, introduced in 1970 by George Box and Gwilym Jenkins.

The ARIMA model runs rounds of differencing to make the time series more stationary, and then applies a regular ARMA model.
When forecasting, it uses the ARMA model and adds back the terms that were subtracted by differencing.

There’s also seasonal ARIMA, SARIMA. It models the time series in the same way as ARIMA, but also models a seasonal component for a given frequency (e.g., weekly) using the same ARIMA approach. It has 7 hyperparameters: , , , , , , . The lowercase pdq hyperparameters are the same as in ARIMA. The uppercase PDQ hyperparameters are for modeling the seasonal pattern. And the period of the seasonal pattern is denoted by . So PDQ are to model the time series at , , , and so on.

Good , , , and values are usually small (0 to 2, sometimes up to 5 or 6). and are usually 0 or 1, sometimes 2. is just the main seasonal pattern’s period, e.g. 7 for weekly.
You can also analyze the auto autocorrelation function (ACF) and partial autocorrelation function (PACF), or minimize the AIC or BIC metrics to select good hyperparameters.
Grid search is a good place to start.
Just be careful you don’t overfit.

Forecasting Using a Deep RNN

Stacking multiple layers of cells = deep RNN.

deep_model = keras.Sequential([
    # First SimpleRNN layer: sequence-to-sequence
    # 32 units, returns sequences for next layer, input shape is [None, 1]
    # None allows variable sequence length, 1 is for single feature
    keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 1]),
    
    # Second SimpleRNN layer: sequence-to-sequence
    # 32 units, returns sequences for next layer
    keras.layers.SimpleRNN(32, return_sequences=True),
    
    # Third SimpleRNN layer: sequence-to-vector
    # 32 units, does not return sequences (only last output)
    keras.layers.SimpleRNN(32),
    
    # Final Dense layer: like vector-to-vector
    # 1 unit for single output prediction
    keras.layers.Dense(1)
])

This model architecture allows for processing sequences, extracting features through multiple RNN layers, and making a final prediction with the Dense layer.
The dense layer is important because the tanh activation function in the recurrent layer can only output values between -1 and 1, but the time series contains values from 0 to ~1.4.

Forecasting Using a Sequence-to-Sequence Model

You can forecast longer than into the future by either continuously predicting , using the predicted as part of the input to predict (and so on).

Or you can use a sequence-to-sequence model. To get the forecast for the next e.g. 10 steps.

You can even combine the two approaches.
E.g. by continuously getting the next 10, using those as input to get the following 10.

Handling Long Sequences

RNNs do not perform well on long time series or sequences.

To train it for long sequences, we have to run it over many time steps, making the unrolled RNN very deep. Then it can suffer from Unstable Gradients.
And when an RNN processes a long sequence, it will gradually forget the first inputs in the sequence.

Fighting the Unstable Gradients Problem

Similar tricks used for DNNs work for RNNs: good parameter initialization, faster optimizers, dropout, and so on.

Nonsaturating activation functions (like ReLU) don’t help much. They can even make the RNN more unstable during training. Due to their nonsaturating nature, outputs could easily explode because the same weights are used at every time step. Nonsaturating activation functions don’t prevent that.

Can use smaller learning rate. Or just use a saturating activation function (e.g. tanh).

Gradients can also explode. You can use Gradient Clipping for that.

Batch Normalization can’t really be used efficiently. It isn’t great.
You can add it to a memory cell so it’ll be applied at every time step, but the same BN layer will be used at each time step, with the same parameters, regardless of the actual scale and offset of the inputs and hidden state. This doesn’t work well.

Layer Normalization works better with RNNs.

Tackling the Short-Term Memory Problem

RNNs lose information at each time step due to the transformations the data goes through.
This means that it has trouble with long sequences – it won’t ‘remember’ the first inputs (RNN state basically won’t have much trace of them).

Various types of cells with long-term memory have been introduced to tackle the problem. They work great, so basic cells are not used much anymore.

LSTM is one such cell. It’ll converge faster during training and it’ll detect longer-term patterns.
It works like a regular cell, but its state is split in two vectors and (c for cell).
The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it.

The GRU cell is a simplified version of the LSTM, and seems to perform just as well. Here, both state vectors are merged into a single vector .
We have a single gate controller controlling both the forget gate and input gate.
There is no output gate – it just outputs the full state vector at every time step. But there is a gate controller that controls which part of the state is shown to the main layer ().

While LSTM and GRU cells work great – they can tackle much longer sequences than simple RNNs – they still have limited short-term memory, and have a hard time learning long-term patterns.
We can solve that by shortening the input sequences (e.g. using 1D convolutional layers).

Since convolutional layers are great at detecting local patterns within data, a 1D convolutional layer can scan through the sequence to identify patterns across short spans of time (local context).
For example, a 1D convolution with a kernel size of 3 might detect patterns involving three consecutive time steps.
If you use a stride greater than 1, or a pooling layer, the length of the sequence can be reduced, effectively creating a condensed version of the input sequence.
This helps the model focus on the most relevant parts of the sequence and reduces the burden on LSTM or GRU cells to maintain long-term dependencies.

Chapter 16 Natural Language Processing with RNNs and Attention

Natural Language Processing tasks include:

Text classification
Translation
Summarization
Question answering

and more, of course.

RNNs are common for NLP.
Character RNNs (or, char-RNNs) are trained to predict the next character in a sentence.

Stateless RNNs learn on random portions of text at each iteration, without any information on the rest of the text.
Stateful RNNs preserve the hidden state between training iterations & continue reading where it left off, allowing it to learn longer patterns.

RNNs can be used to build an encoder-decoder for neural machine translation (NMT).

Attention mechanisms learn to select the part of the input that the rest of the model should focus on at each time step.
Attention could be used to boost the performance of an RNN-based encoder-decoder model, but the RNNs can effectively be dropped, and then you have the wildly successful Transformer model (attention is all you need).

Generating Shakespearean Text Using a Character RNN

Training took too long with the code that was provided, so I ended up rewriting it to PyTorch.
The implementation in the book would take 1h or more to run. The following only took my PC ~4 minutes.

And it generated OK text:

to be or not to be here,
i repent the clifford of a bah what flies,
that was he look their blood, mark me the senately to be now?

prospero:
they be the measure so sure:
that last she hath as it cannot be help,
with this grave bole you that stay and brawn he fall,
i understand a princes, all the love of my devil
a man of time again to she were sweet man;
it is with the duke of york water repetly forth to be dist.

import numpy as np
import os

os.environ["KERAS_BACKEND"] = "torch"

import keras
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from tqdm import tqdm

shakespeare_url = "https://homl.info/shakespeare"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

"""
Map all characters to an integer, starting at 2.
TextVectorization reserves 0 for padding tokens and 1 for unknown characters.
"""
text_vec_layer = keras.layers.TextVectorization(split="character", standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0].cpu().numpy()

encoded = torch.tensor(encoded, dtype=torch.long) - 2  # don't need tokens 0 and 1
n_tokens = text_vec_layer.vocabulary_size() - 2 # number of distinct chars
dataset_size = len(encoded)

# Convert the entire dataset to a single tensor of sequences
sequence_length = 100
stride = 1
sequences = encoded.unfold(0, sequence_length + 1, stride)

# Split the data
train_size = int(0.9 * len(sequences))
val_size = int(0.05 * len(sequences))
test_size = len(sequences) - train_size - val_size
train_data, val_data, test_data = torch.utils.data.random_split(sequences, [train_size, val_size, test_size])

# Create DataLoaders
batch_size = 1024
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=4)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=4)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=4)

# Define the model
class ShakespeareModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.gru = nn.GRU(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.gru(embedded)
        return self.fc(output)

# Initialize the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ShakespeareModel(n_tokens, embedding_dim=16, hidden_dim=128).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        inputs, targets = batch[:, :-1].to(device), batch[:, 1:].to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, n_tokens), targets.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            inputs, targets = batch[:, :-1].to(device), batch[:, 1:].to(device)
            outputs = model(inputs)
            val_loss += criterion(outputs.view(-1, n_tokens), targets.view(-1)).item()
    
    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {total_loss/len(train_loader):.4f}, Val Loss: {val_loss/len(val_loader):.4f}")

torch.save(model.state_dict(), "shakespeare_model.pth")

def generate_text(model, start_text, num_generate=50, temperature=1.0):
    # Convert start text to tensor
    input_sequence = text_vec_layer([start_text.lower()])[0].cpu().numpy() - 2
    input_sequence = torch.tensor(input_sequence, dtype=torch.long).unsqueeze(0).to(device)

    model.eval()
    generated_text = start_text
    
    with torch.no_grad():
        for _ in range(num_generate):
            # Get the last 'sequence_length' characters
            input_sequence = input_sequence[:, -sequence_length:]
            
            # Generate prediction
            output = model(input_sequence)
            
            # Apply temperature
            output = output[:, -1, :] / temperature
            probabilities = torch.nn.functional.softmax(output, dim=-1)
            
            # Sample from the distribution
            next_char_index = torch.multinomial(probabilities, 1).item()
            
            # Convert back to character and append to generated text
            next_char = text_vec_layer.get_vocabulary()[next_char_index + 2]  # +2 because we subtracted 2 earlier
            generated_text += next_char
            
            # Update input sequence
            input_sequence = torch.cat([input_sequence, torch.tensor([[next_char_index]], device=device)], dim=1)
    
    return generated_text

# Load the trained model
model.load_state_dict(torch.load("shakespeare_model.pth"))
model.to(device)

# Generate text
start_text = "to be or not to b"
generated_text = generate_text(model, start_text, num_generate=50, temperature=0.7)
print(generated_text)

Generating Fake Shakespearean Text

Greedy decoding: feeding the model input, having it predict the most likely next token, add it to the end of the input, then give the extended input to the model, and so on.
That often leads to repetition.

Better to sample the next token randomly, with a probability equal to the estimated probability.
Use tf.random.categorical(), which samples random class indices, given the class log probabilities (logits).

You can divide the logits by a number called the temperature to have more control over the diversity of the generated text. When the temperature is close to 0, it favors high-probability characters, and a high temperature gives all characters an equal probability.

To generate more convincing text:
You can sample only from the top characters. Or only from the smallest set of top characters whose total probability exceeds some threshold (nucleus sampling).
Or you can use beam search.
Or more GRU layers and more neurons per layer, training for longer, adding regularization.

Stateful RNN

In stateless RNNs, the models starts each training iteration with a hidden state full of zeros, then updates it at each time step. After the last time step, it throws it away, as it isn’t needed anymore.

We can instruct it to preserve the final state after processign a training batch, and use it as the initial state for the next training batch.
This way, it can learn long-term patterns, despite only backpropagating through short sequences.
That’s called a stateful RNN.

These only make sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off.
So we need sequential and nonoverlapping input sequences.

Sentiment Analysis

Section does sentiment analysis on the IMDb reviews dataset, the “Hello, World!” of NLP—just as image classification on the MNIST dataset is of Computer Vision.

Tokenizing words isn’t as trivial as just separating on spaces. Different languages have different conventions.
There are approaches to tokenize and detokenize words at the subword level.
One such technique is byte pair encoding (BPE), which works by splitting the whole training set into individual characters (including spaces) and then repeatedly merging the most frequent adjacent pairs until the vocabulary reaches the desired size.
Paper: [1508.07909] Neural Machine Translation of Rare Words with Subword Units

Subword regularization can improve accuracy and robustness by introducing randomness in tokenization during training.
Paper: [1804.10959] Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Implementation: GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.
SentencePiece paper: [1808.06226] SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Consider the following network for sentiment analysis on the IMDb reviews dataset.

vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
	text_vec_layer,
	tf.keras.layers.Embedding(vocab_size, embed_size),
	tf.keras.layers.GRU(embed_size),
	tf.keras.layers.Dense(1, activation="softmax")
])
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=2)

Quick explainer on TextVectorization: you can either pass a vocabulary or let it learn one from training data via adapt(). It’ll then analyze the dataset, determine the frequency of individual string values, and create a vocabulary from them ( src ). As part of the processing, it’ll standardize the given text examples and split them into substrings. When it encodes sentences, it encodes unknown words as 1s and pads any sequence shorter than the longest one with 0s.

Here, the TextVectorization layer separates tokens using spaces. We’re setting a vocabulary size of 1000 tokens, which includes the most frequent 998 words + a padding token and a token for unknown words.

The TextVectorization layer takes the text and maps it to (integer) word IDs. Then the Embedding layer maps those IDs to embeddings.
We use a GRU and a dense layer with a single neuron & a Softmax Activation Function—we’re doing binary classification.

The model isn’t great.
Since the reviews have varying lengths, the shorter sentences are padded with the padded token when they’re being converted to sequences of token IDs.
Since most sequences aren’t as long as the longest one, they’ll end up with many padding tokens.
And given we’re using GRU—while better than a simple RNN—it forgets what the review was about after going through so many padding tokens.

The excessive padding can cause the model to focus too much on these padding tokens, which do not carry meaningful information, leading to the model’s short-term memory being overwhelmed. Consequently, the model might struggle to retain important information from the actual input sequences, resulting in poor performance.

A solution to that is to feed the model with batches of equal-length sentences.
Or to make the RNN ignore padding tokens, which can be done with masking.

Masking

With Keras, just add mask_zero=True to the Embedding layer when creating it.
Then the padding tokens will be ignored by all downstream layers.

It works by having the Embedding layer create a mask tensor equal to tf.math.not_equal(inputs, 0). This is a Boolean tensor with the same shape as the inputs, and is False anywhere the token IDs are 0, and true otherwise.
This mask tensor automatically propagated by the model by the next layer, which will be given to that layer’s call() method’s mask argument.

The mask will continue to propagate through the layers given that the layer’s supports_masking attribute is True.

Noting that a recurrent layer’s supports_masking is True when return_sequences=True, and if that’s False, supports_masking is False.

Reusing Pretrained Embedding and Language Models

There a ton of pretrained word embeddings you can use, like Google’s Word2vec, Stanford’s GloVe, and Facebook’s FastText.

The approach has its limits, though. A word will have a single representation, no matter the context.
To address this, Matthew Peters introduced Embeddings from Language Models (ELMo). Instead of just using pretrained embeddings, you use part of a pretrained language model.

Reusing pretrained language models in your own models is the norm today.
For example, you can build a classifier based on the Universal Sentence Encoder a team of Google researchers built. Then add two Dense layers on top of it, and you can reach ~90% accuracy for sentiment analysis on the IMDb dataset—which is really good.

An Encoder-Decoder Network for Neural Machine Translation

A simple NMT model to translate English sentences to Spanish:

Encoder is fed English sentences as input.
Decoder outputs the Spanish translations.

The Spanish translations are also given to the decoder during training, except they’re shifted back by one step. So it’s given as input the word it should have output at the previous step.
That’s called teacher forcing, and is a technique to speed up training and improving model performance.
For the very first word, the decoder is given a start-of-sequence (SOS) token, and the decoder is expected to end the sentence with an end-of-sequence (EOS) token.

Each word is initially represented by its ID. Then an Embedding layer returns the word embedding. Those are then fed to the encoder and the decoder.

At each step, the decoder outputs a score for each word in the output vocabulary (i.e., Spanish), then the Softmax Activation Function turns those into probabilities.
The word with the highest probability is output.

Bidirectional RNNs

At each time step, the recurrent layer only looks at past and present inputs before generating its output. It’s causal – it can’t look into the future. This makes sense for Time Series, but not so much for text classification or for an encoder of a seq2seq model. There you’d like to look ahead at the next words before encoding a given word.

Run two recurrent layers on the same inputs. One reads the words from left to right and the other reads them from right to left. Then combine their outputs at each time step (typically concatenate). That’s what bidirectional recurrent layers do.

Beam Search

Allow the model to go back and fix mistakes it made earlier by keeping track of a short list of the most promising sentences (e.g. top 3), and at each decoder step try to extend them by one word, keeping only the most likely sentences. is called beam width.

At the first decoder step, the model outputs an estimated probability for each possible first word in the translated sentence. E.g. A has 75%, B has 3%, and C has 1%.
So that’s our list so far.
Then we use the model to find the next word for each sentence.
In the first sentence (A), perhaps the model outputs conditional probability 36% for X, 32% for Y, and 16% for Z.
And for the second sequence (B), it may be some other conditional probabilities for some other words.
The probabilities would be conditional because it would be e.g. given the sequence starts with B, we have a probability 50% of W. And so on.
You’ll get probabilities if your vocabulary has words, and we do this times.

So then we compute the probabilities of each of the (at this point) two-word sentences.
This is done by multiplying the estimated conditional probability of each word by the estimated probability of the sentence it completes.
So if A had probability 75%, and we have X with conditional probability 36%, then AX will have probability .
When we’ve done this for each sentence, we keep the top 3.
And then we repeat the same process.

Attention Mechanisms

Introduced by Bahdanau et al. in 2014.

The techniques allows the decoder to focus on the appropriate words (as encoded by the encoder) at each time step.
This means the short-term memory of RNNs have much less impact—the representations don’t need to be carried as far before being used.

Bahdanau² attention³ is used in seq2seq models.
It allows the decoder to selectively focus on different (relevant) parts of the input sequence (from the encoder) when generating each word in the output sequence.

It works through an alignment model, which is a small neural net that calculates a score (or energy) for each output from the encoder. This score reflects how well each part of the input sequence aligns with the current state of the decoder. The scores are processed through a Softmax layer to produce the attention weights, which indicate the significance of each encoder output for the current step of decoding.

The attention weights are used to create a weighted sum of the encoder outputs, guiding the decoder to generate the most appropriate word in the output sequence.

If the input sequence is words long, and assuming the output sequence is about as long, the model needs to compute about weights. This is usually okay, because even long sentences don’t have thousands of words.

There’s also Luong attention, or multiplicative attention.
Since the goal of the alignment model is to measure similarity between one of the encoder’s outputs and the decoder’s previous hidden state, Minh-Thang Luong et al. proposed to simply compute the dot product of the two vectors. This is often a good similarity measure, and modern hardware can compute it efficiently.
This requires both vectors to have the same dimensionality.
The dot product gives a score, all scores (at a given decoder time step) goes through a softmax layer to give the final weights.

They also proposed to just use the decoder’s hidden state at the current time step, rather than the previous time step, and then use the output of the attention mechanism directly to compute the decoder’s predictions, instead of using it to compute the decoder’s hidden state.

And they proposed a variant of the dot product mechanism where the encoder outputs first go through a fully connected layer (with no bias term) before the dot products are computed. This is called the “general” dot product approach.

These methods perform better than concatenative attention, so it isn’t used much anymore.

Attention mechanisms

with

and

The attention layer provides a way to focus the attention of the model on part of the inputs.
But you can also think of it as acting as a differentiable memory retrieval mechanism.

Attention Is All You Need: The Original Transformer Architecture

Both the encoder and decoder contain modules that are stacked times. In Attention Is All You Need, . The final outputs of the whole encoder stack are fed to the decoder at each of the levels.

Encoder uses multi-head attention to update each word representation by attending to all other words in the same sentence. Essentially, this gives context to words, to make them make sense in the sentence.

The decoder’s masked multi-head attention layer does the same, except it doesn’t attend to words located after the word it’s processing. It’s causal.

The decoder’s upper MHA layer is where the decoder uses cross-attention.

The positional encodings are dense vectors that represent the position of each word in the sentence. The positional encoding is added to the word embedding of the word in each sentence.
This is necessary as all layers in transformers ignore word positions. Without positional encodings, you could shuffle the input sequences, and it would just shuffle the output sequences the same way. The order of wards matter, so we give positional information to the transformer.

The first two arrows going into each MHA layer represents the keys and values. The third arrow represents the queries.
In self-attention layers, all three are equal to the word representations output by the previous layer, while in the decoder’s upper attention layer, the keys and values are equal to the encoder’s final word representations, and the queries are equal to the word representations output by the previous layer.

Positional Encodings

This is a dense vector that encodes the position of a word within a sentence. The positional encoding is added to the word embedding of the word in the sentence.

In Attention Is All You Need, they didn’t use trainable positional encodings. They used fixed encodings based on the sine and cosine functions at difference frequencies.

This can give the same performance as trainable positional embeddings & can extend to arbitrarily long sentences without adding any parameters to the model.
Since they used oscillating functions (sine & cosine), the model can learn relative positions.

Multi-head attention

First, how does the scaled dot-product attention layer work?
Because that’s what multi-head attention is based on.

Scaled dot-product attention in vectorized form:

It’s the same as Luong attention, except a scaling factor.

is a matrix of one row per query. Shape is , where is the number of queries and is the number of dimensions of each query and each key.
is a matrix containing one row per key. Shape is , where is the number of keys and values.
is a matrix with one row per value. Shape is , where is the number of dimensions of each value.
The shape of is . Contains one similarity score for each query/key pair.
- Input sequence can’t be too long to prevent this matrix from becoming huge.
- Output of softmax function has the same shape, but all rows sum up to 1.
- Final output has shape : one row per query, each row representing the query result (a weighted sum of the values).
Scaling factor scales down the similarity scores to avoid saturating the softmax function, which would lead to tiny gradients.
You can mask out some key/value pairs by adding a very large negative value to the corresponding similarity scores, just before computing softmax – useful in masked multi-head attention.

So, a multi-head attention layer will have a bunch of () scaled dot-product attention layers. Each of these is preceded by a linear transformation of the values, keys, and queries (i.e., a time-distributed dense layer with no activation function). All outputs are concatenated and they go through a final linear transformation (again, time-distributed).

Word representations encode many different characteristics of the words. If you only used a single scaled dot-product attention layer, we’d only be able to query all of these characteristics in one shot.

Using multiple different linear transformations of values, keys, and queries allows the model to apply many different projections of the word representation into different subspaces, each focusing on a subset of the word’s characteristics.
Then the scaled dot-product attention layer implements the lookup phase. And we concatenate all the results and project them back to the original space.

An Avalanche of Transformer Models

GPT paper by Radford et al.:

Used a transformer-like architecture
Pretrained a large but simple architecture composed of a stack of 12 transformer modules using only masked multi-head attention layers
Trained on a large dataset, using the autoregressive technique of just predicting the next token
- Which is a form of Self-supervised Learning
Fined-tuned on various language tasks

Google’s BERT:

Self-supervised pretraining on a large corpus
Similar architecture to GPT but with nonmasked multi-head attention layers only—so the model is bidirectional
- Hence the B in BERT (Bidirectional Encoder Representations from Transformers)
Proposed two pretraining tasks:
- Masked language model (MLM):
  - Each word in a sentence has a 15% probability of being masked & the model is trained to predict the masked words
  - More precisely, each selected word has an 80% chance of being masked, a 10% change of being replaced by a random word (to reduce discrepancy between pretraining and fine-tuning since model won’t see <mask> tokens during fine-tuning), and a 10% chance of being left alone (to bias model toward the correct answer)
- Next sentence prediction (NSP):
  - Model is trained to predict whether two sentences are consecutive or not.
  - Later research showed this was not as important as initially thought. So not used in later architectures
Model is trained on those tasks simultaneously
After the pretraining phase on a large corpus of text, they fine-tuned the model for various tasks

GPT-2

Similar architecture to GPT, but larger (1.5b params)
Could perform zero-shot learning, meaning it could achieve good performance on many tasks without fine-tuning
This kind of started the race to bigger and bigger models

But with the pursuit of bigger models, researchers are finding ways to downsize transformers and make them more data-efficient.

For example, we can train models using distillation, which means transferring knowledge from a teacher model to a student one. The student is often much smaller than the teacher model.
Typically done by using the teacher’s predicted probabilities for each training instance as targets for the student.
This often works better tran training the student from scratch on the same dataset as the teacher.

Vision Transformers

One of the first applications of attention beyond NMT was generating image captions using visual attention (Show, Attend and Tell): a CNN first processes the image and outputs some feature maps, then a decoder RNN with attention generates the caption, one word at a time.

At each decoder time step (i.e., each word) the decoder uses the attention model to focus on just the right part of the image.

In October 2020, Google researchers published a paper introducing a fully transformer-based vision model, a vision transformer (ViT).
The idea is to chop the image into little 16x16 squares and trat the sequence of squares as if it were a sequence of word representations.

The squares are first flattened into -dimensional vectors (3 for RGB). The vectors then go through a linear layer that transforms them but retains their dimensionality.
The resulting sequence of vectors can then be treated just like a sequence of word embeddings.
So add positional embeddings & pass the result to a transformer.

Term: inductive bias, which is an implicit assumption made by the model, due to its architecture. E.g., linear models implicitly assumes the data is linear. CNNs implicitly assume that patterns learned in one location will likely be useful in other locations as well. RNNs implicitly assume that inputs are ordered, and that recent tokens are more important than older ones. The more correct inductive biases a model has, the less training data the model requires. But if they are wrong, it may perform poorly even when trained on large datasets.

Then 2 months later, Facebook released DeiTs.

And in March 2021, DeepMind released a paper introducing the Perceiver architecture (a multimodal transformer!). Transformers could previously only handle shot sequences because of the performance and RAM bottleneck in the attention layers, so audio or video was excluded, and images had to be sequences of patches, rather than sequences of pixels.
Bottleneck is due to self-attention where every token must attend to every other token. If the input sequence has tokens, the attention layer must compute a matrix, which can be huge if is large.
Perceiver solves this by gradually improving a fairly short latent representation of the inputs (composed of tokens—usually a few hundred).
The model uses cross-attention layers, feeding them the latent representation as the queries, and the inputs as the values.
This only requires computing an matrix, so computational complexity is linear with regard to , instead of quadratic.
If all goes well, the latent representation captures everything that matters in the inputs.
You can also share weights between consecutive cross-attention layers, which effectively makes the Perceiver an RNN. The shared cross-attention layers can be seen as the same memory cell at different time steps, and the latent representation corresponds to the cell’s context vector. The same inputs are repeatedly fed to the memory cell at every time step.

A month later, Mathilde Caron et al. introduced DINO: ViT trained entirely without labels, using self-supervision, capable of high-accuracy Semantic Segmentation.
Model is duplicated during training–one network is the teacher, the other is the student.
Gradient Descent only affects the student. The teacher’s weights are just an exponential moving average of the student’s weights.
Student is trained to match the teacher’s predictions. Since they’re almost the same model, this is called self-distillation.
At each training step, the input images are augmented in different ways for student and teacher, so they don’t see the same images. But their predictions must match, so they have to come up with high-level representations.
To prevent mode collapse (where both student and teacher always output the same thing, ignoring inputs), DINO keeps track of a moving average of the teacher’s outputs and tweaks the teacher’s predictions to ensure they remain centered on zero, on average.
DINO also forces the teacher to have high confidence in its predictions (this is called sharpening).

In 2021, Google researchers found out how to scale ViTs up or down, depending on the amount of data (Scaling Vision Transformers).

In March 2022, Mitchell Wortsman et al. showed that it’s possible to first train multiple transformers, then average their weights to create a new, improved model (Model Soups paper).

And now, building large multimodal models is popular. These are often capable of zero-shot or few-shot learning.
E.g. OpenAI’s CLIP proposed a large transformer model pretrained to match captions with images, allowing it to learn great image representations, and the model can then be used directly for tasks like image classification using simple text prompts.
Later, OpenAI announced DALL-E. Then DALL-E 2.

In April 2022, DeepMind released the Flamingo paper, introducing a family of models. Later, GATO, which can be used as a policy for a RL agent.

17 Autoencoders, GANs, and Diffusion Models

Autoencoders are ANNs that can learn dense representations of the input data, called latent representations or codings, without supervision.
The codings usually have much lower dimensionality than the input data—making autoencoders great for Dimensionality Reduction.
They also act as feature detectors.
Can be used for unsupervised pretraining for deep NNs.
Some are generative, meaning they are capable of randomly generating new data that looks similar to the training data.

Autoencoders learn to copy their inputs to their outputs. Sounds trivial, but constraining the network in various ways can make it difficult, preventing it from simply copying inputs to outputs. This forces it to learn efficient ways to represent the data.
Codings are byproducts of the autoencoder learning the identity function under some constraints.

GANs can also generate data. They’re widely used for super resolution (increasing image resolution—enhance), colorization, image editing, etc.
These are composed of two neural networks: a generator that tries to generate data that looks like the training data, and a discriminator that tries to tell real data from fake data.
They essentially compete against each other during training.
Adversarial training (training competing neural networks) is a very interesting idea.
On a personal note, I recall this as being one of the most fascinating things when I first heard about it many years ago.

And a new addition to generative learning is diffusion models.
A denoising diffusion probabilistic model (DDPM) is trained to remove a tiny bit of noise from an image.
If you then take an image entirely full of Gaussian noise and repeatedly run the diffusion model on that image, a high-quality image will gradually emerge (will be similar to training data).

All these are unsupervised, they all learn latent representations, and they can all be used as generative models.

Efficient Data Representations

Identifying patterns can make data easier to remember.

For example, consider this sequence of numbers:

0 1 1 2 3 5 8 13 21 34 55

Does it look easy to remember? Perhaps not at a first glance. But once you know that it’s The Fibonacci Sequence, you understand the underlying pattern, and won’t have to memorize all the individual numbers.

Similarly, expert chess players can memorize the positions of all the pieces in a game by looking at the board for five seconds. Most other people would be challenged by this.
However, if the the pieces were placed randomly, the expert players couldn’t memorize them. It’s not that they have a better memory, it’s just that they see chess patterns more easily, due to their experience with the game.
This helps them store information efficiently.

Autoencoders looks at the inputs, converts them to an efficient latent representation, and outputs something that should look close to the inputs.
It’s always composed of two parts: an encoder (or recognition network) that converts the inputs to a latent representation, followed by a decoder (or generative network) that converts the internal representation to the outputs.
The number of neurons in the output layer must be equal to the number of inputs.

The outputs are often called reconstructions because the autoencoder tries to reconstruct the inputs.
The cost function contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs.

An undercomplete autoencoder is if the internal representation has a lower dimensionality than the input data. This kind of autoencoder can’t just copy inputs to the codings, but must find a way to output a copy of its inputs. So it has to learn the most important features in the input data, and drop the unimportant ones.

An overcomplete autoencoder would be one where the coding layer is just as large (or larger) than the inputs.

Performing PCA with an Undercomplete Linear Autoencoder

A autoencoder using only linear activations and if the cost function is the MSE, it just does PCA.

Here’s a simple linear autoencoder implemented with PyTorch to perform PCA on a 3D dataset, projecting it to 2D:

import torch
import torch.nn as nn
import torch.optim as optim
from typing import OrderedDict

model = nn.Sequential(OrderedDict([
    ('encoder', nn.Linear(3, 2)),
    ('decoder', nn.Linear(2, 3))
]))

model = LinearAutoencoder(3, 2)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.5)

n_epochs = 500

for epoch in range(n_epochs):
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, X_train)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 50 == 0:
        print(f'Epoch [{epoch+1}/{n_epochs}], Loss: {loss.item():.4f}')

Stacked Autoencoders

Autoencoders can have multiple layers. Then they’re called stacked autoencoders, or deep autoencoders.
More layers help the autoencoder learn more complex codings.

Be careful not to make it too powerful. E.g. the encoder may just learn to map each input to a single arbitrary number & the decoder learns the reverse mapping. This kind of autoencoder will reconstruct perfectly but won’t have learned any useful data representation, so it won’t generalize well.

The architecture is generally symmetrical with regard to the central hidden layer (the coding layer). It’ll basically look like a sandwich.

Here’s a stacked autoencoder that’s able to reconstruct MNIST images to a somewhat recognizable state:

class StackedAutoencoder(nn.Module):
    def __init__(self):
        super(StackedAutoencoder, self).__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 100),  # Input size: 28x28 flattened to 784
            nn.ReLU(),
            nn.Linear(100, 30),
            nn.ReLU()
        )

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(30, 100),
            nn.ReLU(),
            nn.Linear(100, 28 * 28),  # Output size: 784 (28x28)
            nn.Unflatten(1, (28, 28))  # Reshape back to 28x28
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

model = StackedAutoencoder()
criterion = nn.MSELoss()
optimizer = optim.NAdam(model.parameters(), lr=0.001)

n_epochs = 30
for epoch in range(n_epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        inputs = batch[0].view(-1, 28, 28)
        outputs = model(inputs)
        loss = criterion(outputs, inputs)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    # Validation
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_valid)
        val_loss = criterion(val_outputs, X_valid)
    
    avg_loss = total_loss / len(train_loader)
    print(f'Epoch [{epoch + 1}/{n_epochs}], Loss: {avg_loss:.4f}, Validation Loss: {val_loss.item():.4f}')

You can get some pretty cool visualizations of the processed data.

Unsupervised Pretraining Using Stacked Autoencoders

If you have a large dataset, but most of it is unlabeled, you can first train a stacked autoencoder using all the data, then reuse the lower layers to create a Neural Network for your actual task and train it using the labeled data.
This is unsupervised pretraining.

Just train an autoencoder using all the training data (unlabeled + labeled) & reuse its encoder layers to create a new neural network.

Tying Weights

When the autoencoder is neatly symmetrical, it’s common to tie the weights of the decoder layers to the weights of the encoder layers. This halves the number of weights in the model, speeds up training, and limits the risk of overfitting.

If the autoencoder has layers (not counting input layer), and represents the connection weights of the layer, then the decoder layer weights can be defined as with .

Training One Autoencoder at a Time

Instead of training the whole stacked autoencoder in one go, you can train one shallow autoencoder at a time, then stack all of them into a single stacked autoencoder.

This is not used as much anymore, but you may encounter it in papers talking about “greedy layerwise training.”

First phase: first autoencoder learns to reconstruct the inputs. Then encode the whole training set using the first autoencoder to get a new, compressed training set.
Second phase: train second autoencoder on the new dataset.

And so on. At the end, you build a big sandwich using all the autoencoders: first stack the hidden layers of each autoencoder, then the output layers in reverse order.

Convolutional Autoencoders

To build an autoencoder for images, you’ll want to build a convolutional autoencoder.

The encoder is a regular CNN composed of convolutional layers and pooling layers.
It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing depth (i.e., number of feature maps).

The decoder must do the reverse (upscaling the image and reducing depth back to the original dimensions). For this, you can use transpose convolutional layers (or combine upsampling layers with convolutional layers).

You can also create autoencoders with other architecture types, like RNNs.

Denoising Autoencoders

You can also force autoencoders to learn useful features by adding noise to the inputs, training it to recover the original, noise-free inputs.

The noise can be pure Gaussian noise added to the inputs. Or it can be randomly switch-off inputs, like Dropout.

This would be implemented as a regular stacked autoencoder with an additional Dropout layer applied to the encoder’s inputs. Or use a Gaussian noise layer.

Sparse Autoencoders

Sparsity is another constraint that leads to good feature extraction.
By adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer.
E.g. it may only have on average 5% significantly active neurons in the coding layer.
This often makes each neuron in the coding layer represent a useful feature—if you can only make few bets, you better make them good.

One way is to use the sigmoid activation function in the coding layer to constrain codings top values between 0 and 1, use a large coding layer (e.g. 300 units), and add regularization to the coding layer’s activations.

An approach that often gives better results is to measure the actual sparsity of the coding layer at each training iteration, and penalize the model when the measured sparsity differs from the target sparsity.
Done by computing the average activation of each neuron in the coding layer, over the whole training batch.
Don’t want a too small batch size or the mean won’t be accurate.
Once you have the mean activation per neuron, you want to penalize the neurons that are too active, or not active enough. Do this by adding a sparsity loss to the cost function.
In practice, a good approach is to use the Kullback-Leibler (KL) divergence.

Kullback-Leibler divergence:
Given two discrete probability distributions and , the KL divergence between these distributions, noted , is computed as:

We want to measure the divergence between the target probability that a neuron in the coding layer will activate the actual probability , estimated by measuring the mean activation over the training batch.
So it becomes KL divergence between the target sparsity and the actual sparsity :

Once we have the sparsity loss for each neuron in the coding layer, we sum up these losses & add the result to the cost function.
You can control the relative importance of the sparsity loss and reconstruction loss by multiplying the sparsity loss by a sparsity weight hyperparameter.
If the weight is too high, the model sticks closely to the target sparsity, but may not reconstruct the inputs properly, making the model useless. If too low, it’ll mostly ignore the sparsity objective and won’t learn any interesting features.

Variational Autoencoders

The Variational Autoencoder (VAE) is now one of the most popular variants of autoencoders.

They are probabilistic autoencoders: their outputs are partly determined by chance, even after training. And they are generative autoencoders, meaning they can generate new instances that look like they were samples from the training set.

Variational autoencoders perform variational Bayesian inference—an efficient way of carrying out approximate Bayesian inference.

Bayesian inference means updating a probability distribution based on new data, using equations derived from Bayes’ Theorem. The original distribution is called the prior, while the updated distribution is called the posterior.
In this case, we want to find a good approximation of the data distribution, so we can sample from it.

VAEs, of course, follow the general structure of autoencoders with an encoder followed by a decoder.
However, instead of directly producing a coding for a given input, the encoder produces a mean coding and a standard deviation .
The actual coding is sampled randomly from a Gaussian distribution with mean and standard deviation .
After that, the decoder decodes the sampled coding normally.

So: produce and , sample coding randomly, decode coding.

During training, the cost function pushes the codings to gradually migrate within the coding space (also called latent space) to end up looking like a cloud of Gaussian points.

After training, you can generate a new instance by sampling a random coding from the Gaussian distribution and then decoding it.

The cost function has two parts:

The reconstruction loss that pushes the autoencoder to reproduce its inputs (can use MSE for this)
The latent loss that pushes the autoencoder to have codings that look as though they were sampled from a simple Gaussian distribution. This is the KL divergence between the target distribution (i.e., the Gaussian distribution) and the actual distribution for the codings.

VAE’s latent loss:

Here, is the latent loss, is the codings’ dimensionality, and and are the mean and standard deviation of the component of the codings.
Vectors and are output by the encoder.

It’s common to make the encoder output rather than .
This approach is more numerically stable & speeds up training.
Rewritten as:

Here’s my implementation of VAEs in PyTorch for MNIST:

class Sampling(nn.Module):
    def __init__(self):
        super(Sampling, self).__init__()

    def forward(self, inputs):
        mean, log_var = inputs
        eps = torch.randn_like(log_var)
        return mean + torch.exp(0.5 * log_var) * eps

class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super(Encoder, self).__init__()
        self.seq = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_features=input_dim, out_features=hidden_dim),
            nn.ReLU(),
            nn.Linear(in_features=hidden_dim, out_features=hidden_dim),
            nn.ReLU(),
        )
        self.fc_mean = nn.Linear(in_features=hidden_dim, out_features=latent_dim)
        self.fc_log_var = nn.Linear(in_features=hidden_dim, out_features=latent_dim)
        self.codings_fc = Sampling()
    
    def forward(self, x):
        x = self.seq(x)
        mean = self.fc_mean(x)
        log_var = self.fc_log_var(x)
        z = self.codings_fc((mean, log_var))
        return [mean, log_var, z]

class Decoder(nn.Module):
    def __init__(self, latent_dim, hidden_dim, output_dim):
        super(Decoder, self).__init__()
        self.seq = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
            # Using sigmoid to output in the range [0, 1] because we're
            # dealing with images (for MNIST)
            nn.Sigmoid()
        )

    def forward(self, z):
        return self.seq(z)


class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super(VAE, self).__init__()
        self.encoder = Encoder(input_dim, hidden_dim, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim, input_dim)

    def forward(self, x):
        mean, log_var, z = self.encoder(x)
        recon_x = self.decoder(z)
        return recon_x, mean, log_var

def vae_loss(recon_x, x, mean, log_var):
    recon_loss = nn.functional.mse_loss(recon_x, x, reduction="sum")
    kl_div = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
    return recon_loss + kl_div

input_dim = 28 * 28
hidden_dim = 150
latent_dim = 10
model = VAE(input_dim, hidden_dim, latent_dim)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# just a very basic training loop
# without early stopping, checkpoints, validation, etc.
n_epochs = 30
for epoch in range(n_epochs):
    model.train()
    train_loss = 0
    for batch in train_loader:
        x = batch[0].to(device)
        optimizer.zero_grad()
        recon_batch, mean, log_var = model(x)
        loss = vae_loss(recon_batch, x, mean, log_var)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    avg_train_loss = train_loss / len(train_loader)
    print(f'Epoch [{epoch + 1}/{n_epochs}], Train Loss: {avg_train_loss:.4f}')

Generative Adversarial Networks

Proposed by Ian Goodfellow et al. in 2014 (paper: Generative Adversarial Nets).

It’s composed of two networks: a generator and a discriminator.

Generator: takes a random distribution as input (usually Gaussian) and outputs some data (e.g., an image). The random input can be thought of as latent representations (i.e., codings) of the image to be generated

Discriminator: Takes either fake images from the generator or a real image from the training set as input, and must guess whether the input image is fake or real.

During training: discriminator tries to tell fake images from real images, while the generator tries to produce images that look real enough to trick the discriminator.
Since the networks have different objectives, the GAN can’t be trained like a regular neural network.
So training is divided into two phases:

Train the discriminator.
1. Sample batch of real images from training set & complete with equal number of fake images produced by the generator.
2. Labels are 0 for fake images and 1 for real.
3. Train discriminator on this labeled batch for one step, using binary cross-entropy loss.
4. Backpropagation only optimizes the weights of the discriminator in this phase.
Train the generator.
1. Use it to produce another batch of fake images.
2. Use the discriminator to tell whether the images are fake or real.
3. This time, we don’t add real images to the batch, but set all labels to 1 (real). We want the generator to produce images that the discriminator will wrongly believe to be real.
4. Weights of discriminator has to be frozen during this step, so Backpropagation only affects the weights of the generator.

The generator doesn’t actually see any real images, yet gradually learns to produce convincing fake images. It only gets the gradients flowing back from the discriminator.
The better the discriminator gets, the more information about the real images is contained in the secondhand gradients, which helps the generator progress.

The Difficulties of Training GANs

The generator and discriminator are playing a Zero Sum Game, constantly trying to outsmart each other. As training advances, the game may end up in a Nash Equilibrium.

The authors of the GAN paper showed that a GAN can only reach a single Nash Equilibrium: when the generator produces perfectly realistic images, and the discriminator is forced to guess (50/50).
So just train the GAN for long enough, and it’ll eventually get there, right?
Unfortunately, no. Nothing guarantees that the equilibrium will ever be reached.

The biggest challenge is mode collapse, wherein the generator’s outputs gradually become less diverse.
This can happen if the generator becomes really good at producing outputs of one kind, but then as it continues to specialize in that direction, it’ll forget how to produce anything else. And since it’ll just produce that kind of output, the discriminator only sees that, and therefore forgets how to discriminate fake outputs of other classes.
When eventually the discriminator discriminates the fake from real, the generator will have to move to another class.
So it becomes a cycle of specialization, failure, and then moving into another class.

Since the generator and discriminator are constantly pushing against each other, their parameters may end up oscillating and becoming unstable. So training may just diverge for no apparent reason, due to these instabilities.
GANs are sensitive to hyperparameters.

Some popular techniques to deal with this:

Experience replay
Mini-batch discrimination

Deep Convolutional GANs

These used to be the state of the art.

Deep Convolutional GANs (DCGANs) were proposed by Alec Radford et al. in Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.

They proposed the following guidelines for building stable convolutional GANs:

Replace any pooling layers with strided convolutions (in the discriminator) and transposed convolutions (in the generator)
Use Batch Normalization in both the generator and discriminator, except in the generator’s output layer & discriminator’s input layer
Remove fully connected hidden layers for deeper architectures
Use ReLU in the generator in all layers except the output layer (should use tanh)
Use Leaky ReLU in the discriminator for all layers

These won’t always work. May need to experiment with different hyperparameters. Just changing the seed may work.

Progressive Growing of GANs

Nvidia researchers Tero Kerras et al. proposed generating small images at the beginning of training, then gradually adding convolutional layers to both the generator and discriminator to producer larger and larger images.
This resembles greedy layer-wise training of stacked autoencoders.
The extra layers get added at the end of the generator and at the beginning of the discriminator, and previously trained layers remain trainable.

They also introduced other techniques to increase the diversity of the outputs (to avoid mode collapse) and make training more stable:

Mini-batch standard deviation layer
Equalized learning rate
Pixelwise normalization layer

StyleGANs

Proposed by Nvidia researchers Tero kerras et al. in 2018 (paper: A Style-Based Generator Architecture for Generative Adversarial Networks).

They used style transfer techniques in the generator to ensure the generated images have the same local structure as the training images, at every scale, which improves the quality of the generated images.

A StyleGAN generator is composed of two networks:

Mapping network: maps codings to style vectors.
- 8-layer MLP that maps the latent representations to a vector . This vector is then sent through multiple affine transformations, which are dense layers with no activation functions. These produce multiple vectors. They control the style of the generated image at different levels (from fine-grained textures like hair color to high-level features like adult or child).
Synthesis network: generates the images.
- Has constant learned input (after training, but gets tweaked during training).
- Processes inputs through multiple convolutional and upsampling layers.
- Adds noise to the inputs and to all outputs of the convolutional layers (before activation function).
- Each noise layer is followed by an adaptive instance normalization (AdaIN) layer. An AdaIN layer standardizes each feature map independently (subtract feature map’s mean & divide by std.dev), then uses the style vector to to determine the scale and offset of each feature map. Style vectors contain one scale and one bias term for each feature map.

Adding noise independently of codings is important.
Some parts of an image are somewhat random (e.g. position of a freckle). In GANs, this randomness had to come from either codings or noise produced the generator itself.
If it came from the codings, the generator had to dedicate a significant potion of the codings’ representational power to store noise, which is wasteful.
If the generator produced the noise, it may not look convincing, leading to more visual artifacts. And part of the weights would be dedicated to generating noise, which is also wasteful.
These issues are avoided by adding extra noise inputs.

StyleGAN uses a technique called mixing regularization (or style mixing), where a percentage of the generated images are produced using two different codings.

Diffusion Models

Denoising diffusion probabilistic models (DDPMs) were introduced in 2020 by Jonathan Ho et al., and later improved in 2021 by OpenAI researchers Alex Nichol and Prafulla Dhariwal.
DDPMs are easier to train than GANs & the images are more diverse and of higher quality. But they take a long time to generate the images.

DDPM steps:

Start with initial image
Forward process
1. Add Gaussian noise to the image in steps (e.g. )
2. At each step , add noise with mean 0 and variance
3. Noise is isotropic, meaning it’s independent for each pixel
4. Result: (increasingly noisy versions until image is completely hidden by noise)

In the improved DDPM paper, and the variance schedule was tweaked to change more slowly at the beginning and at the end.

The pixel values et rescaled slightly at each step, by a factor of to ensure the mean of the pixel values gradually approaches 0 & that the variance gradually converges to 1.

The forward pass is summarized in this equation:

It defines the probability distribution of given as a Gaussian distribution with mean times the scaling factor, and with a covariance matrix equal to .

There’s a shortcut for the forward process. You can sample an image given without having to compute the steps between. Since the sum of multiple Gaussian distributions is also a Gaussian distribution, all the noise can be added in one shot:

But we don’t just want the version of the image that’s drowned out by noise, we want to create new images. So we train a model to do the reverse process: going from to , which we use to remove bits of noise from the image until all the noise is gone.

Latent diffusion models (Robin Rombach, Andreas Blattmann, et al.) has the diffusion process take place in latent space, rather than pixel space. This speeds up image generation and reduces training time and cost. And the quality of the generated images is great.

And in August 2022, the powerful latent diffusion model Stable Diffusion was open sourced.

18 Reinforcement Learning

Learning to Optimize Rewards

In RL, software Agents make observations and takes actions within an environment, and in return receives rewards from the environment.
Its objective is to learn to act in a way that will maximize its expected rewards over time.

Policy Search

The algorithm the agent uses to determine its actions is called its policy.

The policy could be a neural network taking observations as inputs and outputting the action to take.
It can be anything you can think of.
Doesn’t have to be deterministic.
Sometimes you don’t even have to consider the environment.

Policies involving randomness are called stochastic policies.

Policy parameters are the adjustable values that define an agent’s behavior.
Policy search is the process of finding the best policy by exploring different parameter combinations.
Policy space refers to all possible policies an agent can adopt.

You can explore policy space with various methods, like:

Genetic algorithms are optimization methods inspired by biological evolution, using concepts like selection, crossover, and mutation to find good solutions.
Optimization techniques are mathematical methods used to find the best solution to a problem, often by minimizing or maximizing a specific function.
- Policy gradients (PG): a method that directly optimizes the policy function. It works by estimating the gradient of the expected reward with respect to the policy parameters and then updating these parameters in the direction that increases the expected reward. This approach allows for learning in continuous action spaces and can handle high-dimensional problems effectively.

Introduction to OpenAI Gym

You need an environment to train agents in.

Generally, you want a simulated environment at least for bootstrap training.
Can use E.g. PyBullet or MuJoCo for 3D simulation.
OpenAI Gym is a toolkit that provides various simulated environments that you can use to train agents, compare them or develop new RL algorithms.
Note: OpenAI Gym isn’t maintained anymore. Use Gymnasium .

Neural Network Policies

Creating a NN policy for the cart pole classical control problem. There are two possible actions, so we use one output neuron. It’ll output the probability of action (left) and then the probability of action (right) is .

Why pick a random action based on probabilities rather than the action with the highest score?
To let the agent find the right balance between exploring new actions and exploiting the actions known to work well. This is the exploration/exploitation dilemma, and is central to RL.

In this environment, past actions & observations can be ignored: each observation contains the environment’s full state. If there were a hidden state, you may need to consider past actions & observations.

Evaluating Actions: The Credit Assignment Problem

Credit assignment problem: when the agent gets a reward, it’s hard for it to know which actions should get credited (or blamed) for it.

A common strategy to tackle this problem is to evaluate an action based on the sum of all the rewards that come after it.
This is usually done by applying a discount factor, , at each step.
This sum of discounted rewards is called the action’s return.
Discount factors typically vary from 0.9 to 0.99. With a discount factor of 0.95, rewards 13 steps into the future count roughly for half as much as immediate rewards ().

A good action can be followed by several bad actions that cause the pole to fall, so the good action gets a low return.
We want to estimate how much better or worse an action is, compared to the other possible actions, on average. This is called action advantage.
So we run many episodes and normalize all action returns, by subtracting the mean and dividing by the std. dev. Then we can reasonably assume that actions with a negative advantage were bad while actions with a positive one were good.

Policy Gradients

PG algorithms optimize the parameters of a policy by following the gradients toward higher rewards. They directly try to optimize the policy to increase rewards.

A popular class of these algorithms is REINFORCE algorithms, introduced by Ronald Williams in 1992.

A common variant:

Let NN play the game several times. At each step, compute the gradients that would make the chosen action even more likely—but don’t apply them yet.
When you’ve run several episodes, compute each action’s advantage.
Apply gradients
1. If an action’s advantage is positive, it’s probably good, and you want to apply the gradients to make the action even more likely to be chosen in the future.
2. If its advantage is negative, the action is probably bad, and you want to apply the negative gradients to make the action less likely in the future.
3. Solution: multiply each gradient vector by the corresponding action’s advantage.
Compute the mean of all resulting gradient vectors & use it to perform a Gradient Descent step.

Here’s an alternative implementation in PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="rgb_array")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

learning_rate = 0.01
gamma = 0.99  # Discount factor for future rewards

input_dim = env.observation_space.shape[0]
hidden_dim = 128
output_dim = env.action_space.n

model = nn.Sequential(
    nn.Linear(input_dim, hidden_dim),
    nn.ReLU(),
    nn.Linear(hidden_dim, output_dim),
    nn.Softmax(dim=-1)
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

def select_action(model, state):
    state = torch.from_numpy(state).float()
    probs = model(state)
    action = np.random.choice(len(probs.detach().numpy()), p=probs.detach().numpy())
    return action, probs[action]

def compute_discounted_rewards(rewards, gamma):
    discounted_rewards = []
	# R is the cumulative discounted rewards
    R = 0
    for r in reversed(rewards):
        R = r + gamma * R
        discounted_rewards.insert(0, R)
    return discounted_rewards

def train(env, model, optimizer, num_episodes=1000):
    for episode in range(num_episodes):
        state, info = env.reset()
        log_probs = []
        rewards = []
        for t in range(1000):  # max steps per episode
            action, log_prob = select_action(model, state)
            state, reward, done, truncated, info = env.step(action)
            log_probs.append(torch.log(log_prob))
            rewards.append(reward)
            if done or truncated:
                break

        discounted_rewards = compute_discounted_rewards(rewards, gamma)
        discounted_rewards = torch.tensor(discounted_rewards)

        # Normalize (add small constant to ensure no div by zero)
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9)

        policy_gradient = []
        for log_prob, reward in zip(log_probs, discounted_rewards):
			# Multiply log probability by the corresponding normalized reward (advantage)
			# This scales the gradient by how good or bad the action turned out to be
            policy_gradient.append(-log_prob * reward)
        policy_gradient = torch.stack(policy_gradient).sum()  # result is total gradient for the episode

        optimizer.zero_grad()
        policy_gradient.backward()
        optimizer.step()

        if episode % 100 == 0:
            print(f"Episode {episode}\tTotal Reward: {sum(rewards)}")

Softmax is used to get the Probability Distribution for the actions.

The discount factor, , determines how much future rewards are valued relative to immediate rewards.

The model outputs a probability distribution over the possible actions, which we use to randomly select an action. We also return the log probability of the selected action, which is useful for computing the gradients later.
The action is sampled according to the probabilities. This ensures exploration of random actions based on their likelihood (recall: exploitation vs. exploration).

Cumulative discounted rewards: accumulates future rewards from the current time step until the end of the episode. Calculated by summing up the rewards obtained in the future, discounted by at each step. The idea is that rewards obtained in the future are worth less than immediate rewards.
Calculated by backward iteration through the rewards obtained during the episode. For each reward at time step , is updated:

So, say we have the sequence of rewards with . We have . Then the calculation is:

We normalize after computing the discounted rewards to have a mean of 0 and a standard deviation of one, which reduces variance and stabilizes training. We also add a small constant (1e-9) to avoid division by zero.

where is the normalized reward, is the mean of the discounted reward, is the standard deviation, and is a small constant.

In short:
The policy network outputs action probabilities, which are sampled to choose actions during the game. Rewards are discounted and normalized to compute the advantage of each action, and this advantage is used to scale the gradients during backpropagation. Finally, these gradients are used to update the policy network (aim is to increase likelihood of actions that lead to higher rewards).

The algorithm above is rather sample inefficient—it has to explore the game for a long time before it can make significant progress. This is because it has to run multiple episodes to compute the advantage of each action.

I see log_prob rather often and wanted to explain to myself why we use it.
First, they are the natural Logarithm of the probabilities. Given probability , its log probability is:

There seems to be multiple reasons to use log probabilities over raw probabilities, but the important one I can see is this:
Probabilities are often small numbers. When you multiply many small probabilities together (often done for sequences of events, like Hidden Markov Models or calculating the likelihood of a sequence of actions in RL), the product can become extremely small, leading to underflow issues where the result is too close to zero for the computer to represent accurately.
By taking the logarithm, multiplication of probabilities turns into addition:

This is more numerically stable than multiplication of small probabilities.

Markov Decision Processes

In the 20th century, Andrey Markov studied stochastic processes with no memory, called Markov chains.

Such a process has a fixed number of states, and it randomly evolves from one state to another at each step.
The probability for it to evolve from a state to a state is fixed, and depends only on the pair , not on past states (hence: it has no memory).

A state is called a terminal state if the process will remain there forever.

Markov decision processes were first described by Richard Bellman in the 1950s.
They look like Markov chains, but at each step, an agent can choose one of several possible actions, and the transition probabilities depend on the chosen action.
And some state transitions return some reward (positive or negative), and the agent’s goal is to find a policy that maximizes reward over time.

Bellman found a way to estimate the optimal state value of any state, denoted .
It’s the sum of all discounted future rewards the agent can expect on average after it reaches the state, assuming it acts optimally.
If the agent acts optimally, the Bellman optimality equation applies: the optimal value of the current state is equal to the reward the agent gets on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to.

is the transition probability from to , given that the agent chose action .
is the reward the agent gets when it goes from state to state , given that the agent chose action .
is the discount factor.

This leads to the value iteration algorithm, which can precisely estimate the optimal state value of every possible state.
Start by initializing all state value estimates to 0, then iteratively update them using the algorithm.
Given enough time, the estimates are guaranteed to converge to the optimal state values, which corresponds to the optimal policy.

The value iteration algorithm:

where is the estimate value of state at the iteration.
It uses Dynamic Programming.

But this doesn’t give us the optimal policy for the agent.
Bellman found a similar algorithm to estimate the optimal state-action values, called Q-values (quality values).
The optimal Q-value of the state-action pair , noted , is the sum of discounted future rewards the agent can expect on average after it reaches the state and chooses action , but before it sees the outcome of this action (assuming it acts optimally after it).

Start by initializing all Q-value estimates to zero. Then update them with the Q-value iteration algorithm:

When you have the optimal Q-values, the optimal policy (denoted ) is defined as follows.
When the agent is in state , it should choose the action with the highest Q-value for that state:

Code example from the book:

# shape=[s, a, s']
transition_probabilities = [
    [[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]],
    [[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]],
    [None, [0.8, 0.1, 0.1], None]
]
# Find transition probability of going from s2 to s0 after playing action a1:
# transition_probabilities[2][1][0]

# shape=[s, a, s']
rewards = [
    [[+10, 0, 0], [0, 0, 0], [0, 0, 0]],
    [[0, 0, 0], [0, 0, 0], [0, 0, -50]],
    [[0, 0, 0], [+40, 0, 0], [0, 0, 0]]
]
# Find reward when going from state s2 to s0 after action a1:
# rewards[2][1][0] (it's 40)

possible_actions = [[0, 1, 2], [0, 2], [1]]
# Find possible actions in state 2: possible_actions[2] (only a1)


Q_values = np.full((3, 3), -np.inf)  # -np.inf for impossible actions
for state, actions in enumerate(possible_actions):
    Q_values[state, actions] = 0.0  # for all possible actions

gamma = 0.90  # discount factor

for iteration in range(50):
    Q_prev = Q_values.copy()
    for s in range(3):
        for a in possible_actions[s]:
            Q_values[s, a] = np.sum([
                    transition_probabilities[s][a][sp]
                    * (rewards[s][a][sp] + gamma * Q_prev[sp].max())
                for sp in range(3)])

Q_values.argmax(axis=1)  # optimal action for each state

So the optimal policy for that Markov decision process (using ) is to choose action when in state , action when in , and action when in .

Temporal Difference Learning

Can often model RL problems with discrete actions as MDPs, but the agent doesn’t initially know nor . It has to experience each state and transition at least once to know them, and multiple times to get reasonable transition probability estimates.

Temporal difference (TD) learning algorithm takes into account the fact that the agent only has partial knowledge of the MDP.
We assume the agent initially knows only the possible states and actions, and nothing more. Then it uses an exploration policy (e.g. could be a purely random policy) to explore the MDP. As it explores, the TD learning algorithm updates the estimates of the state values based on the transitions & rewards that are actually observed.

Or, equivalently:

with .

is the learning rate (e.g., 0.01), is called the TD target, and is called the TD error.

You can also write the first form by using the notation , which means .
So we can rewrite it as:

For each state , the algorithm keeps track of a running average of the immediate rewards the agent gets upon leaving the state, plus the rewards it expects to get later, assuming it acts optimally.

Q-Learning

This is an adaptation of the Q-value iteration algorithm to the situation where transition probabilities and the rewards are initially unknown.

Q-learning works by watching an agent play (e.g., randomly) and gradually improving its estimates of the Q-values.
When it has accurate Q-value estimates (or close to), then the optimal policy is just choosing the action with the highest Q-value (i.e., the greedy policy).

Q-learning algorithm:

For each state-action pair , the algorithm keeps track of a running average of the rewards the agent gets upon leaving the state with action , plus the sum of discounted future rewards it expects to get.
To estimate the sum, we take the maximum of the Q-value estimates from the next state , since we assume the target policy acts optimally from then on.

Code example from the book:

# Simulating an agent moving around in the environment
def step(state, action):
    probas = transition_probabilities[state][action]
    next_state = np.random.choice([0, 1, 2], p=probas)
    reward = rewards[state][action][next_state]
    return next_state, reward

# Using a random policy (because state space is small)
def exploration_policy(state):
    return np.random.choice(possible_actions[state])

# Initialize Q-values
np.random.seed(42)
Q_values = np.full((3, 3), -np.inf)
for state, actions in enumerate(possible_actions):
    Q_values[state][actions] = 0

# using power scheduling
alpha0 = 0.05  # initial learning rate
decay = 0.005  # learning rate decay
gamma = 0.90  # discount factor
state = 0  # initial state

for iteration in range(10_000):
    action = exploration_policy(state)
    next_state, reward = step(state, action)
    next_value = Q_values[next_state].max()  # greedy policy at the next step
    alpha = alpha0 / (1 + iteration * decay)
    Q_values[state, action] *= 1 - alpha
    Q_values[state, action] += alpha * (reward + gamma * next_value)
    state = next_state

Using Learning Rate Scheduling with power scheduling.

The algorithm converges to the optimal Q-values, but it takes many iterations, and potentially a lot of hyperparameter tuning.

The Q-learning algorithm is called an off-policy algorithm because the policy being trained is not necessarily the one used during training.
As opposed to the policy gradient algorithm, which is an on-policy algorithm.

Exploration Policies

Q-learning only works if the exploration policy explores the MDP thoroughly enough.

A purely random policy will visit all states and transitions many times, but it’ll take a long time.

A better option is the -greedy policy. At each step it acts randomly with probability , or greedily with probability (i.e., choosing the action with the highest Q-value).
The advantage over the random policy is that it will spend more and more time exploring the interesting parts of the environment, while still exploring other regions.
It’s common to start with a high value for (e.g., 1.0) and gradually reduce it (e.g., down to 0.05).

Or you can encourage the exploration policy to try actions it hasn’t tried much before. This can be implemented as a bonus to the Q-value estimates:
Q-learning using an exploration function:

counts the number of times the action was chosen in state
is an exploration function, such as , where is a curiosity hyperparameter that measures how much the agent is attracted to the unknown

Approximate Q-Learning and Deep Q-Learning

Q-learning does not scale well to large (or even medium) MDPs with many states and actions.

The solution is to find a function that approximates the Q-value for any state-action pair using a manageable number of parameters (given by parameter vector ).
This is approximate Q-learning.

It was previously recommended to use a linear combination of handcrafted features extracted from the state to estimate Q-values, but DeepMind showed that using a DNN can work much better, especially for complex problems, and it doesn’t require Feature Engineering.
A DNN used to estimate Q-values is called a deep Q-network (DQN), and using a DQN for approximate Q-learning is called deep Q-learning.

How to train a DQN
The approximate Q-value should be as close as possible to the reward that we actually observe after playing action in state , plus the discounted value of playing optimally from then on. We know this from Bellman’s equations.

To estimate the sum of future discounted rewards, execute the DQN on the next state , for all possible actions . We get an approximate future Q-value for each possible action.
Then pick the highest (since we assume optimal play) and discount it. This gives an estimate of the sum of future discounted rewards.
by summing the reward and the future discounted value estimate, we get a target Q-value for the state-action pair .

With the target Q-value, we can run a training step using a Gradient Descent algorithm.
Minimize the squared error between the estimated Q-value and the target Q-value , or the Huber loss to reduce the algorithm’s sensitivity to large errors.

Catastrophic forgetting is a challenge virtually all RL algorithms face. As the agent explores the environment, it updates its policy, but what it learns in one part of the environment may break what it learned earlier in other parts.
The experiences are quite correlated, and the learning environment keeps changes. This is not ideal for Gradient Descent.
Increasing size of the replay buffer helps. Tuning learning rate may help. But RL is just hard. Training is often unstable, you may need to try a lot of different hyperparameter values, etc.

Deep Q-Learning Variants

These can stabilize and speed up training:

Fixed Q-value Targets
Double DQN
Prioritized Experience Replay (PER)
Dueling DQN

Overview of Some Popular RL Algorithms

AlphaGo: uses a variant of Monte Carlo tree search (MCTS) based on deep neural networks. AlphaZero generalized the algorithm, so it could play other games than Go. MuZero continued to improve the algorithm.
Actor-critic algorithms: a family of RL algorithms that combine policy gradients with deep Q-networks.
- DQN is trained normally, by learning from the agent’s experiences.
- The policy net learns by the agent (actor) relying on the action values estimated by the DQN (critic).
Asynchronous advantage actor-critic (A3C): a variant of actor-critic wherein multiple agents learn in parallel, exploring different copies of the environment. Periodically (but asynchronously), each agent pushes weight updates to the master network, and pulls in the latest weights as well.
Advantage actor-critic (A2C): a variant of A2C that removes the asynchronicity. All model updates are synchronous, so gradient updates are performed over larger batches, so the model can better utilize the GPU.
Soft actor-critic (SAC): a variant of actor-critic. Learns not only rewards, but also to maximize the entropy of its actions. Meaning, it tries to be as unpredictable as possible, while still maximizing rewards.
Proximal policy optimization (PPO): is based on A2C, but clips the loss function to avoid excessively large weight updates (as they often lead to training instabilities). It’s a simplification of the trust region policy optimization (TRPO) algorithm.
Curiosity-based exploration: ignore rewards, just make the agent extremely curious to explore the environment.
Open-ended learning (OEL): train agents capable of endlessly learning new and interesting tasks. (not achieved as of book publishing)

Chapter 19 Training and Deploying TensorFlow Models at Scale

Deploying a Model to a Mobile or Embedded Device

Can’t just push massive models to mobile/embedded devices. May not fit in the device, or use too much RAM or CPU power, and so on. Also drains battery.
So you need a lightweight an efficient model, ideally without sacrificing too much accuracy.

There are tools like TFLite to help you

Reduce model size
Reduce amount of computations needed for predictions
Adapt model to device specific constraints

One way to reduce model size is to use smaller bit-widths. E.g., using half-floats (16 bits) instead of regular floats (32 bits). Then the model size shrinks by a factor of 2, generally at a small accuracy drop. Training is faster & you use ~half the GPU RAM.
But you can even quantize down to fixed-point, 8-bit integers

The problem with quantization is that you lose a bit of accuracy. If this drop is too severe, you may need quantization-aware training (adding fake quantization operations to the model so it can learn to ignore the quantization noise during training, making the final weights more robust to quantization).

Using GPUs to Speed Up Computations

Getting Your Own GPUs

Can make sense financially.
But you need to take the time to find the right GPU.
Tim Dettmer wrote a good blog post on the subject .

Some parameters that matter:

GPU RAM, e.g., you typically need at least 10 GB for image processing or NLP
Number of cores
Bandwidth (how fast you can send data in/out of the GPU)
Cooling system
And more.

Managing the GPU RAM

You can split your GPU into two or more logical devices, which is useful e.g., if you have one GPU but want to test a multi-GPU program.

Split GPU #0 into two logical devices, each with 2 GiB RAM:

# Runs this right after importing Tensorflow
tf.config.set_logical_device_configuration(
	physical_gpus[0],
	[
		tf.config.LogicalDeviceConfiguration(memory_limit=2048), 
		tf.config.LogicalDeviceConfiguration(memory_limit=2048)
	]
)

Placing Operations and Variables on Devices

Generally, you’d want to place data preprocessing operations on the CPU and neural network operations on the GPU.

Avoid unnecessary data transfers in/out of the GPU as they have limited communication bandwidth.

Training Models Across Multiple Devices

Two main approaches to training a single model across multiple devices:

Model parallelism (model is split across devices)
Data parallelism (model is replicated across every device; each replica is trained on a subset of the data)

Model Parallelism

Requires chopping the model into separate chunks and running each chunk on a different device.

This is rather tricky, and its effectiveness depends on the architecture of your Neural Network. Usually not much to be gained for fully connected networks here. Splitting by layers is bad because each layer uses the input from the previous layer. Doing a vertical split isn’t great either because each half of the next layer requires the output of both halves, so there’ll be cross-device communication, which is likely to cancel out speed benefits.

Partially connected neural networks can handle it, though. E.g. Convolutional Neural Networks.

Deep RNNs can be split across multiple GPUs.
Split horizontally by placing each layer on a difference device, feed the network with an input sequence, then at the first time step only one device will be active (working on the sequence’s first value), at the second step two will be active (2nd layer is handling output of first layer for the first value; 1st layer is handling the second value). When the signal has propagated to the output layer, all devices are active simultaneously.
But in practice, running a stack of LSTMs on a single GPU is much faster.

Data Parallelism

This is generally simpler and more efficient than model parallelism.

Replicate the neural network on every device and run each training step simultaneously on all replicas, using a different mini-batch for each.
The gradients computed by each replica are then averaged, and the result is used to update the model parameters.
This is data parallelism (or, sometimes single program, multiple data (SIMD)).

Mirrored strategy

Mirror all model parameters across all the GPUs. Always apply the same parameter updates on every GPU. Replicas will be perfectly identical.
Can use an AllReduce algorithm to efficiently compute the mean of all gradients from all GPUs and distribute.
- AllReduce algorithm: class of algorithms where multiple nodes collaborate to efficiently perform a reduce operation while ensuring all nodes obtain the final result.

Centralized parameters

Store model parameters outside the GPU devices doing the computations (workers), e.g. on the CPU. Can place them on one or more CPU-only servers called parameter servers, whose job is to host & update the parameters.

Mirrored imposes synchronous weight updates. Centralized allows either synchronous or asynchronous updates.

Synchronous updates: aggregator waits until all gradients are available before computing avg. gradients & passes to optimizer, which updates model parameters.
This also means a replica has to wait when it has finished computing gradients for the parameters to be updated, before proceeding to the next mini-batch. Some devices can be slower than others, making the process as slow as the slowest device.

Asynchronous updates: whenever a replica finishes computing gradients, they’re immediately used to update the model parameters. So no aggregation & no synchronization. Replicas are independent. This runs more training steps per minute. Parameters still do need to be copied to every device at every step, but it happens at different times for each replica.

The Google Brain team found (in 2016) that synchronous updates with a few sparse replicas was more efficient than using asynchronous updates. Converges faster & produces a better model. But this is an active area of research.

These updates require communicating model parameters from the parameter server to every replica at the start of each training step, and the gradients in the other direction at the end of each training step. At some point, further GPUs will no longer improve performance due to time spent moving data in and out of GPU RAM (and across the network in distributed setups). Adding more GPUs at this point will just worsen the bandwidth saturation and even slow down training.

Appendix B Autodiff

We have function:

We need the partial derivatives and .
Usually done to do gradient descent (or another optimization algo).

Can either:

Use sympy to calculate the derivatives // manual differentiation
Finite difference approximation
Use autograd to calculate the derivatives // automatic differentiation

But there are a few ways to do autodiff:

Forward mode (calculate derivative of each variable)
Reverse mode (calculate derivative of each function)

Manual Differentiation

Pick up a piece of paper and use calculus to derive the appropriate equation.
This gets incredibly tedious.

Finite Difference Approximation

Unfortunately, this is imprecise and slow.

def f(x, y):
    return x**2 * y + y + 2


def derivative(f, x, y, x_eps, y_eps):
    return (f(x + x_eps, y + y_eps) - f(x, y)) / (x_eps + y_eps)


df_dx = derivative(f, 3, 4, 0.00001, 0)
df_dy = derivative(f, 3, 4, 0, 0.00001)

print(df_dx, df_dy)
> 24.000039999805264 10.000000000331966

Forward-Mode Autodiff

Algo goes through computation graph from inputs to outputs (hence ‘forward’).
Starts by getting partial derivatives of leaf nodes.
Then uses chain rule to calculate derivatives of other nodes.

Forward-Mode takes one computation graph and produces another.
This is called symbolic differentiation.
A nice byproduct of this is that we can reuse the output computation graph to calculate the derivatives of the given function for any value of and .
And we can run it again on the output graph to get second-order derivatives (and so on).

But we can also do forward-mode autodiff without creating a graph (so numerically, not symbolically) by computing intermediate results on the fly. Can use dual numbers for this.

The major flaw of forward-mode autodiff is that it’s not efficient for functions with many inputs. Not great for deep learning, where there are so many parameters.
This is where reverse-mode autodiff comes in.

Reverse-Mode Autodiff

Goes through graph in forward direction to compute values of each node, and then does a second, reverse pass to compute all the partial derivatives.

We gradually go through the graph in reverse, computing the partial derivatives of the function, w.r.t each consecutive node, until we reach the inputs.
This uses the chain rule, and is called reverse accumulation.

Reverse-mode autodiff is efficient for functions with many inputs, but not so much for many outputs.
It requires only one forward pass & one reverse pass per output to compute all the partial derivatives for all outputs, w.r.t the inputs.
When we train neural nets, there’s only one output (the loss), but many inputs.

It can also handle functions that aren’t entirely differentiable - but only if you only ask it to compute the partial derivatives at points where the function is differentiable.

For a given point, can think of this as intermediate vector between gradient vectors around that point. ↩
Named after the first author of the original paper ↩
Also: additive or concatenative attention. It’s called concatenative because it concatenates the encoder’s output with the decoder’s previous hidden state before passing them through the Dense layer to compute the alignment scores. ↩

Notes on

Hands-on Machine Learning With Scikit-Learn, Keras, and TensorFlow

by Aurélien Géron