In this lab, you will learn about modern convolutional architecture and use your knowledge to implement a simple but effective convnet called "squeezenet".

This lab includes the necessary theoretical explanations about convolutional neural networks and is a good starting point for developers learning about deep learning.

This lab is Part 4 of the "Keras on TPU" series. You can do them in the following order or independently.

What you'll learn

Feedback

If you see something amiss in this code lab, please tell us. Feedback can be provided through GitHub issues [feedback link].

This lab uses Google Collaboratory and requires no setup on your part. You can run it from a Chromebook. You can open this sample notebook and run through a couple of cells to familiarize yourself with Colaboratory.

Welcome to Colab.ipynb

Select a TPU backend

In the Colab menu, select Runtime > Change runtime type and then select TPU. In this code lab you will use a powerful TPU (Tensor Processing Unit) backed for hardware-accelerated training. Connection to the runtime will happen automatically on first execution, or you can use the "Connect" button in the upper-right corner.

Notebook execution

Execute cells one at a time by clicking on a cell and using Shift-ENTER. You can also run the entire notebook with Runtime > Run all

Authentication

Most code lab notebooks will ask you to authenticate with your Google account on first execution. This allows the Colab backend to access any cloud resources where logged-in access is necessary. Watch out for the prompt in "Colab auth" cells.

Table of contents

All notebooks have a table of contents. You can open it using the black arrow on the left.

Hidden cells

Some cells will only show their title. This is a Colab-specific notebook feature. You can double click on them to see the code inside but it is usually not very interesting. Typically support or visualization functions. You still need to run these cells for the functions inside to be defined.

In a nutshell

The code for training a model on TPU in Keras is:

tpu = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
strategy = tf.contrib.tpu.TPUDistributionStrategy(tpu)
tpu_model = tf.contrib.tpu.keras_to_tpu_model(model, strategy=strategy)

tpu_model.fit(get_training_dataset,
              steps_per_epoch=TRAIN_STEPS, epochs=EPOCHS,
              validation_data=get_validation_dataset, validation_steps=VALID_STEPS)

We will use TPUs today to build and optimize a flower classifier at interactive speeds (minutes per training run).

Why TPUs ?

Modern GPUs are organized around programmable "cores", a very flexible architecture that allows them to handle a variety of tasks such as 3D rendering, deep learning, physical simulations, etc.. TPUs on the other hand pair a classic vector processor with a dedicated matrix multiply unit and excel at any task where large matrix multiplications dominate, such as neural networks.

Illustration: a dense neural network layer as a matrix multiplication, with a batch of eight images processed through the neural network at once. Please run through one line x column multiplication to verify that it is indeed doing a weighted sum of all the pixels values of an image. Convolutional layers can be represented as matrix multiplications too although it's a bit more complicated (explanation here, in section 1).

The hardware

MXU and VPU

A TPU v2 core is made of a Matrix Multiply Unit (MXU) which runs matrix multiplications and a Vector Processing Unit (VPU) for all other tasks such as activations, softmax, etc. The VPU handles float32 and int32 computations. The MXU on the other hand operates in a mixed precision 16-32 bit floating point format.

Mixed precision floating point and bfloat16

The MXU computes matrix multiplications using bfloat16 inputs and float32 outputs. Intermediate accumulations are performed in float32 precision.

Neural network training is typically resistant to the noise introduced by a reduced floating point precision. There are cases where noise even helps the optimizer converge. 16-bit floating point precision has traditionally been used to accelerate computations but float16 and float32 formats have very different ranges. Reducing the precision from float32 to float16 usually results in over and underflows. Solutions exist but additional work is typically required to make float16 work.

That is why Google introduced the bfloat16 format in TPUs. bfloat16 is a truncated float32 with exactly the same exponent bits and range as float32. This, added to the fact that TPUs compute matrix multiplications in mixed precision with bfloat16 inputs but float32 outputs, means that, typically, no code changes are necessary to benefit from the performance gains of reduced precision.

Systolic array

The MXU implements matrix multiplications in hardware using a so-called "systolic array" architecture in which data elements flow through an array of hardware computation units. (In medicine, "systolic" refers to heart contractions and blood flow, here to the flow of data.)

The basic element of a matrix multiplication is a dot product between a line from one matrix and a column from the other matrix (see illustration at the top of this section). For a matrix multiplication Y=X*W, one element of the result would be:

Y[2,0] = X[2,0]*W[0,0] + X[2,1]*W[1,0] + X[2,2]*W[2,0] + ... + X[2,n]*W[n,0]

On a GPU, one would program this dot product into a GPU "core" and then execute it on as many "cores" as are available in parallel to try and compute every value of the resulting matrix at once. If the resulting matrix is 128x128 large, that would require 128x128=16K "cores" to be available which is typically not possible. The largest GPUs have around 4000 cores. A TPU on the other hand uses the bare minimum of hardware for the compute units in the MXU: just bfloat16 x bfloat16 => float32 multiply-accumulators, nothing else. These are so small that a TPU can implement 16K of them in a 128x128 MXU and process this matrix multiplication in one go.

Illustration: the MXU systolic array. The compute elements are multiply-accumulators. The values of one matrix are loaded into the array (red dots). Values of the other matrix flow through the array (grey dots). Vertical lines propagate the values up. Horizontal lines propagate partial sums. It is left as an exercise to the user to verify that as the data flows through the array, you get the result of the matrix multiplication coming out of the right side.

In addition to that, while the dot products are being computed in an MXU, intermediate sums just flow between adjacent compute units. They do not need to be stored and retrieved to/from memory or even a register file. The end result is that the TPU systolic array architecture has a significant density and power advantage, as well as a non-negligible speed advantage over a GPU, when computing matrix multiplications.

Cloud TPU

When you request one "Cloud TPU v2" on Google Cloud Platform, you get a virtual machine (VM) which has a PCI-attached TPU board. The TPU board has four dual-core TPU chips. Each TPU core features a VPU (Vector Processing Unit) and a 128x128 MXU (MatriX multiply Unit). This "Cloud TPU" is then usually connected through the network to the VM that requested it. So the full picture looks like this:

Illustration: your VM with a network-attached "Cloud TPU" accelerator. "The Cloud TPU" itself is made of a VM with a PCI-attached TPU board with four dual-core TPU chips on it.

TPU pods

In Google's data centers, TPUs are connected to a high-performance computing (HPC) interconnect which can make them appear as one very large accelerator. Google calls them pods and they can encompass up to 512 TPU v2 cores. TPU v3 pods are even more powerful.

Illustration: a TPU v3 pod. TPU boards and racks connected through HPC interconnect.

During training, gradients are exchanged between TPU cores using the all-reduce algorithm (good explanation of all-reduce here). The model being trained can take advantage of the hardware by training on large batch sizes.

Illustration: synchronization of gradients during training using the all-reduce algorithm on Google TPU's 2-D toroidal mesh HPC network.

The software

Large batch size training

The ideal batch size for TPUs is 128 data items per TPU core but the hardware can already show good utilization from 8 data items per TPU core. Remember that one Cloud TPU has 8 cores.

In this code lab, we will be using the Keras API. In Keras, the batch size automatically becomes the per-core batch size when running on TPU. It is not something you need to adjust in your code, but under the hood, you will be training with an 8 times larger batch size.

For additional performance tips see the TPU Performance Guide. For very large batch sizes, special care might be needed in some models, see LARSOptimizer for more details.

Under the hood: XLA

Tensorflow programs define computation graphs. The TPU does not directly run Python code, it runs the computation graph defined by your Tensorflow program. Under the hood, a compiler called XLA (accelerated Linear Algebra compiler) transforms the Tensorflow graph of computation nodes into TPU machine code. This compiler also performs many advanced optimizations on your code and your memory layout. The compilation happens automatically as work is sent to the TPU. You do not have to include XLA in your build chain explicitly.

Illustration: to run on TPU, the computation graph defined by your Tensorflow program is first translated to an XLA (accelerated Linear Algebra compiler) representation, then compiled by XLA into TPU machine code.

Using TPUs in Keras

TPUs are supported through the Keras API as of Tensorflow 1.12. Keras support is limited to 8 cores or one Cloud TPU for now. Here is an example:

tpu = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
strategy = tf.contrib.tpu.TPUDistributionStrategy(tpu)
tpu_model = tf.contrib.tpu.keras_to_tpu_model(model, strategy=strategy)

tpu_model.fit(get_training_dataset,
              steps_per_epoch=TRAIN_STEPS, epochs=EPOCHS,
              validation_data=get_validation_dataset, validation_steps=VALID_STEPS)

In this code snippet:

Common TPU porting tasks

Using TPUs with Estimator API

Porting an Estimator model to the TPUEstimator API is more involved but also allows additional flexibility and enables support for TPU pods. The documentation describing the process is here and you can find a commented before/after TPUEstimator porting example here:

In a nutshell

If all the terms in bold in the next paragraph are already known to you, you can move to the next exercise. If your are just starting in deep learning then welcome, and please read on.

For models built as a sequence of layers Keras offers the Sequential API. For example, an image classifier using three dense layers can be written in Keras as:

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[192, 192, 3]),
    tf.keras.layers.Dense(500, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(5, activation='softmax') # classifying into 5 classes
])

# this configures the training of the model. Keras calls it "compiling" the model.
model.compile(
  optimizer='adam',
  loss= 'categorical_crossentropy',
  metrics=['accuracy']) # % of correct answers

# train the model
model.fit(dataset, ... )

Dense neural network

This is the simplest neural network for classifying images. It is made of "neurons" arranged in layers. The first layer processes input data and feeds its outputs into other layers. It is called "dense" because each neuron is connected to all the neurons in the previous layer.

You can feed an image into such a network by flattening the RGB values of all of its pixels into a long vector and using it as inputs. It is not the best technique for image recognition but we will improve on it later.

Neurons, activations, RELU

A "neuron" computes a weighted sum of all of its inputs, adds a value called "bias" and feeds the result through a so called "activation function". The weights and bias are unknown at first. They will be initialized at random and "learned" by training the neural network on lots of known data.

The most popular activation function is called RELU for Rectified Linear Unit. It is a very simple function as you can see on the graph above.

Softmax activation

The network above ends with a 5-neuron layer because we are classifying flowers into 5 categories (rose, tulip, dandelion, daisy, sunflower). Neurons in intermediate layers are activated using the classic RELU activation function. In the last layer though, we want to compute numbers between 0 and 1 representing the probability of this flower being a rose, a tulip and so on. For this, we will use an activation function called "softmax".

Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector, typically using the L1 norm (sum of absolute values) so that the values add up to 1 and can be interpreted as probabilities.

Cross-entropy loss

Now that our neural network produces predictions from input images, we need to measure how good they are, i.e. the distance between what the network tells us and the correct answers, often called "labels". Remember that we have correct labels for all the images in the dataset.

Any distance would work, but for classification problems the so-called "cross-entropy distance" is the most effective. We will call this our error or "loss" function:

Gradient descent

"Training" the neural network actually means using training images and labels to adjust weights and biases so as to minimise the cross-entropy loss function. Here is how it works.

The cross-entropy is a function of weights, biases, pixels of the training image and its known class.

If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases we obtain a "gradient", computed for a given image, label, and present value of weights and biases. Remember that we can have millions of weights and biases so computing the gradient sounds like a lot of work. Fortunately, Tensorflow does it for us. The mathematical property of a gradient is that it points "up". Since we want to go where the cross-entropy is low, we go in the opposite direction. We update weights and biases by a fraction of the gradient. We then do the same thing again and again using the next batches of training images and labels, in a training loop. Hopefully, this converges to a place where the cross-entropy is minimal although nothing guarantees that this minimum is unique.

gradient descent2.png

Mini-batching and momentum

You can compute your gradient on just one example image and update the weights and biases immediately, but doing so on a batch of, for example, 128 images gives a gradient that better represents the constraints imposed by different example images and is therefore likely to converge towards the solution faster. The size of the mini-batch is an adjustable parameter.

This technique, sometimes called "stochastic gradient descent" has another, more pragmatic benefit: working with batches also means working with larger matrices and these are usually easier to optimise on GPUs and TPUs.

The convergence can still be a little chaotic though and it can even stop if the gradient vector is all zeros. Does that mean that we have found a minimum? Not always. A gradient component can be zero on a minimum or a maximum. With a gradient vector with millions of elements, if they are all zeros, the probability that every zero corresponds to a minimum and none of them to a maximum point is pretty small. In a space of many dimensions, saddle points are pretty common and we do not want to stop at them.

Illustration: a saddle point. The gradient is 0 but it is not a minimum in all directions. (Image attribution Wikimedia: By Nicoguaro - Own work, CC BY 3.0)

The solution is to add some momentum to the optimization algorithm so that it can sail past saddle points without stopping.

Glossary

batch or mini-batch: training is always performed on batches of training data and labels. Doing so helps the algorithm converge. The "batch" dimension is typically the first dimension of data tensors. For example a tensor of shape [100, 192, 192, 3] contains 100 images of 192x192 pixels with three values per pixel (RGB).

cross-entropy loss: a special loss function often used in classifiers.

dense layer: a layer of neurons where each neuron is connected to all the neurons in the previous layer.

features: the inputs of a neural network are sometimes called "features". The art of figuring out which parts of a dataset (or combinations of parts) to feed into a neural network to get good predictions is called "feature engineering".

labels: another name for "classes" or correct answers in a supervised classification problem

learning rate: fraction of the gradient by which weights and biases are updated at each iteration of the training loop.

logits: the outputs of a layer of neurons before the activation function is applied are called "logits". The term comes from the "logistic function" a.k.a. the "sigmoid function" which used to be the most popular activation function. "Neuron outputs before logistic function" was shortened to "logits".

loss: the error function comparing neural network outputs to the correct answers

neuron: computes the weighted sum of its inputs, adds a bias and feeds the result through an activation function.

one-hot encoding: class 3 out of 5 is encoded as a vector of 5 elements, all zeros except the 3rd one which is 1.

relu: rectified linear unit. A popular activation function for neurons.

sigmoid: another activation function that used to be popular and is still useful in special cases.

softmax: a special activation function that acts on a vector, increases the difference between the largest component and all others, and also normalizes the vector to have a sum of 1 so that it can be interpreted as a vector of probabilities. Used as the last step in classifiers.

tensor: A "tensor" is like a matrix but with an arbitrary number of dimensions. A 1-dimensional tensor is a vector. A 2-dimensions tensor is a matrix. And then you can have tensors with 3, 4, 5 or more dimensions.

In a nutshell

If all the terms in bold in the next paragraph are already known to you you can move to the next exercise. If your are just starting with convolutional neural networks please read on.

convolutional.gif

Illustration: filtering an image with two successive filters made of 4x4x3=48 learnable weights each.

This is how a simple convolutional neural network looks in Keras:

model = tf.keras.Sequential([
  # input: images of size 192x192x3 pixels (the three stands for RGB channels)
  tf.keras.layers.Conv2D(kernel_size=3, filters=24, padding='same', activation='relu', input_shape=[192, 192, 3]),
  tf.keras.layers.Conv2D(kernel_size=3, filters=24, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=12, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=6, padding='same', activation='relu'),
  tf.keras.layers.Flatten(),
  # classifying into 5 categories
  tf.keras.layers.Dense(5, activation='softmax')
])

model.compile(
  optimizer='adam',
  loss= 'categorical_crossentropy',
  metrics=['accuracy'])

Convolutional neural nets 101

In a layer of a convolutional network, one "neuron" does a weighted sum of the pixels just above it, across a small region of the image only. It adds a bias and feeds the sum through an activation function, just as a neuron in a regular dense layer would. This operation is then repeated across the entire image using the same weights. Remember that in dense layers, each neuron had its own weights. Here, a single "patch" of weights slides across the image in both directions (a "convolution"). The output has as many values as there are pixels in the image (some padding is necessary at the edges though). It is a filtering operation, using a filter of 4x4x3=48 weights.

However, 48 weights will not be enough. To add more degrees of freedom, we repeat the same operation with a new set of weights. This produces a new set of filter outputs. Let's call it a "channel" of outputs by analogy with the R,G,B channels in the input image.

Screen Shot 2016-07-29 at 16.02.37.png

The two (or more) sets of weights can be summed up as one tensor by adding a new dimension. This gives us the generic shape of the weights tensor for a convolutional layer. Since the number of input and output channels are parameters, we can start stacking and chaining convolutional layers.

Illustration: a convolutional neural network transforms "cubes" of data into other "cubes" of data.

Strided convolutions, max pooling

By performing the convolutions with a stride of 2 or 3, we can also shrink the resulting data cube in its horizontal dimensions. There are two common ways of doing this:

Illustration: sliding the computing window by 3 pixels results in fewer output values. Strided convolutions or max pooling (max on a 2x2 window sliding by a stride of 2) are a way of shrinking the data cube in the horizontal dimensions.

Convolutional classifier

Finally, we attach a classification head by flattening the last data cube and feeding it through a dense, softmax-activated layer. A typical convolutional classifier can look like this:

Illustration: an image classifier using convolutional and softmax layers. It uses 3x3 and 1x1 filters. The maxpool layers take the max of groups of 2x2 data points. The classification head is implemented with a dense layer with softmax activation.

In Keras

The convolutional stack illustrated above can be written in Keras like this:

model = tf.keras.Sequential([
  # input: images of size 192x192x3 pixels (the three stands for RGB channels)    
  tf.keras.layers.Conv2D(kernel_size=3, filters=32, padding='same', activation='relu', input_shape=[192, 192, 3]),
  tf.keras.layers.Conv2D(kernel_size=1, filters=32, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=32, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(kernel_size=1, filters=32, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=32, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(kernel_size=1, filters=32, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=32, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(kernel_size=1, filters=32, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=16, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(kernel_size=1, filters=8, padding='same', activation='relu'),
  tf.keras.layers.Flatten(),
  # classifying into 5 categories
  tf.keras.layers.Dense(5, activation='softmax')
])

model.compile(
  optimizer='adam',
  loss= 'categorical_crossentropy',
  metrics=['accuracy'])

The padding parameter in convolutional layers can have two values:

In a nutshell

Illustration: a convolutional "module". What is best at this point ? A max-pool layer followed by a 1x1 convolutional layer or a different combination of layers ? Try them all, concatenate the results and let the network decide. On the right: the "inception" convolutional architecture using such modules.

In Keras, to create models models where the data flow can branch in and out, you have to use the "functional" model style. Here is an example:

l = tf.keras.layers # syntax shortcut

y = l.Conv2D(filters=32, kernel_size=3, padding='same',
             activation='relu', input_shape=[192, 192, 3])(x) # x=input image

# module start: branch out
y1 = l.Conv2D(filters=32, kernel_size=1, padding='same', activation='relu')(y)
y3 = l.Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')(y)
y = l.concatenate([y1, y3]) # output now has 64 channels
# module end: concatenation

# many more layers ...

# Create the model by specifying the input and output tensors.
# Keras layers track their connections automatically so that's all that's needed.
z = l.Dense(5, activation='softmax')(y)
model = tf.keras.Model(x, z)

Other cheap tricks

Small 3x3 filters

In this illustration, you see the result of two consecutive 3x3 filters. Try to trace back which data points contributed to the result: these two consecutive 3x3 filters compute some combination of a 5x5 region. It is not exactly the same combination that a 5x5 filter would compute but it is worth trying because two consecutive 3x3 filters are cheaper than a single 5x5 filter.

1x1 convolutions ?

In mathematical terms, a "1x1" convolution is a multiplication by a constant, not a very useful concept. In convolutional neural networks however, remember that the filter is applied to a data cube, not just a 2D image. Therefore, a "1x1" filter computes a weighted sum of a 1x1 column of data (see illustration) and as you slide it across the data, you will obtain a linear combination of the channels of the input. This is actually useful. If you think of the channels as the results of individual filtering operations, for example a filter for "pointy ears", another one for "whiskers" and a third one for "slit eyes" then a "1x1" convolutional layer will be computing multiple possible linear combinations of these features, which might be useful when looking for a "cat". On top of that, 1x1 layers use fewer weights.

A simple way of putting these ideas together has been showcased in the "Squeezenet" paper. The authors suggest a very simple convolutional module design, using only 1x1 and 3x3 convolutional layers.

Illustration: squeezenet architecture based on "fire modules". They alternate a 1x1 layer that "squeezes" the incoming data in the vertical dimension followed by two parallel 1x1 and 3x3 convolutional layers that "expand" the depth of the data again.

Continue in your previous notebook and build a squeezenet-inspired convolutional neural network. You will have to change the model code to the Keras "functional style".

HANDS-ON: Keras_Flowers_TPU (playground).ipynb

Squeezenet architectures to try

It will be useful for this exercise to define a helper function for a squeezenet module:

def fire(x, squeeze, expand):
  y = l.Conv2D(filters=squeeze, kernel_size=1, padding='same', activation='relu')(x)
  y1 = l.Conv2D(filters=expand//2, kernel_size=1, padding='same', activation='relu')(y)
  y3 = l.Conv2D(filters=expand//2, kernel_size=3, padding='same', activation='relu')(y)
  return tf.keras.layers.concatenate([y1, y3])

# this is to make it behave similarly to other Keras layers
def fire_module(squeeze, expand):
  return lambda x: fire(x, squeeze, expand)

# usage:
x = l.Input(shape=[192, 192, 3])
y = fire_module(squeeze=24, expand=48)(x) # typically, squeeze is less than expand
y = fire_module(squeeze=32, expand=64)(y)
...
model = tf.keras.Model(x, y)

Here are a couple of architectures you can try:

Little squeeze: 6 layers, global average pooling

x = l.Input(shape=[*IMAGE_SIZE, 3])

# Squeezenet's fire modules alternating with max-pooling layers
y = fire_module(squeeze=25, expand=50)(x)
y = l.MaxPooling2D(pool_size=2)(y)
y = fire_module(squeeze=25, expand=50)(y)
y = l.MaxPooling2D(pool_size=2)(y)
y = fire_module(squeeze=25, expand=50)(y)
y = l.MaxPooling2D(pool_size=2)(y)

# classification head with cheap global average pooling to 50 numbers
# (each channel is averaged to one number), followed by dense softmax layer.
y = l.GlobalAveragePooling2D()(y)
y = l.Dense(5, activation='softmax')(y)

This one is simple but not so great, tops out at 65% accuracy.

Squeeze-dense: 8 layers, dense classification head

x = l.Input(shape=[*IMAGE_SIZE, 3])

# Starting directly with a 3x3 layer instead of the 1x1 layer that starts a fire
# module. Not sure if doing a 1x1 convolution (=linear combination) of the RGB
# channels of the input image is useful.
y = l.Conv2D(kernel_size=3, filters=40, padding='same', activation='relu')(x)
y = l.MaxPooling2D(pool_size=2)(y)

# Alternating max-pooling and fire modules with increasing filter count.
y = fire_module(squeeze=25, expand=50)(x)
y = l.MaxPooling2D(pool_size=2)(y)
y = fire_module(squeeze=30, expand=60)(y)
y = l.MaxPooling2D(pool_size=2)(y)
y = fire_module(squeeze=40, expand=80)(y)
y = l.MaxPooling2D(pool_size=2)(y)

# final 1x1 conv layer to bring the channel count to a reasonable 10 channels
y = l.Conv2D(kernel_size=1, filters=10, padding='same', activation='relu')(y)

# flatten 24x24x10 data cube to 24x24x10=5760 long vector and end on a fairly large
# dense layer. Notice that it accounts for half of the weights of the entire network.
y = l.Flatten()(y)
y = l.Dense(5, activation='softmax')(y)

This one goes to 75% accuracy. For the flowers dataset, a final dense layer seems to be working better than global average pooling.

Squeeze it as fast as you can: quick downsampling, 6 layers, global average pooling

x = l.Input(shape=[*IMAGE_SIZE, 3])

# Are all the 192x192 pixels of the image useful for recognizing flowers ?
# Let's downsample heavily with a 6x6 filter applied every 2 pixels (output is 96x96)
# and a max-pooling layer right after (output is now 48x48).
y = l.Conv2D(kernel_size=6, filters=42, padding='same', activation='relu', strides=2)(x)
y = l.MaxPooling2D(pool_size=2)(y)

# only 4 layers worth of fire modules, let's see if it is enough
y = fire_module(squeeze=24, expand=60)(y)
y = l.MaxPooling2D(pool_size=2)(y)
y = fire_module(squeeze=27, expand=90)(y)
y = l.MaxPooling2D(pool_size=2)(y)

# Global average pooling by the book: one last conv layer to bring the number of
# channels down to 5, average them, apply softmax activation on the results directly.
# No dense layer at all.
y = l.Conv2D(kernel_size=1, filters=5, padding='same', activation='relu')(y)
y = l.GlobalAveragePooling2D()(y)
y = l.Activation('softmax')(y)

This model trains in 3 seconds per epoch on TPU and amazingly, it still achieves 70% accuracy. Downsampling the input image aggressively seems to work for the flowers dataset.

Squeeze it to 90%

The convnet from the previous chapter achieved 75% accuracy and transfer learning from the first chapter took us to 85% accuracy. Can you beat them ?

A word about overfitting

Sometimes, you will see training curves like this:

The validation accuracy stalls and the validation loss goes up instead of going down. This is usually called "overfitting". It happens when the optimization work that is being done on the training dataset is no longer useful for examples outside of the training dataset. Various regularization techniques such as "dropout" or "batch normalization" can be used to address this, but this is a topic for another code lab. For now, just restart the training. On the Flowers dataset, the random weights initializations are sufficient to get the network to converge on most runs.

Solution

Here is the solution notebook. You can use it if you are stuck.

Keras_Flowers_TPU_squeezenet.ipynb

What we've covered

Please take a moment to go through this checklist in your head.

You have built your first modern convolutional neural network and trained it to 90% + accuracy, iterating on successive training in only minutes thanks to TPUs. This concludes the 4 "Keras on TPU codelabs":

TPUs in practice

TPUs and GPUs are available on Cloud AI Platform:

Finally, we love feedback. Please tell us if you see something amiss in this lab or if you think it should be improved. Feedback can be provided through GitHub issues [feedback link].

HR.png

Martin Görner ID small.jpg
The author: Martin Görner
Twitter: @martin_gorner

tensorflow logo.jpg
www.tensorflow.org