One place for hosting & domains


      Introduction to PyTorch: Build a Neural Network to Recognize Handwritten Digits

      The author selected the Code 2040 to receive a donation as part of the Write for DOnations program.


      Machine learning is a field of computer science that finds patterns in data. As of 2021, machine learning practitioners use these patterns to detect lanes for self-driving cars; train a robot hand to solve a Rubik’s cube; or generate images of dubious artistic taste. As machine learning models grow more accurate and performant, we see increasing adoption in mainstream applications and products.

      Deep learning is a subset of machine learning that focuses on particularly complex models, termed neural networks. In later, advanced DigitalOcean articles (like this tutorial on building an Atari bot), we will formally define what “complex” means. Neural networks are the highly accurate and hype-inducing modern-day models your hear about, with applications across a wide range of tasks. In this tutorial, you will focus on one specific task called object recognition, or image classification. Given an image of a handwritten digit, your model will predict which digit is shown.

      You will build, train, and evaluate deep neural networks in PyTorch, a framework developed by Facebook AI Research for deep learning. When compared to other deep learning frameworks, like Tensorflow, PyTorch is a beginner-friendly framework with debugging features that aid in the building process. It’s also highly customizable for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. By the end of this tutorial, you will be able to:

      • Build, train, and evaluate a deep neural network in PyTorch
      • Understand the risks of applying deep learning

      While you won’t need prior experience in practical deep learning or PyTorch to follow along with this tutorial, we’ll assume some familiarity with machine learning terms and concepts such as training and testing, features and labels, optimization, and evaluation. You can learn more about these concepts in An Introduction to Machine Learning.


      To complete this tutorial, you will need a local development environment for Python 3 with at least 1GB of RAM. You can follow How to Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.

      Step 1 — Creating Your Project and Installing Dependencies

      Let’s create a workspace for this project and install the dependencies you’ll need. You’ll call your workspace pytorch:

      Navigate to the pytorch directory:

      Then create a new virtual environment for the project:

      Activate your environment:

      • source pytorch/bin/activate

      Then install PyTorch. On macOS, install PyTorch with the following command:

      • python -m pip install torch==1.4.0 torchvision==0.5.0

      On Linux and Windows, use the following commands for a CPU-only build:

      • pip install torch==1.4.0+cpu torchvision==0.5.0+cpu -f
      • pip install torchvision

      With the dependencies installed, you will now build your first neural network.

      Step 2 — Building a “Hello World” Neural Network

      In this step, you will build your first neural network and train it. You will learn about two sub-libraries in Pytorch, torch.nn for neural network operations and torch.optim for neural network optimizers. To understand what an “optimizer” is, you will also learn about an algorithm called gradient descent. Throughout this tutorial, you will use the following five steps to build and train models:

      1. Build a computation graph
      2. Set up optimizers
      3. Set up criterion
      4. Set up data
      5. Train the model

      In this first section of the tutorial, you will build a small model with a manageable dataset. Start by creating a new file, using nano or your favorite text editor:

      • nano

      You will now write a short 18-line snippet that trains a small model. Start by importing several PyTorch utilities:

      import torch
      import torch.nn as nn
      import torch.optim as optim

      Here, you alias PyTorch libraries to several commonly used shortcuts:

      • torch contains all PyTorch utilities. However, routine PyTorch code includes a few extra imports. We follow the same convention here, so that you can understand PyTorch tutorials and random code snippets online.
      • torch.nn contains utilities for constructing neural networks. This is often denoted nn.
      • torch.optim contains training utilities. This is often denoted optim.

      Next, define the neural network, training utilities, and the dataset:

      . . .
      net = nn.Linear(1, 1)  # 1. Build a computation graph (a line!)
      optimizer = optim.SGD(net.parameters(), lr=0.1)  # 2. Setup optimizers
      criterion = nn.MSELoss()  # 3. Setup criterion
      x, target = torch.randn((1,)), torch.tensor([0.])  # 4. Setup data
      . . .

      Here, you define several necessary parts of any deep learning training script:

      • net = ... defines the “neural network”. In this case, the model is a line of the form y = m * x; the parameter nn.Linear(1, 1) is the slope of your line. This model parameter nn.Linear(1, 1) will be updated during training. Note that torch.nn (aliased with nn) includes many deep learning operations, like the fully connected layers used here (nn.Linear) and convolutional layers (nn.Conv2d).
      • optimizer = ... defines the optimizer. This optimizer determines how the neural network will learn. We will discuss optimizers in more detail after writing a few more lines of code. Note that torch.optim (aliased to optim) includes many such optimizers that you can use.
      • criterion = ... defines the loss. In short, the loss defines what your model is trying to minimize. For your basic model of a line, the goal is to minimize the difference between your line’s predicted y-values and the actual y-values in the training set. Note that torch.nn (aliased with nn) includes many other loss functions you can use.
      • x, target = ... defines your “dataset”. Right now, the dataset is just one coordinate—one x value and one y value. In this case, the torch package itself offers tensor, to create a new tensor, and randn to create a tensor with random values.

      Finally, train the model by iterating over the dataset ten times. Each time, you adjust the model’s parameter:

      . . .
      # 5. Train the model
      for i in range(10):
          output = net(x)
          loss = criterion(output, target)
          print(round(loss.item(), 2))

      Your general goal is to minimize the loss, by adjusting the slope of the line. To effect this, this training code implements an algorithm called gradient descent. The intuition for gradient descent is as follows: Imagine you’re looking straight down at a bowl. The bowl has many points on it, and each point corresponds to a different parameter value. The bowl itself is the loss surface: the center of the bowl—the lowest point—indicates the best model with the lowest loss. This is the optimum. The fringes of the bowl—the highest points, and the parts of the bowl closest to you—hold the worst models with the highest loss.

      To find the best model with the lowest loss:

      1. With net = nn.Linear(1, 1) you initialize a random model. This is equivalent to picking a random point on the bowl.
      2. In the for i in range(10) loop, you begin training. This is equivalent to stepping closer to the center of the bowl.
      3. The direction of each step is given by the gradient. You will skip a formal proof here, but in summary, the negative gradient points to the lowest point in the bowl.
      4. With lr=0.1 in optimizer = ..., you specify the step size. This determines how large each step can be.

      In just ten steps, you reach the center of the bowl, the best possible model with the lowest possible loss. For a visualization of gradient descent, see Distill’s “Why Momentum Really Works,” first figure at the top of the page.

      The last three lines of this code are also important:

      • net.zero_grad clears all gradients that may have been leftover from the previous step iterate.
      • loss.backward computes new gradients.
      • optimizer.step uses those gradients to take steps. Notice that you didn’t compute gradients yourself. This is because PyTorch, and other deep learning libraries like it, automatically differentiate.

      This now concludes your “hello world” neural network. Save and close your file.

      Double-check that your script matches Then, run the script:

      • python

      Your script will output the following:


      0.33 0.19 0.11 0.07 0.04 0.02 0.01 0.01 0.0 0.0

      Notice that your loss continually decreases, showing that your model is learning. There are two other implementation details to note, when using PyTorch:

      1. PyTorch uses torch.Tensor to hold all data and parameters. Here, torch.randn generates a tensor with random values, with the provided shape. For example, a torch.randn((1, 2)) creates a 1x2 tensor, or a 2-dimensional row vector.
      2. PyTorch supports a wide variety of optimizers. This features torch.optim.SGD, otherwise known as stochastic gradient descent (SGD). Roughly speaking, this is the algorithm described in this tutorial, where you took steps toward the optimum. There are more-involved optimizers that add extra features on top of SGD. There are also many losses, with torch.nn.MSELoss being just one of them.

      This concludes your very first model on a toy dataset. In the next step, you will replace this small model with a neural network and the toy dataset with a commonly used machine learning benchmark.

      Step 3 — Training Your Neural Network on Handwritten Digits

      In the previous section, you built a small PyTorch model. However, to better understand the benefits of PyTorch, you will now build a deep neural network using torch.nn.functional, which contains more neural network operations, and torchvision.datasets, which supports many datasets you can use, out of the box. In this section, you will build a relatively complex, custom model with a premade dataset.

      You’ll use convolutions, which are pattern-finders. For images, convolutions look for 2D patterns at various levels of “meaning”: Convolutions directly applied to the image are looking for “lower-level” features such as edges. However, convolutions applied to the outputs of many other operations may be looking for “higher-level” features, such as a door. For visualizations and a more thorough walkthrough of convolutions, see part of Stanford’s deep learning course.

      You will now expand on the first PyTorch model you built, by defining a slightly more complex model. Your neural network will now contain two convolutions and one fully connected layer, to handle image inputs.

      Start by creating a new file, using your text editor:

      You will follow the same five-step algorithm as before:

      1. Build a computation graph
      2. Set up optimizers
      3. Set up criterion
      4. Set up data
      5. Train the model

      First, define your deep neural network. Note this is a pared down version of other neural networks you may find on MNIST—this is intentional, so that you can train your neural network on your laptop:

      import torch
      import torch.nn as nn
      import torch.optim as optim
      import torch.nn.functional as F
      from torchvision import datasets, transforms
      from torch.optim.lr_scheduler import StepLR
      # 1. Build a computation graph
      class Net(nn.Module):
          def __init__(self):
              super(Net, self).__init__()
              self.conv1 = nn.Conv2d(1, 32, 3, 1)
              self.conv2 = nn.Conv2d(32, 64, 3, 1)
              self.fc = nn.Linear(1024, 10)
          def forward(self, x):
              x = F.relu(self.conv1(x))
              x = F.relu(self.conv2(x))
              x = F.max_pool2d(x, 1)
              x = torch.flatten(x, 1)
              x = self.fc(x)
              output = F.log_softmax(x, dim=1)
              return output
      net = Net()
      . . .

      Here, you define a neural network class, inheriting from nn.Module. All operations in the neural network (including the neural network itself) must inherit from nn.Module. The typical paradigm, for your neural network class, is as follows:

      1. In the constructor, define any operations needed for your network. In this case, you have two convolutions and a fully connected layer. (A tip to remember: The constructor always starts with super().__init__().) PyTorch expects the parent class to be initialized before assigning modules (for example, nn.Conv2d) to instance attributes (self.conv1).
      2. In the forward method, run the initialized operations. This method determines the neural network architecture, explicitly defining how the neural network will compute its predictions.

      This neural network uses a few different operations:

      • nn.Conv2d: A convolution. Convolutions look for patterns in the image. Earlier convolutions look for “low-level” patterns like edges. Later convolutions in the network look for “high-level” patterns like legs on a dog, or ears.
      • nn.Linear: A fully connected layer. Fully connected layers relate all input features to all output dimensions.
      • F.relu, F.max_pool2d: These are types of non-linearities. (A non-linearity is any function that is not linear.) relu is the function f(x) = max(x, 0). max_pool takes the maximum value in every patch of values. In this case, you take the maximum value across the entire image.
      • log_softmax: normalizes all of the values in a vector, so that the values sum to 1.

      Second, like before, define the optimizer. This time, you will use a different optimizer and a different hyper-parameter setting. Hyper-parameters configure training, whereas training adjusts model parameters. These hyper-parameter settings are taken from the PyTorch MNIST example:

      . . .
      optimizer = optim.Adadelta(net.parameters(), lr=1.)  # 2. Setup optimizer
      . . .

      Third, unlike before, you will now use a different loss. This loss is used for classification problems, where the output of your model is a class index. In this particular example, the model will output the digit (possibly any number from 0 to 9) contained in the input image:

      . . .
      criterion = nn.NLLLoss()  # 3. Setup criterion
      . . .

      Fourth, set up the data. In this case, you will set up a dataset called MNIST, which features handwritten digits. Deep Learning 101 tutorials often use this dataset. Each image is a small 28x28 px image containing a handwritten digit, and the goal is to classify each handwritten digit as 0, 1, 2, … or 9:

      . . .
      # 4. Setup data
      transform = transforms.Compose([
          transforms.Resize((8, 8)),
          transforms.Normalize((0.1307,), (0.3081,))
      train_dataset = datasets.MNIST(
          'data', train=True, download=True, transform=transform)
      train_loader =, batch_size=512)
      . . .

      Here, you preprocess the images in transform = ... by resizing the image, converting the image to a PyTorch tensor, and normalizing the tensor to have mean 0 and variance 1.

      In the next two lines, you set train=True, as this is the training dataset and download=True so that you download the dataset if it is not already.

      batch_size=512 determines how many images the network is trained on, at once. Barring ridiculously large batch sizes (for example, tens of thousands), larger batches are preferable for roughly faster training.

      Fifth, train the model. In the following code block, you make minimal modifications. Instead of running ten times on the same sample, you will now iterate over all samples in the provided dataset once. By passing over all samples once, the following is training for one epoch:

      . . .
      # 5. Train the model
      for inputs, target in train_loader:
          output = net(inputs)
          loss = criterion(output, target)
          print(round(loss.item(), 2))
      . . .

      Save and close your file.

      Double-check that your script matches Then, run the script.

      Your script will output the following:


      2.31 2.18 2.03 1.78 1.52 1.35 1.3 1.35 1.07 1.0 ... 0.21 0.2 0.23 0.12 0.12 0.12

      Notice that the final loss is less than 10% of the initial loss value. This means that your neural network is training correctly.

      That concludes training. However, the loss of 0.12 is difficult to reason about: we don’t know if 0.12 is “good” or “bad”. To assess how well your model is performing, you next compute an accuracy for this classification model.

      Step 4 — Evaluating Your Neural Network

      Earlier, you computed loss values on the train split of your dataset. However, it is good practice to keep a separate validation split of your dataset. You use this validation split to compute the accuracy of your model. However, you can’t use it for training. Following, you set up the validation dataset and evaluate your model on it. In this step, you will use the same PyTorch utilities from before, including torchvision.datasets for the MNIST dataset.

      Start by copying your file into Then, open the file:

      • cp
      • nano

      First, set up the validation dataset:

      . . .
      train_loader = ...
      val_dataset = datasets.MNIST(
          'data', train=False, download=True, transform=transform)
      val_loader =, batch_size=512)
      . . .

      At the end of your file, after the training loop, add a validation loop:

          . . .
      correct = 0.
      for inputs, target in val_loader:
          output = net(inputs)
          _, pred = output.max(1)
          correct += (pred == target).sum()
      accuracy = correct / len(val_dataset) * 100.
      print(f'{accuracy:.2f}% correct')

      Here, the validation loop performs a few operations to compute accuracy:

      • Running net.eval() ensures that your neural network is in evaluation mode and ready for validation. Several operations are run differently in evaluation mode than when in training mode.
      • Iterating over all inputs and labels in val_loader.
      • Running the model net(inputs) to obtain probabilities for each class.
      • Finding the class with the highest probability output.max(1). output is a tensor with dimensions (n, k) for n samples and k classes. The 1 means you compute the max along the index 1 dimension.
      • Computing the number of images that were classified correctly: pred == target computes a boolean-valued vector. .sum() casts these booleans to integers and effectively computes the number of true values.
      • correct / len(val_dataset) finally computes the percent of images classified correctly.

      Save and close your file.

      Double-check that your script matches Then, run the script:

      Your script will output the following. Note the specific loss values and final accuracy may vary:


      2.31 2.21 ... 0.14 0.2 89% correct

      You have now trained your very first deep neural network. You can make further modifications and improvements by tuning hyper-parameters for training: This includes different numbers of epochs, learning rates, and different optimizers. We include a sample script with tuned hyper-parameters; this script trains the same neural network but for 10 epochs, obtaining 97% accuracy.

      Risks of Deep Learning

      One gotcha is that deep learning does not always obtain state-of-the-art results. Deep learning works well in feature-rich, data-rich scenarios but conversely performs poorly in data-sparse, feature-sparse regimes. Whereas there is active research in deep learning’s weak areas, many other machine learning techniques are already well-suited for feature-sparse regimes, such as decision trees, linear regression, or support vector machines (SVM).

      Another gotcha is that deep learning is not well understood. There are no guarantees for accuracy, optimality, or even convergence. On the other hand, classic machine learning techniques are well-studied and are relatively interpretable. Again, there is active research to address this lack of interpretability in deep learning. You can read more in “What Explainable AI fails to explain (and how we fix that)”.

      Most importantly, lack of interpretability in deep learning leads to overlooked biases. For example, researchers from UC Berkeley were able to show a model’s gender bias in captioning (“Women also Snowboard”). Other research efforts focus on societal issues such as “Fairness” in machine learning. Given these issues are undergoing active research, it is difficult to recommend a prescribed diagnosis for biases in models. As a result, it is up to you, the practitioner, to apply deep learning responsibly.


      PyTorch is deep learning framework for enthusiasts and researchers alike. To get acquainted with PyTorch, you have both trained a deep neural network and also learned several tips and tricks for customizing deep learning.

      You can also use a pre-built neural network architecture instead of building your own. Here is a link to an optional section: Use Existing Neural Network Architecture on Google Colab that you can try. For demonstration purposes, this optional step trains a much larger model with much larger images.

      Check out our other articles to dive deeper into machine learning and related fields:

      • Is your model complex enough? Too complex? Learn about the bias-variance tradeoff in Bias-Variance for Deep Reinforcement Learning: How To Build a Bot for Atari with OpenAI Gym to find out. In this article, we build AI bots for Atari Games and explore a field of research called Reinforcement Learning. Alternatively, find a visual explanation of the bias-variance trade-off in this Understanding the Bias-Variance Trade-off article.
      • How does a machine learning model process images? Learn more in Build an Emotion-Based Dog Filter. In this article, we discuss how models process and classify images, in more detail, exploring a field of research called Computer Vision.
      • Can a neural network be fooled? Learn how to in Tricking a Neural Network. In this article, we explore adversarial machine learning, a field of research that devises both attacks and defenses for neural networks for more robust real-world deep learning deployments.
      • How can we better understand how neural networks work? Read one class of approaches called “Explainable AI” in How To Visualize and Interpret Neural Networks. In this article, we explore explainable AI, and in particular visualize pixels that the neural networks believes are important for its predictions.

      Source link

      How To Build a Neural Network to Recognize Handwritten Digits with TensorFlow


      Neural networks are used as a method of deep learning, one of the many subfields of artificial intelligence. They were first proposed around 70 years ago as an attempt at simulating the way the human brain works, though in a much more simplified form. Individual ‘neurons’ are connected in layers, with weights assigned to determine how the neuron responds when signals are propagated through the network. Previously, neural networks were limited in the number of neurons they were able to simulate, and therefore the complexity of learning they could achieve. But in recent years, due to advancements in hardware development, we have been able to build very deep networks, and train them on enormous datasets to achieve breakthroughs in machine intelligence.

      These breakthroughs have allowed machines to match and exceed the capabilities of humans at performing certain tasks. One such task is object recognition. Though machines have historically been unable to match human vision, recent advances in deep learning have made it possible to build neural networks which can recognize objects, faces, text, and even emotions.

      In this tutorial, you will implement a small subsection of object recognition—digit recognition. Using TensorFlow, an open-source Python library developed by the Google Brain labs for deep learning research, you will take hand-drawn images of the numbers 0-9 and build and train a neural network to recognize and predict the correct label for the digit displayed.

      While you won’t need prior experience in practical deep learning or TensorFlow to follow along with this tutorial, we’ll assume some familiarity with machine learning terms and concepts such as training and testing, features and labels, optimization, and evaluation. You can learn more about these concepts in An Introduction to Machine Learning.


      To complete this tutorial, you’ll need:

      Step 1 — Configuring the Project

      Before you can develop the recognition program, you’ll need to install a few dependencies and create a workspace to hold your files.

      We’ll use a Python 3 virtual environment to manage our project’s dependencies. Create a new directory for your project and navigate to the new directory:

      • mkdir tensorflow-demo
      • cd tensorflow-demo

      Execute the following commands to set up the virtual environment for this tutorial:

      • python3 -m venv tensorflow-demo
      • source tensorflow-demo/bin/activate

      Next, install the libraries you’ll use in this tutorial. We’ll use specific versions of these libraries by creating a requirements.txt file in the project directory which specifies the requirement and the version we need. Create the requirements.txt file:

      Open the file in your text editor and add the following lines to specify the Image, NumPy, and TensorFlow libraries and their versions:



      Save the file and exit the editor. Then install these libraries with the following command:

      • pip install -r requirements.txt

      With the dependencies installed, we can start working on our project.

      Step 2 — Importing the MNIST Dataset

      The dataset we will be using in this tutorial is called the MNIST dataset, and it is a classic in the machine learning community. This dataset is made up of images of handwritten digits, 28x28 pixels in size. Here are some examples of the digits included in the dataset:

      Examples of MNIST images

      Let's create a Python program to work with this dataset. We will use one file for all of our work in this tutorial. Create a new file called

      Now open this file in your text editor of choice and add this line of code to the file to import the TensorFlow library:

      import tensorflow as tf

      Add the following lines of code to your file to import the MNIST dataset and store the image data in the variable mnist:

      from tensorflow.examples.tutorials.mnist import input_data
      mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) # y labels are oh-encoded

      When reading in the data, we are using one-hot-encoding to represent the labels (the actual digit drawn, e.g. "3") of the images. One-hot-encoding uses a vector of binary values to represent numeric or categorical values. As our labels are for the digits 0-9, the vector contains ten values, one for each possible digit. One of these values is set to 1, to represent the digit at that index of the vector, and the rest are set to 0. For example, the digit 3 is represented using the vector [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. As the value at index 3 is stored as 1, the vector therefore represents the digit 3.

      To represent the actual images themselves, the 28x28 pixels are flattened into a 1D vector which is 784 pixels in size. Each of the 784 pixels making up the image is stored as a value between 0 and 255. This determines the grayscale of the pixel, as our images are presented in black and white only. So a black pixel is represented by 255, and a white pixel by 0, with the various shades of gray somewhere in between.

      We can use the mnist variable to find out the size of the dataset we have just imported. Looking at the num_examples for each of the three subsets, we can determine that the dataset has been split into 55,000 images for training, 5000 for validation, and 10,000 for testing. Add the following lines to your file:

      n_train = mnist.train.num_examples # 55,000
      n_validation = mnist.validation.num_examples # 5000
      n_test = mnist.test.num_examples # 10,000

      Now that we have our data imported, it’s time to think about the neural network.

      Step 3 — Defining the Neural Network Architecture

      The architecture of the neural network refers to elements such as the number of layers in the network, the number of units in each layer, and how the units are connected between layers. As neural networks are loosely inspired by the workings of the human brain, here the term unit is used to represent what we would biologically think of as a neuron. Like neurons passing signals around the brain, units take some values from previous units as input, perform a computation, and then pass on the new value as output to other units. These units are layered to form the network, starting at a minimum with one layer for inputting values, and one layer to output values. The term hidden layer is used for all of the layers in between the input and output layers, i.e. those "hidden" from the real world.

      Different architectures can yield drastically different results, as the performance can be thought of as a function of the architecture among other things, such as the parameters, the data, and the duration of training.

      Add the following lines of code to your file to store the number of units per layer in global variables. This allows us to alter the network architecture in one place, and at the end of the tutorial you can test for yourself how different numbers of layers and units will impact the results of our model:

      n_input = 784   # input layer (28x28 pixels)
      n_hidden1 = 512 # 1st hidden layer
      n_hidden2 = 256 # 2nd hidden layer
      n_hidden3 = 128 # 3rd hidden layer
      n_output = 10   # output layer (0-9 digits)

      The following diagram shows a visualization of the architecture we've designed, with each layer fully connected to the surrounding layers:

      Diagram of a neural network

      The term "deep neural network" relates to the number of hidden layers, with "shallow" usually meaning just one hidden layer, and "deep" referring to multiple hidden layers. Given enough training data, a shallow neural network with a sufficient number of units should theoretically be able to represent any function that a deep neural network can. But it is often more computationally efficient to use a smaller deep neural network to achieve the same task that would require a shallow network with exponentially more hidden units. Shallow neural networks also often encounter overfitting, where the network essentially memorizes the training data that it has seen, and is not able to generalize the knowledge to new data. This is why deep neural networks are more commonly used: the multiple layers between the raw input data and the output label allow the network to learn features at various levels of abstraction, making the network itself better able to generalize.

      Other elements of the neural network that need to be defined here are the hyperparameters. Unlike the parameters that will get updated during training, these values are set initially and remain constant throughout the process. In your file, set the following variables and values:

      learning_rate = 1e-4
      n_iterations = 1000
      batch_size = 128
      dropout = 0.5

      The learning rate represents ow much the parameters will adjust at each step of the learning process. These adjustments are a key component of training: after each pass through the network we tune the weights slightly to try and reduce the loss. Larger learning rates can converge faster, but also have the potential to overshoot the optimal values as they are updated. The number of iterations refers to how many times we go through the training step, and the batch size refers to how many training examples we are using at each step. The dropout variable represents a threshold at which we elimanate some units at random. We will be using dropout in our final hidden layer to give each unit a 50% chance of being eliminated at every training step. This helps prevent overfitting.

      We have now defined the architecture of our neural network, and the hyperparameters that impact the learning process. The next step is to build the network as a TensorFlow graph.

      Step 4 — Building the TensorFlow Graph

      To build our network, we will set up the network as a computational graph for TensorFlow to execute. The core concept of TensorFlow is the tensor, a data structure similar to an array or list. initialized, manipulated as they are passed through the graph, and updated through the learning process.

      We’ll start by defining three tensors as placeholders, which are tensors that we'll feed values into later. Add the following to your file:

      X = tf.placeholder("float", [None, n_input])
      Y = tf.placeholder("float", [None, n_output])
      keep_prob = tf.placeholder(tf.float32) 

      The only parameter that needs to be specified at its declaration is the size of the data we will be feeding in. For X we use a shape of [None, 784], where None represents any amount, as we will be feeding in an undefined number of 784-pixel images. The shape of Y is [None, 10] as we will be using it for an undefined number of label outputs, with 10 possible classes. The keep_prob tensor is used to control the dropout rate, and we initialize it as a placeholder rather than an immutable variable because we want to use the same tensor both for training (when dropout is set to 0.5) and testing (when dropout is set to 1.0).

      The parameters that the network will update in the training process are the weight and bias values, so for these we need to set an initial value rather than an empty placeholder. These values are essentially where the network does its learning, as they are used in the activation functions of the neurons, representing the strength of the connections between units.

      Since the values are optimized during training, we could set them to zero for now. But the initial value actually has a significant impact on the final accuracy of the model. We'll use random values from a truncated normal distribution for the weights. We want them to be close to zero, so they can adjust in either a positive or negative direction, and slightly different, so they generate different errors. This will ensure that the model learns something useful. Add these lines:

      weights = {
          'w1': tf.Variable(tf.truncated_normal([n_input, n_hidden1], stddev=0.1)),
          'w2': tf.Variable(tf.truncated_normal([n_hidden1, n_hidden2], stddev=0.1)),
          'w3': tf.Variable(tf.truncated_normal([n_hidden2, n_hidden3], stddev=0.1)),
          'out': tf.Variable(tf.truncated_normal([n_hidden3, n_output], stddev=0.1)),

      For the bias, we use a small constant value to ensure that the tensors activate in the intial stages and therefore contribute to the propagation. The weights and bias tensors are stored in dictionary objects for ease of access. Add this code to your file to define the biases:

      biases = {
          'b1': tf.Variable(tf.constant(0.1, shape=[n_hidden1])),
          'b2': tf.Variable(tf.constant(0.1, shape=[n_hidden2])),
          'b3': tf.Variable(tf.constant(0.1, shape=[n_hidden3])),
          'out': tf.Variable(tf.constant(0.1, shape=[n_output]))

      Next, set up the layers of the network by defining the operations that will manipulate the tensors. Add these lines to your file:

      layer_1 = tf.add(tf.matmul(X, weights['w1']), biases['b1'])
      layer_2 = tf.add(tf.matmul(layer_1, weights['w2']), biases['b2'])
      layer_3 = tf.add(tf.matmul(layer_2, weights['w3']), biases['b3'])
      layer_drop = tf.nn.dropout(layer_3, keep_prob)
      output_layer = tf.matmul(layer_3, weights['out']) + biases['out']

      Each hidden layer will execute matrix multiplication on the previous layer’s outputs and the current layer’s weights, and add the bias to these values. At the last hidden layer, we will apply a dropout operation using our keep_prob value of 0.5.

      The final step in building the graph is to define the loss function that we want to optimize. A popular choice of loss function in TensorFlow programs is cross-entropy, also known as log-loss, which quantifies the difference between two probability distributions (the predictions and the labels). A perfect classification would result in a cross-entropy of 0, with the loss completely minimized.

      We also need to choose the optimization algorithm which will be used to minimize the loss function. A process named gradient descent optimization is a common method for finding the (local) minimum of a function by taking iterative steps along the gradient in a negative (descending) direction. There are several choices of gradient descent optimization algorithms already implemented in TensorFlow, and in this tutorial we will be using the Adam optimizer. This extends upon gradient descent optimization by using momentum to speed up the process through computing an exponentially weighted average of the gradients and using that in the adjustments. Add the following code to your file:

      cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=output_layer))
      train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

      We've now defined the network and built it out with TensorFlow. The next step is to feed data through the graph to train it, and then test that it has actually learnt something.

      Step 5 — Training and Testing

      The training process involves feeding the training dataset through the graph and optimizing the loss function. Every time the network iterates through a batch of more training images, it updates the parameters to reduce the loss in order to more accurately predict the digits shown. The testing process involves running our testing dataset through the trained graph, and keeping track of the number of images that are correctly predicted, so that we can calculate the accuracy.

      Before starting the training process, we will define our method of evaluating the accuracy so we can print it out on mini-batches of data while we train. These printed statements will allow us to check that from the first iteration to the last, loss decreases and accuracy increases; they will also allow us to track whether or not we have ran enough iterations to reach a consistent and optimal result:

      correct_pred = tf.equal(tf.argmax(output_layer, 1), tf.argmax(Y, 1))
      accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

      In correct_pred, we use the arg_max function to compare which images are being predicted correctly by looking at the output_layer (predictions) and Y (labels), and we use the equal function to return this as a list of [Booleans](tps:// We can then cast this list to floats and calculate the mean to get a total accuracy score.

      We are now ready to initialize a session for running the graph. In this session we will feed the network with our training examples, and once trained, we feed the same graph with new test examples to determine the accuracy of the model. Add the following lines of code to your file:

      init = tf.global_variables_initializer()
      sess = tf.Session()

      The essence of the training process in deep learning is to optimize the loss function. Here we are aiming to minimize the difference between the predicted labels of the images, and the true labels of the images. The process involves four steps which are repeated for a set number of iterations:

      • Propagate values forward through the network
      • Compute the loss
      • Propagate values backward through the network
      • Update the parameters

      At each training step, the parameters are adjusted slightly to try and reduce the loss for the next step. As the learning progresses, we should see a reduction in loss, and eventually we can stop training and use the network as a model for testing our new data.

      Add this code to the file:

      # train on mini batches
      for i in range(n_iterations):
          batch_x, batch_y = mnist.train.next_batch(batch_size)
, feed_dict={X: batch_x, Y: batch_y, keep_prob:dropout})
          # print loss and accuracy (per minibatch)
          if i%100==0:
              minibatch_loss, minibatch_accuracy =[cross_entropy, accuracy], feed_dict={X: batch_x, Y: batch_y, keep_prob:1.0})
              print("Iteration", str(i), "t| Loss =", str(minibatch_loss), "t| Accuracy =", str(minibatch_accuracy))

      After 100 iterations of each training step in which we feed a mini-batch of images through the network, we print out the loss and accuracy of that batch. Note that we should not be expecting a decreasing loss and increasing accuracy here, as the values are per batch, not for the entire model. We use mini-batches of images rather than feeding them through individually to speed up the training process and allow the network to see a number of different examples before updating the parameters.

      Once the training is complete, we can run the session on the test images. This time we are using a keep_prob dropout rate of 1.0 to ensure all units are active in the testing process.

      Add this code to the file:

      test_accuracy =, feed_dict={X: mnist.test.images, Y: mnist.test.labels, keep_prob:1.0})
      print("nAccuracy on test set:", test_accuracy)

      It’s now time to run our program and see how accurately our neural network can recognize these handwritten digits. Save the file and execute the following command in the terminal to run the script:

      You'll see an output similar to the following, although individual loss and accuracy results may vary slightly:


      Iteration 0 | Loss = 3.67079 | Accuracy = 0.140625 Iteration 100 | Loss = 0.492122 | Accuracy = 0.84375 Iteration 200 | Loss = 0.421595 | Accuracy = 0.882812 Iteration 300 | Loss = 0.307726 | Accuracy = 0.921875 Iteration 400 | Loss = 0.392948 | Accuracy = 0.882812 Iteration 500 | Loss = 0.371461 | Accuracy = 0.90625 Iteration 600 | Loss = 0.378425 | Accuracy = 0.882812 Iteration 700 | Loss = 0.338605 | Accuracy = 0.914062 Iteration 800 | Loss = 0.379697 | Accuracy = 0.875 Iteration 900 | Loss = 0.444303 | Accuracy = 0.90625 Accuracy on test set: 0.9206

      To try and improve the accuracy of our model, or to learn more about the impact of tuning hyperparameters, we can test the effect of changing the learning rate, the dropout threshold, the batch size, and the number of iterations. We can also change the number of units in our hidden layers, and change the amount of hidden layers themselves, to see how different architectures increase or decrease the model accuracy.

      To demonstrate that the network is actually recognizing the hand-drawn images, let's test it on a single image of our own.

      First either download this sample test image or open up a graphics editor and create your own 28x28 pixel image of a digit.

      Open the file in your editor and add the following lines of code to the top of the file to import two libraries necessary for image manipulation.

      import numpy as np
      from PIL import Image

      Then at the end of the file, add the following line of code to load the test image of the handwritten digit:

      img = np.invert("test_img.png").convert('L')).ravel()

      The open function of the Image library loads the test image as a 4D array containing the three RGB color channels and the Alpha transparency. This is not the same representation we used previously when reading in the dataset with TensorFlow, so we'll need to do some extra work to match the format.

      First, we use the convert function with the L parameter to reduce the 4D RGBA representation to one grayscale color channel. We store this as a numpy array and invert it using np.invert, because the current matrix represents black as 0 and white as 255, whereas we need the opposite. Finally, we call ravel to flatten the array.

      Now that the image data is structured correctly, we can run a session in the same way as previously, but this time only feeding in the single image for testing. Add the following code to your file to test the image and print the outputted label.

      prediction =,1), feed_dict={X: [img]})
      print ("Prediction for test image:", np.squeeze(prediction))

      The np.squeeze function is called on the prediction to return the single integer from the array (i.e. to go from [2] to 2). The resulting output demonstrates that the network has recognized this image as the digit 2.


      Prediction for test image: 2

      You can try testing the network with more complex images –– digits that look like other digits, for example, or digits that have been drawn poorly or incorrectly –– to see how well it fares.


      In this tutorial you successfully trained a neural network to classify the MNIST dataset with around 92% accuracy and tested it on an image of your own. Current state-of-the-art research achieves around 99% on this same problem, using more complex network architectures involving convolutional layers. These use the 2D structure of the image to better represent the contents, unlike our method which flattened all the pixels into one vector of 784 units. You can read more about this topic on the TensorFlow website, and see the research papers detailing the most accurate results on the MNIST website.

      Now that you know how to build and train a neural network, you can try and use this implementation on your own data, or test it on other popular datasets such as the Google StreetView House Numbers, or the CIFAR-10 dataset for more general image recognition.

      Source link