Learn CNN from scratch with Python and Numpy

Convolutional Neural Network (CNN/ ConvNet) is a deep learning algorithm for image analysis and Computer Vision. In this CNN deep learning tutorial I will give you a very basic explanation of Convolutional Neural Network (ConvNet/ CNN), so that it can be understandable easily.

Application of CNN

Form OCR (Optical Character Recognition) to self-driving cars, every where Convolution Neural Network is used. Let’s see some real world application of CNN:

  • Image tagging: It is one of the foundation elements of visual search. Image tag is a word or a word combination that describes the images
  • Image Search: Google image search is an example of Visual Search
  • Recommendation engines: Amazon uses CNN image recognition to suggest in the “you might also like” section
  • Face Detection: Now a days almost each phone have this feature in it’s camera
  • Object Detection: Detect any real world object like person, phone etc.

Advantages of CNN over Basic Neural Network

One of the challenges you will face while solving computer vision problem with basic neural network is that input size.

Let’s say we are doing image classification. Given a image you need to classify that image is dog or not. If you are working with 64x64x3 (3 is bands: Red, Green, Blue in short RGB band) images, that means you will have 12288 (64*64*3) input features.

image classification using cnn
Low Resolution Image

Now if you want to work with larger images let’s say 1000x1000x3. In this case you will have 3 million (1000*1000*3) input features. Now let’s say you are using 1000 hidden units to train your basic neural network. So total number of weights to train this simple neural network for image classification will be 3 billion (3 million * 1000).

High resolution image classification using Convolutional Neural Network
High Resolution Image

So you can see you have huge number of parameters to train basic neural network for image classification. Due to that huge number of parameters you will face:

  • Huge computation issue
  • You will end up with overfitting the basic neural network model

To solve above problems we should use Convolutional Neural Network (CNN neural network). Now let’s understand architecture of Convolutional Neural Network.

CNN Architecture

Architecture of Convolutional Neural Network
Architecture of Convolutional Neural Network

There are mainly 5 layers in CNN architecture.

  1. Input Layer
  2. Convolutional Layer
  3. Pooling Layer
  4. Fully Connected Layer
  5. Output Layer

Among these 5 layers of CNN neural network:

  • Feature Extraction: First 3 layers (input, convolution and pooling) are used for feature extraction (like: eyes, nose, legs, body of a dog)
  • Classification: Last two layers (Fully Connected and output) are used for classification (dog or not)

Input Layer of CNN

When we see any image as a human we just see as an image but computer can only understand numbers. Now let’s say we want to detect an image is digit 9 or not.

Let’s say size of the image is 5X7. Since it is a grey scale image so we can write it as 5x7x1 image.

Note: Here in this image matrix 1 represent white pixel and -1 means black pixel.

Convert image as number in CNN
Represent image as number in CNN

Define Input Layer in Python with Numpy

import numpy as np
from skimage.util.shape import view_as_windows
from numpy.lib.stride_tricks import as_strided

# *******************************************************************************
# Input Layer
# *******************************************************************************

input_data_list= [
[-1,1,1, 1, -1],
[-1,1,-1, 1,-1],
[-1,1, 1, 1,-1],


# Convert list to input data matrix
input_data_matrix = np.asarray(input_data_list, dtype=np.float32)

Convolutional Layer of CNN

The Convolution layer is the core building block of a CNN neural network where convolution operation happens. To understand Convolutional Layer of Convolutional Neural Network we should first know how human recognize any image?

Also Read:  Train YOLO Custom object detection model in Windows GPU

When a person look at a dog image, to recognize the image as dog, we look at little features like eyes, nose, ears etc. and we try to detect these features one by one.

There are different set of neurons in our brain work on these different features, these neurons are connected to another set of neuros which aggregate those previous result of neurons. At the end after final aggregation our brain determine that image is dog or not.

how human detect any image CNN tutorial
How human recognize any image CNN tutorial

Now let’s understand how CNN recognize any image (dog or not).

Computer use concept of Filter to recognize these tiny features (eye, nose, leg etc.).

Let’s say we want to detect an image is of hand written digit 9 or not. To detect hand written digit 9 we can have three filters:

  1. Box type filter
  2. Vertical line Filter
  3. Diagonal line Filter
Hand written digit recognition using CNN
Kernels to detect Digit 9 using ConvNet

In Convolution Layer we will apply convolution operation (filter operation) on our input image. This filter is also known as kernel.

First let’s apply box filter on our image. For this CNN tutorial I am using 3×3 filter or kernel (you can use any number).

Now let’s understand how Filter Operation works. It can be divided into some steps:

  • Lineup the feature (kernel/ filter) and the image path
  • Multiply each input image pixel with corresponding filter pixel
  • Add them up
  • Find the average by dividing the result with total number of element in the filter matrix (in our case 9 because we are using 3×3 kernel or filter)

Let me explain you one Convolution Operation by blow animation:

How Filter Operation works

In this similar way we can calculate entire output matrix of convolution layer. This output matrix of convolution layer is called feature map. Below animation is for Box Filter Operation.

Full Feature Map Calculation

Here I am taking a stride of one. You can take stride (will explain at bottom of this CNN tutorial) of two or three, also you can use any number of filter or kernel size (for this tutorial I have used 3×3).

After finishing convolution operation on input image, you will get a Feature Map. Now, in the Feature Map where ever you see a number is equal to one or close to 1, it means that, there a feature (box type shape) is detected.

box feature detection in feature map convnet
Detected Box type shape

For Box Filter operation we found value 1 at top row second column. So it is detecting Box type shape in top of the image at digit 9, similarly it can be nose or eye for dog because for dog eye, legs, nose etc. are the features.

For digit 9 it is detecting box type shape at top, may be for digit 6 it can detect box type shape at bottom.

feature detection for hand written digit 6 cnn

Similarly if you use Vertical Filter, it will detect vertical line at middle in feature map.

Vertical feature detection in hand written digit 9
Vertical Line shape detected for hand written digit 9

If you use Diagonal line Filter it will detect diagonal line at bottom in feature map.

Diagonal line detection in hand written digit 9
Diagonal Shape detected for hand written digit 9

So, in a simple word Filters are nothing but a feature detector.

For digit 9 filters are:

  • Box type filter
  • Vertical line Filter
  • Diagonal line Filter
Also Read:  Image to Text using Tesseract OCR with Python

Similarly for Dog it can be:

  • Eye filter
  • Nose filter
  • Front leg filter
  • Back leg filter
  • etc.

Convolution Operation in Python and Numpy

# *******************************************************************************
# Convolution layer (convolution operation)
# *******************************************************************************

box_filter_list = [
[1, 1,1],
[1, 1,1]


vertical_line_Filter_list = [
[-1, 1, -1],
[-1, 1, -1],
[-1, 1, -1]


Diagonal_line_Filter_list = [
[-1, -1, 1],
[-1, 1, -1],
[1, -1, -1]


# Convert all filter list into matrix
box_filter_matrix = np.asarray(box_filter_list, dtype=np.float32)
vertical_line_Filter_matrix = np.asarray(vertical_line_Filter_list, dtype=np.float32)
Diagonal_line_Filter_matrix = np.asarray(Diagonal_line_Filter_list, dtype=np.float32)

# Extract each window from input matrix  by stride operation
def strided4D_v2(input_image_matrix,kernal_matrix,stride):
    return view_as_windows(input_image_matrix, kernal_matrix.shape, step=stride)

# Calculate shape of the feature map (output matrix from convolution layer)
featureMap_row = strided4D_v2(input_data_matrix, box_filter_matrix, 1).shape[0]
featureMap_col = strided4D_v2(input_data_matrix, box_filter_matrix, 1).shape[1]

# Function to Calculate featuremap matrix for box filter
def conv2d(input_matrix, kernal_matrix):
    # Create blank featureMap matrix for stride 1
    featureMap_Output = np.zeros((featureMap_row, featureMap_col))

    for row in range(featureMap_row):
        for col in range(featureMap_col):
            window = strided4D_v2(input_matrix, kernal_matrix, 1)[row][col]
            featureMap_Output[row, col] = np.sum(np.multiply(kernal_matrix, window))

            # # To Format floats in a numpy array
            # Taking average with divided by 9 (total number of element in filter matrix)
            total_number_of_element_in_filter_matrix = kernal_matrix.shape[0] * kernal_matrix.shape[0]
    return (featureMap_Output / total_number_of_element_in_filter_matrix)

# ------------------------------------------------------------------------
# Box Filter operation
# ------------------------------------------------------------------------
# Create blank featureMap matrix for stride 1
featureMap_Box = conv2d(input_data_matrix, box_filter_matrix)

# ------------------------------------------------------------------------
# Vertical line Filter operation
# ------------------------------------------------------------------------
featureMap_Vertical = conv2d(input_data_matrix, vertical_line_Filter_matrix)
# print(featureMap_Vertical)

# ------------------------------------------------------------------------
# Diagonal line Filter operation
# ------------------------------------------------------------------------
# Output after applying Diagonal line Filter with stride 1
featureMap_Diagonal = conv2d(input_data_matrix, Diagonal_line_Filter_matrix)
# print(featureMap_Diagonal)

Relu Operation

Once we calculate the feature map (output of convolution operation/ convolution layer), we need to apply ReLu activation function to each (Box filter, Vertical filter and Diagonal filter) feature map. ReLu operation is adding nonlinearity in convolution neural network model.

Now what ReLu activation function does in CNN? ReLu just convert negative number in feature map to 0. If the value of Feature Map is greater than 0, it will keep as it is.

Now let’s see ReLu activation output for each Feature Map:

ReLu operation on Box Filter Feature Map


ReLu operation for box filter CNN


ReLu operation on Vertical line Filter Feature Map


ReLu operation for Vertical filter CNN handwritten Digit


ReLu operation on Diagonal line Filter Feature Map


ReLu operation for diagonal filter CNN hand written digit 9


ReLu Operation in Python and Numpy

# *******************************************************************************
# ReLu Operation
# *******************************************************************************
featureMap_Box_ReLu = np.maximum(featureMap_Box, 0)
featureMap_Vertical_ReLu = np.maximum(featureMap_Vertical, 0)
featureMap_Diagonal_ReLu = np.maximum(featureMap_Diagonal, 0)


Pooling Layer

Now if you are working with high resolution images, after convolution operation, you will not be able to reduce much size of image. So still now the computation issue is not solved.

Pooling layer is used to reduce the size of image (reduce dimension of image), by that we can solve computation issue.

Types of pooling in CNN

There are mainly two types of pooling:

  1. Max Pooling
  2. Average Pooling

In this CNN tutorial I will be using Max Pooling. Similar to filter or kernel you have to mention your pooling size and stride. For this CNN lesson I am using 2×2 pooling window with stride of 1.

How Max Pooling works

Pooling operation is simple. For Max Pooling it just take maximum value of a particular window and for Average Pooling it takes average value of a particular window. For this CNN tutorial I am using 2×2 pooling window. You can use any window for Pooling.

Let’s see how Max Pooling works by below animation:

how max pooling works in Convolutional Neural Network

Pooling Layer output of Box Filter Feature Map


Max Pooling output for box filter


Pooling Layer output of Vertical line Filter Feature Map


Max Pooling output for vertical line filter


Pooling Layer output of Diagonal line Filter Feature Map


Max Pooling output for diagonal filter

As you can after doing pooling operation it still detect feature at it’s position. But now image dimension got reduced. Our input image dimension was 5×7 and the dimension of pooling layer output is just 2×4. This is how we solved computation issue of image processing with convolutional neural network.

Also Read:  3D Digital Surface Model with Python and Pylidar

Benefit of Pooling Layer

  • Reduce computation by reducing dimension
  • Reduce overfitting (as there are less parameter)

Pooling Operation in Python and Numpy

# *******************************************************************************
# Pooling Layer
# *******************************************************************************

# Pooling function with stride using python and numpy
def pool2d(input_matrix, kernel_size, stride, padding, pool_mode='max'):

    # Padding
    input_matrix = np.pad(input_matrix, padding, mode='constant')

    # Window view of input_matrix
    output_shape = ((input_matrix.shape[0] - kernel_size)//stride + 1,
                    (input_matrix.shape[1] - kernel_size)//stride + 1)
    kernel_size = (kernel_size, kernel_size)
    input_matrix_w = as_strided(input_matrix, shape = output_shape + kernel_size,
                        strides = (stride*input_matrix.strides[0],
                                   stride*input_matrix.strides[1]) + input_matrix.strides)
    input_matrix_w = input_matrix_w.reshape(-1, *kernel_size)

    # Return the result of pooling
    # For Max Pooling
    if pool_mode == 'max':
        return input_matrix_w.max(axis=(1,2)).reshape(output_shape)
    # For Average Pooling
    elif pool_mode == 'avg':
        return input_matrix_w.mean(axis=(1,2)).reshape(output_shape)

# Max Pooling with 2x2 filter & Stride = 1
featureMap_Box_ReLu_MaxPool = pool2d(featureMap_Box_ReLu, kernel_size=2, stride=1, padding=0, pool_mode='max')
featureMap_Vertical_ReLu_MaxPool = pool2d(featureMap_Vertical_ReLu, kernel_size=2, stride=1, padding=0, pool_mode='max')
featureMap_Diagonal_ReLu_MaxPool = pool2d(featureMap_Diagonal_ReLu, kernel_size=2, stride=1, padding=0, pool_mode='max')

Fully Connected Layer & Output Layer

In this layer you just need to Flatten each Pooling Layer output and join (stack) those flatten data. At this point you extracted all features (using all possible feature matrix or kernel or filter) of images and made a structured data. Now you just need to apply a basic neural network to classify the input image is digit 9 or not.

Fully connected layer in Convolutional neural network

Fully Connected Layer in Python and Numpy

# *******************************************************************************
# Fully Connected Layer
# *******************************************************************************
# Convert array to list
featureMap_Box_ReLu_MaxPool_list = featureMap_Box_ReLu_MaxPool.tolist()
featureMap_Vertical_ReLu_MaxPool_list = featureMap_Diagonal_ReLu_MaxPool.tolist()
featureMap_Diagonal_ReLu_MaxPool_list = featureMap_Diagonal_ReLu_MaxPool.tolist()

# Convert list of list to flat list
featureMap_Box_ReLu_MaxPool_FlatList = [item for sublist in featureMap_Box_ReLu_MaxPool_list for item in sublist]
featureMap_Vertical_ReLu_MaxPool_FlatList = [item for sublist in featureMap_Vertical_ReLu_MaxPool_list for item in sublist]
featureMap_Diagonal_ReLu_MaxPool_FlatList = [item for sublist in featureMap_Diagonal_ReLu_MaxPool_list for item in sublist]

# Stack all flat list data
input_to_basic_neural_network = featureMap_Box_ReLu_MaxPool_FlatList + featureMap_Vertical_ReLu_MaxPool_FlatList + featureMap_Diagonal_ReLu_MaxPool_FlatList

Output Layer in Python and Numpy

# *******************************************************************************
# Output Layer
# *******************************************************************************

# In this layer you just need to apply basic neural network to input_to_basic_neural_network. You can use sklearn library for this


In any data science project you first need to extract important features from the dataset then you apply any classification algorithm. Convolutional Neural Network also follow the same steps. ConvNet have same two steps:

  1. Feature Extraction: Input Layer, Convolution Layer and Pooling Layer is used for important feature extraction
  2. Classification: Fully Connected Layer and Output Layer
two steps of cnn

Fun part of Convolutional Neural Network is that you no need to define values of filters or kernel.

define filter in cnn

CNN automatically calculate values of filter by doing back propagation of Convolutional Neural Network. While starting CNN you can give filter as matrix of zero or matrix with random values, then during training of CNN it will automatically determine best possible values of kernel by back propagation of CNN.

Leave a comment