CNNs : Understanding & Building them with a step by step guide

Devesh Surve
15 min readFeb 1, 2024

--

Hey Everyone,

Here’s the topic for today: Not just understanding CNNs, but understanding AND building them. I’ve noticed a lot of resources either skate around the practical aspects or dive too deep into theory.

So, I’m here to offer a middle ground — a way to not only grasp what CNNs and their components are (think interview gold) but also to provide a step-by-step guide for personal projects. I wanted to understand every line of code I wrote, not just blindly follow instructions. With that said, let’s jump in.

Here’s a colab link to preview side by side :

Table of Contents :

  1. The CNN Model : An Overview
  2. Problem with building CNNs : Which Layer does what and how to add and how many ?
  3. Explaining with an example : CIFAR-10
  4. Explaining Convolution Layer ?
  5. Why do we Flatten ?
  6. So what are Dense Layers again ?
  7. Let’s build a Baseline Model
  8. What’s a Pooling layer ?
  9. What are activation functions?
  10. What’s Batch Normalization?
  11. Why drop out/s ?
  12. What are optimizers & learning rates
  13. Handling Epochs, Batches & Early Stopping.

1. The CNN Model: An Overview

Convolutional Neural Networks (CNNs) are the workhorses of image recognition, powering everything from facial recognition to self-driving cars. At their essence, CNNs learn to recognize patterns in images — such as lines, shapes, and textures — automatically identifying features that define an object.

But here’s the kicker: building a CNN is like stacking Legos. You have different blocks (layers) each with a role, and how you stack them affects what the model sees and learns. It’s not about adding layers until it works; it’s about understanding what each layer does and using them to capture the essence of the images.

As we proceed, we’ll decode these layers, from the ones spotting edges in an image to those making sense of the overall picture. Ready to build your own CNN that sees the world as we do? Let’s get our hands dirty.

2. Problem with Building CNNs: Which Layer Does What and How to Add and Reduce

Building a CNN can sometimes feel like being a chef in a gourmet kitchen. You have all these ingredients (layers), but knowing how much of each to use and in what order can make the difference between a Michelin-star dish and a kitchen nightmare. The main challenge? Understanding the role of each layer and how it contributes to the overall learning process.

Let’s break it down:

  • Convolutional Layers: Think of these as your sous-chefs, specialized in chopping and preparing the ingredients. They’re the ones identifying patterns in the image, like edges and textures. But just as having too many sous-chefs can clutter the kitchen, stacking too many convolutional layers without purpose can make your model complex and slow without adding value.
  • Pooling Layers: These are your pot washers, condensing and simplifying what the sous-chefs provide. They reduce the dimensionality of the data, making the model more manageable and focusing on the essential features. However, overdoing it can wash away too much information, leaving your model with too little to learn from.
  • Dense (Fully Connected) Layers: Consider these the head chefs, making the final decisions. They take the simplified, processed inputs and decide what the image represents. But as in any kitchen, too few head chefs might mean not enough insight, while too many can lead to confusion and inefficiency.

The trick to a well-functioning CNN kitchen? Balance. You need just the right number of each type of layer to work harmoniously together, ensuring your model is efficient, accurate, and capable of learning from the data provided.

In the next sections, we’ll dive deeper into each type of layer, starting with convolutional layers, and discuss how to strike the perfect balance for your CNN masterpiece.

3. Explaining Convolutional Layers

Imagine you’re looking at a beautiful mosaic, each tile representing a part of a bigger picture. Convolutional layers in a CNN work similarly, examining small pieces of an image to understand the larger scene. These layers are the detectives of the network, using their filters (or kernels) to search for clues in the form of patterns, textures, and shapes across the image.

Here’s how they do it:

  • Filters: Each filter in a convolutional layer is like a magnifying glass, focusing on specific features like edges or color gradients. As the filter slides over the image (a process called convolution), it creates a map of where those features occur. Think of it as highlighting all the places in a book where your favorite character appears.
  • Feature Maps: The result of applying a filter over the image is a feature map. If our image is a complex scene, the feature map simplifies it into a canvas that emphasizes the detected features, making it easier for the network to understand.

But why are these layers so crucial? Because they allow the network to see the world in a hierarchical way. The first convolutional layer might catch simple patterns like lines and edges. Layers deeper in the network combine these simple patterns to recognize more complex shapes and eventually whole objects.

The magic of convolutional layers lies in their ability to learn these filters automatically from the training data. Instead of telling the network to look for specific features, it learns from examples, adjusting its filters to improve accuracy over time.

By stacking convolutional layers, we build a deep understanding of the visual world, layer by layer, from simple to complex. This is what makes CNNs so powerful for tasks like image recognition.

4. Why Do We Flatten?

After our network has combed through the image with convolutional and pooling layers, identifying and condensing the features, we’re left with a high-level understanding of the visual content. This understanding, however, is still in the form of multi-dimensional arrays (the feature maps). Before we can make a final decision (classification), we need a way to process this information in a fully connected (dense) layer. That’s where flattening comes in.

Imagine you’ve just completed a jigsaw puzzle, and you need to store it away without losing the big picture. Flattening is like taking that puzzle, carefully aligning all the pieces in a single row so they can be analyzed sequentially. In technical terms, we convert the multi-dimensional array of features into a one-dimensional array.

This process is crucial because dense layers, which make the final decisions, expect input in a flat, one-dimensional format. By flattening the feature maps, we prepare the data for this last stage of processing, where the network combines all the learned features to make a classification (e.g., identifying the image as a cat, dog, etc.).

It’s a simple yet vital step, bridging the gap between the spatial hierarchy learned by convolutional layers and the classification logic of dense layers. Without flattening, the transition from recognizing patterns to making predictions wouldn’t be possible.

5. So What Are Dense Layers?

Once we’ve flattened our data, it’s time for the dense layers, also known as fully connected layers, to take the stage. These layers are the brain of our operation, where all the high-level reasoning happens. If convolutional layers are the detectives gathering clues and flattening is like laying out all the clues on a table, dense layers are the judges making the final verdict.

Dense layers work by taking all the inputs from the previous layer (now a flat, one-dimensional array) and determining the patterns that most strongly suggest a particular classification. Each neuron in a dense layer is connected to every input, which is why it’s called “fully connected.” These neurons then weigh the inputs based on their importance for making a correct prediction, essentially voting on what the image represents.

Here’s the kicker: the power of dense layers comes from their ability to learn these weights through training. By adjusting these weights, the network gets better and better at making accurate predictions. It’s a bit like refining your recipe based on feedback until your dish is just right.

But why are dense layers placed after convolutional and flattening steps?

Because they require a global understanding of the image, which is only possible after the features have been identified, highlighted, and organized by the preceding layers. Dense layers synthesize all this information, making sense of the patterns to classify the image accurately.

In essence, dense layers are where all the learned features come together to make a coherent decision. Without them, our CNN would be like a detective gathering clues without ever solving the case.

6. What’s a Pooling Layer?

Pooling layers are the unsung heroes of CNNs, optimizing the network by reducing the dimensions of the feature maps without losing the essential information. Imagine you’re an artist tasked with creating a miniature version of a large painting. You’d want to preserve the painting’s essence, capturing the most important elements in a smaller space. That’s essentially what pooling layers do for CNNs.

There are two main types of pooling: max pooling and average pooling.

  • Max Pooling is like looking at small sections of the image and only keeping the brightest pixel in each section. This method ensures that the most prominent features are retained while reducing the size of the data being processed. It’s particularly effective at preserving the presence of features detected by the convolutional layers.
  • Average Pooling, on the other hand, takes the average value of pixels in each section. This approach is smoother and less aggressive than max pooling, distributing the focus more evenly across features.

But why do we need pooling? There are a few reasons:

  1. Efficiency: By reducing the size of the data (downsampling), pooling layers make the network faster and reduce computational costs.
  2. Preventing Overfitting: Smaller data means there’s less risk of the model memorizing the training data too closely, helping it generalize better to unseen images.
  3. Feature Preservation: Pooling helps to make the detection of features in the input images more robust to changes in scale or orientation.

In essence, pooling layers streamline the network’s learning process, ensuring that it focuses on the most salient features of the data while maintaining efficiency and robustness. They’re a key part of the balance between detail retention and computational efficiency in CNNs.

7. What Are Activation Functions?

Activation functions are the soul of neural networks, including CNNs. They decide whether a neuron should be activated or not, based on the weighted sum of its inputs. Think of each neuron in your network as a gatekeeper, and the activation function is the rule that tells the gatekeeper when to open the gate.

Why are they important? Because without activation functions, our network would be a simple linear regression model, incapable of learning complex patterns in data. Activation functions introduce non-linearity into the network, allowing it to tackle problems as complicated as image recognition.

There are several types of activation functions, but let’s focus on a few commonly used in CNNs:

  • ReLU (Rectified Linear Unit): This function is simple yet effective, turning all negative values to zero and keeping positive values as is. It’s like saying, “If the information is useful (positive), let it through; otherwise, block it.” ReLU is popular because it reduces training time and prevents the vanishing gradient problem.
  • Softmax: Often used in the final layer of a classification network, the softmax function converts the output scores from neurons into probabilities by dividing the exponential of each output by the sum of all exponentials. This way, you get a probability distribution across classes, making it clear which class the model predicts.
  • Sigmoid and Tanh: These functions were more common in the early days of neural networks. Sigmoid squashes the values between 0 and 1, making it useful for binary classification. Tanh, on the other hand, normalizes values between -1 and 1, offering a centered scaling. However, both can lead to the vanishing gradient problem, where gradients become too small for effective learning.

Activation functions play a critical role in learning, determining the complexity and versatility of the model. By deciding which signals should pass through, they enable the network to make nuanced decisions based on the learned features from the input data.

8. What’s Batch Normalization?

Batch normalization is like the secret ingredient in a recipe that makes everything better. Introduced to address the issues of internal covariate shift (where the distribution of network inputs changes during training, making it harder to train models), batch normalization standardizes the inputs of each layer within the network. This means adjusting the inputs so they have a mean of 0 and a variance of 1, similar to how data is often pre-processed before training.

Why is this beneficial? Here are a few key reasons:

  • Faster Training: By normalizing the inputs across mini-batches, batch normalization allows the use of higher learning rates, speeding up the training process. It’s like greasing the wheels of your bike, so you can pedal faster without more effort.
  • Reduced Overfitting: Batch normalization has a regularization effect, meaning it helps prevent the model from too closely fitting the noise in the training data. This is partly because the mini-batch statistics add a bit of noise to the signals within the network, making it more robust.
  • Eases Initialization: Choosing the right initialization method for weights can be tricky. Batch normalization helps alleviate this problem by reducing the sensitivity of the network to the initial weights.

In practice, batch normalization is applied after convolutional layers but before the activation function. This positioning ensures that the scaling effect of batch normalization is taken into account when the activation function is applied, leading to more stable and efficient training.

By stabilizing the learning process, batch normalization allows the network to learn more effectively, making it a popular choice in many CNN architectures.

9. What Are Dropouts?

Imagine you’re working on a group project, and to ensure no one person becomes too crucial (and to prevent over-reliance on any single member), you randomly assign tasks to different people each time you meet. This is akin to what dropout does in a neural network. It randomly “drops out” a subset of neurons (i.e., temporarily removes them from the network) during each training batch, forcing the network to learn more robust features that are not dependent on any small set of neurons.

Here’s why dropout is so effective:

  • Prevents Overfitting: By randomly removing neurons, dropout prevents the network from memorizing the training data. The network has to be adaptable, learning more generalized patterns that are useful across different data samples, not just the training set.
  • Encourages Redundancy: When different neurons can’t rely on the presence of others, the network develops a more redundant representation of the data. This redundancy means that the network doesn’t depend too heavily on any single feature, enhancing its ability to generalize from the training data to new, unseen data.
  • Simple yet Powerful: Dropout is remarkably straightforward to implement but can lead to significant improvements in model performance, especially in complex networks prone to overfitting.

Typically, dropout is applied to the fully connected layers of a CNN, where overfitting is most likely to occur due to the large number of parameters. However, it can also be used after convolutional layers, depending on the architecture and specific challenges of the task at hand.

In essence, dropout introduces a form of training noise, making the model more robust and less prone to the pitfalls of overfitting, ensuring that the network remains adaptable and generalized.

10. What Are Optimizers & Learning Rates?

In the journey of training a neural network, the optimizer is the navigator, guiding the model towards the lowest point of loss, where it makes the most accurate predictions. The learning rate, on the other hand, dictates the size of the steps the model takes on this journey. Too big, and it might overshoot the destination; too small, and the journey becomes tediously long.

Optimizers are algorithms that adjust the attributes of the neural network, such as weights and learning rate, to reduce losses. They play a crucial role in how quickly and effectively a model learns. Some of the most popular optimizers include:

  • SGD (Stochastic Gradient Descent): The simplest form of optimizer. It updates the model’s weights using the gradient of the loss function with respect to the weights, multiplied by the learning rate. It’s like navigating through fog with a compass; basic but reliable.
  • Momentum: Improves on SGD by ‘remembering’ the previous updates and applying them to the current update, essentially gaining speed (momentum) to navigate through flat areas of the loss landscape more effectively.
  • Adam (Adaptive Moment Estimation): Combines the benefits of two other extensions of SGD, AdaGrad and RMSProp, to adjust the learning rate on a per-parameter basis. It’s particularly effective and widely used due to its adaptability across different problems and data types.

Learning Rate is arguably one of the most important hyperparameters to tune for training a neural network. It controls how much we adjust the weights of our network with respect to the loss gradient. The right learning rate helps the model to learn efficiently.

  • Fixed Learning Rate: Keeps the learning rate constant throughout training. It’s simple but might not be efficient for all problems.
  • Adaptive Learning Rate: Adjusts the learning rate dynamically during training. Techniques like learning rate scheduling (gradually decreasing the learning rate) or optimizers like Adam that adjust the learning rate per parameter fall into this category.

Choosing the right optimizer and learning rate is crucial for effective model training. It can make the difference between a model that learns quickly and accurately and one that doesn’t learn at all or converges too slowly.

11. Handling Epochs, Batches & Early Stopping

Training a neural network is a balancing act. It involves determining not just how the model learns (via learning rates and optimizers) but also how much it learns at a time (batch size) and how long it keeps learning (number of epochs). Additionally, early stopping serves as a safeguard against overfitting by halting the training process at the right moment. Let’s break these concepts down:

  • Epochs: An epoch represents one full cycle through the entire training dataset. Increasing the number of epochs allows the network more opportunities to learn from the data. However, too many epochs can lead to overfitting, where the model learns the training data too well, including its noise and outliers, at the expense of its ability to generalize to new data.
  • Batches: Instead of feeding the entire dataset into the network at once (which can be memory intensive), we divide it into smaller batches. Batch size affects the model’s learning process; smaller batches mean more updates per epoch, which can lead to faster learning, but too small a batch might increase the noise in each update, making the training process less stable.
  • Early Stopping: This is a form of regularization used to avoid overfitting by stopping the training when the model’s performance on a validation set starts to deteriorate, i.e., when it begins to learn the noise in the training set rather than the signal. It’s like stopping the game when you’re ahead to secure your winnings. By monitoring the model’s performance on a set of data not seen during training (validation set), we can stop the training at the point where performance peaks.

Implementing early stopping involves setting aside a portion of the training data as a validation set and periodically evaluating the model’s performance on this set. If the model’s performance on the validation set fails to improve for a specified number of epochs, training is halted. This approach helps ensure that the model retains its ability to generalize well to new, unseen data.

Together, managing epochs, batch size, and implementing early stopping form a triad of strategies that optimize the training process, ensuring that the model learns efficiently and effectively without overfitting or underfitting.

Conclusion

CNNs are more than just layers and algorithms; they’re a testament to the power of machine learning to mimic and extend human capabilities in image recognition and beyond. As you embark on building your own CNNs, remember:

  1. Balance is Key: From layer architecture to learning rates, finding the right balance is crucial for effective model training.
  2. Understanding Over Memorization: Grasping what each component does and why it’s used will empower you to design better models, not just replicate existing ones.
  3. Experimentation Leads to Mastery: The best way to learn is by doing. Experiment with different configurations, datasets, and challenges to deepen your understanding and discover what works best for your specific needs.

As we conclude this guide, I encourage you to apply these insights and principles to your projects. Dive into building your own CNNs, experiment with different architectures and settings, and see firsthand how these networks can learn to see the world through data.

And I don’t usually ask for these but if you read all this way down, I’d really appreciate a clap ! Follow for more such content !

And if you need anything feel free to reach out ! https://linkedin.com/in/deveshsurve

Have a great day !

--

--

Devesh Surve

Grad student by day, lifelong ML/AI explorer by night. I dive deep, then share easy-to-understand, step-by-step guides to demystify the complex.