CNNs & Computer Vision

From pixels to perception

Teaching Computers to See

When you look at a photo of a cat, you instantly recognize it. But to a computer, an image is just a grid of numbers—pixel values. Convolutional Neural Networks (CNNs) bridge this gap, learning to extract meaningful patterns from raw pixels.

Why Not Just Use Regular Neural Networks?

Consider a small 256×256 color image. That's 256 × 256 × 3 = 196,608 input values. In a traditional fully-connected network, the first layer alone could need billions of connections. CNNs solve this with three key ideas:

Local connectivity: Each neuron only looks at a small patch, not the whole image
Weight sharing: The same pattern detector is used across the entire image
Hierarchy: Simple features combine into complex ones

Convolutions: Pattern Detectors

A convolution slides a small filter (like a magnifying glass) across an image, checking for a specific pattern at each position:

Edge detector: Highlights boundaries between light and dark
Corner detector: Finds intersection points
Texture detector: Recognizes repeating patterns

The filter is just a small grid of numbers (e.g., 3×3). The network learns what numbers to put in each filter during training.

Feature Hierarchies

CNNs build up understanding in layers:

Layer 1: Edges, simple textures
Layer 2: Corners, curves, basic shapes
Layer 3: Parts of objects (eyes, wheels, windows)
Layer 4: Whole objects (faces, cars, buildings)
Final layers: Categories and decisions

This mirrors how our visual cortex processes information!

Pooling: Simplifying and Generalizing

Pooling shrinks the representation by summarizing regions:

Max pooling: Keep the strongest signal in each region
Average pooling: Average the values

This makes the network robust to small shifts and reduces computation.

Receptive Fields

A neuron's receptive field is the region of the original image it can "see." As you go deeper:

Early neurons see tiny patches (3×3 pixels)
Middle neurons see larger areas (dozens of pixels)
Deep neurons see most or all of the image

This is why deep networks understand context better.

What CNNs Can Do

Image Classification: "This is a cat" vs "This is a dog"

Object Detection: Find and label multiple objects with bounding boxes

Semantic Segmentation: Label every pixel (road, car, pedestrian, sky)

Face Recognition: Match faces across photos

Medical Imaging: Detect tumors, analyze X-rays

Landmark Architectures

LeNet (1998): The original CNN for handwritten digits

AlexNet (2012): Sparked the deep learning revolution by winning ImageNet

VGGNet (2014): Showed that deeper is better (16-19 layers)

ResNet (2015): Enabled 100+ layer networks with skip connections

Vision Transformer (2020): Applied transformer architecture to images

References

[1]

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541-551.

DOI:10.1162/neco.1989.1.4.541

[2]

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (NeurIPS).

DOI:10.1145/3065386

[3]

Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.

arXiv:1409.1556

[4]

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. CVPR 2016.

arXiv:1512.03385

[5]

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., et al. (2014). Going Deeper with Convolutions. CVPR 2015.

arXiv:1409.4842

[6]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.

arXiv:2010.11929

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn