CNNs & Computer Vision
From pixels to perception
Teaching Computers to See
When you look at a photo of a cat, you instantly recognize it. But to a computer, an image is just a grid of numbers—pixel values. Convolutional Neural Networks (CNNs) bridge this gap, learning to extract meaningful patterns from raw pixels.
Why Not Just Use Regular Neural Networks?
Consider a small 256×256 color image. That's 256 × 256 × 3 = 196,608 input values. In a traditional fully-connected network, the first layer alone could need billions of connections. CNNs solve this with three key ideas:
- Local connectivity: Each neuron only looks at a small patch, not the whole image
- Weight sharing: The same pattern detector is used across the entire image
- Hierarchy: Simple features combine into complex ones
Convolutions: Pattern Detectors
A convolution slides a small filter (like a magnifying glass) across an image, checking for a specific pattern at each position:
- Edge detector: Highlights boundaries between light and dark
- Corner detector: Finds intersection points
- Texture detector: Recognizes repeating patterns
The filter is just a small grid of numbers (e.g., 3×3). The network learns what numbers to put in each filter during training.
Feature Hierarchies
CNNs build up understanding in layers:
- Layer 1: Edges, simple textures
- Layer 2: Corners, curves, basic shapes
- Layer 3: Parts of objects (eyes, wheels, windows)
- Layer 4: Whole objects (faces, cars, buildings)
- Final layers: Categories and decisions
This mirrors how our visual cortex processes information!
Pooling: Simplifying and Generalizing
Pooling shrinks the representation by summarizing regions:
- Max pooling: Keep the strongest signal in each region
- Average pooling: Average the values
This makes the network robust to small shifts and reduces computation.
Receptive Fields
A neuron's receptive field is the region of the original image it can "see." As you go deeper:
- Early neurons see tiny patches (3×3 pixels)
- Middle neurons see larger areas (dozens of pixels)
- Deep neurons see most or all of the image
This is why deep networks understand context better.
What CNNs Can Do
Image Classification: "This is a cat" vs "This is a dog"
Object Detection: Find and label multiple objects with bounding boxes
Semantic Segmentation: Label every pixel (road, car, pedestrian, sky)
Face Recognition: Match faces across photos
Medical Imaging: Detect tumors, analyze X-rays
Landmark Architectures
LeNet (1998): The original CNN for handwritten digits
AlexNet (2012): Sparked the deep learning revolution by winning ImageNet
VGGNet (2014): Showed that deeper is better (16-19 layers)
ResNet (2015): Enabled 100+ layer networks with skip connections
Vision Transformer (2020): Applied transformer architecture to images
References
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541-551.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (NeurIPS).
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. CVPR 2016.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., et al. (2014). Going Deeper with Convolutions. CVPR 2015.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.