Convolutional Neural Network for Computer Vision
Have you ever wondered how does computer see? It is very possible to build such a system using machine learning, deep learning specifically. But doesn’t it sound quite scary to learn the math behind? Perhaps it is. In fact, you don’t need to understand the whole mathematic processes. That’s a good thing if you want to.
It's been more than a decade since the evolutionary neural network architecture developed, which is convolutional neural network or CNN for short. CNN does have good performances for image processing and features extraction. One of my favorite CNN based architecture is Residual Network or ResNet that has very deep network, with more than 100 hidden layers.
So, what is actually CNN? How does it look? What kind of architecture is that? I will explain the surface and the main idea only. If you are interested to learn and go deeper with CNN, you can watch this lecturer from Stanford University.
In general, CNN consists of three kinds of layer, convolutional layer, pooling layer, and fully connected layer.
1. Convolutional layer
Convolutional layer is the main layer for extracting features from the image. Convolutional layer is basically just a filter/kernel with certain size that scan the whole pixel of your input image and map to a new pixel value to gain features. The common size of convolutional layer is 3x3, 5x5, and 7x7. There are several parameters that is commonly used for this layer. Such as stride (how many step your kernel take) and padding (how many pixels you want to expand your image size).
2. Pooling layer
Pooling layer is used for decreasing your image size but remain the information. This layer resembles the convolutional layer but doesn’t involve any multiplications. Instead, it takes the average (average pooling layer) or max (max pooling layer) from the pixels that covered/scanned by the kernel.
3. Fully connected layer
The last layer is fully connected layer, the main layer that responsible for each computer vision tasks. It could be image classification, object detection, image segmentation, etc. Basically, fully connected layer works just like plain neural network. Before the image features fed to this layer, it has to be flattened first. What I mean by flatten is stretch your image features (in the matrix form) into one column only.
This is the whole CNN architecture looks like.
I think that’s all for CNN, hopefully you can understand CNN better after this. Maybe I will write about NLP and RNN in the upcoming article. Thank you for reading :).