Use network quantization in applications with memory constraints where a minor loss in performance is acceptable
Context: With the rising use of deep learning in different domains, the models are being used on battery powered devices such as smartphones. Doing so is also beneficial from the point of view of bandwidth and latency. However, the large size of these models with their energy consumption requirements can pose a significant challenge in case of battery powered devices
Problem: Running the deep learning models can involve millions of multiplication and addition operations. Having a high precision representation of the parameters causes these operations to become expensive in-terms of energy requirements on battery powered devices.
Solution: Network quantization involves reducing the number of bits to represent the parameters of the neural network. Quantization has been used in the existing work to improve energy efficiency of the deep learning models. Quantizing the network can make the multiplication and addition operations less expensive computationally due to the reduction in the bit-width of the operands. This causes reduction in power consumption. It also cuts down the memory requirements due to the reduction in the model size. If done properly, quantization only causes minor loss in performance and does not affect the output significantly.
Example: Consider an MobileNet V2 model that needs to be deployed on a smartphone. Quantization of the model to use parameters using 4-bit precision can lead to a smaller model size with lesser energy consumption per computation.