A team of researchers from MIT and the MIT-IBM Watson AI Lab have developed a new technique that enables on-device training using less than a quarter of a megabyte of memory. The new development is an impressive achievement as other training solutions typically require over 500 megabytes of memory, which exceeds the 256 kilobyte capacity of most microcontrollers.
By training a machine learning model on a smart edge device, it can adapt to new data and make better predictions. That said, the training process is usually memory intensive, so it’s often done with computers in a data center before the model is deployed to a device. This process is much more expensive and raises privacy issues compared to the new technique developed by the team.
The researchers developed the algorithms and framework in a way that reduced the amount of computation needed to train a model, making the process faster and more memory efficient. The technique can help train a machine learning model on a microcontroller in just minutes.
The new technique also helps with privacy as it keeps data on the device, which is important when sensitive data is involved. At the same time, the framework improves the accuracy of the model compared to other approaches.
Song Han is an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and lead author of the research paper.
“Our study enables IoT devices not only to perform inference, but also to continuously update AI models based on newly collected data, paving the way for lifelong on-device learning,” said Han. “Low resource utilization makes deep learning more accessible and can have a wider reach, especially for low-power edge devices.”
The paper included co-lead authors and EECS doctoral students Ji Lin and Ligeng Zhu, and MIT postdocs Wei-Ming Chen and Wei-Chen Wang. It also included Chuang Gan, a senior member of the MIT-IBM Watson AI Lab research staff.
Make the training process more efficient
To make the training process more efficient and less memory intensive, the team relied on two algorithmic solutions. The first is known as chunked updating, which uses an algorithm that identifies the most important weights to update during each training cycle. The algorithm freezes the weights one by one until the precision drops to a certain threshold, in which case it stops. The remaining weights are then updated and the activations corresponding to the frozen weights do not need to be stored in memory.
“Updating the whole model is very expensive because there are a lot of activations, so people tend to only update the last layer, but as you can imagine, it hurts the accuracy,” Han said. “For our method, we selectively update these important weights and ensure that accuracy is fully preserved.”
The second solution developed by the team involves quantifying training and simplifying the weights. An algorithm first rounds the weights down to just eight bits via a quantization process which also reduces the amount of memory for training and inference, inference being the process of applying a model to a set of data and generating a prediction. The algorithm then relies on a technique called quantization-sensitive scaling (QAS), which acts as a multiplier to adjust the ratio of weight to gradient. This avoids any drop in accuracy that might result from quantized training.
The researchers developed a system called a small training engine, which runs the algorithm’s innovations on a simple microcontroller with no operating system. To do more work in the compilation step, before deploying the model to the edge device, the system changes the order of the steps in the training process.
“We push a lot of the computation, like self-differentiation and graph optimization, to compile time. We also aggressively remove redundant operators to support sparse updates. execution, we have a lot less work to do on the device,” says Han.
Very effective technology
While traditional techniques designed for light training typically required around 300 to 600 megabytes of memory, the team’s optimization only needed 157 kilobytes to train a machine learning model on a microcontroller.
The framework was tested by training a computer vision model to detect people in images, and it learned to accomplish this task in just 10 minutes. The method was also able to train a model more than 20 times faster than other methods.
Researchers will now seek to apply the techniques to language models and different types of data. They also want to use this insight gained to scale down larger models without loss of accuracy, which could also help reduce the carbon footprint of training large-scale machine learning models.
#Lifelong #ondevice #learning #training #technique