A New AI Research Proposes an Effective Optimization Method of Adaptive Column-Wise Clipping (CowClip) that Reduces CTR Prediction Model Training Time from 12 Hours to 10 Minutes on 1 GPU

A New AI Research Proposes an Effective Optimization Method of Adaptive Column-Wise Clipping (CowClip) that Reduces CTR Prediction Model Training Time from 12 Hours to 10 Minutes on 1 GPU

Source: https://arxiv.org/pdf/2204.06240.pdf

Online commerce, video applications and web advertising are seeing a lot of clicks as the Internet and the e-economy grow. The amount of click samples in a typical industrial dataset has reached hundreds of billions and continues to grow daily. To determine whether a user will click on the suggested item, one can use click through rate (CTR) prediction. This is an important task in the recommendation and advertising systems. The user experience and advertising revenue can both be directly improved by an accurate CTR prediction. The goal of the click-through rate (CTR) prediction challenge is to predict a user’s decision to click on the suggested item. This is an important task in the recommendation and advertising systems.

To keep an updated CTR prediction model, it is essential to reduce the time required for retraining on a large data set. This is because CTR prediction is a time-sensitive activity (eg recent subjects and new users’ hobbies). In addition, training time also lowers training costs, resulting in a good return on investment, given a stable computing budget. In recent years, GPU processing power has grown rapidly. Larger batch sizes can benefit more from the parallel processing capacity of GPUs as GPU memory and FLOPS increase. Figure 1(a) demonstrates that a single forward and backward pass take almost the same time when scaling eight times the batch size, indicating that GPUs with small batch sizes are extremely underutilized.

They concentrate on building an accuracy-preserving approach for increasing batch size on a single GPU, which can be readily extended for multi-node training, to avoid deviating from system optimization in decreasing communication costs. Large group training reduces the number of steps and, consequently, significantly reduces the overall training time because the number of training epochs remains constant (Figure 1(b)). In a multi-GPU environment, where gradients from the large embedding layer need to be sent across multiple GPUs and computers, a large batch also benefits more due to the high communication costs associated with it.

Figure 1: Relative time required to train the DeepFM model with one V100 GPU using the Criteo dataset. 🎟 Join our 13k+ ML Subreddit Community

Since CTR prediction is a very sensitive task and cannot tolerate accuracy loss, the issue of applying large batch training is an accuracy loss while naively increasing the batch size. In CV and NLP tasks, hyperparameter scaling rules and properly crafted optimization techniques are not ideally suited for CTR prediction. This is because the embedding layers dominate the parameters of the whole network in CTR prediction (e.g. 99.9%, see Table 1), and the inputs are more sparse and frequency unbalanced. In this study, they explained why the previously used CTR prediction scaling rules failed and provided a successful algorithm and scaling control for large batch training.

Conclusion: To the best of their knowledge, they are the first to look at the stability of the training CTR prediction model in very large group sizes. • With careful mathematical analysis, they demonstrate that the learning rate for unusual features should not scale while increasing the batch size.

• They attribute the difficulty in scaling the batch size to the difference in id frequencies. With CowClip, they can increase batch size with a simple and effective scaling technique.

• To stabilize the training process of the CTR prediction task, they provide an efficient optimization strategy called adaptive Column-Wise Clipping (CowClip). They successfully scale four models at 128 times the batch size on two open datasets. On the Criteo dataset, they specifically train the DeepFM model with a 72x speedup and 0.1% AUC increase.

The entire project codebase is open source on GitHub.

Check out the Paper and Github. All credit for this research goes to the researchers on this project. Also, don’t forget to join our 13k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He likes to connect with people and collaborate on interesting projects.

Leave a Reply

Your email address will not be published. Required fields are marked *