Meet ‘Stack,’ A 3TB of Permissively Licensed Source Code for LLMs (Large Language Models)
About a year ago, generating code from a large language model (LLM) seemed like an impossible task. With the advancements in artificial intelligence, LLMs are now successfully used to generate software codes. The automatic generation of code has streamlined many real-world programming tasks. However, along with the liberal use of code LLMs by the technologies, there has been a buzz about the source code being used as the training data for developing the model. The model learns from the training examples, which may include open source codes restricted by restrictive licenses. This raised doubts and raised questions among developers who would not want their codes used to train the language models.
The BigCode Project, an association of ServiceNow and Hugging Face, has released The Stack, which includes a 3.1 TB dataset of permissive licensed source code in 30 programming languages. Considering the current scenario in which the use of open source repositories is debatable, BigCode released the code to promote transparency around the pre-training data.
The main idea is to let people choose whether they want their code to be contributed to the evaluation of Machine Learning models. The Hugging Face website – ‘https://huggingface.co/spaces/bigcode/in-the-stack’ allows people to conveniently opt out of having their repository included in The Stack for training the LLMs. People can confirm this by entering their respective GitHub usernames on the site, and if the repository is in the stack, they can discard the data from any future variation.
The ServiceNow and Hugging Face team, in their recently published paper The Stack: 3TB of Permissively Licensed Source Code mentioned some of their contributions which are as follows –
- The team has released 3.1TB of permissive licensed source code in 30 programming languages and a near-deduplicated version of it, which anyone can access by visiting the website – https://hf.co/BigCode.
- When training 350M decoder-only transformers on Python data, discarding near-duplicates from the training data helps to remarkably increase the model performance.
- The team claims to show that by using permissively licensed data, it has the ability to replicate the extraordinary outcomes of Codex and CodeGen.
- It shares a dedicated data management plan with the instructions and the process to refuse to share open source repositories in the training data.
To obtain the license details of 137.36M Github repositories that make up the large dataset, the team used GHArchive and the go-license-detector. The most used licenses were MIT and Apache 2.0. The group made a full comparison between the size of The Stack and one of the most popular datasets, CodeParrot. Compared to CodeParrot, The Stack is relatively more than three times the size. Apart from that, The Stack is compared to other code datasets like AlphaCode, CodeGen and PolyCoder.
The absence of transparency in training data has always been a crucial obstacle to developing a model. The Service Now Research and Hugging Face have certainly promoted clarity in code LLMs by releasing the enormous data set and sharing the entire process of compiling the data.
Look at the Paper. All credit for this research goes to researchers on this project. Also, don’t forget to join our Reddit page and disagreement channelwhere we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergraduate from the University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a keen interest in learning new skills, leading groups and managing work in an organized manner.