The GLM-130B framework is a bilingual pre-trained large language model with over 130 billion parameters capable of generating text outputs in both English and Chinese. The GLM-130B framework is an attempt to open source a language model at a scale of over 100B parameters, and discuss how frameworks of such a large scale can be pre-trained because currently, training a model of such a large scale is often rattled with issues like divergence & loss spikes.
In this article, we will be talking about the GLM-130B framework, which attempts to devise a method to effectively pre-train large language models with hundreds of billions of parameters. We will take a deeper dive into the working & architecture of the GLM-130B framework along with the training process & design choices that not only helps in increasing the efficiency, but also the stability. Initial experiments carried out to test the working of the GLM-130B framework on a wide array of English benchmarks resulted in the GLM-130B model outperforming the current state of the art GPT-3 framework by a considerable margin. So let’s begin, and explore how the GLM-130B framework delivers such consistent, accurate, and stable results.
Large Language Models capable of operating in few-shot & zero-shot settings, especially those with over 100 billion parameters present attractive scaling laws, out of which, the GPT-3 framework is one of the best performing frameworks that delivers considerable performance upgrades over its predecessor, the BERT framework. However, despite the popularity of the GPT-3 framework, and its widespread applications, the training process, and in some ways, the GPT-3 framework in itself has been non transparent to the public. Furthermore, empirically enumerating all the possible designs for training LLMs over 100B parameters is computationally unaffordable which makes it even more critical to come up with a pre-training method for large scale LLM frameworks.
The above point makes sharing the working, and the training process of high-quality large-scale LLM frameworks like GPT-3 is of critical value, and with the ethical concerns kept in mind, the GLM-130B framework is an attempt to pre-train an accurate, and open-source LLM with over 100B parameters. During the course of their attempt, the GLM-130B development team observed that pre-training a large scale LLM framework is often accompanied with a wide array of engineering & technical challenges in terms of pre-training stability, efficiency, and convergence.
To be more specific, the GLM-130B is a bidirectional, and bilingual dense framework consisting over 130B parameters, pre-trained over 400B tokens on a cluster of 96 NVIDIA DGX-A100 GPU nodes over a span of nearly two months. Furthermore, instead of opting for the GPT-style architecture, the GLM-130B framework uses the GLM or General Language Model algorithm in an attempt to leverage its autoregressive blank infilling objectives, and the bidirectional attention advantage. The following table compares the GLM-130B framework with other models with over 100B parameters including GPT, BLOOM-176B, and OPT-175B.
The engineering and development concepts involved in the GLM-130B framework outperforms almost every large-scale LLM framework including GPT-3, and PaLM 540B with over 500B parameters in a lot of cases, and across a wide array of benchmarks. The following figure compares the performance of the GLM-130B framework with models with over 100B+ parameters, and as it be seen, the GLM-130B framework has significantly less generation toxicity, and bias than its counterparts.
Finally, the GLM-130B has been designed in a way to allow as many developers to conduct studies on frameworks with over 100B parameters, and there are two ways in which the GLM-130B framework achieves this. Firstly, instead of using over 175B parameters like BLOOM & OPT, the GLM-130B framework uses 130B parameters, because the size of the model supports interference even on a lone A100 server. Secondly, the GPU requirements to run the GLM-130B framework is less when compared to other LLM frameworks, and the GLM-130B framework achieves this by quantizing the original framework into INT4 precision. The INT4 quantization used by the GLM-130B framework enhances the performance while maintaining negligible performance degradation.
GLM-130B : Architecture
The inductive bias of a machine learning model is described by its architecture, and it doesn’t come as a surprise when developers cannot explore various architectural designs for large language models given the computational affordability, and viability. With that being said, let’s have a look at GLM-130B’s architecture.
Large-scale LLM frameworks like PaLM, GPT, and more have over 100B parameters, and they are built on the conventional decoder-only GPT-style architecture for autoregressive language modeling. On the other hand, the GLM-130B framework explores the possibility of using a bidirectional General Language Model or GLM, a transformer-based language model that aims to leverage autoregressive blank filling as the training objective, as its foundation. Briefly, for a given text sequence the GLM framework samples the text spans that are then replaced with a single mask token.
The bidirectional attention of the General Language Model over uncorrupted or unmasked contexts is what separates the GLM-130B framework from the GPT-style approach that makes use of a unidirectional approach. Furthermore, to support both generation & understanding of data, the GLM framework amalgamates two corruption strategies, each of which is indicated with a special & unique mask token.
- [MASK] : [MASK] is a corruption strategy that uses short blanks in sentences, the lengths of which add up to a certain percentage of the input.
- [gMASK] : [gMASK] is a corruption strategy that makes use of random-length blanks towards the end of the sentence with the prefix contexts.
The approach followed by the GLM framework is what allows the framework to record an accuracy score of over 80% on zero-shot LAMBADA language modeling, and outperforms both the PaLM 540B, and the GPT-3 framework.
Layer Normalization
One of the major challenges faced by developers when training a LLM framework is the training instability, and using an appropriate LN(Layer Normalization) might help with the training of LLMs. The GLM-130B framework uses a Post-LN approach thanks to its performance on downstream tasks.
FFNs and Positional Encoding
Feedforward Neural Networks or FFNs and positional encoding are two approaches adopted by the GLM-130B framework to introduce high-end downstream performance & training stability.
Pre-Training Setup
The pre-training objectives of the GLM-130B framework not only includes multi-task learning for a small number of tokens, but also includes the self-supervised GLM for autoregressive filling of the blanks, with the expectation that this approach will help the GLM-130B framework in downstream tasks. With that being said, the pre-training setup of the GLM-130B framework looks like the following.
Self-Supervised Blank Filling
As already mentioned, the GLM-130B framework uses two corruption strategies namely the [MASK] and [gMASK], and one of these strategies is independently applied to every individual training sequence, one at a time. For infilling the blanks, the [MASK] strategy masks consecutive spans in 30% of the training sequence, where the lengths of the spans add to up to 15% of the input, and follows a Poisson distribution. For the remaining 70% of the sequence, the prefix of every sequence is kept as a context, and the [gMASK] strategy helps in masking the rest of it, and the masked length is then sampled using the Uniform distribution.
Multi-Task Instructions Pre-Training
It has been indicated that following a multi-task learning approach for pre-training the models can deliver better results than fine-tuning, to improve task transfers in a zero-shot setting. Subsequently, the GLM-130B framework proposes to use an array of instruction prompted datasets including language generation, understanding, and information extraction during pre-training.
When compared to other approaches for zero-shot task transfer that make use of multi-task prompted fine-tuning, the Multi-Task Instructions Pre-Training approach followed by the GLM-130B framework accounts only for 5% of the total tokens, and it is set during the pre-training phase in an attempt to prevent spoiling other abilities of the LLM framework or in other words, unconditional free generation.
3D Parallel Strategy
There are two de facto practices for training large scale models with billions of parameters, the tensor model parallelism and the data parallelism. In an attempt to minimize the GPU utilization, and to handle immense GPU requirements, the GLM-130B framework implements a 3D parallel strategy that combines the pipeline model parallelism strategy with the tensor model parallelism and the data parallelism strategies.
GLM-130B : Training Stability
Training stability is an important factor when determining a LLM’s quality, and the training stability is influenced heavily depending on the number of tokens it passes through. Furthermore, it is vital to establish a trade-off between stability and efficiency with regards to floating point formats given the computing restraints. For example, low precision floating point formats boost the computing efficiency, but they often result in training collapses given they are prone to underflow and overflow errors.
Mixed Precision
In an attempt to boost training accuracy and reduce memory usage, the GLM-130B framework follows the common practice of using mixed precisions i.e FP16 for both forward & backwards, and FP32 for both master weights and optimizer states. Just like other popular LLM frameworks including BLOOM-176B and OPT-175B, the training phase of the GLM-130B framework using the mixed precision strategy faces frequent loss spikes, and the frequency of these spike losses tend to increase as the model continues to train. Furthermore, there are major issues that developers face when they are scaling up the transformers.
First, the value scale of the main branch of the transformer can be vast in the deeper layers when using Pre-LN, and in the GLM-130B framework, it is addressed by using a DeepNorm based Pre-LN, which ensures that the value scale remains bounded at all times. Second, as the model scales up, the attention scores grow to a point where they exceed FP16’s range.
Embedding-Layer Gradient Shrink or EGS
Developers working on the GLM-130B framework identified that the gradient norm can act as an informative indicator for training collapses, and a training collapse usually lags behind a spike in the gradient norm. The cause for these spikes is the abnormal gradients of the embedding layer, and developers observed that when compared to the gradient norm of other layers, the gradient norm of the embedding layers is larger by several magnitudes, and it also tends to fluctuate dramatically during the early training of the framework. Vision models also face this issue, and it is handled by freezing the patch projection layer. However, the same approach cannot be applied to LLMs as in language models, you cannot freeze the projection layers.
GLM-130B : Results and Performance
To evaluate GLM-130B’s performance for English tasks, it implements the same settings followed by common LLM frameworks including PaLM and GPT-3, and as the GLM-130B is a bilingual framework, it is also evaluated across several Chinese benchmarks. The GLM-130B framework’s performance will be measured across multiple benchmarks including Language Modelling, MMLU or Massive Multitask Language Understanding, BIG-Bench or Beyond the Imitation Game Benchmark, and CLUE or Chinese Language Understanding Evaluation. So let’s get started.
Language Modeling
The Language Modeling benchmark test on the GLM-130B framework is performed across two datasets: LAMBADA, and Pile.
The LAMBADA dataset is used to test the last word modeling capabilities of LLMs, and the GLM-130B framework achieves a zero-shot accuracy score of 80.2 in a bilingual setting, and in route, set a new benchmark record on the LAMBADA dataset.
On the other hand, Pile is a test set that comprises a series of benchmarks for language models. On average, in comparison to the GPT-3 and Jurassic-1, the GLM-130B framework delivers its best performance on 18 shared test sets in terms of weighted BPBs. The results demonstrate the strong language capabilities of the GLM-130B framework, and the results are included in the table below.
MMLU or Massive Multitask Language Understanding
MMLU or Massive Multitask Language Understanding is a diverse benchmark that comprises over 50 multiple-choice question answering tasks concerning human intelligence & knowledge, ranging from high-school to expert levels, and it is released after the crawling of the Pile test set, and thus, it serves as an ideal test-best to evaluate the few-shot learning capabilities of a LLM.
As it can be seen, in a few shot settings(5-shot), the performance of the GLM-130B framework approaches the performance of the GPT-3 model after viewing close to 300B tokens. The performance continues to boost as the training proceeds further, and when the training ends, the framework achieves an accuracy score of 44.8 after viewing a total of 400B tokens.
BIG-Bench or Beyond the Imitation Game Benchmark
BIG-Bench or Beyond the Imitation Game Benchmarks challenging tasks tests a model’s ability on knowledge, reasoning, and commonsense. As demonstrated in the following figures, in zero-shot setting, the GLM-130B framework outperforms both PaLM 540B and GPT-3 175B frameworks which might be because of MIP and the bidirectional context attention to boost the GLM-130B’s performance in unseen tasks in zero-shot setting. Furthermore, as the number of shots increases, the performance of the GLM-130B framework also improves, outperforming the GPT-3 framework consistently.
CLUE or Chinese Language Understanding Evaluation
GLM-130B’s Chinese zero-shot performance is evaluated on established NLP benchmark tasks including CLUE and FewCLUE, and is compared against 260B ERNIE Titan 3.0, the largest existing Chinese language model. As it can be observed, the GLM-130B framework constantly outperforms the 260B ERNIE Titan 3.0 framework across 12 different tasks, and performs nearly 260% better than the ERNIE framework on two abstractive MRC datasets.
Conclusion
In this article, we have talked about GLM-130B, a bilingual pre-trained large language model that aims to promote inclusive LLM research. The architecture, engineering, and technical undertakings aims to provide the AI community with a better insight into the architecture of LLM frameworks, training efficiency & stability, pre-training objectives, and affordable interference.
Credit: Source link