banner
News center
Articulate and proficient in their expertise.

Meta AI’s Novel Setup Reveals The Structure and Evolution of Transformers

May 16, 2023

In a new paper Birth of a Transformer: A Memory Viewpoint, a Meta AI research team introduces a new synthetic setup to explore the structure and evolution of transformer language models, aiming to provide insights of the global vs in-context learning of LLMs.

In recent years, large language models (LLMs) have demonstrated a strong capability to learn vast amounts of ‘global’ knowledge from their training data and have shown the ability to quickly adapt to new information based on given contexts or prompts. Despite their impressive ‘in-context’ learning capabilities, their internal mechanisms remain under-explored, posing a threat to their reliability for real-world applications.

In the new paper, Birth of a Transformer: A Memory Viewpoint, the Meta AI research team introduces a novel synthetic setup to explore the structure and evolution of transformer language models. Their aim is to provide insights into the global vs. in-context learning of LLMs.

The team summarizes their main contributions as follow:

The team first develops a synthetic dataset to explore how transformers develop global knowledge and in-context learning capability. This dataset consists of generic bigram language models, where some bigrams are sequence-specified. Therefore, the transformer models rely on in-context learning to get good prediction on the sequence-specific bigrams while general bigrams can be predicted from global statistics based on the current token.

To gain a fine-grained understanding of the in-context mechanism during the training stage, the researchers further simplify the two-layer architecture by freezing some of the layers at random initialization. Such simplification allows the team to introduce a model for individual weight matrices as associative memories, which store pairs of embeddings. As a result, they yield a precise understanding of learning dynamics.

In their empirical study, the researchers used mini-batch SGD with momentum to train their model, they observed that the global bigram statistics tend to be learned faster then the induction head, and the change to the data distribution greatly impacts the speed of in-context learning.

They also provide theoretical insights on training dynamics, demonstrating that with enough data, the associative memory can filter out noise from inputs; and when the attention patterns are near-uniform, it can recover the desired associative memory.

Overall, this work provides valuable insights on the structure and evolution of transformer models. The team claims their next step will explore how transformers leverage some other aspects, such as learning embeddings, factorized key-query matrices and non-linear feedforward layers, to learn in richer settings.

The paper Birth of a Transformer: A Memory Viewpoint on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don't want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Machine Intelligence | Technology & Industry | Information & Analysis

Your email address will not be published. Required fields are marked *

Comment *

Name

Email

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.

Δ

Author Editor Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.