Summary of From Deep to Long Learning?

Summary From Deep to Long Learning? · Hazy Research hazyresearch.stanford.edu

1,654 words - html page - View html page

One Line

The article discusses the challenge of modeling long-range dependencies in machine learning and the introduction of the Long Range Arena benchmark, as well as various models such as Hippo, S4, H3, and Hyena, with Hazy Research exploring longer-sequence models for applications such as high-resolution imaging and language models that can read entire books.

Key Points

Hazy Research is exploring longer-sequence models for deep learning, motivated by applications such as high-resolution imaging and language models.
They are proposing new architectures, including the H3 and Hyena models, which use gating behavior and convolutions to handle longer sequences.
Other models that can handle longer sequences include Hippo, S4, and FlashAttention.
The Long Range Arena benchmark evaluates models' ability to handle long-range dependencies, and researchers are investigating nearly linear time models to address this issue.
Longer sequences could enable machine learning models to learn from longer contexts, multiple media sources, and complex demonstrations.
Hazy Research has proposed the first fully near linear-time convolutional models that can match Transformers on perplexity and downstream tasks with promising results in initial scaling experiments.

Summaries

267 word summary

The article discusses the challenge of modeling long-range dependencies in machine learning and the need for models that can handle longer sequences. The Long Range Arena (LRA) benchmark was introduced in 2020 to evaluate different models' ability to handle long-range dependencies. Researchers have been investigating models that are nearly linear time in sequence length, such as Hippo, S4, H3, and Hyena, to address this issue. Another approach to increasing sequence length is the use of images as context. The article also mentions the introduction of FlashAttention by the Hazy Research lab. The H3 model uses gating behavior and iteratively applies convolutions and gates. The Hyena model adds more projections and gates to generalize to more expressive architectures and close the gap to attention. The convolution can be computed in FlexConv or CKConv, and every SSM can be viewed as a convolution filter the length of the input sequence. The Hyena model grows nearly linearly in sequence length, and the next architecture in this line of work is RWKV. SSMs with gating can work well in concert with attention in language modeling. The FlashAttention model increases the sequence length of Transformers. The S4 model successfully models long-range dependencies and scales with sequence length. Hazy Research is exploring longer-sequence models for applications such as high-resolution imaging and language models that can read entire books. They are also looking at ways to make the connection between the FFT and matrix multiplication more efficient. Hazy Research has proposed the first fully near linear-time convolutional models that can match Transformers on perplexity and downstream tasks with promising results in initial scaling experiments.

369 word summary

Hazy Research is excited about exploring longer and longer sequences and new architectures for deep learning. They are motivated by applications that could benefit from longer-sequence models, such as high-resolution imaging and language models that can read entire books. They are exploring the class of transforms this extension learns and what it can allow them to do. They are also looking at ways to make the connection between the FFT and matrix multiplication more efficient. Hazy Research has proposed the first fully near linear-time convolutional models that can match Transformers on perplexity and downstream tasks with promising results in initial scaling experiments. The H3 model uses gating behavior that takes three projections of the input and iteratively applies convolutions and gates. The Hyena model adds more projections and gates to generalize to more expressive architectures and close the gap to attention. The convolution can be computed in FlexConv or CKConv, and every SSM can be viewed as a convolution filter the length of the input sequence. The Hyena model grows nearly linearly in sequence length, and the next architecture in this line of work is RWKV. SSMs with gating can work well in concert with attention in language modeling. The Hungry Hungry Hippos layer stacks two SSMs and multiplies their outputs together with a multiplicative gate. The FlashAttention model increases the sequence length of Transformers. The S4 model successfully models long-range dependencies and scales with sequence length. The article discusses the challenge of modeling long-range dependencies in machine learning and the need for models that can handle longer sequences. The Long Range Arena (LRA) benchmark was introduced in 2020 to evaluate different models' ability to handle long-range dependencies, but many Transformer variants struggled to perform better than random guessing. To address this issue, researchers have been investigating models that are nearly linear time in sequence length, such as Hippo, S4, H3, and Hyena. Another approach to increasing sequence length is the use of images as context. The context lengths of foundation models have been growing recently, and longer sequences could enable machine learning models to learn from longer contexts, multiple media sources, complex demonstrations, and more. The article also mentions the introduction of FlashAttention by the Hazy Research lab.

Raw indexed text (9,863 chars / 1,654 words / 214 lines)

From Deep to Long

Learning? Hazy Research

Hazy Research

People

Blog

Mar 28, 2023

9 min

read

From Deep to Long

Learning?

Dan Fu

Michael Poli

Chris

For the last two

years

, a

line

work

our

lab

has been to increase sequence length. We thought longer sequences would enable a new era of machine learning foundation models: they could learn from longer contexts, multiple media sources, complex demonstrations, and more. All data ready and waiting to be learned from in the world! Its been amazing to see the progress there. As an aside, were happy to play a role with the introduction of FlashAttention (

code

blog

paper

) by Tri Dao and Dan Fu from our lab, who showed that sequence lengths of 32k are possibleand now

widely available

in this era of foundation models (and weve heard

OpenAI

Microsoft

NVIDIA

, and others use it for their models tooawesome!).

The context lengths of foundation models have been growing recently (and

alternate explanations abound

)! What's next?

As the GPT4 press release noted, this has allowed almost 50 pages of text as contextand tokenization/patching ideas like those in Deepminds

Gato

are able to use images as context. So many amazing ideas coming together, awesome!

This article is about another approach to increasing sequence length at a high level, and the connection to a new set of primitives.

This blog is about another approach to increasing sequence length!

One fundamental issue we ran into was that the attention layers in Transformers scale quadratically in sequence length: going from 32k length to 64k length isnt 2x as expensive, but 4x more expensive. This led us to investigate models that are nearly linear time in sequence length. For our lab, this started with Hippo, followed by S4, H3, and now Hyena. These models hold the promise to have context lengths of millions or maybe even a billion!

Some Recent History and Progress

Long Range Arena and S4

The Long Range Arena

benchmark

was introduced by Google researchers in 2020 to evaluate how well different models can handle long-range dependencies. LRA tests a suite of tasks covering different data types and modalities such as text, images, and mathematical expressions, with sequence lengths up to 16K (Path-X: classifying images that have been unrolled into pixels, without any spatial inductive bias). Theres been a

lot

great

work

on scaling Transformers to

longer

sequences, but many of them seem to sacrifice accuracy. And theres that pesky Path-X column: all these Transformer methods and their variants struggled to do better than random guessing.

Transformer variants benchmarked on Long Range Arena, along with S4.

Enter

S4, led by the amazing Albert Gu! Inspired by the results from the LRA benchmark, Albert wanted to figure out how to better model long-range dependencies. Building on a long line of work on

orthogonal polynomials

and the relationships between

recurrent and convolutional

models, we introduced

a new sequence model based on structured state space models (SSMs).

Critically, SSMs scale with

log

O(N \log N)

in sequence length

, instead of quadratically like attention. S4 was able to successfully model the long-range dependencies in LRA, and was also the

first model

to achieve better than average performance on Path-X (and can now get 96.4% accuracy!). Since releasing S4, weve been super excited by how people are building on the ideas and making the space richer: with models like

from Scott Lindermans group,

DSS

from Ankit Gupta (and our own follow-on collaboration

S4D

Liquid-S4

from Hasani & Lechner, and more and of course we are always indebted to Sasha Rush and Sidd Karamcheti for the amazing

Annotated S4

As an aside: when we released

FlashAttention

, we were able to increase the sequence length of Transformers. We found that Transformers could also get non-trivial performance (63%) on Path-X simply by increasing the sequence length to 16K!

The Gap with Language

But S4 still had a gap in quality on language modeling up to 5 perplexity points (for context, thats the gap between a 125M model and a 6.7B model). To close this gap, we looked at

synthetic languages

like associative recall to figure out what properties you should need for language. We ended up designing

(Hungry Hungry Hippos) a new layer that stacked two SSMs, and multiplied their outputs together with a multiplicative gate.

Using H3, we replaced almost all the attention layers in GPT-style Transformers, and were able to match Transformers on both perplexity and downstream evaluations, when trained on 400B tokens from the Pile:

Model

Pile PPL

SuperGlue Zero-Shot

GPT-Neo-1.3B

6.2

52.1

H3, 2 attn (1.3B)

6.0

56.5

GPT-Neo-2.7B

5.7

54.6

H3, 2 attn (2.7B)

5.4

56.8

Since the H3 layer is built on SSMs, it also has compute that grows in

log

O(N \log N)

in sequence length. The two attention layers still make the whole model

N^2

overall, but more on that in a bit...

Of course, we werent the only folks thinking in this direction:

GSS

also found that SSMs with gating could work well in concert with attention in language modeling (which inspired H3), Meta released their

Mega

model which also combined an SSM with attention, the

BiGS model

replaced attention in BERT-style models, and our

RWKV

friends have been looking at completely recurrent approaches. Very exciting work in this area!

The Next Advance: Hyena

The next architecture in this line of work is

Hyena

we wanted to see if it was possible to get rid of those last two attention layers in H3, and get a model that grows nearly linearly in sequence length. Turns out, two simple insights led us to the answer:

Every SSM can be viewed as a convolution filter the length of the input sequence so we can replace the SSM with a convolution the size of the input sequence, to get a strictly more powerful model for the same compute. In particular, we parametrize the convolutional filters implicitly via another small neural network, borrowing powerful methods from the

neural fields

literature, and the great

CKConv

FlexConv

line of work. Plus, the convolution can be computed in

log

O(N \log N)

time in sequence length nearly-linear scaling!

The gating behavior in H3 can be generalized: H3 takes three projections of the input, and iteratively takes convolutions and applies a gate. In Hyena, we simply add more projections and more gates, which helps generalize to more expressive architectures and closes the gap to attention.

In Hyena, we proposed the first fully near linear-time convolutional models that could match Transformers on perplexity and downstream tasks, with promising results in initial scaling experiments. We trained small- and medium-sized models on subsets of the PILE, and saw that val PPL matched Transformers:

Model

10B

15B

GPT-2 Small (125M)

13.3

11.9

11.2

Pure H3 (153M)

14.8

13.5

12.3

Hyena (153M)

13.1

11.8

11.1

GPT-2 Medium (355M)

11.4

9.8

9.3

Hyena (355M)

11.3

9.8

9.2

With some optimizations (more on that below), Hyena models are slightly slower than Transformers of the same size at sequence length 2K but get a lot faster at longer sequence lengths.

Were super excited to see how far we can take these models, and excited to scale them up to the full size of the PILE (400B tokens): what happens if we combine the best ideas from H3 and Hyena, and how long can we go?

A Common Primitive: the FFT... or Something More Basic?

A common primitive in all these models is the FFT thats how we can efficiently compute a convolution as long as the input sequence in

log

O(N \log N)

time. However, the FFT is poorly supported on modern hardware, which is dominated by specialized matrix multiplication units and GEMMs (e.g., tensor cores on NVIDIA GPUs).

We can start to close the efficiency gap by

rewriting

the FFT as a series of matrix multiplication operations using a

connection

Butterfly

matrices that folks in our group have used to explore

sparse

training

. In our recent work, weve used this connection to build fast convolution algorithms like

FlashConv

and

FlashButterfly

, by using a Butterfly decomposition to compute the FFT as a series of matmul operations.

But we can draw on the prior work to make a deeper connection: you can also let these matrices be

learned

which takes the same wall-clock time, but gives you extra parameters! Weve started exploring this connection on some small datasets with promising initial results, and were excited to see where else this connection can take us (how can we make it work for language models?):

Block Size

sCIFAR Acc

Baseline

91.0

16x16 Learned

91.8

32x32 Learned

92.4

256x256 Learned

92.5

Were looking forward to exploring this more deeply. What class of transforms does this extension learn, and what can it allow you to do? What happens when we apply it to language?

What's Next

We are super excited by these directions, and whats next: longer and longer sequences, new architectures that allow us to explore this new regime. Were especially motivated by applications that could benefit from longer-sequence models high-resolution imaging, new modalities of data, language models that can read entire books. Imagine giving a language model an entire book and having it summarize the plot, or conditioning a code generation model on all the code youve ever written. The possibilities are wild and were excited.

You can find model code to play around with the synthetics languages we used to develop H3 & Hyena

here

. If youre also excited by these directions, please reach out we would love to chat!

Dan Fu:

[email protected]

; Michael Poli:

[email protected]

Acknowledgements

Thanks to Alex Tamkin, Percy Liang, Albert Gu, Michael Zhang, Eric Nguyen, and Elliot Epstein for their comments and feedback on this post.

Alternate Explanations Abound

H/t to

@typedfemale

for bringing this to our attention.

Made by Hazy Research.

Learn more about the lab