Summary of Inverted Transformers for Time Series Forecasting

Summary Inverted Transformers for Time Series Forecasting arxiv.org

9,159 words - PDF document - View PDF document

One Line

iTransformer enhances time series forecasting by reversing the attention mechanism and feed-forward network, leading to exceptional performance and interpretability.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Inverted Transformers for Time Series Forecasting

Source: arxiv.org - PDF - 9,159 words - view

Challenges in Time Series Forecasting

• Transformers face challenges in time series forecasting, especially for series with larger lookback windows

• iTransformer proposes a modification of the Transformer architecture

• iTransformer addresses the limitations of traditional Transformers

iTransformer Overview

• iTransformer uses the attention mechanism to capture multivariate correlations

• iTransformer applies the feed-forward network to learn nonlinear representations

• Experimental results show state-of-the-art performance on real-world datasets

Comparison with Other Models

• iTransformer outperforms other Transformer-based forecasters

• Performance and efficiency advantages of iTransformer over other models

• iTransformer is a fundamental backbone for time series forecasting

Implementation Details

• Experiments conducted in PyTorch on a single GPU

• Optimized using ADAM with L2 loss

• Batch size set to 32, and number of training epochs fixed at 10

Ablation Studies

• Different architectural designs compared in iTransformer

• iTransformer consistently outperforms other designs

• Self-attention for multivariate correlations and feed-forward networks for series representations

Hyperparameter Sensitivity

• Analysis of learning rate, number of Transformer blocks, and hidden dimension

• Careful selection of learning rate for large number of variates

• Larger block numbers and hidden dimensions not necessarily leading to better performance

Interpretability of Attention in Correlating

• Attention mechanism in inverted transformers allows for interpretable learned maps

• Visualization of multivariate correlations in the Solar-Energy dataset

• Interpretability of attention in correlating and encoding/decoding process

Prediction Showcases

• Clear comparison among different models for Traffic, Electricity, and Weather datasets

• iTransformer exhibits superior performance and predicts precise future series variations

iTransformers Framework Results

• Full results of iTransformers framework applied to five Transformer variants

• Consistent improvement achieved by the iTransformers framework

• Supplementary forecasting results demonstrating consistent improvement

Full Multivariate Forecasting Results

• Comparison of iTransformer with extensive competitive models

• iTransformer outperforms other models across all prediction lengths

• State-of-the-art performance in real-world forecasting applications

Key Takeaways

• Transformers face challenges in time series forecasting, iTransformer addresses them

• iTransformer achieves state-of-the-art performance on real-world datasets

• The iTransformers framework consistently improves the performance of Transformer variants

Key Points

Transformers have been successful in natural language processing and computer vision but face challenges in time series forecasting, especially for series with larger lookback windows.
iTransformer is proposed as a modification of the Transformer architecture for time series forecasting.
iTransformer uses the attention mechanism to capture multivariate correlations and applies the feed-forward network to learn nonlinear representations.
Experimental results show that iTransformer achieves state-of-the-art performance on real-world datasets and addresses the limitations of traditional Transformers.
The authors discuss related work in time series forecasting and compare iTransformer with other models in terms of performance and efficiency.
The experiments are conducted in PyTorch on a single GPU using ADAM optimization with L2 loss.
Ablation studies and hyperparameter sensitivity analysis validate the rationality of Transformer components in iTransformer.
The iTransformers framework consistently improves the performance of Transformer variants and achieves state-of-the-art results in various forecasting applications.

Summaries

18 word summary

iTransformer improves time series forecasting by inverting the attention mechanism and feed-forward network, achieving state-of-the-art performance and interpretability.

77 word summary

The iTransformer overcomes limitations of Transformer-based forecasters in time series forecasting by inverting the duties of the attention mechanism and the feed-forward network. It achieves state-of-the-art performance on real-world datasets, outperforming other models in terms of performance and efficiency. Ablation studies and hyperparameter sensitivity analysis support the rationality of Transformer components in iTransformer. The attention mechanism allows for interpretable learned maps. Comprehensive results demonstrate the consistent improvement and superiority of iTransformer over other models in forecasting applications.

112 word summary

The iTransformer is proposed as a solution to the limitations of Transformer-based forecasters in time series forecasting. It addresses challenges with larger lookback windows and the unified embedding of multiple variates by inverting the duties of the attention mechanism and the feed-forward network. Experimental results show that iTransformer achieves state-of-the-art performance on real-world datasets and outperforms other models in terms of performance and efficiency. Ablation studies and hyperparameter sensitivity analysis support the rationality of Transformer components in iTransformer. The attention mechanism allows for interpretable learned maps by correlating multiple variate tokens. Prediction showcases and comprehensive results demonstrate the consistent improvement and superiority of iTransformer over other competitive models in various forecasting applications.

476 word summary

The authors propose iTransformer as a solution to the limitations of Transformer-based forecasters in time series forecasting. Transformers struggle with larger lookback windows and the unified embedding of multiple variates can result in meaningless attention maps. iTransformer addresses these issues by inverting the duties of the attention mechanism and the feed-forward network. It embeds each time point of a series into variate tokens for multivariate correlation capture and applies the feed-forward network to each variate token for nonlinear representations.

Experimental results show that iTransformer achieves state-of-the-art performance on real-world datasets. The authors compare iTransformer with other models, highlighting the advantages of their approach in terms of performance and efficiency.

The experiments are conducted in PyTorch on a single GPU using ADAM optimization with L2 loss. The authors vary the number of inverted Transformer blocks and series representation dimensions to evaluate their effects on performance.

Detailed descriptions of the datasets used in the experiments are provided, including size, prediction length, dataset size, and frequency.

Ablation studies are conducted to analyze the rationality of Transformer components in iTransformer. Different architectural designs are compared, and iTransformer consistently outperforms other designs.

Hyperparameter sensitivity analysis is conducted to investigate the effects of learning rate, number of Transformer blocks, and hidden dimension on performance. Careful selection of learning rate is important when there are a large number of variates, and larger block numbers and hidden dimensions do not necessarily lead to better performance.

The attention mechanism in inverted transformers allows for more interpretable learned maps by correlating multiple variate tokens. Visualization of multivariate correlations in the Solar-Energy dataset demonstrates the interpretability of attention in correlating and the encoding/decoding process during layer stacking.

Figures 11, 12, and 13 present prediction showcases of three representative datasets, comparing iTransformer with other models. iTransformer consistently exhibits superior performance and predicts the most precise future series variations.

The full results of the iTransformers framework applied to five Transformer variants are presented in Table 2, showcasing consistent improvement and the advantage of efficient attention mechanisms. Supplementary forecasting results are provided in Table 6, further demonstrating the consistent improvement achieved by the iTransformers framework.

Table 7 presents the full results of the iTransformer model compared to other competitive models across six well-acknowledged benchmarks. iTransformer outperforms the other models across all prediction lengths, achieving state-of-the-art performance in real-world forecasting applications.

Table 8 presents the full results of the Market dataset for transaction forecasting. iTransformer consistently achieves lower MSE and MAE values, indicating its superior performance in this task.

In conclusion, the iTransformers framework, with its attention mechanism and inverted operation, allows for interpretable attention in correlating and consistently improves the performance of Transformer variants. The visualization of multivariate correlations and prediction showcases highlight the effectiveness of the iTransformer model in time series forecasting tasks. The full results demonstrate the superiority of iTransformer over other competitive models in various forecasting applications.

547 word summary

The authors of the paper propose iTransformer as a solution to the limitations of Transformer-based forecasters in time series forecasting. They explain that Transformers have been successful in other domains but struggle with larger lookback windows in time series forecasting, and the unified embedding of multiple variates can result in meaningless attention maps. iTransformer addresses these issues by inverting the duties of the attention mechanism and the feed-forward network. It embeds each time point of a series into variate tokens for multivariate correlation capture, and then applies the feed-forward network to each variate token for nonlinear representations.

Experimental results show that iTransformer achieves state-of-the-art performance on real-world datasets. The authors highlight three contributions of their work: reflecting on the architecture of Transformer, proposing iTransformer as a fundamental backbone for time series forecasting, and achieving consistent state-of-the-art performance on real-world benchmarks. They compare iTransformer with other models in terms of performance and efficiency, emphasizing the advantages of their approach.

The experiments are conducted in PyTorch on a single GPU using ADAM optimization with L2 loss. The batch size is 32 and the number of training epochs is 10. The authors vary the number of inverted Transformer blocks and series representation dimensions to evaluate their effects on performance.

The authors provide detailed descriptions of the datasets used in the experiments, including size, prediction length, dataset size, and frequency.

Ablation studies are conducted to analyze the rationality of Transformer components in iTransformer. Different architectural designs are compared, and it is found that iTransformer consistently outperforms other designs.

Hyperparameter sensitivity analysis is conducted to investigate the effects of learning rate, number of Transformer blocks, and hidden dimension on performance. It is found that careful selection of learning rate is important when there are a large number of variates, and larger block numbers and hidden dimensions do not necessarily lead to better performance.

Table 8 presents the full results of the Market dataset for transaction forecasting. iTransformer consistently achieves lower MSE and MAE values, indicating its superior performance in this task.

813 word summary

The recent boom in linear forecasting models has raised questions about the effectiveness of Transformer-based forecasters. While Transformers have been successful in natural language processing and computer vision, their performance in time series forecasting, especially for series with larger lookback windows, has been challenged. Additionally, the unified embedding of multiple variates with potentially unaligned timestamps and distinct physical measurements in Transformers may fail to capture variate-centric representations and result in meaningless attention maps.

In this work, the authors propose iTransformer, which repurposes the Transformer architecture without modifying its basic components. iTransformer inverts the duties of the attention mechanism and the feed-forward network. Each time point of an individual series is embedded into variate tokens, which are used by the attention mechanism to capture multivariate correlations. The feed-forward network is then applied to each variate token to learn nonlinear representations.

Experimental results show that iTransformer achieves consistent state-of-the-art performance on several real-world datasets. It outperforms other Transformer-based forecasters and addresses the limitations of the traditional Transformer architecture. The authors highlight three contributions of their work: reflecting on the architecture of Transformer and refining the competent capability of native Transformer components, proposing iTransformer as a fundamental backbone for time series forecasting, and achieving consistent state-of-the-art performance on real-world forecasting benchmarks.

The authors also discuss related work in the field of time series forecasting. They categorize existing modifications of Transformer-based forecasters into four categories based on whether they modify components and architecture. They compare their proposed iTransformer with other models in terms of performance and efficiency, highlighting the advantages of their approach.

In terms of implementation details, the experiments are conducted in PyTorch on a single GPU. The models are optimized using ADAM with L2 loss. The batch size is set to 32, and the number of training epochs is fixed at 10. The number of inverted Transformer blocks in iTransformer and the dimension of series representations are varied to evaluate their effects on performance.

The authors provide detailed descriptions of the datasets used in the experiments, including Electricity, ETT, Traffic, Solar-Energy, Weather, PEMS, and Market datasets. They explain the size, prediction length, dataset size, and frequency of each dataset.

The authors also conduct ablation studies to analyze the rationality of Transformer components in iTransformer. They compare different architectural designs, such as replacing and removing components, and evaluate their performance. They find that iTransformer, which utilizes self-attention for multivariate correlations and feed-forward networks for series representations, consistently outperforms other designs.

Finally, the authors investigate the hyperparameter sensitivity of iTransformer. They analyze the effects of learning rate, number of Transformer blocks, and hidden dimension on performance. They find that the learning rate should be carefully selected when the number of variates is large, and larger block numbers and hidden dimensions do not necessarily lead to better performance in iTransformer.

The attention mechanism in inverted transformers allows for more interpretable learned maps by correlating multiple variate tokens. Figure 10 showcases the visualization of multivariate correlations in the Solar-Energy dataset. Each case is divided into the lookback and future time series, with distinct multivariate correlations due to seasonal changes. The learned pre-Softmax maps in the shallow attention layer resemble the correlations of the raw lookback series, while deeper layers resemble the correlations of the future series. This demonstrates the interpretability of attention in correlating and the encoding/decoding process during layer stacking.

To provide a clear comparison among different models, Figures 11, 12, and 13 present supplementary prediction showcases of three representative datasets: Traffic, Electricity, and Weather. The models compared include iTransformer, PatchTST, DLinear, Crossformer, Autoformer, and Transformer. Among these models, iTransformer exhibits superior performance and predicts the most precise future series variations.

The full results of the iTransformers framework applied to five Transformer variants (Transformer, Reformer, Informer, Flowformer, Flashformer) are presented in Table 2. The framework consistently promotes these variants and takes advantage of efficient attention mechanisms. Supplementary forecasting results are provided in Table 6, further demonstrating the consistent improvement achieved by the iTransformers framework.

The full multivariate forecasting results are provided in Table 7 for six well-acknowledged benchmarks. The iTransformer model is compared to extensive competitive models under different prediction lengths. The results show that iTransformer outperforms the other models across all prediction lengths, achieving state-of-the-art performance in real-world forecasting applications.

Table 8 presents the full results of the Market dataset for transaction forecasting. iTransformer is compared to other models including PatchTST, Crossformer, TimesNet, SCINet, DLinear, FEDformer, Stationary Autoformer, and Informer. iTransformer consistently achieves lower MSE and MAE values, indicating its superior performance in this task.

Raw indexed text (60,122 chars / 9,159 words / 1,608 lines)

I T RANSFORMER : I NVERTED T RANSFORMERS

E FFECTIVE FOR T IME S ERIES F ORECASTING

A RE

Yong Liu ∗ , Tengge Hu ∗ , Haoran Zhang ∗ , Haixu Wu, Shiyu Wang § , Lintao Ma § , Mingsheng Long B

School of Software, BNRist, Tsinghua University, Beijing 100084, China

Ant Group, Hangzhou, China

A BSTRACT

The recent boom of linear forecasting models questions the ongoing passion for

architectural modifications of Transformer-based forecasters. These forecasters

leverage Transformers to model the global dependencies over temporal tokens of

time series, with each token formed by multiple variates of the same timestamp.

However, Transformer is challenged in forecasting series with larger lookback

windows due to performance degradation and computation explosion. Besides, the

unified embedding for each temporal token fuses multiple variates with potentially

unaligned timestamps and distinct physical measurements, which may fail in learn-

ing variate-centric representations and result in meaningless attention maps. In this

work, we reflect on the competent duties of Transformer components and repurpose

the Transformer architecture without any adaptation on the basic components. We

propose iTransformer that simply inverts the duties of the attention mechanism

and the feed-forward network. Specifically, the time points of individual series are

embedded into variate tokens which are utilized by the attention mechanism to

capture multivariate correlations; meanwhile, the feed-forward network is applied

for each variate token to learn nonlinear representations. The iTransformer model

achieves consistent state-of-the-art on several real-world datasets, which further

empowers the Transformer family with promoted performance, generalization abil-

ity across different variates, and better utilization of arbitrary lookback windows,

making it a nice alternative as the fundamental backbone of time series forecasting.

I NTRODUCTION

0.38

ETT

0.11

0.43

0.18

0.20

0.56

0.28

0.33

0.23

0.28

0.67 0.66

0. 55

0.28

0.23

Transformer (Vaswani et al., 2017) has achieved tremendous

success in natural language processing (Brown et al., 2020)

and computer vision (Dosovitskiy et al., 2021), growing into

the foundation model that follows the scaling law (Kaplan

et al., 2020). Inspired by the immense success in extensive

fields, Transformer with strong capabilities of depicting pair-

wise dependencies and extracting multi-level representations

in sequences is emerging in time series forecasting (Li et al.,

2021; Wu et al., 2021; Nie et al., 2023).

{liuyong21,htg21,z-hr20,whx20}@mails.tsinghua.edu.cn

{weiming.wsy,lintao.mlt}@antgroup.com , [email protected]

0.46

However, researchers have recently begun to question the va-

0.43

0. 26

Weather

Traffic

lidity of Transformer-based forecasters, which typically embed

iTransformer

DLinear

PatchTST

multiple variates of the same timestamp into indistinguishable

Transformer

TimesNet

FEDformer

channels and apply attention on these temporal tokens to cap-

ture temporal dependencies. Considering the numerical but Figure 1: Performance of iTrans-

less semantic relationship among time points, researchers find former. Average results (MSE) are re-

ported following TimesNet (2023).

that simple linear layers, which can be traced back to statistical

forecasters (Box & Jenkins, 1968), have exceeded complicated Transformers on both performance

and efficiency (Zeng et al., 2023; Das et al., 2023). Meanwhile, ensuring the independence of variate

∗

Equal Contribution

1Token

Token

iTransformer

View

Token

Attention over

Temporal Tokens

FFN for

Variate Encoding

Token

Invert

Time

Token

Attention over

Variate Tokens

Time

FFN for

Series Encoding

Transformer

View

Token

Figure 2: Comparison between the vanilla Transformer (top) and the proposed iTransformer (bottom).

Unlike Transformer, which embeds each time step to the temporal token, iTransformer embeds the

whole series independently to the variate token, such that multivariate correlations can be depicted by

the attention mechanism and series representations are encoded by the feed-forward network.

and utilizing mutual information is ever more highlighted by recent research that explicitly models

multivariate correlations to achieve accurate forecasting (Zhang & Yan, 2023; Ekambaram et al.,

2023), but this goal can be hardy achieved without subverting the vanilla Transformer architecture.

Considering the disputes of Transformer-based forecasters, we reflect on why Transformers perform

even worse than linear models in time series forecasting while acting predominantly in many other

fields. We notice that the existing structure of Transformer-based forecasters may be not suitable for

multivariate time series forecasting. As shown in the left of Figure 2, it is notable that the points

of the same time step that basically represent completely different physical meanings recorded by

inconsistent measurements are embedded into one token with wiped-out multivariate correlations.

And the token formed by a single time step can struggle to reveal beneficial information due to

excessively local receptive field and unaligned timestamps of multivariate time points in the real

world. Besides, while series variations can be greatly influenced by the sequence order, permutation-

invariant attention mechanisms are improperly adopted on the temporal dimension (Zeng et al.,

2023). Consequently, Transformer is weakened to capture essential series representations and portray

multivariate correlations, limiting its capacity and generalization ability on diverse time series data.

Concerning the irrationality of embedding multivariate points of each time step as a (temporal) token,

we take an inverted view on time series and embed the whole time series of each variate independently

into a (variate) token, the extreme case of Patching (Nie et al., 2023) that enlarges local receptive field.

By inverting, the embedded token aggregates the global representations of series that can be more

variate-centric and better leveraged by booming attention mechanisms for multivariate correlating.

Meanwhile, the feed-forward network can be proficient enough to learn generalizable representations

for distinct variates encoded from arbitrary lookback series and decoded to predict future series.

Based on the above motivations, we believe it is not that Transformer is ineffective for time series

forecasting, but rather it is improperly used. In this paper, we revisit the structure of Transformer

and advocate iTransformer as a fundamental backbone for time series forecasting. Technically, we

embed each time series as variate tokens, adopt the attention for multivariate correlations, and employ

the feed-forward network for series encoding. Experimentally, the proposed iTransformer achieves

state-of-the-art performance on real-world forecasting benchmarks shown in Figure 1 and surprisingly

tackles the pain points of Transformer-based forecasters. Our contributions lie in three aspects:

• We reflect on the architecture of Transformer and refine that the competent capability of

native Transformer components on time series is underexplored.

• We propose iTransformer that regards independent time series as tokens to capture multivari-

ate correlations by self-attention and utilize layer normalization and feed-forward network

modules to learn better series-global representations for time series forecasting.

• Experimentally, iTransformer achieves consistent state-of-the-art on real-world forecasting

benchmarks. We extensively analyze the inverted modules and architecture choices, indicat-

ing a promising direction for the future improvement of Transformer-based forecasters.

R ELATED W ORK

With the progressive breakthrough made in natural language processing and computer vision areas,

elaboratively designed Transformer variants are proposed to tackle ubiquitous time series forecasting

applications. Going beyond contemporaneous TCNs (Bai et al., 2018; Liu et al., 2022a) and RNN-

based forecasters (Zhao et al., 2017; Rangapuram et al., 2018; Salinas et al., 2020), Transformer has

exhibited powerful sequence modeling capability and promising model scalability, leading to the

trend of passionate modifications adapted for time series forecasting.

Through a systematical review of Transformer-based forecasters, we conclude that existing modifi-

cations can be divided into four categories by whether to modify the component and architecture.

As shown in Figure 3, the first category (Wu et al., 2021; Li et al., 2021; Zhou et al., 2022), which

is the most common practice, mainly concerns the component adaptation, especially the attention

module for the temporal dependency modeling and the complexity optimization on long sequences.

Nevertheless, with the rapid emergence of linear forecasters (Oreshkin et al., 2019; Zeng et al., 2023;

Das et al., 2023), the impressive performance and efficiency continuously challenge this direction.

Soon afterward, the second category attempts to fully utilize Transformer and pays more attention to

inherent processing of time series, such as Stationarization (Liu et al., 2022b), Channel Independence,

and Patching (Nie et al., 2023), which bring about consistently improved performance. Moreover,

faced with the increasing significance of the independence and mutual interactions of multiple

variates, the third category refurbishes Transformer in both aspects of component and architecture.

Representative (Zhang & Yan, 2023) explicitly captures the cross-time and cross-variate dependency

by the renovated attention mechanism and architecture.

Unlike previous works, our proposed model modifies none of the native components of Transformer.

By contrast, the duties of components are totally inverted, which is the only one that belongs to the

fourth category to our best knowledge. We believe the capabilities of the original modules have stood

the test by extensive fields, the truth is that the architecture of Transformer is improperly adopted.

Attention

(I) Autoformer, Informer,…

(III) Crossformer,…

Modified

Component

Feed-forward

Add & Norm

Modified Attn

No Modified Architecture

Modified Architecture

Series Processing

Transformer

Original Arch

Modified Arch

No Modified

Component

(II) PatchTST, NSTransformer,…

(IV) iTransformer (Ours)

Figure 3: Transformer-based forecasters categorized by component and architecture modifications.

I T RANSFORMER

In multivariate time series forecasting, given historical observations X = {x 1 , . . . , x T } ∈ R T ×N

with T time steps and N variates, we predict the future S time steps Y = {x T +1 , . . . , x T +S } ∈

R S×N . For convenience, we denote X t,: as the simultaneously recorded multivariate at time step t,

and X :,n as the whole time series of each variate indexed by n. It is notable that X t,: may not contain

time points of the essentially same timestamp in real-world scenarios because of the systematical

delay of monitors and loosely organized datasets. The elements of X t,: can be distinct from each

other in physical measurements and statistical distributions, for which a variate X :,n generally shares.

3.1

S TRUCTURE O VERVIEW

Our proposed iTransformer illustrated in Figure 4 utilizes the simpler encoder-only architecture of

Transformer (Vaswani et al., 2017), including the embedding, projection, and Transformer blocks.

Embedding the whole series as the token Most Transformer-based forecasters typically regard

multiple variates of the same time as the (temporal) token and follow the generative formulation of

forecasting tasks. However, we find the approach on the numerical modality can be less instructive for

3Projection

TrmBlock

LayerNorm

µ σ

Variate

Feed-forward

Temporal LayerNorm

Dense

L ×

LayerNorm

(a)

Multivariate

Attention

Raw

Series

Embedding

(c)

𝑥 ! =

𝑥− µ

Output

Features

Embedded

Variate Tokens

Embedding

(b)

(d)

Act & Drop

Dense

Query

MatMul

Scale

Multivariate

Correlations

Map

Key

Value

MatMul

Input

Figure 4: Overall structure of iTransformer, which shares the same modular arrangement with the

encoder of Transformer: (a) raw series of different variates are independently embedded to tokens.

(b) self-attention is applied to embedded variate tokens with enhanced interpretability revealing

multivariate correlations. (c) series representations of each token are extracted by the shared feed-

forward network. (d) layer normalization is adopted to reduce the discrepancies among variates.

learning attention maps, which is supported by increasing applications of Patching (Dosovitskiy et al.,

2021; Nie et al., 2023) that broadens the respective field. Meanwhile, the triumph of linear forecasters

also challenges the necessity of adopting a heavy encoder-decoder Transformer for generating tokens.

Instead, our proposed encoder-only iTransformer focuses on representation learning and adaptive

correlating of multivariate series. Each time series driven by the underlying complicated process

is firstly tokenized to describe the properties of the variate, applied by self-attention for mutual

interactions, and individually processed by feed-forward networks for series representations. Notably,

the task to generate the predicted series is essentially delivered to linear layers, which has been proven

competent by previous work (Das et al., 2023) and we provide a detailed analysis in the next section.

Based on the above considerations, in iTransformer, the process of predicting future series of each

specific variate Ŷ :,n based on the lookback series X :,n is simply formulated as follows:

h 0 n = Embedding(X :,n ),

H l+1 = TrmBlock(H l ), l = 0, · · · , L − 1,

Ŷ :,n =

(1)

Projection(h L

n ),

where H = {h 1 , · · · , h N } ∈ R N ×D contains N embedded tokens of dimension D and the su-

perscript denotes the layer index. Embedding : R T 7→ R D and Projection : R D 7→ R S are

both implemented by multi-layer perceptron (MLP). The obtained variate tokens interact with each

other by self-attention and are independently processed by the shared feed-forward network in each

TrmBlock. Specifically, as the order of sequence is implicitly stored in the neuron permutation of the

feed-forward network, the position embedding in the vanilla Transformer is no longer needed here.

iTransformers The proposed architecture essentially presupposes no more specific requirements

on Transformer variants, other than the attention mechanism should be applicable for multivariate

correlation modeling. Thus, a bundle of efficient attention mechanisms (Li et al., 2021; Wu et al.,

2022; Dao et al., 2022) can be the plugins, reducing the complexity of correlating when the variate

number grows large. The equipped Transformer variants by the proposed architecture, which we

name as iTransformers, are extensively evaluated in our experiments in Section 4.2 and demonstrate

superior advantages on time series forecasting.

3.2

I NVERTED T RANSFORMER C OMPONENTS

We organize a stack of L blocks composed of native layer normalization, feed-forward network, and

self-attention modules. But their duties on the inverted dimension should be carefully reconsidered.

4Layer normalization Layer normalization (Ba et al., 2016) is originally proposed to increase

the convergence and training stability of deep networks. In typical Transformer-based forecasters,

the module normalizes the variate representation of the same timestamp, gradually making each

variate indistinguishable from each other. Once the collected time points are not aligned by time,

the operation will also introduce interaction noises between noncausal or delayed processes. In our

inverted version, the normalization is applied to the series representation of individual variate as

Equation 2, which has been studied and proved effective in tackling non-stationary problems (Kim

et al., 2021; Liu et al., 2022b). Besides, as all series as (variate) tokens are normalized to a normal

distribution, the discrepancies caused by inconsistent measurements can be diminished. Instead, in

previous architecture, different tokens of time steps will be normalized, leading to oversmooth series.

(

LayerNorm(H) =

h n − Mean(h n )

n = 1, · · · , N

Var(h n )

)

(2)

Feed-forward network Transformer leverages the feed-forward network as the basic building block

for encoding token representation and it is identically applied to each token. As aforementioned,

in the vanilla Transformer, multiple variates of the same timestamp that forms the token can be

malpositioned and too localized to reveal enough information for predictions. In the inverted version,

feed-forward networks are leveraged on the channels of variate tokens. By the universal approximation

theorem (Hornik, 1991), they can extract complicated representations to describe a time series. With

the stacking of inverted blocks, they are devoted to encoding the observed time series and decoding

the representations for future series using dense non-linear connections, which works the same as the

recent works completely built on MLPs (Tolstikhin et al., 2021; Das et al., 2023).

More interestingly, the identical linear operation on independent time series, which serves as the

combination of the recent linear forecasters (Zeng et al., 2023) and Channel Independent strategy (Nie

et al., 2023), can be instructive for us to understand the series representations. Recent revisiting on

linear forecasters (Li et al., 2023) highlights that temporal features extracted by MLPs are supposed to

be shared within distinct time series. We propose a rational explanation that the neurons of MLP are

taught to portray the intrinsic properties of any time series, such as the amplitude, periodicity, and even

frequency spectrums (neuron as a filter), serving as a more advantageous predictive representation

learner than the self-attention applied on time points. Experimentally, we validate that the division of

labor helps enjoy the benefits of linear layers in Section 4.3, such as the promoted performance if

providing enlarged lookback series, and the generalization ability on unseen variates.

Self-attention While the attention mechanism is generally adopted for facilitating the temporal

dependencies modeling in previous forecasters, the inverted model regards the whole series of one

variate as an individual process. Concretely, with comprehensively extracted representations of each

time series H = {h 0 , . . . , h N } ∈ R N ×D , the self-attention module adopts linear projections to get

queries, keys and values Q, K, V ∈ R N ×d k , where d k is the projected dimension.

With denotation of q i , k j ∈ R d k as the specific query and key of one (variate)

token, we notice

√

that each entry of the pre-Softmax scores is formulated as A i,j = (QK ⊤ / d k ) i,j ∝ q ⊤

i k j . Since

each token is previously normalized on its feature dimension, the entries can somewhat reveal the

variate-wise correlation and the whole score map A ∈ R N ×N exhibits the multivariate correlations

between paired variate tokens. Consequently, highly correlated variate will be more weighted for the

next representation interaction with values V. Based on this intuition, the proposed mechanism is

believed to be more natural and interpretable for multivariate series forecasting. We further provide

the visualization analysis of the score map in Section 4.3.

E XPERIMENTS

We thoroughly evaluate the proposed iTransformer on various time series forecasting applications,

validate the generality of the proposed framework and further dive into the effects of inverting the

duties of Transformer components for the specific time series dimensions.

Datasets We extensively include 6 real-world datasets in our experiments, including ETT, Weather,

Electricity, Traffic used by Autoformer (Wu et al., 2021), Solar-Energy datasets proposed in LST-

5Net (Lai et al., 2018), and PEMS evaluated in SCINet (Liu et al., 2022a). We also provide supple-

mentary experiments of Market (6 subsets) in Appendix E.2, which records the minute-sampled

server load of Alipay online transaction application with more than hundreds of variates. The detailed

descriptions of datasets are provided in Appendix A.1.

4.1

F ORECASTING R ESULTS

In this section, we conduct extensive experiments to evaluate the forecasting performance of our

proposed model together with advanced baselines.

Baselines We carefully choose 10 well-acknowledged forecasting models as our benchmark,

including (1) Transformer-based methods: Informer (Li et al., 2021), Autoformer (Wu et al., 2021),

FEDformer (Zhou et al., 2022), Stationary (Liu et al., 2022b), Crossformer (Zhang & Yan, 2023),

PatchTST (Nie et al., 2023); (2) Linear-based methods: DLinear (Zeng et al., 2023), TiDE (Das et al.,

2023); and (3) TCN-based methods: SCINet (Liu et al., 2022a), TimesNet (Wu et al., 2023).

Main results Comprehensive forecasting results are listed in Table 1 with the best in red and

the second underlined. The lower MSE/MAE indicates the more accurate prediction result. The

proposed iTransformer achieves consistent state-of-the-art performance. Notably, PatchTST as the

previous best model on Electricity and Weather datasets, fails in many cases of PEMS, which can be

attributed to the extremely fluctuating series of the dataset, and the patching mechanism of PatchTST

may lose focus on specific locality to handle rapid fluctuation. By contrast, our proposed method

aggregating the whole series variations for series representations can better cope with this situation.

Besides, as the representative that explicitly captures multivariate correlations, the performance of

Crossformer is still subpar to iTransformer, indicating the interaction of time-unaligned patches

from different multivariate will bring about unnecessary noise for predictions. Therefore, native

Transformer components are competent for temporal modeling and multivariate correlating, and the

proposed inverted architecture can effectively tackle real-world time series forecasting scenarios.

Table 1: Multivariate forecasting results with prediction lengths S ∈ {12, 24, 36, 48} for PEMS and

S ∈ {96, 192, 336, 720} for others. We fix the lookback length T = 96. All the results are averaged

from all prediction lengths. Results of all prediction lengths are provided in Appendix E.2.

Models

Metric

ETT

iTransformer PatchTST Crossformer

(Ours)

(2023)

TiDE

(2023)

TimesNet

(2023)

DLinear

(2023)

SCINet

(2022a)

FEDformer Stationary Autoformer

(2022)

(2022b)

(2021)

Informer

(2021)

MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

0.383 0.407 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723 0.437 0.449 0.526 0.516 0.450 0.459 4.431 1.729

Electricity 0.178 0.270 0.216 0.304 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 0.193 0.296 0.227 0.338 0.311 0.397

Traffic 0.428 0.282 0.555 0.362 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.804 0.509 0.610 0.376 0.624 0.340 0.628 0.379 0.764 0.416

Weather 0.258 0.279 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 0.288 0.314 0.338 0.382 0.634 0.548

Solar-Energy 0.233 0.262 0.270 0.307 0.641 0.639 0.347 0.417 0.301 0.319 0.330 0.401 0.282 0.375 0.291 0.381 0.261 0.381 0.885 0.711 0.235 0.280

PEMS

4.2

0.113 0.221 0.180 0.291 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224 0.213 0.327 0.147 0.249 0.667 0.601 0.171 0.274

I T RANSFORMERS

G ENERALITY

In this section, we evaluate iTransformers by applying our framework to Transformer and its vari-

ants, which generally address the quadratic complexity of the self-attention mechanism, including

Reformer (Kitaev et al., 2020), Informer (Li et al., 2021), Flowformer (Wu et al., 2022) and FlashAt-

tention (Dao et al., 2022). Surprising and promising discoveries are exhibited, indicating the simple

inverted perspective can enhance Transformer-based forecasters with promoted performance with

efficiency, generalization on unseen variates, and better utilization of historical observations.

Performance promotion We evaluate Transformers and the corresponding iTransformers with the

reported performance promotions in Table 2. It is notable that the framework consistently improves

various Transformers. Overall, it achieves averaged 38.9% promotion on Transformer, 36.1% on

Reformer, 28.5% on Informer, 16.8% on Flowformer and 32.2% on Flashformer, revealing the

previous improper usage of the Transformer architecture on time series forecasting. Moreover, since

the attention mechanism is adopted on the variate dimension in our inverted structure, the introduction

of efficient attentions with linear complexity essentially addresses the efficiency problem due to

6numerous variates, which is prevalent in real-world applications but can be resource-consuming for

Channel Independent (Nie et al., 2023). Therefore, the idea of iTransformer can be widely practiced

on Transformer-based forecasters to take advantage of booming efficient attention mechanisms.

Table 2: Performance promotion obtained by our inverted framework. Flashformer means Transformer

equipped with hardware-accelerated FlashAttention (Dao et al., 2022). We report the average

performance and the relative MSE reduction (Promotion). Full results can be found in Appendix E.2.

Models Transformer

(2017)

Reformer

(2020)

Informer

(2021)

Flowformer

(2022)

Flashformer

(2022)

Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

Original

Electricity +Inverted 0.277

0.178 0.372

0.270 0.338

0.208 0.422

0.301 0.311

0.216 0.397

0.311 0.267

0.210 0.359

0.293 0.285

0.206 0.377

0.291

Promotion 35.6% 27.4% 38.4% 28.7% 30.5% 21.6% 21.3% 18.6% 27.8% 22.9%

Original

+Inverted

Traffic

Weather

0.665

0.428

0.363

0.282

0.764

0.662 0.416

0.380 0.750

0.524 0.658

0.492

0.356

0.333

Promotion 35.6% 22.3% 12.7% 12.3% 13.3% 8.6% 30.1% 15.6% 25.2%

6.4%

Original

+Inverted 0.548

0.330 0.286

0.266 0.308

0.285 0.659

0.262

0.574

0.282

Promotion 60.2% 50.8% 69.2% 55.5% 57.3% 39.8% 7.2% 7.7% 60.2% 50.8%

0.657

0.258

0.572

0.279

0.741

0.647

0.803

0.248

0.422

0.370

0.656

0.292

0.634

0.271

0.421

0.355

Variate generalization By inverting vanilla Transformers, it is notable that the models are empow-

ered with the generalization capability on unseen variates. Firstly, benefiting from the flexibility of the

number of input tokens, the amount of variate channels is no longer restricted and thus feasible to vary

from training and inference. Besides, feed-forward networks are identically applied on independent

variate tokens in iTransformer. As aforementioned, the neurons as filters learn the intrinsic patterns

of any time series, which are inclined to be shared among distinct variates.

To verify the hypothesis, we compare iTransformer with an alternative generalizing strategy: Channel

Independent, which compulsively adopts one shared Transformer to learn the pattern of all variates.

We train models with only 20% of variates from each dataset and directly test the trained model on

all variates without fine-tuning. As shown in Figure 5, while the generalization error by Channel

Independent (CI-Transformers) can increase tremendously, iTransformers present a much smaller

increase in forecasting error. The difference lies in that the temporal interactions conducted in

the attention of vanilla Transformer, well-acknowledged as a data-dependent module, can be less

transferable for unseen series variations, for which feed-forward networks can be more proficient.

Electricity

0.4

+0.03

+0.05

+0.00

0.8

0.4

+0.01

0.0

Informer

Reformer

Flowformer

Transformers (100% variates)

+0.17

0.4

+0.18

0.3

+0.01

+0.00

+0.01

0.2

+0.05

+0.07

+0.01

+0.00

+0.01

0.1

0.0

Transformer

+0.22

+0.52

+0.14

+0.03

+0.49

1.2

+0.63

+0.33

0.6

0.2

Solar

Traffic

+0.59

0.8

Transformer

Informer

CI-Transformers (20% variates)

Reformer

Flowformer

Transformer

iTransformers (100% variates)

Informer

Reformer

Flowformer

iTransformers (20% variates)

Figure 5: Performance of generalization on unseen variates. iTransformers can be naturally trained

with 20% variates and accomplish forecast on all variates with hardly increased error, while Trans-

formers with Channel Independence bring about significantly increased error.

Increasing lookback length Several previous works have observed the phenomenon that the

forecasting performance does not necessarily improve with the increasing of lookback length on

Transformers (Nie et al., 2023; Zeng et al., 2023), which can be attributed to the distracted attention

on the growing input. However, the desired performance improvement is generally held on linear

forecasts, theoretically supported by statistical methods (Box & Jenkins, 1968) with enlarged historical

7information to be utilized. Since the duties of the attention and feed-forward network are inverted,

we evaluate the performance of Transformers and iTransformer in Figure 6 with increased lookback

length. The results surprisingly verify the rationality of leveraging MLPs on the temporal dimension

such that Transformers can benefit from the extended lookback window for more precise predictions.

Analysis of multivariate correlations By assigning the duty of multivariate correlation to the

attention mechanism, the learned map is able to enjoy enhanced interpretability. We present the case

visualization on series from Soloar-Energy in Figure 7, which has distinct correlations in the lookback

and future windows. It can be clearly observed that in the shallow attention layer, the learned map

shares lots of similarities to the correlations of raw input series. As it dives into deeper layers, the

learned map become gradually alike to the correlations of future series, which validates the inverted

operation empowers interpretable attention for correlating, and the processes of encoding the past and

decoding for the future are essentially conducted in series representations during feed-forwarding.

Lookback Correlations Future Correlations

Score Map of Layer 1 Score Map of Layer L

Figure 7: Analysis of series representations and multivariate correlations. Left: MSE and CKA

similarity of representations comparison between Transformers and iTransformers. A higher CKA

similarity indicates more favored representations for accurate predictions. Right: A case visualization

of multivariate correlations of raw time series and the learned score maps by inverted self-attention.

Efficient training strategy In the proposed iTransformer, due to the quadratic complexity of

self-attention, it can be overwhelming for training when variates grow numerous. In addition to

efficient attention mechanisms, we propose a novel training strategy for high-dimensional multivariate

series by taking advantage of previously demonstrated variate generation capability. Concretely, we

randomly choose part of the variates in each batch and only train the model with selected variates.

Since the number of variate channels is flexible because of our inverting, the model can predict all

the variates for predictions. As shown in Figure 8, the performance of our proposed strategy is still

comparable with full-variate training, while the memory footprint can be reduced significantly.

R EFERENCES

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.

https://arxiv.org/pdf/1607.06450.pdf, 2016.

Layer normalization.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and

recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2, 2018.

George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal of

the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. NeurIPS, 2020.

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-

efficient exact attention with io-awareness. NeurIPS, 2022.

Abhimanyu Das, Weihao Kong, Andrew Leach, Rajat Sen, and Rose Yu. Long-term forecasting with

tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424, 2023.

Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long.

Simmtm: A simple pre-training framework for masked time-series modeling. arXiv preprint

arXiv:2302.00861, 2023.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas

Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image

is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.

Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam.

Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. KDD, 2023.

Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):

251–257, 1991.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott

Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.

arXiv preprint arXiv:2001.08361, 2020.

Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible

instance normalization for accurate time-series forecasting against distribution shift. ICLR, 2021.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. ICLR,

2020.

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. Similarity of neural

network representations revisited. ICML, 2019.

Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term

temporal patterns with deep neural networks. SIGIR, 2018.

Jianxin Li, Xiong Hui, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence

time-series forecasting. arXiv: 2012.07436, 2021.

Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An

investigation on linear mapping. arXiv preprint arXiv:2305.10721, 2023.

Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. Scinet:

time series modeling and forecasting with sample convolution and interaction. NeurIPS, 2022a.

Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Rethinking

the stationarity in time series forecasting. NeurIPS, 2022b.

10Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64

words: Long-term forecasting with transformers. ICLR, 2023.

Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis

expansion analysis for interpretable time series forecasting. ICLR, 2019.

Adam Paszke, S. Gross, Francisco Massa, A. Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,

Z. Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito,

Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and

Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. NeurIPS,

2019.

Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and

Tim Januschowski. Deep state space models for time series forecasting. NeurIPS, 2018.

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic

forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):

1181–1191, 2020.

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Un-

terthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An

all-mlp architecture for vision. NeurIPS, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz

Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers

with Auto-Correlation for long-term series forecasting. NeurIPS, 2021.

Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Flowformer: Linearizing

transformers with conservation flows. ICML, 2022.

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet:

Temporal 2d-variation modeling for general time series analysis. ICLR, 2023.

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series

forecasting? AAAI, 2023.

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for

multivariate time series forecasting. ICLR, 2023.

Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu. Lstm network: a deep

learning approach for short-term traffic forecast. IET Intelligent Transport Systems, 11(2):68–75,

2017.

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency

enhanced decomposed transformer for long-term series forecasting. ICML, 2022.

11A

I MPLEMENTATION D ETAILS

A.1

D ATASETS DESCRIPTIONS

We conduct experiments on 12 real-world datasets to evaluate the performance of the proposed

iTransformer including: (1) Electricity (Wu et al., 2021) records the hourly electricity consumption

data of 321 clients. (2) ETT (Li et al., 2021) contains 7 factors of electricity transformer from July

2016 to July 2018, which is recorded by hour. (3) Traffic (Wu et al., 2021) collects hourly road

occupancy rates measured by 862 sensors of San Francisco Bay area freeways from January 2015

to December 2016. (4) Solar-Energy (Lai et al., 2018) records the solar power production of 137

PV plants in 2006, which is sampled every 10 minutes. (5) Weather (Wu et al., 2021) includes

21 meteorological factors collected every 10 minutes from the Weather Station of the Max Planck

Biogeochemistry Institute in 2020. (6) PEMS (Liu et al., 2022a) contains public traffic network data

in California collected by 5-minute windows with 358 variates.

Apart from the public datasets widely used as forecasting benchmarks, we also collect a set of Market

datasets of a real-world application, which records the minute-sampled server load of Alipay online

transactions between January 30th, 2023, and April 9th, 2023 with the number of variates varied from

285 to 759. It includes 6 sub-datasets, which are divided according to diverse transaction domains.

We follow the same data processing protocol and forecasting length settings used in TimesNet (Wu

et al., 2023) and SCINet (Liu et al., 2022a). For ETT, Weather, Electricity, Solar-Energy, and

Traffic datasets, we fix the length of the lookback series as 96, and the prediction length varies in

{96, 192, 336, 720}. For PEMS dataset, we fix the lookback length as 12, and the prediction length

varies in {12, 24, 36, 48}. For the Market dataset, the lookback window contains the past one day

observations with 144 time points and the forecasting length varies in {12, 24, 72, 144}. The details

of all the datasets are provided in Table 4.

Table 4: Detailed dataset descriptions. Dim denotes the variate number of each dataset. Dataset

Size denotes the total number of time points in (Train, Validation, Test) split respectively. Prediction

Length denotes the future time points to be predict and four prediction settings are included in each

dataset. Frequency denotes the sampling interval of time points.

Dataset

Dim Prediction Length

Dataset Size Frequency Information

ETT 7 {96, 192, 336, 720} (8545, 2881, 2881) 15min Electricity

Weather 21 {96, 192, 336, 720} (36792, 5271, 10540) 10min Weather

Solar-Energy 137 {96, 192, 336, 720} (36601, 5161, 10417) 10min Energy

Electricity 321 {96, 192, 336, 720} (18317, 2633, 5261) Hourly Electricity

Traffic 862 {96, 192, 336, 720} (12185, 1757, 3509) Hourly Transportation

PEMS 358 {12, 24, 48, 96} (15701,5216,434) 5min Transportation

Market-Merchant 285 {12, 24, 72, 144} (7045,1429,1429) 10min Transaction

Market-Wealth 485 {12, 24, 72, 144} (7045,1429,1429) 10min Transaction

Market-Finance 405 {12, 24, 72, 144} (7045,1429,1429) 10min Transaction

Market-Terminal 307 {12, 24, 72, 144} (7045,1429,1429) 10min Transaction

Market-Payment 759 {12, 24, 72, 144} (7045,1429,1429) 10min Transaction

Market-Customer 395 {12, 24, 72, 144} (7045,1429,1429) 10min Transaction

A.2

I MPLIEMENTATION DETAILS

All the experiments are implemented in PyTorch (Paszke et al., 2019) and conducted on a single

NVIDIA P100 16GB GPU. We utilize ADAM (Kingma & Ba, 2015) with an initial learning rate in

{10 −3 , 5 × 10 −4 , 10 −4 } and L2 loss for the model optimization. The batch size is uniformly set to 32

and the number of training epochs is fixed to 10. We set the number of inverted Transformer blocks in

our proposed model L ∈ {2, 3, 4}. The dimension of series representations D is set from {256, 512}.

All the compared baseline models that we reproduced are implemented based on the benchmark of

12TimesNet (Wu et al., 2023) Repository, which is fairly built on the configurations provided by each

model’s original paper or official code. Since several baselines have adopted Series Stationarization

of Non-stationary Transformers (Liu et al., 2022b) while others do not, we equip all models with the

method for a fair comparison. And we also provide the pseudo-code of iTransformer in Algorithm 1.

Algorithm 1 iTransformer - Overall Architecture.

Require: Input lookback time series X ∈ R T ×N ; input Length T ; predicted length S; variables

number N ; token dimension D; iTransformer block number L.

▷ X ∈ R N ×T

1: X = X.transpose

2: ▷ Multi-layer Perceptron works on the last dimension to embed series into variate tokens.

▷ H 0 ∈ R N ×D

3: H 0 = MLP(X)

4: for l in {1, · · · , L}:

▷ Run through iTransformer blocks.

5: for ▷ Self-attention layer is applied on variate tokens.

▷ H l−1 ∈ R N ×D

6: for H l−1 = LayerNorm H l−1 + Self-Attn(H l−1 )

7: for ▷ Feed-forward network is utilized for series representations, broadcasting to each token.

8: for H l = LayerNorm H l−1 + Feed-Forward(H l−1 )

▷ H l ∈ R N ×D

9: for ▷ LayerNorm is adopted on series representations to reduce variates discrepancies.

10: End for

11: Ŷ = MLP(H L )

▷ Project tokens back to predicted series, Ŷ ∈ R N ×S

▷ Ŷ ∈ R S×N

12: Ŷ = Ŷ.transpose

13: Return Ŷ

▷ Return the prediction result Ŷ

A BLATION S TUDIES

To elaborate on the rational business of Transformer components, we conduct detailed ablations

covering replacing components (Replace) and removing components (w/o). Since the average results

are listed in Table 3 due to the paper limit, we provide the detailed results and analysis here.

As shown in Table 5, among various architectural designs, iTransformer learns multivariate correla-

tions by self-attention and encodes series representations by the feed-forward network and generally

exhibit superior performance. Nevertheless, the vanilla Transformer architecture that leverages feed-

forward on variate representations and self-attention on time points can lead to the worst performance,

indicating the misuse of Transformer components on time series forecasting. It is also notable that

applying pure feed-forward networks on both variate and temporal dimensions can also lead to good

performance in Weather forecasting. However, we find the phenomenon only happens in the dataset

with a relatively small number of time series (21 variates). With the increasing number of variates and

difficulty in predicting the time series, the importance of capturing multivariate correlations is ever

more highlighted. Our proposed iTransformer employing the self-attention module to disentangle

the correlations between variate tokens proves to be more powerful than aggregating the variate

representations by feed-forward networks, thereby further boosting the performance on challenging

datasets with more variates and enhancing the interpretability.

H YPERPARAMETER S ENSITIVITY

We evaluate the hyperparameter sensitivity of iTransformer with respect to the following factors: the

learning rate lr, the number of Transformer block L, and the hidden dimension D of variate tokens.

The results are shown in Figure 9. We find that the learning rate, as the most common influencing

factor, should be carefully selected when the number of variates is large (ECL, Traffic). And the block

number and hidden dimension are not essentially favored to be as large as possible in iTransformer.

13Table 5: Full results of the ablation on iTransformer. We apply different components on the respective

dimension to learn multivariate correlations (Variate) and series representations (Temporal), in

addition to removing the specific component of Transformer.

Design

Variate

Prediction

Temporal

Electricity

Lengths

FFN

Attention Attention

Replace

FFN

Attention

FFN

w/o

D.1

S HOWCASES

V ISUALIZATION OF M ULTIVARIATE C ORRELATIONS

By using the attention mechanism on multiple variate tokens, the resulting learned map can become

more interpretable. To present an intuitive understanding of the multivariate correlations. In Figure 10,

we provide three randomly chosen case visualizations of the time series from Solar-Energy. Each

case is divided by the dotted line. On the top part of each case, we provide the variate correlations

calculated on the raw series by Pearson Correlation coefficient as the following equation:

− x̄)(y i − ȳ)

i (x i − x̄)

i (y i − ȳ)

ρ xy = pP

i (x i

where x i , y i ∈ R run through all time points of the paired variates to be correlated. All the cases

have distinct multivariate correlations in the lookback and future time series because the dataset

exhibits obvious seasonal change in the daytime and night. On the bottom part of each case, we

provide the learned pre-Softmax maps of the self-attention module in both the first and the last but one

layers. As we observe in the shallow attention layer, we can find that the learned map is similar to the

correlations of the raw lookback series. As we go deeper into the layers, the learned map gradually

becomes more similar to the correlations of the future series to be predicted. This demonstrates that

the inverted operation allows for interpretable attention in correlating, and that encoding of the past

and decoding for the future are conducted through series representations during layer stacking.

Lookback Correlations Future Correlations Lookback Correlations Future Correlations Lookback Correlations Future Correlations

Score Map of Layer 1 Score Map of Layer L Score Map of Layer 1 Score Map of Layer L Score Map of Layer 1 Score Map of Layer L

Figure 10: Multivariate correlations of the lookback series and future series and the learned score

maps by inverted self-attention of different layers. Cases all come from the Solar-Energy dataset.

D.2

V ISUALIZATION OF P REDICTION R ESULTS

To provide a clear comparison among different models, we list supplementary prediction showcases

of three representative datasets in Figures 12, 11, and 13, which are given by the following models:

iTransfomrer, PatchTST (Nie et al., 2023), DLinear (Zeng et al., 2023), Crossformer (Zhang & Yan,

2023), Autoformer (Wu et al., 2021), Transformer (Vaswani et al., 2017). Among the various models,

iTransformer predicts the most precise future series variations and exhibits superior performance.

E.1

F ULL R ESULTS

F ULL FRAMEWORK PROMOTION RESULTS

We apply the proposed iTransformers framework to five Transformer and its variants: Trans-

former (Vaswani et al., 2017), Reformer (Kitaev et al., 2020), Informer (Li et al., 2021), Flow-

former (Wu et al., 2022), Flashformer (Dao et al., 2022). The averaged results are shown in Table 2

due to the limited pages. We provide the supplementary forecasting results in Table 6. The results

demonstrate that our iTransformers framework can consistently promote these Transformer variants,

and take advantage of the booming efficient attention mechanisms.

15iTransformer PatchTST DLinear

Crossformer Autoformer Transformer

Figure 11: Visualization of input-96-predict-96 results on the Traffic dataset.

iTransformer PatchTST DLinear

Crossformer Autoformer Transformer

Figure 12: Visualization of input-96-predict-96 results on the Electricity dataset.

iTransformer PatchTST DLinear

Crossformer Autoformer Transformer

Figure 13: Visualization of input-96-predict-96 results on the Weather dataset.

16Table 6: Full results of Transformers with our inverted framework. Flashformer means Transformer

equipped with hardware-accelerated FlashAttention (Dao et al., 2022).

Transformer

(2017)

Models

Metric

Original

192

336

720

Reformer

(2020)

Informer

(2021)

Flowformer

(2022)

Flashformer

(2022)

MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

0.260

0.266

0.280

0.302 0.358

0.367

0.375

0.386 0.312

0.348

0.350

0.340 0.402

0.433

0.420 0.274

0.296

0.300

0.373 0.368

0.386

0.394

0.439 0.215

0.259

0.296

0.296 0.320

0.355

0.383

0.380 0.259

0.274

0.310

0.298 0.357

0.374

0.396

0.383

Avg 0.277 0.372 0.338 0.422 0.311 0.397 0.267 0.359 0.285 0.377

Electricity

192

+Inverted 336

720

0.148

0.162

0.178

0.225

0.240

0.253

0.269

0.317

0.182

0.192

0.210

0.249

0.275

0.286

0.304

0.339

0.190

0.201

0.218

0.255

0.286

0.297

0.315

0.347

0.183

0.192

0.210

0.255

0.267

0.277

0.295

0.332

0.178

0.189

0.207

0.251

0.265

0.276

0.294

0.329

Avg 0.178 0.270 0.208 0.301 0.216 0.311 0.210 0.293 0.206 0.291

Original

192

336

720

0.647

0.649

0.667

0.697

0.357

0.356

0.364

0.376

0.732

0.733

0.742

0.755

0.423

0.420

0.432

0.719

0.696

0.777

0.864

0.391

0.379

0.420

0.472

0.691

0.729

0.756

0.825

0.393

0.419

0.423

0.449

0.641

0.648

0.670

0.673

0.348

0.358

0.364

0.354

Avg 0.665 0.363 0.741 0.422 0.764 0.416 0.750 0.421 0.658 0.356

Traffic

192

+Inverted 336

720

0.395

0.417

0.433

0.467

0.268

0.276

0.283

0.302

0.617

0.629

0.648

0.694

0.356

0.361

0.370

0.394

0.632

0.641

0.663

0.713

0.367

0.370

0.379

0.405

0.493

0.506

0.526

0.572

0.339

0.345

0.355

0.381

0.464

0.479

0.501

0.524

0.320

0.326

0.337

0.350

Avg 0.428 0.282 0.647 0.370 0.662 0.380 0.524 0.355 0.492 0.333

Original

Weather

192

336

720

0.395

0.619

0.689

0.926

0.427

0.560

0.594

0.710

0.689

0.752

0.639

1.130

0.596

0.638

0.596

0.792

0.300

0.598

0.578

1.059

0.384

0.544

0.523

0.741

0.182

0.250

0.309

0.404

0.233

0.288

0.329

0.385

0.388

0.619

0.698

0.930

0.425

0.560

0.600

0.711

Avg 0.657 0.572 0.803 0.656 0.634 0.548 0.286 0.308 0.659 0.574

192

+Inverted 336

720

0.174

0.221

0.278

0.358

0.214

0.254

0.296

0.349

0.169

0.213

0.268

0.340

0.225

0.265

0.317

0.361

0.180

0.244

0.282

0.377

0.251

0.318

0.343

0.409

0.183

0.231

0.286

0.363

0.223

0.262

0.301

0.352

0.177

0.229

0.283

0.359

0.218

0.261

0.300

0.251

Avg 0.258 0.279 0.248 0.292 0.271 0.330 0.266 0.285 0.262 0.282

17E.2

F ULL FORECASTING RESULTS

The full multivariate forecasting results are provided in the following section due to the space

limitation of the main text. Table 7 contains the detailed results of all prediction lengths of the six

well-acknowledged benchmarks. And Table 8 records the Market results for server load forecasting.

Our proposed model achieves consistent state-of-the-art in real-world forecasting applications.

Table 7: Full results for the long-term forecasting task. We compare extensive competitive models

under different prediction lengths following the setting of TimesNet (2023). The input sequence

length is set to 96 for all baselines. Avg means the average results from all four prediction lengths.

Models

Metric

192

336

720

iTransformer PatchTST Crossformer

(Ours)

(2023)

SCINet

(2022a)

TiDE

(2023)

TimesNet

(2023)

DLinear FEDformer Stationary Autoformer Informer

(2023)

(2022)

(2022b)

(2021)

MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

0.297

0.380

0.428

0.427

0.349

0.400

0.432

0.445

0.302 0.348 0.745

0.388 0.400 0.877

0.426 0.433 1.043

0.431 0.446 1.104

0.584

0.656

0.731

0.763

0.707 0.621 0.400 0.440 0.340 0.374 0.333 0.387 0.358 0.397 0.476 0.458 0.346

0.860 0.689 0.528 0.509 0.402 0.414 0.477 0.476 0.429 0.439 0.512 0.493 0.456

1.000 0.744 0.643 0.571 0.452 0.452 0.594 0.541 0.496 0.487 0.552 0.551 0.482

1.249 0.838 0.874 0.679 0.462 0.468 0.831 0.657 0.463 0.474 0.562 0.560 0.515

0.388

0.452

0.486

0.511

3.755 1.525

5.602 1.931

4.721 1.835

3.647 1.625

Avg 0.383 0.407 0.387 0.407 0.942 0.684 0.954 0.723 0.611 0.550 0.414 0.427 0.559 0.515 0.437 0.449 0.526 0.516 0.450 0.459 4.431 1.729

192

336

720

0.148

0.162

0.178

0.225

0.240

0.253

0.269

0.317

0.195 0.285 0.219

0.199 0.289 0.231

0.215 0.305 0.246

0.256 0.337 0.280

0.314

0.322

0.337

0.363

0.247 0.345 0.237 0.329 0.168 0.272 0.197 0.282 0.193 0.308 0.169 0.273 0.201

0.257 0.355 0.236 0.330 0.184 0.289 0.196 0.285 0.201 0.315 0.182 0.286 0.222

0.269 0.369 0.249 0.344 0.198 0.300 0.209 0.301 0.214 0.329 0.200 0.304 0.231

0.299 0.390 0.284 0.373 0.220 0.320 0.245 0.333 0.246 0.355 0.222 0.321 0.254

0.317

0.334

0.338

0.361

0.274 0.368

0.296 0.386

0.300 0.394

0.373 0.439

Avg 0.178 0.270 0.216 0.304 0.244 0.334 0.268 0.365 0.251 0.344 0.192 0.295 0.212 0.300 0.214 0.327 0.193 0.296 0.227 0.338 0.311 0.397

192

336

720

0.395

0.417

0.433

0.467

0.268

0.276

0.283

0.302

0.544 0.359 0.522

0.540 0.354 0.530

0.551 0.358 0.558

0.586 0.375 0.589

0.290

0.293

0.305

0.328

0.788 0.499 0.805 0.493 0.593 0.321 0.650 0.396 0.587 0.366 0.612 0.338 0.613

0.789 0.505 0.756 0.474 0.617 0.336 0.598 0.370 0.604 0.373 0.613 0.340 0.616

0.797 0.508 0.762 0.477 0.629 0.336 0.605 0.373 0.621 0.383 0.618 0.328 0.622

0.841 0.523 0.719 0.449 0.640 0.350 0.645 0.394 0.626 0.382 0.653 0.355 0.660

0.388

0.382

0.337

0.408

0.719 0.391

0.696 0.379

0.777 0.420

0.864 0.472

Avg 0.428 0.282 0.555 0.362 0.550 0.304 0.804 0.509 0.760 0.473 0.620 0.336 0.625 0.383 0.610 0.376 0.624 0.340 0.628 0.379 0.764 0.416

192

336

720

0.174

0.221

0.278

0.358

0.214

0.254

0.296

0.349

0.177 0.218 0.158

0.225 0.259 0.206

0.278 0.297 0.272

0.354 0.348 0.398

0.230

0.277

0.335

0.418

0.221 0.306 0.202 0.261 0.172 0.220 0.196 0.255 0.217 0.296 0.173 0.223 0.266

0.261 0.340 0.242 0.298 0.219 0.261 0.237 0.296 0.276 0.336 0.245 0.285 0.307

0.309 0.378 0.287 0.335 0.280 0.306 0.283 0.335 0.339 0.380 0.321 0.338 0.359

0.377 0.427 0.351 0.386 0.365 0.359 0.345 0.381 0.403 0.428 0.414 0.410 0.419

0.336

0.367

0.395

0.428

0.300 0.384

0.598 0.544

0.578 0.523

1.059 0.741

Avg 0.258 0.279 0.259 0.281 0.259 0.315 0.292 0.363 0.271 0.320 0.259 0.287 0.265 0.317 0.309 0.360 0.288 0.314 0.338 0.382 0.634 0.548

192

336

720

0.203

0.233

0.248

0.249

0.237

0.261

0.273

0.275

0.234 0.286 0.310

0.267 0.310 0.734

0.290 0.315 0.750

0.289 0.317 0.769

0.331

0.725

0.735

0.765

0.237 0.344 0.312 0.399 0.250 0.292 0.290 0.378 0.242 0.342 0.215 0.249 0.884

0.280 0.380 0.339 0.416 0.296 0.318 0.320 0.398 0.285 0.380 0.254 0.272 0.834

0.304 0.389 0.368 0.430 0.319 0.330 0.353 0.415 0.282 0.376 0.290 0.296 0.941

0.308 0.388 0.370 0.425 0.338 0.337 0.356 0.413 0.357 0.427 0.285 0.295 0.882

0.711

0.692

0.723

0.717

0.236 0.259

0.217 0.269

0.249 0.283

0.241 0.317

Avg 0.233 0.262 0.270 0.307 0.641 0.639 0.282 0.375 0.347 0.417 0.301 0.319 0.330 0.401 0.291 0.381 0.261 0.381 0.885 0.711 0.235 0.280

0.071

0.093

0.125

0.160

0.174

0.201

0.236

0.270

0.099 0.216 0.090

0.142 0.259 0.121

0.211 0.319 0.202

0.269 0.370 0.262

0.203

0.240

0.317

0.367

0.066 0.172 0.178 0.305 0.085 0.192 0.122 0.243 0.126 0.251 0.081 0.188 0.272

0.085 0.198 0.257 0.371 0.118 0.223 0.201 0.317 0.149 0.275 0.105 0.214 0.334

0.127 0.238 0.379 0.463 0.155 0.260 0.333 0.425 0.227 0.348 0.154 0.257 1.032

0.178 0.287 0.490 0.539 0.228 0.317 0.457 0.515 0.348 0.434 0.247 0.336 1.031

0.385

0.440

0.782

0.796

0.126 0.233

0.139 0.250

0.186 0.289

0.233 0.323

Avg 0.113 0.221 0.180 0.291 0.169 0.281 0.114 0.224 0.326 0.419 0.147 0.248 0.278 0.375 0.213 0.327 0.147 0.249 0.667 0.601 0.171 0.274

18Table 8: Full results of the Market dataset. We compare extensive competitive models on the

real-world transaction forecasting task. Avg means the average results from all prediction lengths.

Models

Metric

144

iTransformer PatchTST Crossformer TimesNet

(Ours)

(2023)

SCINet

(2022a)

DLinear FEDformer Stationary Autoformer Informer

(2023)

(2022)

(2022b)

(2021)

MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

0.058

0.066

0.079

0.086

0.126

0.138

0.157

0.167

0.072 0.155 0.068

0.079 0.164 0.091

0.090 0.180 0.123

0.093 0.185 0.185

0.141

0.161

0.202

0.218

0.088 0.177 0.202 0.310 0.093 0.183 0.277 0.384 0.143 0.243 0.365

0.103 0.195 0.215 0.323 0.105 0.200 0.268 0.378 0.167 0.270 0.669

0.089 0.180 0.388 0.431 0.116 0.215 0.281 0.390 0.193 0.300 0.404

0.091 0.183 0.459 0.477 0.124 0.225 0.359 0.453 0.183 0.294 0.536

0.444

0.636

0.479

0.566

0.489 0.459

0.507 0.461

0.523 0.473

0.543 0.483

Avg 0.072 0.147 0.084 0.171 0.117 0.181 0.093 0.184 0.316 0.385 0.110 0.206 0.296 0.401 0.172 0.277 0.494 0.531 0.516 0.469

144

0.189

0.254

0.421

0.517

0.205

0.244

0.327

0.379

0.255 0.250 0.270

0.320 0.291 0.329

0.459 0.360 0.484

0.541 0.404 0.633

0.208

0.233

0.324

0.388

0.275 0.277 0.525 0.451 0.380 0.355 0.553 0.508 0.355 0.332 0.653

0.300 0.285 0.583 0.479 0.456 0.397 0.567 0.514 0.430 0.377 0.761

0.384 0.326 0.761 0.558 0.555 0.438 0.636 0.548 0.573 0.454 0.857

0.481 0.383 0.770 0.568 0.611 0.459 0.744 0.604 0.637 0.498 0.817

0.555

0.611

0.658

0.627

0.837 0.470

0.905 0.502

1.069 0.573

1.094 0.622

Avg 0.345 0.289 0.394 0.326 0.429 0.288 0.360 0.318 0.660 0.514 0.501 0.412 0.625 0.543 0.499 0.415 0.772 0.612 0.976 0.542

144

0.123

0.158

0.212

0.245

0.170

0.197

0.240

0.257

0.164 0.206 4.630

0.198 0.228 4.987

0.268 0.273 5.631

0.293 0.286 6.083

0.520

0.568

0.675

0.708

0.465 0.291 1.865 0.602 0.321 0.271 1.537 0.538 0.537 0.384 1.651

0.228 0.297 2.228 0.664 0.464 0.318 1.553 0.547 0.551 0.386 1.671

0.534 0.310 3.084 0.793 0.986 0.423 1.612 0.554 2.004 0.853 2.054

0.564 0.333 4.089 0.875 1.287 0.473 1.784 0.636 2.379 0.947 2.114

0.593

0.594

0.758

0.778

7.159 0.911

7.328 0.915

7.967 1.004

7.832 0.995

Avg 0.184 0.216 0.231 0.248 5.333 0.618 0.516 0.308 2.817 0.734 0.765 0.372 1.621 0.569 1.368 0.643 1.872 0.681 7.571 0.956

144

0.051

0.059

0.071

0.079

0.127

0.139

0.160

0.171

0.068 0.164 0.055

0.074 0.173 0.065

0.081 0.187 0.077

0.085 0.193 0.085

0.140

0.155

0.170

0.181

0.074 0.169 0.199 0.301 0.096 0.198 0.268 0.379 0.140 0.252 0.386

0.081 0.178 0.225 0.325 0.105 0.209 0.256 0.370 0.174 0.289 0.708

0.077 0.178 0.317 0.338 0.109 0.215 0.285 0.396 0.202 0.321 0.510

0.088 0.192 0.378 0.425 0.113 0.220 0.372 0.468 0.204 0.322 0.468

0.461

0.644

0.552

0.528

0.279 0.365

0.302 0.378

0.313 0.377

0.329 0.393

Avg 0.065 0.150 0.077 0.179 0.071 0.162 0.080 0.179 0.280 0.360 0.106 0.210 0.295 0.403 0.180 0.296 0.518 0.547 0.306 0.378

144

0.050

0.062

0.082

0.093

0.121

0.135

0.155

0.166

0.065 0.156 0.152

0.077 0.167 0.178

0.094 0.184 0.236

0.101 0.190 0.260

0.145

0.165

0.193

0.214

0.094 0.171 0.164 0.249 0.090 0.180 0.272 0.349 0.129 0.229 0.382

0.099 0.178 0.216 0.280 0.108 0.196 0.265 0.343 0.157 0.266 0.345

0.111 0.189 0.360 0.370 0.129 0.209 0.284 0.360 0.183 0.291 0.437

0.115 0.189 0.410 0.391 0.138 0.215 0.379 0.441 0.194 0.296 0.501

0.437

0.412

0.471

0.518

0.492 0.347

0.514 0.367

0.523 0.359

0.547 0.386

Avg 0.072 0.144 0.084 0.174 0.207 0.179 0.105 0.182 0.288 0.322 0.116 0.200 0.300 0.373 0.166 0.271 0.417 0.460 0.519 0.365

144

0.065

0.078

0.108

0.126

0.129

0.141

0.161

0.172

0.091 0.160 0.243

0.107 0.173 0.293

0.131 0.190 0.331

0.141 0.195 0.368

0.156

0.177

0.215

0.226

0.123 0.180 0.310 0.326 0.143 0.195 0.309 0.366 0.175 0.243 0.640

0.130 0.183 0.338 0.344 0.170 0.212 0.313 0.369 0.188 0.264 0.763

0.149 0.196 0.511 0.408 0.202 0.228 0.330 0.374 0.267 0.324 0.616

0.166 0.206 0.687 0.461 0.222 0.239 0.450 0.456 0.336 0.373 0.658

0.580

0.642

0.564

0.586

0.828 0.427

0.860 0.439

0.904 0.451

0.903 0.458

Avg 0.094 0.150 0.118 0.180 0.309 0.194 0.142 0.191 0.461 0.385 0.184 0.219 0.350 0.391 0.242 0.301 0.669 0.593 0.874 0.444

1 Count 28