Toto: Time Series Optimized
Transformer for Observability
Technical Report

Ben Cohen Emaad Khwaja Kan Wang Charles Masson \ANDElise Ramé Youssef Doubli Othmane Abou-Amal \ANDme {ben.cohen, emaad, kan.wang, charles.masson, elise.rame, youssef.doubli, othmane}@datadoghq.com

Abstract

This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state-of-the-art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be specifically tuned for observability metrics.

Toto was trained on a dataset of one trillion time series data points – the largest among all currently published time series foundation models. Alongside publicly available time series datasets, 75% of the data used to train Toto consists of fully anonymous numerical metric data points from the Datadog platform.

In our experiments, Toto outperforms existing time series foundation models on observability data. It does this while also excelling at general-purpose forecasting tasks, achieving state-of-the-art zero-shot performance on multiple open benchmark datasets.

In this report, we detail the following key contributions:

•

Proportional factorized space-time attention: We introduce an advanced attention mechanism that allows for efficient grouping of multivariate time series features, reducing computational overhead while maintaining high accuracy.
•

Student-T mixture model head: This novel use of a probabilistic model that robustly generalizes Gaussian mixture models enables Toto to more accurately capture the complex dynamics of time series data and provides superior performance over traditional approaches.
•

Domain-specific training data: In addition to general multi-domain time series data, Toto is specifically pre-trained on a large-scale dataset of Datadog observability metrics, encompassing unique characteristics not present in open-source datasets. This targeted training ensures enhanced performance in observability metric forecasting.

Refer to caption — Figure 1: Toto architecture diagram. Input time series of $T$ steps (univariate example used for simplicity here) are first embedded using the patch embedding layer. They then pass through the transformer stack, which contains $L$ identical segments. Each segment of the transformer consists of one space-wise transformer block followed by $N$ time-wise blocks. The flattened transformer outputs are projected to form the parameters of the Student-T mixture model (SMM) head. The final outputs are the forecasts for the input series, shifted $P$ steps (the patch width) into the future.

1 Background

We present Toto, a groundbreaking time series forecasting foundation model developed by Datadog. Toto is specifically designed to handle the complexities of observability data, leveraging a state-of-the-art transformer architecture to deliver unparalleled accuracy and performance. Toto is trained on a massive dataset of diverse time series data, enabling it to excel in zero-shot predictions. This model is tailored to meet the demanding requirements of real-time analysis as well as compute and memory-efficient scalability to very large data volumes, providing robust solutions for high-frequency and high-dimensional data commonly encountered in observability metrics.

1.1 Observability data

The Datadog observability platform collects a vast array of metrics across multiple subdomains, crucial for monitoring and optimizing modern infrastructure and applications. These metrics include infrastructure data such as memory usage, CPU load, disk I/O, and network throughput, as well as application performance indicators like hit counts, error rates, and latency [1]. Additionally, Datadog integrates specific metrics from numerous SaaS products, cloud services, open-source frameworks, and other third-party tools. The platform allows users to apply various time series models to proactively alert on anomalous behavior, leading to a reduction in time to detection (TTD) and time to resolution (TTR) of production incidents [2].

The complexity and diversity of these metrics present significant challenges for time series forecasting. Observability data often requires high time resolution, down to seconds or minutes, and is typically sparse with many zero-inflated metrics. Moreover, these metrics can display extreme dynamic ranges and right-skewed distributions. The dynamic and nonstationary nature of the systems being monitored further complicates the forecasting task, necessitating advanced models that can adapt and perform under these conditions.

1.2 Traditional models

Historically, time series forecasting has relied on classical models such as ARIMA, exponential smoothing, and basic machine learning techniques [3]. While foundational, these models necessitate individual training for each metric, presenting several limitations [4]. The need to develop and maintain separate models for each metric impedes scalability, especially given the extensive range of metrics in observability data. Moreover, these models often fail to generalize across different types of metrics, leading to suboptimal performance on diverse datasets [5, 6]. Continuous retraining and tuning to adapt to evolving data patterns further increase the operational burden. This scaling limitation has hindered the adoption of deep learning–based methods for time series analysis, even as they show promise in terms of accuracy [7].

1.3 Foundation models

Large neural network-based generative models, often referred to as “foundation models,” have revolutionized time series forecasting by enabling accurate predictions on new data not seen during training, known as zero-shot prediction [8]. This capability significantly reduces the need for constant retraining on each specific metric, thus saving considerable time and computational resources. Their architecture supports the parallel processing of vast data volumes, facilitating timely insights essential for maintaining system performance and reliability [9, 10].

Through pretraining on diverse datasets, generative models exhibit strong generalization across various types of time series data. This enhances their robustness and versatility, making them suitable for a wide range of applications. Zero-shot predictions are particularly attractive in the observability domain, where the limitations of traditional methods are felt very acutely. The most common use cases for time series models within an observability platform like Datadog include automated anomaly detection and predictive alerting. It is challenging to scale classical forecasting methods to handle cloud-based applications that can be composed of many ephemeral, dynamically scaling components such as containers, VMs, serverless functions, etc. These entities tend to be both high in cardinality and short-lived in time. This limits the practicality of traditional time series models in two ways:

•

First, the high cardinality and volume of data can make fitting individual models to each time series computationally expensive or even intractable. The ability to train a single model and perform inference across a wide range of domains has the potential to dramatically improve the efficiency, and thus the coverage, of an autonomous monitoring system.
•

Second, ephemeral infrastructure elements often lack enough historical data to confidently fit a model. In practice, algorithmic alerting systems often require an adaptation period of days or weeks before they can usefully monitor a new metric. However, if the object being monitored is a container with a lifespan measured in minutes or hours, these classical models are unable to adapt quickly enough to be useful. Real-world systems thus often fall back to crude heuristics, such as threshold-based alerts, which rely on the domain knowledge of users. Zero-shot foundation models can enable accurate predictions with much less historical context, by aggregating and interpolating prior information learned from a massive and diverse dataset.

The integration of transformer-based models [11] like Toto into observability data analysis thus promises significant improvements in forecasting accuracy and efficiency. These models offer a robust solution for managing diverse, high-frequency data and delivering zero-shot predictions. With their advanced capabilities, transformer-based models represent a significant leap forward in the field of observability and time series analysis [12, 13, 14].

1.4 Recent work

The past several years have seen the rise of transformer-based models as powerful tools for time series forecasting. These models leverage multi-head self-attention mechanisms to capture long-range dependencies and intricate patterns in data.

To address the unique challenges of time series data, recent advancements have introduced various modifications to the attention mechanism. For example, Moirai [15] uses “any-variate” attention to model dependencies across different series simultaneously. Factorized attention mechanisms [16] have been developed to separately capture temporal and spatial (cross-series) interactions, enhancing the ability to understand complex interdependencies. Other models [17, 18] have used cross-channel attention in conjunction with feed-forward networks for mixing in the time dimension. Additionally, causal masking [19] and hierarchical encoding [16] can improve the efficiency and accuracy of predictions in time series contexts.

These innovative transformer-based models have demonstrated state-of-the-art performance on benchmark datasets [14], frequently surpassing traditional models in both accuracy and robustness. Their capacity to process high-dimensional data efficiently [20] makes them ideal for applications involving numerous time series metrics with varying characteristics, such as observability.

Even more recently, a number of time series “foundation models” have been released [19, 21, 15, 22, 23, 24]. By pre-training on extensive, multi-domain datasets, these large models achieve impressive zero-shot prediction capabilities, significantly reducing the need for constant retraining. This paradigm is appealing for the observability context, where we constantly have new time series to process and frequent retraining is impractical.

2 Problem statement

At Datadog, our time series data encompasses a variety of observability metrics from numerous subdomains. These metrics present several challenges for existing forecasting models:

•

High time resolution: Users often require data in increments of seconds or minutes, unlike many publicly-available time series datasets that are at hourly frequency or above.
•

Sparsity: Metrics such as error counts often track rare events, resulting in sparse and zero-inflated time series.
•

Extreme right skew: Latency measurements in distributed systems exhibit positive, heavy tailed distributions with extreme values at high percentiles.
•

Dynamic, nonstationary systems: The behavior of monitored systems change frequently due to code deployments, infrastructure scaling, feature flag management, and other configuration changes, as well as external factors like seasonality and user-behavior-driven trends. Some time series, such as those monitoring fleet deployments, can also have a very low variance, exhibiting a piecewise-constant shape.
•

High-cardinality multivariate data: Monitoring large fleets of ephemeral cloud infrastructure such as virtual machines (VMs), containers, serverless functions, etc. leads to high cardinality data, with hundreds or thousands of individual time series variates, often with limited historical data for each group.
•

Historical anomalies: Historical data often contains outliers and anomalies caused by performance regressions or production incidents.

Foundation models pre-trained on other domains struggle to generalize effectively to observability data due to these characteristics. To overcome this, we developed Toto, a state-of-the-art foundation model that excels at observability forecasting while also achieving top performance on standard open benchmarks.

3 Model architecture

Toto is a decoder-only forecasting model. This model employs many of the latest techniques from the literature, and introduces a novel method for adapting multi-head attention to multivariate time series data (Fig. 1).

3.1 Transformer design

Transformer models for time series forecasting have variously used encoder-decoder [13, 12, 21], encoder-only [14, 15, 17], and decoder-only architectures [23, 19]. For Toto, we employ a decoder-only architecture. Decoder architectures have been shown to scale well [25, 26], and allow for arbitrary prediction horizons. The causal next-patch prediction task also simplifies the pre-training process.

We use techniques from some of the latest large language model (LLM) architectures, including pre-normalization [27], RMSNorm [28], and SwiGLU feed-forward layers [29].

3.2 Input embedding

Time series transformers in the literature have used various approaches for creating input embeddings. We use non-overlapping patch projections (Fig. 3), first introduced for Vision Transformers [30, 31] and popularized in the time series context by PatchTST [14]. Toto was trained using a fixed patch size of 32.

Figure 3: The patch embedding takes as input a multivariate time series of

M

variates by

N

time steps. It divides each variate along the time dimension into patches of size

P

and projects these linearly into an embedding space of latent dimension

D

. This results in an output of size

M\times\frac{N}{P}\times D

which is fed to the transformer decoder.

3.3 Attention mechanism

Observability metrics are often high-cardinality, multivariate time series. Therefore, an ideal model will natively handle multivariate forecasting. It should be able to analyze relationships both in the time dimension (what we refer to as “time-wise” interactions) and in the channel dimension (what we refer to as “space-wise” interactions, following the convention in the Datadog platform of describing different groups or tag sets of a metric as the “space” dimension).

In order to model both space and time-wise interactions, we need to adapt the traditional multi-head attention architecture [11] from one to two dimensions. Several approaches have been proposed in the literature to do this, including:

•

Assuming channel independence, and computing attention only in the time dimension [14]. This is efficient, but throws away all information about space-wise interactions.
•

Computing attention only in the space dimension, and using a feed-forward network in the time dimension [18, 17].
•

Concatenating variates along the time dimension and computing full cross-attention between every space/time location [15]. This can capture every possible space and time interaction, but it is computationally costly.
•

Computing “factorized attention,” where each transformer block contains a separate space and time attention computation [16, 32, 33]. This allows both space and time mixing, and is more efficient than full cross-attention. However, it doubles the effective depth of the network.

In order to design our attention mechanism, we follow the intuition that for many time series, the time relationships are more important or predictive than the space relationships. As evidence, we observe that even models that completely ignore space-wise relationships (such as PatchTST [14] and TimesFM [19]) can still achieve competitive performance on multivariate datasets. However, other studies (e.g. Moirai [15]) have shown through ablations that there is some clear benefit to including space-wise relationships.

We therefore propose a novel variant of factorized attention, which we call “Proportional Factorized Space-Time Attention.” We use a mixture of alternating space-wise and time-wise attention blocks. As a configurable hyperparameter, we can change the ratio of time-wise to space-wise blocks, thus allowing us to devote more or less compute budget to each type of attention. For our base model, we selected a configuration with one space-wise attention block for every two time-wise blocks.

In the time-wise attention blocks, we use causal masking and rotary positional embeddings [34] with XPOS [35] in order to autoregressively model time-dependent features. In the space-wise blocks, by contrast, we use full bidirectional attention in order to preserve permutation invariance of the covariates, with a block-diagonal ID mask to ensure that only related variates attend to each other. This masking allows us to pack multiple independent multivariate time series into the same batch, in order to improve training efficiency and reduce the amount of padding.

3.4 Probabilistic prediction head

In order to be useful for forecasting applications, a model should produce probabilistic predictions. A common practice in time series models is to use an output layer where the model regresses the parameters of a probability distribution. This allows for prediction intervals to be computed using Monte Carlo sampling [7].

Common choices for an output layer are Normal [7] and Student-T [36, 23], which can improve robustness to outliers. Moirai [15] allows for more flexible residual distributions by proposing a novel mixture model incorporating a weighted combination of Gaussian, Student-T, Log-Normal, and Negative-Binomial outputs.

However, real-world time series can often have complex distributions that are challenging to fit, with outliers, heavy tails, extreme skew, and multimodality. In order to accommodate these scenarios, we introduce an even more flexible output likelihood. To do this we employ a method based on Gaussian mixture models (GMMs), which can approximate any density function ([37]). To avoid training instability in the presence of outliers, we use a Student-T mixture model (SMM), a robust generalization of GMMs [38] that has previously shown promise for modeling heavy-tailed financial time series [39, 40]. The model predicts $k$ Student-T distributions (where $k$ is a hyperparameter) for each time step, as well as a learned weighting.

When we perform inference, we draw samples from the mixture distribution at each timestamp, then feed each sample back into the decoder for the next prediction. This allows us to produce prediction intervals at any quantile, limited only by the number of samples; for more precise tails, we can choose to spend more computation on sampling (Fig. 2).

3.5 Input/output scaling

As in other time series models, we perform instance normalization on input data before passing it through the patch embedding, in order to make the model generalize better to inputs of different scales [41]. We scale the inputs to have zero mean and unit standard deviation. The output predictions are then rescaled back to the original units.

3.6 Training objective

As a decoder-only model, Toto is pre-trained on the next-patch prediction task. We minimize the negative log-likelihood of the next predicted patch with respect to the distribution output of the model. We train the model using the AdamW optimizer [42].

3.7 Hyperparameters

The hyperparameters used for Toto are detailed in Table A.1, with 103 million total parameters.

4 Training data

We pretrained Toto with a dataset of approximately one trillion time series points. Of these, roughly three-quarters are anonymous observability metrics from the Datadog platform. The remaining points come from the LOTSA dataset [15], a compilation of publicly-available time series datasets across many different domains.

4.1 Datadog dataset

The Datadog platform ingests more than a hundred trillion events per day. However, much of this data is sparse, noisy, or too granular or high in cardinality to be useful in its raw form. To curate a high-quality dataset for efficient model training, we sample queries based on quality and relevance signals from dashboards, monitor alerts, and notebooks. This provides a strong signal that the data resulting from these queries is of critical importance and sufficient quality for observability of real-world applications.

		Zero Shot					Full Shot
Dataset	Metric	Toto	Moirai_Small	Moirai_Base	Moirai_Large	TimesFM^*	iTransformer	TimesNet	PatchTST	Crossformer	TiDE	DLinear	SCINet	FEDformer
ETTh1	MAE	0.389	0.424	0.438	0.469	0.426	0.448	0.450	0.455	0.522	0.507	0.452	0.647	0.460
	MSE	0.363	0.400	0.434	0.510	-	0.454	0.458	0.469	0.529	0.541	0.456	0.747	0.440
ETTh2	MAE	0.261	0.379	0.382	0.376	0.410	0.407	0.497	0.407	0.684	0.550	0.515	0.723	0.449
	MSE	0.170	0.341	0.345	0.354	-	0.383	0.414	0.387	0.942	0.611	0.559	0.954	0.437
ETTm1	MAE	0.375	0.409	0.388	0.389	0.388	0.410	0.406	0.400	0.495	0.419	0.407	0.481	0.452
	MSE	0.372	0.448	0.381	0.390	-	0.407	0.400	0.387	0.513	0.419	0.403	0.486	0.448
ETTm2	MAE	0.319	0.341	0.321	0.320	0.334	0.332	0.333	0.326	0.611	0.404	0.401	0.537	0.349
	MSE	0.272	0.300	0.272	0.276	-	0.288	0.291	0.281	0.757	0.358	0.350	0.571	0.305
Electricity	MAE	0.246	0.320	0.274	0.273	-	0.270	0.295	0.304	0.334	0.344	0.300	0.365	0.327
	MSE	0.157	0.233	0.188	0.188	-	0.178	0.193	0.216	0.244	0.252	0.212	0.268	0.214
Weather	MAE	0.284	0.267	0.261	0.275	-	0.278	0.287	0.281	0.315	0.320	0.317	0.363	0.360
	MSE	0.256	0.242	0.238	0.259	-	0.258	0.259	0.259	0.259	0.271	0.265	0.292	0.309
Mean	MAE	0.312	0.357	0.341	0.350	-	0.357	0.378	0.362	0.493	0.424	0.399	0.519	0.400
	MSE	0.265	0.328	0.315	0.330	-	0.328	0.336	0.333	0.541	0.409	0.374	0.533	0.359

Table 1: Comparison of different models with Toto on the LSF benchmark datasets. Results are averaged across prediction lengths of 96, 192, 336, and 720 steps. For Toto, we use a stride of 512 steps and a historical context window of 512 steps. For other models, we use the results reported in [15] and [19]. Metrics for each prediction length are available in Table A.2. ^*TimesFM only reports values for MAE on ETTh1, ETTh2, ETTm1, and ETTm2. Key: Best results, Second-best results.

Datadog metrics are accessed using a specialized query language supporting filters, group-bys, time aggregation, and various transformations and postprocessing functions [43]. We consider groups returned from the same query to be related variates in a multivariate time series (Fig. 4). After we retrieve the query results, we discard the query strings and group identifiers, keeping only the raw numeric data.

Handling this vast amount of data requires several preprocessing steps to ensure consistency and quality. Initially, we apply padding and masking techniques to align the series lengths, making them divisible by the patch stride. This involves adding necessary left-padding to both the time series data and the ID mask, ensuring compatibility with the model's requirements.

Various data augmentations are employed to enhance the dataset's robustness. We introduce random time offsets to prevent memorization caused by having series always align the same way with the patch grid. After concatenating the Datadog and LOTSA datasets for training, we also implement a variate shuffling strategy to maintain diversity and representation. Specifically, 10% of the time, we combine variates that are not necessarily related, thus creating new, diverse combinations of data points. To sample the indices, we employ a normal distribution with a standard deviation of 1000, favoring data points that were closer together in the original datasets. This Gaussian sampling ensures that, while there is a preference for adjacent data points, significant randomness is introduced to enhance the diversity of the training data. This approach improves the model's ability to generalize across different types of data effectively.

By implementing these rigorous preprocessing steps and sophisticated data handling mechanisms, we ensure that the training data for Toto is of the highest quality, ultimately contributing to the model's superior performance and robustness.

4.2 Synthetic data

We use a synthetic data generation process similar to TimesFM [19] to supplement our training datasets, improving the diversity of the data and helping to teach the model basic structure. We simulate time series data through the composition of components such as piecewise linear trends, ARMA processes, sinusoidal seasonal patterns, and various residual distributions. We randomly combine five of these processes per variate, introducing patterns not always present in our real-world datasets. The generation process involves creating base series with random transformations, clipping extreme values, and rescaling to a specified range. By making synthetic data approximately 5% of our training dataset, we ensure a wide range of time-series behaviors are captured. This diversity exposes our models to various scenarios during training, improving their ability to generalize and effectively handle real-world data.

5 Results

We report experimental results for a pre-trained Toto model in Section 5.1 and Section 5.2.

To evaluate predictions, we sequentially divide a time series into context and forecast segments. We input the context segment into Toto and autoregressively generate output patches by sampling from the Student-T mixture model distribution. We forecast a number of steps equal to the nearest multiple of the patch size, then truncate the predictions to the desired length. In order to keep inference time consistent, we vary the number of samples generated based on the cardinality and length of the dataset, with a minimum of 100 samples. We take the median sample at each time step as the final point prediction. This prediction is then compared against the ground-truth forecast segment for evaluation.

5.1 LSF benchmarks

To assess general-purpose time series forecasting performance, we use the Long Sequence Forecasting (LSF) benchmark datasets (ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather) [12]. We evaluate with forecast lengths of 96, 192, 336, and 720 time steps, in sliding windows with stride 512, and average the results. For Toto, we used a historical context window of 512 steps and took the median of 200 samples. Following standard practice, we report normalized Mean Absolute Error (MAE) and Mean Squared Error (MSE), fitted on a training split, in order to be able to compare performance across different datasets. We compared Toto's performance with the reported results of other recent zero-shot foundation models [15, 19], as well as full-shot time series forecasting models [17, 44, 14, 16, 36, 45, 46, 47]. We display these results in Table 1.

Toto demonstrates exceptional performance across a variety of benchmark datasets, excelling in zero-shot scenarios. In the LSF datasets, Toto consistently outperforms other models in terms of MAE and MSE. For example, on the ETTh1 dataset, Toto achieves an MAE of 0.389 and an MSE of 0.363, outperforming all zero-shot models, including the previously reported Moirai series and TimesFM. Macro-averaging across the six LSF datasets, Toto achieves an MAE of 0.312 and MSE of 0.265, again exceeding Moirai's reported zero-shot performance as well as the reported performance of the full-shot models.

Several architectural choices and data features likely contribute to Toto's superior performance. The novel Proportional Factorized Space-Time Attention mechanism allows Toto to efficiently capture both temporal and spatial dependencies within multivariate time series data. Additionally, the extensive training on a diverse dataset of one trillion time series points, including a mix of real-world observability metrics and multi-domain time series data, enhances Toto's ability to handle varied characteristics of different benchmark datasets.

While Toto generally excels, there are areas where its performance is closely matched by other models. In full-shot scenarios, models like PatchTST, Crossformer, and FEDformer show competitive results. For example, on the Electricity dataset, while Toto achieves a leading zero-shot MAE of 0.246 and MSE of 0.157, iTransformer and TimesNet also show strong performance, indicating that these models can catch up when additional training data is available.

Overall, Toto's architectural innovations and extensive training data enable it to achieve state-of-the-art performance across diverse benchmarks, excelling in zero-shot scenarios while remaining highly competitive in full-shot contexts.

Metric	Toto	Chronos-T5_Tiny	Chronos-T5_Mini	Chronos-T5_Small	Chronos-T5_Base	Chronos-T5_Large	Moirai_Small	Moirai_Base	Moirai_Large	TimesFM
sMAPE	0.672	0.809	0.788	0.800	0.796	0.805	0.808	0.742	0.736	1.246
sMdAPE	0.318	0.406	0.391	0.401	0.393	0.396	0.418	0.370	0.365	0.639

Table 2: Performance of Toto and other zero-shot models on the Datadog benchmark dataset. Key: Best results, Second-best results.

5.2 Datadog benchmark

We created a benchmark using anonymous Datadog data to assess performance across various observability metrics. To ensure a representative and realistic sample, we sampled data based on quality and relevance signals from dashboards, monitor alerts, and notebooks. This benchmark comprises 983,994 data points from 82 distinct multivariate time series, encompassing 1,122 variates.

We analyzed summary statistics of the series in our benchmark to identify characteristics that make observability time series challenging to forecast. The categories and their definitions are as follows:

•

Sparse: Series with a low density of observations, indicating infrequent recording of data or rare events.
•

Extreme right skew: Series with a distribution heavily skewed to the right, characterized by a few very high values and many lower values.
•

Seasonal: Series exhibiting regular and recurring patterns, often linked to daily, weekly, or yearly cycles.
•

Flat: Series with minimal variability, showing little to no change over time.

The relative proportion of these cases are displayed in Table 3.

To assess the prediction of other zero-shot models on the DD Benchmark, we follow sampling procedures delineated in their respective manuscripts. In short, for Chronos models, we generate 20 samples and take the median prediction. For Moirai models, we take the median of 100 samples and set the patch size to “auto”. TimesFM only produces point predictions of the mean, so we use those directly. Since TimesFM and Chronos only support univariate forecasting, we process each variate independently. Moirai, on the other hand, like Toto, makes joint predictions for each group of related variates. For Toto, we utilize the same evaluation procedure we used on the LSF benchmarks.

The evaluation results (Table 2) demonstrate that Toto outperforms the other models. We evaluate using a prediction length of 365, the maximum forecast window available for previous time series models within the Datadog platform. We use a historical context window of 512 steps. Because observability data can have extreme variation in both magnitude and dispersion, we select symmetric mean absolute percentage error (sMAPE) as a scale-invariant performance metric [48]. We also report symmetric median absolute percentage error (sMdAPE), a robust version of sMAPE [49] that minimizes the influence of the extreme outliers present in observability data. With the lowest sMAPE of 0.672 and sMdAPE of 0.318, Toto proves to be the most accurate for forecasting observability time series data.

These results suggest that current open datasets may not provide sufficient information to extrapolate to the specific nuances of observability data, highlighting the importance of training on more relevant data as demonstrated by Toto's superior performance.

Case % Sparse 12.20 Extreme Right Skew 17.07 Seasonal 80.49 Flat 1.22

Table 3: Breakdown of Datadog dataset based on case, computed based on the average characteristics of variates in each multivariate series. Note that these do not add to 100% because time series may fall into multiple categories.

6 Conclusions

Toto, through a novel architecture and pre-training corpus, demonstrates state-of-the-art performance both on public benchmarks and on the Datadog observability benchmark. We look forward to sharing many more technical details, experiments, and benchmark results in a forthcoming paper.

7 Impact statement

In developing Toto, Datadog follows a structured approach to ensure responsible development, focusing on identifying, assessing, and mitigating potential risks associated with the use of our model. Given that Toto is not intended for mass distribution and specifically generates time series forecasts for observability data, the potential harms are considerably lower compared to more general-purpose models. At Datadog, our primary focus is on ensuring the accuracy, reliability, and security of the forecasts generated by Toto, which are crucial for maintaining and optimizing infrastructure and application performance.

We carefully analyze the potential for Toto to produce incorrect or misleading forecasts that could impact decision-making processes in critical systems. Additionally, we consider the implications of Toto's performance across diverse datasets, ensuring it can generalize well without introducing significant errors.

8 Future directions

Many exciting areas of exploration remain for further study. If you are interested in working with us, please reach out to the authors by email.

Some future research questions that particularly intrigue us include:

•

Multi-modal inputs: Incorporate additional input modalities such as query metadata and captions to enhance forecast performance.
•

Autonomous troubleshooting agents: Augment Datadog's AI agents [50] for troubleshooting and incident response by integrating modality-specific foundation models like Toto to improve their reasoning and planning capabilities with telemetry data.
•

Conversational interfaces: Align time series forecasting models with LLMs to develop conversational agents capable of interpreting and reasoning about time series data.
•

Model enhancements and scaling: Explore ways to improve and scale model performance through optimizations such as new types of input embeddings, attention mechanisms, and examining alternative variate groupings to capture richer interactions.

9 Contributions

Contributors are listed in alphabetical order.

Othmane Abou-Amal, Joseph Banks, Mayeul Blanzat, Ben Cohen, Youssef Doubli, Ben Hinthorne, Emaad Khwaja, Jared Ledvina, Charles Masson, Sajid Mehmood, Elise Ramé, Maxime Visonneau, Kan Wang.

10 Acknowledgements

Our work is made possible by the efforts of numerous teams at Datadog. Special thanks and acknowledgement to:

Johan Andersen, Roashan Ayene, Romoli Bakshi, Kevin Beach, Bill Birkholz, Rob Boll, Maxim Brown, Benedetto Buratti, Marion Chan-Renous, Jessica Cordonnier, Ben Donohue, Zakaria Fikrat, Quentin François, Erica Hale, Michael Hoang, Joe Jones, Max Livingston, Jesse Mack, Amine Naouas, Sean O'Connor, Brendan Rhoads, Phil Sarin, Vyom Shah, Aaron Taa, Bharath Vontimitta, Dominique West, Steven Zhou.

References

Datadog [2024a] Datadog. Observability platform, 2024a. URL https://www.datadoghq.com/monitoring/observability-platform/.
Datadog [2024b] Datadog. Modern infrastructure monitoring, 2024b. URL https://www.datadoghq.com/product/infrastructure-monitoring/.
Hyndman and Athanasopoulos [2021] Rob J Hyndman and George Athanasopoulos. Forecasting: Principles and Practice. OTexts, 3rd edition, 2021. URL https://otexts.com/fpp3/.
Fildes et al. [1998] Robert Fildes, Michèle Hibon, Spyros Makridakis, and Nigel Meade. Generalising about univariate forecasting methods: further empirical evidence. International Journal of Forecasting, 14:339–358, 9 1998. ISSN 01692070. doi: 10.1016/S0169-2070(98)00009-0.
Stevenson [2007] Simon Stevenson. A comparison of the forecasting ability of arima models. Journal of Property Investment & Finance, 25:223–240, 5 2007. ISSN 1463-578X. doi: 10.1108/14635780710746902.
Christodoulos et al. [2010] Charisios Christodoulos, Christos Michalakelis, and Dimitris Varoutas. Forecasting with limited data: Combining arima and diffusion models. Technological Forecasting and Social Change, 77:558–565, 5 2010. ISSN 00401625. doi: 10.1016/j.techfore.2010.01.009.
Salinas et al. [2020] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36:1181–1191, 2020. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2019.07.001. URL https://www.sciencedirect.com/science/article/pii/S0169207019301888.
Brophy et al. [2023] Eoin Brophy, Zhengwei Wang, Qi She, and Tomás Ward. Generative adversarial networks in time series: A systematic literature review. ACM Computing Surveys, 55:1–31, 10 2023. ISSN 0360-0300. doi: 10.1145/3559540.
Jia et al. [2018] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. Exploring the hidden dimension in accelerating convolutional neural networks, 2018. URL https://openreview.net/forum?id=SJCPLLpaW.
Xu et al. [2021] Weizheng Xu, Youtao Zhang, and Xulong Tang. Parallelizing dnn training on gpus: Challenges and opportunities. pages 174–178. ACM, 4 2021. ISBN 9781450383134. doi: 10.1145/3442442.3452055.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wu et al. [2021] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. 2021. URL https://openreview.net/forum?id=J4gRj6d5Qm.
Zhou et al. [2020] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wan Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. 2020. URL https://api.semanticscholar.org/CorpusID:229156802.
Nie et al. [2023] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. 2023. URL https://openreview.net/forum?id=Jbdc0vTOcol.
Woo et al. [2024] Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. 2024. URL https://openreview.net/forum?id=Yd8eHMY1wz.
Zhang and Yan [2023] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vSVLM2j9eie.
Liu et al. [2024] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. 2024. URL https://openreview.net/forum?id=JePfAI8fah.
Ilbert et al. [2024] Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, and Ievgen Redko. SAMformer: Unlocking the potential of transformers in time series forecasting with sharpness-aware minimization and channel-wise attention. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=8kLzL5QBh2.
Das et al. [2024] Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=jn2iTJas6h.
Lin et al. [2021] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. CoRR, abs/2106.04554, 2021. URL https://arxiv.org/abs/2106.04554.
Ansari et al. [2024] Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the language of time series, 2024. URL https://arxiv.org/abs/2403.07815.
Garza and Mergenthaler-Canseco [2023] Azul Garza and Max Mergenthaler-Canseco. Timegpt-1, 2023.
Rasul et al. [2023] Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: Towards foundation models for time series forecasting. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. URL https://openreview.net/forum?id=jYluzCLFDM.
Gruver et al. [2023] Nate Gruver, Marc Anton Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=md68e8iZK1.
Radford and Narasimhan [2018] Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
Xiong et al. [2020] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture, 2020. URL https://openreview.net/forum?id=B1x8anVFPr.
Zhang and Sennrich [2019] Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019. URL https://openreview.net/references/pdf?id=S1qBAf6rr.
Shazeer [2020] Noam Shazeer. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/2002.05202.
Cordonnier et al. [2020] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=HJlnC1rKPB.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
Rao et al. [2021] Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. Msa transformer. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/rao21a.html.
Arnab et al. [2021] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021. doi: 10.1109/ICCV48922.2021.00676.
Su et al. [2021] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021.
Sun et al. [2022] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. In ACL 2023, December 2022. URL https://www.microsoft.com/en-us/research/publication/a-length-extrapolatable-transformer/.
Das et al. [2023] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tiDE: Time-series dense encoder. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=pCbC3aQB5W.
Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
Peel and McLachlan [2000] D. Peel and G.J. McLachlan. Robust mixture modelling using the t distribution. Statistics and Computing, 10(4):339–348, 2000.
Meitz et al. [2018] Mika Meitz, Daniel P. A. Preve, and Pentti Saikkonen. A mixture autoregressive model based on student’s t–distribution. Communications in Statistics - Theory and Methods, 52:499 – 515, 2018. URL https://api.semanticscholar.org/CorpusID:73615847.
WONG et al. [2009] C. S. WONG, W. S. CHAN, and P. L. KAM. A student t -mixture autoregressive model with applications to heavy-tailed financial data. Biometrika, 96(3):751–760, 2009. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/27798861.
Kim et al. [2022] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=cGDAkQo1C0p.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
Datadog [2024c] Datadog. Querying, 2024c. URL https://docs.datadoghq.com/dashboards/querying/.
Wu et al. [2023] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. In International Conference on Learning Representations, 2023.
Zeng et al. [2023] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):11121–11128, Jun. 2023. doi: 10.1609/aaai.v37i9.26317. URL https://ojs.aaai.org/index.php/AAAI/article/view/26317.
LIU et al. [2022] Minhao LIU, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia LAI, Lingna Ma, and Qiang Xu. SCINet: Time series modeling and forecasting with sample convolution and interaction. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=AyajSjTAzmg.
Zhou et al. [2022] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proc. 39th International Conference on Machine Learning (ICML 2022), 2022.
Armstrong [1985] J. Scott Armstrong. Long-range Forecasting: From Crystal Ball to Computer. John Wiley & Sons, New York, 1985. ISBN 9780471822608.
Hyndman and Koehler [2006] R. J Hyndman and A. B. Koehler. Another look at measures of forecast accuracy. International Journal of Forecasting, 22, 2006.
Datadog [2024d] Datadog. Bits ai: Reimagining the way you run operations with autonomous investigations, 2024d. URL https://www.datadoghq.com/blog/bits-ai-autonomous-investigations.

Appendix

A.1 Model architecture

Hyperparameter	Value
Embedding Dimension	512
MLP Dimension	2048
# Layers	24
# Heads	8
# Variates	32
( $\beta_{1}$ , $\beta_{2}$ )	(0.9, 0.95)
Weight Decay	0.01
Spacewise Layer Cadence	3
Patch Size	32
# Student-T Mixture Model Components	16
Initial Learning Rate	0.001
Annealing Schedule	Cosine
Batch Size	192
Warmup Steps	5000
Total Train Steps	193000

Table A.1: Hyperparameters for Toto

A.2 Results

			Zero Shot					Full Shot
Dataset	Prediction Length	Metric	Toto	Moirai_Small	Moirai_Base	Moirai_Large	TimesFM	iTransformer	TimesNet	PatchTST	Crossformer	TiDE	DLinear	SCINet	FEDformer
	96	MAE	0.366	0.402	0.402	0.398	0.398	0.405	0.402	0.419	0.448	0.464	0.400	0.599	0.419
		MSE	0.307	0.375	0.384	0.380	-	0.386	0.384	0.414	0.423	0.479	0.386	0.654	0.376
	192	MAE	0.368	0.419	0.429	0.434	0.424	0.436	0.429	0.445	0.474	0.492	0.432	0.631	0.448
ETTh1		MSE	0.329	0.399	0.425	0.440	-	0.441	0.436	0.460	0.471	0.525	0.437	0.719	0.420
	336	MAE	0.399	0.429	0.450	0.474	0.436	0.458	0.469	0.466	0.546	0.515	0.459	0.659	0.465
		MSE	0.396	0.412	0.456	0.514	-	0.487	0.491	0.501	0.570	0.565	0.481	0.778	0.459
	720	MAE	0.424	0.444	0.473	0.568	0.445	0.491	0.500	0.488	0.621	0.558	0.516	0.699	0.507
		MSE	0.419	0.413	0.470	0.705	-	0.503	0.521	0.500	0.653	0.594	0.519	0.836	0.506
	96	MAE	0.197	0.334	0.327	0.325	0.356	0.349	0.374	0.348	0.584	0.440	0.387	0.621	0.397
		MSE	0.093	0.281	0.277	0.287	-	0.297	0.340	0.302	0.745	0.400	0.333	0.707	0.358
	192	MAE	0.231	0.373	0.374	0.367	0.400	0.400	0.414	0.400	0.656	0.509	0.476	0.689	0.439
ETTh2		MSE	0.135	0.340	0.340	0.347	-	0.380	0.402	0.388	0.877	0.528	0.477	0.860	0.429
	336	MAE	0.260	0.393	0.401	0.393	0.428	0.432	0.541	0.433	0.731	0.571	0.541	0.744	0.487
		MSE	0.160	0.362	0.371	0.377	-	0.428	0.452	0.426	1.043	0.643	0.594	1.000	0.496
	720	MAE	0.355	0.416	0.426	0.421	0.457	0.445	0.657	0.446	0.763	0.679	0.657	0.838	0.474
		MSE	0.294	0.380	0.394	0.404	-	0.427	0.462	0.431	1.104	0.874	0.831	1.249	0.463
	96	MAE	0.328	0.383	0.360	0.363	0.345	0.368	0.375	0.367	0.426	0.387	0.372	0.438	0.419
		MSE	0.306	0.404	0.335	0.353	-	0.334	0.338	0.329	0.404	0.364	0.345	0.418	0.379
	192	MAE	0.353	0.402	0.379	0.380	0.374	0.391	0.387	0.385	0.451	0.404	0.389	0.450	0.441
ETTm1		MSE	0.328	0.435	0.366	0.376	-	0.377	0.374	0.367	0.450	0.398	0.380	0.439	0.426
	336	MAE	0.389	0.416	0.394	0.395	0.397	0.420	0.411	0.410	0.515	0.425	0.413	0.485	0.459
		MSE	0.390	0.462	0.391	0.399	-	0.426	0.410	0.399	0.532	0.428	0.413	0.490	0.445
	720	MAE	0.429	0.437	0.419	0.417	0.436	0.459	0.450	0.439	0.589	0.461	0.453	0.550	0.490
		MSE	0.463	0.490	0.434	0.432	-	0.491	0.478	0.454	0.666	0.487	0.474	0.595	0.543
	96	MAE	0.270	0.282	0.269	0.260	0.263	0.264	0.267	0.259	0.366	0.305	0.292	0.377	0.287
		MSE	0.200	0.205	0.195	0.189	-	0.180	0.187	0.175	0.287	0.207	0.193	0.286	0.203
	192	MAE	0.315	0.318	0.303	0.300	0.309	0.309	0.309	0.302	0.492	0.364	0.362	0.445	0.328
ETTm2		MSE	0.269	0.261	0.247	0.247	-	0.250	0.249	0.241	0.414	0.290	0.284	0.399	0.269
	336	MAE	0.319	0.355	0.333	0.334	0.349	0.348	0.351	0.343	0.542	0.422	0.427	0.591	0.366
		MSE	0.264	0.319	0.291	0.295	-	0.311	0.321	0.305	0.597	0.377	0.369	0.637	0.325
	720	MAE	0.374	0.410	0.377	0.386	0.415	0.407	0.403	0.400	1.042	0.524	0.522	0.735	0.415
		MSE	0.354	0.415	0.355	0.372	-	0.412	0.408	0.402	1.730	0.558	0.554	0.960	0.421
	96	MAE	0.212	0.299	0.248	0.242	-	0.240	0.272	0.285	0.314	0.329	0.282	0.345	0.308
		MSE	0.124	0.205	0.158	0.152	-	0.148	0.168	0.195	0.219	0.237	0.197	0.247	0.193
	192	MAE	0.232	0.310	0.263	0.259	-	0.253	0.289	0.289	0.322	0.330	0.285	0.355	0.315
Electricity		MSE	0.138	0.220	0.174	0.171	-	0.162	0.184	0.199	0.231	0.236	0.196	0.257	0.201
	336	MAE	0.249	0.323	0.278	0.278	-	0.269	0.300	0.305	0.337	0.344	0.301	0.369	0.329
		MSE	0.155	0.236	0.191	0.192	-	0.178	0.198	0.215	0.246	0.249	0.209	0.269	0.214
	720	MAE	0.291	0.347	0.307	0.313	-	0.317	0.320	0.337	0.363	0.373	0.333	0.390	0.355
		MSE	0.211	0.270	0.229	0.236	-	0.225	0.220	0.256	0.280	0.284	0.245	0.299	0.246
	96	MAE	0.223	0.212	0.203	0.208	-	0.214	0.220	0.218	0.230	0.261	0.255	0.306	0.296
		MSE	0.180	0.173	0.167	0.177	-	0.174	0.172	0.177	0.158	0.202	0.196	0.221	0.217
	192	MAE	0.267	0.250	0.241	0.249	-	0.254	0.261	0.259	0.277	0.298	0.296	0.340	0.336
Weather		MSE	0.235	0.216	0.209	0.219	-	0.221	0.219	0.225	0.206	0.242	0.237	0.261	0.276
	336	MAE	0.291	0.282	0.276	0.292	-	0.296	0.306	0.297	0.335	0.335	0.335	0.378	0.380
		MSE	0.252	0.260	0.256	0.277	-	0.278	0.280	0.278	0.272	0.287	0.283	0.309	0.339
	720	MAE	0.356	0.322	0.323	0.350	-	0.349	0.359	0.348	0.418	0.386	0.381	0.427	0.428
		MSE	0.356	0.320	0.321	0.365	-	0.358	0.365	0.354	0.398	0.351	0.345	0.377	0.403

Table A.2: Performance metrics for various models. Key: Best results, Second-best results.

Toto: Time Series Optimized Transformer for Observability Technical Report