Skip to main content

Building a Production-Ready Time Series Anomaly Detection Application



Time series — collections of data points indexed and ordered by time — are extensively used in modeling a variety of problems and detecting outliers or anomalies in time series is one of the most impactful applications of machine learning. Fraud detection in finance, asset health monitoring in heavy industries, and biological health monitoring in healthcare are just a few of the ways time series anomaly detection (TAD) can be applied.

This blog will cover a subset of commonly used TAD methods, including their benefits and limitations. We will also discuss not only the modeling challenges, but also the technical and infrastructure challenges that come with building and deploying TAD applications in a large enterprise. Finally, we will demonstrate how the C3 AI Platform addresses these challenges to enable the deployment of large-scale solutions within remarkably short times.

Visualizing Time Series Anomalies

The very first step in TAD is data exploration and visualization. For time series data such as sensor measurements, stock market closing values, daily inventory levels, or user actions on a website, simple visualizations can help understand the baseline or non-anomalous activity before building a TAD application.

Due to the sequential nature of time series data, a time series anomaly must consider the time aspect of the information. Anomalies can be point anomalies, which are single instances of deviations, or pattern (also referred as collective) anomalies, which are multiple, sequential deviations from the normal behavior (Figure 1).

Figure 1 Point vs Pattern Anomalies

Figure 1 Point vs Pattern Anomalies

One of the easiest ways to explore time series anomalies is visualizing the data. Time series plots, such as in Figure 1, are one of the most simple and useful visualizations and clearly show the characteristics of the present anomalies. However, for higher-dimensional data, visualizing individual features might result in missing important structures and clusters in the data. T-stochastic neighborhood embedding (TSNE) is an effective visualization method for high-dimensional data. TSNE generates low-dimensional representations of the high-dimensional data such that structures in the data are well preserved. In Figure 2, TSNE is applied to twenty sensor measurements from an industrial plant with several abnormal measurements. 

Figure 2 TSNE Visualization of a set of sensors from an Industrial Plant. This plant produces pharmaceutical chemicals, and it contains tens of sensors measuring different physical quantities such as pressure, flow, vibration, and similar

Figure 2 TSNE Visualization of a set of sensors from an Industrial Plant. This plant produces pharmaceutical chemicals, and it contains tens of sensors measuring different physical quantities such as pressure, flow, vibration, and similar

Several conclusions can be drawn from this figure. For instance, consider samples from the state 500. In general, samples from the same state form a cluster whereas, for this state, we can see several samples which are scattered and potential outliers that need to be closely examined.


Supervised Models for Time Series Anomaly Detection

Supervised learning is one of the most popular approaches used for time series anomaly detection. In this approach, we provide a set of labels that specify whether the data observed for pre-defined timestamps are normal or not. Using these labels, a machine learning model is trained to predict the labels of future samples: normal or not

To illustrate this approach, let’s consider a predictive maintenance (PM) use case. In predictive maintenance, given various telemetry measurements, we are interested in predicting machine failures prior to the failure of a component. Given historic examples of these failures, we can generate a label by marking a specified period (for example one day) before a failure as abnormal and all other time stamps as normal. Additionally, data points from periods where the data might be unreliable are removed using a mask. Examples of masked periods include the periods immediately following a shut-down or a failure. The output of a supervised predictive maintenance model is a risk score that roughly represents the probability of failure in the future.

Figure 3 Supervised Model Summary: Given sensor data and labels, the model predicts the probability of a timestamp belonging to an abnormal state (here referred to as risk score).

Figure 3 Supervised Model Summary: Given sensor data and labels, the model predicts the probability of a timestamp belonging to an abnormal state (here referred to as risk score).

Supervised techniques require labeled data and can predict the class of the samples. The building blocks of a supervised model are summarized in Figure 3.

Semi-Supervised Models for Time Series Anomaly Detection

In a typical enterprise, data is vast and imperfect. For example, for many industrial assets available labels from different failure modes are limited . However, there is usually a large amount of data available from the normal operation modes. This allows for the development of semi-supervised models which essentially learn the normal behavior of the asset and then measure the anomaly (or risk score) by the distance from the normal behavior. Compared to supervised learning, semi-supervised models don’t require a label for training. However, the user needs to specify abnormal periods so that they are not used when training the semi-supervised ML model to learn the normal behavior of the system. It is also worth noting that a semi-supervised model’s predictions can be potentially used as labels for other machine learning approaches after they are verified with subject matter experts (SMEs) feedback.

Semi-supervised anomaly detection models are typically regressors that learn the changes in the values of one sensor based on the values of the other sensors. Two flavors of these models are particularly worth discussing. In the case where domain knowledge is available, typically through SMEs, one can utilize this information by focusing on certain sensors and learning their expected behaviors conditional on the value of other sensors. This can be achieved by training a regressor (e.g., a linear model or a neural network) that learns the relation between other relevant measurements and the sensor of interest. For example, one can build a regressor that predicts the temperature of a point in a physical asset based on other pressure and temperature readings (or derived features from these measurements). As the estimated temperature value diverges from the measured value for the same point, the risk score from the semi-supervised model increases. Figure 4 summarizes the main inputs and outputs of a semi-supervised ML model.

When such information or the required subject matter expertise is not available, one can still attempt to learn the normal behavior of the asset by considering available data from every sensor. However, in this case, the dimension of the input space can be very high, and efficient dimension reduction techniques need to be used. Convolutional auto-encoders are one of the most popular and promising techniques used for this purpose. This technique learns a representation of the data on a low-dimensional space which can then be used to reconstruct the input. The reconstruction error can then be used to define an anomaly measure. The building blocks of a semi-supervised TAD model are summarized in Figure 4.

Figure 4 Semi-supervised model summary: Input is healthy sensor data along with periods that need to masked out. The model generates reconstructed sensor values from which reconstruction errors and risk scores can be computed.

Figure 4 Semi-supervised model summary: Input is healthy sensor data along with periods that need to masked out. The model generates reconstructed sensor values from which reconstruction errors and risk scores can be computed.

Unsupervised Models for Time Series Anomaly Detection

In many situations, one has limited labeled data and therefore, supervised, or semi-supervised methods can’t be reliably used. In such cases, one can consider unsupervised methods to find the structures in the data. There is a vast number of unsupervised techniques that can be applied to anomaly detection in time series data. To name a few, one can use (i) clustering methods such as k-means and DBSCAN, to learn clusters in the data which can then be used to isolate the outliers, (ii) density based models such as the local outlier factor (LOF) to learn a local density deviation as a measure of anomaly score and (iii) ensemble methods, such as isolation forest, to measure an average number of decision steps to isolate the sample.

There is no ground truth or label for the unsupervised models, leading to unique practical challenges. For instance, anomalies can form clusters and such clusters could be mis-identified as normal. To address some of these limitations, nearest-neighbor-based methods are proposed. These methods assume that normal data instances should be close to each other, and hence occur in dense neighborhoods, whereas anomalies should have a significant distance to their closest neighbors. LOF is a commonly used nearest neighbor model for TAD which performs well in applications that contain local outliers. The shortcoming of LOF is that it might not perform well for very high dimensional data.

Tree-based ensemble models, such as random forest, are known to achieve good performances in several classification and regression tasks even in very high dimensional settings. Building on the strengths of tree-based methods, isolation forest (IF) is one of the most popular anomaly detection algorithms used for time series data and deserves further attention. The main intuition behind IF is that since abnormal points are usually few and different, they should be isolated relatively easier compared to normal points. Isolation forest measures this by building an ensemble of isolation trees, where abnormal points are the ones which have shorter average path lengths on the trees. Figure 5 illustrate an example where abnormal samples have shorter average lengths, also refer to Figure 6. IF is quite effective particularly in the high-dimensional setting. Predictive maintenance for complex industrial systems with thousands of sensor measurements is an example of such high-dimensional settings.

Figure 5 Tree-depths for normal and abnormal samples

Figure 5 Tree-depths for normal and abnormal samples

Figure 6 Unsupervised model summary: Given input data, IF model generates an anomaly score (we use normalized score between 0-1)

Figure 6 Unsupervised model summary: Given input data, IF model generates an anomaly score (we use normalized score between 0-1)

Which method(s) suit your needs best?

Choosing the right modeling approach is perhaps among one of the most challenging and important tasks in TAD. There are several important points to consider when making this choice starting with the availability of labels. As a rule of thumb, if sufficient failure labels from different operation modes of the asset are available, supervised methods can be a good fit. In other situations where there is a vast amount of data from normal operating modes but not from the other modes of operation, semi-supervised or unsupervised models are preferred.

Availability of labels fundamentally determines the selection of the general model. However, sometimes additional considerations surrounding the specific technique need to be implemented. For instance, the utilization of a TAD model can be dramatically improved by providing an explanation and/or interpretation of the predictions for the abnormal class. Therefore, whether using a supervised, semi-supervised or unsupervised framework, algorithms that can be interpreted easily are preferred.

The Challenges of Time Series Anomaly Detection at Enterprise Scale

Other than choosing the right modeling approach, there are multiple other factors that make TAD a challenging problem. The data for typical industrial TAD comes from many sources. The first challenge is access to high-fidelity, up-to-date data for all relevant sources. For example, for time series anomaly detection at industrial plants there are high frequency data from various sensors along with infrequent data coming from manual measurements (such as vibration readings). Moreover, various systems are used for recording the static data for the different systems and subsystems such as a gas turbine and its various components and the connections between these systems. These data can easily span multiple databases and file systems. The C3 AI Platform comes with prebuilt extendible data models that can easily be configured and tailored for a new complex time series anomaly detection problem and make all the relevant data accessible using a simple and unified SDK.

After choosing the right approach, developing a high-quality machine learning model for TAD is a highly iterative process and typically requires careful exploration of the data sources and experimentation with different feature sets, learning techniques, ML models, and libraries. This ad-hoc exploration and experimentation is simplified with the help of the C3 AI–hosted Jupyter service. This service is tightly integrated with the C3 AI Platform, providing an interface to the application data model and simplifying experimentation by combining multiple data sources that are typically not easily accessible in a standardized way in one place. This enables data scientists and SMEs to efficiently iterate over ideas and build an effective solution in a timely manner.

Finally, in a typical deployment, many machine learning models are needed for a single application. As an example, in a live deployment of C3 AI Reliability for a large biomedical manufacturer, there are more than 60 models providing continuous predictions on live data. Maintaining and managing these models is a challenging task. First, every model needs to be trained and hyperparameter-tuned. After training, the models need to be productionized to continuously generate predictions and scores, and these outputs need to be persisted and efficiently served to the client using the application front-end. Finally, the quality of the predictions needs to be continuously monitored and the models must be retrained or replaced in case of performance degradation. Performing these tasks at the scale required for a field-level anomaly detection application poses many technical challenges.

The C3 AI Platform solves all these problems with C3 AI’s proprietary Model Deployment Framework. By leveraging the elastic, multi-node architecture of the C3 AI Platform, thousands of models can be trained, processed, and/or tuned simultaneously using asynchronous compute jobs. C3 AI’s Model Deployment Framework enables users to easily configure the logic that defines the data used for training each model. Users can specify whether for one specific system (e.g., gas turbine compressor), the training data should be shared between all turbines in a plant or collected individually for each turbine. This logic is flexible and can easily be defined by filters that can be based on any part of the application data model. Finally, the C3 AI Platform enables complex deployment strategies such as champion-challenger deployments. For more details and examples about existing enterprise-scale deployments of time series anomaly detection applications check out our market leading predictive maintenance application: C3 AI Reliability.

About the Authors

Nevroz Sen is a lead data scientist at He holds a Ph.D. degree in Applied Mathematics at Queen's University, Canada. Previously, he held post-doc positions at Harvard University and McGill University and worked as a principal engineer in the areas of machine learning and robotic systems. His research interests include machine learning, information theory, stochastic control, mean field games and nonlinear filtering.

Amir H. Delgoshaie is a Senior Data Science Manager at C3 AI, where he has worked on the development and deployment of multiple large-scale AI applications for the utility, energy, and manufacturing sectors. He holds a Ph.D. in Energy Resources Engineering from Stanford University and master’s and bachelor’s degrees in Mechanical Engineering from ETH Zurich and Sharif UT. Prior to C3 AI, he developed algorithms and software at various research and industrial institutions.

About us is a leading enterprise AI software provider for accelerating digital transformation.