A Gentle Introduction to Anomaly Detection with Autoencoders
Anomagram is an interactive visualization tool for exploring how a deep learning model can be applied to the task of anomaly detection (on stationary data). Given an ECG signal sample, an autoencoder model (running live in your browser) can predict if it is normal or abnormal. To try it out, click any of the test ECG signals from the ECG5000 dataset below, or better still, draw a signal to see the model's prediction! Disclaimer: This prototype is built for demonstration purposes only and is not intended for use in any medical setting.
Click on a data sample below to see the prediction of a trained autoencoder.
All
Normal
R-on-T Premature Ventricular Contraction
Ectopic Beat
Premature Ventricular Contraction
Threshold ring implementation
Click and drag to draw a signal. Please draw within the box.
2
1
0
-1
-2
-3
-4
Extracted signal
0
10
20
30
40
50
60
70
80
90
100
110
120
130
0.000
mse
MODEL PREDICTION
Select a signal or draw one!
Input
Prediction
Error
The autoencoder is trained using normal ECG data samples. It has never seen any of the test signals above, but correcly predicts (most of the time) if a given signal is normal or abnormal. So, how does the autoencoder identify anomalies? Why is mean squared error a useful metric? What is the threshold and how is it set? Read on to learn more!
How does the Autoencoder work?
Encoder 2 Layers
z
Decoder 2 Layers
Encoder
Bottleneck
Decoder
input [140 units]
output [140 units]
Applying Autoencoders for Anomaly Detection
An anomaly (outlier, abnormality) is defined as “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” - Hawkins 1980.
While autoencoder models have been widely applied for dimensionality reduction (similar to techniques such as PCA), they can also be used for anomaly detection[3]. In fact, a few deep learning models that are comprised of encoders and decoders (e.g. Sequence to Sequence Models[5] , Variational Autoencoders[2], Bidirectional GANs[4]), with some modifications, also work well for this task! So, how is this all achieved?It turns out that if we train the model on normal data (or unlabelled data with very few abnormal samples), it learns a reconstruction function that works well for normal looking data (low reconstruction error) and works poorly for abnormal data (high reconstruction error). We can then use reconstruction error as a signal for anomaly detection.
In particular, if we visualize a histrogram of reconstruction errors generated by a trained autoencoder, we hopefully will observe that the distribution of errors for normal samples is overall smaller and markedly separate from the distribution of errors for abnormal data.
Note: We may not always have labelled data, but we can can assume (given the rare nature of anomalies) that the majority of data points for most anomaly detection use cases are normal. See the section below that discusses the impact of data composition (% of abnormal data) on model performance.
The Dataset
This prototype uses the ECG5000 dataset which contains 5000 examples of ECG signals from a patient. Each data sample (corresponds to an extracted heartbeat containing 140 points) has been labelled as normal or being indicative of heart conditions related to congestive heart failure - .
All
Normal
R-on-T Premature Ventricular Contraction
Ectopic Beat
Premature Ventricular Contraction
Data Transformation
Prior to training the autoencoder, we first apply a minmax scaling transform to the input data which converts it from its original range (-5 to 2) to a range of (0 to 1) This is done for two main reasons. First, existing research suggests that neural networks in general train better when input values lie between 0 and 1 (or have zero mean and unit variance). Secondly, scaling the data supports the learning objective for the autoencoder (minimizing reconstruction error) and makes the results more interpretable. In general, the range of output values from the autoencoder is dependent on the type of activation function used in the output layer. For example, the tanh activation function outputs values in the range of -1 and 1, sigmoid outputs values in the range of 0 - 1 In the example above, we use the sigmoid activation function in the output layer of the autoencoder, allowing us directly compare the transformed input signal to the output data when computing the means square error metric during training. In addition, having both input and output in the same range allows us to visualize the differences that contribute to the anomaly classification.Note:The parameters of the scaling transform should be computed only on train data and then applied to test data.
-2.22, -4.22, -4.67, -4.47, -3.34, -2.24, -1.55, -1.31, -0.61, -0.26, -0.28, -0.29, -0.35, -0.23, -0.25, -0.31, -0.27, -0.27, -0.30, -0.31, -0.31, -0.36, -0.38, -0.44, -0.41, -0.47, -0.43, -0.51, -0.51, -0.54, -0.55, -0.68, -0.73, -0.72, -0.68, -0.55, -0.50, -0.53, -0.45, -0.46, -0.31, -0.33, -0.15, 0.00, 0.01, 0.02, 0.11, 0.16, 0.12, 0.10, ...
Interactive replay of training run visualization
Model Implementation and Training
The autoencoder in this prototype (visualized above) has two layers in its encoder and decoder respectively. It is implemented using the Tensorflow.js layers api (similar to the keras api). The encoder/decoder are specified using dense layers, relu activation function, and the Adam optimizer (lr = 0.01) is used for training. Given that each ECG data sample is comprised of 140 values, both the encoder input vector and decoder output layer are of size 140.
Note: In the absence of labelled data, and if we make a few assumptions (most datapoints are normal and that the mse values follow a normal distribution), we can use statistics such as standard deviation and percentiles to infer a good threshold.
Tensorflow.js code for specifying the autoencoder can be found in the project repository on Github.
As training progresses, the model's weights are updated to minimize the difference between the encoder input and decoder output for the training data (normal samples). To illustrate the relevance of the training process to the anomaly detection task, we can visualize the histogram of reconstruction error generated by the model (see figure to the right). At initialization (epoch 0), the untrained autoencoder has not learned to reconstruct normal data and hence makes fairly random guesses in its attempt to reconstruct any input data - thus we see a similar distribution of errors for both normal and abnormal data. As training progresses, the model gets better at reconstructing normal data, and its reconstruction error markedly becomes smaller for normal samples leading to a distinct distribution for normal compared to abnormal data. As both distributions diverge, we can set a threshold or cutoff point; any data point with error above this threshold is termed an anomaly and any data point below this is termed normal. Selecting a Threshold
The current setup is semi-supervised, in that we have labels for a small pool of validation/test samples. Using these labels (and some domain expertise), we can automatically determine this threshold - we explore the range of MSE values for each data point in the validation set and select our threshold as the point that yields the best accuracy. But is accuracy enough?Note: In the absence of labelled data, and if we make a few assumptions (most datapoints are normal and that the mse values follow a normal distribution), we can use statistics such as standard deviation and percentiles to infer a good threshold.
0
Epoch
Example below shows the histogram of errors for 500 test data points during training epochs. At Epoch 0, Model is untrained, both normal and abnormal data have similar value range and overlapping distributions.
Replay Training
Normal
Abnormal
Model Evaluation: Accuracy is NOT Enough
For most anomaly detection problems, data is usually imbalanced - the number of labelled normal samples vastly out number abnormal samples. For example, for every 100 patients who take an ECG test, less than 23 are likely to have some type of abnormal reading. This sort of data imbalance introduces issues that make accuracy an inssufficient metric. Consider a naive model (actually a really bad model) that simply flags every sample as normal. Given our ECG scenario above, it would have an accuracy of > 77% despite being a really unskilled model. Clearly, accuracy alone does not tell the complete story i.e. how often does the model flag an ECG as abnormal when it is indeed abnormal (true positive), abnormal when it is normal (false positive) normal when it is abnormal (false negative) and normal when it is indeed normal (true negative). Two important metrics can be applied to address these issues - precision aand recall. Precision expresses the percentage of positive predictions that are correct and is calculated as (true positive / true positive + false positive ). Recall expresses the proportion of actual positives that were corrected predicted (true positive / true positive + false negative).
Depending on the use case, it may be desirable to optimize a model's performance for high precision or high recall. This tradeoff between precision and recall can be adjusted by the selection of a threshold (e.g. a low enough threshold will yield excellent recall but reduced precision). In addition, the Receiver Operating Characteristics (ROC) curve provides a visual assessment of a model's skill (area under the curve - AUC) and is achieved by plotting the true positive rate against the false positive rate at various values of the threshold. The F score metric has also been introduced to summarize both precision and recall while reflecting an emphasis on either precision or recall given its β parameter .
Depending on the use case, it may be desirable to optimize a model's performance for high precision or high recall. This tradeoff between precision and recall can be adjusted by the selection of a threshold (e.g. a low enough threshold will yield excellent recall but reduced precision). In addition, the Receiver Operating Characteristics (ROC) curve provides a visual assessment of a model's skill (area under the curve - AUC) and is achieved by plotting the true positive rate against the false positive rate at various values of the threshold. The F score metric has also been introduced to summarize both precision and recall while reflecting an emphasis on either precision or recall given its β parameter .
Example below the performance of a trained autoencoder model. Move the slider to see how threshold choices impact precision recall metrics.
95.00 %
Accuracy
0.92
Precision
0.96
Recall
5.82 %
False Positive Rate
3.85 %
False Negative Rate
96.15 %
True Positive Rate
94.18 %
True Negative Rate
Some Insights on the Effect of Model/Training Parameters
Some interesting insights that can be observed while modifying the training parameters for the model are highlighted below. You can explore them via the train a model interactive tab.
Regularization, Optimizer, Batch Size
Neural networks can approximate complex functions. They are also likely overfit, given limited data. In this prototype, we have relatively few samples (2500 normal samples), and we can observe signs of overfitting (train loss is less than validation loss). Regularization (l1 and l2) can be an effective way to address this. In addition, the choice of learning rate and optimizer can affect the speed and effectiveness (time to peak performance) of training. For example using Adam reaches peak accuracy within fewer epochs compared to optimizers like rmsprop and good old sgd. In the train a model interactive section, you can apply activation regularization - l1, l2 and l1l2 (regularization rate is set to learning rate) and observe its impact! You can also try out 6 different optimizers (Adam, Adamax, Adadelta, Rmsprop,Momentum, Sgd), with various learning rates. Abnormal Percentage
We may not always have labelled normal data to train a model. However, given the rarity of anomalies (and domain expertise), we can assume that unlabelled data is mostly comprised of normal samples. Does model performance degrade with an changes in the percentage of abnormal samples in the dataset? The train a model section, you can specify the percentage of abnormal samples to include when training the autoencoder model. We see that with 0% abnormal data, the model AUC is ~96%. At 30% abnormal sample composition, AUC drops to ~93%. At 50% abnormal datapoints, there is just not enough information in the data that allows the model learn a pattern of normal behaviour. It essentially learns to reconstruct normal and abnormal data equally well and mse is no longer a good measure of anomaly. At this point, model performance is only slightly above random chance (AUC of 56%). Ok .. The Road to Production?
Why Use An Autoencoder?
Why is an autoencoder (or any other related deep learning model) a good candidate for anomaly detection problems? First, this approach allows us to train the model with mostly unlabelled data, after which we can evaluate and tune our threshold using a small amount of labelled data. This alleviates the burden/cost associated with amassing a large amount of labelled training data. Next, by using an anomaly threshold, the model is more likely to detect new anomalies that have previously been unseen (unknown unknowns). To explore this, try drawing a really squiqqly line that is really unrepresentative of what an ECG signal could be and see the model's output. On the other hand, if we cast this problem as a classic classificatioin problem (assuming labels exist), we are less likely to detect unknown unknowns. Finally, deep learning models work well in approximating complex non-linear functions (also ... watchout for overfitting!); they can corrrectly model non-linear patterns that make up normal samples and can do this with minimal tuning compared to other methods.
Discretizing Data
There are a few important properties of the current dataset that make the autoencoder approach possible. First, while ECG is time series data, the current dataset has been discretized i.e. chunked into fixed-size slices of 140 values, where each slice constitutes a sample in the dataset. This discretization is performed using some domain knowledge (each set of 140 values corresponds to a heartbeat!), making each sample comparable or identical. Next, we can also observe that the data is stationary - i.e. its mean and variance do not change with time. This way, it is more likely that the values being predicted at test time lie in the same range (distribution) as values seen during training. In order to move to production with your own data using the autoencoder approach discussed above, it is important that similar conditions are met - stationarity is handled (if it exists) and the dataset is constructed such that each sample is identical as well as independent. Model serving
For production purposes, there are a couple options. First, we can integrate our Tensorflow.js model code (model.predict) as is into a Node.js web server application. However, we will be using either the Tensorflow.js CPU backend which accelerates computation via the Tensorflow C binary, or the Tensorflow.js GPU backend which accelerates computation via an available CUDA enabled GPU. This option makes sense for teams already heavily invested in the Nodejs stack. Our second option is to rewrite the model using the Keras API. The good part is that the Tensorflow.js (layers) api has an almost 1:1 mapping with the Keras API, making rewrites easy. A keras model can be served in production using Tensorflow serving .Congrats on making it this far! What's next? Click the train a model tab to interactively build and train an autoencoder, evaluate its performance and visualize model metrics for normal and abnormal test data.
Closing Notes
In this prototype, we have considered the task of detecting anomalies in ECG data. We used an autoencoder and demonstrate some fairly good results with minimal tuning. We have also explored how and why it works. This and other neural approaches (Sequence to Sequence Models, Variational Autoencoders, BiGANs etc) can be particularly effective for the task of anomaly detection on multivariate or high dimensional datasets such as images (think convolutional layers instead of dense layers), multivariate time series, time series with multiple external regressors.
Note: A deep learning model is not always the best tool for the job. Particularly, for univariate data (and low dimension data), autoregressive linear models (linear regression, ARIMA family of models for time series [6], etc), Clustering (PCA, KMeans, etc), Nearest Neighbour (KNNs) can be very fast and effective. Interested in learning more about other deep learning approaches to anomaly detection? My colleagues and I cover additional details on this topic in the upcoming Fast Forward Labs 2020 report on Deep Learning for Anomaly Detection.
Note: A deep learning model is not always the best tool for the job. Particularly, for univariate data (and low dimension data), autoregressive linear models (linear regression, ARIMA family of models for time series [6], etc), Clustering (PCA, KMeans, etc), Nearest Neighbour (KNNs) can be very fast and effective. Interested in learning more about other deep learning approaches to anomaly detection? My colleagues and I cover additional details on this topic in the upcoming Fast Forward Labs 2020 report on Deep Learning for Anomaly Detection.
Further Reading
[1] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. MIT Press 2016 Deep learning. Chapter 14, Autoencoders
[2] An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1).
[3] Zhou, Chong, and Randy C. Paffenroth. "Anomaly detection with robust deep autoencoders." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.
[4] Di Mattia, Federico, et al. "A Survey on GANs for Anomaly Detection." arXiv preprint arXiv:1906.11632 (2019).
[5] Malhotra, Pankaj, et al. "LSTM-based encoder-decoder for multi-sensor anomaly detection." arXiv preprint arXiv:1607.00148 (2016).
[6] Rob Hyndman 2018. A brief history of time series forecasting competitions. https://robjhyndman.com/hyndsight/forecasting-competitions/