Anton Lebedevich's Blog


Statistics for Monitoring: Anomaly Detection (Part 2)

21 Apr 2014

Experimental anomaly detection methods based on autocorrelation and non-parametric 2 sample tests.

Autocorrelation helps distinguishing between metrics that have changing behavior and stable ones.


These are different kinds of graphs that have high Ljung–Box test statistic which is based on autocorrelation coefficients at different lags.

Ljung-Box test is good at finding graphs with non-flat trends and mean shifts. The downside is that it finds graphs with seasonal changes, oscillations, already aggregated data (like load average which is EWMA). 2 bottom graphs on the image above are load average and some oscillating metric.

Control-charts based methods mentioned in Part 1 don’t work for data with relatively stable mean (which changes fall withing three-sigma range):

stable mean

Both graphs clearly show different behavior at different time intervals but changes in mean value (painted red) are quite small in comparison to standard deviation to be noticed by control-charts.

Another bad example are metrics that represent request latency or size.

request size daily

This weird graph shows maximum request size (black dots) measured over 10 seconds intervals during a day. Red line is an average over 3 minutes intervals. There are some spikes in average value but there is also a strange dot cloud in top-right part which stands for heavy requests appearing in that time of day.

In case of latency data distribution is not bell shaped. It usually has a long tail (some requests are taking much longer than the rest) which produces a lot of false alarms when using control charts.

It’s possible to find changes in such kinds of data by using non-parametric 2-sample tests like Kolmogorov–Smirnov test or Cramér–von Mises test. These tests allow to find how different two sets of data are.

Kolmogorov Smirnov test

Here we select 2 adjacent time intervals (right side of the top graph) and compare data drawn from them using Kolmogorov-Smirnov test. Value of test statistic is put on the graph below on the line between those 2 intervals. Time intervals are quite large here (2 hours) so the bottom graph shows only large-scale changes. Two highest peaks on it represent times when the heavy request cloud appeared and disappeared.

Distribution change example

There is a visible mean shift (left side) which is hidden from control charts by large standard deviation.

Kolmogorov Smirnov test 2

The maximum of Kolmogorov-Smirnov test statistic (bottom graph) points exactly to the point of change.

This method (find maximum of 2-sample test statistics between 2 adjacent sliding windows) is:

Drawbacks of the method:

The main use case for the method is to find the time range when something (maybe good, maybe bad) happened and someone might need to read logs from that time. Another case is to find metrics that did change behavior at some known time when we know that things got broken but don’t know exactly why.

previous: Statistics for Monitoring: Anomaly Detection (Part 1) next: Statistics for Monitoring: Correlation and Clustering
comments powered by Disqus