Anton Lebedevich's Blog


Statistics for Monitoring: Data Properties

11 Feb 2014

Introduces performance metrics and their properties that affect choice of algorithms for anomaly detection, performance analysis, capacity planning. There will be several examples that illustrate typical properties and anomalies.

great wall of graphs

Here’s the way typical performance metrics look like. CPU utilization, used memory, network io, disk io, etc. are drawn as time series. I’ll put no labels on axes because horizontal line is always time and vertical line is a value of some metric. There are cases when it helps to know what kind of metric we are analyzing but in general it doesn’t matter for statistical purposes. Sometimes all you know about metrics is their names and only developers can shed a light on meaning of a particular metric.

This style of visualization when black dots are individual data points and red curve is an averaged over some period value will be used for graphs. It allows to quickly see both data distribution (darkness of dot cloud) and its trend.

There are several interesting things on the picture above. One service crashed and was restarted which led to spikes on some graphs and drops on others. Another service is leaking memory which looks like a linear growth.

It helps a lot to have good understanding of a system (OS and application) when you are investigating performance problems. But sometimes you have no idea of what’s going on. In that case you have to check a lot of metrics in hope to find something interesting and related to the problem.

Humans are good at pattern recognition so they can spot trends and changes on graphs. I used to navigate through hundreds of graphs while investigating production problems but it doesn’t scale well. If you install metrics gathering agent (like collectd) on your server it’ll produce hundreds metrics. If you have dozen servers it’ll produce thousands metrics which is impossible for human to review on timely manner.

Faced that problem I noticed that I follow quite simple steps while analyzing data visually. I look for graphs which have some kind of change in their shape when things got broken or I’m looking for similar graphs (e.g. something that has spikes at the same time as error rate has spikes).

Statistical tools can spot some trends and recognize some patterns too. There is a whole body of knowledge on change point detection and clustering (grouping similar objects). With their help human can handle more data by looking only at interesting graphs and skipping unrelated noise.

When I started learning statistics I found that computer generated data is quite different from the ideal world of statistical models because of:


quantization (path)

The graph above is a metric with visible quantization. It’s not that obvious when all datapoints are connected by lines. Let’s see how it looks like if lines are removed.

quantization (dots)

Dots are placed on horizontal lanes. Let’s make a histogram with bin width 1:

quantization (histogram 1)

Nothing interesting. Let’s make finer grain histogram with bin width 0.25:

quantization (histogram 0.25)

Now we see that the data has only integers.

mean shift

It’s a typical changepoint which is well known as a ‘mean shift’. CPU usage of data processing service grew when new data producer was added.

sudden drop

It’s used disk space when large log file was deleted by logrotate. There is an almost linear trend (though it has some noise in it) followed by abrupt drop.

bimodal spike

Closed tcp sockets per second. One system crashed and a lot of clients got disconnected which resulted in large spike on the graph.

previous: Statistics for Monitoring: Preface next: Statistics for Monitoring: Load Testing (Tuning)
comments powered by Disqus