Summarizers

A summarizer is an object that ingests (double precision floating point) values and computes some statistics (summaries) of the ingested values. The following code example creates a Tally and feeds it 1000 uniformly distributed pseudo-random values.

Tally tally = new Tally("Example tally");
Random random = new Random(1234);
for (int i = 0; i < 1000; i++)
    tally.ingest(random.nextDouble());
System.out.println("minimum:                       " + tally.getMin());
System.out.println("maximum:                       " + tally.getMax());
System.out.println("count:                         " + tally.getN());
System.out.println("sum:                           " + tally.getSum());
System.out.println("sample mean:                   " + tally.getSampleMean());
System.out.println("sample variance:               " + tally.getSampleVariance());
System.out.println("sample standard deviation:     " + tally.getSampleStDev());
System.out.println("sample skewness:               " + tally.getSampleSkewness());
System.out.println("sample kurtosis:               " + tally.getSampleKurtosis());
System.out.println("sample excess kurtosis:        " + tally.getSampleExcessKurtosis());
System.out.println("population mean:               " + tally.getPopulationMean());
System.out.println("population variance:           " + tally.getPopulationVariance());
System.out.println("population standard deviation: " + tally.getPopulationStDev());
System.out.println("population skewness:           " + tally.getPopulationSkewness());
System.out.println("population kurtosis:           " + tally.getPopulationKurtosis());
System.out.println("population excess kurtosis:    " + tally.getPopulationExcessKurtosis());

When run, this program outputs something like

minimum:                       4.463828850445051E-4
maximum:                       0.9993228356687273
count:                         1000
sum:                           487.6875254457159
sample mean:                   0.4876875254457153
sample variance:               0.0839370407429099
sample standard deviation:     0.28971889952660995
sample skewness:               0.03986803012965087
sample kurtosis:               1.743485723621789
sample excess kurtosis:        -1.2550414677445052
population mean:               0.4876875254457153
population variance:           0.08385310370216699
population standard deviation: 0.28957400384386545
population skewness:           0.03980820314948127
population kurtosis:           1.7452309545763656
population excess kurtosis:    -1.2547690454236344

With the same java runtime environment as we used to run this example, you should get the exact same output because the output of a pseudo-random generator is predictable and reproducible. In these results, the mean and the median are expected to be 0.5, the expected variance 0.083333 (=1/12), expected standard deviation 0.288675 (√(1/12)), expected skewness 0.0, expected kurtosis 1.8 and the expected excess kurtosis -1.2. The differences with the observed values are reasonable for the sample size (1000).

Population versus sample

The getPopulationXXX methods return a result as it should be computed when an entire population has been ingested. The getSampleXXX methods should be used when ingested values form just a sample of the entire population. In the example above, the Tally has ingested 1000 values from a populatin that has infinite size (actually the number of double precision floating point values between 0.0 and 1.0 is not unlimited, but the sample of 1000 is nowhere close to the entire population). Thus, the getSampleXXX methods should be used to summarize the ingested values.

Skewness, kurtosis, excess kurtosis

Skewness is a measure for the asymmetry of the distribution. The skewness of a symmetric distribution is 0.0. A negative skewness indicates that the distribution has a longer tail on the left, a positive skewness indicates a longer tail on the right.

Kurtosis is a measure for the tailedness of the distribution. The kurtosis of normally distributed population is 3.0. The kurtosis of uniformly distributed population is 1.8. Larger values of the kurtosis indicate that a distribution has long tails. A large kurtosis in observed sample values is often caused by the presence of outliers in the sample.

Excess kurtosis is defined as kurtosis minus 3. This makes the excess kurtosis of normally distributed values 0.0 and the excess kurtosis of uniformly distributed values -1.2.

Drawing the result in a boxplot

A box plot is a nice way to graph minimum, first quartile, median, third quartile and maximum of a distribution. As shown above, the Tally collects minimum and maximum values. The quartiles and the median can only be approximated by the Tally. This is done by calling the getQuantile method:

System.out.println("first quartile:                " + tally.getQuantile(0.25));
System.out.println("median:                        " + tally.getQuantile(0.5));
System.out.println("third quartile:                " + tally.getQuantile(0.75));

The output of these extra statements is something like:

first quartile:                0.29227267857053363
median:                        0.4876875254457153
third quartile:                0.683102372320897

The expected values for these quantiles are 0.25, 1.5 and 0.75; so what is going on? The Tally assumes that the values are normally distributed and then estimates these quantile values from the observed mean and standard deviation. (It is possible to improve that approximation by taking the skewness and kurtosis of the data into account.) In our case, the values that were fed to the tally were uniformly distributed. The difference is rather striking. To fix this, we should construct a Tally with a suitable quantile accumulator. In the quantileaccumulator package there are a couple to choose from:

  • NoStorageAccumulator: this requires no memory and it is the one used when no specific quantile accumulator is specified at Tally construction time; e.g. the example above
  • FullStorageAccumulator: as the name suggests, this stores all ingested samples and may require more memory than available
  • TDigestAccumulator: this one is the most complex, but strikes a good balance between memory use and accuracy

To create a Tally with FullStorageAccumulator, replace the first line of the program by:

Tally tally = new Tally("Example tally with full storage accumulator", new FullStorageAccumulator());

The output for the quantiles is now something like

first quartile:                0.22964815841745587
median:                        0.4753812701997516
third quartile:                0.7515290688493804

The results differ from the expected values because we have only ingested a small fraction of the entire population. These values are exact (a.k.a. the ground truth). The FullStorageAccumulator is perfect when not too many values will be ingested. When the number of ingested values runs in the millions or billions, the TDigestAccumulator is the best choice for non-uniformly distributed values. The TDigestAccumulator is based on the algorithm by Ted Dunning.

To use a TDigestAccumulator, construct the Tally like this:

Tally tally = new Tally("Example tally with TDigest accumulator", new TDigestAccumulator());

The output for the quantiles is now

first quartile:                0.229876868954619
median:                        0.4751306678330976
third quartile:                0.7507889885484719

These values match the results of the FullStorageAccumulator within 0.001. For most applications such differences won't matter. For higher precision (at the cost of more memory and CPU-time), the TDigestAccumulator can be constructed with an integer argument (the compression setting) like:

Tally tally = new Tally("Example tally with TDigest accumulator with higher precision", new TDigestAccumulator(1000));

The default value for the compression is 100. The output using the TDigestAccumulator with compression set to 1000 is:

first quartile:                0.22962779231833413
median:                        0.4753812701997516
third quartile:                0.7517244282398716

This matches the output of the FullStorageAccumulator within 0.0002.