## Benchmarking methodology

To reasonably compare, or benchmark, algorithms for image analysis, researchers must agree on a common reference point—the image set's "correct" answer—and see how closely each algorithm matches it. We adopt the common term ground truth for this reference point even though it may be questionable in some cases whether the reference value is the exact "truth."

This page discusses the four types of ground truth available in the BBBC, as well as suitable benchmarking methods for each type. In some cases, there are CellProfiler analysis pipelines available to help with benchmarking.

### C Counts

In this case, the ground truth consists of the number of cells (or other objects) in each image, as counted by one or more humans. For each image, the cell count produced by an algorithm is compared to the ground truth. If there is more than one human counter, the mean of their counts is used as ground truth.

The absolute value of the difference between the two is divided by the ground truth to obtain the amount of error (in percent) of the algorithm on that image. The mean error (over all images) is reported. Sometimes the standard deviation of the error is also reported and compared to the standard deviation of the human counters.

### F Foreground and background

In this case, a human produces a binary (black and white) image the same size as the original image. Pixels that belong to the foreground (i.e., the cells or other objects) are white, and pixels that belong to the background are black.

The precision and recall are reported, and algorithms are ranked by F-factor (i.e., the harmonic mean of precision and recall). The CellProfiler module CalculateImageOverlap can be used to make these calculations. The module takes two binary images as input: the segmentation result to be assessed and the ground truth.

### O Outlines of of individual objects

In this case, a human outlines each cell in the image in order to indicate which pixels belong to which cell. The ground truth is provided as binary images, with black outlines on a white background.

To compare an algorithm's results to the manual outlines, consider each pixel that is on the boundary found by the algorithm and that is not adjacent to any background pixels. For each such pixel, compute the Euclidean distance to the corresponding pixel on the manually found outline. Report the percentage of relevant pixels that are within two pixels of the corresponding pixel on the ground-truth outline.

The benchmarking pipeline "Edges Between Objects" on the CellProfiler Examples page can be used to make these calculations.

### B Biological labels

In these cases, the experiments have been prepared with control samples for which we know the expected biological result. The types of controls that are available dictate the type of statistic that can be calculated. The following two statistics are widely used:

Z'-factor:
The Z'-factor indicates how well the algorithm is able to separate the positive and negative controls, given the variation present in both control populations. This factor can be calculated whenever multiple positive and negative control samples are available. See Zhang et al., J. Biomol. Screen, 1999 [pdf].
V-factor:
The V-factor analyzes all the data along a dose-response curve rather than just the positive and negative controls alone, and is in many ways more appropriate for image-based assays. In a dose-response collection of images, each sample is treated with a different dose, and the resulting response of the biological system should be detectable, usually producing a sigmoid-curve response to dose. Reference: Ravkin I, "Quality measures for imaging-based cellular assays," Society Biomol. Screen. Conference Posters, 2004. See also http://www.ravkin.net/SBS/Literature.htm

For both the Z'-factor and the V-factor, the highest possible value (best assay quality) is 1. Negative values indicate that distinguishing between positive and negative controls is difficult or impossible. A Z'-factor > 0 is potentially suitable for a high-throughput screen; a Z'-factor > 0.5 is considered an excellent assay.

The CellProfiler software package can calculate Z'-factors and V-factors. See the pipeline "Human cytoplasm-nucleus translocation assay (SBS)" for how to make these calculations on images that are processed in CellProfiler using the CalculateStatistics module or data tool.

CellProfiler's CalculateStatistics data tool can also make these calculations on data produced outside CellProfiler. To do this, download, install, and launch CellProfiler, then add LoadText, CalculateStatistics, and ExportToExcel (choosing to export 'Experiment') modules. Load this CellProfiler pipeline into CellProfiler to get started. There is a short demo on constructing a small pipeline in CellProfiler on the examples page. Format your data in a text file as instructed in the Help for the LoadText module. Set the DefaultOutputFolder to your preferred output location and click "Analyze images". The calculated statistics (Z' factor, V factor, and EC50) will be written to Excel-compatible files.