Review Normalizer

Reviewers do not assign scores to papers in the same way. Some tend to give lower scores and some, higher. Inconsistent scoring can affect which papers are discussed in program committee meetings and which papers are advocated for, or how strongly.

This tool is intended to help program committee chairs compensate for scoring bias. It suggests a compensation to normalize review scores, and plots average score versus normalized score for each paper.

All processing is done locally in the browser: your data will not be transmitted elsewhere.

1 Upload the CSV file containing the review scores. The first line should read:
Paper,Reviewer,Score

Paper and Reviewer names may be any strings. Scores are numeric but need not be integral or positive. [example file] [How to extract this from HotCRP]

2 (Optional) Choose a compensation method :
Deviation from mean (per paper) ? Deviation from mean (per paper): This compensation method compares how each reviewer's scores differ from the mean score on each paper that the reviewer reviewed. A reviewer who tends to give high or low scores will not necessarily be considered biased, unless their scores are consistently different from how other reviewers scored the same papers.
Deviation from mean (all papers: crude) ? Deviation from mean (all papers): This too-simple compensation method simply looks at the reviewer's average score across all papers and compares it against other reviewers' average scores, under the assumption that the pools of papers that reviewers received are not significantly different from each other. This is the method that people usually assume revNorm uses when they first hear about it. But it's not recommended because in practice, the sets of papers assigned to different reviewers are different enough that this method produces poor results. RevNorm supports this method primarily for comparison purposes.
Maximum likelihood (slower, more sophisticated) ? Maximum likelihood: In this (usually best) method, the review normalizer constructs a simple linear model for each reviewer, characterized by two parameters, bias and gain. Bias is a constant offset relative to the average reviewer. The average bias is zero. A positive bias means you are estimated to give higher scores on average, whereas negative bias means your scores are estimated to be lower on average. Gain estimates a linear multiplier applied to the deviation from the average score, to capture the effect that some reviewers assign a wider spread of scores. The average gain is 1.0. A larger gain means you are estimated to give more extreme scores; a smaller gain means you are estimated to give more central scores. Thus, this simple linear model predicts that the measured review score will be:
  ((true score) - (average score)) * gain + bias 

This compensation method tries to find the bias and gain for each reviewer that maximizes the likelihood of obtaining the scores that were obtained, using a standard second-order nonlinear minimization method (BFGS).

The method is controlled by three parameters whose initial values are probably already reasonable for most conferences:

  • σb represents the expected spread in bias ratings (around 0).
  • σg represents the expected spread in gain ratings (around 1).
  • σr represents the expected amount of random noise that reviewers add to their scores.

The estimated likelihood of a particular set of estimated true scores, estimated biases, and estimated gains is generated using three corresponding terms: a score term based on the difference between the observed score for each paper and the predicted score, a bias term based on how far each estimated bias is from zero, and a gain term based on each estimated gain's distance from 1.

Scores within the range 0–6 are recommended for fast convergence. Since this method constructs a two-parameter model for each reviewer, it does need more data than the other methods, which only construct a one-parameter model. Reviewers who have reviewed few papers will tend to have their ratings discounted somewhat because there is not enough data to accurately model them.

3 (Optional) Upload the CSV file containing any additional label for papers (e.g., Accept, Reject). [example] The first line should read:
Paper,Label