Review Normalizer

Reviewers do not assign scores to papers in the same way. Some tend to give lower scores and some, higher. Inconsistent scoring can affect which papers are discussed in program committee meetings and which papers are advocated for, or how strongly.

This tool is intended to help program committee chairs compensate for scoring bias. It suggests a compensation to normalize review scores, and plots average score versus normalized score for each paper.

All processing is done locally in the browser: your data will not be transmitted elsewhere.

1 Upload the CSV file containing the review scores. The first line should read:

Paper and Reviewer names may be any strings. Scores are numeric but need not be integral or positive. [example file] [How to extract this from HotCRP]

2 (Optional) Choose a compensation method :
Deviation from mean (per paper) ? Deviation from mean (per paper): This compensation method compares how each reviewer's scores differ from the mean score on each paper that the reviewer reviewed. A reviewer who tends to give high or low scores will not necessarily be considered biased, unless their scores are consistently different from how other reviewers scored the same papers.
Deviation from mean (all papers: crude) ? Deviation from mean (all papers): This too-simple compensation method simply looks at the reviewer's average score across all papers and compares it against other reviewers' average scores, under the assumption that the pools of papers that reviewers received are not significantly different from each other. This is the method that people usually assume revNorm uses when they first hear about it. But it's not recommended.
Maximum likelihood (slower, more sophisticated) ? Maximum likelihood:

This compensation method tries to find the bias and gain for each reviewer that maximizes the likelihood of obtaining the scores that were obtained. Gain captures the effect that some reviewers assign a wider spread of scores.

This method is controlled by three parameters whose initial values are probably already reasonable for most conferences. First, σb represents the expected spread in bias ratings (around 0). Second, σg represents the expected spread in gain ratings (around 1). Third, σr represents the expected amount of random noise that reviewers add to their scores.

The estimated likelihood of a particular set of estimated true scores, estimated biases, and estimated gains is generated using three corresponding terms: a score term based on the difference between the observed score for each paper and the predicted score, a bias term based on how far each estimated bias is from zero, and a gain term based on each estimated gain's distance from 1.

Scores within the range 0–6 are recommended for fast convergence.

3 (Optional) Upload the CSV file containing any additional label for papers (e.g., Accept, Reject). [example] The first line should read: