Agreement, the f-measure, and reliability in information retrieval.
Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the kappa statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to [...]
Author(s): Hripcsak, George, Rothschild, Adam S
DOI: 10.1197/jamia.M1733