+# Pseudo-Perplexity with BERT to Estimate OCR Quality
+Measuring the quality of OCR-extracted text is a challenge. Well-established OCR evaluation metrics - like character error rate (CER) and word error rate (WER) - require a ground-truth against which the extracted text can be compared. However, when applying OCR in real-world scenarios, e.g. in retro-digitalisation projects in libraries, ground-truth texts are typically not available. At the same time, historical texts pose a particular challenge to OCR software often leading to low-quality results. 
+To alleviate this problem, [Störbel et al. (2022)](https://arxiv.org/pdf/2201.06170) have proposed several groud-truth-free metrics to estimate the quality of OCR texts. Among other metrics, they suggest using pseudo-perplexity scores from masked language models (MLM) to estimate the quality of OCR-extracted text. In their paper, they show that the prediction with pseudo-perplexity correlates well with the actual text quality. 
+This repository provides different approaches to calculate the pseudo-perplexity of a text snippet with a model from the BERT family.
+On the one hand, the script `compute_pppl.py` calculates word-level pseudo-perplexities. On the other hand, `run_lmppl.py` can be used to execute the [lmppl repository](https://github.com/asahi417/lmppl/tree/main) to calculate the pseudo-perplexities on sentence-level.
 def parse_args():
     parser = argparse.ArgumentParser(description="Run pseudo-perplexity calculation")
-    parser.add_argument("-m", "--model-name", type=str, default="bert-base-multilingual-uncased", help="Model name")
+    parser.add_argument("-m", "--model-name", type=str, default="bert-base-multilingual-uncased", help="Model name or path")
     parser.add_argument("-i", "--input-path", type=str, default="data/sentences", help="Path to the input directory")
     parser.add_argument("-o", "--output-path", type=str, default="data/pppl_per_sent", help="Path to the output directory")
     parser.add_argument("-b", "--batch-size", type=int, default=32, help="Batch size")