Newer
Older
Measuring the quality of OCR-extracted text is a challenge. Well-established OCR evaluation metrics - like character error rate (CER) and word error rate (WER) - require a ground-truth against which the extracted text can be compared. However, when applying OCR in real-world scenarios, e.g. in retro-digitalisation projects in libraries, ground-truth texts are typically not available. At the same time, historical texts pose a particular challenge to OCR software often leading to low-quality results.
To alleviate this problem, [Störbel et al. (2022)](https://arxiv.org/pdf/2201.06170) have proposed several groud-truth-free metrics to estimate the quality of OCR texts. Among other metrics, they suggest using pseudo-perplexity scores from masked language models (MLM) to estimate the quality of OCR-extracted text. In their paper, they show that the prediction with pseudo-perplexity correlates well with the actual text quality.
This repository provides different approaches to calculate the pseudo-perplexity of a text snippet with a model from the BERT family.
On the one hand, the script `compute_pppl.py` calculates word-level pseudo-perplexities. On the other hand, `run_lmppl.py` can be used to execute the [lmppl repository](https://github.com/asahi417/lmppl/tree/main) to calculate the pseudo-perplexities on sentence-level.
A problem with OCR-extracted text is that sentence boundaries are not known. Due to errors in the extracted text, common sentence splitters like NLTK might fail at identifying sentence boundaries correctly.
To circumvent the challenge of splitting bad-qualtiy text into sentences, this repository uses a sliding window that slides over the entire text token by token. The extracted text windows are of a fixed size (11 by default). Then, the token in the middle of the window is masked. This sequence with the masked target token and the context tokens on both sides (e.g. 5 tokens on the left and 5 on the right) is passed to BERT to calculate the pseudo-perplexity for the masked target token.
To calculate the pseudo-perplexities, any huggingface model from the BERT family can be used. For our experiments, we used `bert-base-multilingual-uncased`.
python3 compute_pppl.py -m your-model-name -i path/to/your/data -o path/to/output/directory --window-size 11
As input, the script expects a json file with the following structure:
[
{
"page_id": "ocr_27812752_p1.json",
"content": [
{
"word": "Liebe",
"index": "ocr_27812752_p1_w27",
"error": 0
},
{
"word": "die",
"index": "ocr_27812752_p1_w28",
"error": 0
}
]
}
]
```
The output are json files containing the pseudo-perplexity scores:
[
{
"page_id": "ocr_27812752_p1.json",
"content": [
{
"word": "Liebe",
"index": "ocr_27812752_p1_w27",
"error": 0,
"pppl": 4852.421617633325
},
{
"word": "die",
"index": "ocr_27812752_p1_w28",
"error": 0,
"pppl": 488.1390946218524
}
]
}
]
```
To calculate the pseudo-perplexity per sentence, we use the [Language Model Perplexity (LM-PPL)](https://github.com/asahi417/lmppl) repository by Ushio & Clarke. The repository expects a list of sentences as input. Hence, the OCR text must be split into sentences beforehand. The repository then masks each token in the sentence once and calculates the pseudo-perplexity score across the entire sentence.
```bash
python3 run_lmppl.py -m your-model-name -i path/to/your/data -o path/to/output/directory
As input, the script expects a json file with the following structure:
[
{
"sent_id": "ocr_26843985_p4_6",
"ocr": "Sonder Zweifel niemand besser, als eben Er selber.",
"gt": "Sonder Zweifel niemand besser, als eben Er selber.",
"cer": 0.0,
"wer": 0.0
},
{
"sent_id": "ocr_26843985_p4_7",
"ocr": "Ihn will ich also fragen; Er soll mir antworten.",
"gt": "Ihn will ich also fragen; Er soll mir antworten.",
"cer": 0.0,
"wer": 0.0
}
]
```
The output are json files containing the pseudo-perplexity scores:
[
{
"sent_id": "ocr_26843985_p4_6",
"ocr": "Sonder Zweifel niemand besser, als eben Er selber.",
"gt": "Sonder Zweifel niemand besser, als eben Er selber.",
"cer": 0.0,
"wer": 0.0,
"pppl": 188.4483450695001
},
{
"sent_id": "ocr_26843985_p4_7",
"ocr": "Ihn will ich also fragen; Er soll mir antworten.",
"gt": "Ihn will ich also fragen; Er soll mir antworten.",
"cer": 0.0,
"wer": 0.0,
"pppl": 22.069765228493164
}
]
```