Skip to content
Snippets Groups Projects
README.md 5.26 KiB
Newer Older
Ubuntu's avatar
Ubuntu committed
# Pseudo-Perplexity with BERT to Estimate OCR Quality
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
Measuring the quality of OCR-extracted text is a challenge. Well-established OCR evaluation metrics - like character error rate (CER) and word error rate (WER) - require a ground-truth against which the extracted text can be compared. However, when applying OCR in real-world scenarios, e.g. in retro-digitalisation projects in libraries, ground-truth texts are typically not available. At the same time, historical texts pose a particular challenge to OCR software often leading to low-quality results. 
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
To alleviate this problem, [Störbel et al. (2022)](https://arxiv.org/pdf/2201.06170) have proposed several groud-truth-free metrics to estimate the quality of OCR texts. Among other metrics, they suggest using pseudo-perplexity scores from masked language models (MLM) to estimate the quality of OCR-extracted text. In their paper, they show that the prediction with pseudo-perplexity correlates well with the actual text quality. 
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
This repository provides different approaches to calculate the pseudo-perplexity of a text snippet with a model from the BERT family.
On the one hand, the script `compute_pppl.py` calculates word-level pseudo-perplexities. On the other hand, `run_lmppl.py` can be used to execute the [lmppl repository](https://github.com/asahi417/lmppl/tree/main) to calculate the pseudo-perplexities on sentence-level.
Ubuntu's avatar
Ubuntu committed
## Word-Level Pseudo-Perplexity with BERT
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
A problem with OCR-extracted text is that sentence boundaries are not known. Due to errors in the extracted text, common sentence splitters like NLTK might fail at identifying sentence boundaries correctly. 
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
To circumvent the challenge of splitting bad-qualtiy text into sentences, this repository uses a sliding window that slides over the entire text token by token. The extracted text windows are of a fixed size (11 by default). Then, the token in the middle of the window is masked. This sequence with the masked target token and the context tokens on both sides (e.g. 5 tokens on the left and 5 on the right) is passed to BERT to calculate the pseudo-perplexity for the masked target token.
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
To calculate the pseudo-perplexities, any huggingface model from the BERT family can be used. For our experiments, we used `bert-base-multilingual-uncased`.
Sarah Kiener's avatar
Sarah Kiener committed


Ubuntu's avatar
Ubuntu committed
### Getting Started
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
To calcualte the pseudo-perplexity per word, run:
Ubuntu's avatar
Ubuntu committed
# Installing the dependencies
pip install transformers, tqdm
Sarah Kiener's avatar
Sarah Kiener committed

python3 compute_pppl.py -m your-model-name -i path/to/your/data -o path/to/output/directory --window-size 11
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
```
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
As input, the script expects a json file with the following structure: 
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
[
    {
        "page_id": "ocr_27812752_p1.json",
        "content": [
            {
                "word": "Liebe",
                "index": "ocr_27812752_p1_w27",
                "error": 0
            },
            {
                "word": "die",
                "index": "ocr_27812752_p1_w28",
                "error": 0
            }
        ]
    }
]
```
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
The output are json files containing the pseudo-perplexity scores:
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
[
    {
        "page_id": "ocr_27812752_p1.json",
        "content": [
            {
                "word": "Liebe",
                "index": "ocr_27812752_p1_w27",
                "error": 0,
                "pppl": 4852.421617633325
            },
            {
                "word": "die",
                "index": "ocr_27812752_p1_w28",
                "error": 0,
                "pppl": 488.1390946218524
            }
        ]
    }
]
```
Sarah Kiener's avatar
Sarah Kiener committed


Ubuntu's avatar
Ubuntu committed
## Sentence-Level Pseudo-Perplexity with BERT
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
To calculate the pseudo-perplexity per sentence, we use the [Language Model Perplexity (LM-PPL)](https://github.com/asahi417/lmppl) repository by Ushio & Clarke. The repository expects a list of sentences as input. Hence, the OCR text must be split into sentences beforehand. The repository then masks each token in the sentence once and calculates the pseudo-perplexity score across the entire sentence.
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
### Getting Started
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
Install the Language Model Perplexity (LM-PPL) repository:
Sarah Kiener's avatar
Sarah Kiener committed

```bash
pip install lmppl 
Ubuntu's avatar
Ubuntu committed
```
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
To use the repositroy, run: 
Sarah Kiener's avatar
Sarah Kiener committed

```bash
python3 run_lmppl.py -m your-model-name -i path/to/your/data -o path/to/output/directory
Ubuntu's avatar
Ubuntu committed
```
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
As input, the script expects a json file with the following structure: 
Ubuntu's avatar
Ubuntu committed
[
    {
        "sent_id": "ocr_26843985_p4_6",
        "ocr": "Sonder Zweifel niemand besser, als eben Er selber.",
        "gt": "Sonder Zweifel niemand besser, als eben Er selber.",
        "cer": 0.0,
        "wer": 0.0
    },
    {
        "sent_id": "ocr_26843985_p4_7",
        "ocr": "Ihn will ich also fragen; Er soll mir antworten.",
        "gt": "Ihn will ich also fragen; Er soll mir antworten.",
        "cer": 0.0,
        "wer": 0.0
    }
]
```
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
The output are json files containing the pseudo-perplexity scores:
Sarah Kiener's avatar
Sarah Kiener committed

Ubuntu's avatar
Ubuntu committed
[
    {
        "sent_id": "ocr_26843985_p4_6",
        "ocr": "Sonder Zweifel niemand besser, als eben Er selber.",
        "gt": "Sonder Zweifel niemand besser, als eben Er selber.",
        "cer": 0.0,
        "wer": 0.0,
        "pppl": 188.4483450695001
    },
    {
        "sent_id": "ocr_26843985_p4_7",
        "ocr": "Ihn will ich also fragen; Er soll mir antworten.",
        "gt": "Ihn will ich also fragen; Er soll mir antworten.",
        "cer": 0.0,
        "wer": 0.0,
        "pppl": 22.069765228493164
    }
]
```
Sarah Kiener's avatar
Sarah Kiener committed