Skip to content

Lang Identification additions & fixes

Luka Jovanovic requested to merge exclude_shorter_spans_from_tagging into development

The language identification sometimes produced languages that the current state of promethia did not support. This error was produced by the 'closest_supported_match' method from the 'langcodes' library. One needs to pass a 'max_distance' argument, that ensures that the found language will be one of the given list. This commit also extends the spacy wrapper with a 'SpacyLemmatizer' class, which can be used to find the correct lemmas when working with a spacy dependency parser. The pipeline scrypt was extended with a 'try/except' blck that ensures a continouos run, even if there are errors in a document.

Merge request reports

Loading