Frequently Asked Questions

How to deal with OCR errors in text corpus?

The texts are the results from an OCR, and have been corrected by humans.  Since neither OCR nor humans are perfect, there are many small errors that can lead to false positives in the detection of false documents. Of course, if you detect them, you can correct them to improve your precision score!

Please, let us know of any errors you find, and we will complete the following list:

Task 1, Training Corpus:

  • Receipt 15: 3-12€ should be 3.12€
  • Receipt 66: 0.8l€ should be 0.81€
  • Receipt 130: 5.48€ should be 6.48€
  • Receipt 169: 6.50 should be 8.50
  • Receipt 194: 2.35€/kg should be 2.05€/kg
  • Receipt 201: 3.1O€ should be 3.18€
  • Receipt 927: 254€ should be 2.54

There’s a text file “1087.txt” but no image “1087.jpg” in Task 1, Test Corpus.

Indeed, it seems that the “1087.txt” has no corresponding image and “1008.jpg” has no corresponding text in the Test Corpus, Task 1…

I suggest you don’t process image 1008 and text 1087 if your approach uses both images and texts. The evaluation, in this case, will be on 499 documents.

If you use only the texts, or only the images, there is no problem for evaluation.