Content Reconstruction of Parliamentary Questions. Combining Metadata with an OCR Process


The Hellenic Parliament stores parliamentary questions using a combination of metadata extracted manually from the original text as well as the scanned document as an image file. Consequently, broad access and study of the parliamentary questions are limited as there is no principal access to the original content. A combined process was designed in order to fully reconstruct the original content of the parliamentary questions using the available metadata, which were extracted during the archivation phase, and a modified mass Optical Character Recognition (OCR) process. Post-correction of OCR results and quality controls of extracted text are paramount to ensure that the text output matches the one from the original document. The results from the OCR process are joined with the metadata and allow the full description of the original document.

The 5th International Virtual Conference on Advanced Scientific Results