Computer Science in Muninn |
|||||||||
|
|||||||||
|
The computer science component of this research project will concern itself with the extraction, classification and curation of the information within WWI documents. Previous approaches to this type of problem have used either human transcription or Optical Character Recognition algorithms on the contents of the pages. Both time consuming and prone to error due to the semi-structured nature of forms and the difficult recognition of hand-written documents by the algorithms. We propose to attack the problem from both a statistical and retrieval perspective where the extraction would be done based on a incrementally created probabilistic model: military organisation from the Great War period tended to use printed forms filled in with a mixed content of hand-written, type-written and rubber stamps. This is an opportunity in that we are already able to build classifiers to identify the actual form. Because forms have set data-entry fields in them, only a limited number of words and characters will appear at each location. This allows us to create a localised model for the type of information that can be gathered and that should be recognized from each form. This restricted “symbol set” that is contained at each location will improve the recognition of the data since there will be a small set of possibilities to choose from. Since the printed text of the form never changes from one form instance to another, we are able to extract the field label using a basic OCR model. This same approach can be taken to identifying rubber-stamps within each service; the fact that they are usually in a coloured ink makes their identification all the more easy to recognise on a rapid scale. That said, some information within the form will remain impossible to extract especially when dealing with faded handwritten documents and human review will be required. Other possible approaches involve asking for the help of volunteers through the internet and it is under study. We are likely to document our data in a machine readable format and either attempt this project in a future extension to this project or enable another team to carry the extraction forward.
|
|||||||||
|
|||||||||