Towards Big Data Infrastructure for Historic Handwritten Document Transcription
Loading...
Date
2017-05-03
Access
Authors
Hoffman, David Richard
Journal Title
Journal ISSN
Volume Title
Publisher
East Carolina University
Abstract
Historical archival documents are vast and contain many thousands of pages of unstructured data, often not easily searchable. The current state of the art is searchable meta-tags of documents which help a reader to land on the page, but the data within the documents is not fully searchable. On the contrary, if this unstructured handwritten data is transcribed and stored in a database, search would be much more straightforward. Digitization of documents allows for visually impaired persons to stay connected with history as the transcribed text can be sent through assistive technology. Furthermore, digitized text can be translated into worldwide languages allowing for greater accessibility. The problem is apparent and end-goal solution is clear, but the steps to achieve the solution are not. Some museums and libraries have resorted to crowdsourced, volunteer transcription via the web. A more logical means would be to utilize computer-based handwriting recognition to help achieve the goal. In this research we present Big Data approaches towards handwriting recognition. High-resolution document scans consume a large amount of disk space, and computer processing of the images with advanced algorithms require a lot of operations. Sharing the load of the data storage and parallel processing over a scalable cluster of machines is the logical next step, and has not been reported on in depth in this domain. This research does not seek to completely solve the problem of a full-fledged handwriting recognition system. Rather, it describes and demonstrates how using a cluster in a Hadoop environment can assist the research by processing in parallel. Although the algorithms utilized in this research already exist, the framework introduced may be extended to accommodate more advanced methods in order to speed up the processing time.