Towards Big Data Infrastructure for Historic Handwritten Document Transcription
Hoffman, David Richard
This item will be available on: 2018-05-01
Historical archival documents are vast and contain many thousands of pages of unstructured data, often not easily searchable. The current state of the art is searchable meta-tags of documents which help a reader to land on the page, but the data within the documents is not fully searchable. On the contrary, if this unstructured handwritten data is transcribed and stored in a database, search would be much more straightforward. Digitization of documents allows for visually impaired persons to stay connected with history as the transcribed text can be sent through assistive technology. Furthermore, digitized text can be translated into worldwide languages allowing for greater accessibility. The problem is apparent and end-goal solution is clear, but the steps to achieve the solution are not. Some museums and libraries have resorted to crowdsourced, volunteer transcription via the web. A more logical means would be to utilize computer-based handwriting recognition to help achieve the goal. In this research we present Big Data approaches towards handwriting recognition. High-resolution document scans consume a large amount of disk space, and computer processing of the images with advanced algorithms require a lot of operations. Sharing the load of the data storage and parallel processing over a scalable cluster of machines is the logical next step, and has not been reported on in depth in this domain. This research does not seek to completely solve the problem of a full-fledged handwriting recognition system. Rather, it describes and demonstrates how using a cluster in a Hadoop environment can assist the research by processing in parallel. Although the algorithms utilized in this research already exist, the framework introduced may be extended to accommodate more advanced methods in order to speed up the processing time.
Hoffman, David Richard. (May 2017). Towards Big Data Infrastructure for Historic Handwritten Document Transcription (Master's Thesis, East Carolina University). Retrieved from the Scholarship. (http://hdl.handle.net/10342/6180.)
Hoffman, David Richard. Towards Big Data Infrastructure for Historic Handwritten Document Transcription. Master's Thesis. East Carolina University, May 2017. The Scholarship. http://hdl.handle.net/10342/6180. February 22, 2019.
Hoffman, David Richard, “Towards Big Data Infrastructure for Historic Handwritten Document Transcription” (Master's Thesis., East Carolina University, May 2017).
Hoffman, David Richard. Towards Big Data Infrastructure for Historic Handwritten Document Transcription [Master's Thesis]. Greenville, NC: East Carolina University; May 2017.
East Carolina University