Towards Big Data Infrastructure for Historic Handwritten Document Transcription

Hoffman, David Richard

Towards Big Data Infrastructure for Historic Handwritten Document Transcription

dc.access.option	Restricted Campus Access Only
dc.contributor.advisor	Tabrizi, M. H. N
dc.contributor.author	Hoffman, David Richard
dc.contributor.department	Computer Science
dc.date.accessioned	2017-06-01T12:04:35Z
dc.date.available	2019-02-26T14:23:52Z
dc.date.created	2017-05
dc.date.issued	2017-05-03
dc.date.submitted	May 2017
dc.date.updated	2017-05-30T18:46:41Z
dc.degree.department	Computer Science
dc.degree.discipline	MS-Software Engineering
dc.degree.grantor	East Carolina University
dc.degree.level	Masters
dc.degree.name	M.S.
dc.description.abstract	Historical archival documents are vast and contain many thousands of pages of unstructured data, often not easily searchable. The current state of the art is searchable meta-tags of documents which help a reader to land on the page, but the data within the documents is not fully searchable. On the contrary, if this unstructured handwritten data is transcribed and stored in a database, search would be much more straightforward. Digitization of documents allows for visually impaired persons to stay connected with history as the transcribed text can be sent through assistive technology. Furthermore, digitized text can be translated into worldwide languages allowing for greater accessibility. The problem is apparent and end-goal solution is clear, but the steps to achieve the solution are not. Some museums and libraries have resorted to crowdsourced, volunteer transcription via the web. A more logical means would be to utilize computer-based handwriting recognition to help achieve the goal. In this research we present Big Data approaches towards handwriting recognition. High-resolution document scans consume a large amount of disk space, and computer processing of the images with advanced algorithms require a lot of operations. Sharing the load of the data storage and parallel processing over a scalable cluster of machines is the logical next step, and has not been reported on in depth in this domain. This research does not seek to completely solve the problem of a full-fledged handwriting recognition system. Rather, it describes and demonstrates how using a cluster in a Hadoop environment can assist the research by processing in parallel. Although the algorithms utilized in this research already exist, the framework introduced may be extended to accommodate more advanced methods in order to speed up the processing time.
dc.embargo.lift	2018-05-01
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10342/6180
dc.language.iso	en
dc.publisher	East Carolina University
dc.subject	handwriting recognition
dc.subject	manuscript
dc.subject	hadoop streaming api
dc.subject.lcsh	Big data
dc.subject.lcsh	Archival materials--Digitization
dc.subject.lcsh	Image processing--Digital techniques
dc.subject.lcsh	Electronic information resource searching
dc.title	Towards Big Data Infrastructure for Historic Handwritten Document Transcription
dc.type	Master's Thesis
dc.type.material	text

Files

Original bundle

Now showing 1 - 1 of 1

Name:: HOFFMAN-MASTERSTHESIS-2017.pdf
Size:: 4.58 MB
Format:: Adobe Portable Document Format

Please login to access this content.

Download

Collections

Master's Theses
Computer Science