Repository logo
 

Towards Big Data Infrastructure for Historic Handwritten Document Transcription

dc.access.optionRestricted Campus Access Only
dc.contributor.advisorTabrizi, M. H. N
dc.contributor.authorHoffman, David Richard
dc.contributor.departmentComputer Science
dc.date.accessioned2017-06-01T12:04:35Z
dc.date.available2019-02-26T14:23:52Z
dc.date.created2017-05
dc.date.issued2017-05-03
dc.date.submittedMay 2017
dc.date.updated2017-05-30T18:46:41Z
dc.degree.departmentComputer Science
dc.degree.disciplineMS-Software Engineering
dc.degree.grantorEast Carolina University
dc.degree.levelMasters
dc.degree.nameM.S.
dc.description.abstractHistorical archival documents are vast and contain many thousands of pages of unstructured data, often not easily searchable. The current state of the art is searchable meta-tags of documents which help a reader to land on the page, but the data within the documents is not fully searchable. On the contrary, if this unstructured handwritten data is transcribed and stored in a database, search would be much more straightforward. Digitization of documents allows for visually impaired persons to stay connected with history as the transcribed text can be sent through assistive technology. Furthermore, digitized text can be translated into worldwide languages allowing for greater accessibility. The problem is apparent and end-goal solution is clear, but the steps to achieve the solution are not. Some museums and libraries have resorted to crowdsourced, volunteer transcription via the web. A more logical means would be to utilize computer-based handwriting recognition to help achieve the goal. In this research we present Big Data approaches towards handwriting recognition. High-resolution document scans consume a large amount of disk space, and computer processing of the images with advanced algorithms require a lot of operations. Sharing the load of the data storage and parallel processing over a scalable cluster of machines is the logical next step, and has not been reported on in depth in this domain. This research does not seek to completely solve the problem of a full-fledged handwriting recognition system. Rather, it describes and demonstrates how using a cluster in a Hadoop environment can assist the research by processing in parallel. Although the algorithms utilized in this research already exist, the framework introduced may be extended to accommodate more advanced methods in order to speed up the processing time.
dc.embargo.lift2018-05-01
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10342/6180
dc.language.isoen
dc.publisherEast Carolina University
dc.subjecthandwriting recognition
dc.subjectmanuscript
dc.subjecthadoop streaming api
dc.subject.lcshBig data
dc.subject.lcshArchival materials--Digitization
dc.subject.lcshImage processing--Digital techniques
dc.subject.lcshElectronic information resource searching
dc.titleTowards Big Data Infrastructure for Historic Handwritten Document Transcription
dc.typeMaster's Thesis
dc.type.materialtext

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
HOFFMAN-MASTERSTHESIS-2017.pdf
Size:
4.58 MB
Format:
Adobe Portable Document Format