Research Guide Subtitle: Transcribing Handwritten and Early Printed Books with Artificial Intelligence
Contributors: David Brown
First published: 2022
Machine Transcription (MT) is the application of machine learning to transcribe handwritten documents. The technology is also known as HTR, or Handwritten Text Recognition, but has wider applications including the ability to digitize early printed text and to recognise text on other media, for example maps.
MT uses machine learning algorithms similar to those used in other areas of digital image analysis, of which facial recognition is the most widely known application to date.
Two steps are required to transcribe each page.
MT is effectively a word-based application and although the algorithm will attempt a best match for each cursive character it is confronted with, it will also then attempt to match each string of characters to a word in memory. Consequently, the best results are achieved when character recognition takes place in conjunction with a good dictionary or language model for the target language.
The key difference between a machine learning approach and more traditional computational methods is in the use of probabilities to identify words and characters. The difference between a dictionary and a language model in this environment is that a dictionary is a simple list of words in which each word appears once. In a language model, a count is made for all of the individual words in a passage of text, and the words are then ranked according to how often they appear. When the transcription is unclear, for example if the machine is unsure about a selection of characters within a word, it will go to the language model to see what the most likely match may be.
Probability and confidence are key concepts in understanding machine learning. The algorithm is offering what it ‘thinks’ is most likely the correct answer and selects the answer with the lowest ambiguity according to the parameters of the algorithm. It is this approach which separates machine learning, or neural computing, from more traditional information search and retrieval.
The algorithm is not searching a database for a simple yes or no answer, but weighing a number of possible results using its neural algorithm and returning what is calculated as being the best answer, but not on an all or nothing basis. The downside to this approach is that false positives are not only possible, but likely as an alternative to no answer at all.
For this reason, MT text always requires editing by scholars who are expert in the documents that have been processed to produce a final text.
MT is, however, a hugely powerful productivity tool facilitating transcription on a scale impossible until now with immediate application in keyword search that makes the contents of handwritten documents searchable by an information retrieval system.
MT and Optical Character Recognition (OCR) are different technologies. Although both make use of image analysis to segment an image into regions and zones, OCR relies on a bit map of each character to recognise which letter it is. These are the pixels that become visible when a scanned image is magnified.
A typical OCR program includes bit maps of typefaces in commonly used fonts and compares these with the scans to convert the segmented image into letters and words. For this reason, a typical OCR program will not work on older printed texts unless it has been specially configured to include the older typeface. An OCR program cannot ‘guess’ in the same way as a neural algorithm.
MT does not use bitmaps. Instead, each word and letter is transformed into a vector diagram that allows the complex geometries of cursive handwriting to be understood and manipulated by a computer. This approach allows far more complicated probability equations to be performed on the geometries than can be achieved simply by examining the patterns of pixels on a black and white image. Consequently, MT can recognise and transcribe handwriting that is similar, but not necessarily identical, to the handwriting stored in each model.
The result of combining many different hands into a single model is a general model capable of reading documents both in the hands that have been used to train the algorithm and any reasonably similar hand. This approach enables the large-scale transcription of historical documents that are initially searchable and suitable for final scholarly editing.
Beyond 2022 | Virtual Record Treasury of Ireland, has used Transkribus, the MT system provided by the READ Cooperative at the University of Innsbruck, to transform hundreds of thousands of pages of documents into word-searchable text.
The READ Cooperative is the result of over a decade of European Research Council funded projects intended to bring Europe’s documentary heritage into a digital environment. The Transkribus platform, the best known output of this series of projects, works effectively on most European languages and scripts from the medieval through to the early modern period.
For each language, a ‘ground truth’ or exact transcription is produced manually and this text, with its matching segmented images, is used as training data which allows the machine to ‘learn’ varied styles of handwriting and language models. Beyond 2022 has been working with the Transkribus and Read research teams since 2018 and has produced a range of handwriting models specific to historical texts that are replacements for the originals lost in 1922.
These handwriting models are capable of transcribing original documents from the early modern period through to calendars made of earlier documents in the nineteenth century by bodies such as the Irish Record Commission and the PROI itself.
The Library of Trinity College Dublin became a founding member of the READ Cooperative in 2020 once the research phase of the Transkribus projects had come to end and continues to participate in the ongoing development and outreach of the platform.
The Beyond 2022 handwriting model was launched on the 99th anniversary of the Four Courts blaze (30 June 2021). The model includes:
The model was curated by David Brown and includes contributions from Beyond 2022’s Archival Discovery team: Brian Gurrin, Sarah Hendriks and Timothy Murtagh. Sarah Hendriks’ contribution includes both English secretarial hand from the mid seventeenth century and transcriptions in Latin of medieval documents made in the nineteenth century.
The earliest material in the model, starting in 1610, was donated to the project by Bríd McGrath and a large set of ground truth was obtained from the 1641 Depositions project at Trinity College Dublin. The 1641 Depositions is a large collection of witness statements in multiple hands that are especially rich in person and place names. Later text from the seventeenth century was curated by David Brown from transcriptions funded by the Arts and Social Science Benefaction Fund at Trinity College Dublin.
The general English model contains roughly 400,000 words transcribed in blocks of 10-15,000 words in each distinctive style of handwriting from selected seventeenth and eighteenth century documents. Most of the remaining eighteenth-century text was transcribed by Brian Gurrin and Tim Murtagh, with Brian’s contribution the informal handwriting found in local documents such as Parish Registers.
A further 165,000 words of transcription was provided by the Property Registration Authority of Ireland (The Registry of Deeds), copied from their Memorials of Deeds, 1709–30.
The nineteenth-century text in the model includes transcriptions from calendars (summaries) of much earlier documents, including Latin records dating from the period 1250–1500. This approach enables the language model element to recognise the multiple variants of place- and person names found in earlier documents. By combining the earlier handwriting styles with the rich sources of vocabulary found in the nineteenth-century calendars, the MT system as a whole is capable of finding entities in difficult-to-read early modern texts,
Go to readcoop.eu for manuals and videos on using Transkribus.
There are three versions of the software: the expert client that is installed on a PC or MAC and allows access to all of Transkibus’ functions. For smaller projects, the web based app, Transkribus Lite, should be sufficient.
An even simpler app, Transkribus AI, allows the automatic transcription of single pages using a variety of publicly available models. There is a processing charge for transcribing more than an initial 100 pages.
Beyond 2022’s largest general purpose model, B2022 English M4, is publicly available for use in all versions of Transkribus. This model will transcribe most documents written in a clear hand, 1600-1900 although some subsequent editing will almost always be necessary. For best results use a good quality orthogonal image with high contrast.