37
CARLI Digital Collections Users’ Group Links revised: 09/23/2014
7
Digital Text Basics
Digital representations of text are based on the concept of character encoding, which is the
assignment of a numeric code for each character in a given repertoire to a sequence of bit
patterns in order to facilitate the transmission and storage of text in digital form. The character
encoding used in a file will determine the type of characters that can be represented in the file.
Currently, 8-bit Unicode Transformation Format (UTF-8), is the generally accepted standard for
digital texts. UTF-8 encoding can accommodate not only Latin-based language characters, but
also Greek, Cyrillic, Hebrew, Arabic, and much more. For these reasons, it is recommended that
all textual documents be encoded as UTF-8.
Most computer programs can save text-based documents (plain text files, XML, or HTML) as a
UTF-8 encoded document. Additionally, some document formats, such as XML and HTML,
provide a way to explicitly declare the file as UTF-8 encoded within the markup, which a parser
can then use to interpret the rest of the document. In XML, this can be seen easily in the first
line of the file, where the type of file is declared (XML) and so is its encoding (UTF-8). Before
saving a text file, check the software’s save options to make sure that UTF-8 encoding is being
used.
Optical Character Recognition (OCR)
OCR is the process of electronically translating a scanned bitmapped image of text material into
machine-readable text. A computer program “reads” the character content within the image and
creates a digital version of the text, usually in a separate file. This allows the text to be searched
and indexed, or used in other processes such as data mining or machine translation.
The accuracy of the OCR process depends on a number of factors, including the quality of the
image being scanned, the language that the text is written in, and the type of font used in
printing. Poor quality images where the text is not clearly contrasted with the background, text in
non-European foreign languages (or non-Latin character sets), and text rendered in serif fonts
can all decrease the accuracy of the resulting text file. At this time, hand-printed manuscripts are
extremely difficult for OCR software to interpret, and those written in cursive are basically
impossible. However, with a clear typeset image, an accuracy of 80%-90% may be achieved
through the use of readily available and relatively inexpensive software.
The advantage of OCR is that it eliminates the need for costly, time-consuming transcription. For
most libraries transcription may not be an option, and so even an inaccurate rendering as
produced by OCR is still an advantage over having no digital representation of the text at all.
OCR routines can also be set up as part of the digitization workflow and do not require a
significant time investment. For documents where the accuracy of the machine-readable text is of
primary importance, the OCR-produced text can be manually corrected.