Capture Technology refers to the technology used to transform the images or information contained in the original document into some other form, the form dependent upon the overall media conversion technology being used. This term is not relevant to Conservation (3.1.1) or Deacidification (3.1.2), which are conservation technologies, and do not employ media conversion techniques. Printing (see 1.1.1) on paper, is of course also a capture technology.
A Photocopier is a device for making photographic copies of graphic images. A common form of the photocopier involves the use of the xerographic process, where light reflected from the original document is focused onto an electrically charged insulated photoconductor, and the latent image is developed using a resinous powder. For the purposes of this Glossary, the term photocopier is restricted to devices that use analog technologies, such as the use of light lens technology. Digital technologies are incorporated separately (see 3.2.3). With photocopiers so defined, the image is normally scanned and printed essentially in a single operation, and an intermediate scanned latent image is not normally stored for re-use at a later stage--although the two stage processes of photography, which indeed may be used for photocopying, do permit the use of the photographic negative as an intermediate storage device (a particular case of which is the use of microform recording technology--see 3.2.2).
A Microform Recorder is a camera or other photographic device for photographing the original document and printing it onto one of several forms of microform (1.1.2). The microform film in essence becomes both a storage medium (see 3.3.1.2) and a presentation medium (see 3.6.1.2 and 3.6.2.1). Other film copies and paper copies may also be made from the microform negatives for presentation (see 3.6.1.2).
A Digital Image Scanner is a device for scanning the images contained on pages of a document and transforming the scanned image into digital electronic signals corresponding to the physical state at each part of the search area, that is, into image documents (3.1.5.1). These signals are most often stored (see 3.3) for subsequent interpretation (see 3.2.5, 3.2.6, 3.2.7, and 3.3.2, 3.3.4), access (3.4), distribution (3.5), or presentation (3.6). A single small element of the document (known as a "pixel") is thus encoded quantitatively by a digital number, where the number contains sufficient information to represent the image content of the pixel (see 3.1.5.1). A digital image scanner on its own does not interpret the image information. The number of pixels per square inch is considered to be the resolution of the scanner. Typical resolutions with current technology range from 100 pixels per linear inch to over 1,000 pixels per linear inch, but there are trade-offs between resolution, speed, cost, and quality.
Digital Image Scanners may scan in one or more different modes, depending upon their capability and depending upon whether they are scanning monotone or color (1.4.1), or whether they are scanning line art, greyscale, halftone, or continuous tone objects (1.4.2.3, 3.1.5.1">, 1.4.2.3, 3.1.5.1). Performance, in terms of speed, accuracy, and resolution depend upon the degree to which these attributes can be accommodated. The speed of digital image scanners range from one or two pages per minute to around fifty per minute.
A FAX machine (3.5.3) is a special form of digital image scanner. Other special forms of digital image scanners exist for scanning from media other than paper, such as digital image scanners that scan directly from microfilm (1.1.2). Such images scanned from microfilm, however, can be no better than the original microfilm image itself (see 3.1.4).
Digital image scanners may come equipped with different physical devices for accommodating the original documents. These may include flatbed platens equipped with manual feeds, semi-automatic feeds (one page at a time is fed into an automatic hopper), or fully-automatic feeds. Manual feeds offer the greatest safety from potential jamming, a point of importance in the scanning of unique documents. Flatbed scanners generally require either books to be disbound and one page at a time placed on the platen, or require books to be laid open face-down on the platen, which may cause some distortion. They may also come equipped with edge-scanners, which scan right up to the binding of the book, avoiding this distortion; or with cradle scanners, where the book is opened in a cradle (such devices are also used in some microform recording devices) and two angled scanning heads are lowered into the open, cradled book. In all cases, quality control of scanning is an issue with respect to fidelity of the scanned image and registration of the scanned image with respect to a defined standard.
An Optical Character Recognition (OCR) Scanner is a digital image scanner that in addition interprets the textual portion of the images and converts it to digital codes representing formatted or unformatted text (3.1.5.2). The less sophisticated such devices can only "recognize" one or a few fonts of a fixed size, and can only interpret such information as unformatted text. The more sophisticated devices can represent multiple fonts of different sizes, and can interpret limited information as formatted text. At either extreme, no device achieves 100% recognition accuracy: accuracy of the better devices typically ranges between 95% and 98%, depending upon manufacturer imposed trade-offs between the sophistication of the device, its speed, and its intended range of applicability.
OCR devices are most often used where scanning errors and unformatted text are acceptable limitations, such as, for example, where the input material can be subsequently proofread and corrected, or where redundant information is scanned and the redundant information used to correct any inconsistencies arising from scanning errors (typically in certain commercial applications). In the context of document preservation, most uses of OCR devices are limited to where text information only suffices, and the form of the original document is not an important aspect of preservation. An important application is for use in the construction of indices for access and distribution (see 3.4 and 3.5), or for full contextual searching of information (3.4.2). Promising research has been done, for example, on the searching and retrieval of documents for retrieval purposes using the "corrupted" (erroneous) text derived from the OCR scanning of documents. The techniques utilized in this approach exploit the redundant information contained in the corrupted text.
Handwriting recognition devices, an extreme form of OCR devices, are not included in this Glossary. At this time, such devices are limited in capability.
Internal Character Recognition is the term sometimes used when the same interpretation technology that is used in OCR devices (3.2.4) is applied to an already stored digital image at a later date. This separates the functions of scanning the images (3.2.3) digitally, and of interpreting the images. Interpreting the scanned and stored images at a later date also allows for using different recognition technologies in the tradeoffs between accuracy, speed, and function. In the context of preservation and media conversion, it also allows for the immediate focus to be placed on scanning and storage (and possibly media conversion), deferring the option of character recognition and its applications (see 3.2.4) to a later date--at such time, massive-volume character recognition and information interpretation is likely to be more economically feasible at higher levels of accuracy than with present technology.
Intelligent Character Recognition is the term sometimes given to Optical or Internal Character Recognition where the scanned and recognized information is further interpreted to take advantage of contextual information, that is, words, phrases, and so forth, rather than simply treating the text as a string of independent characters. Intelligent Character Recognition, for example, may be used by sophisticated computer programs to construct concordances automatically, or to create highly- sophisticated indexes. At this stage, intelligent character recognition is a field of research, rather than production, interest .
Page Recognition is the term given to the automatic interpretation of features contained within the printed page such as titles, subheads, columns, paragraphs, figures, figure captions, footnotes, and so forth. Additional capabilities of sophisticated page recognition algorithms include the ability to determine fonts and font sizes. In essence, Page Recognition "reverse engineers" the image into marked-up copy.
As an alternative or complement to OCR (3.2.4), textual information can be encoded by directly keying alpha-numeric text into computer files manually. This has some advantage in accuracy over OCR, but is slower. It may also be used in situations where the brittleness of acidic documents makes them so fragile that scanning technologies cannot safely be used. See also 3.1.6.
Enhancement refers to the use of mathematical algorithms to improve the quality of digitally scanned images (3.2.3), such as by computationally adjusting the contrast or brightness of the scanned image. The term also includes techniques that may be used to modify the scanned image for structural reasons, such as bordering to remove any unwanted scanned areas surrounding the actual document pages, de-skewing to rectify the scanned image to correct for any skew in the placement of the document on the scanner, or margin adjustment to ensure that pages are properly aligned with each other.
A full glossary of terms associated with enhancement is beyond the scope of this document.