[previous] [Table of Contents] [Next]

Workshop on Electronic Texts
Session IV.

Image Capture, Text Capture, Overview of Text and Image Storage Formats

William HOOTON, vice president of operations, I-NET, moderated this session.

KENNEY

Factors influencing development of CXP
Advantages of using digital technology versus photocopy and microfilm
A primary goal of CXP; publishing challenges
Characteristics of copies printed
Quality of samples achieved in image capture
Several factors to be considered in choosing scanning
Emphasis of CXP on timely and cost-effective production of black-and-white printed facsimiles
Results of producing microfilm from digital files
Advantages of creating microfilm
Details concerning production
Costs
Role of digital technology in library preservation

Anne KENNEY, associate director, Department of Preservation and Conservation, Cornell University, opened her talk by observing that the Cornell Xerox Project (CXP) has been guided by the assumption that the ability to produce printed facsimiles or to replace paper with paper would be important, at least for the present generation of users and equipment. She described three factors that influenced development of the project: 1) Because the project has emphasized the preservation of deteriorating brittle books, the quality of what was produced had to be sufficiently high to return a paper replacement to the shelf. CXP was only interested in using: 2) a system that was cost-effective, which meant that it had to be cost-competitive with the processes currently available, principally photocopy and microfilm, and 3) new or currently available product hardware and software.

KENNEY described the advantages that using digital technology offers over both photocopy and microfilm: 1) The potential exists to create a higher quality reproduction of a deteriorating original than conventional light-lens technology. 2) Because a digital image is an encoded representation, it can be reproduced again and again with no resulting loss of quality, as opposed to the situation with light-lens processes, in which there is discernible difference between a second and a subsequent generation of an image. 3) A digital image can be manipulated in a number of ways to improve image capture; for example, Xerox has developed a windowing application that enables one to capture a page containing both text and illustrations in a manner that optimizes the reproduction of both. (With light-lens technology, one must choose which to optimize, text or the illustration; in preservation microfilming, the current practice is to shoot an illustrated page twice, once to highlight the text and the second time to provide the best capture for the illustration.) 4) A digital image can also be edited, density levels adjusted to remove underlining and stains, and to increase legibility for faint documents. 5) On-screen inspection can take place at the time of initial setup and adjustments made prior to scanning, factors that substantially reduce the number of retakes required in quality control.

A primary goal of CXP has been to evaluate the paper output printed on the Xerox DocuTech, a high-speed printer that produces 600-dpi pages from scanned images at a rate of 135 pages a minute. KENNEY recounted several publishing challenges to represent faithful and legible reproductions of the originals that the 600-dpi copy for the most part successfully captured. For example, many of the deteriorating volumes in the project were heavily illustrated with fine line drawings or halftones or came in languages such as Japanese, in which the buildup of characters comprised of varying strokes is difficult to reproduce at lower resolutions; a surprising number of them came with annotations and mathematical formulas, which it was critical to be able to duplicate exactly.

KENNEY noted that 1) the copies are being printed on paper that meets the ANSI standards for performance, 2) the DocuTech printer meets the machine and toner requirements for proper adhesion of print to page, as described by the National Archives, and thus 3) paper product is considered to be the archival equivalent of preservation photocopy.

KENNEY then discussed several samples of the quality achieved in the project that had been distributed in a handout, for example, a copy of a print-on-demand version of the 1911 Reed lecture on the steam turbine, which contains halftones, line drawings, and illustrations embedded in text; the first four loose pages in the volume compared the capture capabilities of scanning to photocopy for a standard test target, the IEEE standard 167A 1987 test chart. In all instances scanning proved superior to photocopy, though only slightly more so in one.

Conceding the simplistic nature of her review of the quality of scanning to photocopy, KENNEY described it as one representation of the kinds of settings that could be used with scanning capabilities on the equipment CXP uses. KENNEY also pointed out that CXP investigated the quality achieved with binary scanning only, and noted the great promise in gray scale and color scanning, whose advantages and disadvantages need to be examined. She argued further that scanning resolutions and file formats can represent a complex trade-off between the time it takes to capture material, file size, fidelity to the original, and on-screen display; and printing and equipment availability. All these factors must be taken into consideration.

CXP placed primary emphasis on the production in a timely and cost-effective manner of printed facsimiles that consisted largely of black-and-white text. With binary scanning, large files may be compressed efficiently and in a lossless manner (i.e., no data is lost in the process of compressing [and decompressing] an image--the exact bit-representation is maintained) using Group 4 CCITT (i.e., the French acronym for International Consultative Committee for Telegraph and Telephone) compression. CXP was getting compression ratios of about forty to one. Gray-scale compression, which primarily uses JPEG, is much less economical and can represent a lossy compression (i.e., not lossless), so that as one compresses and decompresses, the illustration is subtly changed. While binary files produce a high-quality printed version, it appears 1) that other combinations of spatial resolution with gray and/or color hold great promise as well, and 2) that gray scale can represent a tremendous advantage for on-screen viewing. The quality associated with binary and gray scale also depends on the equipment used. For instance, binary scanning produces a much better copy on a binary printer.

Among CXP's findings concerning the production of microfilm from digital files, KENNEY reported that the digital files for the same Reed lecture were used to produce sample film using an electron beam recorder. The resulting film was faithful to the image capture of the digital files, and while CXP felt that the text and image pages represented in the Reed lecture were superior to that of the light-lens film, the resolution readings for the 600 dpi were not as high as standard microfilming. KENNEY argued that the standards defined for light-lens technology are not totally transferable to a digital environment. Moreover, they are based on definition of quality for a preservation copy. Although making this case will prove to be a long, uphill struggle, CXP plans to continue to investigate the issue over the course of the next year.

KENNEY concluded this portion of her talk with a discussion of the advantages of creating film: it can serve as a primary backup and as a preservation master to the digital file; it could then become the print or production master and service copies could be paper, film, optical disks, magnetic media, or on-screen display.

Finally, KENNEY presented details re production:

Development and testing of a moderately-high resolution production scanning workstation represented a third goal of CXP; to date, 1,000 volumes have been scanned, or about 300,000 images.
The resulting digital files are stored and used to produce hard-copy replacements for the originals and additional prints on demand; although the initial costs are high, scanning technology offers an affordable means for reformatting brittle material.
A technician in production mode can scan 300 pages per hour when performing single-sheet scanning, which is a necessity when working with truly brittle paper; this figure is expected to increase significantly with subsequent iterations of the software from Xerox; a three-month time-and-cost study of scanning found that the average 300-page book would take about an hour and forty minutes to scan (this figure included the time for setup, which involves keying in primary bibliographic data, going into quality control mode to define page size, establishing front-to-back registration, and scanning sample pages to identify a default range of settings for the entire book--functions not dissimilar to those performed by filmers or those preparing a book for photocopy).
The final step in the scanning process involved rescans, which happily were few and far between, representing well under 1 percent of the total pages scanned.

In addition to technician time, CXP costed out equipment, amortized over four years, the cost of storing and refreshing the digital files every four years, and the cost of printing and binding, book-cloth binding, a paper reproduction. The total amounted to a little under $65 per single 300-page volume, with 30 percent overhead included--a figure competitive with the prices currently charged by photocopy vendors.

Of course, with scanning, in addition to the paper facsimile, one is left with a digital file from which subsequent copies of the book can be produced for a fraction of the cost of photocopy, with readers afforded choices in the form of these copies.

KENNEY concluded that digital technology offers an electronic means for a library preservation effort to pay for itself. If a brittle-book program included the means of disseminating reprints of books that are in demand by libraries and researchers alike, the initial investment in capture could be recovered and used to preserve additional but less popular books. She disclosed that an economic model for a self-sustaining program could be developed for CXP's report to the Commission on Preservation and Access (CPA).

KENNEY stressed that the focus of CXP has been on obtaining high quality in a production environment. The use of digital technology is viewed as an affordable alternative to other reformatting options.

ANDRE

Overview and history of NATDP
Various agricultural CD-ROM products created inhouse and by service bureaus
Pilot project on Internet transmission
Additional products in progress

Pamela ANDRE, associate director for automation, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), presented an overview of NATDP, which has been underway at NAL the last four years, before Judith ZIDAR discussed the technical details. ANDRE defined agricultural information as a broad range of material going from basic and applied research in the hard sciences to the one-page pamphlets that are distributed by the cooperative state extension services on such things as how to grow blueberries.

NATDP began in late 1986 with a meeting of representatives from the land-grant library community to deal with the issue of electronic information. NAL and forty-five of these libraries banded together to establish this project--to evaluate the technology for converting what were then source documents in paper form into electronic form, to provide access to that digital information, and then to distribute it. Distributing that material to the community--the university community as well as the extension service community, potentially down to the county level--constituted the group's chief concern.

Since January 1988 (when the microcomputer-based scanning system was installed at NAL), NATDP has done a variety of things, concerning which ZIDAR would provide further details. For example, the first technology considered in the project's discussion phase was digital videodisc, which indicates how long ago it was conceived.

Over the four years of this project, four separate CD-ROM products on four different agricultural topics were created, two at a scanning-and-OCR station installed at NAL, and two by service bureaus. Thus, NATDP has gained comparative information in terms of those relative costs. Each of these products contained the full ASCII text as well as page images of the material, or between 4,000 and 6,000 pages of material on these disks. Topics included aquaculture, food, agriculture and science (i.e., international agriculture and research), acid rain, and Agent Orange, which was the final product distributed (approximately eighteen months before the Workshop).

The third phase of NATDP focused on delivery mechanisms other than CD-ROM. At the suggestion of Clifford LYNCH, who was a technical consultant to the project at this point, NATDP became involved with the Internet and initiated a project with the help of North Carolina State University, in which fourteen of the land-grant university libraries are transmitting digital images over the Internet in response to interlibrary loan requests--a topic for another meeting. At this point, the pilot project had been completed for about a year and the final report would be available shortly after the Workshop. In the meantime, the project's success had led to its extension. (ANDRE noted that one of the first things done under the program title was to select a retrieval package to use with subsequent products; Windows Personal Librarian was the package of choice after a lengthy evaluation.)

Three additional products had been planned and were in progress:

An arrangement with the American Society of Agronomy--a professional society that has published the Agronomy Journal since about 1908--to scan and create bit-mapped images of its journal. ASA granted permission first to put and then to distribute this material in electronic form, to hold it at NAL, and to use these electronic images as a mechanism to deliver documents or print out material for patrons, among other uses. Effectively, NAL has the right to use this material in support of its program. (Significantly, this arrangement offers a potential cooperative model for working with other professional societies in agriculture to try to do the same thing--put the journals of particular interest to agriculture research into electronic form.)
An extension of the earlier product on aquaculture.
The George Washington Carver Papers--a joint project with Tuskegee University to scan and convert from microfilm some 3,500 images of Carver's papers, letters, and drawings.

It was anticipated that all of these products would appear no more than six months after the Workshop.

ZIDAR

(A separate arena for scanning)
Steps in creating a database
Image capture, with and without performing OCR
Keying in tracking data
Scanning, with electronic and manual tracking
Adjustments during scanning process
Scanning resolutions
Compression
De-skewing and filtering
Image capture from microform: the papers and letters of George Washington Carver
Equipment used for a scanning system

Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), illustrated the technical details of NATDP, including her primary responsibility, scanning and creating databases on a topic and putting them on CD-ROM.

(ZIDAR remarked a separate arena from the CD-ROM projects, although the processing of the material is nearly identical, in which NATDP is also scanning material and loading it on a Next microcomputer, which in turn is linked to NAL's integrated library system. Thus, searches in NAL's bibliographic database will enable people to pull up actual page images and text for any documents that have been entered.)

In accordance with the session's topic, ZIDAR focused her illustrated talk on image capture, offering a primer on the three main steps in the process: 1) assemble the printed publications; 2) design the database (database design occurs in the process of preparing the material for scanning; this step entails reviewing and organizing the material, defining the contents--what will constitute a record, what kinds of fields will be captured in terms of author, title, etc.); 3) perform a certain amount of markup on the paper publications. NAL performs this task record by record, preparing work sheets or some other sort of tracking material and designing descriptors and other enhancements to be added to the data that will not be captured from the printed publication. Part of this process also involves determining NATDP's file and directory structure: NATDP attempts to avoid putting more than approximately 100 images in a directory, because placing more than that on a CD-ROM would reduce the access speed.

This up-front process takes approximately two weeks for a 6,000-7,000-page database. The next step is to capture the page images. How long this process takes is determined by the decision whether or not to perform OCR. Not performing OCR speeds the process, whereas text capture requires greater care because of the quality of the image: it has to be straighter and allowance must be made for text on a page, not just for the capture of photographs.

NATDP keys in tracking data, that is, a standard bibliographic record including the title of the book and the title of the chapter, which will later either become the access information or will be attached to the front of a full-text record so that it is searchable.

Images are scanned from a bound or unbound publication, chiefly from bound publications in the case of NATDP, however, because often they are the only copies and the publications are returned to the shelves. NATDP usually scans one record at a time, because its database tracking system tracks the document in that way and does not require further logical separating of the images. After performing optical character recognition, NATDP moves the images off the hard disk and maintains a volume sheet. Though the system tracks electronically, all the processing steps are also tracked manually with a log sheet.

ZIDAR next illustrated the kinds of adjustments that one can make when scanning from paper and microfilm, for example, redoing images that need special handling, setting for dithering or gray scale, and adjusting for brightness or for the whole book at one time.

NATDP is scanning at 300 dots per inch, a standard scanning resolution. Though adequate for capturing text that is all of a standard size, 300 dpi is unsuitable for any kind of photographic material or for very small text. Many scanners allow for different image formats, TIFF, of course, being a de facto standard. But if one intends to exchange images with other people, the ability to scan other image formats, even if they are less common, becomes highly desirable.

CCITT Group 4 is the standard compression for normal black-and-white images, JPEG for gray scale or color. ZIDAR recommended 1) using the standard compressions, particularly if one attempts to make material available and to allow users to download images and reuse them from CD-ROMs; and 2) maintaining the ability to output an uncompressed image, because in image exchange uncompressed images are more likely to be able to cross platforms.

ZIDAR emphasized the importance of de-skewing and filtering as requirements on NATDP's upgraded system. For instance, scanning bound books, particularly books published by the federal government whose pages are skewed, and trying to scan them straight if OCR is to be performed, is extremely time-consuming. The same holds for filtering of poor-quality or older materials.

ZIDAR described image capture from microform, using as an example three reels from a sixty-seven-reel set of the papers and letters of George Washington Carver that had been produced by Tuskegee University. These resulted in approximately 3,500 images, which NATDP had had scanned by its service contractor, Science Applications International Corporation (SAIC). NATDP also created bibliographic records for access. (NATDP did not have such specialized equipment as a microfilm scanner.

Unfortunately, the process of scanning from microfilm was not an unqualified success, ZIDAR reported: because microfilm frame sizes vary, occasionally some frames were missed, which without spending much time and money could not be recaptured.

OCR could not be performed from the scanned images of the frames. The bleeding in the text simply output text, when OCR was run, that could not even be edited. NATDP tested for negative versus positive images, landscape versus portrait orientation, and single- versus dual-page microfilm, none of which seemed to affect the quality of the image; but also on none of them could OCR be performed.

In selecting the microfilm they would use, therefore, NATDP had other factors in mind. ZIDAR noted two factors that influenced the quality of the images: 1) the inherent quality of the original and 2) the amount of size reduction on the pages.

The Carver papers were selected because they are informative and visually interesting, treat a single subject, and are valuable in their own right. The images were scanned and divided into logical records by SAIC, then delivered, and loaded onto NATDP's system, where bibliographic information taken directly from the images was added. Scanning was completed in summer 1991 and by the end of summer 1992 the disk was scheduled to be published.

Problems encountered during processing included the following: Because the microfilm scanning had to be done in a batch, adjustment for individual page variations was not possible. The frame size varied on account of the nature of the material, and therefore some of the frames were missed while others were just partial frames. The only way to go back and capture this material was to print out the page with the microfilm reader from the missing frame and then scan it in from the page, which was extremely time-consuming. The quality of the images scanned from the printout of the microfilm compared unfavorably with that of the original images captured directly from the microfilm. The inability to perform OCR also was a major disappointment. At the time, computer output microfilm was unavailable to test.

The equipment used for a scanning system was the last topic addressed by ZIDAR. The type of equipment that one would purchase for a scanning system included: a microcomputer, at least a 386, but preferably a 486; a large hard disk, 380 megabyte at minimum; a multi-tasking operating system that allows one to run some things in batch in the background while scanning or doing text editing, for example, Unix or OS/2 and, theoretically, Windows; a high-speed scanner and scanning software that allows one to make the various adjustments mentioned earlier; a high-resolution monitor (150 dpi ); OCR software and hardware to perform text recognition; an optical disk subsystem on which to archive all the images as the processing is done; file management and tracking software.

ZIDAR opined that the software one purchases was more important than the hardware and might also cost more than the hardware, but it was likely to prove critical to the success or failure of one's system. In addition to a stand-alone scanning workstation for image capture, then, text capture requires one or two editing stations networked to this scanning station to perform editing. Editing the text takes two or three times as long as capturing the images.

Finally, ZIDAR stressed the importance of buying an open system that allows for more than one vendor, complies with standards, and can be upgraded.

WATERS

Yale University Library's master plan to convert microfilm to digital imagery (POB)
The place of electronic tools in the library of the future
The uses of images and an image library
Primary input from preservation microfilm
Features distinguishing POB from CXP and key hypotheses guiding POB
Use of vendor selection process to facilitate organizational work
Criteria for selecting vendor
Finalists and results of process for Yale
Key factor distinguishing vendors
Components, design principles, and some estimated costs of POB
Role of preservation materials in developing imaging market
Factors affecting quality and cost
Factors affecting the usability of complex documents in image form

Donald WATERS, head of the Systems Office, Yale University Library, reported on the progress of a master plan for a project at Yale to convert microfilm to digital imagery, Project Open Book (POB). Stating that POB was in an advanced stage of planning, WATERS detailed, in particular, the process of selecting a vendor partner and several key issues under discussion as Yale prepares to move into the project itself. He commented first on the vision that serves as the context of POB and then described its purpose and scope.

WATERS sees the library of the future not necessarily as an electronic library but as a place that generates, preserves, and improves for its clients ready access to both intellectual and physical recorded knowledge. Electronic tools must find a place in the library in the context of this vision. Several roles for electronic tools include serving as: indirect sources of electronic knowledge or as "finding" aids (the on-line catalogues, the article-level indices, registers for documents and archives); direct sources of recorded knowledge; full-text images; and various kinds of compound sources of recorded knowledge (the so-called compound documents of Hypertext, mixed text and image, mixed-text image format, and multimedia).

POB is looking particularly at images and an image library, the uses to which images will be put (e.g., storage, printing, browsing, and then use as input for other processes), OCR as a subsequent process to image capture, or creating an image library, and also possibly generating microfilm.

While input will come from a variety of sources, POB is considering especially input from preservation microfilm. A possible outcome is that the film and paper which provide the input for the image library eventually may go off into remote storage, and that the image library may be the primary access tool.

The purpose and scope of POB focus on imaging. Though related to CXP, POB has two features which distinguish it: 1) scale--conversion of 10,000 volumes into digital image form; and 2) source--conversion from microfilm. Given these features, several key working hypotheses guide POB, including: 1) Since POB is using microfilm, it is not concerned with the image library as a preservation medium. 2) Digital imagery can improve access to recorded knowledge through printing and network distribution at a modest incremental cost of microfilm. 3) Capturing and storing documents in a digital image form is necessary to further improvements in access. (POB distinguishes between the imaging, digitizing process and OCR, which at this stage it does not plan to perform.)

Currently in its first or organizational phase, POB found that it could use a vendor selection process to facilitate a good deal of the organizational work (e.g., creating a project team and advisory board, confirming the validity of the plan, establishing the cost of the project and a budget, selecting the materials to convert, and then raising the necessary funds).

POB developed numerous selection criteria, including: a firm committed to image-document management, the ability to serve as systems integrator in a large-scale project over several years, interest in developing the requisite software as a standard rather than a custom product, and a willingness to invest substantial resources in the project itself.

Two vendors, DEC and Xerox, were selected as finalists in October 1991, and with the support of the Commission on Preservation and Access, each was commissioned to generate a detailed requirements analysis for the project and then to submit a formal proposal for the completion of the project, which included a budget and costs. The terms were that POB would pay the loser. The results for Yale of involving a vendor included: broad involvement of Yale staff across the board at a relatively low cost, which may have long-term significance in carrying out the project (twenty-five to thirty university people are engaged in POB); better understanding of the factors that affect corporate response to markets for imaging products; a competitive proposal; and a more sophisticated view of the imaging markets.

The most important factor that distinguished the vendors under consideration was their identification with the customer. The size and internal complexity of the company also was an important factor. POB was looking at large companies that had substantial resources. In the end, the process generated for Yale two competitive proposals, with Xerox's the clear winner. WATERS then described the components of the proposal, the design principles, and some of the costs estimated for the process.

Components are essentially four: a conversion subsystem, a network-accessible storage subsystem for 10,000 books (and POB expects 200 to 600 dpi storage), browsing stations distributed on the campus network, and network access to the image printers.

Among the design principles, POB wanted conversion at the highest possible resolution. Assuming TIFF files, TIFF files with Group 4 compression, TCP/IP, and ethernet network on campus, POB wanted a client-server approach with image documents distributed to the workstations and made accessible through native workstation interfaces such as Windows. POB also insisted on a phased approach to implementation: 1) a stand-alone, single-user, low-cost entry into the business with a workstation focused on conversion and allowing POB to explore user access; 2) movement into a higher-volume conversion with network-accessible storage and multiple access stations; and 3) a high-volume conversion, full-capacity storage, and multiple browsing stations distributed throughout the campus.

The costs proposed for start-up assumed the existence of the Yale network and its two DocuTech image printers. Other start-up costs are estimated at $1 million over the three phases. At the end of the project, the annual operating costs estimated primarily for the software and hardware proposed come to about $60,000, but these exclude costs for labor needed in the conversion process, network and printer usage, and facilities management.

Finally, the selection process produced for Yale a more sophisticated view of the imaging markets: the management of complex documents in image form is not a preservation problem, not a library problem, but a general problem in a broad, general industry. Preservation materials are useful for developing that market because of the qualities of the material. For example, much of it is out of copyright. The resolution of key issues such as the quality of scanning and image browsing also will affect development of that market.

The technology is readily available but changing rapidly. In this context of rapid change, several factors affect quality and cost, to which POB intends to pay particular attention, for example, the various levels of resolution that can be achieved. POB believes it can bring resolution up to 600 dpi, but an interpolation process from 400 to 600 is more likely. The variation quality in microfilm will prove to be a highly important factor. POB may reexamine the standards used to film in the first place by looking at this process as a follow-on to microfilming.

Other important factors include: the techniques available to the operator for handling material, the ways of integrating quality control into the digitizing work flow, and a work flow that includes indexing and storage. POB's requirement was to be able to deal with quality control at the point of scanning. Thus, thanks to Xerox, POB anticipates having a mechanism which will allow it not only to scan in batch form, but to review the material as it goes through the scanner and control quality from the outset.

The standards for measuring quality and costs depend greatly on the uses of the material, including subsequent OCR, storage, printing, and browsing. But especially at issue for POB is the facility for browsing. This facility, WATERS said, is perhaps the weakest aspect of imaging technology and the most in need of development.

A variety of factors affect the usability of complex documents in image form, among them: 1) the ability of the system to handle the full range of document types, not just monographs but serials, multi-part monographs, and manuscripts; 2) the location of the database of record for bibliographic information about the image document, which POB wants to enter once and in the most useful place, the on-line catalog; 3) a document identifier for referencing the bibliographic information in one place and the images in another; 4) the technique for making the basic internal structure of the document accessible to the reader; and finally, 5) the physical presentation on the CRT of those documents. POB is ready to complete this phase now. One last decision involves deciding which material to scan.

DISCUSSION

TIFF files constitute de facto standard
NARA's experience with image conversion software and text conversion
RFC 1314
Considerable flux concerning available hardware and software solutions * NAL through-put rate during scanning
Window management questions

In the question-and-answer period that followed WATERS's presentation, the following points emerged:

ZIDAR's statement about using TIFF files as a standard meant de facto standard. This is what most people use and typically exchange with other groups, across platforms, or even occasionally across display software.
HOLMES commented on the unsuccessful experience of NARA in attempting to run image-conversion software or to exchange between applications: What are supposedly TIFF files go into other software that is supposed to be able to accept TIFF but cannot recognize the format and cannot deal with it, and thus renders the exchange useless. Re text conversion, he noted the different recognition rates obtained by substituting the make and model of scanners in NARA's recent test of an "intelligent" character-recognition product for a new company. In the selection of hardware and software, HOLMES argued, software no longer constitutes the overriding factor it did until about a year ago; rather it is perhaps important to look at both now.
Danny Cohen and Alan Katz of the University of Southern California Information Sciences Institute began circulating as an Internet RFC (RFC 1314) about a month ago a standard for a TIFF interchange format for Internet distribution of monochrome bit-mapped images, which LYNCH said he believed would be used as a de facto standard.
FLEISCHHAUER's impression from hearing these reports and thinking about AM's experience was that there is considerable flux concerning available hardware and software solutions. HOOTON agreed and commented at the same time on ZIDAR's statement that the equipment employed affects the results produced. One cannot draw a complete conclusion by saying it is difficult or impossible to perform OCR from scanning microfilm, for example, with that device, that set of parameters, and system requirements, because numerous other people are accomplishing just that, using other components, perhaps. HOOTON opined that both the hardware and the software were highly important. Most of the problems discussed today have been solved in numerous different ways by other people. Though it is good to be cognizant of various experiences, this is not to say that it will always be thus.
At NAL, the through-put rate of the scanning process for paper, page by page, performing OCR, ranges from 300 to 600 pages per day; not performing OCR is considerably faster, although how much faster is not known. This is for scanning from bound books, which is much slower.
WATERS commented on window management questions: DEC proposed an X-Windows solution which was problematical for two reasons. One was POB's requirement to be able to manipulate images on the workstation and bring them down to the workstation itself and the other was network usage.

THOMA

Illustration of deficiencies in scanning and storage process * Image quality in this process
Different costs entailed by better image quality
Techniques for overcoming various de-ficiencies: fixed thresholding, dynamic thresholding, dithering, image merge
Page edge effects

George THOMA, chief, Communications Engineering Branch, National Library of Medicine (NLM), illustrated several of the deficiencies discussed by the previous speakers. He introduced the topic of special problems by noting the advantages of electronic imaging. For example, it is regenerable because it is a coded file, and real-time quality control is possible with electronic capture, whereas in photographic capture it is not.

One of the difficulties discussed in the scanning and storage process was image quality which, without belaboring the obvious, means different things for maps, medical X-rays, or broadcast television. In the case of documents, THOMA said, image quality boils down to legibility of the textual parts, and fidelity in the case of gray or color photo print-type material. Legibility boils down to scan density, the standard in most cases being 300 dpi. Increasing the resolution with scanners that perform 600 or 1200 dpi, however, comes at a cost.

Better image quality entails at least four different kinds of costs: 1) equipment costs, because the CCD (i.e., charge-couple device) with greater number of elements costs more; 2) time costs that translate to the actual capture costs, because manual labor is involved (the time is also dependent on the fact that more data has to be moved around in the machine in the scanning or network devices that perform the scanning as well as the storage); 3) media costs, because at high resolutions larger files have to be stored; and 4) transmission costs, because there is just more data to be transmitted.

But while resolution takes care of the issue of legibility in image quality, other deficiencies have to do with contrast and elements on the page scanned or the image that needed to be removed or clarified. Thus, THOMA proceeded to illustrate various deficiencies, how they are manifested, and several techniques to overcome them.

Fixed thresholding was the first technique described, suitable for black-and-white text, when the contrast does not vary over the page. One can have many different threshold levels in scanning devices. Thus, THOMA offered an example of extremely poor contrast, which resulted from the fact that the stock was a heavy red. This is the sort of image that when microfilmed fails to provide any legibility whatsoever. Fixed thresholding is the way to change the black-to-red contrast to the desired black-to-white contrast.

Other examples included material that had been browned or yellowed by age. This was also a case of contrast deficiency, and correction was done by fixed thresholding. A final example boils down to the same thing, slight variability, but it is not significant. Fixed thresholding solves this problem as well. The microfilm equivalent is certainly legible, but it comes with dark areas. Though THOMA did not have a slide of the microfilm in this case, he did show the reproduced electronic image.

When one has variable contrast over a page or the lighting over the page area varies, especially in the case where a bound volume has light shining on it, the image must be processed by a dynamic thresholding scheme. One scheme, dynamic averaging, allows the threshold level not to be fixed but to be recomputed for every pixel from the neighboring characteristics. The neighbors of a pixel determine where the threshold should be set for that pixel.

THOMA showed an example of a page that had been made deficient by a variety of techniques, including a burn mark, coffee stains, and a yellow marker. Application of a fixed-thresholding scheme, THOMA argued, might take care of several deficiencies on the page but not all of them. Performing the calculation for a dynamic threshold setting, however, removes most of the deficiencies so that at least the text is legible.

Another problem is representing a gray level with black-and-white pixels by a process known as dithering or electronic screening. But dithering does not provide good image quality for pure black-and-white textual material. THOMA illustrated this point with examples. Although its suitability for photoprint is the reason for electronic screening or dithering, it cannot be used for every compound image. In the document that was distributed by CXP, THOMA noticed that the dithered image of the IEEE test chart evinced some deterioration in the text. He presented an extreme example of deterioration in the text in which compounded documents had to be set right by other techniques. The technique illustrated by the present example was an image merge in which the page is scanned twice and the settings go from fixed threshold to the dithering matrix; the resulting images are merged to give the best results with each technique.

THOMA illustrated how dithering is also used in nonphotographic or nonprint materials with an example of a grayish page from a medical text, which was reproduced to show all of the gray that appeared in the original. Dithering provided a reproduction of all the gray in the original of another example from the same text.

THOMA finally illustrated the problem of bordering, or page-edge, effects. Books and bound volumes that are placed on a photocopy machine or a scanner produce page-edge effects that are undesirable for two reasons: 1) the aesthetics of the image; after all, if the image is to be preserved, one does not necessarily want to keep all of its deficiencies; 2) compression (with the bordering problem THOMA illustrated, the compression ratio deteriorated tremendously). One way to eliminate this more serious problem is to have the operator at the point of scanning window the part of the image that is desirable and automatically turn all of the pixels out of that picture to white.

FLEISCHHAUER

AM's experience with scanning bound materials
Dithering

Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress, reported AM's experience with scanning bound materials, which he likened to the problems involved in using photocopying machines. Very few devices in the industry offer book-edge scanning, let alone book cradles. The problem may be unsolvable, FLEISCHHAUER said, because a large enough market does not exist for a preservation-quality scanner. AM is using a Kurzweil scanner, which is a book-edge scanner now sold by Xerox.

Devoting the remainder of his brief presentation to dithering, FLEISCHHAUER related AM's experience with a contractor who was using unsophisticated equipment and software to reduce moire patterns from printed halftones. AM took the same image and used the dithering algorithm that forms part of the same Kurzweil Xerox scanner; it disguised moire patterns much more effectively.

FLEISCHHAUER also observed that dithering produces a binary file which is useful for numerous purposes, for example, printing it on a laser printer without having to "re-halftone" it. But it tends to defeat efficient compression, because the very thing that dithers to reduce moire patterns also tends to work against compression schemes. AM thought the difference in image quality was worth it.

DISCUSSION

Relative use as a criterion for POB's selection of books to be converted into digital form

During the discussion period, WATERS noted that one of the criteria for selecting books among the 10,000 to be converted into digital image form would be how much relative use they would receive--a subject still requiring evaluation. The challenge will be to understand whether coherent bodies of material will increase usage or whether POB should seek material that is being used, scan that, and make it more accessible. POB might decide to digitize materials that are already heavily used, in order to make them more accessible and decrease wear on them. Another approach would be to provide a large body of intellectually coherent material that may be used more in digital form than it is currently used in microfilm. POB would seek material that was out of copyright.

BARONAS

Origin and scope of AIIM
Types of documents produced in AIIM's standards program
Domain of AIIM's standardization work
AIIM's structure
TC 171 and MS23
Electronic image management standards
Categories of EIM standardization where AIIM standards are being developed

Jean BARONAS, senior manager, Department of Standards and Technology, Association for Information and Image Management (AIIM), described the not-for-profit association and the national and international programs for standardization in which AIIM is active.

Accredited for twenty-five years as the nation's standards development organization for document image management, AIIM began life in a library community developing microfilm standards. Today the association maintains both its library and business-image management standardization activities--and has moved into electronic image-management standardization (EIM).

BARONAS defined the program's scope. AIIM deals with: 1) the terminology of standards and of the technology it uses; 2) methods of measurement for the systems, as well as quality; 3) methodologies for users to evaluate and measure quality; 4) the features of apparatus used to manage and edit images; and 5) the procedures used to manage images.

BARONAS noted that three types of documents are produced in the AIIM standards program: the first two, accredited by the American National Standards Institute (ANSI), are standards and standard recommended practices. Recommended practices differ from standards in that they contain more tutorial information. A technical report is not an ANSI standard. Because AIIM's policies and procedures for developing standards are approved by ANSI, its standards are labeled ANSI/AIIM, followed by the number and title of the standard.

BARONAS then illustrated the domain of AIIM's standardization work. For example, AIIM is the administrator of the U.S. Technical Advisory Group (TAG) to the International Standards Organization's (ISO) technical committee, TC l7l Micrographics and Optical Memories for Document and Image Recording, Storage, and Use. AIIM officially works through ANSI in the international standardization process.

BARONAS described AIIM's structure, including its board of directors, its standards board of twelve individuals active in the image-management industry, its strategic planning and legal admissibility task forces, and its National Standards Council, which is comprised of the members of a number of organizations who vote on every AIIM standard before it is published. BARONAS pointed out that AIIM's liaisons deal with numerous other standards developers, including the optical disk community, office and publishing systems, image-codes-and-character set committees, and the National Information Standards Organization (NISO).

BARONAS illustrated the procedures of TC l7l, which covers all aspects of image management. When AIIM's national program has conceptualized a new project, it is usually submitted to the international level, so that the member countries of TC l7l can simultaneously work on the development of the standard or the technical report. BARONAS also illustrated a classic microfilm standard, MS23, which deals with numerous imaging concepts that apply to electronic imaging. Originally developed in the l970s, revised in the l980s, and revised again in l991, this standard is scheduled for another revision. MS23 is an active standard whereby users may propose new density ranges and new methods of evaluating film images in the standard's revision.

BARONAS detailed several electronic image-management standards, for instance, ANSI/AIIM MS44, a quality-control guideline for scanning 8.5" by 11" black-and-white office documents. This standard is used with the IEEE fax image--a continuous tone photographic image with gray scales, text, and several continuous tone pictures--and AIIM test target number 2, a representative document used in office document management.

BARONAS next outlined the four categories of EIM standardization in which AIIM standards are being developed: transfer and retrieval, evaluation, optical disc and document scanning applications, and design and conversion of documents. She detailed several of the main projects of each: 1) in the category of image transfer and retrieval, a bi-level image transfer format, ANSI/AIIM MS53, which is a proposed standard that describes a file header for image transfer between unlike systems when the images are compressed using G3 and G4 compression; 2) the category of image evaluation, which includes the AIIM-proposed TR26 tutorial on image resolution (this technical report will treat the differences and similarities between classical or photographic and electronic imaging); 3) design and conversion, which includes a proposed technical report called "Forms Design Optimization for EIM" (this report considers how general-purpose business forms can be best designed so that scanning is optimized; reprographic characteristics such as type, rules, background, tint, and color will likewise be treated in the technical report); 4) disk and document scanning applications includes a project a) on planning platters and disk management, b) on generating an application profile for EIM when images are stored and distributed on CD-ROM, and c) on evaluating SCSI2, and how a common command set can be generated for SCSI2 so that document scanners are more easily integrated. (ANSI/AIIM MS53 will also apply to compressed images.)

BATTIN

The implications of standards for preservation
A major obstacle to successful cooperation
A hindrance to access in the digital environment
Standards a double-edged sword for those concerned with the preservation of the human record
Near-term prognosis for reliable archival standards
Preservation concerns for electronic media
Need for reconceptualizing our preservation principles
Standards in the real world and the politics of reproduction
Need to redefine the concept of archival and to begin to think in terms of life cycles
Cooperation and the La Guardia Eight
Concerns generated by discussions on the problems of preserving text and image
General principles to be adopted in a world without standards

Patricia BATTIN, president, the Commission on Preservation and Access (CPA), addressed the implications of standards for preservation. She listed several areas where the library profession and the analog world of the printed book had made enormous contributions over the past hundred years--for example, in bibliographic formats, binding standards, and, most important, in determining what constitutes longevity or archival quality.

Although standards have lightened the preservation burden through the development of national and international collaborative programs, nevertheless, a pervasive mistrust of other people's standards remains a major obstacle to successful cooperation, BATTIN said.

The zeal to achieve perfection, regardless of the cost, has hindered rather than facilitated access in some instances, and in the digital environment, where no real standards exist, has brought an ironically just reward.

BATTIN argued that standards are a double-edged sword for those concerned with the preservation of the human record, that is, the provision of access to recorded knowledge in a multitude of media as far into the future as possible. Standards are essential to facilitate interconnectivity and access, but, BATTIN said, as LYNCH pointed out yesterday, if set too soon they can hinder creativity, expansion of capability, and the broadening of access. The characteristics of standards for digital imagery differ radically from those for analog imagery. And the nature of digital technology implies continuing volatility and change. To reiterate, precipitous standard-setting can inhibit creativity, but delayed standard-setting results in chaos.

Since in BATTIN'S opinion the near-term prognosis for reliable archival standards, as defined by librarians in the analog world, is poor, two alternatives remain: standing pat with the old technology, or reconceptualizing.

Preservation concerns for electronic media fall into two general domains. One is the continuing assurance of access to knowledge originally generated, stored, disseminated, and used in electronic form. This domain contains several subdivisions, including 1) the closed, proprietary systems discussed the previous day, bundled information such as electronic journals and government agency records, and electronically produced or captured raw data; and 2) the application of digital technologies to the reformatting of materials originally published on a deteriorating analog medium such as acid paper or videotape.

The preservation of electronic media requires a reconceptualizing of our preservation principles during a volatile, standardless transition which may last far longer than any of us envision today. BATTIN urged the necessity of shifting focus from assessing, measuring, and setting standards for the permanence of the medium to the concept of managing continuing access to information stored on a variety of media and requiring a variety of ever-changing hardware and software for access--a fundamental shift for the library profession.

BATTIN offered a primer on how to move forward with reasonable confidence in a world without standards. Her comments fell roughly into two sections: 1) standards in the real world and 2) the politics of reproduction.

In regard to real-world standards, BATTIN argued the need to redefine the concept of archival and to begin to think in terms of life cycles. In the past, the naive assumption that paper would last forever produced a cavalier attitude toward life cycles. The transient nature of the electronic media has compelled people to recognize and accept upfront the concept of life cycles in place of permanency.

Digital standards have to be developed and set in a cooperative context to ensure efficient exchange of information. Moreover, during this transition period, greater flexibility concerning how concepts such as backup copies and archival copies in the CXP are defined is necessary, or the opportunity to move forward will be lost.

In terms of cooperation, particularly in the university setting, BATTIN also argued the need to avoid going off in a hundred different directions. The CPA has catalyzed a small group of universities called the La Guardia Eight--because La Guardia Airport is where meetings take place--Harvard, Yale, Cornell, Princeton, Penn State, Tennessee, Stanford, and USC, to develop a digital preservation consortium to look at all these issues and develop de facto standards as we move along, instead of waiting for something that is officially blessed. Continuing to apply analog values and definitions of standards to the digital environment, BATTIN said, will effectively lead to forfeiture of the benefits of digital technology to research and scholarship.

Under the second rubric, the politics of reproduction, BATTIN reiterated an oft-made argument concerning the electronic library, namely, that it is more difficult to transform than to create, and nowhere is that belief expressed more dramatically than in the conversion of brittle books to new media. Preserving information published in electronic media involves making sure the information remains accessible and that digital information is not lost through reproduction. In the analog world of photocopies and microfilm, the issue of fidelity to the original becomes paramount, as do issues of "Whose fidelity?" and "Whose original?"

BATTIN elaborated these arguments with a few examples from a recent study conducted by the CPA on the problems of preserving text and image. Discussions with scholars, librarians, and curators in a variety of disciplines dependent on text and image generated a variety of concerns, for example: 1) Copy what is, not what the technology is capable of. This is very important for the history of ideas. Scholars wish to know what the author saw and worked from. And make available at the workstation the opportunity to erase all the defects and enhance the presentation. 2) The fidelity of reproduction--what is good enough, what can we afford, and the difference it makes--issues of subjective versus objective resolution. 3) The differences between primary and secondary users. Restricting the definition of primary user to the one in whose discipline the material has been published runs one headlong into the reality that these printed books have had a host of other users from a host of other disciplines, who not only were looking for very different things, but who also shared values very different from those of the primary user. 4) The relationship of the standard of reproduction to new capabilities of scholarship--the browsing standard versus an archival standard. How good must the archival standard be? Can a distinction be drawn between potential users in setting standards for reproduction? Archival storage, use copies, browsing copies--ought an attempt to set standards even be made? 5) Finally, costs. How much are we prepared to pay to capture absolute fidelity? What are the trade-offs between vastly enhanced access, degrees of fidelity, and costs?

These standards, BATTIN concluded, serve to complicate further the reproduction process, and add to the long list of technical standards that are necessary to ensure widespread access. Ways to articulate and analyze the costs that are attached to the different levels of standards must be found.

Given the chaos concerning standards, which promises to linger for the foreseeable future, BATTIN urged adoption of the following general principles:

Strive to understand the changing information requirements of scholarly disciplines as more and more technology is integrated into the process of research and scholarly communication in order to meet future scholarly needs, not to build for the past. Capture deteriorating information at the highest affordable resolution, even though the dissemination and display technologies will lag.
Develop cooperative mechanisms to foster agreement on protocols for document structure and other interchange mechanisms necessary for widespread dissemination and use before official standards are set.
Accept that, in a transition period, de facto standards will have to be developed.
Capture information in a way that keeps all options open and provides for total convertibility: OCR, scanning of microfilm, producing microfilm from scanned documents, etc.
Work closely with the generators of information and the builders of networks and databases to ensure that continuing accessibility is a primary concern from the beginning.
Piggyback on standards under development for the broad market, and avoid library-specific standards; work with the vendors, in order to take advantage of that which is being standardized for the rest of the world.
Concentrate efforts on managing permanence in the digital world, rather than perfecting the longevity of a particular medium.

DISCUSSION

Additional comments on TIFF

During the brief discussion period that followed BATTIN's presentation, BARONAS explained that TIFF was not developed in collaboration with or under the auspices of AIIM. TIFF is a company product, not a standard, is owned by two corporations, and is always changing. BARONAS also observed that ANSI/AIIM MS53, a bi-level image file transfer format that allows unlike systems to exchange images, is compatible with TIFF as well as with DEC's architecture and IBM's MODCA/IOCA.

HOOTON

Several questions to be considered in discussing text conversion

HOOTON introduced the final topic, text conversion, by noting that it is becoming an increasingly important part of the imaging business. Many people now realize that it enhances their system to be able to have more and more character data as part of their imaging system. Re the issue of OCR versus rekeying, HOOTON posed several questions: How does one get text into computer-readable form? Does one use automated processes? Does one attempt to eliminate the use of operators where possible? Standards for accuracy, he said, are extremely important: it makes a major difference in cost and time whether one sets as a standard 98.5 percent acceptance or 99.5 percent. He mentioned outsourcing as a possibility for converting text. Finally, what one does with the image to prepare it for the recognition process is also important, he said, because such preparation changes how recognition is viewed, as well as facilitates recognition itself.

LESK

Roles of participants in CORE
Data flow
The scanning process
The image interface
Results of experiments involving the use of electronic resources and traditional paper copies
Testing the issue of serendipity
Conclusions

Michael LESK, executive director, Computer Science Research, Bell Communications Research, Inc. (Bellcore), discussed the Chemical Online Retrieval Experiment (CORE), a cooperative project involving Cornell University, OCLC, Bellcore, and the American Chemical Society (ACS).

LESK spoke on 1) how the scanning was performed, including the unusual feature of page segmentation, and 2) the use made of the text and the image in experiments.

Working with the chemistry journals (because ACS has been saving its typesetting tapes since the mid-1970s and thus has a significant back-run of the most important chemistry journals in the United States), CORE is attempting to create an automated chemical library. Approximately a quarter of the pages by square inch are made up of images of quasi-pictorial material; dealing with the graphic components of the pages is extremely important. LESK described the roles of participants in CORE: 1) ACS provides copyright permission, journals on paper, journals on microfilm, and some of the definitions of the files; 2) at Bellcore, LESK chiefly performs the data preparation, while Dennis Egan performs experiments on the users of chemical abstracts, and supplies the indexing and numerous magnetic tapes; 3) Cornell provides the site of the experiment; 4) OCLC develops retrieval software and other user interfaces. Various manufacturers and publishers have furnished other help.

Concerning data flow, Bellcore receives microfilm and paper from ACS; the microfilm is scanned by outside vendors, while the paper is scanned inhouse on an Improvision scanner, twenty pages per minute at 300 dpi, which provides sufficient quality for all practical uses. LESK would prefer to have more gray level, because one of the ACS journals prints on some colored pages, which creates a problem.

Bellcore performs all this scanning, creates a page-image file, and also selects from the pages the graphics, to mix with the text file (which is discussed later in the Workshop). The user is always searching the ASCII file, but she or he may see a display based on the ASCII or a display based on the images.

LESK illustrated how the program performs page analysis, and the image interface. (The user types several words, is presented with a list-- usually of the titles of articles contained in an issue--that derives from the ASCII, clicks on an icon and receives an image that mirrors an ACS page.) LESK also illustrated an alternative interface, based on text on the ASCII, the so-called SuperBook interface from Bellcore.

LESK next presented the results of an experiment conducted by Dennis Egan and involving thirty-six students at Cornell, one third of them undergraduate chemistry majors, one third senior undergraduate chemistry majors, and one third graduate chemistry students. A third of them received the paper journals, the traditional paper copies and chemical abstracts on paper. A third received image displays of the pictures of the pages, and a third received the text display with pop-up graphics.

The students were given several questions made up by some chemistry professors. The questions fell into five classes, ranging from very easy to very difficult, and included questions designed to simulate browsing as well as a traditional information retrieval-type task.

LESK furnished the following results. In the straightforward question search--the question being, what is the phosphorus oxygen bond distance and hydroxy phosphate?--the students were told that they could take fifteen minutes and, then, if they wished, give up. The students with paper took more than fifteen minutes on average, and yet most of them gave up. The students with either electronic format, text or image, received good scores in reasonable time, hardly ever had to give up, and usually found the right answer.

In the browsing study, the students were given a list of eight topics, told to imagine that an issue of the Journal of the American Chemical Society had just appeared on their desks, and were also told to flip through it and to find topics mentioned in the issue. The average scores were about the same. (The students were told to answer yes or no about whether or not particular topics appeared.) The errors, however, were quite different. The students with paper rarely said that something appeared when it had not. But they often failed to find something actually mentioned in the issue. The computer people found numerous things, but they also frequently said that a topic was mentioned when it was not. (The reason, of course, was that they were performing word searches. They were finding that words were mentioned and they were concluding that they had accomplished their task.)

This question also contained a trick to test the issue of serendipity. The students were given another list of eight topics and instructed, without taking a second look at the journal, to recall how many of this new list of eight topics were in this particular issue. This was an attempt to see if they performed better at remembering what they were not looking for. They all performed about the same, paper or electronics, about 62 percent accurate. In short, LESK said, people were not very good when it came to serendipity, but they were no worse at it with computers than they were with paper.

(LESK gave a parenthetical illustration of the learning curve of students who used SuperBook.)

The students using the electronic systems started off worse than the ones using print, but by the third of the three sessions in the series had caught up to print. As one might expect, electronics provide a much better means of finding what one wants to read; reading speeds, once the object of the search has been found, are about the same.

Almost none of the students could perform the hard task--the analogous transformation. (It would require the expertise of organic chemists to complete.) But an interesting result was that the students using the text search performed terribly, while those using the image system did best. That the text search system is driven by text offers the explanation. Everything is focused on the text; to see the pictures, one must press on an icon. Many students found the right article containing the answer to the question, but they did not click on the icon to bring up the right figure and see it. They did not know that they had found the right place, and thus got it wrong.

The short answer demonstrated by this experiment was that in the event one does not know what to read, one needs the electronic systems; the electronic systems hold no advantage at the moment if one knows what to read, but neither do they impose a penalty.

LESK concluded by commenting that, on one hand, the image system was easy to use. On the other hand, the text display system, which represented twenty man-years of work in programming and polishing, was not winning, because the text was not being read, just searched. The much easier system is highly competitive as well as remarkably effective for the actual chemists.

ERWAY

Most challenging aspect of working on AM
Assumptions guiding AM's approach
Testing different types of service bureaus
AM's requirement for 99.95 percent accuracy
Requirements for text-coding
Additional factors influencing AM's approach to coding
Results of AM's experience with rekeying
Other problems in dealing with service bureaus
Quality control the most time-consuming aspect of contracting out conversion
Long-term outlook uncertain

To Ricky ERWAY, associate coordinator, American Memory, Library of Congress, the constant variety of conversion projects taking place simultaneously represented perhaps the most challenging aspect of working on AM. Thus, the challenge was not to find a solution for text conversion but a tool kit of solutions to apply to LC's varied collections that need to be converted. ERWAY limited her remarks to the process of converting text to machine-readable form, and the variety of LC's text collections, for example, bound volumes, microfilm, and handwritten manuscripts.

Two assumptions have guided AM's approach, ERWAY said: 1) A desire not to perform the conversion inhouse. Because of the variety of formats and types of texts, to capitalize the equipment and have the talents and skills to operate them at LC would be extremely expensive. Further, the natural inclination to upgrade to newer and better equipment each year made it reasonable for AM to focus on what it did best and seek external conversion services. Using service bureaus also allowed AM to have several types of operations take place at the same time. 2) AM was not a technology project, but an effort to improve access to library collections. Hence, whether text was converted using OCR or rekeying mattered little to AM. What mattered were cost and accuracy of results.

AM considered different types of service bureaus and selected three to perform several small tests in order to acquire a sense of the field. The sample collections with which they worked included handwritten correspondence, typewritten manuscripts from the 1940s, and eighteenth-century printed broadsides on microfilm. On none of these samples was OCR performed; they were all rekeyed. AM had several special requirements for the three service bureaus it had engaged. For instance, any errors in the original text were to be retained. Working from bound volumes or anything that could not be sheet-fed also constituted a factor eliminating companies that would have performed OCR.

AM requires 99.95 percent accuracy, which, though it sounds high, often means one or two errors per page. The initial batch of test samples contained several handwritten materials for which AM did not require text-coding. The results, ERWAY reported, were in all cases fairly comparable: for the most part, all three service bureaus achieved 99.95 percent accuracy. AM was satisfied with the work but surprised at the cost.

As AM began converting whole collections, it retained the requirement for 99.95 percent accuracy and added requirements for text-coding. AM needed to begin performing work more than three years ago before LC requirements for SGML applications had been established. Since AM's goal was simply to retain any of the intellectual content represented by the formatting of the document (which would be lost if one performed a straight ASCII conversion), AM used "SGML-like" codes. These codes resembled SGML tags but were used without the benefit of document-type definitions. AM found that many service bureaus were not yet SGML-proficient.

Additional factors influencing the approach AM took with respect to coding included: 1) the inability of any known microcomputer-based user-retrieval software to take advantage of SGML coding; and 2) the multiple inconsistencies in format of the older documents, which confirmed AM in its desire not to attempt to force the different formats to conform to a single document-type definition (DTD) and thus create the need for a separate DTD for each document.

The five text collections that AM has converted or is in the process of converting include a collection of eighteenth-century broadsides, a collection of pamphlets, two typescript document collections, and a collection of 150 books.

ERWAY next reviewed the results of AM's experience with rekeying, noting again that because the bulk of AM's materials are historical, the quality of the text often does not lend itself to OCR. While non-English speakers are less likely to guess or elaborate or correct typos in the original text, they are also less able to infer what we would; they also are nearly incapable of converting handwritten text. Another disadvantage of working with overseas keyers is that they are much less likely to telephone with questions, especially on the coding, with the result that they develop their own rules as they encounter new situations.

Government contracting procedures and time frames posed a major challenge to performing the conversion. Many service bureaus are not accustomed to retaining the image, even if they perform OCR. Thus, questions of image format and storage media were somewhat novel to many of them. ERWAY also remarked other problems in dealing with service bureaus, for example, their inability to perform text conversion from the kind of microfilm that LC uses for preservation purposes.

But quality control, in ERWAY's experience, was the most time-consuming aspect of contracting out conversion. AM has been attempting to perform a 10-percent quality review, looking at either every tenth document or every tenth page to make certain that the service bureaus are maintaining 99.95 percent accuracy. But even if they are complying with the requirement for accuracy, finding errors produces a desire to correct them and, in turn, to clean up the whole collection, which defeats the purpose to some extent. Even a double entry requires a character-by-character comparison to the original to meet the accuracy requirement. LC is not accustomed to publish imperfect texts, which makes attempting to deal with the industry standard an emotionally fraught issue for AM. As was mentioned in the previous day's discussion, going from 99.95 to 99.99 percent accuracy usually doubles costs and means a third keying or another complete run-through of the text.

Although AM has learned much from its experiences with various collections and various service bureaus, ERWAY concluded pessimistically that no breakthrough has been achieved. Incremental improvements have occurred in some of the OCR technology, some of the processes, and some of the standards acceptances, which, though they may lead to somewhat lower costs, do not offer much encouragement to many people who are anxiously awaiting the day that the entire contents of LC are available on-line.

ZIDAR

Several answers to why one attempts to perform full-text conversion
Per page cost of performing OCR
Typical problems encountered during editing
Editing poor copy OCR vs. rekeying

Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), offered several answers to the question of why one attempts to perform full-text conversion: 1) Text in an image can be read by a human but not by a computer, so of course it is not searchable and there is not much one can do with it. 2) Some material simply requires word-level access. For instance, the legal profession insists on full-text access to its material; with taxonomic or geographic material, which entails numerous names, one virtually requires word-level access. 3) Full text permits rapid browsing and searching, something that cannot be achieved in an image with today's technology. 4) Text stored as ASCII and delivered in ASCII is standardized and highly portable. 5) People just want full-text searching, even those who do not know how to do it. NAL, for the most part, is performing OCR at an actual cost per average-size page of approximately $7. NAL scans the page to create the electronic image and passes it through the OCR device.

ZIDAR next rehearsed several typical problems encountered during editing. Praising the celerity of her student workers, ZIDAR observed that editing requires approximately five to ten minutes per page, assuming that there are no large tables to audit. Confusion among the three characters I, 1, and l, constitutes perhaps the most common problem encountered. Zeroes and O's also are frequently confused. Double M's create a particular problem, even on clean pages. They are so wide in most fonts that they touch, and the system simply cannot tell where one letter ends and the other begins. Complex page formats occasionally fail to columnate properly, which entails rescanning as though one were working with a single column, entering the ASCII, and decolumnating for better searching. With proportionally spaced text, OCR can have difficulty discerning what is a space and what are merely spaces between letters, as opposed to spaces between words, and therefore will merge text or break up words where it should not.

ZIDAR said that it can often take longer to edit a poor-copy OCR than to key it from scratch. NAL has also experimented with partial editing of text, whereby project workers go into and clean up the format, removing stray characters but not running a spell-check. NAL corrects typos in the title and authors' names, which provides a foothold for searching and browsing. Even extremely poor-quality OCR (e.g., 60-percent accuracy) can still be searched, because numerous words are correct, while the important words are probably repeated often enough that they are likely to be found correct somewhere. Librarians, however, cannot tolerate this situation, though end users seem more willing to use this text for searching, provided that NAL indicates that it is unedited. ZIDAR concluded that rekeying of text may be the best route to take, in spite of numerous problems with quality control and cost.

DISCUSSION

Modifying an image before performing OCR
NAL's costs per page
AM's costs per page and experience with Federal Prison Industries * Elements comprising NATDP's costs per page
OCR and structured markup
Distinction between the structure of a document and its representation when put on the screen or printed

HOOTON prefaced the lengthy discussion that followed with several comments about modifying an image before one reaches the point of performing OCR. For example, in regard to an application containing a significant amount of redundant data, such as form-type data, numerous companies today are working on various kinds of form renewal, prior to going through a recognition process, by using dropout colors. Thus, acquiring access to form design or using electronic means are worth considering. HOOTON also noted that conversion usually makes or breaks one's imaging system. It is extremely important, extremely costly in terms of either capital investment or service, and determines the quality of the remainder of one's system, because it determines the character of the raw material used by the system.

Concerning the four projects undertaken by NAL, two inside and two performed by outside contractors, ZIDAR revealed that an in-house service bureau executed the first at a cost between $8 and $10 per page for everything, including building of the database. The project undertaken by the Consultative Group on International Agricultural Research (CGIAR) cost approximately $10 per page for the conversion, plus some expenses for the software and building of the database. The Acid Rain Project--a two-disk set produced by the University of Vermont, consisting of Canadian publications on acid rain--cost $6.70 per page for everything, including keying of the text, which was double keyed, scanning of the images, and building of the database. The in-house project offered considerable ease of convenience and greater control of the process. On the other hand, the service bureaus know their job and perform it expeditiously, because they have more people.

As a useful comparison, ERWAY revealed AM's costs as follows: $0.75 cents to $0.85 cents per thousand characters, with an average page containing 2,700 characters. Requirements for coding and imaging increase the costs. Thus, conversion of the text, including the coding, costs approximately $3 per page. (This figure does not include the imaging and database-building included in the NAL costs.) AM also enjoyed a happy experience with Federal Prison Industries, which precluded the necessity of going through the request-for-proposal process to award a contract, because it is another government agency. The prisoners performed AM's rekeying just as well as other service bureaus and proved handy as well. AM shipped them the books, which they would photocopy on a book-edge scanner. They would perform the markup on photocopies, return the books as soon as they were done with them, perform the keying, and return the material to AM on WORM disks.

ZIDAR detailed the elements that constitute the previously noted cost of approximately $7 per page. Most significant is the editing, correction of errors, and spell-checkings, which though they may sound easy to perform require, in fact, a great deal of time. Reformatting text also takes a while, but a significant amount of NAL's expenses are for equipment, which was extremely expensive when purchased because it was one of the few systems on the market. The costs of equipment are being amortized over five years but are still quite high, nearly $2,000 per month.

HOCKEY raised a general question concerning OCR and the amount of editing required (substantial in her experience) to generate the kind of structured markup necessary for manipulating the text on the computer or loading it into any retrieval system. She wondered if the speakers could extend the previous question about the cost-benefit of adding or exerting structured markup. ERWAY noted that several OCR systems retain italics, bolding, and other spatial formatting. While the material may not be in the format desired, these systems possess the ability to remove the original materials quickly from the hands of the people performing the conversion, as well as to retain that information so that users can work with it. HOCKEY rejoined that the current thinking on markup is that one should not say that something is italic or bold so much as why it is that way. To be sure, one needs to know that something was italicized, but how can one get from one to the other? One can map from the structure to the typographic representation.

FLEISCHHAUER suggested that, given the 100 million items the Library holds, it may not be possible for LC to do more than report that a thing was in italics as opposed to why it was italics, although that may be desirable in some contexts. Promising to talk a bit during the afternoon session about several experiments OCLC performed on automatic recognition of document elements, and which they hoped to extend, WEIBEL said that in fact one can recognize the major elements of a document with a fairly high degree of reliability, at least as good as OCR. STEVENS drew a useful distinction between standard, generalized markup (i.e., defining for a document-type definition the structure of the document), and what he termed a style sheet, which had to do with italics, bolding, and other forms of emphasis. Thus, two different components are at work, one being the structure of the document itself (its logic), and the other being its representation when it is put on the screen or printed.

[previous] [Table of Contents] [Next]

[Search all CoOL documents]

[previous] [Table of Contents] [Next]

Workshop on Electronic Texts Session IV.

Image Capture, Text Capture, Overview of Text and Image Storage Formats

KENNEY

ANDRE

ZIDAR

WATERS

DISCUSSION

THOMA

FLEISCHHAUER

DISCUSSION

BARONAS

BATTIN

DISCUSSION

HOOTON

LESK

ERWAY

ZIDAR

DISCUSSION

[previous] [Table of Contents] [Next]

Workshop on Electronic Texts
Session IV.