The Cornell / Xerox / Commission on Preservation and Access Joint
Study in Digital Preservation
Notes
1. Digital image technology, for the
purposes of this report, is defined as the electronic copying of
scanned documents in image form. The text contained in these images
in not converted to alphanumeric representation at the time of
scanning, although the potential exists for such conversion, in
whole or in part, from the digital files at some later time. The
present capabilities of optical character recognition are inadequate
for capturing both the information and the presentation of the
original page, which is critical when replacing rapidly
self-destructing books, especially when one considers the vast
number of languages, illustrations, type faces, and printing
techniques present in the collections of modern research libraries.
The creation of digital images does not preclude the use of OCR
capabilities. In fact it represents the first step in that
direction--the scanning of paper copies to which character
recognition can then be applied. See for instance: Stephen Smith and
Craig Stanfill, "An Analysis of the Effects of Data Corruption on
Text Retrieval Performance," (Thinking Machines Corporation,
Cambridge, MA: December 14, 1988).
2. The Joint Study compared the quality and costs
associated with monochromatic scanning and photocopying only.
3. Conceivably, this may at some point allow
librarians to propose other service alternatives as a substitute for
traditional shelf storage.
4. Contrary to the frequently expresses concern
about the longevity of the physical storage medium itself, it is the
obsolescence of standards, formats, and access software tools that
is of greatest concern. The physical medium will normally long
outlive these considerations.
5. This Report covers the period of the project
ending December 31, 1991. Subsequent to this date, Cornell project
staff have verified that digitally-produced microfilm produced by
this project does not match microfilm preservation standards. This
is not surprising given the scanning resolution. However, such
microfilm may nevertheless be adequate for preserving texts produced
at 4 point type and larger. In addition, early experiments suggest
that halftone images can be scanned with resultant quality superior
to that normally obtained with most production microfilming
processes. Quality issues will be discussed in subsequent
reports.
6. The current national preservation program to
preserve brittle material is based on the replacement of originals
with copies that faithfully capture their intellectual content,
including text, illustrations, and presentation. In order to
preserve the largest number of items possible, the time spent in
copying material should occur just once and should result in the
production of a print master that can be used to make subsequent
copies at lower costs. Information about the availability of copies
should be widely publicized and included in the national on-line
bibliographic databases. Finally, a preservation master of the
original should be stored and maintained in a manner that will
guarantee its long-term availability.
7. For instance, in the field of mathematics from
which over half of the materials were selected, users "object to the
inconvenience of microfilm, especially for monographs...Hardcopy
reformatting (through photocopying) of older monographs is the
preferred way to provide access in many libraries." Constance C.
Gould and Karla Pearce, Information Needs in the Sciences: An
Assessment, (Mountain View, CA: Research Libraries Group,
Inc., 1991), pp. 65-68.
8. For a Xerox Corporation perspective on the
importance of co-development, see William Anderson, William Crocca,
and Steven Barley, "Customer Co-Development: The Cornell/Xerox Joint
Study Project Interim Report," PARC Technical Report SSL-91-139.
9. One thousand books were chosen for scanning.
Fifty of the most heavily illustrated ones have been reserved for
scanning using the windowing capabilities recently developed by
Xerox.
10. Katz, A. Cohen, D. Network FAX Working
Group of the Internet Engineering Task Force, A File Format for the
Exchange of Images in the Internet. Request for Comments
number 1314, April 1992.
11. Digital files must be created in a manner that
provides users with instructions on how to gain access to the
information contained in them. It is one thing to store information
on a disk, and another to gain access to it. Material can not be
considered preserved if one can not "read" it. Thus a file must
contain documentation on its format. Though there are many competing
file formats, TIFF is in wide use. Unfortunately there are multiple
TIFF formats, but a committee currently exists to address this
issue. Today TIFF comes close to representing an industry standard.
Aldus Corporation and Microsoft Corporation, "Tag Image File
Specification Revision 5.0" (Aldus/Microsoft Technical memorandum,
August 1988).
12. The International Telegraph and Telephone
Consultative Committee (CCITT) has originated two algorithms, Group
3 and Group 4, that are in wide use for black and white Images.
13. Norvell M.M. Jones, Archival Copies of
Thermofax. Verifax. and Other Unstable Records. National
Archives Technical Information Paper No. 5 (Washington: National
Archives and Records Administration, 1990). ANSI Standard
Z39.48-1984, currently being revised, covers the requirements for
permanent/durable paper. See also RLG Preservation Manual (1986) and
the Reproduction of Library Materials (ALA) draft photocopy
guidelines of the Subcommittee on Preservation Photocopying
Guidelines. The guidelines currently available for preservation
photocopying place greater emphasis on image stability and paper
permanence than image quality.
14. Cornell did prepare a Preservation Scope Note
for the mathematics material which appears in the RLIN Conspectus.
Preservation Scope Notes provide RLG and individual institutions
with information about large preservation projects, both in progress
and completed, to assist in the planning and coordination of
preservation activities.
15. Format Integration and Its Effect on the
USMARC Bibliographic Format, Library of Congress, 1988.
Prepared by Network Development and MARC Standards Office.
16. Performance issues associated with reading
material from the network will be addressed in the Testbed Project,
begun in January 1992.
17. The film emulsion layer is unusually thin and
characterized by extremely fine grains and d relatively high silver
to gel ratio; the support is ESTAR base, a clear 4-mil polyester
film. Based on discussions with technical experts at Kodak and
University Microfilms, it appears that the archival properties of
the S0-219 are questionable. Image Graphics is investigating the use
of Image Link film for subsequent tests.
18. Subsequent to the close of Phase 1, the
microfilm was indeed produced. The quality will be discussed in
subsequent reports.
19. Donald J. Waters, From Microfilm to
Digital Imagery. On the feasibility of a project to study
means. costs. and benefits of converting large quantities of
preserved library materials from microfilm to digital images
(Washington: The Commission on Preservation and Access, 1991).
20. The selection process is described by Steven
Rockey in "The Cornell-Xerox-CPA Project to Digitally Reformat
Books," paper presented to the AMS/MAA Joint Mathematics Meetings,
Baltimore, MD, January 8-11, 1992. A bibliography of the mathematics
books preserved in this project is included as Appendix VII. A
bibliography of all volumes scanned in this project can be prepared
by conducting a search on RLIN using the Series Note ("CXJSP"), and
downloading the on-line records.
21. Disbinding books with minimal artifactual
value met little faculty resistance when high-quality replacement
facsimiles were produced, and additional copies can be printed on
demand.
22. It is anticipated that as data exchange
standards are developed and implemented, the time between refreshing
will increase from four years to ten years and beyond. See for
instance, Charles M. Dollar, "The Impact of Information
Technologies on Archival Principles and Practices: Some
Considerations," Draft Version 16, November 15,1990, pg. 63.
23. This study investigated the quality achieved
with binary scanning only. Depending on the object being scanned,
grey scale or color scanning may be superior, and the
advantages/disadvantages of the various approaches need to be
examined. Scanning resolutions and file formats can represent a
complex tradeoff between time, file size, fidelity, on-screen
display, printing, and equipment availability. The study had as a
primary emphasis the production of printed facsimiles that were
largely black and white text in a timely and cost-effective manner.
With binary scanning, large files may be compressed efficiently and
in a lossless manner using CCITT Group IV Facsimile compression
algorithms. Grey scale compression, using JPEG, is much less
economical and is "lossy," which may make it inappropriate as a
preservation method. It appears that while binary files produce a
high quality printed version, other combinations of spatial
resolution with grey and/or color will also be adequate. Grey scale
can offer an advantage for on-screen viewing. For instance, on a low
resolution screen display, two bits of grey at 100 dpi may be more
readable than 600 dpi or 300 dpi binary. The advantage is lost,
however, when the on-screen image is enlarged. The quality
associated with binary or grey scale is also dependent on the
equipment used, for instance binary scanning produces a better paper
copy when it is printed on a binary printer. See Michael Ester,
"Image Quality and User Perception," LEONARDO Digital Image, Digital
Cinema Supplemental Issue, (1990) pg. 51-63.
24. Generational loss is acknowledged in the draft
photocopying guidelines of the Subcommittee on Preservation
Photocopying Guidelines, of the Reproduction of Library Materials
Section of ALA. The August 1991 version emphasizes that acceptable
copy image quality should consider reproducibility (i.e., can the
text be copied again). The generational loss with microfilm is not
as great, but does represent about a 10% reduction in resolution
with each generation. As such the technical specifications for
microfilm vary from one generation to the next. See, for example
Research Libraries Group, Inc., RLG Preservation Microfilming
Handbook, edited by Nancy E. Elkington, (Mountain View, CA: The
Research Libraries Group, Inc., 1992), Appendix 18. See also, Don
Willis, A Hybrid Systems Approach to Preserving Printed Materials
using Microfilm and Digital Imaging, presentation at the AIIM
conference, April 1991.
25. A process of auto-segmentation, which
incorporates the windowing function automatically as a page is
scanned, is being refined by Xerox. When available, it will increase
the speed of capture for illustrated text.
26. An excellent discussion of relating
photographic quality indexes with digital scanning is presented in
AIIM Technical Report (TR 26), "A Tutorial on Photographic and
Electronic Imaging Resolution," draft, 2/5/92. See also Tom Bagg,
"Image Quality," paper presented to the Digital Image Applications
Group, Sept. 25, 1986; and Don Willis, "A Hybrid Systems Approach to
Preserving Printed Materials using Microfilm and Digital Imaging,"
draft paper, 1991, unnumbered.
27. Nonetheless, Xerox has concurred with the
figure used in the cost study.
28. Costs associated with digital technology are
derived from Table A The numbers in [brackets] refer to line
numbers in Table A. Overhead reflects the general and administrative
costs and profit margin that would be included by an outside vendor.
The 1992 cost of photocopying is based on two quotes for
photocopying and binding a 300 page book (Library Bindery Service
and Ridley's Book Bindery). The average annual inflation rate is
calculated at 5%.
29. The numbers in [brackets] for digital
technology refer to line numbers in Table A. A book scanned in 1992
will be refreshed twice in the next decade, in 1992 and 2000.
Overhead reflects the general and administrative costs and profit
margin that would be included by an outside vendor. Microfilm
figures are based on 1992 prices quoted by MicrogrAphics
Preservation Service (MAPS). Cost of archival master is based on
$.195/frame for one-up and two-up filming. Cost of print master is
$15. For two-up filming, assume six books can be stored on each
roll; for one-up filming, assume three books. The cost of one book
on the print master will be $5.00 (one-up) or $2.50 (two-up).
Storage costs are based on $1/year to store one roll of film. The
cost of book storage/year will equal $1 divided by 3 (one-up) or by
6 (two-up). Since two generations are being stored, the cost equals
$.66 (one-up) and $.33 (two-up) per year times 10 years, or $6.66
and $3.33 respectively.
30. The numbers in [brackets] for digital
technology refer to line numbers in Table A. Overhead reflects the
general and administrative costs and profit margin that would be
included by an outside vendor. The binding cost included here
assumes that 20% of all requests for subsequent copies will be bound
with a full cloth library binding, 40% will be bound using Docutech
in-line tape binding, and 40% will be unbound or stapled. If we
assumed that all subsequent copies were bound in a full cloth
binding the total digital cost would rise to $19.81 in 1992
dollars.
1-1. Chart IEEE Std 167A-1987. Prepared by the
IEEE Facsimile Subcommittee and printed by Eastman Kodak Company.
For use in accordance with IEEE Std 167-1966, Test Procedure for
Facsimile. Copyright 1987, Institute of Electrical and Electronics
Engineers.
2-1. The research and development flavor of the
study was reflected in fluctuations In scanning productivity.
Between April 5 and May 24, 1991--an eight week period--the average
weekly scan rate was 6,795 pages, which represents 22.65 books/week.
This highly productive period was followed by a week in which only
7.5 books were scanned. System upgrades occurred at regular
intervals throughout the year and a reduction in scanning production
invariably accompanied software installation. Installation itself
usually took a day for testing and debugging. Technicians had to
prepare for the installation by clearing the hard disk of work in
progress. They then had to learn the new system. Difficulties
associated with installing new software on a networked system also
were common. For instance, during the week that the Pl software was
installed, 3,883 images were scanned; the week the P2.0 software was
installed only 3,245 images were scanned; and the week the P2.1
software was installed only 2,778 images were scanned.
2-2. Statistics prepared by Dorothy Wright,
Preservation Librarian, Mann Library, Cornell University, December
1991.
3-1. For instance, subsequent iterations of
system software will increase the speed of scanning. Xerox has
developed a fast scan capability which delays the document structure
building until after the actual scanning has been completed. This
upgrade has been tested on a scanning workstation located in
Cornell's book store and its use at 300 dpi scanning led to a
doubling of the production rate. Cornell did experiment with using a
feed mechanism. It was determined that pages that were only
marginally brittle (i.e., it took five double corner folds before
the paper broke) could survive most paper jams. Libraries may be
willing to risk a paper jam to achieve faster production rates for
material held by a number of libraries. Before feed mechanisms can
be used with this system, however, registration and deskewing must
become software functions.