The Commission on Preservation and Access

Preserving the Whole:
A two-track Approach to Rescuing Data and Metadata.


Interim Report to the Commission on Preservation and Access

Ann Gerken Green and JoAnn Dionne
December 20, 1996

Project Overview

This project employs a two­track preservation strategy of migrating digital files and digitizing related paper records to enhance access. At the half-way point of the project, we have made considerable progress in evaluating the alternative formats for migrating the original data files from tape and have focused upon the benefits and drawbacks of each alternative. This project has allowed us to investigate diligently and in detail the implications of preserving data in their original format vs migrating to restructured data. At this point we are finishing the data files evaluation and concentrating on the documentation scanning process and output evaluations.

The Roper Collection at Yale

The Yale University Library, one of the first academic libraries to form a collection of machine­readable data, began collecting numeric data in 1972. Over the years, Yale has copied its data from one form of digital storage to another as mainframe computer technology has dictated. The copying of data, while labor­intensive, was straightforward in creating exact logical copies from out­of­date media to newer data storage formats. Now, as users move from the mainframe to distributed computing systems and from one hardware and software configuration to another, digital formats require not just simple duplication, but restructuring. The migration of data from tapes is becoming more urgent as access to and support in using the Yale mainframe is being discontinued.

The Yale Roper Collection includes materials from the Roper Center for Public Opinion Research, whose data files comprise a rich resource for research in political psychology and sociology. They provide a record of public opinion research in the U.S. from 1935 to the present, along with surveys conducted abroad since the 1940s. The Yale Roper Collection materials also include paper records such as questionnaires, information on sample sizes, and other notes necessary for use of the data files. Many of the paper records are brittle or produced through unstable copying technologies such as mimeograph. The paper preservation needs have not been addressed until now.

Selection of Datasets

Our initial discussions led us to select the Roper Reports as representative of the entire collection. These studies are a significant, heavily used part of the Yale Roper Collections. The Roper Reports have been produced since 1973 by the Roper Organization, a commercial polling company now known as Roper Starch Worldwide, Inc. The Roper Reports have 1500-2000 respondents, 200-300 variables and are conducted ten times per year. Datasets include demographic information such as age, sex, race, economic level, education, marital status, union membership, religion and political affiliation. Questions cover a broad array of the issues facing society such as energy, politics, media, health and medical care, consumer behavior, education, and foreign policy.

The Roper Reports are not 'clean' which means they are in column-binary format, do not have machine-readable documentation supplied with the datasets, and often have hand written notes in the margins of the questionnaires that document the data files. They thus represent the problems inherent in the rest of the Yale Roper Collection. Yale owns approximately 200 Roper Reports from which 10 studies were selected to represent the full span of years to include any differences in format or documentation. Three studies were selected from the 1970's, four from the 1980's, and 3 from the 1990's.

Literature Search

A preliminary literature search has revealed much information on imaging as a preservation technique for books but little on preserving documentation for data files. We have uncovered, to date, no previously published material on preservation of electronic materials other than duplicate copies moved from one storage medium to another. We have found little information on the subject of copying data files and changing the way they are coded.

Data Migration Activities

The process of migrating digital numeric information from computer tape to system­independent formats was broken down into a series of steps.

First, we identified the formats to test and the computer equipment upon which to test these formats. The primary equipment in use at Yale and similar academic institutions is: IBM mainframes, UNIX based machines, and PC/Intel based networks and stand-alone computers. Macintoshes are not widely used by social scientists at Yale. Data formats were determined by considering primary software in use, transportability, and long term archival applications. We decided to produce a wide range of formats for two of the Roper Report datasets on multiple platforms and evaluate the time, size, and utility of each format. (See Table 1.) Further, these formats were to be tested on the UNIX and PC/Intel equipment configurations at the Social Science Statistical Laboratory (Statlab). Worksheets for recording procedures, time, disk space, and special notations were developed. The formats selected were: SAS system files of recoded column-binary data; SAS program files to read in the column-binary data for producing a SAS data set or a subset of the original file; ASCII files produced from recoded column-binary data; ASCII files of the binary data patterns in the original file, called 'spread' ASCII.

Second, we migrated the column-binary data sets from old round reel tapes to new 3480 IBM cartridges on the Yale mainframe. This is the first step in migrating the data from the old tape media to more stable magnetic media. Next, the data were copied from cartridge to online disk, then the data were moved via ftp to the Statlab Novell PC server. This would be the home for the SAS program writing, record keeping, and storage of the resultant output.

Third, the column-binary datasets were read with SAS input statements and recoded using question number variable names, with the resulting SAS programs and SAS datasets of recoded data stored on the network. The SAS input statement programs were edited and run on both the Windows and UNIX versions of SAS. Various methods of reading in the column-binary data and recoding them were tested. In the process, we established standard variable naming formats based upon the original question coding structure. Standard types of variables were defined: numeric, numeric with special missing data, multiple-response, or single-response. Each of these variable types require different variable naming, formatting, and recoding procedures. The complexity of a particular data file can be estimated by examining the frequency of complex types of variables, especially the multiple-response type. Template files were created for each variable type to reduce the amount of repetitive typing and speed up the production of SAS programming. We performed error checking routines to compare selected frequencies with 'x-rays' from the Roper Center. During the entire process we evaluated our procedures, examined the sizes of datasets, noted problem variables, and reviewed the formats. Additional versions of the datasets were produced that can be transferred among the various versions of SAS and to other statistical software packages.

Fourth, we evaluated the size and production requirements of an ASCII version of the data set, both in recoded form and spread form. A custom C program was written to produce the many repetitive lines in the SAS code needed to read and write the datasets in 'spread' ASCII format.

Initial findings about data conversion

For present day users of the data, conversion of the column-binary format into SAS and SPSS transport and export files are the most attractive. The format can be used easily, is transportable to multiple operating systems and equipment configurations, and can be transformed into other software specific formats. However, the original data file must be recoded, a process that is lengthy and potentially error prone. The archival limitations also are significant. The SAS and SPSS system files are manipulated by proprietary software, must be migrated through future versions of SAS and SPSS to guarantee utility, and do not retain the original multi-punch structure of the column-binary files. Once the recoding is done, future researchers will be unable to re-create the original data set.

If an archival standard is defined as a non-column-binary format that reproduces the complete structure of the original files, only the 'spread' ASCII format meets these conditions. This 'spread' format, however, is at least 600% larger than the original file and requires recoding. Further, there is no easy mapping of new variable locations onto the original codebook, as each punch listed in the original codebook must be assigned a new column location in the spread ASCII data.

The column-binary format itself begins to look more attractive as a long term archival standard than we had anticipated. It conserves space, it preserves the original coding of data and matches the column location information in the codebooks, it can be transferred in standard binary, and it can be read by standard statistical packages on all of the platforms we have used in testing. Unfortunately, it is difficult to locate and decipher information about how to read in the column-binary data with SAS and SPSS, as the latest manuals no longer contain information about this format. If the column-binary format is archived, sample programs for reading the data with standard statistical packages, or a stand-alone program, must be included in the collection of supporting files.

Documentation Conversion Activities

Several options are available for digitizing the documentation, including optical character recognition (OCR), image scanning, text encoding, and manual data entry. Alternate methods and costs are being evaluated. The initial part of this project included identifying formats to test and setting up the scanning workstation. We designed worksheets for record and time keeping to standardize our evaluations.

Our first step was to scan the documentation for one Roper Report using the OCR software TextBridge Pro. We reviewed alternative OCR software products and finding no significant differences chose a package with which the staff had experience. Initial evaluation of the OCR output showed that there were significant numbers of errors in the resultant ASCII text. We tested various resolutions and documented time taken for setup and scanning, settings, file sizes, proofing summaries and procedures.

The questionnaires we scanned with the TextBridge Pro software had an unacceptable rate of character recognition, including incorrect location information necessary for manipulating the accompanying data files. Handwritten notes are completely lost and editing costs of reviewing the output and changing all errors are prohibitive. This format does not present us with an adequate archival solution to preserving the textual material.

The next step in the documentation portion of the project is to archive and make available documents in the Portable Document Format (PDF) used by Adobe Acrobat. PDF is becoming a widely-accepted de facto standard for encoding electronic documents. The viewing software provided by Adobe allows for reading and searching a document, and high-quality printouts can be made. PDF documents can be displayed as clear, accurate reproductions of the questionnaires. The Adobe Capture software produces an accompanying ASCII file that can be edited to improve text searching. We purchased this software and installed it on the scanning workstation and have begun the evaluation of the procedures, output and applications of this system.

The PDF format provides solutions to some of the documentation distribution and preservation problems we face, but it does not meet all of our needs. 1) PDF files are produced and stored in a format that may be problematic in the future in terms of reading and searching. The PDF format depends upon proprietary software which may not be available in future computing environments. 2) PDF files do not contain marked-up text found in a structured document format, e.g. SGML. The format does not go far enough in providing internal structure for the manipulation, output and analysis of the metadata.

Second Phase

The second part of the project will continue the two-pronged approach. The Data Conversion section has the following tasks to be done:

  1. Investigate more fully the structure, size, and production of the 'spread' ASCII format; produce a standard conversion program that produces 'spread' ASCII
  2. Produce a sample data dictionary for the 'spread' ASCII format, evaluate costs and time requirements, evaluate procedures done at Louis Harris Data Center and the Roper Center in distributing and documenting 'spread' ASCII data and documentation
  3. Address the issues of proprietary formats in an archival context, especially SAS and SPSS
  4. Produce an estimate of recoding costs for producing SAS input statements
  5. Test various compression mechanisms

The Documentation Preservation section has the following tasks to be done:

  1. Investigate alternative image formats; investigate PDF conversion to image files and ASCII files; determine if there is any difference in quality in going from OCR to PDF vs PDF to OCR, implications of both over time given improvements in OCR technology
  2. Evaluate results: time, costs, learning curve, alternate resolutions, etc.
  3. Address archival standards implications of PDF format and precedence at other locations (ICPSR and the US Bureau of the Census)
  4. Illustrate examples of searching and displaying the ASCII text in Adobe Acrobat
  5. HTML applications: produce a sample marked-up HTML document
  6. SGML review: discuss the application of the Data Documentation Initiative DTD (which will not be available for this project) to the Roper Report documentation.

Finally, we will produce the Final Report and WWW site for the project and conduct an informal faculty and graduate student review of the findings and sample output from the project.

Final Report

The project advances a design on which Yale and others can subsequently build. Out of this experience, Yale is creating a model of the process needed to handle a large­scale project that entails migration of data and preservation of accompanying documentation. The contract calls for Yale to produce a report to help other institutions working on digital archiving projects. The report will include: an evaluation of findings, a glossary and bibliography, technical descriptions of software and equipment, and a WWW site with sample programs, datafiles and documentation.

The report will be summarized and presented at the IASSIST/IFDO Conference, Odense, Denmark, May 6-9. 1997. IASSIST is the International Association for Social Science Information Service and Technology and IFDO is the International Federation of Data Organizations. The URL for the conference is http://www.sa.dk/dda/conf97.


Table 1. Data Conversion Formats and Storage Requirements
Platform Dataset description
(Sept. 1993 Roper Report)
Storage requirements (in bytes) Percentage of original
All platforms: Original (column-binary) 2858400  
IBM PC datasets: SAS dataset (full set of variables) 19464984 681%
  SAS XPORT dataset (full set of variables) 19144720 670%
  SAS dataset with 4 byte integers (full set of variables) 9978112 349%
  SAS dataset with 3 byte integers (full set of variables) 7528192 263%
  SAS dataset (partial set of variables) 8879360 311%
  SAS XPORT dataset (partial set of variables) 8843760 309%
  SAS dataset with 4 byte integers (partial set of variables) 4571392 160%
  SAS dataset with 3 byte integers (partial set of variables) 3471616 121%
UNIX datasets: SAS dataset (full set of variables) 21831680 764%
  SAS XPORT dataset (full set of variables) 19144720 670%
  SAS dataset with 4 byte integers (full set of variables) 10166272 356%
  SAS dataset with 3 byte integers (full set of variables) 7413760 259%
  SAS dataset (partial set of variables) 9379840 328%
  SAS XPORT dataset (partial set of variables) 8843760 309%
  SAS dataset with 4 byte integers (partial set of variables) 4595712 161%
  SAS dataset with 3 byte integers (partial set of variables) 3416064 120%
Mainframe datasets: SAS dataset (full set of variables) 23347200 817%
  SAS XPORT dataset (full set of variables) 19144720 670%
  SAS dataset (partial set of variables) 11059200 387%
  SAS XPORT dataset (partial set of variables) 8843760 309%

[Search all CoOL documents]