Document Attribute Format Specification


RAF Technology, Inc.
16650 NE 79th Street, Suite 200
Redmond, WA 98052
Tel: (206) 867-0700
FAX: (206) 882-7370
Table of Contents


  User Comment


About DAFS

DAFS Design

  Reverse Encoding Format Considerations

  Object-Oriented Design

  Hierarchical Design


  Primary DAFS Requirements

  Secondary DAFS Requirements

  Existing Standards

     Advantages of SGML

     Disadvantages of SGML


     A Basis for DAFS

     Working With SGML

     DAFS Tags

  DAFS Storage Formats

  DAFS Entities

     Hierarchical Relationships


  Confidences, Alternative Sets, and Property Ranges


     Property Ranges

     Alternative Contents:  Or and Borrow

     Alternative Read Orders and Page Layouts

The Illuminator User Interface:  Using DAFS



  Out-of-Context Mode

  Flagged Mode

The DAFS Library:  Programming DAFS


  DAFSlib Routines

Appendix A.  DAFS-B Storage Format

  Entities, Properties and Tokens

     List of Tokens

  Entity Definition

  Property Definition

  Image Definition

  DAFS-B Documents

  Example Document

Appendix B.  DAFS-U Storage Format

  Modification of Base Character Set


  Symbolic Names

Appendix C.  DAFS-A Storage Format

Appendix D.  DAFSlib Routines




     Get and Set

     Images and bounds

     Call Backs

     Error Messages

     DAFS-B Utility Routines


Sponsored by ARPA, RAF Technology, Inc. is developing a
powerful new document interchange format and tool set for
document decomposition and data sharing applications.  The
Document Attribute Format Specification (DAFS) provides a
format for breaking down documents into standardized entities
(such as Word and Glyph), defining entity boundaries and
attributes, and labeling their contents (text values) and
attributes.  DAFS builds in extensibility to allow users to
configure their own documents as needed, without violating the

This document is intended to be a preliminary specification
for DAFS.  It outlines the considerations behind DAFS design
as it currently stands.  It also provides an introduction to
Illuminator, an editor for working with DAFS documents, and
dafslib, a library of C function for working with DAFS files.

User Comment

The pre-revision 1.00 versions of DAFS are meant to enable
potential users to investigate the appropriateness of DAFS for
their projects, and to allow users to provide feedback before
DAFS is cast in concrete.  Long term academic or standards
committee processes have their merits in the careful
development of formats and specifications, but under ARPA's
DIMUND program, there is a need for such a format today.  The
DIMUND program is willing to accept the risks of implementing
these specifications before all the issues can be completely
explored.  These specifications are not intended to be the
final form of the representation of document attributes, but
are intended to be a good initial approach.

Under the DIMUND program, DAFS will be used for building
databases of tagged images, and a set of software tools that
will process the data in this format.  It is the intention of
DIMUND to provide access to these databases and software tools
at little or no charge.  It is also desirable that this file
format achieve more widespread use than just in the document
image understanding community.  As a result, the DIMUND
program is requesting feedback regarding this document from a
wider community.

Comments on this document and requests for copies of future
revisions should be addressed to:

Mitch Buchman
e-mail: mebuchm@afterlife.ncsc.mil
Voice: (301) 688-4760

Reports of errors and inconsistencies in DAFS and requests for
additions to the standard should be directed to:

David Justin Ross
e-mail: davros@raf.com
Fax:  (206) 882-7370
Voice: (206) 868-0700
RAF Technology, Inc.
16650 NE 79th Street, Suite 200
Redmond, WA   98052


The use of the term UNICODE Standard in this document refers
to The UNICODE Standard Worldwide Character Encoding, Version
1.0, The UNICODE Consortium, Addison-Wesley, 1991.  It also
incorporates the UNICODE 1.0.1 Addendum.

Character codes which are part of the UNICODE Standard
character set will be referred to in the same format as they
are represented in The UNICODE Standard.  An individual
UNICODE character can be expressed U+nnnn, where nnnn is a
four digit hexadecimal number.

SGML in this document refers to the Standard General Markup
Language, ISO Standard 8879, published October, 1986.

                          About DAFS

While many formats exist for composing a document from
electronic storage onto paper, no satisfactory standard exists
for the reverse process.  DAFS is intended to be a  standard
for document decomposition.  It will used in applications such
as OCR and document image understanding.

There are three storage formats:  DAFS-Unicode, DAFS-ASCII and
a more compact DAFS-Binary form.

DAFS is a file format specification for documents with a
variety of uses.  It is developed under the Document Image
Understanding (DIMUND) project funded by ARPA.  As such, DAFS
is meant to be the file format for all documents used as part
of DIMUND.  These include any documents whose content has been
examined either manually or automatically and which form parts
of DIMUND databases.  In addition, DAFS-formatted documents
are used in the Illuminator project where they are employed
for training and testing document image understanding tools.
It is hoped that DAFS will prove to be general enough to enjoy
widespread use, particularly in document understanding.

Several standards have been developed which address the
creation or composition of documents, but none of these
standards is well suited to the problem of document
decomposition.  There are many applications which would
require some form of document decomposition.  These include
character recognition and document image understanding.  DAFS
is a new format designed explicitly for representing the
encoding of decomposed documents (reverse encoding).  As such,
DAFS is designed to allow representation of both the physical
and semantic information contained within a document image.
It is desired that this format have many applications beyond
these specific ones.  With this as a goal, the principle was
established that the format should be extensible to meet the
needs of a broad base of users, and that the format should not
impose unnecessary assumptions on the potential users.

                          DAFS Design

This section describes the philosophy which was used in the
development of DAFS.  Readers may find this section useful in
the interpretation of the other sections of this

Reverse Encoding Format Considerations

In publishing applications, documents are "encoded" via
standards such as SGML in preparation for the actual printing
process.  In the document understanding and page decomposition
arena, we perform "reverse encoding", seeking to reverse-
engineer the meaning from an image of the printed page.  The
greatest difference between encoding for document creation and
the reverse encoding of document images is that during the
reverse encoding process, there may be varying levels of
uncertainty in the interpretation of aspects within the
document.  In the document creation process, this ambiguity is
not present, since the document is usually being encoded by
the same person who created the representation of the
document.  A data format for document reverse encoding must
have a mechanism for representing these ambiguities.

In document image reverse encoding, there exists the concept
of the physical structure of the document and there is also
the semantic structure of the document.  These may share
common aspects, but they are still two different ways of
perceiving the structure of a document.  A data format for
document reverse encoding must be able to provide a mechanism
to encode both the physical and the semantic structure of

While document creation proceeds in a serial manner, document
image decomposition usually traces through a document
hierarchically rather than serially.  This is a result of the
way that reverse engineering processes are usually applied to
documents.  An example of this is that the discrimination
between text and nontext regions is often performed for an
entire document before any character recognition is performed.
Thus, a data format for document reverse encoding must have a
mechanism for hierarchically building a data structure to
represent a document.

Another factor worth keeping in mind is that it is often
desirable to begin outputting a document before the entire
thing has been processed.  A format for reverse-encoding
document images should make it possible to do so.  For
example, knowing the total number of paragraphs should not be
necessary before output can begin.

Object-Oriented Design

In the process of reverse encoding a document image, it
becomes clear that it would be convenient to be able to handle
portions of the image as discrete objects.  First, the concept
"object" must be established.  An object might be any part of
a document that may be defined in a stable form.  Under DAFS,
an object is called an "entity", and is essentially one or
more rectangular pieces of image from the document.  Examples
of useful entities are "paragraph", a "character", and a
"document".  Each may be part of a document, but needs to have
a sufficiently stable form to be unambiguously defined.  By
implementing this specification in an object-oriented fashion,
each entity may have any number of properties associated with
it, allowing information about the entity to be entered.  (See
"DAFS Entities" for more information.)  Another advantage of
an object-oriented format is that an object may also contain
other objects, which is the key to a hierarchical structure.

Hierarchical Design

Often the objects we wish to define in a document fit into a
hierarchical relationship.  For example, a character may be
contained within a word.  That word may then be contained
within a line of text.  This in turn may be within a
paragraph, within a page.  All of these may be part of a
document.  DAFS provides the ability to create "parent",
"child" and "sibling" relationships between entities, forming
the basis for specifying any hierarchy desired.

The use of a hierarchical structure for describing objects
within a document can make the description of the document
more compact.  Users can leave out the layers or details that
they do not need.  For example, an OCR application might use
the structure document-paragraph-line-word-character in
processing a document image.  A pitch and phase detector,
working with the same image, might use the structure document-
line and have no need of word and character information.


Another design goal to give DAFS the greatest potential for
extension.  It is not envisioned that the developers of this
specification will be aware of all potential applications of
DAFS.  In addressing such future needs, this specification is
being designed to provide a practical degree of extensibility,
and to permit asynchronous revisions of data and applications
to coexist.

The use of discrete objects to describe document structure
permits easy extension to the set of objects available for
this purpose.  In another venue, the availability of an
unlimited number of properties for each object permits user
extensions to document descriptions.  The user may classify,
modify, or record information about a pre-existing document's
content by creating and using new properties.  (See
"Properties" for additional information.)

In order to preserve the extensibility of DAFS and to maintain
compliance with this specification, a process that conforms to
DAFS must either interpret code values as specified, or pass
through these values and not interpret them at all.  A process
must not change a code that it cannot interpret.

Primary DAFS Requirements

There are a number of requirements which DAFS must meet to
fulfill its purpose.

1.DAFS must be powerful enough to serve as the format for
  database and tools document interchange.

2.DAFS must be rich enough to represent data from the custom
  formats used by our clients.

3.ARPA wants all their clients to use the same standard as far
  as possible.

4.ARPA wants the various DIMUND tools to support a single data

5.Both the tools and the data sets built as part of these
  related projects should be in a common format.  DAFS will be
  that format.

6.ARPA wants to fully understand image content.  A major
  problem at ARPA is getting at information in a data stream
  of images.  They want the ability to:

    Find the topics that the data stream relates to.

    Search the data stream for images matching a query topic.

7.ARPA has short-term needs which involve data-handling
  problems surrounding their system integration.  They want a
  standard data interchange format to help them integrate the
  different pieces and only want to define one standard.  DAFS
  should meet these short-term needs also.

Secondary DAFS Requirements

In addition, there are several desirable traits for DAFS.
Ideally, DAFS should:

1.Offer expandability.

2.Allow alternatives/possibilities; e.g., allow a supposed
  character's content to be labeled 'm' or 'rn' or 'iii'.

3.Allow confidence values for each alternative, so that a
  measured choice might be made among them.

4.Support many human languages.

5.Be backward compatible with selected existing tools.

6.Provide easy conversion of other formats to and from DAFS.

7.Help tools process images rapidly.

8.Have small data size and fast compression/decompression.

9.Be human readable (to allow editing on non-DAFS tools).

10.    Be multi-platform portable.

11.    Support both physical and semantic entities.

12.    Support a wide range of tools.

13.    Allow encapsulation of foreign formats (e.g. vector-
  format images)

14.    Minimize the number of files necessary to support one

15.    Provide for easy interconversion of different DAFS
  storage formats (currently there are three).

16.    Keep track of which code version was used to make the
  DAFS document.

17.    Be capable of indicating that a document has been QA'd
  with respect to some aspect, e.g. boldness, even if their
  are no bold characters in the document.

18.     Support existing standards to facilitate understanding
  and interchange.

Existing Standards

The definition of DAFS has been influenced by a number of
current standards.  Some of these are from private companies,
while others are international standards.  Among the private
formats are IBM's RFT:DCA, the Rich Text Format (RTF) format
from Microsoft, and the CDA from Digital Equipment
Corporation.  International standards have been implemented by
the International Organization for Standardization (ISO) and
the Comite Consultatif International Telegraphique et
Telephonique (CCITT) to assist in the open exchange of
documents.  These standards include the Open Document
Architecture (ODA), ISO 8613, and the Standard Generalized
Markup Language (SGML), ISO 8879.

These formats have been made available to the public and are
each implemented by numerous vendors.  Each was created to
provide a common interchange format for moving documents
between heterogeneous document formatting systems and text
publishing systems.  Because they were originally conceived
for encoding documents for publishing applications,  none of
these formats adequately address the problems of document
decomposition and reverse encoding.

The three of these which most influence DAFS are SGML for
overall structure and text handling, CCITT group IV for image,
and ISO 10606 for UNICODE.

Advantages of SGML

SGML, an international standard for document interchange,
offers a number of significant advantages and is a good place
for DAFS to start.  SGML encompasses a standard for describing
document structure, tools for parsing and writing conforming
documents, and an overall philosophy of document structure.

There is no single SGML format.  Instead, applications use a
Document Type Definition (DTD) to describe the grammar of that
specific application.  The DTD defines the type of document
with which it is concerned, the names of allowed document
elements (called entities under DAFS), the tagset used to
delimit them, the set of attributes permitted each element,
etc.  In theory, SGML could be used to describe almost any
kind of document, though it is most often used for text
document interchange.

Some of SGML's good features are listed below.

1.SGML is well-designed.  Much thought has gone into what
  makes up a document; how to handle natural hierarchies and
  nesting of hierarchies; how to link yet distinguish document
  content and markup.

2.SGML is designed to be general and to allow modification.
  For example, the character set can readily be changed,
  enabling SGML to handle documents in many foreign language

3.While by no means perfect in handling non-English text, SGML
  offers several reasonable alternatives.  These are discussed
  in "DAFS-U Storage Format".

4.SGML was designed to be easy to parse.  It is possible to
  start anywhere in an SGML document and be able to discover
  where you are without scanning from the beginning.  Elements
  not recognized by the current application are easily

5.The user needn't know everything about a document to begin
  using SGML; it is not necessary to know the total number of
  paragraphs, for example.

6.SGML is a well-known and widely used standard.

Disadvantages of SGML

The following is a list of some of the shortcomings of SGML as
a standard for document decomposition:

1.SGML specifically avoids addressing the encoding of physical
  characteristics such as bold, centered, pitch or font.  It
  addresses rather the encoding of the semantic structures
  like paragraph, word, and character.  SGML conventionally
  encodes the purpose of an element (e.g. "emphasis") rather
  than how that purpose is expressed (e.g. print this element
  in "bold").  While it is technically possible to represent
  physical characteristics in SGML, doing so violates the
  spirit of SGML.  The use of specific physical attributes is
  discouraged in the interests of generality.  Unfortunately,
  these physical attributes can be important clues toward
  document understanding in the reverse  encoding problem.
  When necessary, it is nonetheless straightforward to ignore
  this convention without actually violating SGML.

2.SGML is not set up to handle images easily.  This problem
  stems from the fact that the bytes of an image might contain
  anything, including a sequence of bytes which
  unintentionally looks like an SGML end-tag.

3.SGML applications are generally written to handle a
  prescribed type of document (specified in a DTD), and will
  have varying requirements.  Tools designed for one set of
  SGML documents won't necessarily work on others.  Using a
  new DTD or extending the current one can require some


A Basis for DAFS

The advantages of SGML are so strong it was decided to use it
as a basis for DAFS.  DAFS is being implemented as a special
SGML application with a set DTD, but with some built-in
features providing the extensibility and flexibility to cover
a wide range of applications.

Working With SGML

As discussed in the preceding section, SGML is not without its
disadvantages as a format for document decomposition
applications.  Solutions to the three problems mentioned in
Disadvantages of SGML are discussed below.

1.  SGML discourages encoding of physical characteristics.

  It was decided to overlook SGML inhibitions about encoding
  physical attributes.  A DTD was created which encodes
  essential attributes, including physical ones like "bold"
  and "point size", and which allows users to add their own.

2.  SGML does not handle images easily.

  Document decomposition applications will often require both
  image and text (for example, the image of a page and the
  corresponding OCR'd text).  SGML was not designed with this
  need in mind.  The difficulty arises from the need to
  distinguish intentional SGML tags from chance sequences of
  image bytes which imitate them.  A number of options were
  considered.  DAFS could:

    Escape the images.  This method of handling the problem
     makes reading and writing tedious.

    Write the image after the final tag of the SGML data.
     Unfortunately, the end of an SGML file is not well
     defined, and SGML parsers would not necessarily interpret
     this the same way.

    Write the image in an associated external file which is
     referenced by the text file.  Maintaining multiple files
     for one document can be inconvenient and is not
     aesthetic, but there is no question which bytes are image
     and which text.

  The decision was made to use references to external image
  files.  The negative aspects of storing a single document
  across more than one file are offset by the certain
  knowledge of which bytes are text and which image.

  Including image with text via external files is not without
  precedent.  The CALS (Computer-aided Acquisition and
  Logistics Support) standards by the US Dept. of Defense
  calls for just such handling of mixed text and image.  Under
  CALS, text is to be converted to SGML and image to another
  format, then stored in separate but linked files.

3.  SGML applications are not necessarily portable from one
DTD to another, yet different applications will require
different DTD's.

  We are alleviating this problem by carefully constructing
  the DAFS DTD to be as generally applicable as possible, and
  by building in a limited expandability through DAFS
  "properties".  These provide a way for users to create new
  categories of information about a document's entities, and
  are discussed further under "Properties".


The following suggestions regarding tags have guided the
creation of the DAFS tagset and DTD:

  1.Because DAFS aims to support many languages and scripts,
     DAFS will incorporate the UNICODE character set.  SGML (and
     by extension, DAFS) allows any character to be used in a
     tag, and the idea of specifying a defined set of tags
     violates the complete freedom of SGML; nevertheless, it is
     highly recommended that all tag characters be the UNICODE
     equivalent of ASCII.  Limiting tag characters to ASCII helps
     maintain their human-readability, and eases the
     interconversion of DAFS-U and DAFS-A files (discussed in
     "DAFS Storage Formats").

  2.It is anticipated that new tags will be developed by users,
     but in the interest of portability of the resulting
     documents, the DAFS-defined tags should be used as much as
     possible.  Proposed additional tags can be submitted to the
     DAFS committee.

  3.High-frequency entity names should be kept short.

DAFS Storage Formats

DAFS has three storage formats, each designed for a different
purpose, but wholly interconvertible.  These are the compact
binary format DAFS-B (BINARY), and the "human-readable" DAFS-A
(ASCII) and DAFS-U (UNICODE).  In DAFS-B, image and text (if
any) are stored in a binary format in one file.  DAFS-A is a
direct application of SGML.  DAFS-U is similar to DAFS-A, but
modified to allow UNICODE characters as content. In DAFS-A and
DAFS-U, images (if any) are included in external linked files.
RAF will provide libraries routines to read and write all
three versions.

DAFS Entities

DAFS entities are conveniently defined objects within a
document such as a paragraph or word.  In essence, an entity
is one area of image which is usually defined by a bounding
box (though it need not be).  An entity can have content,
which might be the text it encompasses; properties, such as
bounding box, font and point size; and hierarchical
relationships with other entities, allowing specification of
read orders and page layouts.  (NOTE:  DAFS "entity" is
analogous to the SGML element.  It has nothing to do with the
SGML concept of entity.)

RAF's customers have requested that the following document
elements and characteristics be definable under DAFS.  It is
our intent to make each of these available.  The primary DAFS
entity types are:

  doc      The document as a whole.
  page     A given page.
  column   A column of text.
  paragraphA delimited block of text comprising a paragraph.
  line     A line of text.
  word     A word in the text.
  glyph    A single character in the text.  "Glyph" rather
            than "character" is used because we are referring
            to an area of image which is meant to be a
            character, but which may not actually be correctly

Hierarchical Relationships

DAFS permits the creation of parent, child and sibling
relationships between entities, providing easy representation
of the hierarchical structures of a document.  As an example,
consider a paragraph entity made up of words, and the words
composed of glyphs.  The component glyphs are the child
entities of each word, while the paragraph is the parent of
each word.  The other words in the paragraph are a given
word's siblings.


An entity may have various attributes, or "properties"
associated with it.  A few predefined properties are listed

  bounding box        Rectangular box delimiting the entity or
            portions of it.
  font class          Information on the character font (eg
            Courier or Helvetica).
  point size          Size of the printed characters.
  bold     Characters printed with thicker lines for
  italic   Characters printed with slanting lines for

DAFS permits the creation of an unlimited number of user
defined properties.   A property is used to describe or
classify an entity and its contents, and exists only in
association with the entity to which it refers.  In SGML
applications, such attributes are generally predefined in the
DTD.  DAFS introduces user-defined properties as a way for
users to create their own entity categories and descriptions,
without the need to alter the underlying DAFS DTD.  It
provides flexibility for handling a large variety of
applications, yet protects the ability to share tools and

Confidences, Alternative Sets, and Property Ranges

  conf        Contains the confidence in the value of an
               entity's contents or its properties.  It
               defaults to a single unsigned byte 0-255.
  alternative set  This is a list of alternatives suggested
               or allowed as the content of an entity.
               Alternatives within an alternative set are
               meant to be read as either the first one or the
               second, and so on.


Since DAFS is meant for use with document decomposition, and
since there is always some uncertainty or ambiguity associated
with determining exactly what the content of an entity is,
DAFS must be able to assign confidence values to all its

Property Ranges

A related idea is the allowed range of values for properties.
We anticipate the existence of tools using DAFS which test the
effectiveness of automatic character recognizers, page
decomposers, and other image understanding tools.  For
example, an automatic page decomposition tool might put a
bounding box around a paragraph, different from the one the
human creating the test set had assigned.  The testing tool
must determine whether the machine-set bounding box is close
enough to the human-created 'ideal'.  Since exact matches are
not required for this type of application, the exact values of
some properties may be uncertain.   DAFS accommodates this
need with entities which set the allowed range of properties.
The property "bold" is another example.  It may have an
allowed range of 100-200 for font class 1.  If "boldness" of a
glyph is measured as 132, this is within the range and it will
be concluded that the glyph is in fact bold.  For consistency,
all other bold glyphs like it from font class 1 should also
have a boldness of 132.

Alternative Contents:  Or and Borrow

DAFS must be able to handle alternative values for entity
content.  Any attempt to decompose a document will engender
areas of uncertainty.  A classic OCR uncertainty, for example,
involves distinguishing 'I' (capital I), '1' and 'l' (lower
case L).  If just one of the three is selected as "most
likely", the fact the other two were very nearly as likely is
lost.  DAFS provides easy means of preserving and presenting
sets of such alternatives.  The use of alternatives is
available not only for sets of characters, but for any other
kind of entity as well.

The key to DAFS alternatives involves the concepts of child
type and entity borrowing.  An entity's children may be the
"And" type, such as the component glyphs of a word, which are
all meant to be presented together.  "Or" type children, on
the other hand, are alternatives of one another; only one of
the set can be presented at one time.  A glyph may have "Or"
children 'I', '1' and 'l', allowing a variety of techniques to
be tried in selecting the best of the possibilities.

Entity borrowing is a very useful device which permits easy
data sharing among entities.  As an example, consider a
document that has a read order different from the physical
order of the entities on the page.  The Document could be an
"Or" type entity with two child entities.  The first child
would arrange the Paragraphs, Words etc. to represent the read
order.  The second would arrange them to represent page
layout, borrowing the same images used by the read-order
child, but arranging them differently.  The Borrow concept
allows the data to appear only once, but to be arranged and
used in multiple ways.  Through Borrowing, DAFS files can be
more compact than would otherwise be possible.

Alternative Read Orders and Page Layouts

Documents can have multiple allowed read orders and page
layouts, and it will be desirable to encode them into the
document itself for the applications which use them.
Automatic testers might use this information when evaluating
page decomposition systems.  Alternative read orders and page
layouts rely on the entity borrowing capability discussed
above, so that the same entities from the image of the
document can be linked together in different orders.

          The Illuminator User Interface:  Using DAFS

The Illuminator is an editor and set of tools created for
building document understanding test and training sets.  It
uses the DAFS format to great advantage.  When Illuminator is
completed, it will support documents that have a combination
of text and images, handling text from any script and
language.  It will be the most efficient, easy to use editor
available for correcting machine-recognized or hand-entered
documents. Error-free documents created with this editor can
be used as training data for new kinds of document
recognizers, as a reference set (ground truth) for testing
such recognizers, or for automated entry of information into

Illuminator offers four modes of operation, all currently
under development.  ImageMode is for viewing and working with
actual images from a document.  TextMode is a simple but
complete text editor supporting all major scripts and
languages; currently, Russian and American are available.  Out-
of-Context Mode is a powerful mode of error correction.
Document entities of  the same type are collected into a
single file (e.g., all the 'A's from a page), where errors may
easily be spotted by eye and corrected.  Flagged Mode provides
quick, easy correction of errors flagged by OCR.  The user can
switch from Mode to Mode at the touch of a button. Changes
made in one Mode will be apparent immediately in the others as
well, and multiple windows will be supported.  The discussion
below presents the state of Illuminator as of Release 0.4.


In this Mode, an image of a page from the document is
displayed.  The image may be subdivided into convenient
smaller images called entities.  The image is not subdivided
physically; rather, information about the entities (such as
their bounding boxes and textual content) is superimposed..
Entities may be related hierarchically; such entities are
referred to as "parent" and "child" entities; many entities
are both.  The textual content of an entity can be represented
as a set of possible strings, each with an associated
confidence value.  A given entity's textual content can be
edited or deleted, and other attributes of interest recorded
as properties of the entity.

Future plans for ImageMode include the ability to edit entity
bounding boxes as well as their contents, and to some degree
perhaps the image itself.  There will be further tools for
creating, merging and resizing bounding boxes, which will be
used in creating and altering entities.  Other tools may
include but are not limited to: a shrink wrap tool (for
snugging a box or other shape to a Glyph), a Glyph finder, and
a picture finder.


TextMode displays the text contents of labeled entities from
DAFS documents.  It is also capable of reading, displaying and
editing ASCII and tiff files.  If positional information is
available, text is presented as close in appearance to the
original arrangement as possible.  In the absence of such
information, the text flows, wrapping around at the edge of
the window.

TextMode displays and edits the textual contents of labeled
entities from the current document.  Insert typing and
deletion are provided at a cursor inserted by the mouse.  The
mouse can also be used to select blocks of text, shown in
reverse video, for deletion or overtyping.  As text is edited
in TextMode, Illuminator tracks which entities have been
edited and which have been deleted, preparing to reflect these
changes should the user switch to one of the other Modes.
Entities previously selected from another Mode also appear in
TextMode in reverse video, until a new selection is made.  If
positional information is available, TextMode presents the
text as close to the original arrangement as it can.  In the
absence of such information, the text flows, wrapping around
at the edge of the window.

In the future, TextMode will be similar to Page Maker or Frame
Maker.  It will be as close to WYSIWYH (What You See Is What
You Had) as possible.  It will take advantage of error flags
from OCR programs, and will be capable of pop-up verification,
bringing up the image of a flagged entity for comparison with
the text.

Out-of-Context Mode

All the Glyphs of one type can be displayed together in this
Mode.  The user is able to bring up a window of just the 'A's,
for example.  They will be displayed packed from left to right
and top to bottom, without regard to the original location on
the page.  If a '4' shows up on the 'A' page in Out-of-Context
Mode, it's immediately obvious that it is out of place.  After
it is relabeled, the image of the '4' will no longer appear
with the 'A's.  It will instead be present on the '4's page.

This Mode is a completely different way of verifying a
document's accuracy, and provides a powerful tool for creating
error-free documents and training sets.  Because the Out-of-
Context Mode brings all Glyphs of the same type into one
window, mislabeled ones really stand out.  Errors can be
spotted with a quick glance at the page.

In the future, other types of entities (not just Glyph
entities) will be classified and grouped this way in Out-of
Context Mode.   Pages of identical Glyph entities will be
savable to separate files.  For example, a correctly labeled
page of 'e's may be saved for use as part of an OCR training

Flagged Mode

Flagged Mode is somewhat similar to Out-of-Context Mode.
Flagged Mode is for correcting entities which have been
flagged by OCR as questionable or unknown.  Some OCR programs
may flag single glyphs; some may only flag entire words.  The
flagged entities are grouped by entity type rather than by
content, so that the user may view a page of flagged Glyphs, a
page of flagged Words, flagged Lines, etc.  The OCR-derived
content of each entity is displayed along with its image, and
is easily edited similarly to TextMode.

              The DAFS Library:  Programming DAFS

The DAFSlib enables the developer to read, write and work with
DAFS files,.  It provides a way to read tiff or pda images and
uncompress them into a bit map.  Routines are provided for
creating, labeling and working with all the types of entities
defined under DAFS.  The whole document, including bounding
boxes and labels, can then be saved to a DAFS file.


Some applications for the DAFS Library follow.

     An OCR package could use it to read in a tiff image and
     write the recognized data in DAFS format.

     An OCR developer might write a character segmenter that
     puts the image of each character into a DAFS file
     consisting of all the same glyph.  The files of sorted
     images could then be used to train an OCR engine.

     An existing database could be converted to DAFS format,
     enabling the use of DAFS-based tools like Illuminator,
     and giving other DAFS users easy access.

     A filter program could be written to read a DAFS file and
     write just specified elements into a separate file.  For
     instance, all the paragraphs containing the word
     'eggplant' could be pulled out of a document and placed
     into a separate file.

     Document structure recognizers might output their
     information in a DAFS file using the DAFSlib.

DAFSlib Routines

The current set of routines in the DAFS Library are listed by
name in Appendix D.  For further information, consult the
Programmer's Guide to the DAFS Library.

              Appendix A.  DAFS-B Storage Format

DAFS-B is a file storage format designed to be easy to read
and write by Illuminator and associated tools and at the same
time to be compact.  These files incorporate the image
directly in the file in an unescaped format, making it quick
to read and write.  All scripts (character sets) which can be
defined under UNICODE are supported in the DAFS-B format.
There has been no attempt to make DAFS-B files human-readable;
rather, DAFS-B is intended to promote efficient handling of
DAFS files where a human-readable requirement might add
substantial size and processing overhead.  DAFS-B is a
compromise between file size, processing speed, and ease of
incorporation of the DAFS-B format into tool and database
programs.  We anticipate that nearly all projects which use
Illuminator and associated tools will read and write DAFS-
BINARY files.

Entities, Properties and Tokens

DAFS-B seeks to preserve the flavor of SGML as much as
possible while incorporating unescaped images directly into
the file.  DAFS entities roughly correspond to SGML elements
and DAFS properties to SGML attributes.  One of the reasons we
chose to give them separate names is to make it easier for a
user to define what would be a new element type in SGML.
Under DAFS, provided the properties of two different entities
are identical, there is no need for them to have separate
entity definitions.  Thus, while SGML might have separate
element types for "paragraph" and "word", DAFS has a single
entity type with type strings paragraph and word.  These do
not need to be defined in the DTD, and hence its not necessary
to change the DTD if the user adds a new entity type.  Indeed,
at the current time DAFS has only one entity type with all
distinction made by TStrings (type strings) which contain the
name of the entity.  These may contain any text.

We avoid escaping binary data (such as images and feature
vectors) in DAFS-B by specifying the length of such data
beforehand.  As a result, DAFS-B files must be scanned from
the beginning in order to ensure correct parsing.  Parsing
cannot be picked up from the middle of the file, as it often
can be in SGML.

The basic structure of a DAFS-B file is straightforward.  The
basic unit is either

          Token  Size-of-Data  Data



All tokens are single bytes with their two least significant
bits set to 0.  These two bits are used to represent the
number of bytes in the Size-of-Data (SOD) field.  SOD can be
zero, one, two, or a four bytes, the content of which tells
how many bytes are in the Data field.  (Note that if it is
either a two or a four byte number, the bytes may need to be
swapped in order to correctly interpret them.  The ByteOrder
token (see dafstype.h) determines whether swapping is
necessary.)  If the SOD field is one byte, the Token is OR'd
with 0x1; if it is two bytes, the Token is OR'd with 0x2; if
it is four bytes, the Token is OR'd with 0x3.  In the token
byte the bits are used as follows:

bits 0-1: size of SOD

bits 2-7: Token

where bit 7 is the most significant bit (MSB).

This yields 64 possible tokens.  Since the tokens are designed
to be as general as possible (there is, for example, only one
type of entity in DAFS-B) this seems sufficient.

The Data field consists of SOD bytes in a format to be
interpreted according to "Token".  If "Token" itself is
sufficient to communicate the required information (e.g. BTrue
and BFalse), then the last two bits are left set to 0.  If the
size of the SOD is zero then the SOD is implicitly zero.

Here is an example from a hex dump of a file.

0x05 0x03 0x00 0x98 0xfe 0x08 0x0f 0x00 0x01 0x23 0x45 0xf4 0x

The first byte (0x05) is the first token 0x04 OR'd with 0x01,
which indicates a size of SOD of  1.  The second byte (0x03)
is the SOD, which is 3 bytes.  Therefore the data is the next
3 bytes 0x00, 0x98, and 0xfe.  The next token follows
immediately and is 0x08.  Its size-of-SOD is 0, so there is no
SOD and no data.  The next byte (0x0f) derives from the token
0xc OR'd with a size of SOD of four.  The SOD is the number
0x00012345, which forms the next four bytes.  The first byte
of 0x12345 bytes of data is 0xf4.

List of Tokens

The following are currently allowed tokens.  Immediately
following is a description of how they are used to form
properties, images, entities, and finally whole DAFS-B

  NullTok     SOD = 0        It starts a DAFS-B document.
  ByteOrder   SOD = 2 bytes       Determines the byte order
               of the document.
  BeginEntity      SOD = 4 bytes  It is required to begin any
  EndEntity   SOD = 0        It ends an entity definition.
  BorrowEntitySOD = 4 bytes       Tells what entity is
  Box         SOD = 16 bytes It defines the bounding box of
               an entity.
  BeginProp   SOD = 0        It starts a property definition.
  EndProp     SOD = 0        It ends a property definition.
  BeginImage       SOD = 4 bytes       It starts an image
  EndImage    SOD = 0        Denotes the end of an image.
  ImageID     SOD = 4 bytes       It refers to an image
               previously defined with BeginImage.
  Data        Any SOD allowed.  It presents data of any type.
               A parser would not know how to interpret it,
               only the application that wrote it would know
               its meaning.
  Int         SOD = 4 bytes.  Presents a single (long) of
  BTrue       SOD = 0.  The data is a Boolean of value TRUE.
  BFalse      SOD = 0.  The data is a Boolean of value FALSE;
  Float       SOD = 4 bytes.  Presents a single (float) of
  DPosSet     SOD = any multiple of 4 bytes.  Presents an
               alternative set.
  D1PosSet    SOD = 4 bytes.  Presents an alternative set
               with one member.
  CCITT4Chunk Any SOD allowed.  It is a CCITT group 4
               compressed chunk of image.
  DString     Any SOD allowed.  These tokens define strings.
               There are four different ones so that within an
               entity more than one string can be used, each
               with a different meaning.

Entity Definition

  BeginEntity      Begins the definition of an entity.  The
               data of this token is the entity's ID, which is
               used only when the file is being read.  The
               token BorrowEntity will refer to other entities
               by this ID.

If any of the following tokens do not appear, the entity uses
a default value.  If more than one appears or if conflicting
tokens appear, Data and DPosSet for example, the entity will
use the last one encountered.

  TString     Stores the entity's type string (default type
               string is "")..
  Box         Stores the entity's bounding box (default box
               has w = 0 indicating no box).
  BFalse      Sets the child type to iOr.
  BTrue       Sets the child type to iAnd (default).
  DString     Sets the entity's contents to this string
               (default is no contents).
  DPosSet     Sets the entity's contents to this array of
  D1PosSet    Sets the entity's contents to this glyph (one
  Data        Sets the entity's contents to this user-defined
  ImageID     Makes the entity's image point to a previously
               defined image.
  BeginProp   Starts a property set and is described below.
  EndProp     This ends the property.
  BeginImage       Starts an image which is described below.
               The data of this token is the image's ID which
               is used only when reading the file.  The
               ImageID token refers to this image by this ID.
  EndImage    This end the image.

Because an entity can have any number of children (sub-
entities) there can be any number of the following tokens,
which define children.

  BorrowEntityCreates a child entity (subentity) that borrows
               data from another entity.  The data of this
               token is the ID of the entity that will be
               borrowed from.  The entity that is to be
               borrowed from must have already been read in.
               Any number of children can appear, all will be
               attached to the end of the list of children for
               this entity.
  BeginEntity      Recursively begins a new entity
               definition.  This entity is attached to its
               parent at the end of the parent's list of child
  EndEntity   This ends the entity's definition.  The
               BeginEntity token and this token always appear
               in pairs.

Property Definition

  BeginProp   Begins the definition of a set of properties.
There can be any number of property pairs which consist of the
TString token followed by any data token.  The string token is
the name of the property.  The data is one of the following

  Int      The property's data is an integer.
  Float    The property's data is a float.
  BFalse   The property's data is a Boolean and has the value
            of False.
  BTrue    The property's data is a Boolean and has the value
            of True.
  Data     This presents user-defined data.
  AString  The property's value is this string.
  EndProp  Ends the set of properties.

For example the following token sequence would define two
properties.  The first is the property "bold" which has the
value of False.  The second is the "point size" property which
has the value of 12.

  TString  "bold"
  TString  "point size"
  Int      12

Image Definition

  BeginImage       Begins the definition of an image.

At this time there is only one image format defined.  It first
begins with a bounding box, then has any number of
CCITT4Chunks, each of which can be any size.

  Box         This must appear first and is the size of the
               image.  x and y will always be zero, so only w
               and h are used.
  CCITT4Chunk This is a chunk of data of any amount.  It is
               best to keep this data on the order of w bytes
               long.  Because it is group 4 compressed there
               will be any number of image scan lines per
               CCITT4Chunk, and there is no guarantee that the
               data will end on a whole scan lines.
  EndImage    Ends the image.  If there wasn't enough
               compressed data to make an image of the size
               specified by the bounding box, DAFSlib will
               return an error.
For example the following token sequence would define an image
of size 34 by 100 pixels.

  Box      0 0 34 100

DAFS-B Documents

A DAFS-B document begins with two NullTok tokens.  Since an
SGML document never begins with two null bytes we can
distinguish between the different DAFS documents without
relying on file name extensions.  The ByteOrder token will
follow so that the parser will know what byte order the data
was written with.  The rest of the document should be one

For example:

  BeginEntity         1
  BeginImage          1
  Box      0 0 12 2
  BeginEntity         2
  ImageID  1

This document has one entity that has an image of 12 by 2
pixels.  This entity has a child that points to the same
image.  This DAFS-B document was written by a machine that has
the same byte order as the machine that parsed this file.

Example Document

The following example was created by running a program called
"dumpdafs" on a DAFS-B document.  "dumpdafs" prints out the
tokens and the data that they represent.  It doesn't try to
build or understand the entity, property, or image data
structures that are defined (as the DAFS Library routine
i_ReadEntity would).  A simple example is presented here.
This document shows just one entity that has a content of
"incunabulum".  This document also includes some user data,
called "RAFWLfeat" as a "property" of the entity.

NullTok                          This is a DAFS-B document.
ByteOrder  000002   same         This document was written by
                                 a processor that has the
                                 same byte ordering.
BeginEntity         000004       1    Here is Entity #1
BeginImage 000004   1            This entity has an image and
                                 this is image #1
Box        000016:  0 0 1216 1024Bounding box of the image.
CCITT4Chunk         000932:      .... 932 bytes of image data
                                 (in CCITT group 4 format).
Box        000016:  767 745 251 39    The bounding box of the
                                 entity on its image (#1).
TString    000009:  Document     This entity is of type
TString    000010:  RAFWLfeat    There is only one property
                                 and it is named "RAFWLfeat"
Data       000010:  ....         A RAFWLfeat contains 10 bytes
                                 of data.  Because those data
                                 are binary and of arbitrary
                                 meaning and format, dumpdafs
                                 does not display them.
DString    000010:  incunabulum  This content is a datastring
                                 (DString) which is 10 bytes
                                 long, containing the letters
                                 "incunabulum" and the
                                 terminating NULL.
              Appendix B.  DAFS-U Storage Format

DAFS-UNICODE is a data-storage format capable of handling non-
ASCII characters as well as ASCII, incorporating the UNICODE
character set into an SGML format.  Image may be incorporated
as well via a separate attached file.  DAFS-U is in the
preliminary design phase.  It is not yet implemented.  This
section considers possibilities for DAFS-U, presenting three
possible schemes for incorporating UNICODE which easily
simplify to straight ASCII.  The following  formats are under
consideration.  Users are encouraged to send comments and
suggestions to David Ross (e-mail:  davros@raf.com).

DAFS-UNICODE is very similar to SGML with two exceptions.
First, since SGML was not originally designed to deal with
image at all, incorporating images into an SGML document is
somewhat arbitrary.  In defining DAFS we have followed the
CALS approach of pointing to an external file containing the
image.  This means that DAFS-U documents can be read and
edited by any editor which can handle UNICODE.  In this sense
they are "human-readable".   It also means that any document
containing both image and markup will be made up of at least
two files, which can be something of a disadvantage.  This
arrangement represents a compromise between competing

Second, DAFS-U supports entity tags (delimiters) and contents
which are UNICODE instead of ASCII.  While this increases its
utility for processing non-Roman scripts, it potentially
severely limits the number of non-Illuminator tools which can
edit DAFS-U documents.   While DAFS-U will also support any
script which a user may care to define which fits into
UNICODE, Illuminator and its associated tools only support (so
far) a portion of UNICODE scripts (because of the need to
display them), and is not yet set up for user scripts,
although with some work they can be incorporated.  Although
DAFS-U allows UNICODE values for all entity tags, we strongly
recommend that the tags be limited to characters which also
have an ASCII value.

Modification of Base Character Set

The first option is to change the base character set (tagset)
or a content character set to be UNICODE.  SGML specifically
allows for these sorts of modifications, and such an approach
would have the advantage of most closely following SGML's
rules.  Stating that this option is acceptable in SGML,
however, is like saying that the Susan B. Anthony dollar is
legal tender -- it is technically true, but practically false.
Many SGML parsers do not implement this feature, which would
effectively rule out its use with programs employing those
parsers.  In addition, DAFS-U implemented this way would not
really be human readable because most equipment does not have
software to view UNICODE as a standard option.


The second option is to use the UTF-FSS encoding algorithm.
This is an algorithm for turning "wide chars" (characters of a
fixed width greater than 1) into multi-byte strings, in which
a single character may be represented by a variable number of
bytes.  It was designed by Ken Thompson of AT&T and it has a
number of interesting properties:

  UNICODE characters in the range U+0000 - U+007F
  (corresponding to the ASCII range) just drop the byte on the
  left, becoming their ASCII equivalent. This means they are
  represented as a single byte.

  Encoded characters are easily recognized by having the
  leftmost bit set to 1 (which ASCII does not).

  A maximum of three bytes is required to represent the
  remaining UNICODE characters (because the leftmost bit of
  encoded bytes is set to 1).  All non-encoded UNICODE
  characters require two bytes.

In documents where characters in the ASCII range predominate,
UTF-FSS encoding saves a great deal of space.  Using this
approach, DAFS-U simplifies automatically to DAFS-A (see
below) when no non-ASCII characters are present.  Unlike the
first option, this scheme is passable by all real-world SGML
parsers. On the downside, we have yet to encounter anyone who
is using it for SGML, though the algorithm is well known.

Symbolic Names

The third option, and one that has some backing in the SGML
community, is to use symbolic names for non-ASCII characters.
There is a standard under development known as ISO 9573 for
creating standardized names, and work is in progress on
mapping tables between ISO 9573 and UNICODE.

For example, the symbolic name "&KATAKANA LETTER ZE;" might
represent the UNICODE character U+30BC.  A variant on this
approach would be to create more compact symbolic names, for
example "&U+30BC;".  Ideally, an SGML parser will  be capable
of handling the symbolic names via a mapping table.  If not, a
dummy table might be supplied, permitting the parser to ignore
the symbolic names.

As was the case with the UTF-FSS option, DAFS-U naturally
simplifies to DAFS-A under the symbolic names approach.

              Appendix C.  DAFS-A Storage Format

DAFS-ASCII is an SGML application, capable of incorporating
image data in external attached files, which is currently
under development.  It will be identical to DAFS-U except that
it permits ASCII characters only as content and in tags.
RAF's customers have asked that a purely ASCII storage option
be made available for documents in which no UNICODE characters
occur (that is, no characters that could not have been
represented by ASCII alone).  DAFS-A will comply with these

                 Appendix D.  DAFSlib Routines


  void i_InitLib(void);
  void i_ExitLib(void);


  iEntityPtr i_NewEntity(const char *type,iEntityPtr parent);
  iEntityPtr i_GetChildEntity(const iEntityPtr entity);
  iDAFSError i_BorrowEntity(iEntityPtr entity,iEntityPtr
            parent, iEntityPtr *epp);
  void i_DisposeEntity(iEntityPtr entity);
  iDAFSError i_ReadEntity(const char *fileName,iEntityPtr
  iDAFSError i_WriteEntity(const char *fileName,const
            iEntityPtr entity, iDAFSWriteMode mode);
  iDAFSError i_TransferEntity(iEntityPtr entity,iEntityPtr
  iEntityPtr i_GetNextEntity(const iEntityPtr entity);
  iEntityPtr i_GetPrevEntity(const iEntityPtr entity);
  iEntityPtr i_GetChildEntity(const iEntityPtr entity);
  iEntityPtr i_GetParentEntity(const iEntityPtr entity);
  void i_SetChildType(iEntityPtr entity,iChildType type);
  iChildType i_GetChildType(iEntityPtr entity);
  void i_MoveToFront(iEntityPtr entity);
  void i_MoveToBack(iEntityPtr entity);
  iDAFSError i_MoveBefore(iEntityPtr entity,iEntityPtr
  iDAFSError i_MoveAfter(iEntityPtr entity,iEntityPtr


  iPropPtr i_SetProperty(iEntityPtr entity, const char
            *name,iPTag type,long size,void *data);
  iPropPtr i_FindProperty(const iEntityPtr entity,const char
  iPropPtr i_FirstProperty(const iEntityPtr entity);
  iPropPtr i_NextProperty(const iPropPtr LastProp);
  const iPDataPtr i_GetPropertyData(const iPropPtr prop);
  const char *i_GetPropertyName(const iPropPtr prop);
  iDAFSError i_SetPropertyData(iPropPtr prop,const iPData
  void i_SetPropertyName(iPropPtr prop,const char *name);

Get and Set

  const char *i_GetComment(const iEntityPtr entity);
  const char *i_GetType(const iEntityPtr entity);
  void *i_GetUserPtr(const iEntityPtr entity);
  void i_SetType(iEntityPtr entity,const char *type);
  void i_SetUserPtr(iEntityPtr entity,void *user);
  iDTag i_GetEDataType(iEntityPtr entity);
  const char *i_GetText(const iEntityPtr entity);
  iDAFSError i_SetText(iEntityPtr entity,const char *text);
  void *i_GetUserData(const iEntityPtr entity,long *size);
  void i_SetUserData(iEntityPtr entity,void *userData,long
  short i_GetGlyph(const iEntityPtr entity);
  void i_SetGlyph(iEntityPtr entity,short chr);
  const iPosSet *i_GetPosSet(iEntityPtr entity,int *n);
  void i_SetPosSet(iEntityPtr entity,int n,const iPosSet

Images and bounds

  void i_SetImage(iEntityPtr entity,iImagePtr image);
  const iImagePtr i_GetImage(const iEntityPtr entity);
  iImagePtr i_CloneEntityImage(const iEntityPtr entity);
  iBox i_GetBounds(const iEntityPtr entity);
  void i_SetBounds(iEntityPtr entity,iBox bounds);
  iImagePtr i_BitMap2Image(iBox bounds, int dataWidth, char
  iDAFSError i_Image2BitMap(iImagePtr image,iBox bounds, long
            *dataWidth,char **data);
  void i_SetImageMag(iImagePtr image,long mag);
  long i_GetImageMag(iImagePtr image);
  iImagePtr i_RotateImage(iImagePtr image,int rot);
  int i_GetImageWidth(const iImagePtr image);
  int i_GetImageHeight(const iImagePtr image);
  int i_ImageExt(const char *fileName);
  iDAFSError i_ReadImage(const char *fileName,iImagePtr
  iDAFSError  i_WriteImage(const char *fileName,iImagePtr
            image, iDAFSCompType comp);
  void i_DisposeImage(iImagePtr image);

Call Backs

  void i_SetCallBack(iCallBackProc callBack,void *userData);
  void i_GetCallBack(iCallBackProc *callBack,void **userData);

Error Messages

  const char *i_GetError(iDAFSError error);

DAFS-B Utility Routines

  iDAFSError i_Read1Token(FILE *fo,char *t);
  iDAFSError i_ReadToken(FILE *fo,char *t,int *swap);
  iDAFSError i_ReadSize(FILE *fo,char t,long *size);
  iDAFSError i_ReadItem(FILE *fo,char t,long *size,void
  iDAFSError i_ReadItemAlloc(FILE *fo,char t,long *size,void
  iDAFSError i_WriteItem(FILE *fo,char t,long size,const void

[Search all CoOL documents]