Research Projects at the CECM: Document Vault

Document Vault

The Document Vault project is intended to provide the basis for the flexible creation and use of on-line, multi-media documents. In particular, a mechanism for the inclusion of arbitrary types of information into a document is needed to support standard archiving and distribution practices on the Internet. The basic Vault features include:

A definition of Vault objects composed of standard file types (images, sounds, text, LaTeX, data, multi-part, etc.) and appropriate headers.
A definition of documents as one or more objects assembled into a particular structure.
A definition of methods appropriate to each object necessary to manifest a document into whatever target context is required (PostScript, HTML, ASCII, DVI, etc.).

Consequently, a document can be authored by including text, images, sounds, data, network pointers to the same or other forms of information and then served in any form required, for either local or network consumption. There is only one authorized document and so maintenance is limited to a minimum.

Originally, the Vault project was intiated to alleviate problems related to the maintenance and development of electronic archives. Anonymous FTP, Gopher and World Wide Web (WWW) archives are distinct in how the contents of each type of archive are typically organized. This is largely due to the nature of the underlying protocols. Moreover, the type of information varies: Raw or compressed data files are the norm for FTP, ASCII text is the most useful for Gopher and HTML-structured multi-media is appropriate for the Web. Worse, the concept of a document is not clearly defined since a mix of images, sounds, text (ASCII and formatted) and data may be required. For a site offering services in each, problems of document authority, duplication and redundant maintenance chores are difficult to cope with.

In addition to addressing the archive maintenance problem, it has become apparent that the Vault offers a great deal of potential for document authors. Instead of information being trapped within the structure of the chosen format, an author can freely include any object previously used into the new document. The origin and context of the object is preserved. Further, the document can be constructed to appear appropriate to the context within which it is being viewed. For example, images can be replaced by a text caption in ASCII representations (similarly for sounds in printed copies). Dynamic information can be added at the time of viewing (i.e. from a database). And network pointers can be included or excised and replaced with references.

While projects of this sort are likely underway in many places, we are working on establishing a rapid prototype based on extant software, standards and protocols. The primary goal is to relieve the pressure on local archive maintainers while supporting growth and development of new documents. A simple prototypical interface is already in use at the CECM site for documents such as preprints.

References:

DAGS'95: Conference on Electronic Publishing and the Information Superhighway
CNIDR: Clearinghouse for Networked Information Discovery and Retrieval
HyperTex: Adding hyperlinks to TeX documents.
CTAN: TeX Archive Network; a collection of TeX-related software.
Red Sage Project: Electronic Distribution of Journals.
Knowledge Management: Refining Roles in Scientific Communication by Richard E. Lucier
Tragic loss or good riddance? The impending demise of traditional scholarly journals by Andrew M. Odlyzko
Tragic loss or good riddance? The impending demise of traditional scholarly journals by Andrew M. Odlyzko (condensed version)
David Dubin's work at Dept. Information Science, University of Pittsburgh
The Perseus Project
Open Math Project