DCPD: Common Formats

For additional DCPD information, see:

  1. Introduction
  2. Framework
  3. Envelope
  4. Extensions
  5. Common Formats

Common Formats

Some subformats seem likely to be used in multiple contexts. They will be described here to avoid having multiple definitions or descriptions all over the place.

SOURCE: Bibliographic references
TEXT: Textual information
URI and URL: Data references
VERSION: Version or release information

SOURCE

(Bibliographic reference. For future specification.)

TEXT

Text is free text intended for a human reader.

The exact form is probably something determined by the relevant specification, but three formats seem desirable to allow:

Plain text.
No additional mark-up. None.
Unicode as base character code with UTF-8 encoding.
Markdown.
Lightly marked-up text. The CommonMark specification may be a reasonable level to start with, but it may need to be restricted as well extended to fit DCPD.
Extended Syntax support might be desirable, but CommonMark does allow for HTML tags to be used.
CommonMark presupposes Unicode.
URIs may need to be restricted in-line data, and HTML blocks probably also need to be restricted.
HTML.
Some reasonable subset of HTML tags. Probably what is needed for Markdown to start with. (This seems to be at least these tags: block blockquote code em h1 h2 h3 h4 h5 h6 hr li ol p pre strong table), + standard entities, while tags such as a, iframe and img may need to be restricted to inline or local URI’s (i.e. data:… and file:///… ).

<img> is needed to support inclusion of chess diagrams in TEXT/Markdown or TEXT/HTML as long as standard character code support is lacking.

This type may be expanded into:

{
  type : "text"  or "md" or "html" -- req
  data : string                    -- req
}

or perhaps:

{
  type : ... as above  -- req
  data : URL           -- req
}

where the URL locates UTF-8 byte-stream data. (If other character encodings are required, a field to identify them is obviously needed, in which case some convention for handling possibly doubly specified data in cases if data: URL scheme is used.)

URI and URL

The term URI corresponds to the standard definition of the term. (See https://en.wikipedia.org/wiki/Uniform_Resource_Identifier and the standard documents cited by that web page.) These references are all in a context where a web browser may be needed to interpret data, and where a human user is the main recipient of the information.

For computer use, any web-related surroundings (JavaScript-based etc.) are undesirable. This excludes sites such as Google Books, HathiTrust Digital Library and other from being used to provide, say, single pages: these sites (and others) typically produce a web reader environment, with controls for browsing, searching, zooming etc, i.e. the provide a fairly complex environment useful for a general reader using a web browser, but not useful for computer access to requested data.

Thus, the current DCPD use of the term URL implies that a successful request will produce a response that contains a single data stream that can be interpreted in the context of the request. That context implies that a request for a page image should return a page image, not an entire PDF file, or a HTML page containing JavaScript that puzzles together separate 256x256 image fragments into such image.

At present the only URL allowed uses the ‘file’ scheme (see RFC 8089) and refers only to local files. The file structure model can be illustrated as follows. A published collection is stored on a local disk in a structure similar to:

.../DIRECTORY
	collection.dcp
	image-1.png, image-2.png, ...

The file that describes a collection (i.e. collection.dcp in this case) refers to the files using URLs on the format

file:/image-1.pgn  (or possibly file:///image-1.pgn)

Also, the only form of image data that is supported is PNG.

Further developments

The restriction to PGN files seems easy enough to lift to allow other raster-graphic formats to be used. The support of such formats should probably be guided by existing recommendations of file formats suitable for digital preservation. (Examples of main formats often mentioned in such recommendations: PGN, TIFF. Some restriction may be necessary, for example concerning multi-image TIFF.)

The restriction to individual local files is probably also easy enough to lift. Two possibilities are:

scheme zip: local ZIP file archive
zip:/<subfolder>/<archive>.zip?<internal-path>/file.png
(should be similar to jar: see https://docs.oracle.com/javase/6/docs/api/java/net/JarURLConnection.html
? The use of ! as separator in jar: seems odd? Legal?

Again, ZIP and TAR are two formats recommended as archive formats.

scheme pdf: local PDF file
Not an existing scheme
pdf:/<subfolder>/<file>.pdf?page=<nr> or #<nr> or something similar (nr is the PDF page number, not the printed page number)

For document archive formats, PDF/A, EPUB and OpenOffice (.odt, perhaps also .sxw) are often recommended.

As far as remote access is concerned, allowing URLs to include an authority field (e.g. ‘//host.example.com’) does not seem impossible.

Further developments: Integrity assurance

For some (all?) links to files or file containers that do not provide any kind of time-stamped integrity assurance, it may be desirable to include such information together with the URI/URL.

A simple form of this would be a date and time when the link was established, and a digital hash of the contents retrieved.

VERSION

Specifies version of formats, extensions, etc,

The version format is patterned on one used for software releases: <major>.<minor>.<build>.<state>

major : integer 0–MAX
large, significant or important changes to envelope contents. Change of format or extensions. Not necessarily backwards compatible.
minor : integer 0–MAX
minor changes to envelope contents or changes of optional format / extension format or data. Almost always backwards compatible.
build : integer 0–MAX
individual changes. Intended to be incremented at every change or save or commit, or equivalent. May be present in versions of published collections, but is primarily intended to be used for internal or private releases.
state : string
“PRIVATE” - private publication of work-in-progress
“REVIEW” - internal publication of work-in-progress for review
“PRELIMINARY” - official publication of preliminary release

A fourth state (“PUBLISHED”) is not intended to be used explicitly, but expressed through the absence of any other state label.

This is expressed as a string, in the format “DIGITS.DIGITS[.DIGITS[.STATE]]” where square brackets represent optional contents.

A struct-based representation is also possible.