Open-Databank Format

Author: Alan G. Isaac
Date: 20050826
First Posted:20050718
Version: 1.1.0
Copyright: Creative Commons Attribution-ShareAlike 2.5 (or later version).
Document URL:http://www.american.edu/econ/pytrix/opendatabank.txt
HTML version:http://www.american.edu/econ/pytrix/opendatabank.htm

The open-databank single-series format is intended to be highly backwards compatible (see notes below) with the classic microTSP databank format. Essentially, the open-databank format enhances the comment lines to include multi-line comments and label comments. This document also serves to document both the original standard, which seems largely undocumented on the internet, and the open-databank extension.

The MicroTSP Databank Format

The original databank format was promulgated by microTSP. It is primarily useful for data of a fixed annual sample frequency (annual, quarterly, monthly) and for undated data. It is supported by TSP and EViews (by means of their store and fetch commands) and by many other applications. A brief description of the microTSP specification follows.[1]

A microTSP databank file is an ASCII text file. Traditionally the name of the file ends with extension .db. A databank file is formatted linewise. The first n lines (n>=1) are comment lines. For dated series, the next three (n+1--n+3) lines specify the frequency, start date, and end date. For undated series, the next two (n+1--n+2) lines specify the start index and end index. The remaining lines are data: one observation per line, or NA if missing.

In More Detail

  • Dated and Undated Series:

    Line 1--n:

    A databankfile begins with n comment lines, each starting with quote-c: "c The first comment is the creation/update date, formatted as follows: cLast updated: 08-18-2006 Subsequent comments contain optional documentation, e.g., cMy useful comment. (See below for the open-databank modification of this specification of the comment lines.)

  • Undated Series:

    Line n+1--n+2

    Positive integers, coding the starting index and ending index of an undated series.

    E.g.,

    1
    300 
    
  • Dated Series:

    Line n+1

    A negative integer, coding the annual frequency of the time series (times -1): -12 (monthly), -4 (quarterly), or -1 (annual).

    Line n+2--n+3

    Starting date and ending date for series in format yyyy, yyyy.q, or yyyy.mm.

    E.g., for monthly data:

    -12
    1980.01
    1990.12
    

    E.g., for quarterly data:

    -4
    1980.1
    1990.4
    

    E.g., for annual data:

    -1
    1980
    1990
    
  • Dated and Undated Series:

    Remaining lines.

    Data, with one observation per line. An observation is either a number (float) or a missing value, coded as NA

Open-Databank: Modifications of the Specification

The open-databank format enhances the original databank specification for the comment lines and adds a couple details. It retains all of the advantages of the microTSP databank format for fixed-frequency time series data. (That is, it is easily human readable, almost self documenting, easily parsed, and terse.) Any correctly formatted microTSP databank file is an open-databank file.

Comment Lines

  • A comment line begins with a comment marker, which is followed by comment content. (The comment content may be padded on either side with white space.)

  • There are two kinds of comment line, new-comment lines and continued-comment lines, distinguished by their two-character comment markers.

    new-comment marker:

    quote-c: "c

    continued-comment marker:

    quote-space:

    • A comment line starts with a two-character comment marker.
    • The first character on a comment line is the double-quote character (ASCII 34).
    • The second character on a comment line is either the lowercase-c character (ASCII 99) or the ordinary-space character (ASCII 32).
  • Comment content begins with the first non-whitespace character after a comment marker and ends with the last non-whitespace character on the comment line. White space immediately following the comment marker or at the end of the comment line is not part of the comment content.

  • There are two kinds of comment:
    • Ordinary comment (also called a remark)
    • Label comment (also called a label)
  • A new-comment line containing a colon (ASCII 58) begins a label comment, which specifies a key:value pair.

    • The key is the comment content before the first colon.
    • The colon may be padded with whitespace on either side; such whitespace is not part of the key or the value.
    • The label value follows the first colon, possibly on one or more continuation lines.

    Example:

    "c Units: current dollars
    

    Example:

    "c Units:
    "  current dollars
    

Details

A line may contain 1024 characters, including end-of-line characters. Readers may truncate longer lines. Any standard (Unix, DOS, Mac) line-termination sequence may be used to indicate the end of a line. It is recommended that the line-termination sequence be consistent throughout a file. The last line of a file should include a line-termination sequence. A file-termination sequence should not be used.

Comments on Backward Compatability

  • As of version 5, EViews simply discards comment continuation lines.

  • As of version 5, EViews recognizes only three labels:

    • Last updated
    • Display Name
    • Modified

    All other labels are treated as ordinary comments.

  • Eviews writes a quote at the end of each comment line. (This practice is discouraged.)

Recommendations

Writer recommendations

  • Write the file with a .db extension unless otherwise specified.
  • Follow each comment marker with one space.
  • Include a label comment with the DBversion key to accommodate possible later extensions of the open-databank format. E.g., DBversion: 1.0
  • Provide an option for writing microTSP compatible output. (Essentially: each comment is written to a single line.)

Reader recommendations

  • Ignore blank lines and lines containing only white space.
  • Read any line starting with a quote as a comment line. (EViews 5 discards lines starting with a quote but not with quote-c.)
  • Disregard white space immediately following a comment marker. (EViews 5 follows this practice.)
  • Disregard any quote at the end of a comment line, and then disregard white space at the end of a comment line. (EViews 5 disregards the final quote but preserves the white space.)
  • Ensure that the sample specified implies the number of observations actually listed. If not, offer a warning.
  • Some existing .db files code missing values as 0.10E-36.[2] Readers are recommended but not required to accommodate this practice, as an option.
  • If the series has a label comment with key = SeriesName and non-empty value, treat the value as the name of the series. Otherwise use the name of the file containing the series (after dropping any extension such as .db or .txt).

Open-Databank Multifiles

The open-databank multifile format specification is still in progress. However it is usable in its current state and files meeting the current specification will remain valid. (Comments are welcome.) The object is to allow simple storage of multiple series in a single file. This is achieved in the most obvious way: stacking the contents of individual open-databank files, each of which is preceded by an series-boundary marker. The open-databank multifile format specification includes a set of basic extensions to the open-databank single-series format. Use of these extensions breaks compatibility with the single-series format.

Basic Concepts

The Open-Databank Multifile format allows stacked series.

  • Each series should follow the open-databank format described above, except as noted in Extensions.

  • Preceding each series, including the first series in the file, must be a line containing only the series-boundary marker:

    --series-boundary
    
  • The series-boundary marker may be preceded and followed by empty lines.

  • Each series in the stack should include a label comment with key = SeriesName and value specifying the name of the series. It is recommended that this be the first comment line for the series.

  • The last line in the file must be the following:

    --series-boundary--
    

    Note that the series-boundary markers thereby mimic a MIME standard, making the open-databank multifile format easy to parse.[3]

  • Lines above the first --series-boundary marker are considered to be comment lines, used for comments on the entire file rather on a specific series. Since this entire area is reserved for comments, there is no special comment marker.

    • The first character on a new comment line must be alpha-numeric.
    • The first character on a comment continuation line must be the space-character (ASCII 31).
    • A new comment line containing a colon (ASCII 58) specifies a label comment. (Note: see FRED for examples of this comment format. Note: be careful not to place a URL on the first line of an ordinary comment.)

Extensions

  • The frequency, sample start, and sample end information may be placed on a single line, separated by white space. No data is allowed on this line.
  • Multiple white-space separated observations may be placed on a single line.

Possible Future Extensions

These extensions are listed in rank order of likelihood of implementation.

  • Allow specifying frequency, sample start, and sample end in comment lines.
  • Allow missing values to be coded as a period (full-stop): .

Notes and References

[1]

Details on the original format were derived from the following sources:

[2]See e.g. [NBER_2001] and associated historical files. (The missing value code in the .db files of this database was 0.1E-36 until August 2005, when it was replaced with NA.)
[3]See for example the Python multifile module.
[Hall_and_Lilien_1989]Hall, Robert E., and Lilien, David, microTSP Version 6.5 User's Manual, Quantitative Micro Software, 1989.
[NBER_2001](1, 2) Feenberg, Daniel and Jeff Miron, NBER Macrohistory Database, NBER, 2001.