| Author: | Alan G. Isaac |
|---|---|
| Date: | 20050826 |
| First Posted: | 20050718 |
| Version: | 1.1.0 |
| Copyright: | Creative Commons Attribution-ShareAlike 2.5 (or later version). |
| Document URL: | http://www.american.edu/econ/pytrix/opendatabank.txt |
| HTML version: | http://www.american.edu/econ/pytrix/opendatabank.htm |
The open-databank single-series format is intended to be highly backwards compatible (see notes below) with the classic microTSP databank format. Essentially, the open-databank format enhances the comment lines to include multi-line comments and label comments. This document also serves to document both the original standard, which seems largely undocumented on the internet, and the open-databank extension.
The original databank format was promulgated by microTSP. It is primarily useful for data of a fixed annual sample frequency (annual, quarterly, monthly) and for undated data. It is supported by TSP and EViews (by means of their store and fetch commands) and by many other applications. A brief description of the microTSP specification follows.[1]
A microTSP databank file is an ASCII text file. Traditionally the name of the file ends with extension .db. A databank file is formatted linewise. The first n lines (n>=1) are comment lines. For dated series, the next three (n+1--n+3) lines specify the frequency, start date, and end date. For undated series, the next two (n+1--n+2) lines specify the start index and end index. The remaining lines are data: one observation per line, or NA if missing.
Dated and Undated Series:
A databankfile begins with n comment lines, each starting with quote-c: "c The first comment is the creation/update date, formatted as follows: cLast updated: 08-18-2006 Subsequent comments contain optional documentation, e.g., cMy useful comment. (See below for the open-databank modification of this specification of the comment lines.)
Undated Series:
Positive integers, coding the starting index and ending index of an undated series.
E.g.,
1 300
Dated Series:
A negative integer, coding the annual frequency of the time series (times -1): -12 (monthly), -4 (quarterly), or -1 (annual).
Starting date and ending date for series in format yyyy, yyyy.q, or yyyy.mm.
E.g., for monthly data:
-12 1980.01 1990.12
E.g., for quarterly data:
-4 1980.1 1990.4
E.g., for annual data:
-1 1980 1990
Dated and Undated Series:
Data, with one observation per line. An observation is either a number (float) or a missing value, coded as NA
The open-databank format enhances the original databank specification for the comment lines and adds a couple details. It retains all of the advantages of the microTSP databank format for fixed-frequency time series data. (That is, it is easily human readable, almost self documenting, easily parsed, and terse.) Any correctly formatted microTSP databank file is an open-databank file.
A comment line begins with a comment marker, which is followed by comment content. (The comment content may be padded on either side with white space.)
There are two kinds of comment line, new-comment lines and continued-comment lines, distinguished by their two-character comment markers.
- new-comment marker:
quote-c: "c
- continued-comment marker:
quote-space: "
- A comment line starts with a two-character comment marker.
- The first character on a comment line is the double-quote character (ASCII 34).
- The second character on a comment line is either the lowercase-c character (ASCII 99) or the ordinary-space character (ASCII 32).
Comment content begins with the first non-whitespace character after a comment marker and ends with the last non-whitespace character on the comment line. White space immediately following the comment marker or at the end of the comment line is not part of the comment content.
A new-comment line containing a colon (ASCII 58) begins a label comment, which specifies a key:value pair.
- The key is the comment content before the first colon.
- The colon may be padded with whitespace on either side; such whitespace is not part of the key or the value.
- The label value follows the first colon, possibly on one or more continuation lines.
Example:
"c Units: current dollars
Example:
"c Units: " current dollars
A line may contain 1024 characters, including end-of-line characters. Readers may truncate longer lines. Any standard (Unix, DOS, Mac) line-termination sequence may be used to indicate the end of a line. It is recommended that the line-termination sequence be consistent throughout a file. The last line of a file should include a line-termination sequence. A file-termination sequence should not be used.
As of version 5, EViews simply discards comment continuation lines.
As of version 5, EViews recognizes only three labels:
- Last updated
- Display Name
- Modified
All other labels are treated as ordinary comments.
Eviews writes a quote at the end of each comment line. (This practice is discouraged.)
The open-databank multifile format specification is still in progress. However it is usable in its current state and files meeting the current specification will remain valid. (Comments are welcome.) The object is to allow simple storage of multiple series in a single file. This is achieved in the most obvious way: stacking the contents of individual open-databank files, each of which is preceded by an series-boundary marker. The open-databank multifile format specification includes a set of basic extensions to the open-databank single-series format. Use of these extensions breaks compatibility with the single-series format.
The Open-Databank Multifile format allows stacked series.
Each series should follow the open-databank format described above, except as noted in Extensions.
Preceding each series, including the first series in the file, must be a line containing only the series-boundary marker:
--series-boundary
The series-boundary marker may be preceded and followed by empty lines.
Each series in the stack should include a label comment with key = SeriesName and value specifying the name of the series. It is recommended that this be the first comment line for the series.
The last line in the file must be the following:
--series-boundary--
Note that the series-boundary markers thereby mimic a MIME standard, making the open-databank multifile format easy to parse.[3]
Lines above the first --series-boundary marker are considered to be comment lines, used for comments on the entire file rather on a specific series. Since this entire area is reserved for comments, there is no special comment marker.
These extensions are listed in rank order of likelihood of implementation.
| [1] | Details on the original format were derived from the following sources:
|
| [2] | See e.g. [NBER_2001] and associated historical files. (The missing value code in the .db files of this database was 0.1E-36 until August 2005, when it was replaced with NA.) |
| [3] | See for example the Python multifile module. |
| [Hall_and_Lilien_1989] | Hall, Robert E., and Lilien, David, microTSP Version 6.5 User's Manual, Quantitative Micro Software, 1989. |
| [NBER_2001] | (1, 2) Feenberg, Daniel and Jeff Miron, NBER Macrohistory Database, NBER, 2001. |