Using Right-to-Know Data to Answer Important Environmental Policy Questions
Center for Environmental Information and Statistics
U.S. Environmental Protection Agency
September, 1998 by
Environment, Statistics and Policy (ESP) Project
Table of Contents

1. Tell the user more about the technical aspects of using the data.
2. Make the data easier to use.
3. Create and make available time series data.
4. Provide a more useful context for the data.

This case study is built around a fictitious organization that represents chicken farmers of the Mid©Atlantic states. The Chicken Farmer's Association (CFA) looks out for the interests of the chicken farmers, provides support to them, and lobbies on their behalf. An issue of great importance to these farmers is the Mid©Atlantic problem of pfiesteria that has occurred in several rivers in this area (generally, EPA Region 3). We assume the role of statisticians who have been hired by the CFA to explore other possible causes for pfiesteria, other than run©off problems linked to chickenªrelated operations. Ô
Some believe that the pfiesteria outbreaks are due to the application of manure based fertilizers which are used on farms in the area. Chicken manure is a byproduct of the chicken farming industry located there. The sale of manure provides additional income to farmers and also helps them dispose of chicken waste. This manure is rich in nitrogen and phosphorus, which is thought to fuel pfiesteria outbreaks. The nutrients are also detrimental to waterways since they feed algae growth, which takes oxygen from water when introduced in great quantities.
Several states, including Maryland and Virginia, have proposed legislation which would require farmers to adopt tighter regulations on waste products. This legislation may be costly to farmers in terms of both capital and operating expenses. The CFA argues that there is little evidence that the pfiesteria outbreak is the result of chicken manure. Pollutants could also come from a variety of industrial, municipal, and other sources.
To test our theory we want data on a river where a pfiesteria outbreak has occurred. For that river, we want to examine data on discharges from sources other than chicken farmers. We want data on excessive water pollutant discharges, and in particular, those that occurred during the periods of the pfiesteria outbreaks.
A good case study is the Pocomoke River, where a major
Ô

In order to relate other contaminants with pfiesteria we need time series data to concord other area water discharges to outbreaks of pfiesteria. In fact, while water discharges may be a leading contributor to the pfiesteria problem, part of the problem originates in air and land media.
The most accessible data is obtained through the Internet. The least costly way for us to do research is by utilizing the data collected by the EPA under its Permit Compliance System (PCS). PCS data is now available to the public from a site maintained by the EPA (Envirofacts) and a private group (RTK.NET from the Unison Institute).
Both sites provide about 15 different types of environmental data sets available to the public. PCS data is available from both systems. Both systems offer drag and click choices for identifying regions for data requests. An initial reaction was that the RTK.NET data was the more easily accessible of the two and it provided a greater depth to the data.
Both systems, however, are oriented towards a higher ability Web user. We believe both sites need a greater degree ofÔ

III. Results
This section discusses right©to©know data, where to find it,
and how to get it.
A. Using Right©to©Know Data
We accessed environmental available to the public through "Right©to© Know" legislation. RTK.NET is the major private "The©Right©to©Know" Web site, run by a private organization affiliated with OMB Watch and the Unison Institute. RTK.NET uses information from EPA, so RTK data appears to cover about the same categories of reporting as Envirofacts.
Another level of sorting options must be made available to the user to make the system more useable. Further, many user choices could be expanded and other "user©friendly" features added. Here is a brief description of the process of getting data.
The RTK.NET web site is very accessible, and the options for sorting by geographic locale are quite easy. Actually managing and utilizing the data is another matter. The system does have a nice feature which sends the data for the chosen geographic locale toÔ
The data we recieved is an "ascii" file, which you can
easily download from Netscape email using the "save as" option. The file
contains more than 100 categories and can be
transmitted in either TAB or COMMA delimited format. From there
one can import the files into Quattro Pro or Excell (and presumably
SPSS and SAS) for data analysis.
B. What Does the Data Tell Us?
Total Major and Minor Facilities

We collected PCS data for three counties bordering the Pocomoke, including both major and minor facilities: Somerset and Worcester Counties in Maryland and Accomak County in Virginia. Worcester had about 100 reporting facilities, Accomak 70, and Somerset 50 (see Figure 1).
Click here for a map of Somerset County
The types of facilities that may be also contributing to the pfiesteria incidents are: Industry, Federal, Municipal, and Other. The specific types of industries that contribute pollution to the river include construction, asphalt, shipyards, container, fuel spill, and lumber.
Click here for a map of Worcester County
The key data for us are the recorded violations of the facilities, and there are several categories that give this violation data. These include indicators such as the number of Ô
One problem is that the data provided is for the life of the facility, which itself can differ, and is not broken down by time or owner for that matter (although such data does exist). While environmental problems are often the result of the aggregation of pollutants, and it may well be so with respect to pfiesteria, correlating pollution to health problems requires the use of time series data. This will serve as a severe limitation in evaluating the data if unavailable.
A more serious problem is in summing of the data for the aggregates. These variables appear toward the end of the 100© plus categories reported for the PCS data. The porblem may be due to missing data points in the transmitted files. Thus, if there are any missing or extra categories downloaded for these 100 plus variables, the alignments in a spreadsheet such as Quattro Pro will be off. This misalignment in reading in the data for these MOST critical variables renders them less useful in any analysis role. In fact, we found misalignment to be a common occurrence.
Where might the pfiesteria problem actually lie if additional sources are
considered? In what data are available, there were 27 instances of
single event violations and 17 quarters where facilities were not
in compliance in Somerset county. This gives Somerset County the
Ô
Total Single Event Violations

C. Comparable and Additional EPA Data
Through examination of comparable and additional data, it is
possible to provide context for the pfiesteria case.
1. Comparable Data on the Web: Envirofacts
We attempted to access the same data set using EPA's
Envirofacts (EF) Web system, in an initial effort in comparing data
sets on water discharges. As a comparative test we
attempted to retrieve the same PCS data for Somerset County,
Maryland. Envirofacts on the Internet, like RTK.NET, has a click
and drag menu feature which is relatively easy to use. The EF data
has a greater degree of informational support than RTK.NET. This
support information is embedded within other related
features, including a useful mapping feature which creates maps of
Ô A cursory study of the informational records uncovers a high
degree of overlap in the types of data available on the EF and RTK
sites. This overlap is not, however, across the board. RTK net
reports more data than EF, and some of that additional data is
critical in successfully completing this case study. EF reports on
the permit level per facility, but nowhere could we find reports on
the number of permit violations at the facility level as we did in
RTK.NET. There appears to be no readily accessible method for
evaluating the actual levels of discharge, but merely whether or
not discharge levels represented a violation. Total Quarters in
Non©Compliance

For both RTK.NET and EF data sets, we chose to obtain report information on all facilities both major and minor. As noted, there were 52 in Somerset County reported by RTK.NET (see Figure 3). However, EF only reported 19 major or minor facilities for Somerset County. Since the data is derived from a common source, the difference may well be in the percentage of total records made available by the two different systems.
This difference clearly displays the clash between the goals of
statistical purity and the institutional process. RTK.NET may have
chosen to target the more sophisticated user, who wants data forÔ
2. Additional Data on the Web
a. EPA's Surf Your Watershed
Data on water discharges is only part of the overall story in explaining other factors present in the pfiesteria outbreak on the Pocomoke River. In probing deeper into these other issues, we would also need to obtain data from Storet(x) which details the volume of water discharges and therefore provides context for the PCS discharge data.
Watershed data (Surf Your Watershed) may also provide critical background information. The level of aggregation for watershed data is, however, geographically too inclusive to provide additional insight into conditions on the Pocomoke River. For example, the watershed area for Somerset County is the lower Chesapeake Bay, but there is no "Pocomoke River" watershed or method by which to indicate it as an area of focus.
Here the need is for watersheds to be defined on a variety of levels of aggregation. The current levels cover large areas, but in fact, public interest is likely to center around much more specific areas. People will want to know about the rivers and Ô
b. FedStat
We also used FedStat, which is a collection of databases from many Federal agencies, to seek out causes for the pfiesteria outbreak related to chicken farming data. Through the site search engine we input the search word "chicken".
The search revealed sites where chicken data (6 data sets) and chicken reports (2) were available. Most of these site references originate from the U.S. Department of Agriculture, from the NASS and ERS data bases.
One site we discovered through this system was a survey on Maryland chicken farmers provided by the Maryland Department of Agriculture. The survey found that two©thirds of Maryland farmers focused on poultry (67 percent). The remaining animal operations focused on swine and cattle, with both about 18 percent. The survey covered the Pocomoke watershed, or parts of Worcester, Wicomico and Somerset counties in Maryland.
Manure is applied to 42 percent of cropland in the state and 85 percent of framers apply manure to crops. Agriculture is the major industry in the Pocomoke River area. About 85 percent of farmers apply manure to fields and 62 percent get that manure from another farmer. However, only 42 percent of cropland receives manure. Therefore, Ô
The survey found that "there are no extraordinary conditions in the Pocomoke Watershed. Most farmers are protecting water quality in an appropriate manner, using current technology." This statement does not entirely correspond with the above data from the IDEA dataset. During the last three years, according to IDEA data ©© Perdue, Holly, or Hudson ©© were cited in violation. Were these firms included in this interview? If so, are they not extraordinary conditions.
3. Additional Data Not on the Web
Publicly available data on water discharges only indicates that
violations have occurred at facilities. We asked for more
information on possible sources of pfiesteria, beyond what was
available on the Web at the right©to©knows sites maintained by the
Unisom Institute (RTK net) and the U.S. EPA (Environfacts).
We received a description of the IDEA dataset from EPA and four related data files for the three counties. The IDEA dataset covers all types of emissions (from 12 differing datasets) by facility. There is one page of description plus several hundred pages of dreary field identifier printouts.
Somerset County Raw files show PCS data on inspections and violations by facility. Even with this added layerÔ
Data are shown by facility, with violations for calendar years 1995-97. There are several types of violations, but no explanation of what is included by the fields. The data sometimes does not seem to match the descriptors contained in the noted dreary field descriptions, despite their length (see Table 1).
Table 1
IDEA Reporting Fields
AllViols Effective Inspections NOVs AAs JAs
Violations Violations
CY 1995
CY 1996
CY 1997
Table 2
Selected Data from IDEA
1995 96 97
Pocomoke City Sewage Treatment Violations 1 3 1
Inspections 2 1 11
Crisfield Sewage Treatment Violations 1 3 1
Inspections 2 0 1
Perdue Farms Violations 2 2 5
Inspections 5 3 3
Hudson Foods Violations 19 19 27
Inspections 13 5 8
Holly Farms Violations 0 4 3Ô Inspections 4 1 1
Table 2 shows time series data on violations and inspections trends by facility. Note that the table only shows the number of violations, but not their magnitude.
Note the diametrically different relation between inspections and violations between the two cities of Pocomoke and Crisfield and the cities with private industry.
The same data for Worcester County, Maryland, looks quite different. In fact, there were some reporting of minimal air emissions, but no violations or inspections occurred in the county. The disparity with Somerset County is startling.
We were also provided a file that showed the Judicial Docket Data
for the two counties in Maryland and the one in Virginia. This file has a
curious cadre of violators,
outside of the others noted above, including:
two violators from the state of Delaware,
a violator from West Virginia,
the University of Maryland,
Maryland State Highway Administration,
Chesapeake & Potomac Telephone Co.,
Berlin Baptist Learning Center,
McReady Memorial Hospital,
7-11 Store,Ô
about 50 violations with no recorded information,
the U.S. Department of Interior,
Westover School,
Eastern Correction Institution,
Maryland Police Barrack V,
Goddard Space Flight Center,
Accomack nursing home,
A laundromat, and
NASA Launch facility.
The files indicated that the Somerset County Sanitary Facility was
fined $27,000 for violations. These reports covered 454
facilities.
Finally, there is a summary report for the three counties that provides macro©information for the three counties and attributes of the IDEA data base. First, here is the reporting for the counties and the differing reporting mechanisms.
Table 3 shows IDEA data by type of reporting mechanism and environmental data set.
| Program | Number |
|---|---|
| AFS | 24 |
| CER | 3 |
| DCK | 29 |
| DUN | 0 |
| FFI | 6 |
| FIN | 372 |
| LST | 0 |
| PCS | 184 |
| RCR | 95 |
| SET | 0 Ô |
| TRI | 14 |
The data are also broken out by the SIC, or the Standard Industrial Classification, for the facility of interest. Here is a ranking of the leading SIC codes by region, that includes the three counties (see Table 4). The data show the most violations relate to local seafood processing, but sewarage systems ranked second. Poultry was cited directly in only 6 of the 111 violating facilities.
Table 4
SIC Codes and Facilities
SIC Type Number
2092 Frozen Seafood 50
4952 Sewerage Systems 24
2091 Canned/Cured Seafood 10
913 Shellfish 7
2048 Animal Foods 7
4911 Electrical Services 7
2015 Poultry Slaughter 6
TRI data is also indicated in the aggregate for the three counties. Air data may provide some clues to pfiesteria outbreaks, since air pollution eventually falls to earth and often ends up in water. The data, however, are of little solace for those seeking consistent data trends. Ammonia emissions for the three counties doubled between 1988 and 1994, and then drop by one©half. Arsenic mysteriously is the same value of the years 1989-91. Chlorine emissions jump from 500 in 1989 to 3,600 in 1993. By 1995, the emissions had fallen to 541.
Some data is obviously wrong. Copper compounds shows a hectic data path (see Table 5). The data trend show increases between 1988 and 1992. Ô the data is largely unbelievable.
Table 5
Copper Compounds
1988 250
1989 502
1990 510
1991 510
1992 760
1993 0
1994 509
1995 3
The story for Ethylene Glycol is even more ominous in terms of trend reliability. It is clearly a statistician's nightmare, insofar as the data show absolutely no variation. Between 1991 and 1995 the emissions were constantly held at 250 units per annum.
Table 6
Ethylene Glycol
1991 250
1992 250
1993 250
1994 250
1995 250
According to the data, sulfuric acid has been eliminated for the three counties. Between 1998 and 1994 there were no sulfuric acids emissions reported.

IV. Recommendations
Here are four recommendations about improving public access
to and use of right©to©know environmental data.
1. Tell the user more about the technical aspects of using the data.
There is simply not enough
information available on the process of downloading and utilizing
data from either Web site. We use Quattro Pro on the AU system.
The default on the RTK.NET system is tab delimited format, although
Quattro Pro supports a comma delimited format. We unfortunately
discovered this the hard way. There should be an explanation of
how to actually manage the data in various software packages as
well as introductory instruction in analyzing it. At the user end,
there should be a user©friendly choice of downloading the data in
readily accessible formats (for example, Quattro Pro, Excell, Word,
etc.).
2. Make the data easier to use.
The data is presented in a random way that confuses the user as
to order of information types. For example, the data fields in the
files when downloaded are not accompanied by the data headers when
imported, which means these must be imported from another file or
typed in by hand. Included in the e©mailed data set, there is a
hyper©link for the header categories, but this step serves as an
additional obstacle for the user to solve as well as another
potential source of error in data use.
3. There needs to be readily©useable time series data Ô
There should also be a means by which to discriminate data by
time, as that is a feature which will be of constant concern. Data
will naturally need to be examined in terms of periodicity. This
information is determinable, but is not easily attainable in the
current data offering on the Web sites.
4. Provide a more useful context for the data.
There is context for the data, but it is often at levels too disparate from the level of data. In the Pfiesteria case study, there was a context, but the specific locations of the point source data could not link up to the eco©system level data of the context. There must be some discrimination in eco©system levels and scopes to provide a link to the point©source data.

PCS Data Field Explanations
Coding Key for RTK.net Data Case Study: Pfiesteria
A=npdes_id (A unique alphanumeric which identifies either a permit or a facility)
B=region (Two digit code for EPA region in which the facility is located)
C=state (FIPS alphabetic state code (generated by PCS system))
D=permit_ind_cat_tr (Translation of permit_ind cat field)
E=inactive_status (Code indicating whether facility Ôe)
F=facility_name_1 (Official or legal name of faciltiy (1st segment))
G=facility_name_2 (Official or legal name of facility (2nd segment))
H=facility_name_3 (Official or legal name of facility (3rd segment))
I=facility_name_4 (Official or legal name of facility (4th segment))
J=major_facility (Code indicating that the facility is a major discharger, M=major)
K=sic (Four©digit Standard Industrial Classification code for facility)
L=sic_tr (Translation of sic field)
M=major_rating (Numerical total of ranking points used to delineate major and minor facilities)
N=county (Name of the county in which the facility is located)
O=owner_type (Code for ownership classification)
P=owner_type_tr (Translation of owner type field)
Q=appl_type (Indicates the type of application form that the facility submitted)
R=appl_type_tr (Translation of appl_type field)
S=priority_epa_hq (Management tool used by EPA headquarters to assign priorities to facilities)
T=epa_or_state_perm (Indicates whether EPA (=E) or the state (=S) issued the permit)
U=facility_name (Name of entity located at facility's physical address)
V=facility_street_1 (First Line of address of physical location of facility)
W=facility_street_2 (Second line of address of physical location of facility)
X=facility_city (Name of the city or town in which the facility is physically located)
Y=facility_state (State or territory on which the facility is physically located)
Z=facility_zip (Zip Ô
AA=facility_phone (Telephone number of the facility)
AB=name_mail (facility name in the primary mailing address)
AC=street_1_mail (First line of primary mailing address of facility)
AD=street_2_mail (Second line of primary mailing address of facility)
AE=city_mail (City in the primary mailing address for the facility)
AF=state_mail (State in the primary mailing address of the facility)
AG=zip_mail (Zip code in the primary mailing address of the facility)
AH=hearing_status (Indicates evidentiary hearing anticipated or in progress for permit (I or A))
AI=hearing_file_num (EPA file number identifying the evidentiary of hearing)
AJ=hearing_docket (Legal case number identifying evidentiary hearing)
AK=hearing_issue_1 (First of 3 codes for central issue causing evidentiary hearing)
AL=hearing_issue_1_tr (Translation of hearing_issue_1 field)
AM=hearing_issue_2 (Second of 3 codes for central issue causing evidentiary hearing)
AN=hearing_issue_2_tr (Translation of hearing_issue_2 field)
AO=hearing_issue_3 (Third of 3 codes for central issue causing evidentiary hearing)
AP=hearing_issue_3_tr (Tranlation of hearing_issue_3 field)
AQ=contact_name (Name/department of permittee's representative responsible for DMRs)
AR=contact_phone (Telephone number of the permittee's representative responsible for DMRs)
AS=issue_date (Date the first permit was issued for a facility)
AT=river_basin_tr (Translation of river basin field)
AU=river_segment (River segment or sub©basin (extension to river Ô
AV=inactive_date (Date on which the facility became inactive or active)
AW=number_reissues (The number of times the permit has been re©issued)
AX=new_facility (Code indicating a new facility with no previous discharge permit)
AY=new_facility_tr (Translation of new_facility field)
AZ=new_date (Date that new source or new discharge began operation)
BA=receiving_waters (Name of river, stream, lake, or other body of water which receives discharge)
BB=grant_indicator (Identifies POTW with SIC code 4952 which obtained federal grant money (=$))
BC=final_limits (Indicated facility on final limits; when treatment constuction complete (=F))
BD=latitude (Latitude of facility (degrees to tenths of seconds & direction DDMMSSTD))
BE=longitude (Longitude of facility (degrees to tenths of seconds & direction DDMMSSTD))
BF=design_flow (Average design flow for a facility (in million gallons per day))
BG=pretreat_req (Code indicating whether municipality is required to develop pretreatment prog)
BH=pretreat_req_tr (Translation of pretreat_req field)
BI=water_qual_limits (Indicates whether permit contains water quality based limits (Y=yes))
BJ=state_permit_num (Space available to state user to classify permits)
BK=nmp_schedule (Indicates whether Municipal Compliance Plan schedule made in accord with NMP)
BL=nmp_schedule_tr (Translation of nmp_schedule field)
BM=nmp_financial (Indicate financial fitness of POTW to comply with MCP in accord with NMP)
BN=nmp_quarter (Indicates fiscal quarter Ô
BO=nmp_quarter_tr (Translation of NMP_quarter field)
BP=owner (Legal name of hte person, firm or entity that owns the facility)
BQ=street_1_owner (First line of the address of the owner of the facility)
BR=street_2_owner (Second line of the address of the owner of the facility)
BS=city_owner (Name of the city or town in the address of the owner of the facility)
BT=state_owner (State or territory of the address of the owner of the facility)
BU=zip_owner (Zip code in the address of the owner of the facility)
BV=phone_owner (Telephone number of the owner of the facility)
BW=operator (Name of the person, firm, or entity that legally operates the facility)
BX=street_1_operator (First line of the street address of the operator of the facility)
BY=street_2_operator (Second line of the street address of the operator of the facility)
BZ=city_operator (Name of the city or town in which the facility's operator is located)
CA=state_operator (State or territory in which the facility's operator is located)
CB=zip_operator (Zip code in the address of the operator of the facility)
CC=phone_operator (Telephone number of the operator of the facility)
CD=control_auth_id (Control authority for enforcing pretreatment regulations)
CE=potw_id (NPDES ID of POTW that receives discharge (monitored by PPETS))
CF=hq01 (1st EPA Headquarters defined data field)
CG=dry_sludge_amount (Amount of sludge a facility produces in DMT/year, dry weight)
CH=sludge_class_ind (Classification assigned to facility producing sludge)
CI=sludge_cls_ind_tr (Translation of sludge_class_ind field)
CJ=sludge_fac_ind Ô
CK=sludge_fac_ind_tr (Translation of the sludge_fac_ind field)
CL=industrial_cat_tr (Translation of industrial category code)
CM=facility_type_tr (Translation of facility type code)
CN=epa_id (EPA ID for facility)
CO=water_basin
CP=num_enforcement (Number of enforcement actions for this permit)
CQ=num_dmr_viol (Number of DMR measurement records with violations (effluent or non© reporting))
CR=num_inspection (Number of inspections of this facility)
CS=num_limit (Number of limit records with this record's NPDES ID)
CTnum_outfall (Number of outfalls regulated under this permit)
CU=num_single_viol (Number of single event violations for this permit)
CV=num_compsched_viol (Number of compliance schedule violations for this permit)
CW=num_nc_quarter (Number of quarter years that facility was in noncompliance)
CX=city
(City in which facility is located (updated by EPA))
**
Fields in bold are empty in the data set so there is no way
to match columns with labels.

Next Steps
We think one way to explore this case study is to follow©up on this trail of discovery by turning attention away from somewhat sophisticated use by researchers to the problems of providing accesible data that can be used. Therefore, we suggest the caseÔ
We propose a project that will both educate and examine "right©to©know" (RTK)
consumer data that is now available on the Web. The Educating and Evaluating RTK
project (EE©RTK) would use students in assessing and using right©to©know data.
Not only would it provide valuable feedback on the use and misuse of the data,
it can also serve as a basis for developing the elements of a class built around
this subject.

Nine Database Quality Questions
This case study constitutes a good basis from which to answer the "Nine Database Quality Questions" which form the basis for review of PCS and other EPA databases. Our approach to the subject is as scientists. We believe that the level of accuracy requires an assumption of proof, this for attaining reasonable scientific findings and for the legal reasons that flow from scientific findings, especially those based on statistics. We will answer these questions from using the data in the context of an academic researcher, one therfore whose findings would be sufficient to stand as an expert witness in a court case or proof of statistical relationship. We also assume that the data is publically available and began with use of a nonªprofit user of EPA data.
1.Á
Unknown. As a case study, comprehensiveness was antithetical to the scope of the research. Ô
2.Á
Maybe. There are spatial variables in the database. However, it is unknown as to its geographic exactness to produce cause and effect. Is the address the report for the site of an event, the site of the nearest post office, or the corporate headquarters filing the report? Likewise, do municipal variables refer to the location of the event of the government office responding to the request? Furthermore, there are distinct state©by©state reporting characteristics that were found in this case that need to be addressed.
3.Á
No. Publically©available PCS contains inadequate data for even constructing a time series, a key find of our report. This data does exist but the data on the Web, and thus publically©available, has only limited time series indications. This is a function of both funding and protection of business and privacy interests.
4.Á
Not enough. Time is distorted and space may be limited in the dataset.
5.Á
Absolutely. We were able to use PCS along with other data through facility reports provided, although were not publically©available data sets.
6.Á
We did not investigate this.
7.Á
Ô
8.Á
Any Internet account with a search engine can find the data. We did not investigate ordering the data by phone in hard copy.
9.Á
Yes, but not very accessible.