Scraping Data for Use by Investigative Journalists
“Liberating” data from publicly available websites, and then making that information accessible for important journalistic research, may be the future of investigative reporting.
And the ScraperWiki Journalism Data Camp, held March 30–31 at the Washington Post, was the perfect outgrowth of a growing Post partnership with AU’s School of Communication. That partnership resulted in the media company cosponsoring the event and AU students benefiting from the knowledge of Post data experts.
Exemplifying the university-wide nature of the matchup was the winning performance of College of Arts and Sciences computer science students in the event’s Data Derby contest.
“The future of journalism is going to be data-based; there’s so much data out there but not necessarily in a format that’s easily told in a story,” said Jolie Lee, a graduate student in interactive journalism, one of several SOC students from SOC professor Lynne Perri’s Online News Production class. “This is a great way to bridge the gap between the computer programmers and the storytellers.”
The event attracted 100 participants from throughout the region, including a half dozen computer science students from AU’s College of Arts and Sciences and 20 SOC interactive journalism graduate students.
“Why are my students here?” said David Johnson, who teaches courses in digital journalism at SOC. “To learn how to do it for themselves.”
Before the data crunching began in earnest, ScraperWiki participants heard from the founders of the data-freeing organization leading the camp, as well as heavy hitters from the Post and AU’s School of Communication.
“The conference actually grew out of our partnership with American University,” said Washington Post local editor Vernon Loeb. “It really is a win-win situation for us. I know that’s a cliché, but it’s true in this case. We benefit from AU students coming down here and working with us, and I think that they really benefit from getting real-world experience that only we can give them.”
Chuck Lewis, a professor in AU’s School of Communication and a nationally known investigative reporter who is the founder of several important nonprofit organizations, including the Center for Public Integrity, echoed the importance of the Post partnership. Lewis is also founding executive editor of SOC’s Investigative Reporting Workshop.
“We at the Investigative Reporting Workshop . . . have an exciting relationship with the Washington Post,” said Lewis, who added he knows of no other partnership quite like it. “In this age of collaboration, which is the spirit of ScraperWiki in this whole process, we have seven students from the School of Communication right now working at the Washington Post while they’re students. And since September they’ve written 200 stories that are in the Washington Post.”
What’s a ScraperWiki?
ScraperWiki was born when organization cofounder Julian Todd discovered that learning how members of Parliament in his native England voted on issues was a maddeningly difficult exercise.
ScraperWiki gathers scattered data from the Internet — from crime statistics to campaign finance spending — and coders put that information in forms that journalists and researchers can use. The organization makes the tools to gather data available online, for free, and encourages collaboration.
The data camp offered sessions on using the Python programming language to liberate data from web sites, with instruction from Todd and developer advocate Thomas Levine.
ScraperWiki won a Knight Foundation grant of about $300,000, in part to spread scraping skills across the country. The organization is partnering with top journalism schools — its first academic partner was Columbia University—and major news organizations.
Washington, D.C., was a natural stop, and the organization connected with American University through Chuck Lewis and others at SOC. Sharon Metcalf, SOC’s senior director of strategic partnerships and programs, quickly saw that the Washington Post was the perfect partner to put on the event, and the Post just as quickly agreed.
The two-day workshop featured a friendly competition of teams tackling the problems of gathering data from web sites and putting that data into a form that could be manipulated and compared with other data.
Students worked on a wide range of topics. SOC grad students Dickson Mercer and Travis Pratt, for example, worked on a project to analyze civil and criminal violations of off-shore drilling regulations. Mercer noted that the workshop was useful for learning how to organize data: “When you have an idea about something, the information is there [on the web], but it is hard to make sense of it.”
AU student Zach Allaun ’14, who is switching to a computer science major, teamed up with Max Richman, a data analyst at InterMedia, a nonprofit research firm, to take on the problem of gathering test scores from Montgomery County, Maryland, public schools with the aim of making an easily accessible site to compare that web-scattered data.
The challenge was gathering and formatting data from PDF files, which require much more ingenuity to extract data from than standard HTML tables. Allaun used a tool developed by ScraperWiki’s Julian Todd, but found that he had to fix two bugs in the tool to proceed.
Allaun’s performance impressed Todd. The AU student, whose team won first place, will continue working with Todd to refine the PDF-extracting tool. Allaun’s computer science classmate Josh Foster ’14 was on the team that captured second place.
AU computer science professor Serge Kruk summed up the spirit of the event: He plans to continue the collaboration between CAS computer science students and journalism students at SOC.
Other event sponsors were the Knight and Sunlight foundations, the Associated Press, and SOC’s J-Lab: The Institute for Interactive Journalism.