You are here: American University Academic Programs Shared Data Science Data Science Practicum

Data Science Practicum

The Data Science Practicum (DATA-793) is the capstone experience for the MS in Data Science and provides assistance to faculty and staff across the university. 

Students entering the practicum have completed coursework in statistics, regression, and R for data science, along with completed or ongoing coursework in statistical machine learning. Practicum students are ready to put their visualization, analytics, and data modelling skills to work on live projects.

Call for Faculty & Staff Projects

Let our advanced students help with your data:

  1. Provide your project title, description, required skills, and email on the Faculty Project form. The only requirement is that the projects use the students' data science skills. 
  2. Under the guidance of our faculty, students in the Data Science Practicum (Data 793) or other advanced research courses review available projects for best fits and contact you.
  3.  You and the student(s) agree on a plan for work on the project. Work can begin as early as January 2020.

Multivariate Analysis of Blueberry Flavor

Breeding programs historically focused on producer-favored traits such as crop yield, which inadvertently resulted in worse tasting fruit. More recently, these programs have started to focus on consumer-favored traits such as flavor and texture. We have a dataset with consumer panel ratings for different blueberry varieties, along with measurements of these blueberries' chemical compositions. The goal of this project is to characterize the relationship between flavor perception and chemical composition using multivariate analysis approaches.
Prerequisites Regression, Linear Algebra, knowledge of R.
Contact: David Gerard, dgerard@american.edu

Understanding the effect of Deep Architectures on the Class Imbalance Problem

The purpose of this project is to study the behavior of deep learning systems in settings that have previously been deemed challenging to classical machine learning systems to find out whether the depth of the systems is an asset in such settings.
Prerequisites  CSC-680, Python, Scikit-Learn, TensorFlow Keras, Colab
Contact: Nathalie Japkowicz, japkowic@american.edu

Automating Survey Data Processes

This project is working with PRRI, a nonprofit, nonpartisan policy research organization, to develop automated processes for creating our topline and banner documents using open-source software. The automation process will include reading in data, performing recodes, weighting data, and producing formatted tables and crosstabs that are ready to be published. The deliverable for this project is a code file to produce high-quality survey topline and banner documents that can be easily customized to any dataset. Success in working with the PRRI research director on this project will result in practical resumé experience; this is a particularly good opportunity for someone interested in the survey research industry.
Prerequisites Strong R skills, R Markdown, knowledge of working with survey data and survey weights.
Contact: Natalie Jackson, njackson@american.edu

Tracing Policy through Congress

Legislative studies is often hampered by the necessity of observing policy changes and the preferences and behavior of members indirectly through coarse and heavily constrained measures like how members vote and what bills become law, both of which mask much of the negotiation around how policy is made. To get a better understanding of how certain policies become law, we are developing an approach to determining how individual policy provisions move through the legislative process in the US Congress by estimating the similarity between the text of sections of bills considered in Congress. Students will help 1) algorithmically and manually split the text of bills into sections, 2) code bill sections by policy topic, 3) compute the text similarity between sections, and 4) analyze the results.
Prerequisites Data analysis (data manipulation, regression, working with text strings) in R or Python
Contact: Andrew Ballard, aballard@american.edu

Comparing demographic profiles of Democratic and Republican donors

This project will build a dataset of federal-level donors to ActBlue and WinRed, the two major fundraising platforms for the Democratic and the Republican party respectively. Using FEC records (especially geocoding addresses on file), we will infer demographic characteristics of Democratic and Republican donors and check for systematic differences in distributions of race, gender, income, occupations, and geographic locations. In particular, we will check whether, within each platform, the distributions are different between small to large donors..
Prerequisites Knowledge and prior experience with R
Contact: Silvia Kim, sskim@american.edu

Donors to 2020 Democratic primaries

This project aims to identify how donors to Democratic presidential primaries behaved. The question is as follows: as perceived electability of candidates shifted, did donors shift the target of their giving? We will work with the FEC dataset to the 2020 primary and general election, identify donors to multiple candidates, and analyze their pattern of giving.
Prerequisites Knowledge and prior experience with R
Contact: Silvia Kim, sskim@american.edu

Searching for Signatures of Elusive Stellar Coronal Mass Ejections from Young Suns Using X-ray Data

NASA’s Kepler and TESS missions have revealed frequent explosive events called superflares from many planet hosting dwarf stars, providing a mechanism by which host stars may have profound effects on the physical and chemical evolution of exoplanetary atmospheres. Solar studies suggest that large flares are accompanied by fast ejection of coronal magnetized materials referred to as coronal mass ejections or CMEs. However, astronomers do not have reliable methods to detect and reveal these elusive stellar magnetic ejections from available data. My team and scientists from Penn State have collected large sets of data based on Chandra and XMM-Newton X-ray observations on hundreds of explosive events from very young stars resembling our Sun in its infancy. In this project, students will explore how statistical and machine learning techniques can be used to search for signatures of elusive CMEs. Students do not need to have any background in astronomy.
Prerequisites Statistics, Machine Learning, knowledge of Python, or IDL/Matlab.
Contact: Vladimir Airapetian, vladimir.airapetian@nasa.gov 

Molecules and machine learning becomes properties

Prediction of molecular properties using machine learning techniques

Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and forcefield modeling, the prediction of molecular properties using machine learning has received great attention in the fields of materials design and drug discovery. In this project, students will use a data fusion framework that is based on Independent Vector Analysis to exploit underlying complementary information contained in different molecular featurization methods. This information will then be used to enhance the prediction ability of a regression model as well as to discover relationships between different molecular structures and properties. Students do not need to have any background in chemistry.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas, boukouva@american.edu.

Data collection, pre-processing, and visualization for understanding the spread of misinformation in social media

Due to the wide use of online media, false information can spread rapidly affecting decision making, cooperation, communications, and markets. Modern social technologies are capable to expedite a massive amount of information enabling the spread of misinformation (inaccurate or misleading). Thus, a crucial question that arises is how do true and false information diffuse and how do they correlate with each other. In this project, students are expecting to collect and pre-process data from social media, news sites and RSS feeds and perform different data visualization techniques in order to identify how false information diffuses and how it correlates with true information.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas, boukouva@american.edu.

Names of different chemicals plotted on a x and y axis

Extracting chemical insights from energetic materials using Natural Language Processing (NLP) techniques

The number of scientific journal articles and reports being published about energetic materials every year is growing exponentially, and therefore extracting relevant information and actionable insights from the latest research is becoming a considerable challenge. In this project, students will explore how techniques from natural language processing and machine learning can be used to automatically extract chemical insights from large collections of documents. Students do not need to have any background in chemistry. Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas, boukouva@american.edu.

Knowledge discovery and detection of misinformation on social media during high impact events

With the evolution of various social media technologies, there has been a fundamental change in how information propagates and is shared on the Internet and microblogs. During a high impact event, e.g. hurricane, terror attacks, stock market crash, social media users can be thought of as generative functions that output network posts. These posts are then propagating on the social network enabling the rapid spread of misinformation which can affect decision making, communications, and markets. Students working on this project will work with a data-driven approach based on latent variable analysis in order to extract information from data so that early detection of misinformation and knowledge discovery during a high impact event are achieved jointly.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas, boukouva@american.edu.

Seafood appearances on historical menus

The New York Public Library has compiled a database of menu items dating back to the 1840s. The goal of this project is to develop an automated approach for identifying and categorizing seafood dishes. The resulting categorization will be used to understand the change in seafood diversity and sourcing over time within this menu collection. Students working on this project will build off of an initial training dataset to apply machine learning techniques for identifying and categorizing menu items.
Prerequisites: R; machine learning; text mining.
Contact Jessica Gephart, jgephart@american.edu.

Estimating physical properties from videos with neural networks

Accurately estimating physical properties of objects from visual and multimedia inputs (e.g. stiffness, roughness, softness) is important for automatic scene understanding in everyday tasks in an AI system. This project aims to leverage human knowledge and physics to learn to estimate physical properties of objects in image/video using deep learning models, with a special emphasis on learning from limited data and with built-in uncertainty in the model.
Prerequisites:Numpy/Scipy/Python, Web programming, Basic machine learning, Linear Algebra, CSC476 Computer Vision is a plus.
Contact Bei Xiao, bixao@american.edu.

Examining the impact of a multicomponent nutrition education intervention program

The goal of the Healthy Schoolhouse 2.0 intervention is to prevent childhood obesity in a high-needs community in Washington DC. This 5-year prospective study follows a pretest-post test design and includes data collected from teachers, students, and schools in Wards 7 and 8. The intervention engages teachers as agents of change by implementing a structured professional development program to support the integration of nutrition concepts in the classroom. Change in pre-post survey assessment of students’ nutrition literacy, attitudes, and intent; change in teachers’ self-efficacy toward teaching nutrition; fruit and vegetable consumption data collected 6 times/y in the cafeteria are examined. Cluster design effects arising from school assignment are accounted for using multilevel mixed modeling (MLM).
Prerequisites: Knowledge of R and SPSS, Regression
Contact Melissa Hawkins, mhawkins@american.edu.

Inclusion by Design

Minority inclusion is at the center of not only creating more equal societies but also democratic stability and preventing ethnic conflict. Yet scholars have found no solution to the knotty problem of measuring inclusion across countries. This problem limits our ability to learn how to design political institutions, such as the electoral system or federalism, to enhance minority inclusion more effectively. My solution to this problem centers on estimating minority electoral support for governing parties in legislatures (and for winning presidential candidates where the president serves more than a symbolic role). Using this information, one can also estimate the minority share of the government’s (and the president’s) electoral supporters. Towards that end, I am taking a multipronged approach to estimating voting behavior by different groups, relying on both ecological inference and polling data.
Prerequisites I need students who are very comfortable (1) locating polling data, and (2) getting key descriptive stats properly weighted out it. Appropriate skills in statistical packages are helpful. I have a grant to pay students over the summer who are interested.
Contact: David Lublin, dlublin@american.edu.

Application of Machine Learning on the Survival Analysis of Breast Cancer Patients

This project is an attempt to study the applications of machine learning techniques in Weka (Clustering, Classification, Association rules, regression) for the survival analysis of breast cancer patients. The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Public-Use Data (years 1977-2017) breast cancer data set will be used for all experiments and R programming for visualization. The aim of this project is to investigate the significance of the prognostic factors such as age, ethnicity/race, tumor grade, tumor size, etc. on the survival of breast cancer patients.
Prerequisites Knowledge of R, Machine Learning Algorithms, Weka
Contact: Mehdi Owrang, owrang@american.edu.

The Accountability Project

The Accountability Project was built as a tool to help researchers and newsrooms search across otherwise siloed data. We acquire, process and standardize hundreds of data sets, accounting for nearly 900 million records. We are seeking student researchers who are interested in either contributing data to the project or analyzing data within the project.
Prerequisites We are software agnostic, but it requires skills in basic data review, cleaning and analysis. Statistics not required, but rigorous data standards are. We're especially interested in folks who want to tell stories with data.Most of the team uses either R or SQL.
Contact: Jennifer LaFleur, lafleur@american.edu.

Methods for video-based behavior tracking

Methods for video-based behavior tracking have been of major interest in fields such as neuroscience over the past several years, and have widespread utility in many fields. These methods depend on recent developments in machine learning (e.g. deep learning and algorithms for classification and regression using high dimensional data) and computer hardware (e.g. GPUs, single board computers). Projects are available to explore the parameter spaces of currently popular methods (e.g. DeepLabCut, SimBA, B-SOiD), develop a database of recordings across a range of behaviors that are commonly used in the field of neuroscience, and to develop training materials for teaching novice users on carrying out video analysis. Projects would be done in collaboration with members of the Laubach Laboratory in the Department of Neuroscience at AU and an international team of researchers through the OpenBehavior project.
Prerequisites Basic coding skills in R and scientific Python; familiarity with Jupyter and Colab notebooks. 
Contact:  Mark Laubachmark.laubach@american.edu.

Transportation and Air Quality using Machine Learning techniques

This project is an attempt to help the public understand the effects of their transportation emissions on air quality and health, as well as actions that can be taken to reduce these effects. The aim of this project is to develop a digital tool for monitoring and analyzing traffic patterns, air pollution and weather in the D.C.area. Students will use large datasets to understand the significance of transportation emissions, such as SO2, Co2,NOx etc., and their impact on air quality. Moreover, they will identify underlying associations between air pollutant concentration and traffic. Through the Census Bureau Opportunity Project (TOP), students will meet the problem statement leaders from EPA and learn more about the transportation and air quality problem statement, any relevant data, and how this challenge affects communities.
Prerequisites Knowledge of R, Regression. 
Contact: Maria Barouti, barouti@american.edu

Trustworthy Machine Learning

The deployment of machine learning in real-world systems is growing faster than many had predicted. Today, organizations across different industries are increasingly using machine learning to augment human decision making, reduce costs and enhance productivity. Recent research has shown that malicious actors can use modified input data to make a machine learning algorithm behave in unexpected ways. For example, researchers have shown that they can trick the machine learning based computer vision algorithms designed for self-driving cars to mistake stop signs for speed limit signs. Thus, it calls for technologies that will ensure that machine learning is trustworthy so we rely on them to produce reliable outputs. Students working this project will study and investigate the trustworthiness of machine learning models. 
Prerequisites Already have a basic understanding of machine learning. Python required. Computer vision is a plus. 
Contact:  Leah Ding, ding@american.edu

Data-driven material estimation from images 

We are interested in estimating 3D shape, material attributes and classes from photographs of translucent materials. The project involves assisting the PI to a large image dataset of translucent objects and use unsupervised learning and representational learning techniques to learn image features that are useful for teasing apart causal factors (geometry, lighting and optical properties) that influencing material properties from images. In addition, the project requires building crowd-sourcing experiments to collect human annotation from online-platform such as Amazon Mechanical Turks and compare the data with outputs from the machine learning models.
Prerequisites Python(Numpy, PyTorch), Basic Machine Learning, Deep Learning, Statistics.
Contact:  Bei Xiao,  bxiao@american.edu

Cardiovascular Risk Factor Prevention among Formerly Incarcerated African Americans

Mixed methodological study.
Prerequisites SPSS, Qualtrics knowledge
Contact: Ebony Russ, eruss@american.edu