You are here: American University Academic Programs Shared Data Science Data Science Practicum

Data Science Practicum

Pollinating bee on AU campus, Washington, DC. Photo: Dylan Singleton.

The Data Science Practicum (DATA-793) is the capstone experience for the MS in Data Science and provides assistance to faculty, government organizations, and companies. 

Students entering the practicum have completed coursework in statistics, regression, and R for data science, along with completed or ongoing coursework in statistical machine learning. Practicum students are ready to put their visualization, analytics, and data modelling skills to work on live projects.


Please contact Practicum Coordinator, Maria Barouti, at

Call for Projects from Faculty & Staff

Let our advanced students help with your data:

  1. Provide your project title, description, required skills, and email on the Faculty Project form. The only requirement is that the projects use the students' data science skills. 
  2. Under the guidance of our faculty, students in the Data Science Practicum (Data 793) or other advanced research courses review available projects for best fits and contact you.
  3. You and the student(s) agree on a plan for work on the project. Work can begin as early as January 2020.

Current Projects Now Inviting Student Participation

Please browse descriptions further below for details on each project:


Predicting Failure of Vehicle Components

Condition-based maintenance plus (CBM+) is a strategy that monitors the real-time conditions of a vehicle in operation to determine what maintenance needs to be performed and predict future maintenance or failure points. This differs from typical preventative maintenance strategies in that it is based off real-time data, as opposed to calendar-based or mileage strategies. The purpose of this project is to analyze and explore Controller Area Network (CAN) data from multiple vehicles in order to develop predictive algorithms to begin to provide the Remaining Useful Life (RUL) of the different platforms/vehicles.

Students will develop models that follow a data-driven approach as opposed to the standard physics-based approach, which focuses on exploiting known relationships between sensor signals. Students will work with anonymized data to identify signals of interest, normal patterns of operation, and deviations/abnormalities from those patterns indicating maintenance is due soon. 

Prerequisites US Citizenship, Knowledge of Python and Common Packages (numpy, pandas, scikit-learn, scipy, etc.), an interest in finding value in seemingly random data patterns
Contact:  Edward Baumann,

Tackling the problem of learning Long-Tailed Distributions with Error Correcting Codes

In classification problems, long-tailed distributions correspond to multiclass problems for which some classes are represented by large numbers of samples, while others are represented by only a few. It is, in some sense, an extension of the classical class imbalance problem. A traditional ensemble method used to enhance the performance of multi-class classifiers uses Error Correcting Codes and was previously used to tackle class imbalanced situations. The purpose of this project is to investigate whether this approach is competitive in Deep Learning long-tailed problems. 
Prerequisites Experience with Deep Learning models applied to image data; Good programming skills in Python
Contact: Nathalie

Analysis of State Teacher Licensure Exam Pass Rates 

Our program participants are required to pass a set of teacher licensure exams for entry and completion of our educator preparation program. The timing and type of exam required varies by state and by licensure area, with these state requirements dynamically changing over the years. Urban Teachers collects licensure exam data from two external testing vendors and uploads into our internal data platform. We are interested in: 1) an initial quality analysis and cleaning of our data; 2) an analysis of the number of attempts and pass rates, in order to understand if there are trends by participant demographics and type of exam; 3) the financial costs to our participants; and 4) the development of a report or dashboard that ensures alignment between participant’s program of study, licensure area, clinical placement, and exams taken. This work can inform programmatic decisions on where to devote specific resources and supports as well as internal policy decisions on testing requirements.
Prerequisites Knowledge of R and willingness to learn Power BI for the development of a dashboard. 
Contact:  Viticia Thames,

Analyzing Teacher Attrition to gain insights for how to support Teacher Retention 

Teacher attrition from our teacher preparation program can happen either because a teacher was dismissed, or because the teacher resigned from the program. We would like reasons for program exits to be analyzed to determine whether there are trends by reasons for dismissals and resignations that vary by participant demographic background. We would like insights from this project to generate data-informed strategies that can increase program teacher retention across all four of our teacher preparation sites in D.C., Dallas, Philadelphia, and Baltimore.
Prerequisites Knowledge of R, covariance analysis by teacher demographic 
Contact:  Viticia Thames,

Text mining on big data

Identifying entities of interest in text is an important precursor for many data mining, natural language processing and natural language understanding tasks. We are interested in processing large English text corpora to recognize text strings that denote named entities of interest such as diseases, symptoms, and laboratory measurements as well as personal information such as age, gender and occupations of patients.

Students will develop algorithmic solutions to recognize a particular set of named entities in a large corpus of text. A relevant set of metadata will be provided. Students can write their own algorithms or use an off-the-shelf tool of their choice to accomplish the task. If any training dataset is available for that particular task, a supervised machine learning approach can be taken, but the students might need to extend the size of the training data to increase the system’s predictive performance.
Prerequisites   Students must be competent in at least one programming language, preferably C, C++, Perl, or Python. Additional requirements are a good mathematical background, some background in natural language processing and machine learning methods.
Contact:  Dr. Mehmet Kayaalp,

Metadata preprocessing and information extraction

To understand the content of any text, we need to identify entities that are represented in words and phrases. Resources that contain such entities in a semi-structured fashion are very helpful. The target audience of such a resource is usually either a human or a specific application. Some of these resources are not curated professionally but produced by volunteers, making the end-product less coherent and noisy.

At NIH, we work on several metadata sets, which need to be preprocessed to make the metadata more coherent so that we can read them algorithmically and extract necessary information reliably. Students may choose to work either on biomedical metadata such as vocabularies in UMLS or on general metadata such as Wikidata along with text corpora such as PubMed Central and Wikipedia. Through this practicum, students will be working on big data, identifying and correcting errors, and transforming the data into a suitable format so that they can be digested by in-house applications.
Prerequisites Students must be able to write programs to perform text processing and be competent in at least one programming language, preferably C, C++, Perl, or Python. Some background in natural language processing would be very valuable.
Contact:  Dr. Mehmet Kayaalp,

Annotation tool selection

The success of modern artificial intelligence (AI) systems partly depends on the availability of large annotated / labeled datasets. They are essential not only to train such systems but also to evaluate their output.

At NIH we build freely available AI applications using such annotated datasets. Unfortunately, our annotation tool is not easy to use for our users. We need to identify a capable annotation tool that is easy to use and available to our users free of charge.
There are a number of such annotation tools on the market. The students will be tasked to learn the annotation requirements of the on-going project at NIH, search through all available tools, evaluate them based on the requirements of the project, and select the most likely candidates.
Prerequisites Students should be able to install and test annotation applications on their own. Access to a Windows or a Linux computer.
Contact: Dr. Mehmet

Survey data from online psychology experiments

Mood disorders, such as depression, are a major cause of suffering worldwide. In order to improve scientific understanding of mood, our clinical research colleagues ran a series of experiments where they asked participants to continually rate their mood while playing a simple gambling game online. Participants were also asked to respond to a survey eliciting their daily happiness, ability to enjoy life, and their experience of the gambling game. From these surveys and the logs of the gambling game, we would like to explore possible associations between survey responses and mood dynamics during the gambling game, which could be of relevance to our scientific understanding of the causes or effects of mood. Students working in this project will gain experience in processing data collected from online surveys, visualization and analysis of survey data, and in formulating and testing statistical hypotheses.
Prerequisites  Statistics and/or machine learning; knowledge of python (with numpy and pandas) or R
Contact: Charles Zheng,

Enhancing Children’s Resilience to Adversity in Puerto Rico

Supporting children’s resilience – their ability to bounce back in the face of adversity – is critical to their wellbeing, including by helping avoid long-term negative mental health outcomes. Create child- and youth-friendly tools that highlight culturally-relevant and available resources, practices, and tools for children and youth in Puerto Rico to build resilience and fill a critical gap. Advancing equity in mental health for children and youth in Puerto Rico is particularly pressing because of the significant exposure to natural disasters, the impact of COVID-19 in the community, and the limited access to resources to support and strengthen the mental health of children and youth.
Prerequisites Statistics, Data Analysis, Visualization and Exploration, knowledge of R.
Contact: Dr. Maria Barouti, Dr. Richard Ressler,

The Inclusion Collaborative FMG project

The Inclusion Collaborative (IC) is a group of employees who serve as Diversity, Equity, and Inclusion (DE&I) liaisons at Fors Marsh Group (FMG). FMG seeks to conduct a process assessment and evaluation. This involves 1) creating a tracking system (e.g., dashboard) for key pieces of information on program use (e.g., # liaisons trained, # requests received, etc.) and, 2) administering pre- and post-surveys (with quantitative and qualitative data) to assess liaison satisfaction and performance and identify areas for improvement. This work presents the opportunity to analyze and communicate data in an applied setting, to work with a team, and to navigate a novel and ambiguous project. Time permitting, a companion effort involves collecting and analyzing data on employee satisfaction, engagement, and attitudes towards special topics or initiatives at FMG. Tasks may include data cleaning, analysis, and report and brief creation.
Prerequisites Data Analysis and Visualization (use of Stata, R, and other related software).
Contact: Laura Severance,

Comparative analysis related to Wildfires in Conservation areas in Angola and Mozambique

Africa is often referred to as “The Fire Continent,” because it endures catastrophic wildfires annually. The US Forest Service is interested in assessing two African countries in the Southern Region, Angola, and Mozambique, to analyze many factors related to the causes, patterns, and aftermath of the wildfires in these two countries. In developing this comparative analysis, the student will assess regions in both countries with similar landscapes, ecosystems, size, and other factors that can provide information on wildfires. This project will be used for future program implementation, specifically a regional fire program currently in development. This project also has the potential to work with Forest Service colleagues. 
Prerequisites Knowledge of Remote Sensing, GIS, Regression, Data Analysis (such as R and other related programs), visualization.
Contact:  Michelle Zweede,

A Comparative Study of Litigation Data related to India’s Agricultural Land Laws

Agricultural land in India is heavily regulated. It is a state subject. State laws regulate who can be a farmer, what a farmer can do with the land, whom a farmer can sell the land to, how much land a farmer can own, and under what conditions the State can compel acquisition of farm land. Details vary from state to state. The students working on this project will work to web scrape the judgments/ case filing data related to agricultural land laws from the central case law repository and undertake a comparative frequency analysis for various India states. The goal of this project is to explore how litigation frequency has varied both, over time and with the kind / degree of restrictions on agricultural land and to identify the most contested/ litigated issues with respect to agricultural land. Students will apply a range of statistical and data science techniques, including web scraping, data tidying, visualization, and data exploration to obtain and analyze legal case data.
Prerequisites Knowledge of web scraping techniques is essential. In addition, knowledge or R (or Python) for data tidying, visualization and exploration; Statistics.
Contact: Prashant Narang (CCS, India) & Dr. Nimai Mehta (Math & Stat Department, AU),

The Adopt a Pixel Project

The Adopt a Pixel Project needs data scientists to build tools that will extract NASA satellite image areas and corresponding citizen science ground photos, and then associate them in a digital platform for analysis by citizen scientists. The project employs the high profile Zooniverse citizen science platform.

Students will implement strategies and methods to automate this extraction, processing and publishing of the data on the Zooniverse platform. The analyzed data will be used for further analysis and visualization in data dashboards, websites and programmatic notebooks. The resulting data stream will support scientific tasks such as graphing, mapping and data reduction and AI. 

Contact: Peder Nelson, Oregon State University, College of Earth Ocean and Atmospheric sciences. Peder is an instructor of geography and geospatial sciences, and the NASA GLOBE Observer Land Cover Science Lead.

Online Antisemitism Detection Using Multimodal Learning

Increasing cases of online antisemitism have become a major concern due to its socio-political consequences. The task of detection of online antisemitism poses multiple challenges that include the extraction of joint representations from multi-modal data (e.g. text, images, etc). Students working on this project will work with different data fusion approaches based on latent variable analysis and deep neural networks in order to extract joint representations from multi-modal data so that online detection of antisemitism and knowledge discovery are achieved jointly.
Prerequisites Machine Learning, good knowledge of Python
Contact: Nathalie Japkowicz and Zois

Exploring multi-task motor learning

How does our brain allow us to learn countless complex motor skills, from tying our shoes to hitting a tennis serve? How and why do previously learned skills affect how well we learn a new skill, and are such interactions reflected on the neuronal and memory level? Answers to these questions are not only important for basic neuroscience, but also for education, rehabilitation and the improvement of artificial neural networks for multi-task learning. We train rats in a high-throughput manner on multiple motor skills and track their performance and fine-grained movement trajectories over tens of thousands of trials. The goal of this project is to explore how learning of multiple motor skills differs from learning individual skills and how skills interact on the behavioral level under various training conditions. Students will explore our large performance and movement datasets to determine the relationship between learned skills and how it develops over the course of training. 
Prerequisites Matlab and/or Python
Contact:  Steffen Wolff,

Detecting Storytelling Tactics in Text Using Machine Learning

Research in entertainment theory has shown that people are more effectively mobilized through engagement and entertainment than through overt and explicit persuasive arguments. Utilizing entertainment tactics - or storytelling tactics - has been shown to “transport” readers into the story world, ultimately leading to more effective communication campaigns. For this effort, Protagonist is interested in building a machine learning classifier to better identify the storytelling potential of media articles and social media campaigns. Protagonist will provide a list of storytelling features known to contribute to a “state of transportation” and practicum participants will create a model to classify text data as “transportive” based on the presence of those features in multilingual text-based data. Practicum students may also employ feature selection techniques to identify the strongest predictors of transportation in text data.
Prerequisites Knowledge of machine learning, text mining, and Python
Contact: Becky Owens, Maria Barouti,

Statistical Consulting class: Stat 798

In this class, you will work with a statistics professor to learn new techniques, such as:
1) Determination of sample sizes and randomization of subjects to include in an experiment or survey
2) Design of a survey sampling and optimal experiments
3) Choice of the proper statistical methods for studies and experiments
4) Transformation and import of data into desired statistical software packages
5) Interpretation of statistical analysis results for researchers
6) Assessment of the power of statistical tests (power analysis)
7) Software support for major statistical software packages: R, SPSS, SAS, STATA, etc.
8) Assistance with the statistics section of a research manuscript before submission to a journal

Permission of the Director of Data Science ( and the Director of the Statistical Consulting Center ( is required.
Prerequisites If you have a strong statistical background, and 2)you would like to work on a variety of short and long projects then you can join STAT 798!

Contact: Aleka

LH Dynamics in African and Asian Elephants

Both Asian and African elephants exhibit a unique hormone pattern during the follicular phase of the estrous cycle with two luteinizing hormone surges that occur approximately 3 weeks apart (double LH surge); the surges are indistinguishable, but only the second one induces ovulation. The goal of this project is to summarize decades of hormonal data and identify intra- and inter-elephant differences in various aspects of the double LH surge: time from the decline in progestagens during the luteal phase to the first LH surge; time between LH1 and LH2, time from LH2 to the rise in progestagens after LH2-induced ovulation. Some elephants have even demonstrated three surges during the follicular phase, which also need to be documented and compared to the normal double LH surge.
Prerequisites Data manipulation and visualization, knowledge of R.
Contact: Natalia Prado,

Characterizing Temporal Patterns in Longitudinal Prolactin Secretion in African Elephants

Normal cycling African elephant females have temporal patterns in prolactin secretion, while acyclic females with abnormal prolactin levels (too high or low) appear to lose this temporal pattern. The goals of this project are to describe normal temporal prolactin patterns in elephants and determine how acyclic females change or lose their prolactin temporal patterns. In doing so, we aim to better understand underlying causes for hyperprolactinemia, a reproductive disorder that is associated with ovarian acyclicity in African elephant females.
Prerequisites Data manipulation and visualization, knowledge of R.
Contact: Natalia Prado,

Multivariate Analysis of Blueberry Flavor

Breeding programs historically focused on producer-favored traits such as crop yield, which inadvertently resulted in worse tasting fruit. More recently, these programs have started to focus on consumer-favored traits such as flavor and texture. We have a dataset with consumer panel ratings for different blueberry varieties, along with measurements of these blueberries' chemical compositions. The goal of this project is to characterize the relationship between flavor perception and chemical composition using multivariate analysis approaches.
Prerequisites Regression, Linear Algebra, knowledge of R.
Contact: David Gerard,

Understanding the effect of Deep Architectures on the Class Imbalance Problem

The purpose of this project is to study the behavior of deep learning systems in settings that have previously been deemed challenging to classical machine learning systems to find out whether the depth of the systems is an asset in such settings.
Prerequisites  CSC-680, Python, Scikit-Learn, TensorFlow Keras, Colab
Contact: Nathalie Japkowicz,

Automating Survey Data Processes

This project is working with PRRI, a nonprofit, nonpartisan policy research organization, to develop automated processes for creating our topline and banner documents using open-source software. The automation process will include reading in data, performing recodes, weighting data, and producing formatted tables and crosstabs that are ready to be published. The deliverable for this project is a code file to produce high-quality survey topline and banner documents that can be easily customized to any dataset. Success in working with the PRRI research director on this project will result in practical resumé experience; this is a particularly good opportunity for someone interested in the survey research industry.
Prerequisites Strong R skills, R Markdown, knowledge of working with survey data and survey weights.
Contact: Natalie Jackson,

Tracing Policy through Congress

Legislative studies is often hampered by the necessity of observing policy changes and the preferences and behavior of members indirectly through coarse and heavily constrained measures like how members vote and what bills become law, both of which mask much of the negotiation around how policy is made. To get a better understanding of how certain policies become law, we are developing an approach to determining how individual policy provisions move through the legislative process in the US Congress by estimating the similarity between the text of sections of bills considered in Congress. Students will help 1) algorithmically and manually split the text of bills into sections, 2) code bill sections by policy topic, 3) compute the text similarity between sections, and 4) analyze the results.
Prerequisites Data analysis (data manipulation, regression, working with text strings) in R or Python
Contact: Andrew Ballard,

Searching for Signatures of Elusive Stellar Coronal Mass Ejections from Young Suns Using X-ray Data

NASA’s Kepler and TESS missions have revealed frequent explosive events called superflares from many planet hosting dwarf stars, providing a mechanism by which host stars may have profound effects on the physical and chemical evolution of exoplanetary atmospheres. Solar studies suggest that large flares are accompanied by fast ejection of coronal magnetized materials referred to as coronal mass ejections or CMEs. However, astronomers do not have reliable methods to detect and reveal these elusive stellar magnetic ejections from available data. My team and scientists from Penn State have collected large sets of data based on Chandra and XMM-Newton X-ray observations on hundreds of explosive events from very young stars resembling our Sun in its infancy. In this project, students will explore how statistical and machine learning techniques can be used to search for signatures of elusive CMEs. Students do not need to have any background in astronomy.
Prerequisites Statistics, Machine Learning, knowledge of Python, or IDL/Matlab.
Contact: Vladimir Airapetian, 

Seafood appearances on historical menus

The New York Public Library has compiled a database of menu items dating back to the 1840s. The goal of this project is to develop an automated approach for identifying and categorizing seafood dishes. The resulting categorization will be used to understand the change in seafood diversity and sourcing over time within this menu collection. Students working on this project will build off of an initial training dataset to apply machine learning techniques for identifying and categorizing menu items.
Prerequisites: R; machine learning; text mining.
Contact Jessica Gephart,

Estimating physical properties from videos with neural networks

Accurately estimating physical properties of objects from visual and multimedia inputs (e.g. stiffness, roughness, softness) is important for automatic scene understanding in everyday tasks in an AI system. This project aims to leverage human knowledge and physics to learn to estimate physical properties of objects in image/video using deep learning models, with a special emphasis on learning from limited data and with built-in uncertainty in the model.
Prerequisites:Numpy/Scipy/Python, Web programming, Basic machine learning, Linear Algebra, CSC476 Computer Vision is a plus.
Contact Bei Xiao,

Examining the impact of a multicomponent nutrition education intervention program

The goal of the Healthy Schoolhouse 2.0 intervention is to prevent childhood obesity in a high-needs community in Washington DC. This 5-year prospective study follows a pretest-post test design and includes data collected from teachers, students, and schools in Wards 7 and 8. The intervention engages teachers as agents of change by implementing a structured professional development program to support the integration of nutrition concepts in the classroom. Change in pre-post survey assessment of students’ nutrition literacy, attitudes, and intent; change in teachers’ self-efficacy toward teaching nutrition; fruit and vegetable consumption data collected 6 times/y in the cafeteria are examined. Cluster design effects arising from school assignment are accounted for using multilevel mixed modeling (MLM).
Prerequisites: Knowledge of R and SPSS, Regression
Contact Melissa Hawkins,

Inclusion by Design

Minority inclusion is at the center of not only creating more equal societies but also democratic stability and preventing ethnic conflict. Yet scholars have found no solution to the knotty problem of measuring inclusion across countries. This problem limits our ability to learn how to design political institutions, such as the electoral system or federalism, to enhance minority inclusion more effectively. My solution to this problem centers on estimating minority electoral support for governing parties in legislatures (and for winning presidential candidates where the president serves more than a symbolic role). Using this information, one can also estimate the minority share of the government’s (and the president’s) electoral supporters. Towards that end, I am taking a multipronged approach to estimating voting behavior by different groups, relying on both ecological inference and polling data.
Prerequisites I need students who are very comfortable (1) locating polling data, and (2) getting key descriptive stats properly weighted out it. Appropriate skills in statistical packages are helpful. I have a grant to pay students over the summer who are interested.
Contact: David Lublin,

Application of Machine Learning on the Survival Analysis of Breast Cancer Patients

This project is an attempt to study the applications of machine learning techniques in Weka (Clustering, Classification, Association rules, regression) for the survival analysis of breast cancer patients. The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Public-Use Data (years 1977-2017) breast cancer data set will be used for all experiments and R programming for visualization. The aim of this project is to investigate the significance of the prognostic factors such as age, ethnicity/race, tumor grade, tumor size, etc. on the survival of breast cancer patients.
Prerequisites Knowledge of R, Machine Learning Algorithms, Weka
Contact: Mehdi Owrang,

The Accountability Project

The Accountability Project was built as a tool to help researchers and newsrooms search across otherwise siloed data. We acquire, process and standardize hundreds of data sets, accounting for nearly 900 million records. We are seeking student researchers who are interested in either contributing data to the project or analyzing data within the project.
Prerequisites We are software agnostic, but it requires skills in basic data review, cleaning and analysis. Statistics not required, but rigorous data standards are. We're especially interested in folks who want to tell stories with data.Most of the team uses either R or SQL.
Contact: Jennifer LaFleur,

Methods for video-based behavior tracking

Methods for video-based behavior tracking have been of major interest in fields such as neuroscience over the past several years, and have widespread utility in many fields. These methods depend on recent developments in machine learning (e.g. deep learning and algorithms for classification and regression using high dimensional data) and computer hardware (e.g. GPUs, single board computers). Projects are available to explore the parameter spaces of currently popular methods (e.g. DeepLabCut, SimBA, B-SOiD), develop a database of recordings across a range of behaviors that are commonly used in the field of neuroscience, and to develop training materials for teaching novice users on carrying out video analysis. Projects would be done in collaboration with members of the Laubach Laboratory in the Department of Neuroscience at AU and an international team of researchers through the OpenBehavior project.
Prerequisites Basic coding skills in R and scientific Python; familiarity with Jupyter and Colab notebooks. 
Contact:  Mark

Trustworthy Machine Learning

The deployment of machine learning in real-world systems is growing faster than many had predicted. Today, organizations across different industries are increasingly using machine learning to augment human decision making, reduce costs and enhance productivity. Recent research has shown that malicious actors can use modified input data to make a machine learning algorithm behave in unexpected ways. For example, researchers have shown that they can trick the machine learning based computer vision algorithms designed for self-driving cars to mistake stop signs for speed limit signs. Thus, it calls for technologies that will ensure that machine learning is trustworthy so we rely on them to produce reliable outputs. Students working this project will study and investigate the trustworthiness of machine learning models. 
Prerequisites Already have a basic understanding of machine learning. Python required. Computer vision is a plus. 
Contact:  Leah Ding,

Data-driven material estimation from images 

We are interested in estimating 3D shape, material attributes and classes from photographs of translucent materials. The project involves assisting the PI to a large image dataset of translucent objects and use unsupervised learning and representational learning techniques to learn image features that are useful for teasing apart causal factors (geometry, lighting and optical properties) that influencing material properties from images. In addition, the project requires building crowd-sourcing experiments to collect human annotation from online-platform such as Amazon Mechanical Turks and compare the data with outputs from the machine learning models.
Prerequisites Python(Numpy, PyTorch), Basic Machine Learning, Deep Learning, Statistics.
Contact:  Bei Xiao,

Cardiovascular Risk Factor Prevention among Formerly Incarcerated African Americans

Mixed methodological study.
Prerequisites SPSS, Qualtrics knowledge
Contact: Ebony Russ,