You are here: American University Academic Programs Shared Data Science Data Science Practicum

Data Science Practicum

The Data Science Practicum (DATA-793) is the capstone experience for the MS in Data Science and provides assistance to faculty and staff across the university. 

Students entering the practicum have completed coursework in statistics, regression, and R for data science, along with completed or ongoing coursework in statistical machine learning. Practicum students are ready to put their visualization, analytics, and data modelling skills to work on live projects.

Call for Faculty & Staff Projects

Let our advanced students help with your data:

  1. Provide your project title, description, required skills, and email on the Faculty Project form. The only requirement is that the projects use the students' data science skills. 
  2. Under the guidance of our faculty, students in the Data Science Practicum (Data 793) or other advanced research courses review available projects for best fits and contact you.
  3.  You and the student(s) agree on a plan for work on the project. Work can begin as early as January 2020.
Molecules and machine learning becomes properties

Prediction of molecular properties using machine learning techniques

Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and forcefield modeling, the prediction of molecular properties using machine learning has received great attention in the fields of materials design and drug discovery. In this project, students will use a data fusion framework that is based on Independent Vector Analysis to exploit underlying complementary information contained in different molecular featurization methods. This information will then be used to enhance the prediction ability of a regression model as well as to discover relationships between different molecular structures and properties. Students do not need to have any background in chemistry.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Data collection, pre-processing, and visualization for understanding the spread of misinformation in social media

Due to the wide use of online media, false information can spread rapidly affecting decision making, cooperation, communications, and markets. Modern social technologies are capable to expedite a massive amount of information enabling the spread of misinformation (inaccurate or misleading). Thus, a crucial question that arises is how do true and false information diffuse and how do they correlate with each other. In this project, students are expecting to collect and pre-process data from social media, news sites and RSS feeds and perform different data visualization techniques in order to identify how false information diffuses and how it correlates with true information.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Names of different chemicals plotted on a x and y axis

Extracting chemical insights from energetic materials using Natural Language Processing (NLP) techniques

The number of scientific journal articles and reports being published about energetic materials every year is growing exponentially, and therefore extracting relevant information and actionable insights from the latest research is becoming a considerable challenge. In this project, students will explore how techniques from natural language processing and machine learning can be used to automatically extract chemical insights from large collections of documents. Students do not need to have any background in chemistry. Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Knowledge discovery and detection of misinformation on social media during high impact events

With the evolution of various social media technologies, there has been a fundamental change in how information propagates and is shared on the Internet and microblogs. During a high impact event, e.g. hurricane, terror attacks, stock market crash, social media users can be thought of as generative functions that output network posts. These posts are then propagating on the social network enabling the rapid spread of misinformation which can affect decision making, communications, and markets. Students working on this project will work with a data-driven approach based on latent variable analysis in order to extract information from data so that early detection of misinformation and knowledge discovery during a high impact event are achieved jointly.
Prerequisites: Regression, Machine Learning, knowledge of R, Python, or Matlab.
Contact Dr. Zois Boukouvalas,

Seafood appearances on historical menus

The New York Public Library has compiled a database of menu items dating back to the 1840s. The goal of this project is to develop an automated approach for identifying and categorizing seafood dishes. The resulting categorization will be used to understand the change in seafood diversity and sourcing over time within this menu collection. Students working on this project will build off of an initial training dataset to apply machine learning techniques for identifying and categorizing menu items.
Prerequisites: R; machine learning; text mining.
Contact Jessica Gephart,

Estimating physical properties from videos with neural networks

Accurately estimating physical properties of objects from visual and multimedia inputs (e.g. stiffness, roughness, softness) is important for automatic scene understanding in everyday tasks in an AI system. This project aims to leverage human knowledge and physics to learn to estimate physical properties of objects in image/video using deep learning models, with a special emphasis on learning from limited data and with built-in uncertainty in the model.
Prerequisites:Numpy/Scipy/Python, Web programming, Basic machine learning, Linear Algebra, CSC476 Computer Vision is a plus.
Contact Bei Xiao,

Examining the impact of a multicomponent nutrition education intervention program

The goal of the Healthy Schoolhouse 2.0 intervention is to prevent childhood obesity in a high-needs community in Washington DC. This 5-year prospective study follows a pretest-post test design and includes data collected from teachers, students, and schools in Wards 7 and 8. The intervention engages teachers as agents of change by implementing a structured professional development program to support the integration of nutrition concepts in the classroom. Change in pre-post survey assessment of students’ nutrition literacy, attitudes, and intent; change in teachers’ self-efficacy toward teaching nutrition; fruit and vegetable consumption data collected 6 times/y in the cafeteria are examined. Cluster design effects arising from school assignment are accounted for using multilevel mixed modeling (MLM).
Prerequisites: Knowledge of R and SPSS, Regression
Contact Melissa Hawkins,

Inclusion by Design

Minority inclusion is at the center of not only creating more equal societies but also democratic stability and preventing ethnic conflict. Yet scholars have found no solution to the knotty problem of measuring inclusion across countries. This problem limits our ability to learn how to design political institutions, such as the electoral system or federalism, to enhance minority inclusion more effectively. My solution to this problem centers on estimating minority electoral support for governing parties in legislatures (and for winning presidential candidates where the president serves more than a symbolic role). Using this information, one can also estimate the minority share of the government’s (and the president’s) electoral supporters. Towards that end, I am taking a multipronged approach to estimating voting behavior by different groups, relying on both ecological inference and polling data.
Prerequisites I need students who are very comfortable (1) locating polling data, and (2) getting key descriptive stats properly weighted out it. Appropriate skills in statistical packages are helpful. I have a grant to pay students over the summer who are interested.
Contact: David Lublin,

Application of Machine Learning on the Survival Analysis of Breast Cancer Patients

This project is an attempt to study the applications of machine learning techniques in Weka (Clustering, Classification, Association rules, regression) for the survival analysis of breast cancer patients. The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Public-Use Data (years 1977-2017) breast cancer data set will be used for all experiments and R programming for visualization. The aim of this project is to investigate the significance of the prognostic factors such as age, ethnicity/race, tumor grade, tumor size, etc. on the survival of breast cancer patients.
Prerequisites Knowledge of R, Machine Learning Algorithms, Weka
Contact: Mehdi Owrang,

The Accountability Project

The Accountability Project was built as a tool to help researchers and newsrooms search across otherwise siloed data. We acquire, process and standardize hundreds of data sets, accounting for nearly 900 million records. We are seeking student researchers who are interested in either contributing data to the project or analyzing data within the project.
Prerequisites We are software agnostic, but it requires skills in basic data review, cleaning and analysis. Statistics not required, but rigorous data standards are. We're especially interested in folks who want to tell stories with data.Most of the team uses either R or SQL.
Contact: Jennifer LaFleur,