MSDSA Project Descriptions (2024 Cohort)

Author

Nate Dailey and Robin Donatello

Masters project abstracts for the 2024 cohort of Masters in Data Science and Analytics students.

Integrating Indian Sign Language (ISL) into Healthcare for Deaf Patients in India.

Pushpak Rane Final Defense Presentation, Portfolio

This project builds a real-time Indian Sign Language (ISL) system for healthcare. It uses a normal webcam to watch your hands, finds hand landmarks with MediaPipe, and turns them into numbers the computer can learn from. A deep learning model (a CNN–LSTM style network with attention) then guesses one of three healthcare signs: doctor, pain, and skin. The app shows the sign name and a confidence score on screen and can read the result aloud for accessibility. Everything can run on your own computer locally. The goal is to reduce communication gaps when a sign language interpreter is not right there—not to replace interpreters, but to support quick, clear messages in clinics or triage-style situations. The system was trained on image data organized by sign class, with steps to balance classes and reduce bias. The model reached about 80% validation accuracy, while the live app focuses on speed and stable predictions (smoothing so labels do not flicker). Future work includes more signs, more diverse users and settings, and stronger real-world testing.

Optimizing Volunteer Scheduling and Emergency Response Using Machine Learning

Khushi Choudhary Final Defense Presentation, LinkedIn, Github

During wildfire emergencies, coordinating volunteer response is as urgent as the response itself. For the North Valley Animal Disaster Group (NVADG), that coordination happened across three separate scheduling systems with no real-time visibility into staffing levels or where gaps would emerge. This project replaces that process with a deployed, full-stack web application that connects operational need to available volunteers in real time.

The application gives volunteers a color-coded shift calendar to view and claim availability, and gives coordinators live signup counts, per-location staffing controls, and one-click CSV exports for post-incident documentation. The colors are not arbitrary: they are driven by a Poisson regression model trained on 673 hourly records from four 2024 wildfire deployments covering 91 volunteers across four incidents. The model predicts expected volunteer counts by location, day, and shift band, and each calendar slot is colored based on how current signups compare to that expectation.

Key findings include a consistent increase in volunteer turnout as an event progresses, significantly higher engagement during evening shifts, and roughly 76% more volunteers at one deployment site compared to another holding all else constant. The model achieves a pseudo-R-squared of 0.51 and a mean absolute error of 5.88 volunteers per shift.

The application is live on Posit Connect through CSU Chico’s Data Science Initiative and was built using Shiny for Python, Google Sheets, and statsmodels.

Offline-First Web App for Collecting and Syncing Geology Field Data

Govardhan Baddala Final Defense Presentation GitHub, LinkedIn

The Geology Field App is a web application designed for geology field courses where internet connection may be weak or unavailable. The app allows students to collect and store field data directly on their phones or laptops using simple forms. To prevent data loss in remote areas, the application uses PouchDB to save data locally in the browser even when the device is offline. When internet connectivity becomes available again, the data automatically syncs with the main CouchDB database on the server. The frontend was built using React, while Docker Compose and Nginx were used to manage deployment and communication between services. Testing showed that the offline-first approach worked reliably in field conditions and allowed users to continue entering data without interruption. The project demonstrates a practical solution for reliable field data collection in low connectivity environments.

The project also showed that using local storage and automatic sync can make field data collection faster, easier, and more reliable for students and instructors

GraphLearnR: An R Package for Powerful and Accessible Graph Learning

Shivam Pawar Final Defense Presentation

This project addresses the challenge of learning network structures from observational data when external factors influence the measured signals. Traditional approaches assume that signals are determined solely by the underlying network, which can lead to inaccurate representations in real-world settings. The purpose of this study is to develop and evaluate a user-friendly software tool that incorporates external variables into the process of network inference.

The study introduces GraphLearnR, an R-based software package designed to estimate network structure while accounting for additional influencing variables. The methodology combines signal-based network learning with regression techniques to separate network effects from external influences. The approach is validated using both simulated data and real-world temperature data collected from weather stations.

Results from the simulated experiments demonstrate that incorporating external variables improves the accuracy of network recovery compared to traditional methods. The real-world application reveals meaningful patterns in temperature relationships that reflect both geographic and environmental factors. The learned networks show clear regional groupings and provide interpretable insights into how temperature behavior varies across locations.

The findings indicate that accounting for external influences leads to more reliable and interpretable network structures. This work contributes a practical tool that can be applied across multiple fields, including environmental science, social science, and biology. The study highlights the importance of considering multiple sources of variation when analyzing complex systems and provides a foundation for future research in network modeling.

The GraphLearnR package is publicly available at: https://github.com/Shivam9927/GraphLearnR

Raster-Based Wildfire Risk Model in Python

Nate Dailey Final Defense Presentation, Website, LinkedIn

This study develops structure-level fire risk models for the 2025 Eaton and Palisades Fires to better understand the factors that contributed to their destructiveness. Exclusively pre-fire explanatory variables were used, such as structure density, structure area, vegetation presence, topography, structure age, and wind alignment. These variables were inputted into logistic regression and XGBoost models to predict whether residential-scale structures burned or survived. The models demonstrated strong predictive performance, even when trained on one study area and evaluated on the other. Density of surrounding structures and the presence of vegetation were the most impactful predictors, but slope, year built, structure area, and wind alignment were also significant factors. These findings are consistent with prior research identifying built-environment and vegetation characteristics as key drivers of structural loss during urban conflagrations. Because the models rely solely on pre-fire data, they are transferable to other fire-prone urban areas, provided that local wind patterns are considered. Given the scale of destruction and long-term social and economic impacts caused by the Eaton and Palisades fires, structure-level risk models such as this can support mitigation planning, defensible-space strategies, and city planning decisions aimed at reducing future wildfire losses.

B-Line Public Transit: A Data-Driven Analysis for Service Optimization

Zakir Elaskar Final Defense Presentation, LinkedIn, Github

This project applies data science techniques to analyze B-Line, the public transportation system serving Chico and surrounding communities. Public transit is essential for students, workers, and residents, yet inefficiencies can limit service quality and cost effectiveness. By examining trends, detecting inefficiencies, and proposing optimization strategies, this work provides data-driven insights to improve service delivery and promote more sustainable, cost-effective urban mobility. The findings reveal strong temporal patterns in ridership, with peak demand occurring during weekday commuting hours and significant declines during weekends and summer periods. Additionally, several routes and trips were identified as underutilized, highlighting opportunities for more efficient resource allocation. Overall, the results demonstrate that data-driven analysis can support practical, evidence-based improvements in transit scheduling and service planning.

Gradient Boosting Model to Predict Air Pollution in California

Snehitha Gorantla Final Defense Presentation

This study develops an integrated machine learning framework for forecasting ground-level ozone concentrations and assessing long-term vegetation exposure risk across California’s federal air quality monitoring network. Using over three decades of hourly observations from ten Clean Air Status and Trends Network stations, three LightGBM gradient-boosted regression models were trained to predict ozone at one, eight, and twenty-four hour horizons, alongside a dedicated binary classifier for identifying exceedances of the National Ambient Air Quality Standard. The models incorporate approximately 160 predictive features including autoregressive ozone lags, atmospheric chemistry measurements, cyclical calendar variables, site geographic characteristics, and spatially explicit wildfire exposure derived from California Department of Forestry and Fire Protection perimeter records. Evaluated on a held-out test period spanning 2021 through 2025, the one-hour and eight-hour models reduced mean absolute error by 40 and 50 percent respectively relative to naive baselines, while the exceedance classifier achieved an area under the ROC curve of 0.918. A complementary vegetation exposure analysis applied Mann-Kendall trend testing and the W126 cumulative index to reveal statistically significant ozone improvements at seven of ten sites, improvements that were previously obscured by the saturation of European-derived binary risk thresholds in California’s high-background-ozone environment. The findings demonstrate that explicitly incorporating wildfire exposure and site-specific bias correction improves both short-range forecast skill and long-term ecological risk assessment in an era of intensifying fire activity.

Case Manager: An AI-Assisted Mobile Application for Health Record Organization, Medication Management, and Appointment Scheduling

Abinesh S Final Defense Presentation, Website

As the global population aged 65 and older approaches 703 million and is projected to double by 2050 (UN DESA, 2023), managing complex health regimens has become a growing public health concern. Senior citizens and individuals with special needs frequently miss medications and appointments due to the cognitive demands of tracking fragmented health information across paper records, multiple applications, and verbal memory, contributing to preventable hospital readmissions and increased caregiver burden.

This project presents Case Manager, a voice-first, AI-powered mobile health management application designed to address this gap. The system integrates a cross-platform React Native frontend with a Python FastAPI backend and Firebase cloud infrastructure. A multi-stage AI orchestration pipeline powered by the Groq Llama 3.3-70b large language model and Deepgram speech services classifies user voice queries across 15 or more intent types, retrieves relevant records from a personal Firestore database, and routes requests to specialized agents for medication, appointment, and general health guidance. Every response passes through a mandatory safety validation layer that filters unsupported medical claims and injects appropriate disclaimers.

The implemented system supports 10 structured medical record categories with full CRUD functionality, achieves speech recognition accuracy exceeding 95%, and delivers end-to-end voice responses in 2 to 4 seconds across iOS, Android, and web platforms. Results demonstrate a robust framework for personalized, context-aware health assistance validated through functional and integration testing.

Leveraging Machine Learning to Model CalFresh Participation

Amol Bhalerao LinkedIn

CalFresh, California’s version of SNAP, can help reduce food insecurity among college students, but many eligible students never enroll. My MS project used CHC-UCLA CalFresh Messaging Survey data to understand both who participates in CalFresh and why students drop out of the application process. I analyzed 4,968 baseline survey responses and 2,228 follow-up responses, comparing machine learning models alongside statistical barrier analysis. The strongest predictive results came from XGBoost for baseline CalFresh participation and SVM for post-survey application behavior. The post-survey model performed especially well, reaching an AUC of 0.818. More importantly, the results showed that nonparticipation is not explained by awareness alone. Many students assumed they were ineligible without checking, while self-reliance, stigma, documentation issues, family influence, and confusion about student eligibility created additional barriers. Among students who started but did not finish an application, common drop-off reasons included believing they were not eligible, being too busy, missing documentation, and family concerns. This project shows how machine learning can help identify who is least likely to access benefits and which barriers campus programs should address first.

Evaluating Post-Fire Vegetation Response in Masticated Chaparral at Big Chico Creek Ecological Reserve

Anand Gangavarapu Final Defense Presentation

Wildfire recovery is often discussed in broad terms, but land managers need site-level evidence to understand whether fuel treatments are supporting resilience or creating new ecological risks. This project evaluates post-fire vegetation response in a masticated chaparral site at Big Chico Creek Ecological Reserve after the 2024 Park Fire. Using paired transect data collected before and one year after the fire, I analyzed changes in woody cover, herbaceous cover, native and non-native species, Cal-IPC invasion-risk groups, and fuel consumption. Results showed a strong structural reset in the shrub layer, with woody cover declining from 46.4% to 9.7% across all paired transects. Herbaceous cover also declined overall, but its response was more variable. Native herbaceous cover remained relatively stable, while non-native cover declined significantly, mainly through Moderate and Limited Cal-IPC groups. However, High-rated invasive species did not clearly decline, meaning invasion risk still requires monitoring. Fuel consumption helped explain woody cover loss, but not herbaceous change. Overall, this project shows that post-fire recovery in treated chaparral should be evaluated through both vegetation structure and community composition. The work also provides a foundation for a repeatable monitoring framework to guide restoration, invasion management, and future fuel-treatment decisions.

Morphometric Sex Determination and GPS Telemetry of Turkey Vultures in Western Montana

Jayana Sarma Final Defense Presentation

Turkey Vultures (Cathartes aura) play an important ecological role as scavengers, but determining their sex in the field is difficult because males and females show little obvious external difference. My MS project evaluates whether morphometric measurements can be used to predict the sex of Turkey Vultures captured in western Montana. Using laboratory-confirmed sex records for 65 individuals, I compared several statistical and machine learning approaches, including logistic regression, linear discriminant analysis, random forest, and regularized regression. The results showed that morphometric measurements contain useful but incomplete information for sex classification. Logistic regression provided the strongest overall balance of predictive performance and interpretability. Although a 9-variable logistic regression model produced the highest cross-validated AUC, a reduced 3-variable model using tail length, culmen length, and head height remained competitive while being simpler and more practical for field use. Overall, this project suggests that morphometric models can serve as a useful preliminary screening tool when genetic sexing is unavailable or delayed, but they should not replace molecular confirmation. Future work should validate the model using larger and independent datasets.