- Starts: 12:00 pm on Friday, May 22, 2026
- Ends: 2:00 pm on Friday, May 22, 2026
ECE PhD Thesis Defense: Chonghua Xue
Title: Strategies For Interpretable Machine Learning On Incomplete Real-World Medical Data
Presenter: Chonghua Xue
Advisor: Professor Vijaya B. Kolachalama
Chair: TBD
Committee: Professor Vijaya B. Kolachalama, Professor Ioannis Paschalidis, Professor Archana Venkataraman, Professor Prakash Ishwar
Google Scholar Link: https://scholar.google.com/citations?user=f9k4jcMAAAAJ
Abstract: This dissertation studies interpretable machine learning under incomplete real-world medical data, with particular emphasis on settings in which the set of available features changes across patients, cohorts, and deployment environments. In such settings, conventional responses such as case deletion, imputation, or restriction to a small shared feature set can reduce predictive utility and weaken the reliability of post hoc explanation. The central argument of the dissertation is that missingness should be modeled directly rather than treated only as a preprocessing nuisance. This perspective is especially important in biomedical applications, where feature availability is shaped by clinical workflow, cost, site-specific measurement practices, and patient-specific testing decisions.
The dissertation develops a feature-as-token transformer framework for structured and multimodal medical data, in which unavailable features are excluded from attention-based computation without being converted into ordinary predictive covariates. On top of this architecture, it introduces masked training as a family of stochastic feature-deletion schemes designed to improve robustness to incomplete inputs. It then develops mask-space distributionally robust optimization, implemented as MaDRO, to improve worst-case performance under adverse shifts in feature-availability patterns. Beyond prediction alone, the dissertation also studies the reliability of Shapley-based explanation under deletion semantics, showing theoretically and empirically that explanation fidelity depends on how accurately the model represents partially observed inputs.
These ideas are evaluated through biomedical case studies, public benchmark experiments, and controlled semisynthetic analyses. The application-focused studies span Parkinson's disease metabolomics, dementia assessment from digital voice recordings, and multimodal differential diagnosis of dementia etiologies. The benchmark and semisynthetic experiments isolate the effects of deletion-aware training and robustification under known missingness regimes. Across these settings, masked training consistently improves performance relative to ordinary training when features are unavailable at test time, and robustification often yields additional gains when deployment conditions are more adverse than those observed during training. The explanation analyses further show that predictive robustness and attribution fidelity are related but not identical objectives, and that deletion-aware training can improve the reliability of Shapley-based interpretation when the explanation game is aligned with the model's treatment of missingness.
Taken together, the dissertation advances a unified view of learning under incomplete data. It argues that architectural design, training-time feature deletion, robust optimization, and explanation reliability should be treated as connected aspects of the same modeling problem. This view yields practical methods for building predictive models that remain more robust and more interpretable under realistic patterns of feature unavailability in biomedical data.
- Location:
- PHO 339
