Xiaoxiao Li
Research Interests
Relevant Thesis-Based Degree Programs
Graduate Student Supervision
Master's Student Supervision
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
With the increasing interest in Deep Learning, data safety issues have become more prevalent as we rely more on Artificial Intelligence. Adversaries can easily obtain sensitive information through various attacks, this dramatically discourages patients and clients from contributing invaluable data that may be beneficial to research. This problem facilitates the need for a gold standard privacy notion. In recent years, Differential Privacy (DP) has been recognized as a gold standard notion of privacy. Among the current popular DP methods, Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation. When used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers using the features of neural tangent kernels (NTKs), more precisely empirical NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets. In addition, we also extend NTK to Data Distillation (DD) in federated learning (FL) settings, where we aim to condense sensitive information into a small set of images for deep learning training in a DP manner, we show that our method obtains meaningful results even under class imbalance and spuriously correlated image datasets.
View record
Standard deep learning paradigm may not be practical over real-world heterogeneous medical data, where new disease merges over time with data acquired in a distributed manner across various hospitals. There have been approaches to facilitate the training of deep models over two primary categories of heterogeneity, including 1) class incremental learning, which offers a promising solution for sequential heterogeneity by adapting a deep network trained on previous disease classes to handle newly introduced diseases over time; 2) federated learning, which offers a promising solution for distributed heterogeneity, by training a global model on a centralized server over private datasets of various hospitals or clients without requiring them to share data. The core challenge in both approaches is catastrophic forgetting, which refers to performance degradation on previously trained data when adapting a model to available data. Due to strict patient privacy regulations, storing and sharing medical data are often discouraged, posing a significant hurdle in addressing such a forgetting. We propose to leverage medical data synthesis to recover inaccessible medical data in heterogeneous learning, presenting two distinctive novel frameworks. Our first framework introduces a novel two-step, data-free class incremental learning pipeline. Initially, it synthesizes data by inverting trained model weights on previous classes and matching statistics saved in continual normalization layers to obtain continual class-specific samples. Subsequently, the model is updated by incorporating three novel loss functions to enhance the utility of synthesized data and mitigate forgetting. Extensive experiments demonstrate that the proposed framework achieves comparative results with state-of-the-art methods on four public MedMNIST datasets and an in-house heart echocardiography dataset. We propose our second framework as a novel federated learning approach to mitigate forgetting by generating and utilizing united global synthetic data among clients. First, we proposed constrained model inversion over the server model to enforce an information-preserving property in synthetic data and leverage the global distribution captured in the globally aggregated server model. Then, we utilize this synthetic data alongside the local data to enhance the generalization capabilities of local training. Extensive experiments show that the proposed method achieves state-of-the-art performance on the BloodMNIST and Retina datasets.
View record
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.