The Value of Side Information in Unlabeled Data

Explore how side information can enhance machine learning performance in data-scarce environments through this 43-minute conference talk from Harvard CMSA's Workshop on Mathematical Foundations of AI. Learn about a novel framework where practitioners can leverage extra features available only during training to improve models that will be deployed with limited feature sets. Discover the iterative training process involving rich-view models that generate pseudo-labels and deployment models trained on both real and synthetic labels, with each iteration refining the pseudo-labeling process. Understand the theoretical foundations showing that side information provides benefits specifically when rich-view and deployment models produce different types of errors, formalized through a decorrelation score that quantifies error independence and predicts performance gains. Gain insights into practical applications for scenarios with abundant unlabeled data but limited labeled examples, and learn how to strategically use temporary access to additional features during training to boost final model performance.