Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset
In this tutorial, we work with the amphora/ResearchMath-14k dataset, a collection of research-level mathematics problems mined from arXiv. We load the dataset, inspect its structure, and explore how the problems are distributed across mathematical fields and open-status categories. We then move beyond basic analysis by extracting field-specific keywords, generating semantic embeddings, visualizing the problem landscape, clustering related problems, and building a simple search engine over the dataset. Also, we train a classifier to predict problem status from embeddings […]