[P] Training GitHub Repository Embeddings using Stars

digitado ⋅ 6 de January de 2026

People use GitHub Stars as bookmarks. This is an excellent signal for understanding which repositories are semantically similar.

The Data: Processed ~1TB of raw data from GitHub Archive (BigQuery) to build an interest matrix of 4 million developers.
The ML: Trained embeddings for 300k+ repositories using Metric Learning (EmbeddingBag + MultiSimilarityLoss).
The Frontend: Built a client-only demo that runs vector search (KNN) directly in the browser via WASM, with no backend involved.

The Result: The system finds non-obvious library alternatives and allows for semantic comparison of developer profiles.

I hope that sources and raw dataset + trained embeddings can help you to build some interesting projects

Like 0

Liked Liked