LVFace performance vs. ArcFace/ResNet
I’m looking at swapping my current face recognition stack for LVFace (the ByteDance paper from ICCV 2025) and wanted to see if anyone has real-world benchmarks yet.
Currently, I’m running a standard InsightFace-style pipeline: SCRFD (det_10g) feeding into the Buffalo_L (ArcFace) models. It’s reliable, and I’ve tuned it to run quickly and with predictable VRAM usage in a long-running environment, but LVFace uses a Vision Transformer (ViT) backbone instead of the usual ResNet/CNN setup, and it supposedly took 1st place in the MFR-Ongoing challenge.
In particular, I’m interested in better facial discrimination and recall performance on partially occluded (e.g. mask-wearing) faces. ArcFace tends to get confused by masks, it will happily compute nonsense embeddings for the masked part of the face rather than say “Oh, that’s a mask, let me focus more on the peri-orbital region and give that more weight in the embedding”.
LVFace supposedly solves this. I’ve done some small scale testing but wondering if anyone’s tried using it in production. If you’ve tested it, I’m curious about:
- Inference Speed: ViTs can be heavy—how much slower is it compared to the r50 Buffalo model in practice?
- VRAM Usage: Is the footprint manageable for high-concurrency batching?
- Masks/Occlusions: It won the Masked Face Recognition challenge, but does that actually translate to better field performance for you?
- Recall at Scale: Any issues with embedding drift or false positives when searching against a million+ identity gallery?
Links:
I’m trying to decide if the accuracy gain is worth the extra compute overhead (doing all local inference here). Any insights appreciated!
[ going to tag u/mrdividendsniffer here in case he has any feedback on LVFace ]
submitted by /u/dangerousdotnet
[link] [comments]