Context
I’m training a Spanish legal NER model (RoBERTa-based, 28 PII categories) using curriculum learning. For the real-world legal corpus (BOE/BORME gazette), I built a multi-annotator pipeline with 5 annotators:
| Annotator |
Type |
Strengths |
| RoBERTa-v2 |
Transformer (fine-tuned) |
PERSON, ORG, LOC |
| Flair |
Transformer (off-the-shelf) |
PERSON, ORG, LOC |
| GLiNER |
Zero-shot NER |
DATE, ADDRESS, broad coverage |
| Gazetteer |
Dictionary lookup |
LOC (cities, provinces) |
| Cargos |
Rule-based |
ROLE (job titles) |
Consensus rule: an entity is accepted if ≥N annotators agree on span (IoU ≥80%) AND category.
The problem
Not all annotators can detect all categories. DATE is only detectable by GLiNER + RoBERTa-v2. ADDRESS is similar. So I use asymmetric thresholds:
| Category |
Threshold |
Rationale |
| PERSON_NAME |
≥3 |
4 annotators capable |
| ORGANIZATION |
≥3 |
3 annotators capable |
| LOCATION |
≥3 |
4 annotators capable (best agreement) |
| DATE |
≥2 |
Only 2 annotators capable |
| ADDRESS |
≥2 |
Only 2 annotators capable |
Actual data (the cliff effect)
I computed retention curves across all thresholds. Here’s what the data shows:
| Category |
Total |
≥1 |
≥2 |
≥3 |
≥4 |
=5 |
| PERSON_NAME |
257k |
257k |
98k (38%) |
46k (18%) |
0 |
0 |
| ORGANIZATION |
974k |
974k |
373k (38%) |
110k (11%) |
0 |
0 |
| LOCATION |
475k |
475k |
194k (41%) |
104k (22%) |
40k (8%) |
0 |
| DATE |
275k |
275k |
24k (8.8%) |
0 |
0 |
0 |
| ADDRESS |
54k |
54k |
1.4k (2.6%) |
0 |
0 |
0 |
Key observations:
- DATE and ADDRESS drop to exactly 0 at ≥3. A uniform threshold would eliminate them entirely.
- LOCATION is the only category reaching ≥4 (gazetteer + flair + gliner + v2 all detect it).
- No entity in the entire corpus gets 5/5 agreement. The annotators are too heterogeneous.
- Even PERSON_NAME only retains 18% at ≥3.

My concerns
- ≥2 for DATE/ADDRESS essentially means “both annotators agree”, which is weaker than a true multi-annotator consensus. Is this still meaningfully better than single-annotator?
- Category-specific thresholds introduce a confound — are we measuring annotation quality or annotator capability coverage?
- Alternative approach: Should I add more DATE/ADDRESS-capable annotators (e.g., regex date patterns, address parser) to enable a uniform ≥3 threshold instead?
Question
For those who’ve worked with multi-annotator NER pipelines: is varying the consensus threshold per entity category a valid practice, or should I invest in adding specialized annotators to enable uniform thresholds?
Any pointers to papers studying this would be appreciated. The closest I’ve found is Rodrigues & Pereira (2018) on learning from crowds, but it doesn’t address category-asymmetric agreement.