We’re releasing FASHN Human Parser, a SegFormer-B4 fine-tuned for human parsing in fashion contexts.
Background: Dataset quality issues
Before training our own model, we spent time analyzing the commonly used datasets for human parsing: ATR, LIP, and iMaterialist. We found consistent quality issues that affect models trained on them:
ATR:
- Annotation “holes” where background pixels appear inside labeled regions
- Label spillage where annotations extend beyond object boundaries
LIP:
- Same issues as ATR (same research group)
- Inconsistent labeling between left/right body parts and clothing
- Aggressive crops from multi-person images causing artifacts
- Ethical concerns (significant portion includes minors)
iMaterialist:
- Higher quality images and annotations overall
- Multi-person images where only one person is labeled (~6% of dataset)
- No body part labels (clothing only)
We documented these findings in detail: Fashion Segmentation Datasets and Their Common Problems
What we did
We curated our own dataset addressing these issues and fine-tuned a SegFormer-B4. The model outputs 18 semantic classes relevant for fashion applications:
- Body parts: face, hair, arms, hands, legs, feet, torso
- Clothing: top, dress, skirt, pants, belt, scarf
- Accessories: bag, hat, glasses, jewelry
- Background
Technical details
| Spec |
Value |
| Architecture |
SegFormer-B4 (MIT-B4 encoder + MLP decoder) |
| Input size |
384 x 576 |
| Output |
Segmentation mask at input resolution |
| Model size |
~244MB |
| Inference |
~300ms GPU, 2-3s CPU |
The PyPI package uses cv2.INTER_AREA for preprocessing (matching training), while the HuggingFace pipeline uses PIL LANCZOS for broader compatibility.
Links
Limitations
- Optimized for fashion/e-commerce images (single person, relatively clean backgrounds)
- Performance may degrade on crowded scenes or unusual poses
- 18-class schema is fashion-focused; may not suit all human parsing use cases
Happy to discuss the dataset curation process, architecture choices, or answer any questions.