[P] Open-sourcing a human parsing model trained on curated data to address ATR/LIP/iMaterialist quality issues

[P] Open-sourcing a human parsing model trained on curated data to address ATR/LIP/iMaterialist quality issues

We’re releasing FASHN Human Parser, a SegFormer-B4 fine-tuned for human parsing in fashion contexts.

Background: Dataset quality issues

Before training our own model, we spent time analyzing the commonly used datasets for human parsing: ATR, LIP, and iMaterialist. We found consistent quality issues that affect models trained on them:

ATR:

  • Annotation “holes” where background pixels appear inside labeled regions
  • Label spillage where annotations extend beyond object boundaries

LIP:

  • Same issues as ATR (same research group)
  • Inconsistent labeling between left/right body parts and clothing
  • Aggressive crops from multi-person images causing artifacts
  • Ethical concerns (significant portion includes minors)

iMaterialist:

  • Higher quality images and annotations overall
  • Multi-person images where only one person is labeled (~6% of dataset)
  • No body part labels (clothing only)

We documented these findings in detail: Fashion Segmentation Datasets and Their Common Problems

What we did

We curated our own dataset addressing these issues and fine-tuned a SegFormer-B4. The model outputs 18 semantic classes relevant for fashion applications:

  • Body parts: face, hair, arms, hands, legs, feet, torso
  • Clothing: top, dress, skirt, pants, belt, scarf
  • Accessories: bag, hat, glasses, jewelry
  • Background

Technical details

Spec Value
Architecture SegFormer-B4 (MIT-B4 encoder + MLP decoder)
Input size 384 x 576
Output Segmentation mask at input resolution
Model size ~244MB
Inference ~300ms GPU, 2-3s CPU

The PyPI package uses cv2.INTER_AREA for preprocessing (matching training), while the HuggingFace pipeline uses PIL LANCZOS for broader compatibility.

Links

Limitations

  • Optimized for fashion/e-commerce images (single person, relatively clean backgrounds)
  • Performance may degrade on crowded scenes or unusual poses
  • 18-class schema is fashion-focused; may not suit all human parsing use cases

Happy to discuss the dataset curation process, architecture choices, or answer any questions.

submitted by /u/JYP_Scouter
[link] [comments]

Liked Liked