[R] Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages

Paper presents SDF (Structured Data Format), an open JSON protocol for pre-extracting agent-oriented semantic representations from web pages.

Key contributions:

  • Hierarchical type system (10 parent types, 50+ subtypes) with type-conditioned extraction
  • Two-pass pipeline: QLoRA-fine-tuned 1.5B classifier + 3B extractor achieves 90% accuracy at 4.1x speed of 14B baseline
  • Five-stage type normalization cascade that corrects 63 taxonomy violations from classifier drift
  • Downstream consumption experiment: 7B and 3B consumer models both significantly more accurate from SDF than raw markdown (0.739 vs 0.352 at 7B, p < 0.05)
  • 99.2% token reduction from HTML, 51.8% from markdown

Limitations acknowledged in paper: ground truth circularity (SDF is its own ground truth for downstream eval), single consumer model scale (7B/3B), template-based questions, sample size (30 docs / 150 questions).

Open weights on HF: https://huggingface.co/sdfprotocol

Spec + schemas: https://github.com/sdfprotocol/sdf

Protocol site: https://sdfprotocol.org

submitted by /u/PlayfulLingonberry73
[link] [comments]

Liked Liked