Per-pixel bounding-box regression + DBSCAN for handwritten word detection – visual walkthrough of WordDetectorNet [P]
|
Overview of WordDetectorNN architecture. Sharing a visual breakdown of WordDetectorNet, Harald Scheidl’s handwritten-word detection model. I think the design choice at its core is unusual enough to be worth a closer look – and I haven’t seen it written up in detail anywhere else. The mechanism: Instead of anchor-based detection + NMS, every pixel the network classifies as a “word pixel” also regresses 4 scalar distances (top/right/bottom/left) to the enclosing bounding box. Each word pixel therefore reconstructs one candidate box, producing thousands of overlapping candidates per word. These are then collapsed with DBSCAN using Architecture: ResNet18 backbone (modified to 1-channel grayscale input, with intermediate features exposed after each residual block) → FPN-style decoder that upscales and concatenates features at all scales → head producing 6 output channels per pixel (2 segmentation logits + 4 distance values). Loss = cross-entropy + IoU, equally weighted. Trained on IAM with 448×448 inputs → 224×224 outputs. What I find interesting about the design:
What I don’t like about the design:
Full visual write-up with figures (one per pipeline stage + an architecture diagram): https://lellep.xyz/blog/worddetectornet-visually-explained.html Credit where credit is due: Original architecture by Harald Scheidl, see here https://github.com/githubharald/WordDetectorNN submitted by /u/martin_lellep |