Spine surgery has massive decision variability. Retrospective ML won’t fix it. Curious if a workflow-native, outcome-driven approach could. [D]
Hi everyone I’m a fellowship-trained neurosurgeon / spine surgeon. I’ve been discussing a persistent problem in our field with other surgeons for a while, and I wanted to run it by people who think about ML systems, not just model performance.
I’m trying to pressure-test whether a particular approach is even technically sound, where it would break, and what I’m likely underestimating. Id love to find an interested person to have a discussion with to get a 10000 feet level understanding of the scope of what I am trying to accomplish.
The clinical problem:
For the same spine pathology and very similar patient presentations, you can see multiple reputable surgeons and get very different surgical recommendations. anything from continued conservative management to decompression, short fusion, or long multilevel constructs. Costs and outcomes vary widely.
This isn’t because surgeons are careless. It’s because spine surgery operates with:
- Limited prospective evidence
- Inconsistent documentation
- Weak outcome feedback loops
- Retrospective datasets that are biased, incomplete, and poorly labeled
EMRs are essentially digital paper charts. PACS is built for viewing images, not capturing decision intent. Surgical reasoning is visual, spatial, and 3D, yet we reduce it to free-text notes after the fact. From a data perspective, the learning signal is pretty broken.
Why I’m skeptical that training on existing data works:
- “Labels” are often inferred indirectly (billing codes, op notes)
- Surgeon decision policies are non-stationary
- Available datasets are institution-specific and access-restricted
- Selection bias is extreme (who gets surgery vs who doesn’t is itself a learned policy)
- Outcomes are delayed, noisy, and confounded
Even with access, I’m not convinced retrospective supervision converges to something clinically useful.
The idea I’m exploring:
Instead of trying to clean bad data later, what if the workflow itself generated structured, high-fidelity labels as a byproduct of doing the work, or at least the majority of it?
Concretely, I’m imagining an EMR-adjacent, spine-specific surgical planning and case monitoring environment that surgeons would actually want to use. Not another PACS viewer, but a system that allows:
- 3D reconstruction from pre-op imaging
- Automated calculation of alignment parameters
- Explicit marking of anatomic features tied to symptoms
- Surgical plan modeling (levels, implants, trajectories, correction goals)
- Structured logging of surgical cases (to derive patterns and analyze for trends)
- Enable productivity (generate note, auto populate plans ect.)
- Enable standardized automated patient outcomes data collection.
The key point isn’t the UI, but UI is also an area that currently suffers. It’s that surgeons would be forced (in a useful way) to externalize decision intent in a structured format because it directly helps them plan cases and generate documentation. Labeling wouldn’t feel like labeling it would almost just be how you work. The data used for learning would explicitly include post-operative outcomes. PROMs collected at standardized intervals, complications (SSI, reoperation), operative time, etc, with automated follow-up built into the system.
The goal would not be to replicate surgeon decisions, but to learn decision patterns that are associated with better outcomes. Surgeons could specify what they want to optimize for a given patient (eg pain relief vs complication risk vs durability), and the system would generate predictions conditioned on those objectives.
Over time, this would generate:
- Surgeon-specific decision + outcome datasets
- Aggregate cross-surgeon data
- Explicit representations of surgical choices, not just endpoints
Learning systems could then train on:
- Individual surgeon decision–outcome mappings
- Population-level patterns
- Areas of divergence where similar cases lead to different choices and outcomes
Where I’m unsure, and why I’m posting here:
From an ML perspective, I’m trying to understand:
- Given delayed, noisy outcomes, is this best framed as supervised prediction or closer to learning decision policies under uncertainty?
- How feasible is it to attribute outcome differences to surgical decisions rather than execution, environment, or case selection?
- Does it make sense to learn surgeon-specific decision–outcome mappings before attempting cross-surgeon generalization?
- How would you prevent optimizing for measurable metrics (PROMs, SSI, etc) at the expense of unmeasured but important patient outcomes?
- Which outcome signals are realistically usable for learning, and which are too delayed or confounded?
- What failure modes jump out immediately?
I’m also trying to get a realistic sense of:
- The data engineering complexity this implies
- Rough scale of compute once models actually exist
- The kind of team required to even attempt this (beyond just training models)
I know there are a lot of missing details. If anyone here has worked on complex ML systems tightly coupled to real-world workflows (medical imaging, decision support, etc) and finds this interesting, I’d love to continue the discussion privately or over Zoom. Maybe we can collaborate on some level!
Appreciate any critique especially the uncomfortable kind!!
submitted by /u/LaniakeaResident
[link] [comments]