[R] shadow APIs breaking research reproducibility (arxiv 2603.01919)
just read this paper auditing shadow APIs (third party services claiming to provide GPT-5/Gemini access). 187 academic papers used these services, most popular one has 5,966 citations
findings are bad. performance divergence up to 47%, safety behavior completely unpredictable, 45% of fingerprint tests failed identity verification
so basically a bunch of research might be built on fake model outputs
this explains some weird stuff ive seen. tried reproducing results from a paper last month, used what they claimed was “gpt-4 via api”. numbers were way off. thought i screwed up the prompts but maybe they were using a shadow api that wasnt actually gpt-4
paper mentions these services are popular cause of payment barriers and regional restrictions. makes sense but the reproducibility crisis this creates is insane
whats wild is the most cited one has 58k github stars. people trust these things
for anyone doing research: how do you verify youre actually using the official model. the paper suggests fingerprint tests but thats extra work most people wont do
also affects production systems. if youre building something that depends on specific model behavior and your api provider is lying about which model theyre serving, your whole system could break randomly
been more careful about this lately. switched my coding tools to ones that use official apis (verdent, cursor with direct keys, etc). costs more but at least i know what model im actually getting. for research work thats probably necessary
the bigger issue is this undermines trust in the whole field. how many papers need to be retracted. how many production systems are built on unreliable foundations
submitted by /u/Electrical-Shape-266
[link] [comments]