Weight space Detection of Backdoors in LoRA Adapters

arXiv:2602.15195v1 Announce Type: new
Abstract: LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data — making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model — making our method data-agnostic. Our method extracts simple statistics — how concentrated the singular values are, their entropy, and the distribution shape — and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters — 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97% detection accuracy with less than 2% false positives.

Liked Liked