Optimizing LoRA target module selection for efficient fine tuning

Fine-tuning a large language model (LLM) on a specific task requires updates to billions of parameters across trillions of tokens, with the attendant costs in GPU resources and time. Low-rank adaptation (LoRA) is a more efficient alternative that freezes the original model weights but introduces lightweight matrices into specific model sublayers, or “modules”. These matrices (commonly referred to as “adapters”) modify the modules’ weights, enabling not only efficient fine tuning but also on-demand model serving, which dramatically lowers inference costs; base-model sharing across GPUs, which cuts memory requirements; lower download overhead; and parallel inference across multiple adapters. The question is where to insert these adapters across the model. Empirically, targeting more and larger modules tends to boost performance, because it allows more flexibility in customization; but it also increases training and inference costs. Using a smaller, well-chosen subset preserves most gains with significantly better efficiency. Using Amazon’s Nova 2.0 Lite multimodal reasoning LLM as our base model, we set ourselves the goal of identifying a subset of standardized target-module configurations that works effectively across the vast majority of customer use cases. Through an ablation study, we identified a module known as o_proj, as the single module where adding an adapter achieves the best trade-off between efficiency and accuracy (o_proj is a linear transformation that mixes representations across attention heads into a single, cohesive form for the rest of the model to understand). The Transformer architecture Transformer models — the models responsible for all of AI’s remarkable recent gains — consist largely of blocks that are repeated multiple times. Each block in turn has two main components: an attention mechanism, which determines the relevance of previously seen tokens to the token currently being processed, and a feed-forward network, a conventional neural network that does additional processing on the outputs of the attention mechanism. The attention mechanism involves three different matrices, which take their names from database design: the query matrix represents how relevant the current token is to the other tokens in the input sequence; the key matrix represents how relevant other tokens are to one another; and the value matrix represents the raw content of those other tokens. Multiplying the three matrices together creates, essentially, a recipe for the Transformer’s next output. To reduce computational complexity, these multiplications take place in a space with reduced dimensions. The matrices themselves and the results of their multiplication then have to be projected back up to the original dimensions of the input. LoRA approximates weight updates using a product of two smaller matrices, drastically reducing the number of trainable parameters. The technique is typically applied to attention projection layers and feed-forward network layers. These modules are ideal candidates because they constitute the bulk of Transformer parameters, directly govern representation learning, and exhibit natural alignment with low-rank approximations. Empirical evidence shows weight changes in these layers often lie within a low-dimensional subspace during fine tuning. Target module selection Selecting the right target modules directly affects accuracy, latency, and computational efficiency. The optimal choice of target modules is primarily a function of (a) the base model being fine-tuned (i.e., its architecture, pre- and post-training data distributions, etc.) and (b) customization domain/modality. When fine-tuning Nova 2.0 Lite, we balanced two competing objectives: Maximizing accuracy across diverse tasks and modalities and Minimizing latency to preserve LoRA’s efficiency benefits. We investigated the application of LoRA to four different modules in each Transformer block: the query, key, and value projection layers ( qkv); the o_proj layer; and two different fully connected layers in the feed-forward network, gate_up_proj and gate_down_proj (referred to as fc1 and fc2). Below are the trade-offs for these modules, both singly and in combination, based on results published in literature and empirical studies. CombinationExpected accuracyExpected latencyUse caseqkv onlyGood (baseline)Lowest Resource-constrained environments Tasks where attention mechanisms are critical (e.g., classification, lightweight generation) Prioritizes speed over maximum accuracy o_proj onlyModerateLowest Ultralow-latency scenarios Tasks where refining attention outputs is sufficient (e.g., simple sentiment analysis). Plays an important role in reasoning Less effective than qkv, but very efficient qkv + o_projHighLow to moderate (+5–10%) Attention-focused tasks (e.g., machine translation, summarization) Balances refinement of both attention context ( o_proj) and query/key/value projections ( qkv) Best accuracy-to-latency ratio for most NLP tasks qkv + fc1 / fc2Very high (close to full fine tuning)Moderate (+10–15%) Complex generation tasks (e.g., translation, long-form summarization) When feed-forward layers ( fc1/ fc2) significantly influence output quality as they store and retrieve factual knowledge Prioritizes accuracy over speed o_proj + fc1 / fc2Good to highModerate (+5–10%) Tasks requiring adaptation of both attention output ( o_proj) and feed-forward layers (e.g., text classification, sentiment analysis) Suitable when qkv adaptation is unnecessary qkv + o_proj + fc1 / fc2Highest (near-full fine tuning)High (+15–20%) Maximum accuracy for critical tasks (e.g., research benchmarks, high-stakes generation) When all components of the Transformer block need adaptation Avoid for production if latency matters All modules ( qkv, o_proj, fc1, fc2)MaximumHighest (+20–25%) Prototyping/research with no latency constraints Rarely justified in practice; marginal gains over qkv + o_proj + fc1/ fc2 Trade-offs of accuracy and latency across target modules, based on literature review and empirical evidence. Experimental methodology We conducted a comprehensive ablation study, training multiple supervised-fine-tuning (SFT) LoRA variants on seven datasets spanning both text and visual data, across reasoning (i.e., the training datasets themselves include reasoning content) and non-reasoning tasks. The datasets covered diverse challenges from simple question answering to long-context summarization and structured JSON extraction. DatasetModalityReasoning tracesDomainTasksTraining sizeEval sizeEval metricSourceFinCOTTxtYesFinanceFinancial-reasoning dataset. Samples consist of complex financial queries, along with reasoning traces obtained from GPT-4o. Predictions are typically complex tables or calculations based on the input.74361147Accuracyhttps://huggingface.co/datasets/TheFinAI/FinCoTGovReportTxtNoGoverment DocLarge-context (30-40K tokens) summarization17457837RougeLsumhttps://gov-report-data.github.io/MedMCQATxtNoMedicalDataset for multiple-choice QA — also used in Nova 1.020k3683Accuracyhttps://huggingface.co/datasets/openlifescienceai/medmcqaMedReasonTxtYesMedicalMedical-reasoning dataset that consists of questions and answers compiled from various medical benchmarks (MedQA, MedMCQA, etc.), along with synthetic, high-quality reasoning traces. (This uses the same eval set as MedMCQA.)316823683Accuracyhttps://huggingface.co/datasets/UCSC-VLAA/MedReasonCoCoHDTxtNoPolitical DocA complex benchmark consisting of large-context (>20K tokens) transcripts of congressional hearings. The output is expected to be a summary in a specific JSON format, consisting of the members present, topic discussed, outcomes, etc.7321053Averaged key and value match ratehttps://github.com/gtfintechlab/CoCoHDLlava-COTImageYesImage understanding, General/ScienceMultimodal, image benchmark consisting of Q&A reasoning questions. The dataset includes high-quality reasoning traces.10k270Exact match ratehttps://huggingface.co/datasets/Xkev/LLaVA-CoT-100kInvoice OCRImageNoImage understandingOCR benchmark that takes an input image and produces a JSON file with fields from the image.1400447AccuracySummary of the experiment datasets All experiments used the Nova 2.0 Lite general-availability checkpoint with consistent hyperparameters across target modules, including learning-rate ratio and alpha values. Target datasetSettingSFT LoRA target performanceNova 2.0 Lite performanceFin-COTqkv67.09%72.12%o_proj68.30%fc175.35%fc260.24%o_proj + fc161.38%qkv + fc260.31%o_proj + fc262.79%qkv + fc168.37%All target modules66.15%CoCoHDqkv19.64%45.14%o_proj65.88%fc141.96%fc217.62%o_proj + fc176.83%qkv + fc266.47%o_proj + fc279.14%qkv + fc145.45%All target modules82.75%GovReporto_proj41.25%38.90%fc139.69%o_proj + fc141.74%o_proj + fc242.16%qkv + fc141.66%qkv + fc239.02%All target modules41.95%Llava-COTqkv64.26%16.22%o_proj64.26%fc165.92%fc265.02%o_proj + fc163.21%qkv + fc262.76%o_proj + fc266.37%qkv + fc166.52%All target modules63.96%Invoice OCRo_proj89.07%14.10%o_proj + fc190.03%qkv + fc287.84%o_proj + fc289.47%qkv + fc188.55%All target modules90.11%MedReasono_proj24.55%1.68%o_proj + fc120.88%qkv + fc28.39%o_proj + fc220.36%qkv + fc14.32%All target modules26.72%MedMCQAqkv62.18%1.68%o_proj63.10%fc112.90%fc259.98%o_proj + fc161.39%qkv + fc265.63%o_proj + fc264.95%qkv + fc157.21%All target modules66.11%Ablation study for target module selection. Some benchmarks have fewer variations, to save on computation and time. MedMCQA and MedReason use the MedMCQA test set for evaluation. On this task, Nova 2.0 Lite fails mainly due to formatting inconsistencies, even though it produces the right answer. For consistency’s sake, we use the same strict parser for SFT models. Key findings 1. O_proj is the most robust single target The o_proj-only configuration demonstrated remarkable consistency, never failing outright on any task and typically performing within a few percentage points of the best configuration (i.e., using all target modules). On MedMCQA, CoCoHD, GovReport, LLaVA-CoT, and Invoice OCR, o_proj-only either matched or came very close to optimal performance, making it an attractive default choice that balances performance and simplicity. There is emerging evidence that this module plays a key role in reasoning, which may explain its effectiveness here. 2. Qkv-only shows instability While qkv-only performed well on MedMCQA, it exhibited extreme variability, performing below baseline on CoCoHD and showing unremarkable results elsewhere. This aligns with the hypothesis that attention-only LoRA can underfit on tasks requiring richer features from the feed-forward network, rather than relying on modified token routing. 3. Module combinations provide modest gains Combinations like o_proj + fc2 or “all target modules” often achieved the highest per-dataset scores (particularly on CoCoHD, MedReason, and Invoice OCR). However, improvements over the best single module were typically modest, usually 1-3 percentage points. 4. Task difficulty amplifies configuration impact On challenging benchmarks where the base model performed poorly, the choice of target modules had greater impact. For example, on CoCoHD (long-context, complex JSON generation), o_proj + fc2 achieved a +15% absolute improvement over the base model, compared to only +3% with o_proj alone. 5. LoRA consistently outperforms base models Across nearly all datasets, any reasonable LoRA configuration dramatically outperformed the base model. For instance, MedReason, MedMCQA, LLaVA-CoT, and Invoice OCR showed improvements from a baseline accuracy of ~1-16% to 60-90%+ with LoRA. The notable exception was Fin-COT, where only certain configurations (notably fc1) exceeded baseline performance, suggesting task-specific sensitivity to adaptation strategy. Recommendations For accuracy-prioritized scenarios, we recommend o_proj + fc2 as the optimal configuration for both text and multimodal tasks, showing 2-12% improvements over o_proj alone across benchmarks. For balanced efficiency and performance, o_proj-only provides an excellent default, offering robust performance with minimal latency overhead — particularly valuable when serving multiple adapters or operating under resource constraints. For challenging tasks, such as benchmarks with long context or complex generation requirements or other tasks where base models struggle, the additional accuracy from o_proj + fc2 justifies the modest latency increase. Future directions Our research opens several promising avenues for further optimization: Modality and task-specific configurations: Segmenting target module selection by modality and task difficulty (e.g., long-context scenarios) could yield specialized configurations with better accuracy-latency trade-offs. Per-module hyperparameter optimization: Extensive hyperparameter optimization for each target module configuration could unlock additional performance gains, though computational costs remain a consideration. Two-stage LoRA for early candidate identification: Leveraging two-stage LoRA approaches that use training dynamics, gradients, etc., to determine the importance of different modules/layers could help identify promising configurations early in training, reducing the cost of comprehensive hyperparameter searches. Layer pruning for latency reduction: Using two-stage training to identify and prune unused layers could further reduce inference latency while maintaining accuracy. Conclusion Our comprehensive study demonstrates that thoughtful target module selection in LoRA fine tuning can improve accuracy while preserving the efficiency advantages that make LoRA attractive for production deployments. The o_proj layer emerges as a remarkably robust single target, while o_proj + fc2 combinations offer the best accuracy for challenging tasks. On average, o_proj LoRA is within 2% of o_proj + fc2 in terms of accuracy but has 22.6% lower latency (TPOT p95 decreases from 10.085ms → 7.803ms). These findings provide a principled foundation for standardizing LoRA configurations across diverse customer use cases, balancing the competing demands of model performance and computational efficiency. Acknowledgements: Kevin Rondinone, Kevin Chen, Nicole Ding, Sebastian Massella, Andy Li

Liked Liked