How to build effective reward functions with AWS Lambda for Amazon Nova model customization

Building effective reward functions can help you customize Amazon Nova models to your specific needs, with AWS Lambda providing the scalable, cost-effective foundation. Lambda’s serverless architecture lets you focus on defining quality criteria while it handles the computational infrastructure.

Amazon Nova offers multiple customization approaches, with Reinforcement fine-tuning (RFT) standing out for its ability to teach models desired behaviors through iterative feedback. Unlike Supervised fine-tuning (SFT) that requires thousands of labeled examples with annotated reasoning paths, RFT learns from evaluation signals on final outputs. At the heart of RFT lies the reward function—a scoring mechanism that guides the model toward better responses.

This post demonstrates how Lambda enables scalable, cost-effective reward functions for Amazon Nova customization. You’ll learn to choose between Reinforcement Learning via Verifiable Rewards (RLVR) for objectively verifiable tasks and Reinforcement Learning via AI Feedback (RLAIF) for subjective evaluation, design multi-dimensional reward systems that help you prevent reward hacking, optimize Lambda functions for training scale, and monitor reward distributions with Amazon CloudWatch. Working code examples and deployment guidance are included to help you start experimenting.

Building code-based rewards using AWS Lambda

You have multiple pathways to customize foundation models, each suited to different scenarios. SFT excels when you have clear input-output examples and want to teach specific response patterns—it’s particularly effective for tasks like classification, named entity recognition, or adapting models to domain-specific terminology and formatting conventions. SFT works well when the desired behavior can be demonstrated through examples, making it ideal for teaching consistent style, structure, or factual knowledge transfer.However, some customization challenges require a different approach. When applications need models to balance multiple quality dimensions simultaneously—like customer service responses that must be accurate, empathetic, concise, and brand-aligned simultaneously —or when creating thousands of annotated reasoning paths proves impractical, reinforcement-based methods offer a better alternative. RFT addresses these scenarios by learning from evaluation signals rather than requiring exhaustive labeled demonstrations of correct reasoning processes.

AWS Lambda-based reward functions simplifies this through feedback-based learning. Instead of showing the model thousands of effective examples, you provide prompts and define evaluation logic that scores responses—then the model learns to improve through iterative feedback. This approach requires fewer labelled examples while giving you precise control over desired behaviors. Multi-dimensional scoring captures nuanced quality criteria that prevent models from exploiting shortcuts, while Lambda’s serverless architecture handles variable training workloads without infrastructure management. The result is Nova customization that’s accessible to developers without deep machine learning expertise, yet flexible enough for sophisticated production use cases.

How AWS Lambda based rewards work

The RFT architecture uses AWS Lambda as a serverless reward evaluator that integrates with Amazon Nova training pipeline, creating an feedback loop that guides model learning. The process begins when your training job generates candidate responses from the Nova model for each training prompt. These responses flow to your Lambda function, which evaluates their quality across dimensions like correctness, safety, formatting, and conciseness. The function then returns scalar numerical scores—typically in the -1 to 1 range as a best practice. Higher scores guide the model to reinforce the behaviors that produced them, while lower scores guide it away from patterns that led to poor responses. This cycle repeats thousands of times throughout training, progressively shaping the model toward responses that consistently earn higher rewards.

The architecture brings together several AWS services in a cohesive customization solution. Lambda executes your reward evaluation logic with automatic scaling that handles variable training demands without requiring you to provision or manage infrastructure. Amazon Bedrock provides the fully managed RFT experience with integrated Lambda support, offering AI judge models for RLAIF implementations through a simple Application Programming Interface (API). For teams needing advanced training control, Amazon SageMaker AI offers options through Amazon SageMaker AI Training Jobs and Amazon SageMaker AI HyperPod, both supporting the same Lambda-based reward functions. Amazon CloudWatch monitors Lambda performance in real-time, logs detailed debugging information about reward distributions and training progress, and triggers alerts when issues arise. At the foundation sits Amazon Nova itself—models with customization recipes optimized across a wide variety of use cases that respond effectively to the feedback signals your reward functions provide

This serverless approach makes Nova customization cost-effective. Lambda automatically scales from handling 10 concurrent evaluations per second during initial experimentation to 400+ evaluations during production training, without infrastructure tuning or capacity planning. Your single Lambda function can assess multiple quality criteria simultaneously, providing the nuanced, multi-dimensional feedback that prevents models from exploiting simplistic scoring shortcuts. The architecture supports both objective verification through RLVR—running code against test cases or validating structured outputs—and subjective judgment through RLAIF, where AI models evaluate qualities like tone and helpfulness. You pay only for actual compute time during evaluation with millisecond billing granularity, making experimentation affordable while keeping production costs proportional to training intensity. Perhaps most valuable for iterative development, Lambda functions save as reusable “Evaluator” assets in Amazon SageMaker AI Studio, enabling you to maintain consistent quality measurement as you refine your customization strategy across multiple training runs.

Choosing the right rewards mechanism

The foundation of successful RFT is choosing the right feedback mechanism. Two complementary approaches serve different use cases: RLVR and RLAIF are two techniques used to fine-tune large language models (LLMs) after their initial training. Their major difference lies in how they provide feedback to the model.

RLVR (Reinforcement Learning via Verifiable Rewards)

RLVR uses deterministic code to verify objective correctness. RLVR is designed for domains where a “correct” answer can be mathematically or logically verified, for example, solving a math problem. RLVR uses deterministic functions to grade outputs instead of a learned reward model. RLVR fails for tasks like creative writing or brand voice where no absolute ground truth exists.

  • Best for: Code generation, mathematical reasoning, structured output tasks
  • Example: Running generated code against test cases, validating API responses, checking calculation accuracy
  • Advantage: Reliable, auditable, deterministic scoring

RLVR functions programmatically verify correctness against ground truth. Here in this example doing sentiment analysis.

from typing import List
import json
import random

from dataclasses import asdict, dataclass

import re
from typing import Optional


def extract_answer_nova(solution_str: str) -> Optional[str]:
    """Extract sentiment polarity from Nova-formatted response for chABSA."""
    # First try to extract from solution block
    solution_match = re.search(r'<|begin_of_solution|>(.*?)<|end_of_solution|>', solution_str, re.DOTALL)
    if solution_match:
        solution_content = solution_match.group(1)
        # Look for boxed format in solution block
        boxed_matches = re.findall(r'\boxed{([^}]+)}', solution_content)
        if boxed_matches:
            return boxed_matches[-1].strip()
    
    # Fallback: look for boxed format anywhere
    boxed_matches = re.findall(r'\boxed{([^}]+)}', solution_str)
    if boxed_matches:
        return boxed_matches[-1].strip()
    
    # Last resort: look for sentiment keywords
    solution_lower = solution_str.lower()
    for sentiment in ['positive', 'negative', 'neutral']:
        if sentiment in solution_lower:
            return sentiment
    
    return None


def normalize_answer(answer: str) -> str:
    """Normalize answer for comparison."""
    return answer.strip().lower()


def compute_score(
    solution_str: str,
    ground_truth: str,
    format_score: float = 0.0,
    score: float = 1.0,
    data_source: str = 'chabsa',
    extra_info: Optional[dict] = None
) -> float:
    """chABSA scoring function with VeRL-compatible signature."""
    answer = extract_answer_nova(solution_str)
    if answer is None:
        return 0.0
    
    # Parse ground_truth JSON to get the answer
    gt_answer = ground_truth.get("answer", ground_truth)
    
    clean_answer = normalize_answer(answer)
    clean_ground_truth = normalize_answer(gt_answer)
    
    return score if clean_answer == clean_ground_truth else format_score

@dataclass
class RewardOutput:
    """Reward service."""

    id: str
    aggregate_reward_score: float

def lambda_handler(event, context):

    scores: List[RewardOutput] = []

    samples = event

    for sample in samples:
        # Extract the ground truth key. In the current dataset it's answer
        print("Sample: ", json.dumps(sample, indent=2))
        ground_truth = sample["reference_answer"]
        
        idx = "no id"
        # print(sample)
        if not "id" in sample:
            print(f"ID is None/empty for sample: {sample}")
        else:
            idx = sample["id"]

        ro = RewardOutput(id=idx, aggregate_reward_score=0.0)

        if not "messages" in sample:
            print(f"Messages is None/empty for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue
        
        # Extract answer from ground truth dict
        if ground_truth is None:
            print(f"No answer found in ground truth for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue
        
        # Get completion from last message (assistant message)
        last_message = sample["messages"][-1]
        completion_text = last_message["content"]
        
        if last_message["role"] not in ["assistant", "nova_assistant"]:
            print(f"Last message is not from assistant for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue

        if not "content" in last_message:
            print(f"Completion text is empty for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue

        random_score = compute_score(solution_str=completion_text, ground_truth=ground_truth)
        ro = RewardOutput(id=idx, aggregate_reward_score=random_score)

        print(f"Response for id: {idx} is {ro}")
        scores.append(ro)

    return [asdict(score) for score in scores]

Your RLVR function should incorporate three critical design elements for effective training. First, create a smooth reward landscape by awarding partial credit—for example, providing format_score points for proper response structure even when the final answer is incorrect. This prevents binary scoring cliffs that make learning difficult. Second, implement good extraction logic with multiple parsing strategies that handle various response formats gracefully. Third, validate inputs at every step using defensive coding practices that prevent crashes from malformed inputs

RLAIF (Reinforcement Learning via AI Feedback)

RLAIF uses AI models as judges for subjective evaluation. RLAIF achieves performance comparable to RLHF(Reinforcement Learning via Human Feedback) while being significantly faster and less costly. Here is an example RLVR lambda function code for sentiment classification.

  • Best for: Creative writing, summarization, brand voice alignment, helpfulness
  • Example: Evaluating response tone, assessing content quality, judging user intent alignment
  • Advantage: Scalable human-like judgment without manual labeling costs

RLAIF functions delegate judgment to capable AI models as shown in this sample code below

import json
import re
import time
import boto3
from typing import List, Dict, Any, Optional

bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')
JUDGE_MODEL_ID = "<jude_model_id>" #Replace with judge model id of your interest
SYSTEM_PROMPT = "You must output ONLY a number between 0.0 and 1.0. No explanations, no text, just the number."

JUDGE_PROMPT_TEMPLATE = """Compare the following two responses and rate how similar they are on a scale of 0.0 to 1.0, where:
- 1.0 means the responses are semantically equivalent (same meaning, even if worded differently)
- 0.5 means the responses are partially similar
- 0.0 means the responses are completely different or contradictory

Response A: {response_a}

Response B: {response_b}

Output ONLY a number between 0.0 and 1.0. No explanations."""

def extract_solution_nova(solution_str: str, method: str = "strict") -> Optional[str]:
    """Extract solution from Nova-formatted response."""
    assert method in ["strict", "flexible"]
    
    if method == "strict":
        boxed_matches = re.findall(r'\boxed{([^}]+)}', solution_str)
        if boxed_matches:
            final_answer = boxed_matches[-1].replace(",", "").replace("$", "")
            return final_answer
        return None
        
    elif method == "flexible":
        boxed_matches = re.findall(r'\boxed{([^}]+)}', solution_str)
        if boxed_matches:
            numbers = re.findall(r"(\-?[0-9\.\,]+)", boxed_matches[-1])
            if numbers:
                return numbers[-1].replace(",", "").replace("$", "")
        
        answer = re.findall(r"(\-?[0-9\.\,]+)", solution_str)
        if len(answer) == 0:
            return None
        else:
            invalid_str = ["", "."]
            for final_answer in reversed(answer):
                if final_answer not in invalid_str:
                    break
        return final_answer

def lambda_graded(id: str, response_a: str, response_b: str, max_retries: int = 50) -> float:
    """Call Bedrock to compare responses and return similarity score."""
    prompt = JUDGE_PROMPT_TEMPLATE.format(response_a=response_a, response_b=response_b)
    
    for attempt in range(max_retries):
        try:
            response = bedrock_runtime.converse(
                modelId=JUDGE_MODEL_ID,
                messages=[{"role": "user", "content": [{"text": prompt}]}],
                system=[{"text": SYSTEM_PROMPT}],
                inferenceConfig={"temperature": 0.0, "maxTokens": 10}
            )
            
            output = response['output']['message']['content'][0]['text'].strip()
            score = float(output)
            return max(0.0, min(1.0, score))

        except Exception as e:
            if "ThrottlingException" in str(e) and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return 0.0
    return 0.0

def compute_score(id: str, solution_str: str, ground_truth: str) -> float:
    """Compute score for train.jsonl format."""
    answer = extract_solution_nova(solution_str=solution_str, method="flexible")
    if answer is None:
        return 0.0
    
    clean_answer = str(answer)
    clean_ground_truth = str(ground_truth)
    
    score = lambda_graded(id, response_a=clean_answer, response_b=clean_ground_truth)
    return score

def lambda_grader(samples: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """
    Process samples from train.jsonl format and return scores.
    
    Args:
        samples: List of dictionaries with messages and metadata
        
    Returns:
        List of dictionaries with reward scores
    """
    results = []
    
    for sample in samples:
        sample_id = sample.get("id", "unknown")
        
        # Extract reference answer from metadata or top level
        metadata = sample.get("metadata", {})
        reference_answer = metadata.get("reference_answer", sample.get("reference_answer", {}))
        
        if isinstance(reference_answer, dict):
            ground_truth = reference_answer.get("answer", "")
        else:
            ground_truth = str(reference_answer)
        
        # Get assistant response from messages
        messages = sample.get("messages", [])
        assistant_response = ""
        
        for message in reversed(messages):
            if message.get("role") in ["assistant", "nova_assistant"]:
                assistant_response = message.get("content", "")
                break
        
        if not assistant_response or not ground_truth:
            results.append({
                "id": sample_id,
                "aggregate_reward_score": 0.0
            })
            continue
        
        # Compute score
        score = compute_score(
            id=sample_id,
            solution_str=assistant_response,
            ground_truth=ground_truth
        )
        
        results.append({
            "id": sample_id,
            "aggregate_reward_score": score,
            "metrics_list": [
                {
                    "name": "semantic_similarity",
                    "value": score,
                    "type": "Reward"
                }
            ]
        })
    
    return results

def lambda_handler(event, context):
    return lambda_grader(event)

While implementing RLAIF function consider client initialization with global variables to reduce overall invocations latency. Handle throttling exceptions gracefully to avoid training interruptions. Use temperature 0.0 for deterministic judge scores, it helps with model consistency. And provide clear rubric, it helps judge provide calibrated scores

Considerations for writing good reward functions

To write good reward functions for RFT, start simple, create a smooth reward landscape (notbinary cliffs), ensure rewards align with the true goal (avoid hacking), use dense/shapedrewards for complex tasks, provide clear signals, and make them verifiable and consistent.

  • Define Goal Clearly: Know exactly what success looks like for your model.
  • Smooth Reward Landscape: Instead of simple pass/fail (0 or 1), use smooth, dense

reward signals that provide partial credit for being “on the right track”. This granularfeedback helps the model learn from incremental improvements rather than waiting fora perfect response. For complex, multi-step tasks, provide rewards for intermediateprogress (shaping) rather than just the final outcome (sparse).

  • Making Rewards Multi-Dimensional: A single scalar reward is too easily hacked. The

reward should evaluate model performance from multiple dimensions: e.g. correctness,faithfulness to input, safety/policy alignment, formatting, and conciseness, etc.

  • Reward Hacking Prevention: Ensure the model can’t get high rewards through shortcuts

(e.g., lucky guesses, repetitive actions); make the task guess-proof.

  • Use Verifiable Rubrics: For objective tasks like code generation or math, use automated

graders that execute the code or parse specific answer tags (e.g., <answer>) to verifycorrectness without a human in the loop.

  • Implement LLM Judges for Subjective Tasks: When programmatic code cannot judge

the answer (e.g., summarization), use a separate, capable model as an “LLM Judge”. Youmust evaluate this judge first to ensure its grades are stable and aligned with humanpreferences.

Optimizing your reward function execution within the training loop

Once your reward function works correctly, optimization helps you train faster while controlling costs. This section covers techniques to consider for your workloads. Optimization techniques compound in their impact—a well-configured Lambda function with appropriate batch sizing, concurrency settings, cold start mitigation, and error handling can evaluate responses ten times faster than a naive implementation while costing significantly less and providing better training reliability. The investment in optimization early in the customization process pays dividends throughout training by reducing iteration time, lowering compute costs, and catching issues before they require expensive retraining.

  1. Ensure IAM permissions are correctly configured before you start training

Dependency Management and Permissions

  • How to add dependencies: you can either bundle them directly with your code in a deployment package (.zip file) or use Lambda layers to manage dependencies separately from your core logic.
    • Creating a .zip deployment package (see instructions here)
    • Using Lambda layers (see instructions here)
  • Amazon Bedrock access for RLAIF: the execution role for the Lambda function should have access to Amazon Bedrock for LLM API call.

Use layers for dependencies shared across multiple functions. Use deployment packages for function-specific logic.Attach AWS Identity and Access Management (IAM) permissions to Lambda execution role for RLAIF implementations. Following the principle of least privilege, scope the Resource ARN to the specific foundation model you are using as a judge rather than using a wildcard

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": "arn:aws:bedrock:<region>:<account-id>:foundation-model/<model-id>"
        }
    ]
}
  1. Understanding platform differences and which platform might be more suitable for your needs

Optimizing Lambda-based reward functions requires understanding how different training environments interact with serverless evaluation and how architectural choices impact throughput, latency, and cost. The optimization landscape differs substantially between synchronous and asynchronous processing models, making environment-specific tuning essential for production-scale customization.

Amazon SageMaker AI Training Jobs employ synchronous processing that generates rollouts first before evaluating them in parallel batches. This architecture creates distinct optimization opportunities around batch sizing and concurrency management. The lambda_batch_size parameter, defaulting to 64, determines how many samples Lambda evaluates in a single invocation—tune this higher for fast reward functions that complete in milliseconds, but lower it for complex evaluations approaching timeout thresholds. The lambda_concurrency parameter controls parallel execution, with the default of 12 concurrent invocations often proving conservative for production workloads. Fast reward functions benefit from significantly higher concurrency, sometimes reaching 50 or more simultaneous executions, though you must monitor account-level Lambda concurrency limits that cap total concurrent executions across your functions in a region.

Amazon SageMaker AI HyperPod takes a fundamentally different approach through asynchronous processing that generates and evaluates samples individually rather than in large batches. This sample-by-sample architecture naturally supports higher throughput, with default configurations handling 400 transactions per second through Lambda without special tuning. Scaling beyond this baseline requires coordinated adjustment of HyperPod recipe parameters—specifically proc_num and rollout_worker_replicas that control worker parallelism. When scaling workers aggressively, consider increasing generation_replicas proportionally to prevent generation from becoming the bottleneck while evaluation capacity sits idle.

  1. Optimization of reward function using concurrency of Lambda

Lambda configuration directly impacts training speed and reliability:

    • Timeout Configuration: Set timeout to 60 seconds (default is only 3 seconds), this provides headroom for RLAIF judge calls or complex RLVR logic
    • Memory Allocation: Set memory to 512 MB (default is 128 MB), accelerated CPU improves response time performance
  1. Cold start mitigation

Cold start mitigation prevents latency spikes that can slow training and increase costs. Keep deployment packages under 50MB to minimize initialization time—this often means excluding unnecessary dependencies and using Lambda layers for large shared libraries. Reuse connections across invocations by initializing clients like the Amazon Bedrock runtime client in global scope rather than inside the handler function, allowing the Lambda execution environment to maintain these connections between invocations. Profile your function using Lambda Insights to identify performance bottlenecks. Cache frequently accessed data such as evaluation rubrics, validation rules, or configuration parameters in global scope so Lambda loads them once per container rather than on every invocation. This pattern of global initialization with handler-level execution proves particularly effective for Lambda functions handling thousands of evaluations during training.

# Keep deployment package under 50MB
# Reuse connections across invocations
bedrock_client = boto3.client('bedrock-runtime')  # Global scope

# Cache frequently accessed data
EVALUATION_RUBRICS = {...}  # Load once

def lambda_handler(event, context):
    # Clients and cached data persist across invocations
    return evaluate_responses(event, bedrock_client, EVALUATION_RUBRICS)
  1. Optimizing RLAIF judge models

For RLAIF implementations using Amazon Bedrock models as judges, there’s an important trade-off to consider. Larger models provide more reliable judgments but have lower throughput, while smaller models offer better throughput but may be less capable—pick the smallest judge model sufficient for your task to maximize throughput. Profile judge consistency before scaling to full training.

Throughput Management:

    • Monitor Amazon Bedrock throttling limits at region level
    • Consider Amazon SageMaker AI endpoints for judge models. It offers higher throughput but currently limited to open weight and Nova models
    • Batch multiple evaluations per API call when possible
    • Account for concurrent training jobs sharing Amazon Bedrock quota
  1. Ensuring your Lambda reward function is error tolerant and corrective

Real-world systems encounter failures—network hiccups, temporary service unavailability, or occasional Lambda timeouts. Rather than letting a single failure derail your entire training job, we’ve built robust retry mechanisms that handle timeouts, Lambda failures, and transient errors automatically. The system intelligently retries failed reward calculations with exponential backoff, giving temporary issues time to resolve. If a call fails even after three retries, you’ll receive a clear, actionable error message pinpointing the specific issue—whether it’s a timeout, a permissions problem, or a bug in your reward logic. This transparency lets you quickly identify and fix problems without sifting through cryptic logs.

def robust_evaluation(sample, max_retries=3):
    """Evaluation with comprehensive error handling."""
    for attempt in range(max_retries):
        try:
            score = compute_score(sample)
            return score
        except ValueError as e:
            # Parsing errors - return 0 and log
            print(f"Parse error for {sample['id']}: {str(e)}")
            return 0.0
        except Exception as e:
            # Transient errors - retry with backoff
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                print(f"Failed after {max_retries} attempts: {str(e)}")
                return 0.0
    return 0.0
  1. Iterative CloudWatch debugging and catching any signs of errors early on

Visibility into your training process is essential for both monitoring progress and troubleshooting issues. We automatically log comprehensive information to CloudWatch for every stage of the training pipeline: each training step’s metrics – including step wise training reward scores and detailed execution traces for each pipeline component. This granular logging makes it straightforward to track training progress in real-time, verify that your reward function is scoring responses as expected, and quickly diagnose issues when they arise. For example, if you notice training isn’t improving, you can examine the reward distributions in CloudWatch to see if your function is returning mostly zeros or if there’s insufficient signal

CloudWatch provides comprehensive visibility into reward function performance. Here are few useful Amazon CloudWatch Insights Queries for the solution

-- Find samples with zero rewards
SOURCE '/aws/lambda/my-reward-function'
| fields @timestamp, id, aggregate_reward_score
| filter aggregate_reward_score = 0.0
| sort @timestamp desc

-- Calculate reward distribution
SOURCE '/aws/lambda/my-reward-function'
| fields aggregate_reward_score
| stats count() by bin(aggregate_reward_score, 0.1)

-- Identify slow evaluations
SOURCE '/aws/lambda/my-reward-function'
| fields @duration, id
| filter @duration > 5000
| sort @duration desc

-- Track multi-dimensional metrics
SOURCE '/aws/lambda/my-reward-function'
| fields @timestamp, correctness, format, safety, conciseness
| stats avg(correctness) as avg_correctness, 
        avg(format) as avg_format,
        avg(safety) as avg_safety,
        avg(conciseness) as avg_conciseness 
  by bin(5m)

Conclusion

Lambda-based reward functions unlock Amazon Nova customization for organizations that need precise behavioral control without massive labeled datasets and improved reasoning. This approach delivers significant advantages through flexibility, scalability, and cost-effectiveness that streamline your model customization process.The architecture allows RLVR to handle objective verification tasks while RLAIF helps with subjective judgment for nuanced quality assessments. Organizations can use them individually or combine them for comprehensive evaluation that captures both factual accuracy and stylistic preferences. Scalability emerges naturally from the serverless foundation, automatically handling variable training workloads from early experimentation through production-scale customization. Cost-effectiveness flows directly from this design—organizations pay only for actual evaluation compute, with training jobs completing faster due to optimized Lambda concurrency and efficient reward calculation.The combination of Amazon Nova foundation models, Lambda serverless scalability, and Amazon Bedrock’s managed customization infrastructure makes reinforcement fine-tuning more accessible regardless of organizational scale. Start experimenting with the sample code in this blog, and begin customizing Amazon Nova models that deliver exactly the behaviors your applications need.

Acknowledgements

Special thanks to Eric Grudzien and Anupam Dewan for their review and contributions to this post.


About the Authors

Bharathan Balaji

Bharathan Balaji is a Senior Applied Scientist at Amazon Web Services, working on reinforcement learning and foundation model services. His work focuses on building AI capabilities that help customers transform their businesses.

Manoj Gupta

Manoj Gupta is a Senior Solutions Architect at AWS, based in San Francisco. With over 4 years of experience at AWS, he works closely with customers to build optimized AI/ML powered solutions and cloud infrastructure. His primary focus areas are Data, AI/ML, and Security, helping organizations modernize their technology stacks. Outside of work, he enjoys outdoor activities and traveling with family.

Brian Hu

Brian Hu is a Senior Applied Scientist at AWS, specializing in supervised and reinforcement fine-tuning and their applications across various domains. He works closely with customers to customize large language models (LLMs) for enhanced performance and domain-specific optimization.

Sarthak Khanna

Sarthak Khanna is a Software Development Engineer at Amazon AGI, specializing in reinforcement fine-tuning and agentic AI systems. His work focuses on building scalable training pipelines for large language models, leveraging reinforcement learning to enable multi-turn reasoning, tool use, and autonomous decision-making.

Liked Liked