Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation



Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation

Welcome to Part 2 of our SAM 3 tutorial. In Part 1, we explored the theoretical foundations of SAM 3 and demonstrated basic text-based segmentation. Now, we unlock its full potential by mastering advanced prompting techniques and interactive workflows.

advanced-sam-3-multi-modal-prompting-and-interactive-segmentation-featured.png

SAM 3’s true power lies in its flexibility; it does not just accept text prompts. It can process multiple text queries simultaneously, interpret bounding box coordinates, combine text with visual cues, and respond to interactive point-based guidance. This multi-modal approach enables sophisticated segmentation workflows that were previously impractical with traditional models.

In Part 2, we will cover:

  • Multi-prompt Segmentation: Query multiple concepts in a single image
  • Batched Inference: Process multiple images with different prompts efficiently
  • Bounding Box Guidance: Use spatial hints for precise localization
  • Positive and Negative Prompts: Include desired regions while excluding unwanted areas
  • Hybrid Prompting: Combine text and visual cues for selective segmentation
  • Interactive Refinement: Draw bounding boxes and click points for real-time segmentation control

Each technique is demonstrated with complete code examples and visual outputs, providing production-ready workflows for data annotation, video editing, scientific research, and more.

This lesson is the 2nd of a 4-part series on SAM 3:

  1. SAM 3: Concept-Based Visual Understanding and Segmentation
  2. Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation (this tutorial)
  3. Lesson 3
  4. Lesson 4

To learn how to perform advanced multi-modal prompting and interactive segmentation with SAM 3, just keep reading.

Looking for the source code to this post?

Jump Right To The Downloads Section


Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.


Configuring Your Development Environment

To follow this guide, you need to have the following libraries installed on your system.

!pip install --q git+https://github.com/huggingface/transformers supervision jupyter_bbox_widget

We install the transformers library to load the SAM 3 model and processor, the supervision library for annotation, drawing, and inspection (which we use later to visualize bounding boxes and segmentation outputs). Additionally, we install jupyter_bbox_widget, an interactive widget that runs inside a notebook, enabling us to click on the image to add points or draw bounding boxes.

We also pass the --q flag to hide installation logs. This keeps notebook output clean.


Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

  • Short on time?
  • Learning on your employer’s administratively locked system?
  • Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
  • Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!


Setup and Imports

Once installed, we proceed to import the required libraries.

import io
import torch
import base64
import requests
import matplotlib
import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt

from google.colab import output
from accelerate import Accelerator
from IPython.display import display
from jupyter_bbox_widget import BBoxWidget
from PIL import Image, ImageDraw, ImageFont
from transformers import Sam3Processor, Sam3Model, Sam3TrackerProcessor, Sam3TrackerModel

We import the following:

  • io: Python’s built-in module for handling in-memory image buffers when converting PIL images to base64 format
  • torch: used to run the SAM 3 model, send tensors to the GPU, and work with model outputs
  • base64: used to convert our images into base64 strings so that the BBox widget can display them in the notebook
  • requests: a library to download images directly from a URL; this keeps our workflow simple and avoids manual file uploads

We also import several helper libraries:

  • matplotlib.pyplot: helps us visualize masks and overlays
  • numpy: gives us fast array operations
  • ipywidgets: enables interactive elements inside the notebook

We import the output utility from Colab, which we later use to enable interactive widgets. Without this step, our bounding box widget will not render. We also import Accelerator from Hugging Face to run the model efficiently on either the CPU or GPU using the same code. It also simplifies device placement.

We import the display function to render images and widgets directly in notebook cells, and BBoxWidget serves as the core interactive tool, allowing us to click and draw bounding boxes or points on an image. We use this as our prompt input system.

We also import 3 classes from Pillow:

  • Image: loads RGB images
  • ImageDraw: helps us draw shapes on images
  • ImageFont: gives us text rendering support for overlays

Finally, we import our SAM 3 tools from transformers:

  • Sam3Processor: prepares inputs for the segmentation model
  • Sam3Model: performs segmentation from text and box prompts
  • Sam3TrackerProcessor: prepares inputs for point-based or tracking prompts
  • Sam3TrackerModel: runs point-based segmentation and masking

Loading the SAM 3 Model

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Sam3Processor.from_pretrained("facebook/sam3")
model = Sam3Model.from_pretrained("facebook/sam3").to(device)

First, we check if a GPU is available in the environment. If PyTorch detects CUDA (Compute Unified Device Architecture) support, then we use the GPU for faster inference. Otherwise, we fall back to the CPU. This check ensures our code runs efficiently on any machine (Line 1).

Next, we load the Sam3Processor. The processor is responsible for preparing all inputs before they reach the model. It handles image preprocessing, bounding box formatting, text prompts, and tensor conversion. In short, it makes our raw images compatible with the model (Line 3).

Finally, we load the Sam3Model from Hugging Face. This model takes the processed inputs and generates segmentation masks. We immediately move the model to the selected device (GPU or CPU) for inference (Line 4).


Downloading a Few Images

!wget -q https://media.roboflow.com/notebooks/examples/birds.jpg
!wget -q https://media.roboflow.com/notebooks/examples/traffic_jam.jpg
!wget -q https://media.roboflow.com/notebooks/examples/basketball_game.jpg
!wget -q https://media.roboflow.com/notebooks/examples/dog-2.jpeg

Here, we download a few images from the Roboflow media server using the wget command and use the -q flag to suppress output and keep the notebook clean.


Multi-Text Prompts on a Single Image

In this example, we apply two different text prompts to the same image: player in white and player in blue. Instead of running SAM 3 once, we loop over both prompts, and each text query produces a new set of instance masks. We then merge all detections into a single result and visualize them together.

prompts = ["player in white", "player in blue"]
IMAGE_PATH = "/content/basketball_game.jpg"

# Load image
image = Image.open(IMAGE_PATH).convert("RGB")

all_masks = []
all_boxes = []
all_scores = []

total_objects = 0

for prompt in prompts:
   inputs = processor(
       images=image,
       text=prompt,
       return_tensors="pt"
   ).to(device)

   with torch.no_grad():
       outputs = model(**inputs)

   results = processor.post_process_instance_segmentation(
       outputs,
       threshold=0.5,
       mask_threshold=0.5,
       target_sizes=inputs["original_sizes"].tolist()
   )[0]

   num_objects = len(results["masks"])
   total_objects += num_objects

   print(f"Found {num_objects} objects for prompt: '{prompt}'")

   all_masks.append(results["masks"])
   all_boxes.append(results["boxes"])
   all_scores.append(results["scores"])

results = {
   "masks": torch.cat(all_masks, dim=0),
   "boxes": torch.cat(all_boxes, dim=0),
   "scores": torch.cat(all_scores, dim=0),
}

print(f"nTotal objects found across all prompts: {total_objects}")

First, we define our two text prompts. Each describes a different visual concept in the image (Line 1). We also set the path to our basketball game image (Line 2). We load the image and convert it to RGB. This ensures the colors are consistent before sending it to the model (Line 5).

Next, we initialize empty lists to store masks, bounding boxes, and confidence scores for each prompt. We also track the total number of detections (Lines 7-11).

We run inference without tracking gradients. This is more efficient and uses less memory. After inference, we post-process the outputs. We apply thresholds, convert logits to binary masks, and resize them to match the original image (Lines 13-28).

We count the number of objects detected for the current prompt, update the running total, and print the result. We store the current prompt’s masks, boxes, and scores in their respective lists (Lines 30-37).

Once the loop is finished, we concatenate all masks, bounding boxes, and scores into a single results dictionary. This allows us to visualize all objects together, regardless of which prompt produced them. We print the total number of detections across all prompts (Lines 39-45).

Below are the numbers of objects detected for each prompt, as well as the total number of objects detected.

Found 5 objects for prompt: 'player in white'

Found 6 objects for prompt: 'player in blue'

Total objects found across all prompts: 11

Output

labels = []
for prompt, scores in zip(prompts, all_scores):
   labels.extend([prompt] * len(scores))

overlay_masks_boxes_scores(
   image=image,
   masks=results["masks"],
   boxes=results["boxes"],
   scores=results["scores"],
   labels=labels,
   score_threshold=0.5,
   alpha=0.45,
)

Now, to visualize the output, we generate a list of text labels. Each label matches the prompt that produced the detection (Lines 1-3).

Finally, we visualize everything at once using overlay_masks_boxes_scores. The output image (Figure 1) shows masks, bounding boxes, and confidence scores for players in white and players in blue — cleanly layered on top of the original frame (Lines 5-13).

Figure 1: Multi-text prompt segmentation of “player in white” and “player in blue” on a single image (source: visualization by the author)

Batched Inference Using Multiple Text Prompts Across Multiple Images

In this example, we run SAM 3 on two images at once and provide a separate text prompt for each. This gives us a clean, parallel workflow: one batch, two prompts, two images, two sets of segmentation results.

cat_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
images = [
   Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),
   Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
]

text_prompts = ["ear", "dial"]

inputs = processor(images=images, text=text_prompts, return_tensors="pt").to(device)

with torch.no_grad():
   outputs = model(**inputs)

# Post-process results for both images
results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs.get("original_sizes").tolist()
)

print(f"Image 1: {len(results[0]['masks'])} objects found")
print(f"Image 2: {len(results[1]['masks'])} objects found")

First, we define two URLs. The first points to a cat image. The second points to a kitchen scene from COCO (Lines 1 and 2).

Next, we download the two images, load them into memory, and convert them to RGB. We store both images in a list. This allows us to batch them later. Then, we define one prompt per image. The first prompt searches for a cat’s ear. The second prompt looks for a dial in the kitchen scene (Lines 3-8).

We batch the images and batch the prompts into a single input structure. This gives SAM 3 two parallel vision-language tasks, packed into one tensor (Line 10).

We disable gradient computation and run the model in inference mode. The outputs contain segmentation predictions for both images. We post-process the raw logits. SAM 3 returns results as a list: one entry per image. Each entry contains instance masks, bounding boxes, and confidence scores (Lines 12-21).

We count the number of objects detected for each prompt. This gives us a simple, semantic summary of model performance (Lines 23 and 24).

Below is the total number of objects detected in each image provided for each text prompt.

Image 1: 2 objects found

Image 2: 7 objects found

Output

for image, result, prompt in zip(images, results, text_prompts):
   labels = [prompt] * len(result["scores"])
   vis = overlay_masks_boxes_scores(image, result["masks"], result["boxes"], result["scores"], labels)
   display(vis)

To visualize the output, we pair each image with its corresponding prompt and result. For each batch entry, we do the following (Line 1):

  • create a label per detected object (Line 2)
  • visualize the masks, boxes, and scores using our overlay helper (Line 3)
  • display the annotated result in the notebook (Line 4)

This approach shows how SAM 3 handles multiple text prompts and images simultaneously, without writing separate inference loops.

In Figure 2, we can see the object (ear) detected in the image.

Figure 2: Batched inference result for Image 1 showing “ear” detections (source: visualization by the author)

In Figure 3, we can see the object (dial) detected in the image.

Figure 3: Batched inference result for Image 2 showing “dial” detections (source: visualization by the author)

Single Bounding Box Prompt

In this example, we perform segmentation using a bounding box instead of a text prompt. We provide the model with a spatial hint that says: “focus here.” SAM 3 then segments all detected instances of a concept provided by the spatial hint.

# Load image
image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

# Box in xyxy format: [x1, y1, x2, y2]
box_xyxy = [100, 150, 500, 450]

input_boxes = [[box_xyxy]]        
input_boxes_labels = [[1]]          # 1 = positive (foreground) box

def draw_input_box(image, box, color="red", width=3):
   img = image.copy().convert("RGB")
   draw = ImageDraw.Draw(img)
   x1, y1, x2, y2 = box
   draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)
   return img

input_box_vis = draw_input_box(image, box_xyxy)
input_box_vis

First, we load an example COCO image directly from a URL. We read the raw bytes, open them with Pillow, and convert them to RGB (Lines 2 and 3).

Next, we define a bounding box around the region to be segmented. The coordinates follow the xyxy format (Line 6).

  • (x1, y1): top-left corner
  • (x2, y2): bottom-right corner

We prepare the box for the processor.

  • The outer list indicates a batch size of 1. The inner list holds the single bounding box (Line 8).
  • We set the label to 1, meaning this is a positive box, and SAM 3 should focus on this region (Line 9).

Then, we define a helper to visualize the prompt box. The function draws a colored rectangle over the image, making the prompt easy to verify before segmentation (Lines 11-16).

We display the input box overlay. This confirms our prompt is correct before running the model (Lines 18 and 19).

Figure 4 shows the bounding box prompt overlaid on the input image.

Figure 4: Single bounding box prompt drawn over the input image (source: visualization by the author)
inputs = processor(
   images=image,
   input_boxes=input_boxes,
   input_boxes_labels=input_boxes_labels,
   return_tensors="pt"
).to(device)

with torch.no_grad():
   outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Found {len(results['masks'])} objects")

Now, we prepare the final inputs for the model. Instead of passing text, we pass bounding box prompts. The processor handles resizing, padding, normalization, and tensor conversion. We then move everything to the selected device (GPU or CPU) (Lines 1-6).

We run SAM 3 in inference mode. The torch.no_grad() function disables gradient computation, reducing memory usage and improving speed (Lines 8 and 9).

After inference, we reshape and threshold the predicted masks. We resize them back to their original sizes so they align perfectly. We index [0] because we are working with a single image (Lines 11-16).

We print the number of foreground objects that SAM 3 detected within the bounding box (Line 18).

Found 1 objects

Output

labels = ["box-prompted object"] * len(results["scores"])

overlay_masks_boxes_scores(
   image=image,
   masks=results["masks"],
   boxes=results["boxes"],
   scores=results["scores"],
   labels=labels,
   score_threshold=0.5,
   alpha=0.45,
)

To visualize the results, we create a label string "box-prompted object" for each detected instance to keep the overlay looking clean (Line 1).

Finally, we call our overlay helper. It blends the segmentation mask, draws the bounding box, and shows confidence scores on top of the original image (Lines 3-11).

Figure 5 shows the segmented object.

Figure 5: Segmentation result guided by a single bounding box prompt (source: visualization by the author)

Multiple Bounding Box Prompts on a Single Image (Dual Positive Foreground Regions)

In this example, we guide SAM 3 using two positive bounding boxes. Each box marks a small region of interest inside the image: one around the oven dial and one around a nearby button. Both boxes act as foreground signals. SAM 3 then segments all detected objects within these marked regions.

kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
   requests.get(kitchen_url, stream=True).raw
).convert("RGB")

box1_xyxy = [59, 144, 76, 163]   # Dial
box2_xyxy = [87, 148, 104, 159] # Button

input_boxes = [[box1_xyxy, box2_xyxy]]    
input_boxes_labels = [[1, 1]]               # 1 = positive (foreground)

def draw_input_boxes(image, boxes, color="red", width=3):
   img = image.copy().convert("RGB")
   draw = ImageDraw.Draw(img)

   for box in boxes:
       x1, y1, x2, y2 = box
       draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)

   return img

input_box_vis = draw_input_boxes(
   kitchen_image,
   [box1_xyxy, box2_xyxy]
)

input_box_vis

First, we load the kitchen image from COCO. We download the raw image bytes, open them with Pillow, and convert the image to RGB. Next, we define two bounding boxes. Both follow the xyxy format. The first box highlights the oven dial. The second box highlights the oven button (Lines 1-7).

We pack both bounding boxes into a single list, since we are working with a single image. We assign a value of 1 to both boxes, indicating that both are positive prompts. We define a helper function to visualize the bounding box prompts. For each box, we draw a red rectangle overlay on a copy of the image (Lines 9-20).

We draw both boxes and display the result. This gives us a visual confirmation of our bounding box prompts before running the model (Lines 22-27).

Figure 6 shows the two positive bounding boxes superimposed on the input image.

Figure 6: Two positive bounding box prompts (dial and button) superimposed on the input image (source: visualization by the author)
inputs = processor(
   images=kitchen_image,
   input_boxes=input_boxes,
   input_boxes_labels=input_boxes_labels,
   return_tensors="pt"
).to(device)

with torch.no_grad():
   outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Found {len(results['masks'])} objects")

Now, we prepare the image and the bounding box prompts using the processor. We then send the tensors to the CPU or GPU. We run SAM 3 in inference mode. We disable gradient tracking to improve memory and speed (Lines 1-9).

Next, we post-process the raw outputs. We resize masks back to their original shape, and we filter low-confidence results. We print the number of detected objects that fall inside our two positive bounding box prompts (Lines 11-18).

Below is the total number of objects detected in the image.

Found 7 objects

Output

labels = ["box-prompted object"] * len(results["scores"])

overlay_masks_boxes_scores(
   image=kitchen_image,
   masks=results["masks"],
   boxes=results["boxes"],
   scores=results["scores"],
   labels=labels,
)

We generate a label for visualization. Finally, we overlay the segmented objects on the image using the overlay_masks_boxes_scores function (Lines 1-9).

Here, Figure 7 displays all segmented objects.

Figure 7: Segmentation results from dual positive bounding box prompts (source: visualization by the author)

Multiple Bounding Box Prompts on a Single Image (Positive Foreground and Negative Background Control)

In this example, we guide SAM 3 using two bounding boxes: one positive and one negative. The positive box highlights the region we want to segment, while the negative box tells the model to ignore a nearby region. This combination gives us fine control over the segmentation result.

kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
   requests.get(kitchen_url, stream=True).raw
).convert("RGB")

box1_xyxy = [59, 144, 76, 163]   # Dial
box2_xyxy = [87, 148, 104, 159] # Button

input_boxes = [[box1_xyxy, box2_xyxy]]    
input_boxes_labels = [[1, 0]]              

def draw_input_boxes(image, boxes, labels, width=3):
   """
   boxes  : list of [x1, y1, x2, y2]
   labels : list of ints (1 = positive, 0 = negative)
   """
   img = image.copy().convert("RGB")
   draw = ImageDraw.Draw(img)

   for box, label in zip(boxes, labels):
       x1, y1, x2, y2 = box

       # Color by label
       color = "green" if label == 1 else "red"

       draw.rectangle(
           [(x1, y1), (x2, y2)],
           outline=color,
           width=width,
       )

   return img

input_box_vis = draw_input_boxes(
   kitchen_image,
   boxes=[box1_xyxy, box2_xyxy],
   labels=[1, 0],   # 1 = positive, 0 = negative
)

input_box_vis

First, we load our kitchen image from the COCO dataset. We fetch the bytes from the URL and convert them to RGB (Lines 1-4).

Next, we define two bounding boxes. Both follow the xyxy coordinate format (Lines 6 and 7):

  • first box: surrounds the oven dial
  • second box: surrounds a nearby oven button

We pack the two boxes into a single list because we are working with a single image. We set labels [1, 0], meaning (Lines 9 and 10):

  • dial box: positive (foreground to include)
  • button box: negative (area to exclude)

We define a helper function that draws bounding boxes in different colors. Positive prompts are drawn in green. Negative prompts are drawn in red (Lines 12-32).

We visualize the bounding box prompts overlaid on the image. This gives us a clear understanding of how we are instructing SAM 3 (Lines 34-40).

Figure 8 shows the positive and negative box prompts superimposed on the input image.

Figure 8: Positive (include) and negative (exclude) bounding box prompts shown on the input image (source: visualization by the author)
inputs = processor(
   images=kitchen_image,
   input_boxes=input_boxes,
   input_boxes_labels=input_boxes_labels,
   return_tensors="pt"
).to(device)

with torch.no_grad():
   outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Found {len(results['masks'])} objects")

We prepare the inputs for SAM 3. The processor handles preprocessing and tensor conversion. We perform inference. Gradients are disabled to reduce memory usage. Next, we post-process the results. SAM 3 returns instance masks filtered by confidence and resized to the original resolution (Lines 1-16).

We print the number of objects segmented using this foreground-background combination (Line 18).

Below is the total number of objects detected in the image.

Found 6 objects

Output

labels = ["box-prompted object"] * len(results["scores"])

overlay_masks_boxes_scores(
   image=kitchen_image,
   masks=results["masks"],
   boxes=results["boxes"],
   scores=results["scores"],
   labels=labels,
)

We assign labels to detections to ensure the overlay displays meaningful text. Finally, we visualize the segmentation (Lines 1-9).

In Figure 9, the positive prompt guides SAM 3 to segment the dial, while the negative prompt suppresses the nearby button.

Figure 9: Segmentation result using combined positive/negative box guidance to isolate the dial while suppressing a nearby region (source: visualization by the author)

Combining Text and Visual Prompts for Selective Segmentation (Excluding the Undesired Regions)

In this example, we use two different prompt types at the same time:

  • text prompt: to search for "handle"
  • negative bounding box: to exclude the oven handle region

This provides selective control, allowing SAM 3 to focus on handles in the scene while ignoring a specific area.

kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
   requests.get(kitchen_url, stream=True).raw
).convert("RGB")

# Segment "handle" but exclude the oven handle using a negative box
text = "handle"
# Negative box covering oven handle area (xyxy): [40, 183, 318, 204]
oven_handle_box = [40, 183, 318, 204]
input_boxes = [[oven_handle_box]]

def draw_negative_box(image, box, width=3):
   img = image.copy().convert("RGB")
   draw = ImageDraw.Draw(img)

   x1, y1, x2, y2 = box
   draw.rectangle(
       [(x1, y1), (x2, y2)],
       outline="red",   # red = negative
       width=width,
   )

   return img

neg_box_vis = draw_negative_box(
   kitchen_image,
   oven_handle_box
)

neg_box_vis

First, we load the kitchen image from the COCO dataset. We read the file from the URL, open it as a Pillow image, and convert it to RGB (Lines 1-4).

Next, we define the structure of our prompt. We want to segment handles in the kitchen, but exclude the large oven handle. We describe the concept using text ("handle") and draw a bounding box over the oven handle region (Lines 7-10).

We write a helper function to visualize our negative region. We draw a red bounding box to show that this area should be excluded. We display the negative prompt overlay. This helps confirm that the region is positioned correctly (Lines 12-30).

Figure 10 shows the bounding box prompt to exclude the oven handle region.

Figure 10: Negative bounding box covering the oven handle region to exclude it from segmentation (source: visualization by the author)
inputs = processor(
   images=kitchen_image,
   text="handle",
   input_boxes=[[oven_handle_box]],
   input_boxes_labels=[[0]],   # negative box
   return_tensors="pt"
).to(device)

with torch.no_grad():
   outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Found {len(results['masks'])} objects")

Here, we prepare the inputs for SAM 3. We combine text and bounding box prompts. We mark the bounding box with a 0 label, meaning it is a negative region that the model must ignore (Lines 1-7).

We run the model in inference mode. This yields raw segmentation predictions based on both prompt types. We post-process the results by converting logits into binary masks, filtering low-confidence predictions, and resizing the masks back to the original resolution (Lines 9-17).

We report below the number of handle-like objects remaining after excluding the oven handle (Line 19).

Found 3 objects

Output

labels = ["handle (excluding oven)"] * len(results["scores"])

final_vis = overlay_masks_boxes_scores(
   image=kitchen_image,
   masks=results["masks"],
   boxes=results["boxes"],
   scores=results["scores"],
   labels=labels,
   score_threshold=0.5,
   alpha=0.45,
)

final_vis

We assign meaningful labels for visualization. Finally, we draw masks, bounding boxes, labels, and scores on the image (Lines 1-13).

In Figure 11, the result shows only handles outside the negative region.

Figure 11: Hybrid prompting result: "handle" segmentation while excluding the oven handle via a negative box (source: visualization by the author)

Batched Mixed-Prompt Segmentation Across Two Images (Text and Bounding Box Guidance)

In this example, we demonstrate how SAM 3 can handle multiple prompt types in a single batch. The first image receives a text prompt ("laptop"), while the second image receives a visual prompt (positive bounding box). Both images are processed together in a single forward pass.

text=["laptop", None]
input_boxes=[None, [box2_xyxy]]
input_boxes_labels=[None, [1]]

def draw_input_box(image, box, color="green", width=3):
   img = image.copy().convert("RGB")
   draw = ImageDraw.Draw(img)
   x1, y1, x2, y2 = box
   draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)
   return img

input_vis_1 = images[0]  # text prompt → no box
input_vis_2 = draw_input_box(images[1], box2_xyxy)

First, we define 3 parallel prompt lists:

  • 1 for text
  • 1 for bounding boxes
  • 1 for bounding box labels

We set the first entry in each list to None for the first image because we only want to use natural language there (laptop). For the second image, we supply a bounding box and label it as positive (1) (Lines 1-3).

We define a small helper function to draw a bounding box on an image. This helps us visualize the prompt region before inference. Here, we prepare two preview images (Lines 5-13):

  • first image: shows no box, since it will use text only
  • second image: is rendered with its bounding box prompt
input_vis_1

Figure 12 shows no box over the image, since it uses a text prompt for segmentation.

Figure 12: Batched mixed-prompt setup: Image 1 uses a text prompt (no box overlay shown) (source: image by the author)
input_vis_2

Figure 13 shows a bounding box over the image because it uses a box prompt for segmentation.

Figure 13: Batched mixed-prompt setup: Image 2 uses a positive bounding box prompt (source: visualization by the author)
inputs = processor(
   images=images,
   text=["laptop", None],
   input_boxes=[None, [box2_xyxy]],
   input_boxes_labels=[None, [1]],
   return_tensors="pt"
).to(device)

with torch.no_grad():
   outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)

Next, we assemble everything into a single batched input. This gives SAM 3:

  • 2 images
  • 2 prompt types
  • 1 forward pass

We run SAM 3 inference without computing gradients. This produces segmentation predictions for both images simultaneously (Lines 1-10).

We post-process the model outputs for both images. The result is a two-element list (Lines 12-17):

  • entry [0]: corresponds to the laptop query
  • entry [1]: corresponds to the bounding box query

Output 1: Text Prompt Segmentation

labels_1 = ["laptop"] * len(results[0]["scores"])

overlay_masks_boxes_scores(
   image=images[0],
   masks=results[0]["masks"],
   boxes=results[0]["boxes"],
   scores=results[0]["scores"],
   labels=labels_1,
   score_threshold=0.5,
)

We apply a label to each detected object in the first image. We visualize the segmentation results overlaid on the first image (Lines 1-10).

In Figure 14, we observe detections guided by the text prompt "laptop".

Figure 14: Text-prompt segmentation result for "laptop" in Image 1 (source: visualization by the author)

Output 2: Bounding Box Prompt Segmentation

labels_2 = ["box-prompted object"] * len(results[1]["scores"])

overlay_masks_boxes_scores(
   image=images[1],
   masks=results[1]["masks"],
   boxes=results[1]["boxes"],
   scores=results[1]["scores"],
   labels=labels_2,
   score_threshold=0.5,
)

We create labels for the second image. These detections are from the bounding box prompt. Finally, we visualize the bounding box guided segmentation on the second image (Lines 1-10).

In Figure 15, we can see the detections guided by the bounding box prompt.

Figure 15: Bounding-box-guided segmentation result in Image 2 (source: visualization by the author)

Interactive Segmentation Using Bounding Box Refinement (Draw to Segment)

In this example, we turn segmentation into a fully interactive workflow. We draw bounding boxes directly over the image using a widget UI. Each drawn box becomes a prompt signal for SAM 3:

  • green (positive) boxes: identify areas we want to segment
  • red (negative) boxes: exclude areas we want the model to ignore

After drawing, we convert the widget output into proper box coordinates and run SAM 3 to produce refined segmentation masks.

output.enable_custom_widget_manager()

# Load image
url = "http://images.cocodataset.org/val2017/000000136466.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Convert to base64
def pil_to_base64(img):
   buffer = io.BytesIO()
   img.save(buffer, format="PNG")
   return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()

# Create widget
widget = BBoxWidget(
   image=pil_to_base64(image),
   classes=["positive", "negative"]
)

widget

We enable custom widget support in Colab to ensure the bounding box UI renders properly. We download the kitchen image, load it into memory, and convert it to RGB format (Lines 1-5).

Before sending the image into the widget, we convert it into a base64 PNG buffer. This encoding step makes the image displayable in the browser UI (Lines 8-11).

We create an interactive drawing widget. It displays the image and allows the user to add labeled boxes. Each box is tagged as either "positive" or "negative" (Lines 14-17).

We render the widget in the notebook. At this point, the user can draw, move, resize, and delete bounding boxes (Line 19).

In Figure 16, we can see the positive and negative bounding boxes drawn by the user. The blue box indicates regions that belong to the object of interest, while the orange box marks background areas that should be ignored. These annotations serve as interactive guidance signals for refining the segmentation output. 

Figure 16: Interactive box drawing UI showing positive and negative box annotations (source: image by the author)
print(widget.bboxes)

The widget.bboxes object stores metadata for every annotation drawn by the user on the image. Each entry corresponds to a single box created in the interactive widget.

A typical output looks like this:

[{'x': 58, 'y': 147, 'width': 18, 'height': 18, 'label': 'positive'}, {'x': 88, 'y': 149, 'width': 18, 'height': 8, 'label': 'negative'}]

Each dictionary represents a single user annotation:

  • x and y: indicate the top-left corner of the drawn box in pixel coordinates
  • width and height: describe the size of the box
  • label: tells us whether the annotation is a 'positive' point (object) or a 'negative' point (background)
def widget_to_sam_boxes(widget):
   boxes = []
   labels = []

   for ann in widget.bboxes:
       x = int(ann["x"])
       y = int(ann["y"])
       w = int(ann["width"])
       h = int(ann["height"])

       x1 = x
       y1 = y
       x2 = x + w
       y2 = y + h

       label = ann.get("label") or ann.get("class")

       boxes.append([x1, y1, x2, y2])
       labels.append(1 if label == "positive" else 0)

   return boxes, labels

boxes, box_labels = widget_to_sam_boxes(widget)

print("Boxes:", boxes)
print("Labels:", box_labels)

We define a helper function to translate widget data into SAM-compatible xyxy coordinates. The widget gives us x/y + width/height. We convert to SAM’s xyxy format.

We encode labels into SAM 3 format:

  • 1: positive region
  • 0: negative region

The function returns valid box lists ready for inference. We extract the interactive box prompts (Lines 23-45).

Below are the Boxes and Labels in the required format.

Boxes: [[58, 147, 76, 165], [88, 149, 106, 157]]

Labels: [1, 0]

inputs = processor(
   images=image,
   input_boxes=[boxes],              # batch size = 1
   input_boxes_labels=[box_labels],
   return_tensors="pt"
).to(device)

with torch.no_grad():
   outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
   outputs,
   threshold=0.5,
   mask_threshold=0.5,
   target_sizes=inputs["original_sizes"].tolist()
)[0]

print(f"Found {len(results['masks'])} objects")

We pass the image and interactive box prompts into the processor. We run inference without tracking gradients. We convert logits into final mask predictions. We print the number of detected regions matching the interactive prompts (Lines 49-66).

Below is the number of objects detected by the model.

Found 6 objects

Output

labels = ["interactive object"] * len(results["scores"])

overlay_masks_boxes_scores(
   image=image,
   masks=results["masks"],
   boxes=results["boxes"],
   scores=results["scores"],
   labels=labels,
   alpha=0.45,
)

We assign simple labels to each detected region and overlay masks, bounding boxes, and scores on the original image (Lines 1-10).

This workflow demonstrates an effective use case: human-guided refinement through live drawing tools. With just a few annotations, SAM 3 adapts the segmentation output, giving us precision control and immediate visual feedback.

In Figure 17, we can see the segmented regions according to the positive and negative bounding box prompts annotated by the user over the input image.

Figure 17: Interactive segmentation output produced from the user-drawn positive/negative box prompts (source: visualization by the author)

Interactive Segmentation Using Point-Based Refinement (Click to Guide the Model)

In this example, we segment using point prompts rather than text or bounding boxes. We click on the image to mark positive and negative points. The center of each clicked point becomes a guiding coordinate, and SAM 3 uses those coordinates to refine segmentation. This workflow provides fine-grained, pixel-level control, well suited for interactive editing or correction.

# Setup device
device = Accelerator().device

# Load model and processor
print("Loading SAM3 model...")
model = Sam3TrackerModel.from_pretrained("facebook/sam3").to(device)
processor = Sam3TrackerProcessor.from_pretrained("facebook/sam3")
print("Model loaded successfully!")

# Load image
IMAGE_PATH = "/content/dog-2.jpeg"
raw_image = Image.open(IMAGE_PATH).convert("RGB")

def pil_to_base64(img):
   """Convert PIL image to base64 for BBoxWidget"""
   buffer = io.BytesIO()
   img.save(buffer, format="PNG")
   return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()

We set up our compute device using the Accelerator() class. This automatically detects the GPU if available. We load the SAM 3 tracking model and processor. This variant supports point-based refinement and multi-mask output (Lines 2-7).

We load the dog image into memory and convert it to RGB format. The BBoxWidget expects image data in base64 format. We write a helper function to convert a PIL image to base64 (Lines 11-18).

def get_points_from_widget(widget):
   """Extract point coordinates from widget bboxes"""
   positive_points = []
   negative_points = []
 
   for ann in widget.bboxes:
       x = int(ann["x"])
       y = int(ann["y"])
       w = int(ann["width"])
       h = int(ann["height"])
     
       # Get center point of the bbox
       center_x = x + w // 2
       center_y = y + h // 2
     
       label = ann.get("label") or ann.get("class")
     
       if label == "positive":
           positive_points.append([center_x, center_y])
       elif label == "negative":
           negative_points.append([center_x, center_y])
 
   return positive_points, negative_points

We loop over bounding boxes drawn on the widget and convert them into point coordinates. Each tiny bounding box becomes a center point. We split them into (Lines 20-42):

  • positive points: object
  • negative points: background
def segment_from_widget(b=None):
   """Run segmentation with points from widget"""
   positive_points, negative_points = get_points_from_widget(widget)
 
   if not positive_points and not negative_points:
       print("⚠️ Please add at least one point (draw small boxes on the image)!")
       return
 
   # Combine points and labels
   all_points = positive_points + negative_points
   all_labels = [1] * len(positive_points) + [0] * len(negative_points)
 
   print(f"n🔄 Running segmentation...")
   print(f"  • {len(positive_points)} positive points: {positive_points}")
   print(f"  • {len(negative_points)} negative points: {negative_points}")
   # Prepare inputs (4D for points, 3D for labels)
   input_points = [[all_points]]  # [batch, object, points, xy]
   input_labels = [[all_labels]]   # [batch, object, labels]
 
   inputs = processor(
       images=raw_image,
       input_points=input_points,
       input_labels=input_labels,
       return_tensors="pt"
   ).to(device)
 
   # Run inference
   with torch.no_grad():
       outputs = model(**inputs)
 
   # Post-process masks
   masks = processor.post_process_masks(
       outputs.pred_masks.cpu(),
       inputs["original_sizes"]
   )[0]
 
   print(f"✅ Generated {masks.shape[1]} masks with shape {masks.shape}")
 
   # Visualize results
   visualize_results(masks, positive_points, negative_points)

This segment_from_widget function handles (Lines 44-83):

  • reading positive + negative points (Lines 46-58)
  • building SAM 3 inputs (Lines 60-68)
  • running inference (Lines 71 and 72)
  • post-processing masks (Lines 75-78)
  • visualizing results (Line 83)

We pack points and labels into the correct model format. The model generates multiple ranked masks. Better quality masks appear at index 0.

def visualize_results(masks, positive_points, negative_points):
   """Display segmentation results"""
   n_masks = masks.shape[1]
 
   # Create figure with subplots
   fig, axes = plt.subplots(1, min(n_masks, 3), figsize=(15, 5))
   if n_masks == 1:
       axes = [axes]
 
   for idx in range(min(n_masks, 3)):
       mask = masks[0, idx].numpy()
     
       # Overlay mask on image
       img_array = np.array(raw_image)
       colored_mask = np.zeros_like(img_array)
       colored_mask[mask > 0] = [0, 255, 0]  # Green mask
     
       overlay = img_array.copy()
       overlay[mask > 0] = (img_array[mask > 0] * 0.5 + colored_mask[mask > 0] * 0.5).astype(np.uint8)
     
       axes[idx].imshow(overlay)
       axes[idx].set_title(f"Mask {idx + 1} (Quality Ranked)", fontsize=12, fontweight='bold')
       axes[idx].axis('off')
     
       # Plot points on each mask
       for px, py in positive_points:
           axes[idx].plot(px, py, 'go', markersize=12, markeredgecolor='white', markeredgewidth=2.5)
       for nx, ny in negative_points:
           axes[idx].plot(nx, ny, 'ro', markersize=12, markeredgecolor='white', markeredgewidth=2.5)
 
   plt.tight_layout()
   plt.show()

We overlay segmentation masks over the original image. Positive points are displayed as green dots. Negative points are shown in red (Lines 85-116).

def reset_widget(b=None):
   """Clear all annotations"""
   widget.bboxes = []
   print("🔄 Reset! All points cleared.")

This clears previously selected points so we can start fresh (Lines 118-121).

# Create widget for point selection
widget = BBoxWidget(
   image=pil_to_base64(raw_image),
   classes=["positive", "negative"]
)

Users can click to add points anywhere on the image. The widget captures both position and label (Lines 124-127).

# Create UI buttons
segment_button = widgets.Button(
   description='🎯 Segment',
   button_style='success',
   tooltip='Run segmentation with marked points',
   icon='check',
   layout=widgets.Layout(width='150px', height='40px')
)
segment_button.on_click(segment_from_widget)

reset_button = widgets.Button(
   description='🔄 Reset',
   button_style='warning',
   tooltip='Clear all points',
   icon='refresh',
   layout=widgets.Layout(width='150px', height='40px')
)
reset_button.on_click(reset_widget)

We create UI buttons for:

  • running segmentation (Lines 130-137)
  • clearing annotations (Lines 139-146)
# Display UI
print("=" * 70)
print("🎨 INTERACTIVE SAM3 SEGMENTATION WITH BOUNDING BOX WIDGET")
print("=" * 70)
print("n📋 Instructions:")
print("  1. Draw SMALL boxes on the image where you want to mark points")
print("  2. Label them as 'positive' (object) or 'negative' (background)")
print("  3. The CENTER of each box will be used as a point coordinate")
print("  4. Click 'Segment' button to run SAM3")
print("  5. Click 'Reset' to clear all points and start over")
print("n💡 Tips:")
print("  • Draw tiny boxes - just big enough to see")
print("  • Positive points = parts of the object you want")
print("  • Negative points = background areas to exclude")
print("n" + "=" * 70 + "n")

display(widgets.HBox([segment_button, reset_button]))
display(widget)

We render the interface side-by-side. The user can now:

  • click positive points
  • click negative points
  • run segmentation live
  • reset anytime

Output

In Figure 18, we can see the whole point-based segmentation process.

Figure 18: Point-based interactive refinement workflow: selecting points and generating ranked masks (source: GIF by the author).


What’s next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: February 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you’re serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you’ll find:

  • ✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
  • ✓ 86 Certificates of Completion
  • ✓ 115+ hours hours of on-demand video
  • ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
  • ✓ Pre-configured Jupyter Notebooks in Google Colab
  • ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
  • ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
  • ✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University


Summary

In Part 2 of this tutorial, we explored the advanced capabilities of SAM 3, transforming it from a powerful segmentation tool into a flexible, interactive visual query system. We demonstrated how to leverage multiple prompt types (text, bounding boxes, and points) both individually and in combination to achieve precise, context-aware segmentation results.

We covered sophisticated workflows, including:

  • Segmenting multiple concepts simultaneously in the same image
  • Processing batches of images with different prompts efficiently
  • Using positive bounding boxes to focus on regions of interest
  • Employing negative prompts to exclude unwanted areas
  • Combining text and visual prompts for selective, fine-grained control
  • Building fully interactive segmentation interfaces where users can draw boxes or click points and see results in real-time

These techniques showcase SAM 3’s versatility for real-world applications. Whether you’re building large-scale data annotation pipelines, developing intelligent video editing tools, creating AR experiences, or conducting scientific research, the multi-modal prompting capabilities we explored give you pixel-perfect control over segmentation outputs.


Citation Information

Thakur, P. “Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation,” PyImageSearch, P. Chugh, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/5c4ag

@incollection{Thakur_2026_advanced-sam-3-multi-modal-prompting-and-interactive-segmentation,
  author = {Piyush Thakur},
  title = {{Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
  year = {2026},
  url = {https://pyimg.co/5c4ag},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you’ll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

The post Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation appeared first on PyImageSearch.

Liked Liked