Part2: Guide to Huggingface AutoModels** for Vision and Audio

Part2: Guide to Hugging-face AutoModels** for Vision

In the first part of this article, we explored NLP AutoModels — classification, question answering, text generation, and more.

Now let’s move beyond text.

Hugging Face Transformers also support vision, audio, and multimodal tasks using the same powerful AutoModel philosophy:

Vision models work with pixels instead of words, but the idea remains the same:

  • Pretrained on large image datasets
  • Fine-tuned on your specific task
  • Used for inference in production systems

Before models see images, they need preprocessing.

Image Preprocessing

from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

This handles:

  • Resizing
  • Normalization
  • Pixel conversion

1. AutoModelForImageClassification

Assigns one label per image.

from transformers import AutoModelForImageClassification

Example Tasks

  • Cat vs Dog classification
  • Medical image diagnosis
  • Product categorization
  • Defect detection

Fine tuning pipeline

Step1 Load Pretrained Model & Image Processor

Vision models do not use tokenizers.
They use image processors.

from transformers import AutoImageProcessor, AutoModelForImageClassification

model_name = "google/vit-base-patch16-224"

processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(
model_name,
num_labels=2
)

Image Processor Does

  • Resize images
  • Normalize pixel values
  • Convert images → tensors

Step2: Dataset Format

dataset/
├── cat/
│ ├── img1.jpg
│ ├── img2.jpg
├── dog/
│ ├── img3.jpg
│ ├── img4.jpg

Step3: Load Dataset

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="dataset/")

Hugging Face understands:

Images

Labels

Class names

Step4: Preprocess Images

Just like tokenization in NLP, images must be processed.

def transform(example):
example["pixel_values"] = processor(
example["image"],
return_tensors="pt"
)["pixel_values"][0]
return example

dataset = dataset.with_transform(transform)

Step5: Fine tuning

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir="./image_classifier",
per_device_train_batch_size=8,
num_train_epochs=3,
evaluation_strategy="no",
logging_steps=10,
remove_unused_columns=False
)
### Train the Model
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"]
)

trainer.train()

### save the model
model.save_pretrained("./image_classifier")
processor.save_pretrained("./image_classifier")

### Inferencing
### Load the model
from PIL import Image
import torch

processor = AutoImageProcessor.from_pretrained("./image_classifier")
model = AutoModelForImageClassification.from_pretrained("./image_classifier")

### Predict the image
image = Image.open("test.jpg")

inputs = processor(image, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

logits = outputs.logits
predicted_class = logits.argmax(-1).item()

print(model.config.id2label[predicted_class])

## Outputs Dog

2. AutoModelForObjectDetection

AutoModelForObjectDetection loads a pretrained vision model with an object detection head.

from transformers import AutoModelForObjectDetection

The model predicts:

  • Object labels
  • Bounding boxes (x, y, width, height or xmin, ymin, xmax, ymax)
  • Confidence scores

Example of bounding boxes generated for image

Use AutoModelForObjectDetection if:

  • An image contains multiple objects
  • You need locations, not just labels
  • Objects may appear more than once
Common Real-World Use Cases

Autonomous driving (cars, pedestrians, signs)

Retail shelf monitoring

Face detection

Traffic analysis

Security & surveillance

Sports analytics

Example

input: A Street image

Output: Labeled bounding boxes

Car       → box (x1, y1, x2, y2)
Person → box (x1, y1, x2, y2)
Traffic light → box (x1, y1, x2, y2)

Popular Detection Models Behind the Scenes

AutoModel automatically selects the correct architecture, such as:

  • DETR (Detection Transformer)
  • YOLOS
  • RT-DETR
  • Faster R-CNN (via adapters)

Finetuning

Step1: Load Pretrained model and Image preprocessor

from transformers import AutoImageProcessor, AutoModelForObjectDetection

model_name = "facebook/detr-resnet-50"

processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForObjectDetection.from_pretrained(
model_name,
num_labels=3 # example: person, car, bicycle
)

Step2: Dataset Format

Unlike image classification, labels are not single values.

Each image needs:

  • Object labels
  • Bounding box coordinates

Example Annotated image (COCO style)

{
"image": "image1.jpg",
"objects": {
"bbox": [[50, 60, 200, 300], [300, 100, 400, 350]],
"category": [0, 1]
}
}

Load Dataset

from datasets import load_dataset

dataset = load_dataset("coco", split="train")

### Preprocess
def transform(example):
image = example["image"]
annotations = example["objects"]

encoding = processor(
images=image,
annotations=annotations,
return_tensors="pt"
)

example["pixel_values"] = encoding["pixel_values"][0]
example["labels"] = encoding["labels"][0]
return example

dataset = dataset.with_transform(transform)

### fine tuning
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir="./object_detector",
per_device_train_batch_size=4,
num_train_epochs=5,
logging_steps=10,
remove_unused_columns=False
)

### Training
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset
)

trainer.train()
### Save the model
model.save_pretrained("./object_detector")
processor.save_pretrained("./object_detector")

## infernce
### Load the model
from PIL import Image
import torch

processor = AutoImageProcessor.from_pretrained("./object_detector")
model = AutoModelForObjectDetection.from_pretrained("./object_detector")

### Detection
image = Image.open("test.jpg")

inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

### Post processing
target_sizes = torch.tensor([image.size[::-1]])

results = processor.post_process_object_detection(
outputs,
threshold=0.5,
target_sizes=target_sizes
)[0]

### Example output
for score, label, box in zip(
results["scores"],
results["labels"],
results["boxes"]
):
print(
model.config.id2label[label.item()],
score.item(),
box.tolist()
)

### Output
person 0.98 [34.5, 60.1, 210.2, 380.4]
car 0.95 [260.3, 120.0, 420.8, 340.7]

How to generated Labeled images (with bounding boxes)

Tools to Label Images for Object Detection Training

Object detection models require bounding box annotations around objects. These annotations are usually stored in formats like COCO, Pascal VOC, or YOLO.

Below are the best tools used in real-world projects, from beginners to enterprise scale.

LabelImg: A simple, open-source desktop tool for drawing bounding boxes.

CVAT (Computer Vision Annotation Tool): for large datasets, teams

Label Studio: An open-source, multi-purpose data labeling platform.

Roboflow: A full dataset management platform (labeling + augmentation + hosting).

Annotation format

| Format     | Used By            |
| ---------- | ------------------ |
| COCO | DETR, Faster R-CNN |
| YOLO | YOLO models |
| Pascal VOC | Legacy models |
| JSON | Custom pipelines |

COCO (Common Objects in Context) format is the most widely used annotation format for object detection, segmentation, and keypoints.

High-Level Structure of a COCO Annotation File

A COCO annotation file is a single JSON file with these main sections:

{
"images": [],
"annotations": [],
"categories": []
}

### Images Section
"images": [
{
"id": 1,
"file_name": "street1.jpg",
"width": 1280,
"height": 720
}
]

### Categories section
"categories": [
{
"id": 1,
"name": "person"
},
{
"id": 2,
"name": "car"
},
{
"id": 3,
"name": "traffic light"
}
]
### annotation section
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 200, 80, 250],
"area": 20000,
"iscrowd": 0
}
]

3. AutoModelForSemanticSegmentation

Fine-Tuning & Inference Explained (Pixel-Level Understanding)

If object detection answers:

“What objects are in the image and where?”

Then semantic segmentation answers

“What is every pixel in this image?”

This is the most detailed form of image understanding.

AutoModelForSemanticSegmentation loads a pretrained vision backbone with a pixel-wise classification head.

When Should You Use This Model?

Use AutoModelForSemanticSegmentation if:

  • You need pixel-accurate results
  • Object boundaries matter
  • Each pixel belongs to exactly one class

Common Real-World Use Cases

  • Autonomous driving (road, car, sidewalk, sky)
  • Medical imaging (organs, tumors)
  • Satellite imagery (water, land, buildings)
  • Background removal
  • Scene understanding

Example Problem

Input image: Street scene

output : Road → gray
Car → blue
Person → green
Sky → light blue
Building → brown

Fine tunning pipeline

from transformers import AutoImageProcessor, AutoModelForSemanticSegmentation

model_name = "nvidia/segformer-b0-finetuned-ade-512-512"

processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForSemanticSegmentation.from_pretrained(model_name)

Image segmentation example

Dataset Format

Unlike detection:

  • You do not annotate boxes
  • You annotate masks
dataset/
├── images/
│ ├── img1.jpg
│ ├── img2.jpg
├── masks/
│ ├── img1.png
│ ├── img2.png
### Load Dataset
from datasets import load_dataset

dataset = load_dataset(
"imagefolder",
data_dir="dataset/",
)

### Preprocess Image
def transform(example):
image = example["image"]
mask = example["mask"]

inputs = processor(
images=image,
segmentation_maps=mask,
return_tensors="pt"
)

example["pixel_values"] = inputs["pixel_values"][0]
example["labels"] = inputs["labels"][0]
return example

dataset = dataset.with_transform(transform)

### fine tuning
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir="./segmentation_model",
per_device_train_batch_size=4,
num_train_epochs=5,
logging_steps=10,
remove_unused_columns=False
)

### traning the model
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"]
)

trainer.train()

### Save the model
model.save_pretrained("./segmentation_model")
processor.save_pretrained("./segmentation_model")

### inference
from PIL import Image
import torch

processor = AutoImageProcessor.from_pretrained("./segmentation_model")
model = AutoModelForSemanticSegmentation.from_pretrained("./segmentation_model")

### running inference
image = Image.open("test.jpg")

inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

logits = outputs.logits

### Convert logits to mask
predicted_mask = logits.argmax(dim=1)[0]
### Each pixel in predicted_mask corresponds to a class ID.

### Visualization
Original Image → Segmentation Mask → Colored Overlay

AutoModelForVision2Seq

Image → Text Explained (Captioning, OCR, Visual Reasoning)

AutoModelForVision2Seq is used when your input is an image and your output is text.

If:

  • Image classification → image → label
  • Object detection → image → boxes
  • Segmentation → image → pixels

Then Vision2Seq answers:

“Describe or reason about this image in words.”

AutoModelForVision2Seq It is a vision encoder + text decoder model.

The model:

  1. Encodes the image (vision transformer / CNN)
  2. Decodes text token by token (language model)

Use AutoModelForVision2Seq if you want:

  • Image captioning
  • OCR + text generation
  • Visual question answering (image → text)
  • Document understanding
  • Multimodal assistants

Real-World Use Cases

  • Accessibility tools (describe images)
  • Invoice & document processing
  • Screenshot understanding
  • Image-based chatbots
  • Visual reasoning systems

Behind AutoModelForVision2Seq, Hugging Face loads models like:

  • BLIP / BLIP-2
  • Donut (document understanding)
  • Kosmos
  • VisionEncoderDecoder models

You don’t need to manage architectures manually.

Example

Input : Image of a street with cars and people

Output: “A busy street with pedestrians crossing and cars stopped at a traffic light.”

pipeline

### Model loading
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "Salesforce/blip-image-captioning-base"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)

###Inference (Image → Text)
from PIL import Image
import torch

image = Image.open("street.jpg").convert("RGB")
### Caption generation
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=30
)

caption = processor.decode(output_ids[0], skip_special_tokens=True)
print(caption)

### example output
A city street with cars and people walking across the road.

Example Training Data

Image: street.jpg
Target text: "A busy street with pedestrians and cars."

Dataset format
data = {
"image": ["img1.jpg", "img2.jpg"],
"text": [
"A dog playing in the park.",
"A person riding a bicycle."
]
}

Vision2Seq models are pretrained on massive data,
but fine-tuning is how you make them useful for your problem.

Minimal fine tuning Training

 ### Loading Model
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "Salesforce/blip-image-captioning-base"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)

### Example dataset
data = [
{
"image": "img1.jpg",
"text": "A yellow taxi driving down a city street."
},
{
"image": "img2.jpg",
"text": "A person crossing the road at a traffic light."
}
]

### Preprocessing
from PIL import Image

def preprocess(example):
image = Image.open(example["image"]).convert("RGB")

inputs = processor(
images=image,
text=example["text"],
padding="max_length",
truncation=True,
return_tensors="pt"
)

inputs["labels"] = inputs["input_ids"]
return inputs

### training
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir="./vision2seq",
per_device_train_batch_size=2,
num_train_epochs=3,
fp16=True,
remove_unused_columns=False
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset
)

trainer.train()

Next part (Part3) I will list out Huggingface Audio Models, Thanks for your time and attention


Part2: Guide to Huggingface AutoModels** for Vision and Audio was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked