Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export

digitado ⋅ 26 de May de 2026

In this tutorial, we explore the TuringEnterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and image distributions, and visualize representative examples from each domain. We also build a lightweight reward function that checks exact, numeric, fractional, LaTeX, and symbolic answers, giving us a useful way to evaluate model outputs. Finally, we format prompts for vision-language models, optionally test SmolVLM on sample examples, and export the dataset into a GRPO-style structure for future multimodal RL training.

Copy CodeCopiedUse a different Browser

import subprocess, sys
subprocess.run([sys.executable, "-m", "pip", "-q", "install",
               "datasets>=3.0", "huggingface_hub>=0.24", "transformers>=4.45",
               "Pillow", "matplotlib", "pandas", "numpy", "sympy",
               "accelerate", "tqdm"], check=True)
import os, re, io, json, math, random, textwrap, hashlib, warnings
from collections import Counter
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import sympy as sp
from datasets import load_dataset
warnings.filterwarnings("ignore")
random.seed(0); np.random.seed(0)
pd.set_option("display.max_colwidth", 120)
DS_ID = "TuringEnterprises/Open-MM-RL"
ds = load_dataset(DS_ID, split="train")
print(f"Loaded {DS_ID} — {len(ds)} rows")
print("Features:", ds.features)
print("Row 0 keys:", list(ds[0].keys()))

We install all required libraries and import the core tools needed for dataset loading, analysis, visualization, symbolic math, and file handling. We set random seeds for reproducibility and configure pandas so that longer text fields display clearly. We then load the TuringEnterprises/Open-MM-RL dataset from Hugging Face and inspect its size, features, and first-row structure.

Copy CodeCopiedUse a different Browser

df = ds.remove_columns(["images"]).to_pandas()
df["n_images"]    = [len(ex["images"]) for ex in ds]
df["q_len_chars"] = df["question"].str.len()
df["a_len_chars"] = df["answer"].str.len()
print("n=== Domain ==="); print(df["domain"].value_counts())
print("n=== Format ==="); print(df["format"].value_counts())
print("n=== Sub-domain (top by domain) ===")
print(df.groupby("domain")["subDomain"].value_counts().head(15))
print(f"nMean images/example: {df['n_images'].mean():.2f}   max: {df['n_images'].max()}")
print(f"Median Q length: {df['q_len_chars'].median():.0f}   "
     f"Median A length: {df['a_len_chars'].median():.0f}")
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
df["domain"].value_counts().plot.bar(ax=axes[0], color="#4C72B0")
axes[0].set_title("Examples per domain"); axes[0].set_ylabel("count")
df["format"].value_counts().plot.bar(ax=axes[1], color="#55A868")
axes[1].set_title("Image-format type"); axes[1].tick_params(axis='x', rotation=25)
df["n_images"].plot.hist(ax=axes[2], bins=range(1, df["n_images"].max() + 2),
                        color="#C44E52", edgecolor="white")
axes[2].set_title("Images per example"); axes[2].set_xlabel("n_images")
plt.tight_layout(); plt.show()
def img_stats(ex):
   sizes = [im.size for im in ex["images"]]
   modes = [im.mode for im in ex["images"]]
   return {
       "n_images": len(sizes),
       "min_w": min(w for w, h in sizes), "max_w": max(w for w, h in sizes),
       "min_h": min(h for w, h in sizes), "max_h": max(h for w, h in sizes),
       "modes": "|".join(sorted(set(modes))),
       "total_pixels": sum(w * h for w, h in sizes),
   }
img_df = pd.DataFrame([img_stats(ex) for ex in ds])
print("n=== Image resolution stats ===")
print(img_df[["min_w", "max_w", "min_h", "max_h", "total_pixels"]].describe().round(0))
print("nMode mix:", Counter("|".join(img_df["modes"]).split("|")))

We convert the dataset into a DataFrame after removing the image column, then calculate useful fields such as the number of images, question length, and answer length. We analyze domain counts, format distribution, sub-domain breakdowns, and basic text/image statistics. We also create charts to visualize the number of examples per domain, the image formats, and the distribution of images per example.

Copy CodeCopiedUse a different Browser

def show_example(ex, max_chars=600):
   print("=" * 80)
   print(f"id={ex['conversation_id']}   {ex['domain']} / {ex['subDomain']}")
   print(f"format={ex['format']}   n_images={len(ex['images'])}")
   print("-" * 80)
   q = ex["question"][:max_chars] + ("..." if len(ex["question"]) > max_chars else "")
   print("Q:", textwrap.fill(q, 100))
   print("-" * 80)
   print("A (gold):", ex["answer"])
   n = len(ex["images"])
   fig, axes = plt.subplots(1, n, figsize=(5 * n, 5)) if n > 1 
               else plt.subplots(1, 1, figsize=(6, 6))
   axes = np.atleast_1d(axes)
   for ax, im in zip(axes, ex["images"]):
       ax.imshow(im); ax.set_xticks([]); ax.set_yticks([])
       ax.set_title(f"{im.size[0]}×{im.size[1]}  ({im.mode})")
   plt.tight_layout(); plt.show()
for dom in df["domain"].unique():
   idx = int(df[df["domain"] == dom].index[0])
   show_example(ds[idx])
LATEX_PAT = re.compile(r"\[[sS]+?\]|\([sS]+?\)|$[^$]+$")
df["latex_blocks_q"] = df["question"].apply(lambda s: len(LATEX_PAT.findall(s or "")))
df["latex_blocks_a"] = df["answer"].apply(lambda s: len(LATEX_PAT.findall(s or "")))
print("n=== LaTeX blocks per field ===")
print(df[["latex_blocks_q", "latex_blocks_a"]].describe().round(2))
def classify_answer(a):
   s = (a or "").strip().strip("$ []").strip()
   s_no_dollar = s.replace("$", "")
   if re.fullmatch(r"-?s*d+(.d+)?s*", s_no_dollar):       return "integer/float"
   if any(t in s for t in ["\sqrt", "\frac", "\pi", "^", "_", "\kappa", "\lceil"]):
       return "symbolic"
   if re.fullmatch(r"[-+0-9./()s\a-zA-Z{}]+", s) and any(c.isdigit() for c in s):
       return "numeric_expr"
   return "text"
df["answer_type"] = df["answer"].apply(classify_answer)
print("n=== Answer-type breakdown ==="); print(df["answer_type"].value_counts())
print("n=== Answer-type × domain ===")
print(pd.crosstab(df["domain"], df["answer_type"]))

We define a helper function to display one representative example from each domain, including its question, gold answer, and associated images. We use this visual inspection step to better understand how multimodal reasoning problems are structured across different domains. We then analyze LaTeX usage in questions and answers, classify answer types, and compare answer-type distributions across domains.

Copy CodeCopiedUse a different Browser

EXTRACT_PATS = [
   r"\boxed{([^{}]+)}",
   r"finals+answers*[:=]s*([^n]+)",
   r"answers*[:=]s*([^n]+)",
]
def extract_final(text):
   if not text: return ""
   for p in EXTRACT_PATS:
       m = re.search(p, text, flags=re.IGNORECASE)
       if m: return m.group(1).strip().strip(".,;")
   lines = [l.strip() for l in str(text).strip().splitlines() if l.strip()]
   return lines[-1] if lines else ""
def latex_to_sympy(s):
   s = (s or "").strip().strip("$").strip()
   s = re.sub(r"^\[[(]", "", s); s = re.sub(r"\[])]$", "", s)
   s = (s.replace("\pi", "pi").replace("\cdot", "*").replace("\times", "*")
          .replace("\,", "").replace("\;", "").replace("\!", ""))
   s = re.sub(r"\fracs*{([^{}]+)}s*{([^{}]+)}", r"((1)/(2))", s)
   s = re.sub(r"\sqrts*{([^{}]+)}", r"sqrt(1)", s)
   s = s.replace("^", "**")
   s = re.sub(r"\[a-zA-Z]+", "", s)
   s = s.replace("{", "(").replace("}", ")")
   return s
def grade(pred, gold, tol=1e-4):
   """Verifiable reward in [0,1]: exact > numeric > sympy-symbolic > partial."""
   if pred is None or gold is None: return 0.0
   p = extract_final(str(pred)).strip()
   g = str(gold).strip()
   norm = lambda x: re.sub(r"s+", "", x.lower()).strip("$.,;[]()")
   if norm(p) == norm(g): return 1.0
   def to_float(x):
       try: return float(latex_to_sympy(x))
       except Exception:
           try: return float(sp.sympify(latex_to_sympy(x)).evalf())
           except Exception: return None
   fp, fg = to_float(p), to_float(g)
   if fp is not None and fg is not None:
       if abs(fp - fg) / max(1.0, abs(fg)) < tol: return 1.0
   try:
       ep = sp.sympify(latex_to_sympy(p)); eg = sp.sympify(latex_to_sympy(g))
       if sp.simplify(ep - eg) == 0: return 1.0
   except Exception:
       pass
   if norm(g) and norm(g) in norm(p): return 0.5
   return 0.0
print("n=== Grader sanity checks ===")
for pred, gold, want in [
   ("The answer is \boxed{120}",            "[120]",            1.0),
   ("After computing: 7396 \pi",            "7396\pi",         1.0),
   ("Final answer: -71/4",                   "-\frac{71}{4}",   1.0),
   ("Therefore the result is 0.0074",        "0.0074",           1.0),
   ("Final answer: nucleus accumbens",       "Nucleus accumbens",1.0),
   ("I don't know",                          "12",               0.0),
]:
   print(f"  pred={pred[:38]!r:42s} gold={gold!r:22s} -> r={grade(pred, gold)}  (want {want})")
SYSTEM = ("You are a STEM expert solving multimodal reasoning problems. "
         "You will see a question and one or more figures. "
         "Reason step by step, then end with exactly one line:n"
         "Final answer: <your answer>")
def build_prompt(ex):
   img_tags = "n".join(f"[Image {i+1}]" for i in range(len(ex["images"])))
   return f"{SYSTEM}nn{img_tags}nnQuestion:n{ex['question']}nnLet's think step by step."
print("n=== Example prompt (truncated) ===")
print(build_prompt(ds[0])[:600], "...n")

We build a verifiable reward function that extracts final answers and compares predictions against gold answers using exact, numeric, and symbolic matching. We also add a LaTeX-to-SymPy conversion helper, allowing mathematical expressions to be evaluated more reliably. We test the grader with sanity checks and then create a structured prompt format for vision-language model reasoning.

Copy CodeCopiedUse a different Browser

import torch
USE_VLM = torch.cuda.is_available()
print(f"CUDA available: {USE_VLM}")
if USE_VLM:
   try:
       from transformers import AutoProcessor, AutoModelForVision2Seq
       MODEL_ID = "HuggingFaceTB/SmolVLM-Instruct"
       print(f"Loading {MODEL_ID} (this takes ~1 min) ...")
       processor = AutoProcessor.from_pretrained(MODEL_ID)
       model = AutoModelForVision2Seq.from_pretrained(
           MODEL_ID, torch_dtype=torch.float16, device_map="auto"
       )
       def vlm_solve(ex, max_new_tokens=512):
           imgs = [im.convert("RGB") for im in ex["images"]]
           content = [{"type": "image"} for _ in imgs]
           content.append({"type": "text", "text": build_prompt(ex)})
           text = processor.apply_chat_template(
               [{"role": "user", "content": content}], add_generation_prompt=True)
           inputs = processor(text=text, images=imgs, return_tensors="pt").to(model.device)
           with torch.no_grad():
               out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
           return processor.batch_decode(
               out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
       rows, sample_idx = [], random.sample(range(len(ds)), 6)
       for i in sample_idx:
           ex = ds[i]
           try:
               pred = vlm_solve(ex)
               r = grade(pred, ex["answer"])
           except Exception as e:
               pred, r = f"<error: {e}>", 0.0
           rows.append({"id": ex["conversation_id"], "domain": ex["domain"],
                        "reward": r, "pred_tail": pred[-200:]})
           print(f"  id={ex['conversation_id']}  {ex['domain']:9s}  r={r:.2f}")
       res = pd.DataFrame(rows)
       print(f"nMean reward over {len(res)} samples: {res['reward'].mean():.3f}")
       print(res.groupby("domain")["reward"].mean().rename("avg_reward"))
   except Exception as e:
       print(f"VLM run failed ({e}); reward & data pipeline remain usable.")
else:
   print("No GPU detected — skipping live VLM inference (Runtime → Change runtime type → GPU).")
out_dir = Path("/content/open_mm_rl_processed"); out_dir.mkdir(exist_ok=True, parents=True)
img_dir = out_dir / "images"; img_dir.mkdir(exist_ok=True)
records = []
for ex in ds:
   paths = []
   for j, im in enumerate(ex["images"]):
       p = img_dir / f"{ex['conversation_id']}_{j}.png"
       im.convert("RGB").save(p)
       paths.append(str(p))
   records.append({
       "id":         ex["conversation_id"],
       "domain":     ex["domain"],
       "subDomain":  ex["subDomain"],
       "format":     ex["format"],
       "prompt":     build_prompt(ex),
       "gold":       ex["answer"],
       "image_paths": paths,
   })
jsonl_path = out_dir / "data.jsonl"
with open(jsonl_path, "w") as f:
   for r in records: f.write(json.dumps(r) + "n")
print(f"nWrote {len(records)} records → {jsonl_path}")
print(f"Saved {sum(len(r['image_paths']) for r in records)} images under {img_dir}")
def mock_policy_samples(gold, K=4):
   """Stand-in for K policy rollouts. Replace with model.generate(do_sample=True)."""
   return [gold,
           "Final answer: 0",
           f"Final answer: {gold} (≈)",
           "I think the answer is unclear."][:K]
def grpo_advantages(rewards):
   r = np.asarray(rewards, dtype=float)
   return (r - r.mean()) / (r.std() + 1e-6)
print("n=== Mock GRPO rollouts for example 0 ===")
gold0 = ds[0]["answer"]
cands = mock_policy_samples(gold0, K=4)
rewards = [grade(c, gold0) for c in cands]
adv = grpo_advantages(rewards)
for c, r, a in zip(cands, rewards, adv):
   print(f"  r={r:.2f}  adv={a:+.2f}   cand={c!r}")
print("nDone. To turn this into real training:")
print("  1. Replace mock_policy_samples with vlm_solve(..., do_sample=True, num_return_sequences=K).")
print("  2. Feed (prompt, K rollouts, K rewards) into TRL's GRPOTrainer or verl.")
print("  3. Curriculum: start with examples where rewards have non-zero variance.")

We check whether CUDA is available and, optionally, run SmolVLM on a few examples to generate predictions, then score them using our reward function. We then export the dataset to a GRPO-style JSONL format, saving all images to disk for future multimodal RL experiments. Finally, we demonstrate mock GRPO rollouts, calculate group-relative advantages, and outline how this can be replaced with real model-generated samples.

In conclusion, we built a complete workflow for understanding, evaluating, and preparing the Open-MM-RL dataset for multimodal reasoning experiments. We moved from dataset loading and exploratory analysis to image inspection, LaTeX-aware answer classification, reward scoring, prompt construction, optional VLM inference, and GRPO-style rollout preparation. It provides a strong starting point for training and evaluating vision-language models with verifiable rewards, while also helping us understand how to transform multimodal datasets into practical reinforcement learning pipelines.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export appeared first on MarkTechPost.

Like 0

Liked Liked