Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO – evals update

digitado ⋅ 16 de April de 2026

So, I trained two variants of this task:

using just length penalty
using a quality reward and length penalty

I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:

Consciencess
Coverage
Clarity
Faitfullness

The results are as follows:

with quality + length penalty rewards: 2.5/4
with just length penalty: 2.4/5

Results:

The model with length penalty and quality reward as ROUGE L is significant with a p-value of 0.0042 wrt the final composite score using one-sided t-test with a total of 5 rounds of evals for each model.

Performed on the test sample of 200 of smoltldr dataset.

Baseline: length penalty only

What is LLM-as-A-Judge?

Well, it is meant to allow any LLM of your choice to judge certain outputs which cant be easily be segregated into definitive reward because of its variance or subjective nature, like summarization!

Such rewards varies for person to person, so we employ an LLM to act like one and give rewards multiple times and aggregates the results.] which is cheap compared to human labelers!

So, I used DeepEvals amazing tools to create a eval system for me to evaluate the summarizations by my models on the aforementioned four factors:

Faithfulness: does the summary stay fully grounded in the source, with no hallucinations or contradictions?

Coverage: does the summary capture the source’s key points without missing meaning-critical information?

Conciseness: is the summary substantially shorter than the source without redundancy or unnecessary detail?

Clarity: is the summary easy to read, grammatically clean, and understandable on its own?

The composite score is the mean of the above scores.

Reward system

length_penalty : basically, -abs(response_length – MAX_LENGTH)

quality_reward: a ROUGE-L, which is basically LCS of golden summarizations I had as part of the above dataset, to ensure we have some structure throughout the responses generated and minimize degradation.

https://preview.redd.it/a7bzpcu8xkvg1.png?width=800&format=png&auto=webp&s=0a809b761d7c285bc70d52175ebbf219e6d79fc5

submitted by /u/East-Muffin-6472
[link] [comments]

Like 0

Liked Liked