Preliminary results – Debiasing & Alignment – seeking collaborators

Preliminary results - Debiasing & Alignment - seeking collaborators

Hi everyone,

We’ve found evidence that while LLMs are trained to be neutral about people, they still leak inaccurate gender stereotypes toward companies.

The Method:

We adapted the CrowS-Pairs framework for the S&P 500. We asked the model to choose between “Stereotypical” and “Anti-Stereotypical” sentences for 500 different brands based on their predicted workers demographics.

Partial results:

https://preview.redd.it/0kmcm84oxzsg1.png?width=1500&format=png&auto=webp&s=c438d6713c70bf3c140741c32ee143c2628167c1

https://preview.redd.it/u04kcwwpxzsg1.png?width=1200&format=png&auto=webp&s=8d417cb532280bb75ffb89c3f6eb3c54585b2f25

you can find more details at our community home page

https://huggingface.co/spaces/sefif/BYO-community-v2

(Check the “Corporate Bias Research” tab)

Help Us Build Better Models!

This is an early-stage community research project. We’re sharing preliminary results because we believe bias research should be open and collaborative.

How you can contribute:

– Dataset Validation: Our adapted sentence pairs need human review.

– Cross-Model Testing: Does the same effect appear in other models?

– Expanding Beyond Gender: Apply the same methodology to race, religion, age, etc.

– Real-World Grounding: Compare model estimates against actual diversity reports.

– Explore debiasing approaches: Can RLHF, DPO, or prompt engineering reduce this?

This is ongoing research. Results are preliminary and datasets require community validation.

Model: Qwen3-30B-A3B. Methodology and full datasets will be released after validation.

submitted by /u/Prestigious_Mud_487
[link] [comments]

Liked Liked