Preliminary results – Debiasing & Alignment – seeking collaborators
|
Hi everyone, We’ve found evidence that while LLMs are trained to be neutral about people, they still leak inaccurate gender stereotypes toward companies. The Method: We adapted the CrowS-Pairs framework for the S&P 500. We asked the model to choose between “Stereotypical” and “Anti-Stereotypical” sentences for 500 different brands based on their predicted workers demographics. Partial results: you can find more details at our community home page https://huggingface.co/spaces/sefif/BYO-community-v2 (Check the “Corporate Bias Research” tab) Help Us Build Better Models! This is an early-stage community research project. We’re sharing preliminary results because we believe bias research should be open and collaborative. How you can contribute: – Dataset Validation: Our adapted sentence pairs need human review. – Cross-Model Testing: Does the same effect appear in other models? – Expanding Beyond Gender: Apply the same methodology to race, religion, age, etc. – Real-World Grounding: Compare model estimates against actual diversity reports. – Explore debiasing approaches: Can RLHF, DPO, or prompt engineering reduce this? This is ongoing research. Results are preliminary and datasets require community validation. Model: Qwen3-30B-A3B. Methodology and full datasets will be released after validation. submitted by /u/Prestigious_Mud_487 |