[D] Data curation and targeted replacement as a pre-training alignment and controllability method

digitado ⋅ 29 de March de 2026

Hi, r/MachineLearning: has much research been done in large-scale training scenarios where undesirable data has been replaced before training, such as any instances of violence, lying, or deception in the dataset?

Most controllability work, like RLHF or constitutional AI, seems to be done post-training. What I’m considering is intentionally training models on more carefully chosen data, and not letting it train on undesirable data at all. This is a literal application of Mo Gawdat’s proposal to “raise AI like a child”, but with the option to never train it on harmful material, even at a “mature” stage of development.

Questions:

– If an entire dataset has all deception or violence removed or replaced, how much does that lower its ability to reason in general, or about deceptive or violent behavior in particular?

– How much would it hurt (or not hurt) overall coherence and capability? How about scientific and algorithmic capability, specifically?

– To what degree would ablated concepts still manifest as emergent properties, if at all?

– Is it possible to make a model much more truthful or much less violent by doing this, and how much more truthful or less violent? What is the minimal achievable amount of the original behavior or concept which would remain?

– Could any concept that can be identified and targeted in advance also be ablated similarly?

– Have there been many/any concrete studies done to answer these questions?

I have been able to produce a custom wavelet-based model with a semantic embedding that uses the two methods below to nearly entirely ablate violence from generated output while maintaining high coherence and cohesion. Sadly, due to my own monetary constraints, it has only been trained on WikiText-103, but hopefully it can be open-sourced soon.

Two methods, though more likely exist:

Replace violent passages in the original dataset with non-violent alternatives that do not introduce factually conflicting information, while maintaining the same narrative style and flow.
For a word embedding-based architecture that uses plain-language features as dimensions and words or n-grams as tokens, don’t train on violent tokens. Instead, use non-violent tokens as training targets. Specifically, use replacement tokens which minimize the Hamming distance with the original words while zeroing out the violent dimensions.

While both methods are not universally applicable, even just the first could be used to help align and control AI to a degree seemingly hitherto unachieved. Smaller-scale studies endeavoring to answer the above questions might be essential to nudge the needle in this direction.

submitted by /u/Real_Beach6493
[link] [comments]

Like 0

Liked Liked