[D] LLMs as a semantic regularizer for feature synthesis (small decision-tree experiment)

I’ve been experimenting with a small idea: use an LLM not to generate features, but to filter them during enumerative feature synthesis.

The approach was inspired by this paper: https://arxiv.org/pdf/2403.03997v1

I had already been playing with enumerative bottom up synthesis but noticed it usually gave me unintelligible features (even with regularization).

I looked into how other symbolic approaches deal with this problem and saw that they tried to model the semantics of the domain somehow – including dimensions, refinement types etc. But those approaches weren’t appealing to me because I was trying to come up with something that worked in general.

So I tried using an LLM to score candidate expressions by how meaningful they are. The idea was that the semantic meaning of the column names, the dimensions, and the salience of the operations could be embedded in the LLM.

My approach was: * Enumerate simple arithmetic features (treat feature eng as program synthesis) * Use an LLM as a semantic filter (“does this look like a meaningful quantity?”) * Train a decision tree (with oblique splits) considering only the filtered candidates as potential splits.

The result was that the tree was noticeably more readable, accuracy was similar / slightly better in my small test.

I wrote it up here: https://mchav.github.io/learning-better-decision-tree-splits/ Runnable code is here

If you’ve tried constraining feature synthesis before: what filters worked best in practice? Are the any measures of semantic viability out there?

submitted by /u/ChavXO
[link] [comments]

Liked Liked