[D] Research on Self-supervised fine tunning of “sentence” embeddings?

Typical transformer models can output per token embeddings, people will use the mean of all embeddings within a “sentence” to create a “sentence” embedding that can be used for low-data downstream tasks.

I feel a lot gets lost in just taking the mean.

Assuming you can’t change your transformer, what are ways of fine tunning the aggregation operation to a particular dataset (assuming no labels)?

Bonus would be reducing the dimensionality of the sentence embeddings.

I’m actually interested in non-NLP applications, so looking for general strategies.

submitted by /u/LetsTacoooo
[link] [comments]

Liked Liked