[P] Yet another garage model – Prisma: Interpretability-Inspired Architecture

Hey y’all! I think some of you might be interested in this creature.

Don’t roast me that much, as I really wanted to collect your feedback and ideas about this crap prototype.

At least it is not GPT/Llama/Mistral/Qwen architecture based, I based it on some ideas that I had while studying other models. The basic differences are:

  • Attention and output weight sharing (reduces parameters);
  • Additional weight set in the FFN (increases parameters, yay!);
  • Introduces Word-Relative Rotary Position Embedding;

The thing with the added weights, I think is the most interesting part of the architecture and I’d like many pinches of salt on that. This weight set is used as a nested gate, making the usual W2 @ (W1 @ x * silu(W3 @ x)) to be W2 @ (W1 @ x * silu(W3 @ x * silu(W4 @ x)))… I’ll leave it as this and wait for the stones to come.

Yes, it is a garage model but works. It is about 25% more data efficient than the “standard transformer architecture”, regarding trainging and gets pretty decent results in basic benchmarks (arc-e, arc-c, piqa, boolq, hellaswag…). Trained in a single H100 with 30B tokens (openwebtext and fineweb-edu).

Anyhow. If you’re interested hf:y3i12/Prisma.

Looking forward for your thoughts and comments 😁

submitted by /u/y3i12
[link] [comments]

Liked Liked