[p] I Made my first Transformer architecture code

In this code I have used pytorch & math to make all the blocks of the transformer as a seperate class and then calling them into the original transformer class . I have used all the parameters as suggested in the original paper , encoding size 512, 6 layers and 8 multi head layers.

My question- Is there any better way to optimize this before I train this

Also what dataset is good for T4 gpu (google colab) This is the link of my code-

https://github.com/Rishikesh-2006/NNs/blob/main/Pytorch%2FTransformer.ipynb

submitted by /u/Jumbledsaturn52
[link] [comments]

Liked Liked