Papers to Read
- DeBERTa-V3: https://openreview.net/pdf?id=sE7-XhLxHA
- LLAMA-2: https://arxiv.org/pdf/2307.09288
- Sentence-BERT: https://arxiv.org/pdf/1908.10084
- Noise Contrastive Estimation: https://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf
- Sparse Transformers: https://arxiv.org/pdf/1904.10509
- LongFormer: https://arxiv.org/pdf/2004.05150
- Transformer-XL: https://arxiv.org/pdf/1901.02860
- GQA (Grouped Query Attention): https://arxiv.org/pdf/2305.13245
- Fast Transformer (aka. Multi-Query Attention): https://arxiv.org/pdf/1911.02150
- Flash Attention: https://arxiv.org/pdf/2205.14135
- Flash Attention 2: https://tridao.me/publications/flash2/flash2.pdf
- RMS-Norm: https://arxiv.org/pdf/1910.07467
- On Layer Normalization: https://arxiv.org/pdf/2002.04745
- SwiGLU: https://arxiv.org/pdf/2002.05202
- RoFormer / RoPE (Rotary Positional Embeddings): https://arxiv.org/pdf/2104.09864v4
- Data Packing w/o Cross-Contamination: https://arxiv.org/pdf/2107.02027, https://www.graphcore.ai/posts/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing
- [ ]