β‘οΈ https://transformers-dashboard.vercel.app

Since the publication of the now famous 2017 paper Attention is All You Need^{1}, many large language models based on the transformer architecture have emerged. Fortunately, some studies ^{2} ^{3} have compiled extensive data on many published models, including the dimensions of their transformers.

Much like my experience learning about CNNs and their growth in complexity, I wanted to analyze LLM transformers. Which models are the largest? What is the optimal size for the feed-forward layer? Is it better to add more embeddings or more attention heads? Can we easily derive the total number of parameters from the network dimensions?

## Transformer model parameters

I will use the notations from the original Attention is All You Need ^{1} paper.

- $N$ : the number of layers
- $h$ : the number of attention heads
- $d_{\textrm{model}}$ : the size of the embeddings
- $d_{\textrm{ff}}$ : the size of the hidden FFN layer
- $V$ : the vocabulary size, that is the number of tokens used

In order to count model parameters, we need break the model down into building blocks:

**Multi-head attention block**: trainable parameters are contained in weight matrices $W_i^Q, W_i^K, W_i^V$, for $1 \leq i \leq h$, as well as $W^O$ and their associated biaises. We then multiply the added number of parameters by $h$, the number of heads. Using the relationship $d_k=d_v=d_{\textrm{model}} / h$^{1}we get

$$ \begin{aligned} P_{\textrm{MHA}} &= h (2d_{\textrm{model}}d_k + 2d_k + d_{\textrm{model}}d_v + d_v) + hd_vd_{\textrm{model}} + d_{\textrm{model}} \\ &= 4 (d_{\textrm{model}}^2 + d_{\textrm{model}}) \end{aligned} $$

**Feed-forward block**: in both the encoder and the decoder, the output of size $d_{\textrm{model}}$ is passed throught a feed-forward block^{1}: $f(x) = \textrm{ReLU}(xW_1 + b_1)W_2 + b_2$. This leads to the following number of parameters

$$ P_{\textrm{FFN}} = 2 (d_{\textrm{ff}} d_{\textrm{model}} + d_{\textrm{model}})$$

**Layer normalization block**: gain and bias with dimension $d_{\textrm{model}}$

$$ P_{\textrm{LN}} = 2 d_{\textrm{model}} $$

**Encoder**: the encoder has one MHA and one FFN. Each one has a norm layer. $$ P_{\textrm{encoder}} = N (P_{\textrm{MHA}} + P_{\textrm{FFN}} + 2P_{\textrm{LN}} )$$**Decoder**: the decoder has two MHA and one FFN. Each one has a norm layer. $$ P_{\textrm{decoder}} = N (2P_{\textrm{MHA}} + P_{\textrm{FFN}} + 3 P_{\textrm{LN}})$$**Linear block**: the linear block outputs as many logits as the vocabulary size, hence the dimension of its matrix and bias is

$$ P_{\textrm{linear}} = d_{\textrm{model}} V + V $$

Finally the total number of parameters is $$ P = P_{\textrm{encoder}} + P_{\textrm{decoder}} + P_{\textrm{linear}} $$

## Gathering data

Although the aforementioned studies ^{2} ^{3} are invaluable and packed with useful information, they’ve become quickly outdated given the pace of model releases these days. I decided to collect my own data from original research papers, announcement posts, as well as some Hugging Face configuration files. I focused on models published by large research teams and/or that had significant impact.

Here are my findings:

- GPT
^{4}used a causal decoder-only transformer, which many models have adopted. This means the encoder block is not present in most models - GPT used $d_{\textrm{ff}}/d_{\textrm{model}} = 4 $
- According to
^{2}, sometimes biases are omitted in the model - Sometimes, some parameters are omitted in the paper and implied from previous version of the model
- Closed-source models rarely disclose detailed architecture information
- Hugging Face configuration files generally display one version (size) from a family of models, potentially leading to misleading interpretations

## Publishing a dashboard

Once the data started to look interesting, I put together a small Next.js app using shadcn/ui data tables. A dashboard is available at https://transformers-dashboard.vercel.app.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. ↩︎ ↩︎ ↩︎ ↩︎

Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., … & Mian, A. (2023). A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435. ↩︎ ↩︎ ↩︎

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., … & Wen, J. R. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223. ↩︎ ↩︎

Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. ↩︎