Since the publication of the now famous 2017 paper Attention is All You Need1, many large language models based on the transformer architecture have emerged. Fortunately, some studies 2 3 have compiled extensive data on many published models, including the dimensions of their transformers.
Much like my experience learning about CNNs and their increasing complexity, I wanted to analyze LLM transformers. Which models are the largest? What is the optimal size for the feed-forward layer? Is it better to add more embeddings or more attention heads? Can we easily derive the total number of parameters from the network dimensions?
Transformer model parameters
I will use the notations from the original Attention is All You Need 1 paper.
- $N$ : the number of layers
- $h$ : the number of attention heads
- $d_{\textrm{model}}$ : the size of the embeddings
- $d_{\textrm{ff}}$ : the size of the hidden FFN layer
- $V$ : the vocabulary size, that is the number of tokens used
In order to count model parameters, we need break the model down into building blocks:
- Multi-head attention block : trainable parameters are contained in weight matrices $W_i^Q, W_i^K, W_i^V$, for $1 \leq i \leq h$, as well as $W^O$ and their associated biaises. We then multiply the added number of parameters by $h$, the number of heads. Using the relationship $d_k=d_v=d_{\textrm{model}} / h$ 1 we get
$$ \begin{aligned} P_{\textrm{MHA}} &= h (2d_{\textrm{model}}d_k + 2d_k + d_{\textrm{model}}d_v + d_v) + hd_vd_{\textrm{model}} + d_{\textrm{model}} \\ &= 4 (d_{\textrm{model}}^2 + d_{\textrm{model}}) \end{aligned} $$
- Feed-forward block : in both the encoder and the decoder, the output of size $d_{\textrm{model}}$ is passed throught a feed-forward block 1 : $f(x) = \textrm{ReLU}(xW_1 + b_1)W_2 + b_2$. This leads to the following number of parameters
$$ P_{\textrm{FFN}} = 2 (d_{\textrm{ff}} d_{\textrm{model}} + d_{\textrm{model}})$$
- Layer normalization block : gain and bias with dimension $d_{\textrm{model}}$
$$ P_{\textrm{LN}} = 2 d_{\textrm{model}} $$
Encoder : the encoder has one MHA and one FFN. Each one has a norm layer. $$ P_{\textrm{encoder}} = N (P_{\textrm{MHA}} + P_{\textrm{FFN}} + 2P_{\textrm{LN}} )$$
Decoder : the decoder has two MHA and one FFN. Each one has a norm layer. $$ P_{\textrm{decoder}} = N (2P_{\textrm{MHA}} + P_{\textrm{FFN}} + 3 P_{\textrm{LN}})$$
Linear block : the linear block outputs as many logits as the vocabulary size, hence the dimension of its matrix and bias is
$$ P_{\textrm{linear}} = d_{\textrm{model}} V + V $$
Finally the total number of parameters is $$ P = P_{\textrm{encoder}} + P_{\textrm{decoder}} + P_{\textrm{linear}} $$
Gathering data
Although the aforementioned studies 2 3 are invaluable and packed with useful information, they’ve become quickly outdated given the pace of model releases these days. I decided to collect my own data from original research papers, announcement posts, as well as some Hugging Face configuration files. I focused on models published by large research teams and/or that had significant impact.
Here are my findings:
- GPT 4 used a causal decoder-only transformer, which many models have adopted. This means the encoder block is not present in most models
- GPT used $d_{\textrm{ff}}/d_{\textrm{model}} = 4 $
- According to 2, sometimes biases are omitted in the model
- Sometimes, some parameters are omitted in the paper and implied from previous version of the model
- Closed-source models rarely disclose detailed architecture information
- Hugging Face configuration files generally display one version (size) from a family of models, potentially leading to misleading interpretations
Publishing a dashboard
Once the data started to look interesting, I put together a small Next.js app using shadcn/ui data tables. A dashboard is available at https://transformers-dashboard.vercel.app.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. ↩︎ ↩︎ ↩︎ ↩︎
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., … & Mian, A. (2023). A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435. ↩︎ ↩︎ ↩︎
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., … & Wen, J. R. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223. ↩︎ ↩︎
Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. ↩︎