Name | Release date[a] | Developer | Number of parameters (billion) [b] | Corpus size |
Training cost (petaFLOP-day) | License[c] | Notes |
Attention Is All You Need |
June 2017 |
Vaswani et al at Google |
0.213 |
36 million English-French sentence pairs |
0.09[1] |
|
Trained for 0.3M steps on 8 NVIDIA P100 GPUs. |
GPT-1 | June 2018 | OpenAI | 0.117 | |
1[2] | MIT[3] |
First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs. |
BERT | October 2018 | Google | 0.340[4] | 3.3 billion words[4] |
9[5] | Apache 2.0[6] |
An early and influential language model.[7]Encoder-only and thus not built to be prompted or generative.[8] Training took 4 days on 64 TPUv2 chips.[9] |
T5 |
October 2019 |
Google |
11[10] |
34 billion tokens[10] |
|
Apache 2.0[11] |
Base model for many Google projects, such as Imagen.[12] |
XLNet | June 2019 | Google | 0.340[13] | 33 billion words |
330 | Apache 2.0[14] |
An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[15] |
GPT-2 | February 2019 | OpenAI | 1.5[16] | 40GB[17] (~10 billion tokens)[18] |
28[19] | MIT[20] |
Trained on 32 TPUv3 chips for 1 week.[19] |
GPT-3 | May 2020 | OpenAI | 175[21] | 300 billion tokens[18] |
3640[22] | Proprietary |
A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[23] |
GPT-Neo | March 2021 | EleutherAI | 2.7[24] | 825 GiB[25] |
| MIT[26] |
The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[26] |
GPT-J | June 2021 | EleutherAI | 6[27] | 825 GiB[25] |
200[28] | Apache 2.0 |
GPT-3-style language model |
Megatron-Turing NLG | October 2021 [29] | Microsoft and Nvidia | 530[30] | 338.6 billion tokens[30] |
38000[31] | Restricted web access |
Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[31] |
Ernie 3.0 Titan | December 2021 | Baidu | 260[32] | 4 Tb |
| Proprietary |
Chinese-language LLM. Ernie Bot is based on this model. |
Claude[33] | December 2021 | Anthropic | 52[34] | 400 billion tokens[34] |
| beta |
Fine-tuned for desirable behavior in conversations.[35] |
GLaM (Generalist Language Model) | December 2021 | Google | 1200[36] | 1.6 trillion tokens[36] |
5600[36] | Proprietary |
Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. |
Gopher | December 2021 | DeepMind | 280[37] | 300 billion tokens[38] |
5833[39] | Proprietary |
Later developed into the Chinchilla model. |
LaMDA (Language Models for Dialog Applications) | January 2022 | Google | 137[40] | 1.56T words,[40] 168 billion tokens[38] |
4110[41] | Proprietary |
Specialized for response generation in conversations. |
GPT-NeoX | February 2022 | EleutherAI | 20[42] | 825 GiB[25] |
740[28] | Apache 2.0 |
based on the Megatron architecture |
Chinchilla | March 2022 | DeepMind | 70[43] | 1.4 trillion tokens[43][38] |
6805[39] | Proprietary |
Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. |
PaLM (Pathways Language Model) | April 2022 | Google | 540[44] | 768 billion tokens[43] |
29,250[39] | Proprietary |
Trained for ~60 days on ~6000 TPU v4 chips.[39] As of October 2024[update], it is the largest dense Transformer published. |
OPT (Open Pretrained Transformer) | May 2022 | Meta | 175[45] | 180 billion tokens[46] |
310[28] | Non-commercial research[d] |
GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[47] |
YaLM 100B | June 2022 | Yandex | 100[48] |
1.7TB[48] | | Apache 2.0 | English-Russian model based on Microsoft's Megatron-LM. |
Minerva | June 2022 | Google | 540[49] | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[49] |
| Proprietary |
For solving "mathematical and scientific questions using step-by-step reasoning".[50] Initialized from PaLM models, then finetuned on mathematical and scientific data. |
BLOOM | July 2022 | Large collaboration led by Hugging Face | 175[51] | 350 billion tokens (1.6TB)[52] |
| Responsible AI |
Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) |
Galactica | November 2022 | Meta | 120 | 106 billion tokens[53] |
Unknown | CC-BY-NC-4.0 |
Trained on scientific text and modalities. |
AlexaTM (Teacher Models) | November 2022 | Amazon | 20[54] | 1.3 trillion[55] |
| Proprietary[56] |
bidirectional sequence-to-sequence architecture |
LLaMA (Large Language Model Meta AI) | February 2023 | Meta AI | 65[57] | 1.4 trillion[57] |
6300[58] | Non-commercial research[e] |
Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.[57] |
GPT-4 | March 2023 | OpenAI | Unknown[f] (According to rumors: 1760)[60] |
Unknown |
Unknown, estimated 230,000. | Proprietary |
Available for ChatGPT Plus users and used in several products. |
Chameleon | June 2024 | Meta AI | 34[61] | 4.4 trillion | | |
Cerebras-GPT |
March 2023 |
Cerebras |
13[62] |
|
270[28] | Apache 2.0 |
Trained with Chinchilla formula. |
Falcon | March 2023 | Technology Innovation Institute | 40[63] | 1 trillion tokens, from RefinedWeb (filtered web text corpus)[64] plus some "curated corpora".[65] |
2800[58] | Apache 2.0[66] |
|
BloombergGPT | March 2023 | Bloomberg L.P. | 50 | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[67] |
| Proprietary |
Trained on financial data from proprietary sources, for financial tasks. |
PanGu-Σ | March 2023 | Huawei | 1085 | 329 billion tokens[68] |
| Proprietary |
|
OpenAssistant[69] | March 2023 | LAION | 17 | 1.5 trillion tokens |
| Apache 2.0 |
Trained on crowdsourced open data |
Jurassic-2[70] |
March 2023 |
AI21 Labs |
Unknown |
Unknown |
| Proprietary |
Multilingual[71] |
PaLM 2 (Pathways Language Model 2) | May 2023 | Google | 340[72] | 3.6 trillion tokens[72] |
85,000[58] | Proprietary |
Was used in Bard chatbot.[73] |
Llama 2 | July 2023 | Meta AI | 70[74] | 2 trillion tokens[74] |
21,000 | Llama 2 license |
1.7 million A100-hours.[75] |
Claude 2 |
July 2023 |
Anthropic |
Unknown |
Unknown |
Unknown | Proprietary |
Used in Claude chatbot.[76] |
Granite 13b |
July 2023 |
IBM |
Unknown |
Unknown |
Unknown | Proprietary |
Used in IBM Watsonx.[77] |
Mistral 7B | September 2023 | Mistral AI | 7.3[78] | Unknown |
| Apache 2.0 |
|
Claude 2.1 |
November 2023 |
Anthropic |
Unknown |
Unknown |
Unknown | Proprietary |
Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[79] |
Grok 1[80] |
November 2023 |
xAI |
314 |
Unknown |
Unknown | Apache 2.0 |
Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).[81] |
Gemini 1.0 |
December 2023 |
Google DeepMind |
Unknown |
Unknown |
Unknown | Proprietary |
Multimodal model, comes in three sizes. Used in the chatbot of the same name.[82] |
Mixtral 8x7B |
December 2023 |
Mistral AI |
46.7 |
Unknown |
Unknown | Apache 2.0 |
Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[83] Mixture of experts model, with 12.9 billion parameters activated per token.[84] |
Mixtral 8x22B |
April 2024 |
Mistral AI |
141 |
Unknown |
Unknown | Apache 2.0 |
[85] |
DeepSeek-LLM |
November 29, 2023 |
DeepSeek |
67 |
2T tokens[86]: table 2 |
12,000 |
DeepSeek License |
Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B[86]: figure 5 |
Phi-2 |
December 2023 |
Microsoft |
2.7 |
1.4T tokens |
419[87] | MIT |
Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[87] |
Gemini 1.5 |
February 2024 |
Google DeepMind |
Unknown |
Unknown |
Unknown | Proprietary |
Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[88] |
Gemini Ultra |
February 2024 |
Google DeepMind |
Unknown |
Unknown |
Unknown | |
|
Gemma | February 2024 | Google DeepMind | 7 | 6T tokens | Unknown | Gemma Terms of Use[89] | |
Claude 3 |
March 2024 |
Anthropic |
Unknown |
Unknown |
Unknown |
Proprietary |
Includes three models, Haiku, Sonnet, and Opus.[90] |
Nova |
October 2024 |
Rubik's AI |
Unknown |
Unknown |
Unknown |
Proprietary |
Previous three models, Nova-Instant, Nova-Air, and Nova-Pro. Company shifted to Sonus AI. |
Sonus[91] |
January 2025 |
Rubik's AI |
Unknown |
Unknown |
Unknown |
Proprietary |
|
DBRX |
March 2024 |
Databricks and Mosaic ML |
136 |
12T Tokens |
|
Databricks Open Model License |
Training cost 10 million USD. |
Fugaku-LLM |
May 2024 |
Fujitsu, Tokyo Institute of Technology, etc. |
13 |
380B Tokens |
|
|
The largest model ever trained on CPU-only, on the Fugaku.[92] |
Phi-3 |
April 2024 |
Microsoft |
14[93] |
4.8T Tokens |
|
MIT |
Microsoft markets them as "small language model".[94] |
Granite Code Models |
May 2024 |
IBM |
Unknown |
Unknown |
Unknown | Apache 2.0 |
|
Qwen2 |
June 2024 |
Alibaba Cloud |
72[95] |
3T Tokens |
Unknown |
Qwen License |
Multiple sizes, the smallest being 0.5B. |
DeepSeek-V2 |
June 2024 |
DeepSeek |
236 |
8.1T tokens |
28,000 |
DeepSeek License |
1.4M hours on H800.[96] |
Nemotron-4 |
June 2024 |
Nvidia |
340 |
9T Tokens |
200,000 |
NVIDIA Open Model License |
Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[97][98] |
Llama 3.1 |
July 2024 |
Meta AI |
405 |
15.6T tokens |
440,000 |
Llama 3 license |
405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[99][100] |
DeepSeek-V3 |
December 2024 |
DeepSeek |
671 |
14.8T tokens |
56,000 |
MIT |
2.788M hours on H800 GPUs.[101] Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.[102] |
Amazon Nova |
December 2024 |
Amazon |
Unknown |
Unknown |
Unknown |
Proprietary |
Includes three models, Nova Micro, Nova Lite, and Nova Pro[103] |
DeepSeek-R1 |
January 2025 |
DeepSeek |
671 |
Not applicable |
Unknown |
MIT |
No pretraining. Reinforcement-learned upon V3-Base.[104][105] |
Qwen2.5 |
January 2025 |
Alibaba |
72 |
18T tokens |
Unknown |
Qwen License |
7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.[106] |
MiniMax-Text-01 |
January 2025 |
Minimax |
456 |
4.7T tokens[107] |
Unknown |
Minimax Model license |
[108][107] |
Gemini 2.0 |
February 2025 |
Google DeepMind |
Unknown |
Unknown |
Unknown | Proprietary |
Three models released: Flash, Flash-Lite and Pro[109][110][111] |
Mistral Large |
November 2024 |
Mistral AI |
123 |
Unknown |
Unknown |
Mistral Research License |
Upgraded over time. The latest version is 24.11.[112] |
Pixtral |
November 2024 |
Mistral AI |
123 |
Unknown |
Unknown |
Mistral Research License |
Multimodal. There is also a 12B version which is under Apache 2 license.[112] |
Grok 3 |
February 2025 |
xAI |
Unknown |
Unknown |
Unknown, estimated 5,800,000. |
Proprietary |
Training cost claimed "10x the compute of previous state-of-the-art models".[113] |
Llama 4 |
April 5, 2025 |
Meta AI |
400 |
40T tokens |
| Llama 4 license |
[114][115] |
Qwen3 |
April 2025 |
Alibaba Cloud |
235 |
36T tokens |
Unknown |
Apache 2.0 |
Multiple sizes, the smallest being 0.6B.[116] |