List of large language models

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

More information Name, Release date ...

Name	Release date^[a]	Developer	Number of parameters (billion) ^[b]	Corpus size	Training cost (petaFLOP-day)	License^[c]	Notes
Attention Is All You Need	June 2017	Vaswani et al at Google	0.213	36 million English-French sentence pairs	0.09^[1]	Unreleased	Trained for 0.3M steps on 8 NVIDIA P100 GPUs. Training and evaluation code released under Apache 2.0 license.^[2]
GPT-1	June 2018	OpenAI	0.117	Unknown	1^[3]	MIT^[4]	First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERT	October 2018	Google	0.340^[5]	3.3 billion words^[5]	9^[6]	Apache 2.0^[7]	An early and influential language model.^[8]Encoder-only and thus not built to be prompted or generative.^[9] Training took 4 days on 64 TPUv2 chips.^[10]
T5	October 2019	Google	11^[11]	34 billion tokens^[11]		Apache 2.0^[12]	Base model for many Google projects, such as Imagen.^[13]
XLNet	June 2019	Google	0.340^[14]	33 billion words	330	Apache 2.0^[15]	An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.^[16]
GPT-2	February 2019	OpenAI	1.5^[17]	40GB^[18] (~10 billion tokens)^[19]	28^[20]	MIT^[21]	Trained on 32 TPUv3 chips for 1 week.^[20]
GPT-3	May 2020	OpenAI	175^[22]	300 billion tokens^[19]	3640^[23]	Proprietary	A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.^[24]
GPT-Neo	March 2021	EleutherAI	2.7^[25]	825 GiB^[26]	Unknown	MIT^[27]	The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.^[27]
GPT-J	June 2021	EleutherAI	6^[28]	825 GiB^[26]	200^[29]	Apache 2.0	GPT-3-style language model
Megatron-Turing NLG	October 2021^[30]	Microsoft and Nvidia	530^[31]	338.6 billion tokens^[31]	38000^[32]	Unreleased	Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours^[32]
Ernie 3.0 Titan	December 2021	Baidu	260^[33]	4TB	Unknown	Proprietary	Chinese-language LLM. Ernie Bot is based on this model.
Claude^[34]	December 2021	Anthropic	52^[35]	400 billion tokens^[35]	Unknown	Proprietary	Fine-tuned for desirable behavior in conversations.^[36]
GLaM (Generalist Language Model)	December 2021	Google	1200^[37]	1.6 trillion tokens^[37]	5600^[37]	Proprietary	Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher	December 2021	DeepMind	280^[38]	300 billion tokens^[39]	5833^[40]	Proprietary	Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications)	January 2022	Google	137^[41]	1.56T words,^[41] 168 billion tokens^[39]	4110^[42]	Proprietary	Specialized for response generation in conversations.
GPT-NeoX	February 2022	EleutherAI	20^[43]	825 GiB^[26]	740^[29]	Apache 2.0	based on the Megatron architecture
Chinchilla	March 2022	DeepMind	70^[44]	1.4 trillion tokens^[44]^[39]	6805^[40]	Proprietary	Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model)	April 2022	Google	540^[45]	768 billion tokens^[44]	29,250^[40]	Proprietary	Trained for ~60 days on ~6000 TPU v4 chips.^[40] As of October 2024^[update], it is the largest dense Transformer published.
OPT (Open Pretrained Transformer)	May 2022	Meta	175^[46]	180 billion tokens^[47]	310^[29]	Non-commercial research^[d]	GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.^[48]
YaLM 100B	June 2022	Yandex	100^[49]	1.7TB^[49]	Unknown	Apache 2.0	English-Russian model based on Microsoft's Megatron-LM
Minerva	June 2022	Google	540^[50]	38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server^[50]	Unknown	Proprietary	For solving "mathematical and scientific questions using step-by-step reasoning".^[51] Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM	July 2022	Large collaboration led by Hugging Face	175^[52]	350 billion tokens (1.6TB)^[53]	Unknown	Responsible AI	Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica	November 2022	Meta	120	106 billion tokens^[54]	Unknown	CC-BY-NC-4.0	Trained on scientific text and modalities.
AlexaTM (Teacher Models)	November 2022	Amazon	20^[55]	1.3 trillion^[56]	Unknown	Proprietary^[57]	Bidirectional sequence-to-sequence architecture
Llama	February 2023	Meta AI	65^[58]	1.4 trillion^[58]	6300^[59]	Non-commercial research^[e]	Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.^[58]
GPT-4	March 2023	OpenAI	Unknown^[f] (According to rumors: 1760)^[61]	Unknown	Unknown, estimated 230,000	Proprietary	Available for ChatGPT Plus users and used in several products.
Chameleon	June 2024	Meta AI	34^[62]	4.4 trillion	Unknown	Non-commercial research^[63]
Cerebras-GPT	March 2023	Cerebras	13^[64]		270^[29]	Apache 2.0	Trained with Chinchilla formula.
Falcon	March 2023	Technology Innovation Institute	40^[65]	1 trillion tokens, from RefinedWeb (filtered web text corpus)^[66] plus some "curated corpora".^[67]	2800^[59]	Apache 2.0^[68]
BloombergGPT	March 2023	Bloomberg L.P.	50	363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets^[69]	Unknown	Unreleased	Trained on financial data from proprietary sources, for financial tasks
PanGu-Σ	March 2023	Huawei	1085	329 billion tokens^[70]	Unknown	Proprietary
OpenAssistant^[71]	March 2023	LAION	17	1.5 trillion tokens	Unknown	Apache 2.0	Trained on crowdsourced open data
Jurassic-2^[72]	March 2023	AI21 Labs	Unknown	Unknown	Unknown	Proprietary	Multilingual^[73]
PaLM 2 (Pathways Language Model 2)	May 2023	Google	340^[74]	3.6 trillion tokens^[74]	85,000^[59]	Proprietary	Was used in Bard chatbot.^[75]
Llama 2	July 2023	Meta AI	70^[76]	2 trillion tokens^[76]	21,000	Llama 2 license	1.7 million A100-hours.^[77]
Claude 2	July 2023	Anthropic	Unknown	Unknown	Unknown	Proprietary	Used in Claude chatbot.^[78]
Granite 13b	July 2023	IBM	Unknown	Unknown	Unknown	Proprietary	Used in IBM Watsonx.^[79]
Mistral 7B	September 2023	Mistral AI	7.3^[80]	Unknown	Unknown	Apache 2.0
Claude 2.1	November 2023	Anthropic	Unknown	Unknown	Unknown	Proprietary	Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.^[81]
Grok 1^[82]	November 2023	xAI	314	Unknown	Unknown	Apache 2.0	Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).^[83]
Gemini 1.0	December 2023	Google DeepMind	Unknown	Unknown	Unknown	Proprietary	Multimodal model, comes in three sizes. Used in the chatbot of the same name.^[84]
Mixtral 8x7B	December 2023	Mistral AI	46.7	Unknown	Unknown	Apache 2.0	Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.^[85] Mixture of experts model, with 12.9 billion parameters activated per token.^[86]
Mixtral 8x22B	April 2024	Mistral AI	141	Unknown	Unknown	Apache 2.0	^[87]
DeepSeek-LLM	November 29, 2023	DeepSeek	67	2T tokens^[88]^{: table 2}	12,000	DeepSeek License	Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B^[88]^{: figure 5}
Phi-2	December 2023	Microsoft	2.7	1.4T tokens	419^[89]	MIT	Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.^[89]
Gemini 1.5	February 2024	Google DeepMind	Unknown	Unknown	Unknown	Proprietary	Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.^[90]
Gemini Ultra	February 2024	Google DeepMind	Unknown	Unknown	Unknown	Proprietary
Gemma	February 2024	Google DeepMind	7	6T tokens	Unknown	Gemma Terms of Use^[91]
Claude 3	March 2024	Anthropic	Unknown	Unknown	Unknown	Proprietary	Includes three models, Haiku, Sonnet, and Opus.^[92]
DBRX	March 2024	Databricks and Mosaic ML	136	12T tokens	Unknown	Databricks Open Model License^[93]^[94]	Training cost 10 million USD
Fugaku-LLM	May 2024	Fujitsu, Tokyo Institute of Technology, etc.	13	380B tokens	Unknown	Fugaku-LLM Terms of Use^[95]	The largest model ever trained on CPU-only, on the Fugaku^[96]
Phi-3	April 2024	Microsoft	14^[97]	4.8T tokens	Unknown	MIT	Microsoft markets them as "small language model".^[98]
Granite Code Models	May 2024	IBM	Unknown	Unknown	Unknown	Apache 2.0
Qwen2	June 2024	Alibaba Cloud	72^[99]	3T tokens	Unknown	Qwen License	Multiple sizes, the smallest being 0.5B.
DeepSeek-V2	June 2024	DeepSeek	236	8.1T tokens	28,000	DeepSeek License	1.4M hours on H800.^[100]
Nemotron-4	June 2024	Nvidia	340	9T tokens	200,000	NVIDIA Open Model License^[101]^[102]	Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.^[103]^[104]
Claude 3.5	June 2024	Anthropic	Unknown	Unknown	Unknown	Proprietary	Initially, only one model, Sonnet, was released.^[105] In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available.^[106]
Llama 3.1	July 2024	Meta AI	405	15.6T tokens	440,000	Llama 3 license	405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.^[107]^[108]
OpenAI o1	September 12, 2024	OpenAI	Unknown	Unknown	Unknown	Proprietary	Reasoning model.^[109]
Mistral Large	November 2024	Mistral AI	123	Unknown	Unknown	Mistral Research License	Upgraded over time. The latest version is 24.11.^[110]
Pixtral	November 2024	Mistral AI	123	Unknown	Unknown	Mistral Research License	Multimodal. There is also a 12B version which is under Apache 2 license.^[110]
DeepSeek-V3	December 2024	DeepSeek	671	14.8T tokens	56,000	MIT	2.788M hours on H800 GPUs.^[111] Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.^[112]
Amazon Nova	December 2024	Amazon	Unknown	Unknown	Unknown	Proprietary	Includes three models, Nova Micro, Nova Lite, and Nova Pro^[113]
DeepSeek-R1	January 2025	DeepSeek	671	Not applicable	Unknown	MIT	No pretraining. Reinforcement-learned upon V3-Base.^[114]^[115]
Qwen2.5	January 2025	Alibaba	72	18T tokens	Unknown	Qwen License	7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.^[116]
MiniMax-Text-01	January 2025	Minimax	456	4.7T tokens^[117]	Unknown	Minimax Model license	^[118]^[117]
Gemini 2.0	February 2025	Google DeepMind	Unknown	Unknown	Unknown	Proprietary	Three models released: Flash, Flash-Lite and Pro^[119]^[120]^[121]
Claude 3.7	February 24, 2025	Anthropic	Unknown	Unknown	Unknown	Proprietary	One model, Sonnet 3.7.^[122]
GPT-4.5	February 27, 2025	OpenAI	Unknown	Unknown	Unknown	Proprietary	Largest non-reasoning model.^[123]
Grok 3	February 2025	xAI	Unknown	Unknown	Unknown, estimated 5,800,000	Proprietary	Training cost claimed "10x the compute of previous state-of-the-art models".^[124]
Llama 4	April 5, 2025	Meta AI	400	40T tokens	Unknown	Llama 4 license	^[125]^[126]
OpenAI o3 and o4-mini	April 16, 2025	OpenAI	Unknown	Unknown	Unknown	Proprietary	Reasoning models.^[127]
Qwen3	April 2025	Alibaba Cloud	235	36T tokens	Unknown	Apache 2.0	Multiple sizes, the smallest being 0.6B.^[128]
Claude 4	May 22, 2025	Anthropic	Unknown	Unknown	Unknown	Proprietary	Includes two models, Sonnet and Opus.^[129]
Grok 4	July 9, 2025	xAI	Unknown	Unknown	Unknown	Proprietary
GLM-4.5	July 29, 2025	Zhipu AI	355	22T tokens	Unknown	MIT	Released in 335B and 106B sizes.^[130] Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix.^[131]
GPT-OSS	August 5, 2025	OpenAI	117	Unknown	Unknown	Apache 2.0	Released in 20B and 120B sizes.^[132]
Claude 4.1	August 5, 2025	Anthropic	Unknown	Unknown	Unknown	Proprietary	Includes one model, Opus.^[133]
GPT-5	August 7, 2025	OpenAI	Unknown	Unknown	Unknown	Proprietary	Includes three models, GPT-5, GPT-5 Thinking, and GPT-5 Pro. GPT-5 is available in ChatGPT for all users.^[134]

List of large language models

List

See also

Notes

References

Wikiwand - on