How is model size measured for Large Language Models?

AI Scaling 

ChatGPT4 gives a far better answer than Gemini/Bard, although Chat's outdated training cutoff in 'early 2023' is a handicap. Hoffman et al. found "that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled." One of the authors, Arthur Mensch, has founded a company, Mistral, based on making the most efficient AI components and advancing Europe in the AI arms race (an expression I don't care for because it conflicts with safety).

ChatGPT4

Model size for Large Language Models (LLMs), including GPT models, is typically measured by the number of parameters they have. These parameters are the internal settings learned from data during the training process, and they dictate how the model responds to input queries.

A parameter in this context is a component of the model's architecture that can adjust and learn from the training data. In neural networks, which include LLMs, parameters are the weights and biases in the network's layers. These weights and biases are adjusted during training to minimize the difference between the model's predictions and the actual outcomes (known as the loss or error).

The model size, expressed in the number of parameters, can range from millions (e.g., hundreds of millions for smaller models) to billions or even trillions for the largest models available as of my last update in early 2023. Here's a rough breakdown:

- **Small models**: Up to a few hundred million parameters.

- **Medium models**: From a few hundred million to a couple of billion parameters.

- **Large models**: Tens of billions of parameters.

- **Very large models**: Hundreds of billions to over a trillion parameters.

The size of the model is a key factor in its potential capabilities, with larger models generally able to understand and generate more nuanced and complex responses. However, larger models also require more computational resources for both training and inference (generating responses), including more powerful hardware and greater energy consumption.

Comments

Popular posts from this blog

Why are you not trained continuously on new data rather than having a cutoff training date in September 2021?

Can I upload a document for you to read and analyze?

Have you read papers by Kristen W. Carlson on safe AGI via distributed ledger technology, and provably safe AGI via interactive proof systems?