Large Language Models (LLMs)

Systems that generate text through statistical pattern prediction—powerful yet fundamentally different from human understanding

What are Large Language Models?

Core definition: LLMs are AI systems optimized for predicting the next token in a sequence.

This seemingly simple objective enables remarkable capabilities:

Generating coherent essays and articles
Answering questions across domains
Translating between languages
Writing functional code
Engaging in dialogue

Training Process

Phase 1: Data ingestion

Web pages, books, code repositories, conversations
Hundreds of billions to trillions of tokens
Requires months of computation and significant resources

Phase 2: Pattern extraction

"After 'The cat sat on the...' likely follows 'mat' or 'floor'"
"Questions beginning with 'How to...' typically yield procedural answers"
"Code starting with 'function...' usually includes '' syntax"

Phase 3: Text generation

Input: "Write a poem about dogs"
Model: Accesses learned patterns about poetry structure
Output: "Golden fur in morning light..."
Process: Predicts each subsequent token probabilistically

The Appearance of Understanding

LLMs are sufficiently skilled at pattern prediction that their outputs appear to reflect understanding. However, they lack genuine comprehension.

A useful analogy: an extremely sophisticated autocomplete system that has processed virtually all human text and can recombine patterns convincingly. Impressive capability? Certainly. Conscious understanding? No.

Scale and Parameters

Parameters: The "knobs" the AI adjusts while learning

Small model: 1 million parameters (can barely form sentences)
Medium model: 1 billion parameters (can chat reasonably)
Large model: 100+ billion parameters (can fool you into thinking it's human)

More parameters = better predictions (usually)

References

[1]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS).

arXiv:1706.03762

[2]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS).

arXiv:2005.14165

[3]

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.

arXiv:1810.04805

[4]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.

Link

[5]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., et al. (2022). Training Compute-Optimal Large Language Models. Advances in Neural Information Processing Systems (NeurIPS).

arXiv:2203.15556

[6]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS).

arXiv:2203.02155

[7]

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. FAccT 2021.

DOI:10.1145/3442188.3445922

[8]

OpenAI (2023). GPT-4 Technical Report. arXiv preprint.

arXiv:2303.08774

[9]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint.

arXiv:2307.09288

[10]

Dubey, A., Jauhri, A., Pandey, A., et al. (2024). The Llama 3 Herd of Models. arXiv preprint.

arXiv:2407.21783

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn