From Zero to LLM: A Complete Guide to the Large Language Model Tech Tree

59 minute read

Published:

All content herein was generated by an LLM and compiled into this document to facilitate sharing with family and friends.

Part 1: The Trunk—Teaching Machines to Think

Before diving into the complex technologies that form the core of today’s Artificial Intelligence (AI) revolution, we must first establish a solid conceptual framework. This section will map out the macroscopic landscape of the AI field, clarifying key terms that are often confused. We will start with the broadest forest, gradually focus on specific trees, and finally arrive at our destination for this journey—the Large Language Model (LLM).

1.1 The Forest of AI—More Than Just Robots

When we hear the term “Artificial Intelligence,” what often comes to mind are self-aware robots from science fiction films. However, real-world AI is a far broader and more diverse field. A more precise and practical definition comes from AI researcher François Chollet, who describes AI as “the effort to automate intellectual tasks normally performed by humans” ¹. This definition encompasses everything from simple task automation to complex cognitive simulations, such as visual perception, speech recognition, decision-making, and language translation ².

To better understand this vast field, we can use two vivid analogies.

The first analogy is the Russian nesting doll ¹. Imagine a set of Russian nesting dolls. The largest doll represents “Artificial Intelligence” (AI), an all-encompassing term. Open it, and inside is a slightly smaller doll representing “Machine Learning” (ML). Open the Machine Learning doll, and you’ll find an even smaller one: “Deep Learning” (DL). This nested structure intuitively shows their relationship: Deep Learning is a subset of Machine Learning, and Machine Learning is a subset of Artificial Intelligence ¹.

The second analogy is transportation ². Think of “Artificial Intelligence” as the grand concept of “transportation,” which includes all ways of moving people and goods. Within this framework, “Machine Learning” is like the “automobile,” a very important and popular method of transportation. “Deep Learning” can then be seen as the “electric vehicle,” a more advanced and specific technology within the automotive field. The brilliance of this analogy is that it clearly shows that not all AI is Machine Learning, just as not all forms of transportation are cars—we also have trains, bicycles, and airplanes.

This distinction is crucial because it reveals a fundamental division within AI: learning-based AI versus non-learning-based AI.

  • Non-learning AI (Rule-Based Systems): This is the early form of AI, where systems follow a series of “if-then” rules meticulously crafted by human programmers to perform tasks ⁴. They can be very complex and “intelligent,” but they cannot learn from experience or improve themselves. Typical examples include:
    • Game AI: In video games, the behavior of non-player characters (NPCs) is often driven by preset scripts and rules, rather than by learning from the player’s actions to adjust their strategies ².
    • Expert Systems: The MYCIN system from the 1970s could diagnose bacterial infections and recommend antibiotics based on a database containing a vast amount of medical knowledge and rules ⁴.
    • Automated Decision Systems: Simple tax preparation software calculates taxes based on preset tax law rules and does not learn new tax avoidance techniques from user data ².
  • Learning-based AI: This is the mainstream of modern AI, with Machine Learning at its core. These systems do not passively execute instructions but actively discover patterns from data to make decisions ⁴.

Understanding that “intelligence” is a spectrum, not a switch, is the first step into the world of AI. A system is called “AI” not because it possesses human-like consciousness, but because it can automate a task that would otherwise require human intellect. The complexity of this task determines its position on the intelligence spectrum. Rule-based systems are at one end of the spectrum, like highly skilled but uncreative artisans perfectly executing their given instructions. Learning-based systems are at the other end, more like apprentices who master skills through observation and practice. It is this paradigm shift from “coding instructions” to “providing experience” that spurred the rise of Machine Learning and paved the way for the LLMs we know today.

1.2 The First Main Branch—Machine Learning

Now, let’s open the largest AI nesting doll and look at the first core branch inside: Machine Learning (ML). ML pioneer Arthur Samuel gave a definition decades ago that remains remarkably insightful today: it is the “field of study that gives computers the ability to learn without being explicitly programmed” ¹. The key phrase here is “without being explicitly programmed.”

This means the role of the programmer has fundamentally changed. In the traditional programming model, the programmer had to anticipate all possibilities and write precise instructions for each scenario. In Machine Learning, the programmer is no longer the rule-maker but more like a teacher. Their core task is to provide the machine with a meticulously compiled “textbook”—that is, a massive, labeled dataset ². The machine “reads” this textbook to independently learn and generalize the patterns and rules hidden within the data.

To understand this process concretely, let’s return to a simpler scenario: fruit sorting ¹.

  1. Human Intelligence: Imagine a worker by a conveyor belt, skillfully sorting apples, bananas, and oranges into different bins based on their experience and knowledge.
  2. Non-learning AI (Rule-Based System): Now, we replace the worker with a machine. This machine is equipped with a scanner. Whenever a fruit passes by, it scans a pre-applied label. The machine’s program is very simple: IF label == “apple”, THEN place in apple bin. This system works but is very rigid. If a fruit has no label or the label is unclear, the machine is helpless. It completely relies on rules preset by humans.
  3. Machine Learning: This is the real revolution. We equip the machine with a camera and then show it thousands of labeled fruit images—”this is an apple,” “this is a banana,” “this is not an apple.” We don’t tell it that apples are red and round, or that bananas are yellow and curved. We simply provide a vast number of “examples.” The machine analyzes these massive amounts of images to discover the intrinsic patterns of “appleness” or “banananess” for itself. Eventually, it learns to recognize them. When a brand new, unseen, unlabeled apple appears on the conveyor belt, it can accurately identify and sort it based on the patterns it has learned ².

This shift from “hard-coding rules” to “learning from data” is the core idea of Machine Learning. Its applications are everywhere:

  • Spam Filters: They learn the language patterns of spam by analyzing millions of emails that users have marked as “spam” and “not spam,” thereby automatically filtering out new spam emails ².
  • Customer Service Chatbots: Some chatbots can learn from past customer conversations to more accurately answer new customer questions ².
  • Credit Scoring: Banks use ML models to analyze large amounts of historical loan data, learning the features that predict whether a customer will default, to assess the credit risk of new applicants ².

The significance of this paradigm shift is profound. It means that for many complex problems, humans no longer need to understand all the underlying rules. For example, it’s very difficult to precisely define the “tone of a spam email” with code, but we can easily provide thousands of examples. The advent of Machine Learning has shifted the focus of solving such problems from “designing clever algorithms” to “collecting, cleaning, and labeling high-quality data.” The entire industry built around data—data collection, data labeling, data storage—has boomed as a result. The role of the programmer has transformed from that of a meticulous architect to a knowledgeable librarian and a patient teacher.

1.3 Deeper Growth—Deep Learning

Within the landscape of Machine Learning, there is a particularly powerful and vibrant branch that has been the engine behind almost all major AI breakthroughs in recent years—Deep Learning (DL). Deep Learning is a subfield of Machine Learning, and its core is the use of a model inspired by the structure of the human brain, namely “Artificial Neural Networks” ⁴. When these networks contain many layers, we call them “deep neural networks,” which is where the term “deep” comes from ².

The most fundamental difference between Deep Learning and traditional Machine Learning lies in the way “feature extraction” is performed. This is a key concept for understanding the power of DL.

  • Traditional Machine Learning: In our fruit sorting example, if we used traditional ML, a human expert might need to intervene first. This expert would tell the machine which “features” to focus on, such as “color,” “shape,” “texture,” etc. This process is called feature extraction. Humans need to manually define these key features before the machine can learn based on them ². This is like teaching a child to read by first breaking down the characters into radicals and teaching those.
  • Deep Learning: Deep Learning is completely different; it can perform automatic feature extraction. You don’t need to tell the model what to focus on. You simply “feed” the raw data—like all the pixels of an image—directly to the deep neural network. The network’s multi-layered structure acts like a highly automated assembly line, learning and extracting features layer by layer, automatically ².
    • The first layer might only learn very basic features, like edges, corners, and color patches.
    • The second layer will combine the edges and color patches learned by the first layer to form more complex features, like textures, patterns, and simple shapes (e.g., circles, squares).
    • Deeper layers continue this combination, possibly identifying parts like “eyes,” “noses,” and “ears.”
    • The final few layers combine these parts to form a highly abstract concept, like “dog” or “cat.”

This hierarchical learning process, from simple to complex, from concrete to abstract, bears some resemblance to how the human brain’s visual cortex processes information. It is this ability to automatically construct a hierarchy of features that makes Deep Learning exceptionally good at handling complex “unstructured data” like images, sound, and text ³.

To explain why Deep Learning has only exploded with such tremendous energy in recent years, we can borrow the “rocket ship” analogy from renowned AI scientist Andrew Ng ⁶:

  • The rocket engine: Represents the Deep Learning models, especially those deep neural networks with vast, complex structures.
  • The fuel: Represents the massive amount of data needed to train the models.

The insight of this analogy is that to successfully launch a rocket into orbit, you must have both a huge engine and a massive amount of fuel. If you have a powerful engine but not enough fuel, you won’t fly far before crashing. If you have tons of fuel but a tiny engine, you can’t even lift off.

For decades, the theory of neural networks (the blueprints for the engine) has actually existed. But we long lacked two things: first, the computational power to drive these massive engines (powerful GPUs, or Graphics Processing Units, which can be seen as the rocket’s launchpad), and second, the massive data to fill them (the fuel). With the popularization of the internet and the falling cost of computation, we finally gathered both these elements in the early 21st century. It was as if we had finally filled a powerful rocket engine with fuel and built a sturdy launchpad, making an AI “liftoff” inevitable.

Almost all of the amazing modern AI applications are driven by Deep Learning:

  • Image Recognition: Social media automatically tagging your friends in photos ².
  • Autonomous Driving: Vehicles like Tesla perceiving their surroundings through cameras and sensors to make driving decisions ².
  • Voice Assistants: Siri and Alexa accurately understanding your voice commands and responding in a natural way ².

The true magic of Deep Learning isn’t just that it “imitates the brain,” but its ability to abstract at scale. This layered structure gives it the capacity to build a complex, rich model of the world from raw sensory data. When this ability is combined with unprecedented computational power and data scale, it unlocks the potential to solve problems once thought to be beyond the reach of machines. This is the final cornerstone on our path to Large Language Models.

Part 2: The Challenge of Language—Teaching Machines to Read

We’ve learned how machines acquire “intelligence” through learning. Now, we turn our focus to a challenge that is innate for humans but incredibly difficult for machines: understanding language. This section will lead us into the world of Natural Language Processing (NLP), exploring the early difficulties machines faced in “reading” and how the initial solutions laid the groundwork for the revolution to come.

2.1 The Quest for Meaning—Natural Language Processing

Natural Language Processing (NLP) is an important branch of AI and Machine Learning whose core goal is to enable computers to interpret, process, and understand human language ⁷. This may sound simple, but in reality, human language is one of the thorniest problems in computer science. The reason is that our language is full of “irregular” features:

  • Ambiguity: The same word can have completely different meanings in different contexts. For example, in “Please give me an Apple,” “Apple” could refer to the fruit or the phone.
  • Context Dependency: The true meaning of a sentence often depends on the preceding and succeeding conversation.
  • Diversity: Language is filled with dialects, slang, sarcasm, metaphors, emotional tones, and even grammatical errors ⁷.

The human brain can effortlessly handle these complexities, but for a computer that only understands logic and mathematics, it’s a nightmare. Therefore, the primary task of NLP, and the first huge obstacle it faces, is how to convert ambiguous, symbolic words into precise, numerical representations that a computer can understand ⁹. This process is called “Word Embedding.”

We can understand word embedding through the analogy of a “vocabulary map.” Imagine that instead of seeing each word as an isolated symbol, we place it on a huge, multi-dimensional map. On this map, each word has its own unique “coordinates” (a vector composed of many numbers). The magic of this map is that the spatial relationships between words reflect their semantic relationships ¹¹.

  • The coordinates for “king” and “queen” would be very close.
  • The coordinates for “man” and “woman” would also be close.
  • More interestingly, the vector from “king” to “queen” (the direction and distance) would be very similar to the vector from “man” to “woman.” Similarly, the vector from “France” to “Paris” would be similar to the vector from “China” to “Beijing.”

In this way, the meaning of a word is no longer defined as an isolated point but by its relative position in the entire “vocabulary universe.” This method of mathematizing word relationships is the cornerstone of modern NLP.

After digitizing words, a computer can perform a series of basic NLP tasks to gradually “understand” text ⁷:

  • Tokenization: This is the most basic step, breaking a sentence down into individual units called “tokens.” For example, the sentence “The cat sat on the mat.” would be broken down into the tokens ['The', 'cat', 'sat', 'on', 'the', 'mat', '.'] ⁷.
  • Part-of-Speech (POS) Tagging: The machine assigns a grammatical identity to each token, such as noun, verb, adjective, etc. This helps in understanding the sentence structure ⁷.
  • Named Entity Recognition (NER): The machine identifies proper nouns in the sentence, such as names of people (“Jane”), places (“France”), or organizations ⁷.

The entire history of NLP can be seen as a long journey of transforming the ambiguous, context-dependent symbols of human language into concrete, structured mathematical representations that computers can process. Early NLP methods tried to establish rigid rules and dictionaries, but with little success. The advent of word embedding was a turning point. It no longer tried to define an absolute meaning for words but captured their relative, dynamic meaning through their co-occurrence relationships in massive texts. This shift from “absolutism” to “relationalism” opened the door for the application of Deep Learning in the field of NLP.

2.2 The Old Path—Reading Word by Word (Recurrent Neural Networks - RNN)

When Deep Learning met Natural Language Processing, an intuitive and elegant solution was born: the Recurrent Neural Network (RNN). RNNs were designed specifically to handle sequential data, such as time series, audio, and, most importantly for us, text ¹². Their way of working is very much in line with human intuition for reading: processing a sentence from left to right, one word at a time.

The core mechanism of an RNN is its “memory cell,” or “hidden state.” When an RNN reads the first word, it generates an understanding of that word and stores it in this memory cell. When it reads the second word, it combines the input of the second word with the information about the first word in the memory cell to update its memory. This continues, with the understanding at each step building upon all previous words. This is like how our brain continuously accumulates information from previous sentences to understand the current content.

However, this seemingly perfect design hides a fatal flaw: its memory is short-lived.

Let’s look at a classic example that clearly exposes the “amnesia” of RNNs ¹²:

Sentence 1: “Tom is a cat.”

Sentence 2: “Tom’s favorite food is __.”

For a human, the answer to the blank is obviously “fish.” We clearly remember the key information provided in the first sentence. But a standard RNN, by the time it processes the end of the second sentence, has likely already forgotten the content of the first. It knows it needs to predict a type of food, but because it has lost the core context that “Tom is a cat,” it might guess “pizza,” “apples,” or any other food, but not the most logical one, “fish” ¹².

Behind this “long-term dependency problem” is a technical challenge known as “vanishing/exploding gradients” ¹². We can understand it with an analogy: imagine you whisper a secret at the front of a very long line of people and ask them to pass it down. By the time the message reaches the end of the line, one of two things is likely to happen: either the message gets weaker and weaker during transmission, eventually becoming vague or disappearing entirely (vanishing gradient); or the message gets distorted and amplified, becoming completely unrecognizable (exploding gradient).

In an RNN, “memory” is like this whispered secret. With each time step (each word), it decays or deforms. When a sentence is very long, the information from the beginning words has a hard time being effectively transmitted to the end.

To solve this problem, researchers designed a more complex RNN variant called the Long Short-Term Memory (LSTM) network ¹². LSTMs add some clever “gating” structures (input gate, forget gate, output gate) to the RNN, like installing valves on the information pipeline. These gates can intelligently decide which information is important and needs to be retained long-term, and which is secondary and can be forgotten. LSTMs alleviated the amnesia problem of RNNs to some extent and became the workhorse model for NLP tasks for a long time.

However, neither RNNs nor LSTMs could escape a fundamental constraint. This constraint stemmed from their most core design philosophy—sequential processing. This one-word-at-a-time approach acts as a natural bottleneck, limiting their potential.

First is the memory bottleneck. Information must pass linearly, step-by-step, through the entire sequence. No matter how cleverly the LSTM gates are designed, information loss and distortion are still inevitable over long distances.

Second is the computational bottleneck. This is the more fatal point. Because processing must be sequential, you cannot compute the representation for the 10th word and the 1st word at the same time, because the calculation for the 10th word depends on the result of the 9th, which in turn depends on the 8th, and so on. This means the entire computation process cannot be massively parallelized. In today’s era of explosive growth in data volume and model size, this computational inefficiency is unacceptable.

Therefore, the entire field was awaiting a revolution. We needed a brand-new architecture that could completely break free from the linear constraints of time and read and understand language in a more global and efficient way. That revolutionary answer was the Transformer.

Part 3: The Revolution—A New Way of Reading

While RNNs and LSTMs were struggling with their inherent sequential processing bottlenecks, a disruptive change was brewing in the AI field. In 2017, a paper published by Google researchers with a highly declarative title—”Attention Is All You Need”—was released, heralding the dawn of a new era. The Transformer architecture introduced in this paper completely changed how machines process language and became the foundation for almost all subsequent large language models ¹⁵.

3.1 The Breakthrough—The Transformer Architecture

The most core and revolutionary innovation of the Transformer is that it completely abandons the recurrent structure of RNNs ¹⁵. It no longer needs to read text sequentially, word by word, like a human. Instead, it can process all words in the input sequence simultaneously ¹⁷.

Imagine an RNN is like a person trying to drink soup through a long, thin straw, taking only a small sip at a time—inefficient and with a limited view. The Transformer, on the other hand, is like a person with a giant funnel, able to pour the entire bowl of soup in at once and observe and analyze all its ingredients simultaneously.

This parallel processing capability shattered the computational bottleneck of RNNs. A computer’s GPU (Graphics Processing Unit), with its thousands of cores, is naturally suited for parallel computation. The Transformer’s design perfectly leverages this, leading to an unprecedented increase in training speed. Models could be trained on larger datasets in less time, opening the door to building bigger and more powerful language models.

The impact of this revolution was so profound that some have compared it to a “new industrial revolution” ¹⁸. In past industrial revolutions, humans invented the generator, converting the potential energy of water into electrical energy, creating unprecedented productivity. The Transformer, then, is like the “generator” of the language processing field. It unlocked a new form of “software energy” capable of generating and understanding language at a massive scale, and on this basis, created software that can create more software.

So, how does the Transformer understand the order and relationships of words in a sentence without a recurrent structure? The answer lies in the title of that paper—the Attention Mechanism.

3.2 The Secret Weapon—The Self-Attention Mechanism

If the Transformer is the revolutionary architecture, then the Self-Attention Mechanism is the secret weapon of this revolution. It’s a relatively complex concept, but we can peel back its layers of mystery with a series of analogies. The core idea is this: when processing any given word in a sentence, the self-attention mechanism allows this word to “examine” all other words in the sentence and dynamically assign them different “attention weights” based on their relevance ¹⁹.

Analogy 1: The Cocktail Party Imagine you are at a noisy cocktail party. To clearly hear what one person is saying, you don’t just isolate the sound they are making. Your brain automatically performs a series of complex processes: you pay more attention to the person they are talking to, observe their mouth movements and body language, and also take note of the overall atmosphere and topics of surrounding conversations. All this information, combined, helps you accurately understand their intent. The self-attention mechanism gives every word in a sentence this ability. When the model processes the word “it,” “it” will “look around” and “listen” to all the other words in the sentence to determine what it actually refers to.

Analogy 2: Search Engine and Online Matchmaking (The Magic of Q, K, V) To achieve the “looking around” described above, the model generates three special vectors for each word in the sentence. These are the core components of self-attention: Query, Key, and Value ¹⁹. We can use the example of a search engine or an online matchmaking platform to understand their roles ²².

Let’s take the sentence “The robot picked up the book because it was heavy” and focus on the word “it.”

  1. Query (Q): This is the “search request” or “pairing need” issued by the current word. It expresses, “What information do I need to better understand myself?”
    • Example: When the model processes the word “it,” it generates a Query vector. The meaning of this vector can be understood as: “I am a pronoun, and I am looking for a noun that I can refer to, which is likely an object.”
  2. Key (K): This is the “keyword” or “personal tag” that each word in the sentence uses to “be searched.” It broadcasts, “What am I, and what attributes do I have?”
    • Example: The word “robot” will generate a Key vector with the meaning: “I am a singular noun, a concrete object, and can be referred to by a pronoun.” The word “book” will generate a similar Key vector.
  3. Matching and Scoring: Next, the model uses the Query vector from “it” to perform a matching calculation (usually a dot product) with the Key vectors of all words in the sentence (including itself). This produces an “attention score” ¹⁹. This score represents the relevance or degree of match.
    • Example: The Query from “it” (looking for an object noun) will have a high match with the Key from “robot” (I am an object noun), resulting in a high score. The match with the Key from “book” will also be quite high. The match with the Keys from words like “picked” or “because” will be very low.
  4. Weighting and Aggregation (Value, V): Besides a Key, each word also has a Value vector. The Value vector represents the “true meaning” or “informational content” of the word. The attention scores, after being normalized by a Softmax function, become a set of weights. The final, context-aware new representation for the current word (“it”) is obtained by taking a weighted sum of the Value vectors of all words in the sentence, using these weights ¹⁹.
    • Example: Because “robot” and “book” received the highest attention scores, their Value vectors (i.e., their semantic information) will play a dominant role in constructing the new representation of “it.” By analyzing the sentence structure (“because it was heavy”), the model might ultimately give “book” a higher weight than “robot,” thus correctly understanding that “it” refers to “book.”

This Query-Key-Value (QKV) process is essentially a highly flexible and dynamic information filtering and aggregation process. It allows each word in the sentence to draw the most relevant information from a global perspective, thereby constructing a deep understanding of its own meaning.

Multi-Head Attention To make the understanding richer and more multi-faceted, the Transformer doesn’t just perform one QKV calculation. It performs multiple calculations in parallel and independently. This process is called “Multi-Head Attention” ¹⁵.

This is like an expert consultation. Instead of just inviting one general practitioner, we invite a grammar expert, a semantic relations expert, a logical reasoning expert, and others at the same time. Each “Attention Head” is like an expert. It has its own independent set of Q, K, and V weight matrices and focuses on analyzing the relationships between words from different angles.

  • One head might focus on identifying the subject-verb-object grammatical structure.
  • Another head might focus on identifying synonym or antonym relationships.
  • Yet another head might focus on resolving pronoun references.

Finally, the analysis results from all these “expert heads” are integrated to form a comprehensive, multi-dimensional, and deep understanding of the sentence.

The true power of the self-attention mechanism is that it is not just a clever engineering design, but a fundamental computational paradigm. It is essentially a learnable, differentiable, relational database query system ²⁴.

  • Relational: It doesn’t look up isolated information but finds the mutual relationships within a set of data (all the words in a sentence).
  • Learnable: The weight matrices that generate the Q, K, and V vectors are not fixed but are continuously learned and optimized during the model training process ²⁰. This means that as the model learns the language, it is also learning “how to ask better questions (Query)” and “how to be better retrieved (Key)”—that is, learning “what to pay attention to.”
  • Differentiable: The entire attention calculation process is composed of a series of smooth mathematical operations (matrix multiplication and Softmax). Mathematically, “differentiable” means we can calculate the impact of tiny adjustments on the final result. This is crucial because it allows the entire model to be trained end-to-end using optimization algorithms like “gradient descent.” When the model makes a mistake, the error signal can be smoothly propagated back to guide the adjustment of the Q, K, and V weight matrices, so the model does better next time.

It is this powerful mechanism, which integrates querying, learning, and optimization, that frees the Transformer from the shackles of time, allowing it to deeply capture the complex meaning of language in a global, dynamic, and scalable way, thereby ushering in the glorious era of large language models.

Part 4: The Canopy—The Age of Large Language Models

With the revolutionary trunk of the Transformer architecture established, the AI tech tree began to grow upwards at an unprecedented rate, its branches and leaves flourishing, ultimately forming the spectacular “canopy” we see today—Large Language Models (LLMs). In this part, we will delve into this lush canopy to explore what exactly makes these models worthy of the word “large,” how they are “educated” to become so capable, and what famous members and factions exist in this vast family.

4.1 What Makes LLMs “Large”?

When we talk about a model being a “Large” Language Model, “large” primarily refers to two dimensions: the number of Parameters and the scale of the Training Data. These two aspects are complementary and together form the foundation of an LLM’s powerful capabilities.

Element 1: Parameters—The Model’s “Brain Capacity”

  • Definition: Parameters are the tunable variables within a model, typically referring to the “weights” and “biases” in a neural network ²⁷. They are the carriers of the “knowledge” that the model learns from data during training ²⁸. When the model makes predictions or generates text, it is using these parameters.
  • Analogy: The most intuitive way to understand parameters is to imagine them as billions of “knobs” or “levers” on an extremely complex machine ³⁰. A simple linear equation y=mx+b has only two parameters (m and b) ³¹. An LLM, in contrast, is a super-function with tens of billions or even trillions of knobs. The “training” process of the model is essentially the process of “debugging” with massive amounts of data to precisely adjust these billions of knobs to their correct positions.
  • Scale: The number of parameters directly determines the complexity and “brain capacity” of the model. More parameters mean the model has the ability to capture and store finer, more complex patterns and knowledge from the data ²⁸. To give you a concrete idea of this “largeness,” here are the parameter scales of some well-known models:
    • OpenAI’s GPT-3 model has 175 billion parameters ²⁷.
    • Meta’s Llama 2 family of models ranges in size from 7 billion to 70 billion parameters.
    • Some newer models are rumored to have reached the trillion-parameter level ³³.

Element 2: Training Data—The Model’s “Library”

  • Definition: Training data is the vast corpus of text used to “teach” an LLM how to understand and generate language ²⁸. The quality and quantity of the data directly determine the model’s performance.
  • Scale: The amount of training data for LLMs is usually measured in Petabytes (PB) ³⁴. 1 PB is equal to 1 million GB. For reference, it’s estimated that the memory capacity of the human brain is around 2.5 PB ³⁴. Where does this data come from? It includes a large portion of the publicly accessible internet, including:
    • Common Crawl: A massive web crawl dataset containing over 50 billion web pages ²⁷.
    • Wikipedia: A treasure trove of knowledge with about 57 million pages ²⁷.
    • Massive collections of books, academic articles, news, social media posts, etc. ¹¹. It can be said that LLMs learn in a “digital library” of a size unprecedented in human history.

However, the significance of “large” goes far beyond just “knowing more.” When the number of parameters and the scale of data cross a certain threshold, a wonderful qualitative change occurs—the emergence of “Emergent Abilities.”

The most basic training task of an LLM is actually very simple: predict the next word ³⁵. Given a piece of text, like “The weather is so nice today, let’s go to the park,” the model’s job is to predict the next most likely word, such as “for a walk.” A small model can also perform this task, but perhaps in a more mechanical way.

But when a model needs to continuously optimize this “predict the next word” task on a dataset that covers almost all areas of human knowledge and contains trillions of words, it is forced to rely on more than just simple statistics and memorization. To predict more accurately, it must learn and internalize deeper rules, such as grammatical structures, logical relationships, causal reasoning, and even basic common sense about the physical world and the subtleties of social culture.

Thus, those advanced abilities that we did not explicitly teach it, such as performing zero-shot learning (completing a new task without any examples) ²⁷, writing code, performing mathematical reasoning, writing poetry, etc., “emerge” as “byproducts” of this extreme-scale learning ¹⁶. It’s like a student who, in order to become the ultimate master of imitation, ends up acquiring profound wisdom and diverse skills after mimicking the words and actions of all the great people in the world.

Therefore, the “largeness” of an LLM is not just a quantitative accumulation but a necessary condition for a qualitative leap in ability. It transforms a simple “word predictor” into an engine with preliminary “reasoning abilities.” This also explains why major tech companies are in an arms race to build larger and larger models—they are not just trying to increase the model’s knowledge reserve, but also to unlock more, and more powerful, unknown emergent abilities.

4.2 The LLM Education System—Pre-training and Fine-tuning

From its birth to its application in specific scenarios, a large language model typically goes through a two-stage “education” process: Pre-training and Fine-tuning. We can understand this process with a very fitting analogy: a person’s growth and educational experience ³⁶.

Stage 1: Pre-training—General Education This stage is like a person’s general education process from kindergarten to an undergraduate degree ³⁶.

  • Textbooks: The model uses the massive dataset mentioned earlier, covering a vast amount of information from the internet. This is an all-encompassing set of “general education textbooks.”
  • Learning Objective: At this stage, the model’s goal is to learn the most universal and fundamental rules of language itself. It learns grammar, factual knowledge, reasoning abilities, the styles of different genres, etc. Just like a student receiving a general education, they will dabble in literature, history, science, art, and other fields.
  • Learning Method: This is mainly “self-supervised learning.” In the task of “predicting the next word,” the text itself provides the “labels”—each word is the “correct answer” for all the words that precede it. The model continuously self-corrects and improves in this process ³⁵.
  • Outcome: After a long and costly pre-training process (often requiring millions or even tens of millions of dollars in computing resources) ¹⁶, we get a “Foundation Model” ³⁵. This model is like a knowledgeable university graduate who is a “jack of all trades, master of none.” It has broad common sense and powerful general abilities, but its depth of knowledge in any specific professional field may not be sufficient.

Stage 2: Fine-tuning—Specialized Education & Job Training After the foundation model “graduates from university,” it can enter “graduate school” or a “company” for specialized training based on specific job requirements ³⁶.

  • Textbooks: Fine-tuning uses a much smaller but extremely high-quality and highly relevant “specialized textbook.” For example, if we want an LLM to become a medical assistant, we would use a large number of medical textbooks, clinical guidelines, and medical literature to fine-tune it ¹⁶. If we want it to be a company’s customer service representative, we would train it with the company’s product manuals and historical customer service conversation logs.
  • Learning Objective: The goal is no longer to learn general knowledge but to make the model an “expert” in a specific domain. It needs to learn the professional terminology, unique writing style, and problem-solving logic of that field.
  • Outcome: After fine-tuning, the general-purpose large language model becomes an “expert model.” Its performance on specific tasks will far exceed that of an un-fine-tuned foundation model ¹⁶. For example, a model fine-tuned on legal documents will be much better at drafting contracts than a general-purpose ChatGPT.

This two-stage education system is key to the widespread application of LLMs. Pre-training builds a powerful foundation of general abilities, meaning we don’t have to train a model from scratch for every task, which greatly saves costs. Fine-tuning, on the other hand, provides a path for customization and specialization, allowing the same foundation model to be adapted to thousands of different application scenarios.

4.3 A Tour of the LLM Zoo—GPT, BERT, Llama, and Friends

“LLM” is a category, not a single product, just like “mammal.” In this vast family, there are various “species” created by different companies or research institutions, each with different design philosophies, strengths, and application scenarios. To understand the current LLM ecosystem, we need to distinguish them along two key dimensions: core objective and access model.

Key Difference 1: Objective—”Creators” vs. “Understanders” Although all are based on the Transformer architecture, different model families have different design emphases, which is mainly reflected in which part of the Transformer they use.

  • The BERT Family (Understanders/Analysts): The representative model is BERT (Bidirectional Encoder Representations from Transformers), developed by Google.
    • Core Technology: BERT mainly uses the Encoder part of the Transformer. Its key feature is “bidirectionality,” meaning that when processing a word, it can simultaneously consider all the context to the left and right of that word ³⁷. This is like a detective reading a testimony, who reads the entire text and repeatedly deliberates to achieve the deepest and most accurate understanding of each word’s meaning.
    • Strengths: This deep contextual understanding makes BERT very suitable for performing “understanding-based” tasks, such as improving a search engine’s understanding of user query intent, performing sentiment analysis (judging whether a text is positive or negative), and text classification ³⁸. BERT’s goal is not to write a new article, but to analyze and understand an existing one.
  • The GPT Family (Creators/Generators): The representative models are the GPT (Generative Pre-trained Transformer) series developed by OpenAI.
    • Core Technology: GPT mainly uses the Decoder part of the Transformer. Its feature is “autoregressive” or “unidirectional,” meaning it always proceeds from left to right, predicting the next word based on the words already generated ³⁷. This is like a writer who, based on the sentences already written, conceives the most appropriate words to follow.
    • Strengths: This design makes GPT a master of “generative” tasks. It can write fluent, coherent, and creative text, thus excelling in areas like chatbots (e.g., ChatGPT), content creation, article summarization, and code generation ³⁷.

Key Difference 2: Access Model—”Closed” vs. “Open” In addition to technical routes, business and research models have also led to a divergence in the LLM world.

  • Closed/Proprietary Models: Typified by OpenAI’s GPT series.
    • Model: These models are “black boxes.” General users and developers can pay to call the model’s capabilities through an API (Application Programming Interface), but they cannot obtain the model itself, nor can they view or modify its internal parameters (weights) ³⁸.
    • Analogy: This is like renting a high-performance Ferrari. You can drive it and experience its powerful performance, but you cannot open the hood to study its construction, let alone modify it ⁴⁰. You are just a user of the service.
  • Open-Weight Models: Primarily represented by Meta’s Llama series.
    • Model: These models publicly release their parameters (weights) ³⁸. This means that researchers, developers, and companies can download the entire model for free, run it on their own servers, and perform deep fine-tuning and customization according to their own needs ⁴¹.
    • Analogy: This is like a car manufacturer not only selling you a car but also giving you the engine’s design blueprints and core components. You can then build a completely custom race car that meets your needs based on this powerful engine.
    • Note: Although often called “open source,” the “open-weight” model is not entirely equivalent to open-source software in the traditional sense. This is because core assets like the training data and training methods are often still kept secret ⁴².

To more clearly summarize these differences, the following table summarizes the characteristics of the main LLM “species”:

Model FamilyCore ObjectiveCore ArchitectureTypical Use CasesAccess Model
BERTUnderstanding & AnalysisBidirectional EncoderSearch engine optimization, sentiment analysis, text classificationOpen Source
GPTGeneration & CreationUnidirectional DecoderChatbots, content writing, code generationProprietary API
LlamaGeneration & CreationUnidirectional DecoderAcademic research, enterprise custom fine-tuning, on-device AIOpen-Weight

The diversity in this “zoo” reflects a healthy and vibrant ecosystem in the AI field. There is no single “best” model, only the “most suitable” one. The choice depends on your specific goal (do you need an analyst or a creator?), your resources (can you afford expensive API fees, or do you want to deploy on your own hardware?), and your needs for customization and data privacy. Understanding these fundamental differences is the first step to making informed decisions in this new era driven by AI.

Part 5: Coexisting with Giants—Hopes, Risks, and the Road Ahead

We have climbed the LLM tech tree, from its roots deep into its lush canopy. Now, it’s time to come down from the tree and return to the ground to examine the light and shadows cast by these technological giants in the real world. While understanding the potential of LLMs is exciting, it is equally important to be soberly aware of their inherent risks and to look ahead to their future development. This is crucial for us to use this transformative technology responsibly.

5.1 Shadows in the Forest—Hallucination and Bias

In interactions with LLMs, users quickly discover two “ghosts” that are ever-present: Hallucination and Bias. These are not “software bugs” that can be easily fixed, but fundamental problems stemming from the core design and training methods of LLMs.

Risk 1: Hallucination—Confidently Talking Nonsense

  • Definition: LLM hallucination refers to the model generating text that is grammatically correct, confident in tone, and seemingly plausible, but whose content is factually incorrect, fabricated out of thin air, or logically unsound ⁴⁴. This is one of the most confusing and worrying characteristics of LLMs.
  • Cause: To understand the root of hallucination, we must return to the most fundamental training objective of an LLM: to predict the next most statistically probable word, not to state objective truth ⁴⁶. An LLM is essentially an extremely complex pattern matcher and text completer, not a knowledge base with a fact-checking mechanism.
  • Analogy: An LLM is like a student who has memorized everything but lacks true understanding. He has read all the books in the library and can perfectly imitate any writing style. But when asked a question at the edge of his knowledge, he will not admit “I don’t know.” Instead, he will use the language patterns he has mastered to confidently “fabricate” an answer that sounds the most plausible to fill the knowledge gap ⁴⁶.
  • Real-world Cases:
    • In academic research, an LLM might cite a paper that doesn’t exist, even fabricating the author and journal names convincingly ⁴⁶.
    • In legal consulting, a lawyer used ChatGPT for case research, and the chatbot fabricated several non-existent legal precedents, leading to the lawyer being severely sanctioned in court.
    • In code generation, the model might suggest using a function or API that looks very reasonable but does not actually exist, leading developers into a trap ⁴⁶.

Risk 2: Bias—A Flawed Mirror

  • Definition: Because the “textbooks” for LLMs are massive amounts of human language data from the internet, they inevitably learn, replicate, and even amplify the various social biases present in this data, including stereotypes related to gender, race, religion, age, occupation, and more ⁴⁸.
  • Cause: Behind this is an old principle in computer science: “Garbage in, garbage out” ⁵⁰. An LLM is like a mirror, faithfully reflecting the collective mind of our human society, which contains both the brilliance of wisdom and the darkness of prejudice. If the training data is filled with a certain bias, the model must learn that bias in order to better predict text.
  • Real-world Cases:
    • Multiple studies have found that when asked to describe different professions, LLMs exhibit strong gender biases, such as associating “doctor” and “engineer” with men, and “nurse” and “secretary” with women ⁵¹.
    • In recruitment scenarios, a model might be more inclined to recommend executive positions for resumes with white male names and entry-level positions for resumes with names of other ethnicities or women ⁵¹.
    • In content moderation, a model might disproportionately flag African American Vernacular English (AAVE) as “toxic” or “offensive” because it has learned a false association between this language pattern and negative content from its training data ⁴⁹.

Deeply recognizing that hallucination and bias are not accidental “mistakes” but endemic risks of the current technological paradigm is a key step towards a mature view of AI. As long as the core task of LLMs remains statistical pattern matching, and as long as their primary diet remains internet data reflecting the imperfections of human society, these two problems will persist.

This means the solution is not to “fix the model” once and for all, but to build a whole system of risk mitigation. For example, to combat hallucinations, researchers have developed Retrieval-Augmented Generation (RAG) technology. This involves having the model first retrieve relevant information from a trusted, up-to-date knowledge base (like a company’s internal documents) before generating an answer, and then organizing its response based on these facts, thereby greatly reducing the likelihood of fabrication ¹⁶. To combat bias, continuous efforts in data cleaning, model alignment (correcting the model’s values through human feedback), and rigorous bias detection are necessary ⁵¹. We must treat it like a “giant” with superhuman abilities but an immature mind, setting clear boundaries and guardrails for it.

5.2 The Newest Sprouts—Multimodality and AI Agents

Despite the risks, the growth of the LLM tech tree has not stalled. On the contrary, it is extending new branches in two exciting directions, heralding a future far beyond pure text interaction. These two directions are: Multimodality and AI Agents.

Future Direction 1: Multimodality—Perception Beyond Text

  • Definition: Multimodal AI refers to models that can simultaneously process, understand, and generate information from multiple “modalities” or data types. These modalities include not only text but also images, audio, and video ⁵².
  • Analogy: This is like the evolution of human communication. We first had only text (letters), then images (picture books), then sound (radio), and finally video (movies). A traditional LLM is like a pen pal who can only read and write letters. A multimodal LLM, however, is like a modern person who can seamlessly switch between text, images, voice, and video on a smartphone for richer, more three-dimensional communication ⁵².
  • How It Works (Simplified): The core idea is “universal translation.” The system equips each modality (like images or audio) with a dedicated “encoder.” The task of this encoder is to translate the information of that modality into a common mathematical language—the “embedding vectors” we mentioned earlier. Then, the embedding vectors from different modalities are “fused” together and sent to the core of the LLM for unified reasoning and processing ⁵⁵.
  • Application Scenarios:
    • Image Captioning and Q&A: You can upload a photo of a family party and ask the model, “What is the little girl in the red dress doing in the photo?” The model can understand your question, recognize the content of the photo, and answer, “She is blowing out the candles on her birthday cake.”
    • Text-to-Image/Video Generation: You just need to input a text description, like “A corgi in an astronaut suit walking on the moon, digital art style,” and the model can generate the corresponding image or short video.
    • Medical Diagnosis: A doctor can describe a patient’s symptoms verbally while uploading the patient’s X-ray. A multimodal model can combine the voice information and image analysis to provide a preliminary diagnostic suggestion ⁵².

Future Direction 2: AI Agents—Action Beyond Chat

  • Definition: An AI agent is a system that uses an LLM as its core “brain” or reasoning engine, capable of autonomously planning steps, using tools, and executing tasks to achieve a set goal ⁵⁷.
  • Analogy: If a standard LLM is an incredibly smart “genie” trapped in a bottle, able to answer all your questions but unable to leave the bottle, then an AI agent is a genie that has been given “hands and feet.” It can step out of the bottle and do things for you in the digital world (and even the physical world).
  • How It Works (Simplified): The operational flow of an agent is typically:
    1. Goal Setting: You give the agent a high-level goal, for example, “Help me plan a five-day trip to Paris for next week with a budget of $5,000.”
    2. Planning and Decomposition: The LLM brain breaks this complex goal down into a series of executable sub-tasks, such as: (1) search for round-trip flights to Paris for next week; (2) find highly-rated hotels that fit the budget; (3) plan the daily sightseeing routes; (4) book tickets and restaurant reservations.
    3. Tool Use: To complete these tasks, the agent calls on various “tools.” These tools can be external APIs, software, or websites, such as using a search engine to check flights, calling a hotel booking site’s API, or accessing a map application to plan routes ⁵⁷.
    4. Execution and Iteration: The agent executes the tasks, gets the results, and continuously adjusts its subsequent plans based on the outcomes until the goal you set is finally achieved.

If the tech tree we’ve described so far has been primarily about mastering language, then the future direction is about connecting this powerful language intelligence with two things: the ability to perceive the multi-sensory world and the ability to take action in the real world.

Multimodal technology equips LLMs with “eyes” and “ears,” allowing them to “see” and “hear,” and thus understand the physical world composed of images and sounds. AI agent technology equips LLMs with “hands” and “feet,” allowing them to turn thoughts into actions, operate software, call services, and complete tasks.

The convergence of these two trends suggests that LLMs are undergoing a profound evolution: from a Large Language Model to a Large Action Model, and eventually to a more general Large Intelligence Model. This is not just a new branch growing on the tech tree; it may well represent the beginning of a new era where intelligence is deeply integrated with the world.

Conclusion: Navigating in a New World

We started from the vast forest of Artificial Intelligence, followed the paths of Machine Learning and Deep Learning, and delved deep into the Transformer architecture and its core self-attention mechanism that underpin modern AI. We witnessed how parameters and data jointly gave rise to the technological marvel of “Large” Language Models and learned about their human-like “education” process. We also strolled through the LLM “zoo,” getting to know model families with different design philosophies, and confronted the profound challenges they bring, such as hallucination and bias. Finally, we looked to the horizon and saw the future paths leading to multimodal perception and intelligent action.

The core purpose of this journey has been to strip away the mysterious aura surrounding AI. Hopefully, through this systematic, from-the-ground-up review, you can see that LLMs are not incomprehensible “magic black boxes,” but a tech tree rooted in mathematics, logic, and data, with a clear growth trajectory.

By mastering the structure of this tree, from its foundations to its newest sprouts, you now possess a navigational map. When new technical terms or products appear in the future—be it a more powerful model, a more clever risk control method, or a more amazing application—you will be able to place it in its corresponding position on this tree, understand its context, and judge its significance.

We are in an era being profoundly reshaped by AI. Understanding this technology is no longer just a task for technical experts, but a required course for every modern person who hopes to navigate the new world. With this newly acquired conceptual framework, you are now ready to participate in this great transformation that concerns us all with more confidence, prudence, and insight.

Works cited

  1. Understanding The Difference Between AI, ML, And DL: Using An …, accessed July 2, 2025, https://www.advancinganalytics.co.uk/blog/2021/12/15/understanding-the-difference-between-ai-ml-and-dl-using-an-incredibly-simple-example

  2. AI, Machine Learning, and Deep Learning: Key Differences Explained - Skiplevel, accessed July 2, 2025, https://www.skiplevel.co/blog/ai-machine-deep-learning

  3. The Difference Between AI, ML and DL - CENGN, accessed July 2, 2025, https://www.cengn.ca/information-centre/innovation/difference-between-ai-ml-and-dl/

  4. What’s the relationship of AI, ML, DL and Generative AI? | by Jerel Velarde - Medium, accessed July 2, 2025, https://medium.com/@jereljohnvelarde/whats-the-relationship-of-ai-ml-dl-and-generative-ai-1f4c8295432a

  5. Relationship between AI, Machine Learning, Deep Learning & Data Science? - Corpnce, accessed July 2, 2025, https://www.corpnce.com/relationship-ai-ml-dl-ds/

  6. What Is Deep Learning and How Does It Work? - Built In, accessed July 2, 2025, https://builtin.com/machine-learning/deep-learning

  7. 什么是自然语言处理?- NLP 简介- AWS, accessed July 2, 2025, https://aws.amazon.com/cn/what-is/nlp/

  8. 语言智能的新发展与新挑战 - 科技频道, accessed July 2, 2025, https://tech.gmw.cn/2023-02/20/content_36377739.htm

  9. 自然语言处理的第一步:算法如何理解文本 - NVIDIA Developer, accessed July 2, 2025, https://developer.nvidia.com/zh-cn/blog/natural-language-processing-first-steps-how-algorithms-understand-text/

  10. 语言认知与语言计算– 人与机器的语言理解 - 模式识别国家重点实验室, accessed July 2, 2025, https://nlpr.ia.ac.cn/cip/ZongPublications/2022/2022%E7%8E%8B%E5%B0%91%E6%A5%A0-%E4%B8%AD%E5%9B%BD%E7%A7%91%E5%AD%A6.pdf

  11. How Do Large Language Models Work? Conceptual But Non Technical Explanation, accessed July 2, 2025, https://medium.com/@Gbgrow/how-do-large-language-models-work-conceptual-but-non-technical-explanation-ea369334d32e

  12. 什么是RNN?– 循环神经网络简介– AWS, accessed July 2, 2025, https://aws.amazon.com/cn/what-is/recurrent-neural-network/

  13. 什麼是RNN?– 遞歸神經網路介紹 - AWS, accessed July 2, 2025, https://aws.amazon.com/tw/what-is/recurrent-neural-network/

  14. 什么是循环神经网络(RNN)? - IBM, accessed July 2, 2025, https://www.ibm.com/cn-zh/think/topics/recurrent-neural-networks

  15. The Transformer Attention Mechanism - MachineLearningMastery.com, accessed July 2, 2025, https://machinelearningmastery.com/the-transformer-attention-mechanism/

  16. What Can Large Language Models (LLMs) Be Used For? | deepset Blog, accessed July 2, 2025, https://www.deepset.ai/blog/large-language-models-enterprise-use

  17. Transformer架構- 維基百科,自由的百科全書, accessed July 2, 2025, https://zh.wikipedia.org/zh-tw/Transformer%E6%9E%B6%E6%9E%84

  18. 黄仁勋集齐Transformer论文七大作者,对话一小时,干货满满 - 华尔街见闻, accessed July 2, 2025, https://wallstreetcn.com/articles/3710964

  19. A Beginner’s Guide to Self-Attention in Transformers | by Nacho Zobian | Medium, accessed July 2, 2025, https://medium.com/@nachozobian/a-beginners-guide-to-self-attention-in-transformers-baf71a971efd

  20. Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch - Sebastian Raschka, accessed July 2, 2025, https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

  21. Understanding Transformer Attention Mechanisms : Attention Is All You Need | by Tahir | Medium, accessed July 2, 2025, https://medium.com/@tahirbalarabe2/understanding-transformer-attention-mechanisms-attention-is-all-you-need-2a5dd89196ab

  22. LLM Transformer Model Visually Explained - Polo Club of Data Science, accessed July 2, 2025, https://poloclub.github.io/transformer-explainer/

  23. [D] How to truly understand attention mechanism in transformers? : r/MachineLearning - Reddit, accessed July 2, 2025, https://www.reddit.com/r/MachineLearning/comments/qidpqx/d_how_to_truly_understand_attention_mechanism_in/

  24. Understanding The Attention Mechanism In Transformers: A 5-minute visual guide. - Reddit, accessed July 2, 2025, https://www.reddit.com/r/compsci/comments/1cjc318/understanding_the_attention_mechanism_in/

  25. [D] How does ‘self-attention’ work in transformer models? : r/MachineLearning - Reddit, accessed July 2, 2025, https://www.reddit.com/r/MachineLearning/comments/16q8pwa/d_how_does_selfattention_work_in_transformer/

  26. [draft] Note 10: Self-Attention & Transformers 1, accessed July 2, 2025, https://web.stanford.edu/class/cs224n/readings/cs224n-self-attention-transformers-2023_draft.pdf

  27. What is LLM? - Large Language Models Explained - AWS, accessed July 2, 2025, https://aws.amazon.com/what-is/large-language-model/

  28. Understanding LLMs: Model size, training data, and tokenization - Outshift - Cisco, accessed July 2, 2025, https://outshift.cisco.com/blog/understanding-llms-model-size-training-data-tokenization

  29. What are LLM Parameters? Explained Simply - Deepchecks, accessed July 2, 2025, https://www.deepchecks.com/glossary/llm-parameters/

  30. LLM Parameters Explained - The Cloud Girl, accessed July 2, 2025, https://www.thecloudgirl.dev/blog/llm-parameters-explained

  31. What exactly are parameters? : r/learnmachinelearning - Reddit, accessed July 2, 2025, https://www.reddit.com/r/learnmachinelearning/comments/1dz7w1y/what_exactly_are_parameters/

  32. A Brief Guide To LLM Numbers: Parameter Count vs. Training Size | by Greg Broadhead, accessed July 2, 2025, https://gregbroadhead.medium.com/a-brief-guide-to-llm-numbers-parameter-count-vs-training-size-894a81c9258

  33. LLMs vs. SLMs: The Differences in Large & Small Language Models | Splunk, accessed July 2, 2025, https://www.splunk.com/en_us/blog/learn/language-models-slm-vs-llm.html

  34. An explanation of large language models - TechTarget, accessed July 2, 2025, https://www.techtarget.com/whatis/video/An-explanation-of-large-language-models

  35. Large language models: their history, capabilities and limitations - Snorkel AI, accessed July 2, 2025, https://snorkel.ai/large-language-models/

  36. Pre-training, Fine-tuning, and Transfer learning. To make these ideas more relatable, let’s use a real-world analogy - DEV Community, accessed July 2, 2025, https://dev.to/sreeni5018/pre-training-fine-tuning-and-transfer-learning-to-make-these-ideas-more-relatable-lets-use-a-real-world-analogy-3d0o

  37. Bert vs gpt vs llama: understanding the best AI model for your needs - BytePlus, accessed July 2, 2025, https://www.byteplus.com/en/topic/560409

  38. 7 Popular LLMs Explained in 7 Minutes: GPT, BERT, LLaMA & More | by Rohan Mistry | Jun, 2025 | Medium, accessed July 2, 2025, https://medium.com/@rohanmistry231/7-popular-llms-explained-in-7-minutes-gpt-bert-llama-more-239807219f6f

  39. BERT vs. GPT: What’s the Difference? - Coursera, accessed July 2, 2025, https://www.coursera.org/articles/bert-vs-gpt

  40. Your AI terminology cheat sheet: GPT, ChatGPT, LLaMa, Alpaca, Bard, LLMs - Karbon, accessed July 2, 2025, https://karbonhq.com/resources/generative-ai-terminology-cheat-sheet/

  41. Llama vs GPT: Comparing Open-Source Versus Closed-Source AI Development - Netguru, accessed July 2, 2025, https://www.netguru.com/blog/gpt-4-vs-llama-2

  42. No, Llama 2 is NOT an open source LLM : r/LocalLLaMA - Reddit, accessed July 2, 2025, https://www.reddit.com/r/LocalLLaMA/comments/153i6vi/no_llama_2_is_not_an_open_source_llm/

  43. Open Source LLMs: Llama and Its Competitors | Michigan Online, accessed July 2, 2025, https://online.umich.edu/collections/artificial-intelligence/short/open-source-llms-llama-and-its-competitors/

  44. www.lakera.ai, accessed July 2, 2025, https://www.lakera.ai/blog/guide-to-hallucinations-in-large-language-models#:~:text=Hallucinations%20in%20LLMs%20refer%20to,trust%20placed%20in%20these%20models.

  45. What are LLM Hallucinations? - Iguazio, accessed July 2, 2025, https://www.iguazio.com/glossary/llm-hallucination/

  46. LLM Hallucinations Explained. LLMs like the GPT family, Claude… | by Nirdiamant - Medium, accessed July 2, 2025, https://medium.com/@nirdiamant21/llm-hallucinations-explained-8c76cdd82532

  47. When LLMs day dream: Hallucinations and how to prevent them - Red Hat, accessed July 2, 2025, https://www.redhat.com/en/blog/when-llms-day-dream-hallucinations-how-prevent-them

  48. Bias and Fairness in Large Language Models: A Survey - MIT Press Direct, accessed July 2, 2025, https://direct.mit.edu/coli/article/50/3/1097/121961/Bias-and-Fairness-in-Large-Language-Models-A

  49. Bias in Large Language Models: Origin, Evaluation, and Mitigation - arXiv, accessed July 2, 2025, https://arxiv.org/html/2411.10915v1

  50. Data bias in LLM and generative AI applications - Mostly AI, accessed July 2, 2025, https://mostly.ai/blog/data-bias-types

  51. Explicitly unbiased large language models still form biased associations - PNAS, accessed July 2, 2025, https://www.pnas.org/doi/10.1073/pnas.2416228122

  52. Exploring Multimodal LLMs? Applications, Challenges, and How They Work - Shaip, accessed July 2, 2025, https://www.shaip.com/blog/multimodal-large-language-models-mllms/

  53. A Comprehensive Guide to Multimodal LLMs and How they Work - Ionio, accessed July 2, 2025, https://www.ionio.ai/blog/a-comprehensive-guide-to-multimodal-llms-and-how-they-work

  54. What is multimodal AI: Complete overview 2025 | SuperAnnotate, accessed July 2, 2025, https://www.superannotate.com/blog/multimodal-ai

  55. How Multimodal LLMs Work - The Vision Story - Analytics Vidhya, accessed July 2, 2025, https://www.analyticsvidhya.com/blog/2025/06/multimodal-llm/

  56. What is Multimodal AI? - DataCamp, accessed July 2, 2025, https://www.datacamp.com/blog/what-is-multimodal-ai

  57. What are AI agents? Definition, examples, and types | Google Cloud, accessed July 2, 2025, [https://cloud.google.com/discover/what-are-ai-agents#:~:text=Model%3A%20Large%20language%20models%20(LLMs,components%20facilitate%20reason%20and%20action.](https://cloud.google.com/discover/what-are-ai-agents#:~:text=Model%3A%20Large%20language%20models%20(LLMs,components%20facilitate%20reason%20and%20action.)