Glossary / Models & Architecture

Large Language Models (LLMs)

The foundation under every AI pitch. The differences between GPT-4, Claude, Llama, and your vendor's 'proprietary model' matter more than you think.

Models & Architecture

The Technical Definition

An LLM (Large Language Model) is a machine learning model trained on vast amounts of text data, capable of understanding and generating human language. It processes text by breaking it into tokens and predicting the next token based on patterns learned during training. Modern LLMs use transformer architecture, trained on billions of parameters, which is why they’re called “large.” GPT-4, Claude, Llama, and Gemini are all LLMs—but they’re not interchangeable.

What This Actually Means for Your Business

Every AI product you’re considering is built on top of an LLM. The model choice matters more than vendors want you to think. They say “it doesn’t matter which LLM you use—just use our product.” That’s half-true. The difference between GPT-4 and an older open-source model can be the difference between an assistant that’s useful and one that wastes your team’s time.

Here’s what varies: reasoning capability, code generation quality, instruction-following reliability, hallucination rate, context window size, and cost per token. A model with better reasoning saves your engineers time because it makes fewer careless mistakes. A model with a larger context window means you can feed it more of your documents before having to truncate. A cheaper model per token means your customer service chatbot won’t become prohibitively expensive at scale.

The real operational difference: if you use GPT-4, you’re sending data to OpenAI’s servers. If you use Claude, you’re sending it to Anthropic. If you use Llama or Mistral, you can host it yourself on your own infrastructure. That’s not a small difference. It’s the difference between SaaS and self-hosted, between vendor dependency and operational control, between regulatory compliance being a conversation with the vendor and being entirely your responsibility.

Vendors also love to pitch their “proprietary model.” Usually, this means they’ve fine-tuned an existing open-source LLM or they’re reselling a commercial LLM with custom prompting on top. Nothing wrong with that—but it’s not a moat. You’re evaluating their application, not their model. Ask what foundation model they’re actually using.

The cost dimension gets underestimated. Small differences in token cost multiply. If you’re processing thousands of documents daily, the difference between a cheap model and an expensive one becomes significant. Some models are cheaper but require longer inputs to get the same output quality, which actually makes them more expensive overall.

Reality Check

What the vendor says: “We use the most advanced AI model available.”

What that means in practice: They probably use one of three foundation models (GPT-4, Claude, or Llama), combined with prompt engineering. The real differentiation is in what they do with it, not the model itself.

What Operators Actually Do

Smart teams evaluate LLMs empirically on their specific task. They pick three models, run them on 50 representative examples, and measure outputs against actual performance. They measure not just accuracy but cost, latency, and likelihood of hallucination under stress.

They also consider operational factors: Am I comfortable with vendor dependency here? Is my data sensitive enough that I need to self-host? Is my use case sensitive enough to hallucinations that I need the most capable model even if it costs more? These are business questions, not technical ones.

Companies also diversify strategically. They might use GPT-4 for high-stakes customer-facing work where reliability is critical, and a cheaper model for internal research tasks where perfection isn’t required. Switching models isn’t permanent. You can migrate if a better or cheaper option emerges.

The pattern that works: start with the best model you can afford for your most critical use case. Measure what you actually get. Then be willing to experiment with alternatives if cost or performance warrants it.

The Questions to Ask

  1. What foundation model are you actually using, and why that one? Are you using it for reasoning capability, cost, latency, or something else? What tradeoffs did you make?

  2. How does this model perform on our specific task? Can you show me examples on similar work to ours? What’s the failure rate on edge cases we care about?

  3. What changes if you switch to a different model? If you’re using GPT-4 today but Llama-3 gets better, how easily can you migrate? What’s locked in?

Get the next Brief

One operator. Every other Wednesday.

Plus the AI Glossary and the Failure Museum.
Real names. Real numbers. Honest analysis.