Pre-training
The expensive part of building a model. $100M+ for frontier-scale. Why 'we trained our own AI' is usually a lie, a bad idea, or both.
The Technical Definition
Pre-training is the initial, massive learning phase that produces a foundation model. Engineers feed the model trillions of tokens of text — books, web pages, code, scientific papers — and the model learns the statistical patterns of language by predicting the next token, billions of times, over months of compute. The output is a model that has absorbed general knowledge of how language works.
Pre-training is distinct from fine-tuning. Pre-training builds the base model from scratch. Fine-tuning takes an existing pre-trained model and adapts it to a narrower task using a much smaller dataset.
What This Actually Means for Your Business
When a vendor tells you they “trained their own AI model,” one of three things is true. They actually pre-trained a foundation model from scratch (extremely rare and expensive — companies that do this include OpenAI, Anthropic, Google, Meta, Mistral, and a handful of others, almost all backed by billions in capital). They fine-tuned an existing open-source model like Llama or Mistral on their data (common, useful, but not the same thing). Or they’re using prompt engineering on a commercial API and calling it training (lazy at best, deceptive at worst).
The distinction matters because the economics are not close. Pre-training a frontier-scale model — the kind that competes with GPT-4 or Claude — costs $100 million to $1 billion in compute alone, plus a research team that’s nearly impossible to hire and a data pipeline that takes years to build. The companies doing this are doing it because models are their entire product.
For a $300M industrials company, pre-training your own model is roughly equivalent to building your own electrical grid because you don’t trust the utility company. Possible. Spectacularly stupid in almost every case.
Fine-tuning, on the other hand, is reasonable. Taking Llama-3 or a similar open base model and adapting it on your domain documents can produce a useful, owned, self-hostable system for hundreds of thousands rather than hundreds of millions. The output isn’t as capable as a frontier model — but for narrow tasks (legal contract review, specific code generation, customer support in your domain), it can be more accurate and cheaper at scale.
The third pattern — prompt engineering on a commercial API — isn’t training at all. It’s writing instructions. Vendors who call this “training their AI” are abusing the term to inflate their moat.
Reality Check
What the vendor says: “We trained a custom AI model on your industry’s data.”
What that means in practice: Almost certainly fine-tuning, not pre-training. Ask how many GPU-hours of compute were involved, what base model they started from, and how big the training dataset was. If the answers are “we use OpenAI’s API” or vague hand-waving, they didn’t train anything — they wrote a system prompt.
What Operators Actually Do
The companies getting real value treat pre-training as someone else’s problem. Frontier labs — OpenAI, Anthropic, Google, Meta — burn billions doing the heavy lift. Operators stand on top of that work. The right question isn’t “should we train our own model” but “which foundation model do we build on, and what do we do with it?”
A small minority of operators do invest in fine-tuning. The pattern that works: a company has a well-defined, repeatable, narrow task (extracting data from a specific document type, classifying support tickets in their domain, generating copy in their voice), enough labeled examples to actually teach the model something (typically 1,000 to 100,000 examples), and a real reason that prompt engineering won’t get them there. Most companies don’t actually meet those criteria — they just want to.
The pragmatic stack for almost everyone: use a frontier API for hard reasoning, fine-tune a smaller open model for high-volume narrow tasks, and stop pretending you need to pre-train from scratch.
The Questions to Ask
-
Did you pre-train this model, or fine-tune it from a base model? If they say pre-train, ask what the compute budget was. If under $10M, they didn’t pre-train anything frontier-scale. They fine-tuned and called it training.
-
What base model is this built on, and what happens when it gets superseded? If they fine-tuned Llama-2 in 2024, you may already be on a stale base. Find out what the upgrade path looks like — and who pays for it.
-
What’s our actual case for owning a model versus using an API? Cost at scale, data sensitivity, latency, and offline operation are the four legitimate reasons. “We want our own AI” is not one of them.