The State of LLMs, Part I
Building blocks
This is Startup Pirate #135, a newsletter about startups, technology, and entrepreneurship. Made in Greece. If you’re not a subscriber, click the button below:
The State of LLMs, Part I: Building Blocks
For years, the simplest way to describe AI progress was bigger models, more data, more compute. That formula helped usher in the LLM era, but it no longer tells the whole story.
This is the first part of a two-part conversation on where AI and large language models are today and where they may be heading. I sat down with two of my favourite people in the AI community, Christos Baziotis of Cohere and Christos Perivolaropoulos of Google DeepMind.
Part I starts with the building blocks; Part II turns to agents, harnesses, benchmarks, memory, and what today’s AI systems still cannot do.
In Part I, we cover four questions:
How Christos Baziotis and Christos Perivolaropoulos found their way into two of the world’s leading AI companies.
What transformers actually do, and why they became the architecture behind most modern AI systems.
Why bigger models are no longer the whole story, even though scaling still matters.
What post-training actually does, and whether it teaches models new abilities or helps them use capabilities they already had more reliably.
Let’s get to it.
Their paths into AI
So how’d you get into AI? I’d love to hear about the moments along the way that brought you to the problems and technologies you work on today.
C. Perivolaropoulos: When I was in school, I was active on PC Magazine’s forum, a large computer magazine at the time, where a small group of members pushed each other to code and enter programming competitions. That was probably the beginning.
I studied Electrical and Computer Engineering at the University of Patras, and while at university, I worked at the Greek chip design company, ThinkSilicon, long before its acquisition by the US semiconductor leader Applied Materials. There, I spent about three years working on compilers and assemblers, the low-level systems environment that converts human-readable code into binary machine code, 0s and 1s, that a computer can execute. This shaped a lot of how I think about performance and hardware.
While still at university, I began collaborating with MIT as a Visiting Researcher and worked at another Patras-based startup, Codebender, which built developer tools for programming Arduino-style hardware. Eventually, I decided to pursue a PhD at the University of Edinburgh in 2016, focusing on query languages and databases, and interned at Google and Amazon.
After the PhD, I spent some time at Huawei and later joined DeepMind, where I am today. My work is split between making AI models run more efficiently on GPUs and studying how models, mainly LLMs, but also graph models and other systems, can or cannot generalise beyond their training data.
Quick explainer: LLMs
A large language model, or LLM, is an AI system trained on vast amounts of text and other data to predict the next token, a word or piece of a word, based on the context they have seen. Through that seemingly simple task, repeated billions of times, it learns enough about language and the world to hold conversations, answer questions, write, summarise, and reason. ChatGPT, Claude, Gemini, and Cohere’s models are examples of LLM-powered systems.
C. Baziotis: I studied Computer Science at the University of Piraeus. That was where I first saw traditional AI, such as rule-based systems and evolutionary algorithms. I liked it, but I did not imagine I would actually work on AI.
After graduating, I worked as a software engineer in Greece while doing a master’s degree in the same department. During that master’s program, I discovered modern machine learning and natural language processing, a field focused on helping computers understand and generate language. For my thesis, I participated in SemEval, an international competition for language-processing systems, and our work topped the leaderboard. That was the first time I felt I could maybe do research seriously.
I started a PhD at NTUA in 2016. The lab was great, but the field was moving extremely fast, so I decided to leave a year and a half later and restart my PhD at the University of Edinburgh, focusing on pre-training, the stage where a model first learns broad patterns from large amounts of data, and synthetic data for low-resource machine translation. I also interned at Meta and Amazon.
Then, in 2023, I joined the founding team of Samaya, an AI platform for finance, where I worked on the search and agentic stack. The team was strong, and it was a lot of fun. But after two years, I started to miss working on frontier research and LLMs, so I joined Cohere. I now work mainly on memory for agents and on helping models learn from experience after deployment.
The architecture behind LLMs
Quick explainer: Transformer
A transformer is the architecture behind most modern LLMs. It takes a window of information, such as a prompt, document, or conversation, and breaks it into small pieces called tokens. Rather than reading those tokens strictly one after another, it processes them together, allowing the model to build relationships before generating the next token or output.
If you had to explain transformers to an outsider, what is a transformer actually doing? As we ask these models to do more complex things, do you think the basic transformer idea is starting to show its limits?
C. Perivolaropoulos: A transformer is an architecture that helps an AI model connect the relevant pieces of information together.
Before a transformer can do that, the model has to turn information into a form it can work with. Words, ideas, documents, or pieces of information may look separate to us, but neural networks turn them into mathematical representations called embeddings. Embeddings are vectors, lists of numbers that act a bit like coordinates in a model’s learned map of meaning. Related concepts tend to sit closer together in that space, while dissimilar concepts tend to sit farther apart.
The transformer’s key move is association. It looks at the information you give it and determines which pieces are relevant to one another. If I give it a set of files or some information, the transformer can connect the relevant parts of that input to one another and to patterns it learned during training.
Through many layers of this process, the model builds increasingly rich representations. In an LLM, those representations are used to predict the next token. In another kind of AI system, they might be used to generate an image, etc.
So, at a high level, a transformer is not just memorising facts. It is building relationships between pieces of information and using those relationships to generate something useful. At that level, I do not think the basic idea is breaking down. Transformer-like architectures are still evolving and producing gains.
The limits show up when we ask models to generalise beyond what they have learned. One example is length generalisation. If a model learns a small version of a task, can it scale that task up? If it learns to multiply three-, four-, or five-digit numbers, does it then know how to multiply fifty-digit numbers? With current architectures, the answer is no.
But when you look closely, the failure often comes from specific design choices in today’s models, such as how they handle position or how they choose between possible outputs. It is not necessarily that the core idea of connecting tokens to one another is fundamentally broken.
The current architecture has real limitations, but I believe we can keep modifying and improving transformer-like systems and continue to get gains.
C. Baziotis: To explain transformers, it would be useful to compare them with the models that came before them. Before transformers, many language models worked more like someone reading a text line by line. They would read the first word, update an internal summary, then read the next word and update that summary as they went. These models were called RNNs, or recurrent neural networks.
That approach made intuitive sense because language is sequential, with one word following another. But it had two big problems. First, the model had to squeeze everything it had read into one running summary. If something important appeared early in a long paragraph, it could get lost by the time the model reached the end. Second, the model had to process the text step by step, which made it hard to train efficiently on GPUs.
Transformers changed the approach. Instead of reading one word at a time, a transformer looks at a whole window of text at once. That window is called the context. Within that context, the model uses attention, a mechanism that helps each word find the other words or pieces of information that matter to it.
Quick explainer: Context
Context is the information a model can use when producing an answer: the prompt, the conversation so far, retrieved documents, code, files, or other material presented to it. A larger context window means the model can work with more information at once, but it also makes the system more expensive and can introduce more noise.
You can think of attention as a shortcut. The model no longer needs to carry everything in a single running summary. If a word near the end of a paragraph depends on something near the beginning, it can connect to it directly. That was the breakthrough. Transformers could train much faster, use GPUs much better, and learn from much larger datasets. They also became much better at connecting information across longer texts.
But that design also creates the main limitation. A transformer only works with what is inside its current context window. If the information is inside the window, the model can use it. If not, the model does not naturally have access to it. This is why so much work in AI has focused on increasing context windows and managing them more effectively. But context is not free. The more information you put into the window, the more expensive and noisy the process becomes.
Longer context helps, but it is not the same thing as memory. Memory means the system can keep track of useful information over time, update what it knows from experience, and carry that forward into future tasks. A larger context window just means the model can see more at once.
Overall, I would not say transformers are breaking down. They solved a very important problem and remain extremely powerful. However, their limits are becoming clearer as we ask AI systems to behave less like chatbots and more like long-running agents.
The likely future is not that we throw transformers away. It is that we keep what works, such as attention, parallel processing, and scalable training, while adding better ways for AI systems to remember, update, and use information over time.
Scaling laws
For a while, the story in AI was bigger models, more data, better results. Is that still true, or is the field starting to run into limits where model size alone is no longer enough?
C. Baziotis: The short answer is that model size still matters, but it is no longer the whole story. For a long time, the field had a very powerful recipe: make the model bigger, train it on more data, use more compute, and performance tends to improve. The technical term behind that intuition is scaling laws.
A scaling law is an observed relationship between a resource and an outcome. The resource could be the number of parameters in a model, the amount of training data, the amount of compute used during training, or even the amount of compute used when the model is answering a question.
Quick explainer: Model parameters
Parameters are the muscle memory of the AI. They do not store exact facts like a hard drive. Instead, they store the patterns of how words connect. More parameters mean the AI has a larger brain capacity to learn trickier skills, like sarcasm, coding, or storytelling, but performance also depends on data quality, training methods, and compute.
In the early LLM era, scaling mostly meant pre-training larger models on more data. The basic idea was that if you make the model bigger and give it more data, performance improves in a fairly predictable way. That predictability is what made scaling laws so important. They did not just say “bigger is better”; they gave labs a way to estimate how much better a model might become by adding more data, more parameters, or more compute.
That confidence was slightly shaken around GPT-4. GPT-4 was OpenAI’s major model released after GPT-3.5, the model family behind the first version of ChatGPT. OpenAI has not publicly disclosed the exact parameter counts for GPT-4, but it was definitely larger and clearly more capable than GPT-3.5. Still, many people expected an even more dramatic leap, partly because it was rumoured to be much larger. That led to speculation that model-size scaling was starting to hit diminishing returns.
I do not believe the right conclusion is that scaling laws are dead, but rather that scaling has become more complicated. You can’t just keep making the model bigger and expect the same returns forever. If you increase the model size, you also need enough high-quality data to train it. If high-quality data becomes harder to find, as it increasingly seems to be, then model size alone is no longer the obvious path forward.
That is why the industry is now exploring other forms of scaling, such as test-time compute, which allows the model to use more computation when answering difficult questions. This is what made reasoning models such as OpenAI’s o1 so interesting. The model can “think” for longer, and in some cases, that improves the answer.
Quick explainer: Reasoning models
Reasoning models are LLMs designed to allocate more computational resources to difficult questions before answering. The idea is that for some tasks, a model can perform better by “thinking” longer at runtime rather than simply being made bigger during training.
Another important axis is post-training, the stage after a model’s initial training, where it is refined to behave more usefully. One important post-training method is reinforcement learning.
Quick explainer: Reinforcement learning
Reinforcement learning, or RL, is one method used in post-training. The model receives feedback or rewards that signal which outputs are better, and it is adjusted to produce those kinds of outputs more reliably.
Instead of only asking, “How big should the model be?”, labs now ask, “How much should we spend shaping the model after pre-training? How many examples, rollouts, or feedback loops should we use? How much performance do we get from that?” So the question today is not simply whether larger models still improve. They often do. The question is whether making the model larger is the best use of resources.
When people say a scaling law is “breaking,” they may mean different things. It could mean the model can no longer learn much more, or that we do not have enough good data, or that the data exists but is too expensive to collect. Or it could mean that spending more compute yields diminishing returns.
Scaling is still a useful guide, but the simple version of the story is over. The field is moving from “make the model bigger” to “find the best axis to scale.” Right now, post-training with reinforcement learning looks like one of the most promising areas to scale.
C. Perivolaropoulos: I agree that scaling laws remain a fundamental guide, but their application is changing because the landscape of data and compute is changing. We are still seeing substantial returns from scaling, even if those are no longer exponential. The transformer has proven highly scalable, but it is also resource-intensive. In earlier stages, the field benefited from an abundance of data and compute. Today, there is a shift toward efficiency.
As high-quality data becomes scarcer, the challenge is less a simple breakdown of scaling logic and more a logistical and economic constraint: where do you get the data, what does it cost, and how do you use the compute most effectively? In that sense, scaling laws still matter, but the relevant bottleneck may be the practical cost of moving further along the curve.
That is why efficiency has become so important. The field is no longer only asking whether bigger models work; it is asking how to get more out of each unit of data and compute.
How models improve after initial training
We talked about post-training and reinforcement learning as new ways labs improve models after the initial training run. But what are these methods actually doing? Are they teaching models genuinely new abilities, or helping them use capabilities they already have more reliably?
C. Perivolaropoulos: Post-training does something important, but in a sense, it sits outside the base model. One uncharitable view is that LLMs have fundamental flaws, and post-training is a set of bandages that merely cover them. There is some truth to that, as these methods work partly because the base model is not perfect.
But I do not like that framing. Post-training is better understood as another way to turn compute, data, and feedback into more useful behaviour. It is not always about making the model fundamentally smarter in some direct way. It is about shaping the system so that the model produces better, more reliable outputs.
That is why I see post-training as part of a broader process for extracting useful behaviour from probabilistic models. The same general idea may not be specific to LLMs; similar approaches could potentially be adapted to other types of models as well.
C. Baziotis: My view is that many gains from post-training today come from elicitation. By elicitation, I mean that the model already has many capabilities from pre-training, and post-training helps bring them out more reliably.
You can see this in areas such as maths and code. In the first few steps of post-training, performance can improve very quickly. That probably does not mean the model suddenly learned an entirely new skill. More often, post-training makes the model more likely to produce the right format, follow a useful strategy, or apply knowledge it already has.
There is evidence pointing in that direction. For example, some papers last year found that Qwen2.5 could improve on maths even when using reward signals that were random or explicitly incorrect. That suggests the training process may be amplifying behaviours that were already latent in the model, rather than teaching a completely new capability from scratch.
Papers have shown a similar pattern when sampling many answers from a base model. Even before post-training, the correct answer may already appear in some fraction of cases. Post-training changes the model’s behaviour so that the good answer comes out more reliably.
That also explains why progress has been strongest in domains such as maths and code, where correctness is easier to verify. If you can check whether an answer is right, it is easier to give the model useful feedback. In messier real-world domains, where correctness is more subjective or harder to measure, post-training becomes much harder.
—
Part I leaves us with the conclusion that the old story of AI progress is no longer enough. Bigger models still matter, but the real gains now come from how these models are built, scaled, and refined. Part II turns to what they can actually do. Once a model is powerful enough, the harder questions become practical ones: Can it use tools? Can it remember what came before? Can it work through long tasks without losing the thread? And can we even measure whether it’s useful?
That’s where we go in State of LLMs: Part II in two weeks. Stay tuned.
Top News
We invested in NOFire AI to help engineering teams prevent, resolve, and automate responses to production issues
AI is helping software teams write and ship code faster than ever. But when more code reaches production faster, things can also break faster, and finding the real cause of an outage can feel like searching for a needle in a haystack.
Founded by Spiros Economakis, Panagiotis Moustafellos, Anastassios Nanos and Antonios Chalkiopoulos (ex Elastic, Mattermost, Lenses), NOFire AI helps engineering teams understand what changed, what broke, and why. It connects the dots across a company’s software systems, identifies risky updates before they cause problems, and helps teams resolve incidents in minutes instead of hours.
We led NOFire AI’s $2.5M Seed round to build the reliability layer for the age of AI-generated software.
New drug could be a game-changer for people with high cholesterol
Eli Lilly reported a few days ago that a high dose of its experimental gene-editing therapy cut LDL cholesterol by 62% in an early clinical trial, a promising sign that a single treatment might one day replace the daily pills people now take to manage their cholesterol. Unlike statins, the therapy, called VERVE-102, works by making a one-time edit to a gene in the liver, so its effect is meant to last for years rather than wearing off between doses.
Lilly gained the treatment through its roughly $1B acquisition last year of Verve Therapeutics, a startup whose origins trace back to physician and scientist Anthony Philippakis, who spearheaded its founding at Google Ventures in 2016 to bring genetic medicine to the treatment of heart disease.
How well do LLMs understand and perform in Greek?
What happens when you throw LLMs into messy, high-context Greek scenarios with ambiguity, bureaucracy, domain-specific language, outdated public information, false assumptions and local jargon? Greek LLM Arena is an open-source benchmarking platform designed to assess how well large language models understand and perform in Greek.
Jobs
Check out 320+ fresh job openings from startups hiring in Greece here.
Fundings
Drone delivery platform Matternet raised $33M and went public via a reverse merger.
August Robotics raised $30M in Series B funding, led by Big Pi Ventures, to build autonomous robot fleets for construction, industrial, and large-scale projects.
AI customer support platform Gradient Labs raised a fresh $13M in funding led by Octopus Ventures, doubling its Series A to $26M.
Twin Prime announced a $10M Pre-seed round led by Expeditions to build AI models for defence and security.
AI-powered sales training platform Deelan AI raised a Seed round led by Expon Capital.
That’s a wrap, thank you for reading! If you liked it, give it a ❤️ and share.
Alex

