The State of LLMs, Part II

Beyond the model

Jun 19, 2026

Telling the stories of Greeks in tech building cool things and pushing frontiers. If you haven’t subscribed, join 7,000+ readers by subscribing here:

The State of LLMs, Part II: Beyond the Model

A few weeks ago, two of my favourite people in the AI community, Christos Baziotis of Cohere and Christos Perivolaropoulos of Google DeepMind, hopped on a call with me and we riffed for 1.5 hours on where AI and large language models are today, and where they may be heading.

We went into the technical, and sometimes slightly speculative, depths of the powerful AI systems we use today, like how they are built, trained, evaluated, and how we can make the most out of them.

I decided to publish that conversation in two parts. Part I focused on the foundations: transformers, scaling, and post-training. Today, I’m pleased to share Part II: Beyond the Model.

A large language model can answer questions, write text, and reason through problems. But by itself, it is still only part of the story. To be useful in the real world, it has to sit within a broader system, one that provides it with tools, memory, context, workflows, and a way to act.

That is where this second part begins. You do not need to have read Part I to follow along, but if you want the foundations first, you can find it here.

In Part II, we cover five questions:

Why the harness around a model may be just as important as the model itself.
Why benchmarks are becoming harder to trust, and how we might measure real-world usefulness instead.
What today’s AI systems still cannot do.
What kinds of breakthroughs could make AI systems faster, more useful, and better at long-horizon work.
What would make each guest say they were wrong about how these systems work.

Let’s get to it.

Why the harness matters

Quick explainer: Harness

The harness is the software that surrounds a model, turning it from a raw model into a useful system. It connects the model to tools, files, APIs, memory, workflows, and user interfaces. In an AI agent, the harness helps decide what information the model sees, which tools it can use, when to use them, and how its output translates into an action.

Is AI progress coming mostly from smarter models, or from better systems around them? And when the harness improves, is that real progress, or just a workaround for what models still cannot do?

C. Perivolaropoulos: I think it is real progress. In academic circles, people often focus mostly on the model itself, but the harness around the model is just as important, maybe more important right now.

The model matters, of course. But a better harness expands what the user can do with it. At the end of the day, the user still owns the task. The AI system may help, but the person is still trying to get something done. To me, that is real progress.

C. Baziotis: Yes, agent systems are clearly becoming better at useful work. Whether you call that “intelligence” depends on how you define the word, but I believe it is real progress. The model is still extremely important. It is the core of the system: it processes the information it is given and produces the next answer or action. But it cannot do everything by itself, and in many cases, it does not need to.

Today’s harnesses partly compensate for model weaknesses. Models can still struggle with complex instructions, long plans, large amounts of context, and multi-step tasks. The harness helps by adding features such as task lists, self-correction, memory, tool use, and sub-agents.

But the harness is not only a patch. Some things simply sit outside the model’s natural role. A model cannot, by itself, browse a codebase, run a test, call an API, inspect a file system, or send an email. The harness connects the model to the outside world.

So what we are seeing is both sides improving. If you swap models in your favourite harness, like Cursor or OpenCode, the difference between a current frontier model and one from a year or two ago is obvious. But the harness side also matters a lot, and public benchmarks like SWE-bench Pro verify this: the performance of the same frontier model can vary wildly just by changing the agent scaffold around it.

A lot of this comes down to what people now call context engineering, which is deciding what information the model sees, what tools it gets, when it gets them, how information is summarised, and how memory is managed. The field is still empirical; we are still learning what works.

But these two sides are not improving independently. The dominant recipe at the frontier now is using reinforcement learning to train the model inside an agent loop, with tools, a filesystem, and so on. Because of this, I expect that as models get stronger, more of what the harness does today will move into the model itself and become part of its policy.

You can see the impact of this co-adaptation in coding. For a while, coding agents were brittle and needed constant correction. Then, around the end of 2025, the experience improved sharply, something like a phase transition in usability. Agents started to be significantly more reliable and able to perform more long-horizon tasks.

Why benchmarks are getting harder to trust

Quick explainer: Benchmark

A benchmark is a standardised test used to compare model performance. Benchmarks give the field a shared way to measure progress.

As AI models get better, benchmarks are becoming harder to trust. Once a benchmark becomes important, companies can optimise for it directly. So how should we measure real progress, not just whether a model scores well on a test, but whether it is actually useful?

C. Baziotis: Benchmarks have always been imperfect, and the field has replaced them many times as models got better. The problem is more intense now because progress is faster, benchmarks saturate more quickly, and benchmark scores have become part of product launches.

I do not think we should abandon benchmarks. They have driven significant progress. But we need to be more careful about what we measure. In the past, models were less capable, so benchmarks often measured narrow skills such as classification, question answering, translation, or similar tasks. That made sense at the time. Today, models can do more useful end-to-end work, so we should start measuring whether a model can complete tasks that have economic value or help people in their daily work.

That means moving from narrow tests to more realistic environments. But those are much harder to build. You need something closer to a real workspace: email, Slack, documents, codebases, virtual machines, tools, and tasks that require using all of them together.

You also need a reliable way to check whether the task was completed. Did the agent produce the correct report? Did it fix the code? Did it complete the workflow? That is why building good benchmarks is becoming a serious product and infrastructure problem, not just an academic exercise, and the field is already investing heavily in it.

C. Perivolaropoulos: To add to that, if a user is using a model not to have abstract conversations but to do actual work, does it help them do the job, or does it make the process more stressful? That kind of “vibe check” is not just vibes. It is a proxy for whether the system is useful in practice.

What is still missing from today’s AI systems

When you look at today’s AI systems, what still feels fundamentally missing? What do we still not really know how to build?

C. Perivolaropoulos: One major missing piece is reliability. I want to be able to trust AI the way I trust a doctor or an accountant. I would like a surgeon to be able to trust AI in a critical situation. We are not close to that level of reliability yet.

Beyond reliability, there are areas where AI’s impact has not yet reached the level of application I would like to see. One example is simulation in physical systems and biology. That is partly a complaint about the economy and incentives, not only about technology. I would like AI to be available and capable of helping virologists prevent the next pandemic, or to help with food production and supply chain management in the global south. There is incredible innovation across modalities, but I would personally like to see more of our collective compute and research effort focused on foundational scientific challenges.

On the more speculative side, I would love to see other branches of AI enjoy groundbreaking breakthroughs and attention as LLMs do, including approaches based on search, evolution, explicit rules, or logic. As an industry, we do not really have the hardware for some of that, and our training methods are still crude. But I am willing to dream, because gradient descent has also been unreasonably effective.

C. Baziotis: In my view, the biggest missing thing is that models do not reliably learn from experience after deployment. If you give them the right context, they already have the skills to do many everyday tasks. What they cannot do well is build on what they learn from each interaction. Whatever they figure out in one session does not reliably carry forward, so they have to relearn things again and again. That is a major obstacle for long-horizon work.

By learning from interaction, I mean two things. First, the model should learn from the environment: which tool actually worked, how a codebase is structured, and what action had what consequence. Second, it should learn from users: the conventions of a particular team, what good looks like for a specific customer, and so on.

Today, the model can use those signals for the next few turns inside a session. But once the session ends, most of that teaching is gone. This is part of the broader problem of continual learning.

Quick explainer: Continual learning

Continual learning is the ability of an AI system to keep improving through new experiences after deployment, rather than starting from scratch each session. For agents, this could mean remembering user preferences, lessons from previous attempts, or what worked in a particular environment.

People often think continual learning means updating the model itself after deployment, changing its internal “weights,” or learned settings. But I see that as only one form of it. There are other ways, such as external memory, persistent instructions, skills, procedural notes, and so on. Eventually, some of that should probably feed back into the weights, too.

The agent should end each interaction a little more effective in that environment than it was at the start.

What could change how we build AI

What kind of breakthrough could unlock the next phase of AI, not just better benchmark scores, but systems that are faster, more useful, and able to work over longer horizons?

C. Perivolaropoulos: For me, one important direction is speed. Today’s language models are powerful, but complex tasks can still be slow. If an AI agent needs hours to work through a difficult problem, that limits how people can use it.

So one possible breakthrough would be a different way for models to reason, not necessarily producing everything step by step as today’s LLMs do, but finding faster ways to build an internal understanding before giving an answer.

In technical terms, some of this work includes latent-space reasoning, diffusion-style methods, and JEPA-like models. These may not yet beat today’s best LLMs, but if they could make complex AI work dramatically faster, the impact could be huge.

If I do not have to wait three hours for an agent to make sense of my large codebase, but can get the answer in a few seconds, that completely changes the interaction. A new architecture that makes systems much faster, without losing intelligence, or even while gaining intelligence, would have a major effect.

C. Baziotis: Building on the previous point, the breakthrough for me will come from memory and recurrence. Both have to become first-class parts of these systems, not something we add on later.

When it comes to memory, if a model tries something and it does not work, then later finds the solution, it should remember that. If a user teaches it something, it should be able to use that later.

The other part is recurrence. Transformers dropped it in favour of attention over a flat context window, and that tradeoff paid off. But it also means that information is either in the window and fully accessible or completely gone. Recurrent models do not have this boundary. They maintain an evolving internal state, where important things persist, and the rest naturally fade. For long-running agents, I think something like recurrence has to come back and combine the best of both.

We are already seeing early versions of this, from new architectures to agent loops with persistent state, and I am very optimistic about it. So, overall, I do not think these systems will stay as “flat” as they are now.

What would change their minds

If we sat here again in five years, what result or breakthrough would make you say that your current understanding of how these systems work was wrong?

C. Baziotis: For me, the thing that would make me say I was wrong is if most of the open problems we discussed turn out to be solved simply by growing the context window.

If labs keep stretching context and that keeps unblocking everything, such as agents staying coherent over long tasks, learning from experience, maintaining useful state, then my bet on memory would be wrong. But my current view is that the larger context helps, while memory is a different, deeper problem.

C. Perivolaropoulos: If transformers are still around and still improving in five years, that would be a sign that I had misjudged the limitations of current systems.

So far, I have consistently underestimated neural AI. I had barely absorbed the implications of AlexNet before LLM agents arrived.

Quick explainer: AlexNet

AlexNet was a pioneering, convolutional neural network architecture developed for image classification tasks in 2012 that revolutionised the field of AI.

I have learned to be more careful with my assumptions. If transformer-like architectures keep improving, that will be another lesson: the simple, scalable architecture may have more life left in it than many of us expected.

There is also a practical reason they may stay around. The whole industry has been built around them, such as chips, kernels, tooling, and production systems. Even if a new architecture is theoretically better, replacing the whole stack is hard. There are production lines, software systems, and hardware assumptions built around the existing technology.

So I think we may continue to have something that resembles the transformer for at least the next five years, simply because the industry has so much momentum behind it.

Top News

Databricks gets a 16-person engineering team in Greece via the acquisition of Panther

Databricks, the sixth-most-valuable private company in the world after its Series L round at a $134B valuation, has acquired the AI cybersecurity platform, Panther.

There’s also a Greek angle here. Panther has had a presence in Greece almost from the beginning, with Kostas Papageorgiou joining as a founding engineer. I hosted Kostas in edition #18 of the newsletter, six years ago, in a conversation about building an engineering team in Greece. Today, according to LinkedIn, Panther’s local team has grown to 16 people, led by Director of Engineering Panos Sakkos.

Computer Scientist Ilias Diakonikolas wins one of Computer Science’s highest honours

Ilias Diakonikolas, a computer scientist and professor at UW–Madison, has won the 2026 Gödel Prize, one of the most prestigious awards in theoretical computer science. The prize recognises his co-authored work on making algorithms learn reliably from large, complex datasets even when some of the data has been corrupted, which is a foundational problem for modern AI. The academic prestige is huge; this is the kind of prize given to papers that reshape a field. You can read more here.

Bringing photonics into AI data centres

Oriole Networks is bringing photonics into AI data centres through a deployment with AMD and the UK’s £50M ARIA Scaling Inference Lab. The company is trying to replace the electrical switching layer inside AI data centres with light. Its platform routes data between chips as photons rather than electrical signals, aiming to reduce latency, cut energy use, and keep expensive GPUs from sitting idle while they wait for data.

UCL professor and longtime researcher in optical networked systems, George Zervas, is Oriole’s co-founder and CTO, and has spent years working to connect large-scale computing systems with photonics. Oriole was founded as a UCL spinout in 2023, and this deployment marks its first commercial step toward bringing that research into real AI infrastructure.

Jobs

Check out 320+ fresh job openings from startups hiring in Greece here.

Fundings

Cybersecurity company Ent emerged from stealth with $100M in Seed financing by Sequoia, Felicis, and others. The founder, Elias Manousos, previously built RiskIQ, which Microsoft acquired for over $500M.

Neion Bio, a biotech company producing biologic medicines, i.e. complex drugs made from living organisms, announced the close of a $23M Series A led by Caffeinated Capital.

Neion Bio extracts cells from a fertilised egg in its genetic engineering process

Cargofy landed $11M to enable freight companies to automate operations, improve efficiency, and scale across international markets.

Enlaye raised $5M in Seed funding led by Glasswing Ventures to build a risk management platform for the construction and infrastructure industries.

SOLIDKOSMOS, the first spin-off of the University of Piraeus, raised €300K from Corallia Ventures to enhance the reliability of computing systems across space, defence, automotive, and critical infrastructure.

New Funds

Kos Biotechnology Partners announced the third closing of its life sciences fund at $123M.

That’s all for today, thank you for reading! If you liked it, give it a ❤️ and share.

Alex

Startup Pirate by Alex Alexakis

Discussion about this post

Ready for more?