LLM's have 100% gotten better, but it's hard to say if it's "intrinsically better", if that makes sense.
> OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024 [1]
That's evidence against "intrinsically better". They've also trained on the entire internet - we only have 1 internet, so.
However, late 2024 was the introduction of o1 and early 2025 was Deepseek R1 and o3. These were definitely significant reasoning models - the introduction of test time compute and significant RL pipelines were here.
Mid 2025 was when they really started getting integrated with tool calling.
Late 2025 is when they really started to become agentic and integrate with the CLI pretty well (at least for me). For example, codex would at least try and run some smoke tests for itself to test its code.
In early 2026, the trend now appears to be harness engineering - as opposed to "context engineering" in 2025, where we had to preciously babysit 1 model's context, we make it both easier to rebuild context (classic CS trick btw: rebooting is easier than restoring stale state [2]) and really lean into raw cli tool calling, subagents, etc.
FWIW, AI programming has still been as frustrating as it was when it was just TTC in 2025. Maybe because I don't have the "full harness" but it still has programming styles embedded such as silent fallback values, overly defensive programming, etc. which are obvoiusly gleaned from the desire to just pass all tests, rather than truly good programming design. I've been able to do more, but I have to review more slop... also the agents are really unpleasant to work with, if you're trying to have any reasonable conversation with them and not just delegate to them. It's as if they think the entire world revolves around them, and all information from the operator is BS, if you try and open a proper 2-way channel.
It seems like 2026 will go full zoom with AI tooling because the goal is to replace devs, but hopefully AI agents become actually nice to work with. Not sycophantic, but not passively aggressively arrogant either.
> OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024 [1]
That's evidence against "intrinsically better". They've also trained on the entire internet - we only have 1 internet, so.
However, late 2024 was the introduction of o1 and early 2025 was Deepseek R1 and o3. These were definitely significant reasoning models - the introduction of test time compute and significant RL pipelines were here.
Mid 2025 was when they really started getting integrated with tool calling.
Late 2025 is when they really started to become agentic and integrate with the CLI pretty well (at least for me). For example, codex would at least try and run some smoke tests for itself to test its code.
In early 2026, the trend now appears to be harness engineering - as opposed to "context engineering" in 2025, where we had to preciously babysit 1 model's context, we make it both easier to rebuild context (classic CS trick btw: rebooting is easier than restoring stale state [2]) and really lean into raw cli tool calling, subagents, etc.
[1] https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...
[2] https://en.wikipedia.org/wiki/Kernel_panic
FWIW, AI programming has still been as frustrating as it was when it was just TTC in 2025. Maybe because I don't have the "full harness" but it still has programming styles embedded such as silent fallback values, overly defensive programming, etc. which are obvoiusly gleaned from the desire to just pass all tests, rather than truly good programming design. I've been able to do more, but I have to review more slop... also the agents are really unpleasant to work with, if you're trying to have any reasonable conversation with them and not just delegate to them. It's as if they think the entire world revolves around them, and all information from the operator is BS, if you try and open a proper 2-way channel.
It seems like 2026 will go full zoom with AI tooling because the goal is to replace devs, but hopefully AI agents become actually nice to work with. Not sycophantic, but not passively aggressively arrogant either.