ai January 6, 2025

AI code plus performance minus brittle = ?

ai llm compilers type systems theorem provers

TL;DR There’s a path towards AI that actually produces high-reliability, high-performance code, and it goes through formal methods and compilers.

Generative AI has been something of a controversial topic among developers. Some claim that it’s making their life easier, or even that they don’t need to code anymore and can let some agent do the job. Others claim that it’s unreliable, that it’s operating at the wrong level, and that it will never successfully replace developers.

A few point out (correctly) that Earth simply cannot sustain the exponential energy requirements of AI – or even the exponential money demands of some famous AI company.

For this post, let’s assume that Generative AI is here to say, and that it will eventually write (almost) all the code. This is by no mean a certainty, but it’s a possible scenario, or even a self-fulfilling prophecy.

If this happens, there are two directions in which code-written-by-AI goes. Either AI will get better at writing reliable code, or we’ll get used to rolling dice whenever we use an AI to write code, and more often than not, end up with bad, bug-riddled, unreliable and slow AI-generated code. But hey, we’ll have managed to layoff all the developers (and all the voice actors, all the paralegals, all the book illustrators, etc.) so the money will flow and everybody¹ will be happy. Let’s be frank: given how the industry has evolved during the last ~40 years, the second option is the most likely. But maybe, just maybe, there is a path towards good AI code².

Step 1: Eliminating code hallucinations, baby steps

If you are a senior enough developer, chances are that you don’t consider writing code as something difficult. It may be time-consuming, it may, in some cases, be annoying, but with sufficient experience, it’s hardly more difficult than writing your native language. Yes, learning new libraries may be a headscratcher, especially when the documentation is poor, but not a true obstacle.

How do you write code? Well, you start from some kind of spec, which could be as fuzzy as “I want to write some kind of game based on the motion of fluids”, or, if you work at NASA, could be thousands of pages of meticulously compiled requirements. Then you refine the spec. If you’re an adept of Test-Driven Development, you refine your specs into tests, then fill in the blanks (i.e. the implementation), and chances are that your tests are part of your documentation. If you’re an adept of Types-Driven Development, you refine your specs into types, then fill in the blanks (i.e. the implementation), and chances are that your types are part of your documentation. There are countless other variants, but the ones that interest me for this entry are all based on the same template:

Start with a list of external requirements.
Refine your list into something that can be understood by the computer.
Fill in the blanks.
Let the computer tell you whether the blanks are filled correctly.

As they’re currently being hyped, LLMs promise to semi-magically handle items 2., 3. and 4. without (much) human intervention – and failing to do this reliably. The much-hyped solution pretty much come down to brute-forcing a solution, but none of them seems to actually scale up to any interesting problem.

On the other hand, the points above already point towards a path that can feels much more reliable.

A human being starts with a list of external requirements (in some language, human or formal).
A human being refines this list into something formal.
The Oracle to fill in the blanks (presumably, the Oracle is an LLM, but it doesn’t have to be).
The toolchain uses the formal specifications (tests, types, etc.) to confirm that the blank has been filled correctly.
Return to 3. until the blank has been filled entirely, you run out of money, or the climate is so warm that people have given up on computers.

Note that I’m not nearly the first person to come up with the idea of getting an Oracle to fill in the blanks. I’ve seen a number of people across social networks doing this, with various level of success. In fact, Coq/Rocq developers have been doing this for decades, refining informal specs into types, then getting the Oracle to write the code, and using the checker to make sure that the code matches the specifications. Just, instead of calling these specs “types” and these blanks “code”, they call them “proof obligations” and “proofs”, but this is, in fact, the same thing. In fact, these proofs can compiled to native code and executed. Also, there are precedents for using an LLM as an Oracle for Coq/Rocq.

Thesis 1 With a powerful enough specification mechanism, we can eliminate code hallucinations.

Importantly (and unsurprisingly), this mechanism is independent from the Oracle used. You can use ChatGPT as an Oracle, or genetic algorithms, or your interns, or an agentic architecture, or a Mechanical Turk, or an army of monkeys on typewriters.

Step 2: Eliminating code hallucinations, for real

Actually, we didn’t entirely eliminate hallucinations. For this section, let’s assume that you’re using Test-Driven Development. There is still a chance that your Oracle could be lying to you (or hallucinating convincingly) by satisfying only the tests and failing in every other situation.

When we’re writing code manually, this is usually not a problem: we count on developers to be honest, first because most people are honest, and secondly because if they’re not, they’re going to lose their job, so if we have sufficient tests/types, it’s usually enough. But if your Oracle is an algorithm, or perhaps if you’re buying your snippet from some mechanical turk of untraceable origin, you can’t afford to make any such assumption. In fact, in a future of AI marketplace and generalized economic warfare/industrial espionage, it’s actually far from certain that you will be able to trust an AI bought or rented from another company/nation.

In other words, in presence of an unreliable Oracle, tests are not sufficient. What could be better than tests? From the top of my head, I can think of only one thing that is more powerful than a finite testsuite: an infinite testsuite that can test every single possible combination of inputs and outputs that your code could need to deal with and produce. Of course, the space of such combinations is infinite, so we will want to pack it into some form of code.

One way to do this is to actually write the code for your application/library/service. You’ll be able to compare the behavior of handwritten code to whatever your Oracle can generate. But there’s the small matter that you’ll need to trust your own code, and the other small matter that you have written the code, so what’s the point of having an Oracle in the first place?

So perhaps we could do better? What if instead of writing the code manually, we had several competing Oracles, each writing one version of the code that fulfills the tests, and we compared their results? Something like inverse-fuzzing, or perhaps a bit like generic algorithms. Then you’ll just need to pick the best. But there’s the small matter that you still need to find a way to pick the best among these samples. Since all the samples already fulfill your specs-as-tests, at least in theory, you have already used all the knowledge you have at hand to pick one.

So you’ll have to be smarter and borrow a page from the data science community, and hide some of your specs, only using these specs to reject samples that are convincing but false. This is a form of defense against overfitting/hostile Oracle. You’ll probably want to combine this with some other metric (e.g. performance)to pick among the remaining samples.

How many hidden tests will you need? If I recall, a common practice among data scientists is to keep ~10% hidden ³. I don’t think that I’d be satisfied with decreasing by ~10% the chances that my plane decides to drop from the sky because it doesn’t like the song I’m playing, so I hope you’ll be using more.

Thesis 2 There is a way to (probabilistically) eliminate hallucinations with Test-Driven Development.

I don’t think that there is anything fundamentally wrong with this approach, but I find it a bit disappointing. After all, we’re still rolling dice, we’re just removing some of the bad rolls. In fact, I suspect that removing bad rolls will require exponential effort in writing tests. However, in our industry, we are used to just giving up on testing and reviewing at some point and simply releasing the code, counting in our ability to patch problems once they have surfaced, so perhaps this approach will be sufficient.

Step 3: Eliminating code hallucinations, absolutely

I believe that we have exhausted all that Test-Driven Development can bring it to the conversation on hallucinations. It is time to take a look at some of the other fill-in-the-blank approaches to writing machine-readable specifications.

Formal specifications are languages designed that can encode all the meaningful properties of the system we want to develop, without any unspecified/undefined behavior. Most developers are familiar with the Backus Naur Form (BNF), used to formally specify the syntax of a language, for instance. Beyond that, several languages have formal semantics, including several versions of Java or JavaScript, i.e. a mathematical definition of the exact behavior of the language that covers every possible case. Many tools can interpret these mathematical definitions, for instance to come up with new tests – an unlimited number of distinct tests, if necessary – or to check mathematical proofs that the code actually implements the specifications.

Formal specifications have been used to develop high-reliability software, including the software that powers the Airbus, or automated subways, or operating systems, or compilers.

One of the key aspects of formal methods is that there are well-defined and well-tested mechanisms to guarantee, mathematically, that an implementation respects the specification, even in presence of a hostile Oracle. It’s not probabilistic, it’s a certainty (barring bugs within the proof-checker itself). For instance, code produced by the Oracle can be annotated in such a way that conformance with the specifications can be checked upon delivery (this mechanism is called Proof-Carrying Code).

Thesis 3 There is a way to absolutely eliminate hallucinations with Formal Methods. Yes, for real.

Even better, a number of languages contain built-in proof mechanisms. I’m thinking of the type systems of Rust, Ada, Haskell, OCaml/F#/ReasonML, etc. These type systems will let you prove sophisticated properties – assuming that you run some parallel analysis to guarantee that you don’t use unsafe/Obj.magic/FFI code/…

Sadly, experience shows that the type systems of these specific languages isn’t quite sufficient to encode all the interesting properties that we want to put in a real-world spec. However, more powerful type systems have been defined for that purpose (e.g. Idris, Coq/Rocq, …), and these are sufficient to encode properties of almost arbitrary complexity. I’m going to continue focusing on type systems, because this is a domain I understand rather well, and because some properties of local reasoning are important in my eyes, but there are also entirely different mechanisms, such as Abstract Interpretation or Model Checking that have also been used to prove extremely sophisticated, real-world software.

Historically, the lengthy part of Types-Driven Development/writing formal proofs is coming up with the actual proof body (aka the code). It’s kind of fun, because Types-Driven Development is really solving a puzzle by assembling the types/proofs (aka variables and functions) at hands to build larger types/proofs, but it’s extremely time-consuming.

I actually have fairly high hope that LLMs (or some other untrusted Oracle, whether it’s Chain-of-Thought or some variant on Genetic Algorithms) can solve (or at least help solve) this problem. After all, since we deal with formal proofs, the worst that can happen if the Oracle hallucinates (or attempts an hostile action against the rest of the code) is that their code will be rejected by the proof-checker.

This, of course, assumes, just as with Test-Driven Development, that we have a sufficiently precise specification of what the code needs to do. By opposition to Test-Driven Development, however, there is no dice roll involved.

Step 4: What about performance?

Oh, right. I did mention performance. And, ideally, we want as much as possible of the performance work automated.

Performance is… complicated. To keep it simple, I’ll summarize a few principles:

You need to be willing to invest in optimization. These days, most companies don’t care until [often way too] late in the process.
You need a core technology that supports optimizations. To this day, nothing supports as much optimization as Rust, Ada, Zig, C++ or C, but you can achieve high performance with a host of other languages, from C# to Java to Go. For some other languages, it will depend on the domain at hand.
You need the ability to get performance information, typically with a combination of Telemetry (to find out what feature you need to optimize), Profiling (to refine this into actual function names) and Benchmarking (to catch regressions quickly).
You need the ability to replace code that works but with poor performance with code that works with better performance without causing bugs.

Step 1. is non-technical, so it doesn’t really have a bearing to this conversation.

Step 4. is already covered by the specs, whether we go with Test-Driven Development or Types-Driven Development, we should already have all the tools we need to make sure that we do not break anything by making things faster. If we do, we have found a hole in our specs.

For Step 3, all the tools already exist. I’m going to handwave the automated introduction of Telemetry and the automated extraction of Benchmarks.

Even better, by using the information from Step 3. and some automation, it should be possible to extract some specifications for the part of the code we wish to optimize, to minimize the amount of re-rewriting/re-testing. We get these specifications for free if we’re using Types-Driven Development. With Test-Driven Development, it’s less clear, but depending on the language, it’s possible that static analysis could be sufficient to extract local tests.

Is there a problem with Step 2.? After all, Rust, Ada, Zig, C++, C, C#, Java and Go already exist. Well, if we’re using Test-Driven Development, we have already accepted a degree of risk, and it doesn’t look like we’re incurring any further risk, so the path looks open.

Thesis 4 With sufficient instrumentation and profiling and with a language with clean enough semantics, it should be possible to perform automated optimizations in an Oracle+Test-Driven context.

I’m handwaving a bit, once again, the definition of “language with clean enough semantics”. In my mind, it’s something like “any language in which we can quickly determine the dependencies and side-effects of a function”. Python definitely doesn’t do the trick, for instance, because much of the language can be redefined by a hostile Oracle, nor does today’s C++, because Undefined Behaviors or accidental imports can cause havoc. Does Rust (minus unsafe) do the trick? Does Go (minus CGo and any of the Unsafe functions) do the trick? To be determined.

Step 5: Combining performance and Formal Methods

If we use Formal Methods, we have set ourselves a higher standard, so it’s a bit more complicated. For one thing, none of the performance-oriented languages I’ve quoted features a combination of a sufficiently powerful type system and sufficiently clean semantics to permit Type Driven Development. The languages that do, such as Coq/Rocq or Idris, are not well-suited for optimizations.

There is, however, hope.

For one thing, most of the performance-oriented languages have large subsets that support Model-Checking and Abstract Interpretation, which suggests that these avenues are open.

For another, both Rust and Ada have demonstrated that we can apply some of the research in formal methods and turn them into industrial languages that don’t need a PhD. My gut tells me that it is possible to go much further and develop an Rust/Ada-meets-Coq/Rock, designed to be used by machines, with both a type system powerful enough to let us represent highly complex specifications and zero-cost abstractions and a highly optimizing compiler. I suspect that this would be a very long research project.

Thesis 5 With sufficient instrumentation and profiling, if we restrict ourselves to a language with clean enough semantics and formal methods, it should be possible to perform automated optimizations in an Oracle + Formal Methods context.

Is it worth it?

I, for one, don’t want to fly in a plane whose code is written by ChatGPT, or to live close to a plant controlled by code written by Claude, and I don’t see any sign that LLMs are going to reach sufficient reliability before the end of our civilization, at least not without the kind of help outlined in this post. So no matter what happens to the LLM scene, we will want a to maintain a branch of the development industry that produces highly-reliable code.

There are many possible futures. One in which such code remains hand-written, by a small, elite caste of system developers, while the rest of the trade progressively disappears, eaten away by LLMs, at least until we run out of energy to power them.

But if we head towards a future in which Oracles write code that must keep people alive, I would really hate for it to be done by dice-rolling. So, at the very least, I believe that we as a society, need to investigate how to make sure that our Oracles don’t kill us by hallucinating bad code.

For some very unconvincing definition of “everybody”. ↩︎
But yes, the developers will have been laid off regardless. ↩︎
I’m not a data scientist, don’t take my word on this. ↩︎

Author: David Teller

Posted on: January 6, 2025