Unpublished Musings about software
-
Reviewing AI Code Is Not A Viable Argument
10 Jun 2025
I am an LLM Coding Assistants skeptic. It is not because of IP laws problems (even if they are highly problematic), nor is it for ecological and ressource comsumptions reasons. It is not even for “they are all crap” reasons. My problem with LLM Coding Assistants is that I cannot see, in the face of the scientific evidence, how they can help someone write code better or faster.
And the thing that irritate me is that it seems that none of the proponents of LLM Coding Assistants seems to ever address this problem and this evidence when they defend their tooling choices. Worse, it seems that they give fuel to my arguments everytime they write a rebuke of the skeptics. So let’s look at what I have a problem with, how empirical scientific research support that view, how proponents of LLM Coding Assistants could show this is not a problem and then how right now they are doing the actual opposite.
The Intern Problem
The fundamental problem of LLM Coding Assistants that my criticism center on is their relatively high risk of getting things wrong. For all kind of reasons, some structural to how LLM works and other more akin to the interfaces we provide to interact with them, LLM Coding Assistants get things wrong. It can be hallucinations, typos, simply doing something that is not linked to the task demanded, going into a different path, etc etc.
A lot of people I have talked about that experimented with LLM Coding Assistants explains that they feel “Like an intern”. Like an intern, you should not expect too much from them, you should expect that everything they do will be more or less wrong, and that they have no idea what they are doing, but are highly enthusiastic. I see they never got me as an intern. I was definitely not enthusiastic.
And their answer to that problem, the one you will see all over the internet, is simple. You just do the same thing that you do with interns and junior developers in your team. No, they do not mean you put everything they did in the bin and forget about it. What they mean is that you should review all the code yourself. I mean, you are the human that know better. And you are the one responsible for the code anyway. And on top of this, you are doing that for all the code that get into your codebase anyway, you do not let code get in without a review, right?
What We Means When We Say Reviews
First of all, I want to be clear here. There are different practices in the litterature and in the profession grouped under the term “review”. So let’s be explicit here. Seeing the degree of (mis)trust and potential mistakes there, we should not accept the kind of “lightweight and heavily distributed” reviews that we see the most in our industry as the standard for LLM Coding Assistants supervision by a professional developer. They are not a bad thing to do, nor are they inefficient, but they have been shown in the litterature mostly as good to distribute knowledge of changes and as a way to enforce all kind of surface level rules.
For AI Coding Asssistants, we will need a proper “code review”. Not something as formal and complex as reviews of old, by committees, painstakingly checking every line one by one over a few hours. But still, we want something quite involved and complete. After all, these are interns writing sometimes highly complex code. And if there is something we know in software, it is that the devil can be in the details.
The Limits Of Reviews
Without going into some philosophical depth of reviewing as a practice, there is a glaring problem in this idea. From all the research we have (and if I am motivated, you should see a lot of links to references in the next paragraphs), we have learned, empirically a few things about code reviews. And the evidence is relatively solid here, within reasonable limits. You will see that these do not matter there.
- A review that last more than 1h is too long.
- A review that has to be effective cannot be more than 400LOC at a time, in that time.
Empirical research has shown that reviews that are longer than 1h quickly reach diminishing returns whatever is the size of the code being reviewed. So this is not only that people cannot find bugs anymore after 1h because they already thoroughly reviewed most of the code. No, it is more linked to the fact that after 1h at that level of attention, people start getting tired, bored and simply need some time off.
Of note is the total absence of research as far as I could find, on the recovery time needed between review sessions of 1h. So I canot tell you how frequently someone could do 1h review sessions. But we could probably accept an extreme maximum limit of a handful per day. Which is probably far more than most people could do, I would probably put the average at 2, but eh. That is still in the right ballpark.
The second limit that has been seen in empirical research is that speed, that is number of Lines Of Code per Hour (LOC/H) is highly variable, mostly depending on the context of the code, the kind of code being reviewed, experience, knowledge and the rest of your expected reasons. But something that is regularly pointed out is that, even if there is no hard cut off, it seems that a maximum of 400 LOC/H is a good maximum speed acceptable for efficience, as nearly no review above this speed seems particularly effective in the empirical data at finding and flagging defects.
What It Means For LLMs
So, if we combine the claim that the solution to LLM Coding Assistants problems is to review the code, with the empirical evidence from scientific research on code reviews, what do we get? For every 400 LOC written by a LLM Coding Assistants (at best, less for code that are hard to review), we need one professional senior developer to spend 1h of his time reviewing the code. And he only has between 10 to 40 of these reviewing slots per week, with a recovery period in between of unknown lenght, but probably of at least an hour or two.
And that is without taking into account that these slots of intense concentration are probably in high demand. Meetings, code to think through, design sessions, incidents, etc. You better hope you can take all of them. This means that in the best case scenario, a developer using a LLM Coding Assistant can write, review and commit, at best, a few thousands LOC per day to a codebase. And that is the best case scenario, a more realistic one is less than 1k LOC per day.
If that sound high, remember that this comprise everything. Boilerplate, tests, migrations, configuration, etc. And that this is the best case scenario, where most of the code is boilerplate and easy to review. I have basic unique test files that are more than 400 LOC. So the productivity of said developer is not going to be really high. Hell, if anything, this sounds like a real ceiling to how good these LLM Coding Assistants can help in writing code.
But Wait, It Gets Worse
This makes the “It is ok, you just have to review everything like usual” argument complex to believe in. After all, even if they were right, it would not really provide a speed up or that much more productivity, due to the limits of code reviews. On top of this, remember that this is empirical evidence coming from research on how human reviewers find human coders defects in code. We have no evidence that humans reviewing LLM Coding Assistants code are as efficient as code review by humans of humans code. If anything, we have some tentative evidence that humans that review code generated by LLM Coding Assistants are more confident that they found all the defects, while finding less defects.
The output of Human Reviewer teamed with a LLM Coding Assistants is of lesser quality than the one of Human Reviewer teamed with a Human Coder. But the reviewer in the first case believe that they did a better job. So not only is the “just review everything” practice leading to high limits on productivity benefit of using the LLM Coding Assistants, we do not even have solid proof that it even works at all to fix the problem of LLM Coding Assistants being wrong, a lot.
Note that I never talk of the cost of fixing the defects here. I am only talking of the ability of professional developers, in a professional environment, of reviewing code in order to flag problems. This is not even about how costly the defects introduced by LLM Coding Assistants are. This is about how costly reviewing the code generated is. Whatever the defects are or how many of them there are. Even if the LLM Coding Assistants were extremely good, these costs would stay the same and the productivity hit the same limits.
I Mean, It Could Be Worse
Well. Bad News. It does get worse. There is a benefit of using LLM Coding Assistants regularly touted by developers that are proponents of them and are going all in in using them as part of their professional tools. It is that they can type and produce the code that is the most painful to author and deal with in your professional life. Here is a quote from a proeminent and visible “bash the skeptic” blogpost contemporary to me writing this, talking of what LLM Coding Assistants can produce.
Also: 100% of all the Bash code you should author ever again
I want to stop you there for a minute, so you can think through what the author mean here. He means that you do not have to write your shellscripts yourself. You can now let the computer, the LLM Coding Assistant, do that torturous painful part. No need to lose you sanity anymore, you can let the code that is the easiest to get wrong, the harder to review, the harder to understand you made a fatal mistake that will blow up everything be written and handled by the machine. The code that is so loose on parsing but also so semantically overloaded that a simple typo in a ponctuation could be utterly harmless or literally make you delete a whole computer. That code.
Well you can just let the thing that get everything wrong randomly all the time write it. And then you can just review it and never have to write it ever again. I mean what could go wrong. This is not like it is the hardest code to review for mistakes right? Right? RIGHT !? I mean, maybe they have not at all addressed the question of if we actually can review LLM Coding Assistant output in a way that works out.
But for sure, they are not touting as the main use of this the code that is also the hardest to review effectively. RIGHT ?! Yeah they are. And this is where they lost me. They do not stop at not engaging with the argument of “are you sure that reviewing the code solve the problems?” or the “but if we do all that reviewing,is it really more productive and better?”. No they go deeper. They offer the code that we know, as a professional field, is the hardest to review and get right, as the best application for LLM Coding Assistants.
What To Do To Make Me Calm Down
Ok, I will calm down. Sure, noone engaged with my argument, sure I keep pointing at problems, sure I yell wolf at people using these tools to make their life easier. All nice, keep yelling at cloud, but how could someone convince me? After all, I could be wrong, can’t I? I mean, if I use science all the time to support my point, I should also have some ideas of what could prove me wrong, no?
Yes. First of all, I would like to see empirical research on the ability of humans reviewers to find defect in LLM generated code. And how fast. And how much we can do it per day. We already have empirical study for this in reviewing code generated by humans, so if you want to convince me my argument is wrong, we will need a set of data that is human reviewing LLMs. We have some experimental data and some empirical data already, but the dataset is limited in size and in context. More reproduction would help. Note however that so far, the scientific data point that human are bad at reviewing LLM (or that LLM are good at avoiding detection). This is consistent with expectations, LLM are trained to evade detection. But maybe this was a fluke.
Secondly, you could try to show us how the review of LLM generated content is different from reviewing human generated content. Maybe all this data and empirical evidence we have is not applicable here. Maybe LLMs are sufficiently different that this is a qualitatively different problem. In which case, my arguments against LLM Coding Assistant fall apart. Note, once again, that so far there is tentative burgeoning evidence that reviewing LLM generated content is, indeed, different. But it is point at reviewing it being harder than human generated content, which would make the argument I present here even more powerful. So beware of what knowledge you search.
And Yet I Am Mad
The argument above is not the only reason I am an AI Skeptic, but it is probably the biggest one I have against LLMs Coding Assistants. Mostly, I cannot see, knowing all of this information about code reviews, how we can get benefit, as professional developers, out of LLM tools using this interface and process. It flies against everything we know about producing quality code.
But the thing that really drives me crazy is not that we keep getting these tools and process offered again and again by vendors, despite said scientific evidence. No what drive me crazy and angry is being called “nuts” by someone that never even engage with the problem or the question I asked. I have yet to see a single one of the AI proponent even engage with the problem or try to do empirical evidence research on this.
I am used to this behaviour. I have seen the same thing play again and again in our field around things like TDD, types systems, separating test and coding teams, CI/CD, devops, etc. Anecdotal data always win in our field, even if it flies against everything the empirical evidence shows. But still. I would like if our field stopped at bloodletting. Let’s try to have an actual discussion about ergonomics and empirical evidence. If you want to convince me, please stop calling me nuts or bringing anecdotal information about “hey this time it worked for me”. Start looking at how we got all this empirical evidence around code review in the first place and go do some studies. Please. Let’s use that moment to start taking our professional tools seriously.
-
A SSR dev Browser API wishlist
08 Mar 2023
Last time, I shared my view on the current state of the browser APIs for SSR developers. This time, to not be seen as only criticizing, let’s see at what is in the pipeline, or could be in the pipeline, from the browsers, to make our life easier.
-
The Missing Middle of Funding Tech Companies
05 Dec 2022
Tech companies that make money for their investors seem to mostly… not exist. And the reason for that is that we have no financial instrument to fund them.
In this post i will review the existing financial instruments usually offered to finance the creation of a new firm. Try to show you these instruments do not work for tech company that want to make money for their investors. And leave you at that.
-
What Diana have done, H1-2022 Edition
14 Sep 2022
So we are now past the first half of the year and i have not published a lot. I feel like i should at least update a bit on what i have been doing in the first half of 2022.
New Company, New Job Title
First of all, i am now a SRE at Indeed. I started looking for a new job at the start of 2022. Community was great, but the pace and risk of a startup in this kind of economy was not worth the salary i was getting. The offers i found as an Elixir engineer at the time were… underwhelming. Far too many red flags, far too many company that seems to not understand remote work.
I then got approached by Indeed. They were launching a new SRE team, centered around Incident Management. This fit right in the kind of stuff i have been learning about and exploring in the Learning for Incident and Resilience Engineering communities, so it felt like an offer i could not refuse.
The Interview process was the usual load of bullshit. But i am happy to say that since April 2022, i am now a Senior SRE in the Incident Management team at Indeed.
Shipping Ryu in OTP 25
OTP 25 ship with the
:shortoption forfloat_to_binary/2andfloat_to_list/2. It gives significant boost in performance and significantly reduce the memory impact of these operations. It may sound niche, but it is quite impressive on JSON encoding, as Michal Muskala showed on the Jason project. Please use it if you are doing erlang. And for Elixir, you can use it already but your library will slowly upgrade to it as we keep dropping support for older versions of OTP.I have spent a significant amount of time in 2021 to get that in. I had financial support from the EEF for part of it, which i am thankful for. To know more about what is entailed by this change and how it works, you can go watch the talk i gave at elixirconf.
That said it bring us to the elephant in the room that i think deserve its own point.
The BEAM Situation
I am no longer being paid to work in the BEAM community. I am still bullish on Elixir, Erlang and the BEAM, and use it for all my personal projects. As you can see, i am still in there on my free time. I would have preferred to stay in the BEAM engineering community as someone paid to do the work.
But i am getting really scared of the direction of the community right now. Don’t misunderstand me, José, Chris and all the maintainer are doing amazing work. The OTP team does great thing. But the organizations that use the BEAM seems totally disjointed from the ecosystem.
As of today, there are still only one or two person paid to work on the BEAM itself outside of the OTP team at Ericsson. No-one is paid to work on rebar3. Hex does not make enough money to pay the maintainers. The EEF does great job, but is nowhere near the amount of donations that would allow for engineers to see it as a force to bet on to put food on the table. Contributing to OTP is still a massive pain in the ass, with tooling that simply do not work and take hours to setup. Cowboy and its dependencies are not actively being worked on because Loic need to put food on the table. Fred Hébert, maintainer of tons of resources essential to the community, had to move out to SRE to find a job that fit his needs.
And the orgs keep saying they have trouble recruiting, while the foundation are crumbling due to chronic underinvestment. Let’s be clear. I am not saying that other stacks have a far better prospect. This is a reality that you could find, at one level or another, in every language and runtime. But the BEAM community is supposed to be about robustness and anticipating the future failures. And in this domain, what i see is willful blindness.
And i understand why WASM or ML are areas to go for in order to attract new domains and new people to come in. But consider that to build on top of foundations, you need them to be solid. And the BEAM one may not be as solid as you think. If you want to know more, feel free to contact me, i have receipts and ideas.
Ok, Rant Over.
HyperLogLog
I spend a good part of my free time the past 6 months working on my fork of the
hyperreference library in erlang.One of my personal project needed a good distributed way to do count distinct queries over massive sharded datasets. So i went looking at the erlang reference implementation. It is used in Riak, it should be good right ?
- It use rebar2 to compile, which is definitely not supported anymore
- When porting it to rebar3, it fails to run its tests due to a dependency on
:randomwhich was deprecated in favor of:randa few years ago in erlang. - After porting the test suite to
:randit became obvious that the library was… wrong. The test suite would not pass, purely based on it having a different set of random input, despite the invariant being tested not depending of the input in theory. - It was not even on Hex.
At this point, the library was simply unmaintained for years. And bugged. So time to fork! And publish on Hex. In order, i ported it to rebar3. Ported the tests. Added
ex_docfor documentation. Published it on Hex. And then started the work of making the test suite make sense (no, Proper is not a test runner for unit test), fixing the bugs and trying to make the codebase a bit more maintainable.In the process, i managed to find a new, more precise, way to estimate from a paper from Omar Ertl. I added a few things i needed for safety like versionning, in order to not merge two HLL that are not compatible, which kinda matter in distributed systems. I have bigger plans for this, and it is definitely not at version 1.0 yet. At the very least i need better documentations.
But i think at this point we are quickly coming to the point where it would be the most advanced and complete hyperloglog implementation in all languages i have been able to find. Not yet mind you, but eh.
And not a lot more.
And that is mostly the extend to my FOSS work on the past 6 months. It may not sound a lot, but the reality is that i probably can only find 2 to 4 hours a months to work on “personal” projects that do not pay me. And that limits what i can do. That is also why i have published so few things. A post take easily 2h to write, check for language errors and publish. It competes with all the rest for this limited time.
This is also why i am quite scared for the BEAM ecosystem. This situation is not specific to me. If we want things to evolve in the lower levels of the runtime, we are going to need the community to step up to support it. We cannot depend on Ericsson for everything. They have been great steward, but at a point, they have their own priorities.
To the next time, i hope you are all having fun in your own projects in the meantime.
-
Float To String in OTP 25
01 Mar 2022
In OTP 25, you will have access to a new option for
float_to_list/2andfloat_to_binary/2. This was the result of work i contributed with the support of the EEF. Why did we do that and when should you use it ? -
The Release Pipeline you can't have
21 Mar 2021
Let me tell you about the Release Pipeline we nearly built at one of my ex-job. It partially cannot be done today because the technologies does not exist or are not composable enought to be used this way for now. But we are really close and I really really want this to become a thing. Because the benefits for everyone would be too big.
No One Like You
Let think about what we want from a high level standpoint. The first thing is that I want the team that own the service to own the whole pipeline. If there is infra done for them, it is at best a set of VM and some networking. That way, the pains are theirs and they can follow what happens. And adapt as needed.
The second thing we want is for the Build system to know nothing about the deployment side. All it does is produce an artefact, Content Adressed by the hash of its content, and push it to the Content Adressed Store used. At best it is allowed to set a flag on the content to tell if it is allowed to be deployed yet. In the Store.
That store need to be hostable in a variety of ways. Artefactory, S3, a filesystem,… This generate a third constraint. On top of that, the store need to be able to communicate in a way with the service, to tell them a new version is available. Or the service need to query it regularly.
On fourth constraint, the service needs to be able to self orchestrate a deploy of the new version, between all the machines that make its deployment set. It needs to also be able to coordinate with external services that needs it. At the very least, to not disturb in flight work too much.
Too Drunk To Drive Drunk
If the Solarwind hack told us anything, it is that CI/CD are RCE-as-a-service. So to mitigate this kind of supply chain attack, we want our Build system to follow a few rules. It always start with a blank env. The inputs and dependencies need to come from a CAS. And verified on their hash. The output will be a CAS artifact too. Everything is rebuilt from source, the only exception being that you can use the CAS as a cache. Every build end with the machine being scrubbed clean.
The recipe itself is then CAS stored and is the package you note as deployable. You are not allowed to know more. The Build system live outside of the rest and never talk to anything, except strictly controlled by its contents hash. The typical examples here are Bazel, Nix, Guix, Habitat, …
That means that the team is responsible for updating their dependencies. To mitigate the cost, it may make sense to use a central team to update “low level” dependencies. In particular, things like libc, compilers, classic GNU utilities, …
This is the part you can have these days. Nix and Guix are well advanced and have a nice set of packages as a base to build on top. Habitat have a nice set of packages.
Takin’ Up Space
On the service side, this get a bit harder.
On one hand, swapping the service atomically on the machine is doable. NixOS, Guix, Habitat even, they all can do it. A pretty good start.
Publishing the recipe and using that as a base ? Same thing, we can do it.
Getting the information an update is needed, that gets harder. There are mechanisms, but outside of Habitat, everyone is stuck in Docker and its Kubernetes friend.