Excerpt:

“Even within the coding, it’s not working well,” said Smiley. “I’ll give you an example. Code can look right and pass the unit tests and still be wrong. The way you measure that is typically in benchmark tests. So a lot of these companies haven’t engaged in a proper feedback loop to see what the impact of AI coding is on the outcomes they care about. Lines of code, number of [pull requests], these are liabilities. These are not measures of engineering excellence.”

Measures of engineering excellence, said Smiley, include metrics like deployment frequency, lead time to production, change failure rate, mean time to restore, and incident severity. And we need a new set of metrics, he insists, to measure how AI affects engineering performance.

“We don’t know what those are yet,” he said.

One metric that might be helpful, he said, is measuring tokens burned to get to an approved pull request – a formally accepted change in software. That’s the kind of thing that needs to be assessed to determine whether AI helps an organization’s engineering practice.

To underscore the consequences of not having that kind of data, Smiley pointed to a recent attempt to rewrite SQLite in Rust using AI.

“It passed all the unit tests, the shape of the code looks right,” he said. It’s 3.7x more lines of code that performs 2,000 times worse than the actual SQLite. Two thousand times worse for a database is a non-viable product. It’s a dumpster fire. Throw it away. All that money you spent on it is worthless."

All the optimism about using AI for coding, Smiley argues, comes from measuring the wrong things.

“Coding works if you measure lines of code and pull requests,” he said. “Coding does not work if you measure quality and team performance. There’s no evidence to suggest that that’s moving in a positive direction.”

  • Tiresia@slrpnk.net
    link
    fedilink
    arrow-up
    4
    ·
    13 hours ago

    I think that’s called a cargo cult. Just because something is a tech gadget doesn’t mean it’s going to change the world.

    Basically, the question is this: If you were to adopt it late and it became a hit, could you emulate the technology with what you have in the brief window between when your business partners and customers start expecting it and when you have adapted your workflow to include it?

    For computers, the answer was no. You had to get ahead of it so companies with computers could communicate with your computer faster than with any comptetitors.

    But e-mail is just a cheaper fax machine. And for office work, mobile phones are just digital secretaries+desk phones. Mobile phones were critical on the move, though.

    Even if LLMs were profitable, it’s not going to be better at talking to LLMs than humans are. Put two LLMs together and they tend to enter hallucinatory death spirals, lose their sense of identity, and other failure modes. Computers could rely on a communicable standards, but LLMs fundamentally don’t have standards. There is no API, no consistent internal data structure.

    If you put in the labor to make a LLM play nice with another LLM, you just end up with a standard API. And yes, it’s possible that this ends up being cheaper than humans, but it does mean you lose out on nothing by adapting late when all the kinks have been worked out and protocols have been established. Just hire some LLM experts to do the transfer right the first time.