Ars points out that these findings contradict those of other experiments and then goes on to postulate as to why. I clicked on the link to the other experiment:
when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool
By comparison, this experiment considered 16 developers. That’s 0.3% as many as the experiments its findings contradict. Fortunately, the authors don’t claim their findings are broadly applicable. They even have a table that reads:
We do not provide evidence that | Clarification —- | —- AI systems do not currently speed up many or most software developers | We do not claim that our developers or repositories represent a majority or plurality of software development work AI systems do not speed up individuals or groups in domains other than software de- velopment | We only study software development AI systems in the near future will not speed up developers in our exact setting | Progress is difficult to predict, and there has been substantial AI progress over the past five years [2] There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting | Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup
That said, the study has been an interesting read so far. I highly recommend reading it directly rather than just the news posts about it. Check out their own blog post: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
I personally find the psychological effect - the devs thought they were 20% faster even afterward - to be pretty interesting, as it suggests that even if more time overall is spent, use of AI could reduce cognitive load and potentially side effects like burnout.
I’d like to see much larger scale studies set up like this, as well as studies of other real world situations. For example, how does this affect the amount of time this takes 10,000 different developers to onboard onto an unfamiliar repository?
There’s a whole history of people, both inside and outside the field, shifting the definition of AI to exclude any problem that had been the focus of AI research as soon as it’s solved.
Bertram Raphael said “AI is a collective name for problems which we do not yet know how to solve properly by computer.”
Pamela McCorduck wrote “it’s part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, but that’s not thinking” (Page 204 in Machines Who Think).
In Gödel, Escher, Bach: An Eternal Golden Braid, Douglas Hofstadter named “AI is whatever hasn’t been done yet” Tesler’s Theorem (crediting Larry Tesler).
https://praxtime.com/2016/06/09/agi-means-talking-computers/ reiterates the “AI is anything we don’t yet understand” point, but also touches on one reason why LLMs are still considered AI - because in fiction, talking computers were AI.
The author also quotes Jeff Hawkins’ book On Intelligence:
Now we can see the entire picture. Nature first created animals such as reptiles with sophisticated senses and sophisticated but relatively rigid behaviors. It then discovered that by adding a memory system and feeding the sensory stream into it, the animal could remember past experiences. When the animal found itself in the same or a similar situation, the memory would be recalled, leading to a prediction of what was likely to happen next. Thus, intelligence and understanding started as a memory system that fed predictions into the sensory stream. These predictions are the essence of understanding. To know something means that you can make predictions about it. …
The human cortex is particularly large and therefore has a massive memory capacity. It is constantly predicting what you will see, hear, and feel, mostly in ways you are unconscious of. These predictions are our thoughts, and, when combined with sensory input, they are our perceptions. I call this view of the brain the memory-prediction framework of intelligence.
If Searle’s Chinese Room contained a similar memory system that could make predictions about what Chinese characters would appear next and what would happen next in the story, we could say with confidence that the room understood Chinese and understood the story. We can now see where Alan Turing went wrong. Prediction, not behavior, is the proof of intelligence.
Another reason why LLMs are still considered AI, in my opinion, is that we still don’t understand how they work - and by that, I of course mean that LLMs have emergent capabilities that we don’t understand, not that we don’t understand how the technology itself works.
We are. Why do you think we stopped?
It may be aware of them, but not in that context. If you asked it how to solve the problem rather than to solve the problem for you, there’s a chance it would suggest you use a reverse image search.
LLM image processing doesn’t work the same way reverse image lookup does.
Tldr explanation: Multimodal LLMs turn pictures into a thousand 200-500 or so words tokens, but reverse image lookups create perceptual hashes of images and look the hash of your uploaded image up in a database.
Much longer explanation:
Multimodal LLMs (technically, LMMs - large multimodal models) use vision transformers to turn images into tokens. They use tokens for words, too, but these tokens don’t also correspond to words. There are multiple ways this could be implemented, but a common approach is to break the image down into a grid, then transform each “patch” of a specific size, e.g., 16x16, into a single token. The patches aren’t transformed individually - the whole image is processed together, in context - but it still comes out of it with basically 200 or so tokens that allow it to respond to the image, the same way it would respond to text.
Current vision transformers also struggle with spatial awareness. They embed basic positional data into the tokens but it’s fragile and unsophisticated when it comes to spatial awareness. Fortunately there’s a lot to explore in that area so I’m sure there will continue to be improvements.
One example improvement, beyond improved spatial embeddings, would be to use a dynamic vision transformers that’s dependent on the context, or that can re-evaluate an image based off new information. Outside the use of vision transformers, simply training LMMs to use other tools on images when appropriate can potentially help with many of LMM image processing’s current shortcomings.
Given all that, asking an LLM to find the album for you is like - assuming you’ve given it the ability and permission to search the web - like showing the image to someone with no context, then them to help you find what music video - that they’ve never seen, by an artist whose appearance they describe with 10-20 generic words, none of which are their name - it’s in, and to hope there were, and that they remembered, the specific details that would make it would come up in the top ten results if searched for on Google. That’s a convoluted way to say that it’s a hard task.
By contrast, reverse image lookup basically uses a perceptual hash generated for each image. It’s the tool that should be used for your particular problem, because it’s well suited for it. LLMs were the hammer and this problem was a torx screw.
Suggesting you use - or better, using a reverse image lookup tool itself - is what the LLM should do in this instance. But it would need to have been trained to think to suggest this, capable of using a tool that could do the lookup, and have both access and permission to do the lookup.
Here’s a paper that might help understand the gaps between LMMs and tasks built for that specific purpose: https://arxiv.org/html/2305.07895v7
Thank you! That gives me a starting point that should be easy to look up!
Why is 255 off limits? What is 127.0.0.0 used for?
To clarify, I meant that specific address - if the range starts at 127.0.0.1 for local, then surely 127.0.0.0 does something (or is reserved to sometimes do something, even if it never actually does in practice), too.
Advanced setup would include a reverse proxy to forward the requests from the applications port to the internet
I use Traefik as my reverse proxy, but I have everything on subdomains for simplicity’s sake (no path mapping except when necessary, which it generally isn’t). I know 127.0.0.53 has special meaning when it comes to how the machine directs particular requests, but I never thought to look into whether Traefik or any other reverse proxy supported routing rules based on the IP address. But unless there’s some way to specify that IP and the IP of the machine, it would be limited to same device communications. Makes me wonder if that’s used for any container system (vs the use of the 10, 172.16-31, and 192.168 blocks that I’ve seen used by Docker).
Well this is another advanced setup but if you wanted to segregate two application on different subnets you can. I’m not sure if there is a security benefit by adding the extra hop
Is there an extra hop when you’re still on the same machine? Like an extra resolution step?
I still don’t understand why .255 specifically is prohibited. 8 bits can go up to 255, so it seems weird to prohibit one specific value. I’ve seen router subnet configurations that explicitly cap the top of the range at .254, though - I feel like I’ve also seen some that capped at .255 but I don’t have that hardware available to check. So my assumption is that it’s implementation specific, but I can’t think of an implementation that would need to reserve all the .255 values. If it was just the last one, that would make sense - e.g., as a convention for where the DHCP server lives on each network.
Why is 255 off limits? What is 127.0.0.0 used for?
PSTN is wiretapped.
It’s a good thing that the website itself supports sending and receiving alerts, then.
When did Democrats have a supermajority in both the House and the Senate?
I thought Hue bulbs used Zigbee?
The up arrow moves through the letters, e.g., A->B->C. The down arrow moves to the next character in the sequence, e.g., C->CA->CAA. If you click past the correct letter, you’ll have to click all the way through again. And if you submit the wrong letter, you have to start all over (after it takes twenty seconds attempting to connect with the wrong password and then alerts you that it didn’t work, of course).
Fair point, I should have asked about commercial games in general
That said I didn’t mean that the game studio itself would do the AI training and own their models in-house; if they did, I’d expect it to go just as poorly as you would. Rather, I’d expect the model to be created by an organization specialized in that sort of thing.
For example, “Marey” is one example I found of a GenAI model that its creators are saying was trained ethically.
Another is Adobe Firefly, where Adobe says they trained only on licensed and public domain content. It also sounds like Adobe is paying the artists whose content was used for AI training. I believe that Canva is doing something similar.
StabilityAI is also doing something similar with Stable Audio 2.0, where they partnered with a music licensing company, AudioSparx, to ensure that artists are compensated, AI opt outs are respected, etc…
I haven’t dug into any of those too deep, but they seem to be heading in the right direction at the surface level, at least.
One of the GenAI scenarios that’s the most terrifying to me is the idea of a company like Disney using all the material they have copyright for to train their own, proprietary GenAI image, audio, and video tools… not because I think the outputs would be bad, but because of the impact that would have on creators in that industry.
Fortunately, as long as copyright doesn’t apply to purely AI generated outputs, even if trained entirely on your own content, then I don’t think Disney specifically will do this.
I mention that as an example because that usage of AI, regardless of how ethically the model was trained, would still be unethical, in my opinion. Likewise in game creation, an ethically trained and operated model could still be used unethically to eliminate many people’s jobs in the interest solely of better profits.
I’d be on board with AI use (in game creation or otherwise) if a company were to say, “We’re not changing the budget we have for our human workforce, including for contractors, licensed art, and so on, other than increasing it as inflation and wages increase. We will be using ethical AI models to create more content than we otherwise would have been able to.” But I feel like in a corporate setting, its use is almost always going to result in them cutting jobs.
Are you okay with AAA studios using GenAI that was trained only on licensed works?
Depends on your e-reader! If you have a Kindle, Kobo, or Nook, yes, that’s true. However:
Boox has e-readers that run Android and you can install Hoopla. The Palma 2 is phone sized which is great. The Page, Leaf2, and Go 7 are all in the 7” form factor, plus they have 6” versions. And they have tablet sizes, too. They have both traditional black&white and color e-ink displays.
I have the Boox Air 3C and the original Palma and both are great. I’ll likely get a Boox as my next standard sized e-reader, too (whenever I replace my Kindle Oasis). Though unless the technology drastically improves before then, it’ll be one with a black and white screen. (The color is nice in the tablet sizes, though, especially for comics from Hoopla.)
Some other options that I’m less familiar with include:
Okay, and? What nontechnical user cares enough to use it specifically when they could use Microsoft Office, Google Docs, Polaris Office, MobiOffice, WPS Office, Collabora, etc., instead?
But do nontechnical users care about the “missing” features? A lot of nontechnical users prefer simpler apps.
There is a version of Blender that was made for Android. It’s quite old, though. But if you’re competent enough with Blender that you’ve memorized all its keyboard shortcuts and workflows, you’re likely technical enough to get it working via Termux. But if not, Nomad Sculpt (on both iOS and Android), SpaceDraw (Android only), and several other apps can serve the same purposes.
Not sure why you listed video editing software and two different specific video editors, but Android and iOS both have Lumafusion. I’m sure there are other decent editors but I haven’t used them because Lumafusion is great. iPads do have DaVinci Resolve, though, for what that’s worth. If you care about using a FOSS video editor then you should care enough to install it via Termux. But let’s be real, most nontechnical users are probably happy using CapCut.
DJ software - Cross DJ is free. There are other alternatives. And there are web based DJ software apps like YouDJ.
OnlyOffice is available on Android already.
“any linux app” - I don’t think any nontechnical users want GParted on their Android phones, and it wouldn’t work anyway.
Android has its own games, same as iOS. Nontechnical users are way more likely to want Windows games than Linux games anyway.
Wine used to be developed natively for Android but they stopped a few years back. You can still download it at winehq though. I think Box64 with wine is a decent option?
Overall the thing I’m confused about is why you think Google or any major Android phone manufacturer have a motivation to make native Linux apps more accessible. Google certainly doesn’t want to make it easier for you to use the better versions of their competitors’ apps. Google is moving further away from Linux, not closer. Providing a usable, good enough desktop experience that’s still Android underneath makes far more sense for them.
Fortunately, like I said earlier, there are workarounds to get access to those Linux apps.
The thing that is more likely to change is for the creators of Android apps to build apps that function better when used in a phone-as-desktop format. And even if they don’t, there are enough competent web apps out there that just being able to use your browser full screen on a monitor solves 90% of people’s actual use cases - and probably over 95% when you include the other apps that have decent desktop experiences that can be run alongside them.
The Steam Deck approach is much closer to what you seem to want. The Steam Deck is an actually competent Linux machine that has a Valve-supported compatibility layer in Proton for running non-Linux games. It plugs into a USB-C hub connected to a monitor, mouse, and keyboard just fine, can install any Linux app, etc… It’s completely usable handheld as well. But it isn’t a phone, and even though it’s quite portable, it’s not “stick into your pocket” portable.
I don’t expect a major manufacturer to make a Linux phone any time soon, and I don’t think the Linux phones that are out already have - or will have in the next 5 years - a smooth enough experience to convince any nontechnical user to switch.
There’s a difference between an answer from ChatGPT and an answer from ChatGPT that’s been reviewed by a person, particularly if that person is knowledgeable of the topic. ChatGPT isn’t deterministic, so if I go and ask ChatGPT the same thing, there’s no guarantee I’ll get an at all similar answer.
The problem for me is that I have no way of knowing whether the person posting the ChatGPT response is or isn’t an expert and whether they actually reviewed the output. However that’s true of people in general, just replace “reviewing the output” with “not trolling,” so the effort to assess the utility of a comment is pretty similar.