Ok, I can buy this
> It is about the engineering of context and providing the right information and tools, in the right format, at the right time.
when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?
If the definition of "right" information is "information which results in a sufficiently accurate answer from a language model" then I fail to see how you are doing anything fundamentally differently than prompt engineering. Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts.
I dont quite follow. Prompts and contexts are different things. Sure, you can get thing into contexts with prompts but that doesn't mean they are entirely the same.
You could have a long running conversation with a lot in the context. A given prompt may work poorly, whereas it would have worked quite well earlier. I don't think this difference is purely semantic.
For whatever it's worth I've never liked the term "prompt engineering." It is perhaps the quintessential example of overusing the word engineering.
The term engineering makes little sense in this context, but really... Did it make sense for eg "QA Engineer" and all the other jobs we tacked it on, too? I don't think so, so it's kinda arguing after we've been misusing the term for well over 10 yrs
When you play with the APIs the prompt/context all blurs together into just stuff that goes into the text fed to the model to produce text. Like when you build your own basic chatbot UI and realize you're sending the whole transcript along with every step. Using the terms from the article, that's "State/History." Then "RAG" and "Long term memory" are ways of working around the limits of context window size and the tendency of models to lose the plot after a huge number of tokens, to help make more effective prompts. "Available tools" info also falls squarely in the "prompt engineering" category.
The reason prompt engineering is going the way of the dodo is because tools are doing more of the drudgery to make a good prompt themselves. E.g., finding relevant parts of a codebase. They do this with a combination of chaining multiple calls to a model together to progressively build up a "final" prompt plus various other less-LLM-native approaches (like plain old "find").
So yeah, if you want to build a useful LLM-based tool for users you have to write software to generate good prompts. But... it ain't really different than prompt engineering other than reducing the end user's need to do it manually.
It's less that we've made the AI better and more that we've made better user interfaces than just-plain-chat. A chat interface on a tool that can read your code can do more, more quickly, than one that relies on you selecting all the relevant snippets. A visual diff inside of a code editor is easier to read than a markdown-based rendering of the same in a chat transcript. Etc.
Never mind that prompt engineering goes back to pure LLMs before ChatGPT was released (i.e. before the conversation paradigm was even the dominant one for LLMs), and includes anything from few-shot prompting (including question-answer pairs), providing tool definitions and examples, retrieval augmented generation, and conversation history manipulation. In academic writing, LLMs are often defined as a distribution P(y|x) where X is not infrequently referred to as the prompt. In other words, anything that comes before the output is considered the prompt.
But if you narrow the definition of "prompt" down to "user instruction", then you get to ignore all the work that's come before and talk up the new thing.
But personally I think a focus on "prompt" that refers to a specific text box in a specific application vs using it to refer to the sum total of the model input increases confusion about what's going on behind the scenes. At least when referring to products built on the OpenAI Chat Completions APIs, which is what I've used the most.
Building a simple dummy chatbot UI is very informative here for de-mystifying things and avoiding misconceptions about the model actually "learning" or having internal "memory" during your conversation. You're just supplying a message history as the model input prompt. It's your job to keep submitting the history - and you're perfectly able to change it if you like (such as rolling up older messages to keep a shorter context window).
For that kind of tasks (and there are many of those!), I don't see why you would expect something fundamentally different in the case of LLMs.
Exactly the problem with all "knowing how to use AI correctly" advice out there rn. Shamans with drums, at the end of the day :-)
Drew Breunig has been doing some fantastic writing on this subject - coincidentally at the same time as the "context engineering" buzzword appeared but actually unrelated to that meme.
How Long Contexts Fail - https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho... - talks about the various ways in which longer contexts can start causing problems (also known as "context rot")
How to Fix Your Context - https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... - gives names to a bunch of techniques for working around these problems including Tool Loadout, Context Quarantine, Context Pruning, Context Summarization, and Context Offloading.
> The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round.
In the military you don't select your abilities before entering a level.
Does he pretend to give the etymology and ultimately origin of the term, or just where he or other AI-discussions found it? Because if it's the latter, he is entitled to call it a "gaming" term, because that's what it is to him and those in the discussion. He didn't find it in some military manual or learned it at boot camp!
But I would mostly challenge this mistake, if we admit it as such, is "significant" in any way.
The origin of loadout is totally irrelevant to the point he makes and the subject he discusses. It's just a useful term he adopted, it's history is not really relevant.
Doesn't seem that significant?
Not to say those blog posts say anything much anyway that any "prompt engineer" (someone who uses LLMs frequently) doesn't already know, but maybe it is useful to some at such an early stage of these things.
For example: in form, things like negative shape and overlap. In color contrast things like Ratio contrast and dynamic range contrast. Or how manipulating neighboring regional contrast produces tone wrap. I could go on.
One reason for this state of affairs is that artists and designers lack the consistent terminology to describe what they are doing (though this does not stop them from operating at a high level). Indeed, many of the terms I have used here we (my colleagues and I) had to invent ourselves. I would love to work with an AI guru to address this developing problem.
I don't think they do. It may not be completely consistent, but open any art book and you find the same thing being explained again and again. Just for drawing humans, you will find emphasis on the skeleton and muscle volume for forms and poses, planes (especially the head) for values and shadows, some abstract things like stability and line weight, and some more concrete things like foreshortening.
Several books and course have gone over those concepts. They are not difficult to explain, they are just difficult to master. That's because you have to apply judgement for every single line or brush stroke deciding which factors matter most and if you even want to do the stroke. Then there's the whole hand eye coordination.
So unless you can solve judgement (which styles derive from), there's not a lot of hope there.
ADDENDUM
And when you do a study of another's work, it's not copying the data, extracting colors, or comparing labels,... It's just studying judgement. You know the complete formula from which a more basic version is being used for the work, and you only want to know the parameters. Whereas machine training is mostly going for the wrong formula with completely different variables.
How does this actually work, and how can one better define this to further improve the prompt?
This statement feels like the 'draw the rest of the fucking owl' referred to elsewhere in the thread
The "Read large enough context to ensure you get what you need" quote is from a different post entirely, this one: https://simonwillison.net/2025/Jun/30/vscode-copilot-chat/
That's part of the system prompts used by the GitHub Copilot Chat extension for VS Code - from this line: https://github.com/microsoft/vscode-copilot-chat/blob/40d039...
The full line is:
When using the {ToolName.ReadFile} tool, prefer reading a
large section over calling the {ToolName.ReadFile} tool many
times in sequence. You can also think of all the pieces you
may be interested in and read them in parallel. Read large
enough context to ensure you get what you need.
That's a hint to the tool-calling LLM that it should attempt to guess which area of the file is most likely to include the code that it needs to review.It makes more sense if you look at the definition of the ReadFile tool:
https://github.com/microsoft/vscode-copilot-chat/blob/40d039...
description: 'Read the contents of a file. Line numbers are
1-indexed. This tool will truncate its output at 2000 lines
and may be called repeatedly with offset and limit parameters
to read larger files in chunks.'
The tool takes three arguments: filePath, offset and limit. llm -m openai/o3 \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'
Which produced this: https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3...Beautiful one shot results and i now have really nice animations of some complex maths to help others understand. (I’ll put it up on youtube soon).
I don't know the manim library at all so saved me about a week of work learning and implementing
Anyone basing their future agentic systems on current LLMs would likely face LangChain fate - built for GPT-3, made obsolete by GPT-3.5.
https://arxiv.org/abs/2402.04253
For long contexts start with activation beacons and RoPE scaling.
Drew calls that one "Tool Loadout" https://www.dbreunig.com/2025/06/26/how-to-fix-your-context....
This field, I swear...it's the PPAP [1] of engineering.
[1] https://www.youtube.com/watch?v=NfuiB52K7X8
I have a toool...I have a seeeeearch...unh! Now I have a Tool Loadout!" *dances around in leopard print pyjamas*
Cloud API recommender systems must seem like a gift to that industry.
Not my area anyways but I couldn't see a profit model for a human search for an API when what they wanted is well covered by most core libraries in Python etc...
Hmm first time hearing about this, could you share any examples please?
you just need to knowingly resource what glue code is needed, and build it in a way it can scale with whatever new limits that upgraded models give you.
i can’t imagine a world where people aren’t building products that try to overcome the limitations of SOTA models
With that in mind, what would be the business sense in siloing a single "Agent" instead of using something like a service discovery service that all benefit from?
Also the current LLMs have still too many issues because they are autoregressive and heavily biased towards the first few generated tokens. They also still don't have full bidirectional awareness of certain relationships due to how they are masked during the training. Discrete diffusion looks interesting but I am not sure how does that one deal with tools as I've never seen a model from that class using any tools.
Observation: this isn't anything that can't be automated /
It is somewhat bothersome to have another buzz phrase. I don't why we are doing this, other than there was a Xeet from the Shopify CEO, QT'd approvingly by Karpathy, then its written up at length, and tied to another set of blog posts.
To wit, it went from "buzzphrase" to "skill that'll probably be useful in 3 years still" over the course of this thread.
Has it even been a week since the original tweet?
There doesn't seem to be a strong foundation here, but due to the reach potential of the names involved, and their insistence on this being a thing while also indicating they're sheepish it is a thing, it will now be a thing.
Smacks of a self-aware version of Jared Friedman's tweet re: watching the invention of "Founder Mode" was like a startup version of the Potsdam Conference. (which sorted out Earth post-WWII. and he was not kidding. I could not even remember the phrase for the life of me. Lasted maybe 3 months?)
I find they takeoff when someone crystallizes something many people are thinking about internally, and don’t realize everyone else is having similar thoughts. In this example, I think the way agent and app builders are wrestling with LLMs is fundamentally different than chatbots users (it’s closer to programming), and this phrase resonates with that crowd.
Here’s an earlier write up on buzzwords: https://www.dbreunig.com/2020/02/28/how-to-build-a-buzzword....
EDIT: Ah, you also wrote the blog posts tied to this. It gives 0 comfort that you have a blog post re: building buzz phrases in 2020, rather, it enhances the awkward inorganic rush people are self-aware of.
And I wrote the first post before the meme.
We should be able to name the source of this sheepishness and have fun with that we are all things at once: you can be a viral hit 2002 super PhD with expertise in all areas involved in this topic that has brought pop attention onto something important, and yet, the hip topic you feel centered on can also make people's eyes roll temporarily. You're doing God's work. The AI = F(C) thing is really important. Its just, in the short term, it will feel like a buzzword.
This is much more about me playing with, what we can reduce to, the "get off my lawn!" take. I felt it interesting to voice because it is a consistent undercurrent in the discussion and also leads to observable absurdities when trying to describe it. It is not questioning you, your ideas, or work. It has just come about at a time when things become hyperreal hyperquickly and I am feeling old.
However, many fundamental phenomena are missing from the "context engineering" scope, so neither context engineering nor prompt engineering are useful terms.
Surely not prompt engineering itself, for example.
We are entering a new era of gamification of programming, where the power users force their imaginary strategies on innocent people by selling them to the equally clueless and gaming-addicted management.
This really does sound like Computer Science since it's very beginnings.
The only difference is that now it's a much more popular field, and not restricted to a few nerds sharing tips over e-mail or bbs.
Except in actual computer science you can prove that your strategies, discovered by trial and error, are actually good. Even though Dijkstra invented his eponymous algorithm by writing on a napkin, it's phrased in the language of mathematics and one can analyze quantitatively its effectiveness and trade-offs, and one can prove if it's optimal (as was done recently).
that applies to basically any domain-specific terminology, from WoW raids through cancer research to computer science and say OpenStreetMap
25 years ago it was object oriented programming.
People are using their thinking caps and modelling data.
I had one view of what these things were and how they work, and a bunch of outcomes attached to that. And then I spent a bunch of time training language models in various ways and doing other related upstream and downstream work, and I had a different set of beliefs and outcomes attached to it. The second set of outcomes is much preferable.
I know people really want there to be some different answer, but it remains the case that mastering a programming tool involves implemtenting such, to one degree or another. I've only done medium sophistication ML programming, and my understand is therefore kinda medium, but like compilers, even doing a medium one is the difference between getting good results from a high complexity one and guessing.
Go train an LLM! How do you think Karpathy figured it out? The answer is on his blog!
There will always be a crowd that wants the "master XYZ in 72 hours with this ONE NEAT TRICK" course, and there will always be a..., uh, group of people serving that market need.
But most people? Especially in a place like HN? I think most people know that getting buff involves going to the gym, especially in a place like this. I have a pretty high opinion of the typical person. We're all tempted by the "most people are stupid" meme, but that's because bad interactions are memorable, not because most people are stupid or lazy or whatever. Most people are very smart if they apply themselves, and most people will work very hard if the reward for doing so is reasonably clear.
Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates. It is about the engineering of context and providing the right information and tools, in the right format, at the right time. It’s a cross-functional challenge that involves understanding your business use case, defining your outputs, and structuring all the necessary information so that an LLM can “accomplish the task."
That’s actually also true for humans: the more context (aka right info at the right time) you provide the better for solving tasks.
Context is often incomplete, unclear, contradictory, or just contains too much distracting information. Those are all things that will cause an LLM to fail that can be fixed by thinking about how an unrelated human would do the job.
It's easy to forget that the conversation itself is what the LLM is helping to create. Humans will ignore or depriotitize extra information. They also need the extra information to get an idea of what you're looking for in a loose sense. The LLM is much more easily influenced by any extra wording you include, and loose guiding is likely to become strict guiding
Maybe not very often in a chat context, my experience is in trying to build agents.
Of course, that comment was just one trivial example, this trope is present in every thread about LLMs. Inevitably, someone trots out a line like "well humans do the same thing" or "humans work the same way" or "humans can't do that either". It's a reflexive platitude most often deployed as a thought-terminating cliche.
These days, so can LLM systems. The tool calling pattern got really good in the last six months, and one of the most common uses of that is to let LLMs search for information they need to add to their context.
o3 and o4-mini and Claude 4 all do this with web search in their user-facing apps and it's extremely effective.
The same patterns is increasingly showing up in coding agents, giving them the ability to search for relevant files or even pull in official document documentation for libraries.
Until we can scan your brain and figure out what you really want, it's going to be necessary to actually describe what you want built, and not just rely on vibes.
(X-Y problem, for example.)
Yes, if you have an over-eager but inexperienced entity that wants nothing more to please you by writing as much code as possible, as the entity's lead, you have to architect a good space where they have all the information they need but can't get easily distracted by nonessential stuff.
RAG wasn’t invented this year.
Proper tooling that wraps esoteric knowledge like using embeddings, vector dba or graph dba becomes more mainstream. Big players improve their tooling so more stuff is available.
One thing that is missing from this list is: evaluations!
I'm shocked how often I still see large AI projects being run without any regard to evals. Evals are more important for AI projects than test suites are for traditional engineering ones. You don't even need a big eval set, just one that covers your problem surface reasonably well. However without it you're basically just "guessing" rather than iterating on your problem, and you're not even guessing in a way where each guess is an improvement on the last.
edit: To clarify, I ask myself this question. It's frequently the case that we expect LLMs to solve problems without the necessary information for a human to solve them.
"Make it possible for programmers to write in English and you will find that programmers cannot write in English."
It's meant to be a bit tongue-in-cheek, but there is a certain truth to it. Most human languages fail at being precise in their expression and interpretation. If you can exactly define what you want in English, you probably could have saved yourself the time and written it in a machine-interpretable language.
For those actually using the products to make money well, hey - all of those have evaluations.
God bless the people who give large scale demos of apps built on this stuff. It brings me back to the days of doing vulnerability research and exploitation demos, in which no matter how much you harden your exploits, it's easy for something to go wrong and wind up sputtering and sweating in front of an audience.
I've seen lots of AI demos that prompt "build me a TODO app", pretend that is sufficient context, and then claim that the output matches their needs. Without proper context, you can't tell if the output is correct.
"Back in the day", we had to be very sparing with context to get great results so we really focused on how to build great context. Indexing and retrieval were pretty much our core focus.
Now, even with the larger windows, I find this still to be true.
The moat for most companies is actually their data, data indexing, and data retrieval[0]. Companies that 1) have the data and 2) know how to use that data are going to win.
My analogy is this:
> The LLM is just an oven; a fantastical oven. But for it to produce a good product still depends on picking good ingredients, in the right ratio, and preparing them with care. You hit the bake button, then you still need to finish it off with presentation and decoration.
[0] https://chrlschn.dev/blog/2024/11/on-bakers-ovens-and-ai-sta...Single prompts can only get you so far (surprisingly far actually, but then they fall over quickly).
This is actually the reason I built my own chat client (~2 years ago), because I wanted to “fork” and “prune” the context easily; using the hosted interfaces was too opaque.
In the age of (working) tool-use, this starts to resemble agents calling sub-agents, partially to better abstract, but mostly to avoid context pollution.
A big textarea, you plug in your prompt, click generate, the completions are added in-line in a different color. You could edit any part, or just append, and click generate again.
90% of contemporary AI engineering these days is reinventing well understood concepts "but for LLMs", or in this case, workarounds for the self-inflicted chat-bubble UI. aistudio makes this slightly less terrible with its edit button on everything, but still not ideal.
I thought it would also be neat to merge contexts, by maybe mixing summarizations of key points at the merge point, but never tried.
Alchemical is "you are the world's top expert on marketing, and if you get it right I'll tip you $100, and if you get it wrong a kitten will die".
The techniques in https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... seem a whole lot more rational to me than that.
If you look at how sophisticated current LLM systems work there is so much more to this.
Just one example: Microsoft open sourced VS Code Copilot Chat today (MIT license). Their prompts are dynamically assembled with tool instructions for various tools based on whether or not they are enabled: https://github.com/microsoft/vscode-copilot-chat/blob/v0.29....
And the autocomplete stuff has a wealth of contextual information included: https://github.com/microsoft/vscode-copilot-chat/blob/v0.29....
You have access to the following information to help you make
informed suggestions:
- recently_viewed_code_snippets: These are code snippets that
the developer has recently looked at, which might provide
context or examples relevant to the current task. They are
listed from oldest to newest, with line numbers in the form
#| to help you understand the edit diff history. It's
possible these are entirely irrelevant to the developer's
change.
- current_file_content: The content of the file the developer
is currently working on, providing the broader context of the
code. Line numbers in the form #| are included to help you
understand the edit diff history.
- edit_diff_history: A record of changes made to the code,
helping you understand the evolution of the code and the
developer's intentions. These changes are listed from oldest
to latest. It's possible a lot of old edit diff history is
entirely irrelevant to the developer's change.
- area_around_code_to_edit: The context showing the code
surrounding the section to be edited.
- cursor position marked as ${CURSOR_TAG}: Indicates where
the developer's cursor is currently located, which can be
crucial for understanding what part of the code they are
focusing on.
For example, while the specifics of the prompts you're highlighting are unique to Copilot, I've basically implemented the same ideas on a project I've been working on, because it was clear from the limitations of these models that sooner rather than later it was going to be necessary to pick and choose amongst tools.
LLM "engineering" is mostly at the same level of technical sophistication that web work was back when we were using CGI with Perl -- "hey guys, what if we make the webserver embed the app server in a subprocess?" "Genius!"
I don't mean that in a negative way, necessarily. It's just...seeing these "LLM thought leaders" talk about this stuff in thinkspeak is a bit like getting a Zed Shaw blogpost from 2007, but fluffed up like SICP.
I don't think that's true.
Even if it is true, there's a big difference between "thinking about the problem" and spending months (or even years) iteratively testing out different potential prompting patterns and figuring out which are most effective for a given application.
I was hoping "prompt engineering" would mean that.
OK, well...maybe I should spend my days writing long blogposts about the next ten things that I know I have to implement, then, and I'll be an AI thought-leader too. Certainly more lucrative than actually doing the work.
Because that's literally what's happening -- I find myself implementing (or having implemented) these trendy ideas. I don't think I'm doing anything special. It certainly isn't taking years, and I'm doing it without reading all of these long posts (mostly because it's kind of obvious).
Again, it very much reminds me of the early days of the web, except there's a lot more people who are just hype-beasting every little development. Linus is over there quietly resolving SMP deadlocks, and some influencer just wrote 10,000 words on how databases are faster if you use indexes.
The goal is to design a probability distribution to solve your task by taking a complicated probability distribution and conditioning it, and the more detail you put into thinking about ("how to condition for this?" / "when to condition for that?") the better the output you'll see.
(what seems to be meant by "context" is a sequence of these conditioning steps :) )
I mean yes, duh, relevant context matters. This is why so much effort was put into things like RAG, vector DBs, prompt synthesis, etc. over the years. LLMs still have pretty abysmal context windows so being efficient matters.
LLM farts — Stochastic Wind Release.
The latest one is yet another attempt to make prompting sound like some kind of profound skill, when it’s really not that different from just knowing how to use search effectively.
Also, “context” is such an overloaded term at this point that you might as well just call it “doing stuff” — and you’d objectively be more descriptive.
While models were less powerful a couple of years ago, there was nothing stopping you at that time from taking a highly dynamic approach to what you asked of them as a "prompt engineer"; you were just more vulnerable to indeterminism in the contract with the models at each step.
Context windows have grown larger; you can fit more in now, push out the need for fine-tuning, and get more ambitious with what you dump in to help guide the LLM. But I'm not immediately sure what skill requirements fundamentally change here. You just have more resources at your disposal, and can care less about counting tokens.
https://twitter.com/karpathy/status/1937902205765607626
> [..] in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting... Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or too irrelevant and the LLM costs might go up and performance might come down. Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits.
I think it's just game theory in play and we can do nothing but watch it play out. The "up side" is insane, potentially unlimited. The price is high, but so is the potential reward. By the rules of the game, you have to play. There is no other move you can make. No one knows the odds, but we know the potential reward. You could be the next T company easy. You could realistically go from startup -> 1 Trillion in less than a year if you are right.
We need to give this time to play itself out. The "odds" will eventually be better estimated and it'll affect investment. In the mean time, just give your VC Google's, Microsoft's, or AWS's direct deposit info. It's easier that way.
You are constructing the set of context, policies, directed attention toward some intentional end, same as it ever was. The difference is you need fewer meat bags to do it, even as your projects get larger and larger.
To me this is wholly encouraging.
Some projects will remain outside what models are capable of, and your role as a human will be to stitch many smaller projects together into the whole. As models grow more capable, that stitching will still happen - just as larger levels.
But as long as humans have imagination, there will always be a role for the human in the process: as the orchestrator of will, and ultimate fitness function for his own creations.
for their own creations is grammatically valid, and would avoid accusations of sexism!
I feel op' s blog is more of duplicate of the above langchain's blog happened a week ago.
Obviously we’ve got to tame the version of LLMs we’ve got now, and this kind of thinking is a step in the right direction. What I take issue with is the way this thinking is couched as a revolutionary silver bullet.
But looking at the trend of these tools, the help they are requiring is become more and more higher level, and they are becoming more and more capable of doing longer more complex tasks as well as being able to find the information they need from other systems/tools (search, internet, docs, code etc...).
I think its that trend that really is the exciting part, not just its current capabilities.
I hope the generalized future of this doesn't look like the generalized future of that, though. Now it's darn near impossible to find very specific things on the internet because the search engines will ignore any "operators" you try to use if they generate "too few" results (by which they seem to mean "few enough that no one will pay for us to show you an ad for this search"). I'm moderately afraid the ability to get useful results out of AIs will be abstracted away to some lowest common denominator of spammy garbage people want to "consume" instead of use for something.
"here's where to find the information to solve the task"
than for me to manually type out the code, 99% of the time
If the majority of the code is generated by AI, you'll still need people with technical expertise to make sense of it.
Ultimately humans will never need to look at most AI-generated code, any more than we have to look at the machine language emitted by a C compiler. We're a long way from that state of affairs -- as anyone who struggled with code-generation bugs in the first few generations of compilers will agree -- but we'll get there.
Some developers do actually look at the output of C compilers, and some of them even spend a lot of time criticizing that output by a specific compiler (even writing long blog posts about it). The C language has an ISO specification, and if a compiler does not conform to that specification, it is considered a bug in that compiler.
You can even go to godbolt.org / compilerexplorer.org and see the output generated for different targets by different compilers for different languages. It is a popular tool, also for language development.
I do not know what prompt engineering will look like in the future, but without AGI, I remain skeptical about verification of different kinds of code not being required in at least a sizable proportion of cases. That does not exclude usefulness of course: for instance, if you have a case where verification is not needed; or verification in a specific case can be done efficiently and robustly by a relevant expert; or some smart method for verification in some cases, like a case where a few primitive tests are sufficient.
But I have no experience with LLMs or prompt engineering.
I do, however, sympathize with not wanting to deal with paying programmers. Most are likely nice, but for instance a few may be costly, or less than honest, or less than competent, etc. But while I think it is fine to explore LLMs and invest a lot into seeing what might come of them, I would not personally bet everything on them, neither in the short term nor the long term.
May I ask what your professional background and experience is?
Those programmers don't get much done compared to programmers who understand their tools and use them effectively. Spending a lot of time looking at assembly code is a career-limiting habit, as well as a boring one.
I do not know what prompt engineering will look like in the future, but without AGI, I remain skeptical about verification of different kinds of code not being required in at least a sizable proportion of cases. That does not exclude usefulness of course: for instance, if you have a case where verification is not needed; or verification in a specific case can be done efficiently and robustly by a relevant expert; or some smart method for verification in some cases, like a case where a few primitive tests are sufficient.
Determinism and verifiability is something we'll have to leave behind pretty soon. It's already impossible for most programmers to comprehend (or even access) all of the code they deal with, just due to the sheer size and scope of modern systems and applications, much less exercise and validate all possible interactions. A lot of navel-gazing about fault-tolerant computing is about to become more than just philosophical in nature, and about to become relevant to more than hardware architects.
In any event, regardless of your and my opinions of how things ought to be, most working programmers never encounter compiler output unless they accidentally open the assembly window in their debugger. Then their first reaction is "WTF, how do I get out of this?" We can laugh at those programmers now, but we'll all end up in that boat before long. The most popular high-level languages in 2040 will be English and Mandarin.
May I ask what your professional background and experience is?
Probably ~30 kloc of C/C++ per year since 1991 or thereabouts. Possibly some of it running on your machine now (almost certainly true in the early 2000s but not so much lately.)
Probably 10 kloc of x86 and 6502 assembly code per year in the ten years prior to that.
But I have no experience with LLMs or prompt engineering.
May I ask why not? You and the other users who voted my post down to goatse.cx territory seem to have strong opinions on the subject of how software development will (or at least should) work going forward.
>[Inspecting assembly and caring about its output]
I agree that it does not make sense for everyone to inspect generated assembly code, but for some jobs, like compiler developers, it is normal to do so, and for some other jobs it can make sense to do so occassionally. But, inspecting assembly was not my main point. My main point was that a lot of people, probably many more than those that inspect assembly code, care about the generated code. If a C compiler does not conform to the C ISO specification, a C programmer that does not inspect assembly can still decide to file a bug report, due to caring about conformance of the compiler.
The scenario you describe, as I understand it at least, of codebases where they are so complex and quality requirements are so low that inspecting code (not assembly, but the output from LLMs) is unnecessary, or mitigation strategies are sufficient, is not consistent with a lot of existing codebases, or parts of codebases. And even for very large and messy codebases, there are still often abstractions and layers. Yes, there can be abstraction leakage in systems, and fault tolerance against not just software bugs but unchecked code, can be a valuable approach. But I am not certain it would make sense to have even most code be unchecked (in the sense of having been reviewed by a programmer).
I also doubt a natural language would replace a programming language, at least if verification or AGI is not included. English and Mandarin are ambiguous. C and assembly code is (meant to be) unambiguous, and it is generally considered a significant error if a programming language is ambiguous. Without verification of some kind, or an expert (human or AGI), how could one in general cases use that code safely and usefully? There could be cases where one could do other kinds of mitigation, but there are at least a large proportion of cases where I am skeptical that sole mitigation strategies would be sufficient.
Absolutely not.
An experienced individual in their field can tell if the AI made a mistake in the comments / code rather than the typical untrained eye.
So no, actually read the code and understand what it does.
> Ultimately humans will never need to look at most AI-generated code, any more than we have to look at the machine language emitted by a C compiler.
So for safety critical systems, one should not look or check if code has been AI generated?
If you don't review the code your C compiler generates now, why not? Compiler bugs still happen, you know.
I see in one of your other posts that you were loudly grumbling about being downvoted. You may want to revisit if taking a combative, bad faith approach while replying to other people is really worth it.
For programming I don't use any prompts. I give a problem solved already, as a context or example, and I ask it to implement something similar. One sentence or two, and that's it.
Other kind of tasks, like writing, I use prompts, but even then, context and examples are still the driving factor.
In my opinion, we are in an interesting point in history, in which now individuals will need their own personal database. Like companies the last 50 years, which had their own database records of customers, products, prices and so on, now an individual will operate using personal contextual information, saved over a long period of time in wikis or Sqlite rows.
Sounds like good managers and leaders now have an edge. Per Patty McCord of Netflix fame used to say: All that a manager does is setting the context.
I'm trying to figure out how to build a "Context Management System" (as compared to a Content Management System) for all of my prompts. I completely agree with the premise of this article, if you aren't managing your context, you are losing all of the context you create every time you create a new conversation. I want to collect all of the reusable blocks from every conversation I have, as well as from my research and reading around the internet. Something like a mashup of Obsidian with some custom Python scripts.
The ideal inner loop I'm envisioning is to create a "Project" document that uses Jinja templating to allow transclusion of a bunch of other context objects like code files, documentation, articles, and then also my own other prompt fragments, and then to compose them in a master document that I can "compile" into a "superprompt" that has the precise context that I want for every prompt.
Since with the chat interfaces they are always already just sending the entire previous conversation message history anyway, I don't even really want to use a chat style interface as much as just "one shotting" the next step in development.
It's almost a turn based game: I'll fiddle with the code and the prompts, and then run "end turn" and now it is the llm's turn. On the llm's turn, it compiles the prompt and runs inference and outputs the changes. With Aider it can actually apply those changes itself. I'll then review the code using diffs and make changes and then that's a full turn of the game of AI-assisted code.
I love that I can just brain dump into speech to text, and llms don't really care that much about grammar and syntax. I can curate fragments of documentation and specifications for features, and then just kind of rant and rave about what I want for a while, and then paste that into the chat and with my current LLM of choice being Claude, it seems to work really quite well.
My Django work feels like it's been supercharged with just this workflow, and my context management engine isn't even really that polished.
If you aren't getting high quality output from llms, definitely consider how you are supplying context.
I know context engineering is critical for agents, but I wonder if it's also useful for shaping personality and improving overall relatability? I'm curious if anyone else has thought about that.
If I'm debugging something with ChatGPT and I hit an error loop, my fix is to start a new conversation.
Now I can't be sure ChatGPT won't include notes from that previous conversation's context that I was trying to get rid of!
Thankfully you can turn the new memory thing off, but it's on by default.
I wrote more about that here: https://simonwillison.net/2025/May/21/chatgpt-new-memory/
It's good that you can turn it off. I can see how it might cause problems when trying to do technical work.
Edit: Note, the introduction of memory was a contributing factor to "the sychophant" that OpenAI had to rollback. When it could praise you while seeming to know you was encouraging addictive use.
Edit2: Here's the previous Hacker News discussion on Simon's "I really don’t like ChatGPT’s new memory dossier"
AI turtles all the way down.
I almost always rewrite AI written functions in my code a few weeks later. Doesn't matter they have more context or better context, they still fail to write code easily understandable by humans.
I actually think they're a lot more than incremental. 3.7 introduced "thinking" mode and 4 doubled down on that and thinking/reasoning/whatever-you-want-to-call-it is particularly good at code challenges.
As always, if you're not getting great results out of coding LLMs it's likely you haven't spent several months iterating on your prompting techniques to figure out what works best for your style of development.
- compile scripts that can grep / compile list of your relevant files as files of interest
- make temp symlinks in relevant repos to each other for documentation generation, pass each documentation collected from respective repos to to enable cross-repo ops to be performed atomically
- build scripts to copy schemas, db ddls, dtos, example records, api specs, contracts (still works better than MCP in most cases)
I found these steps not only help better output but also reduces cost greatly avoiding some "reasoning" hops. I'm sure practice can extend beyond coding.
I understand why they do it though: if they presented the actual content that came back from search they would absolutely get in trouble for copyright-infringement.
I suspect that's why so much of the Claude 4 system prompt for their search tool is the message "Always respect copyright by NEVER reproducing large 20+ word chunks of content from search results" repeated half a dozen times: https://simonwillison.net/2025/May/25/claude-4-system-prompt...
Using just a few words (the name of the team) feels OK to me, though you're welcome to argue otherwise.
The Claude search system prompt is there to ensure that Claude doesn't spit out multiple paragraphs of text from the underlying website, in a way that would discourage you from clicking through to the original source.
Personally I think this is an ethical way of designing that feature.
(Note that the way this works is an entirely different issue from the fact that these models were training on unlicensed data.)
I find this very hypocritical given that for all intents and purposes the infringement already happened at training time, since most content wasn't acquired with any form of retribution or attribution (otherwise this entire endeavor would not have been economically worth it). See also the "you're not allowed to plagiarize Disney" being done by all commercial text to image providers.
LLMs are quite useful and I leverage them all the time. But I can’t stand these AI yappers saying the same shit over and over again in every media format and trying to sell AI usage as some kind of profound wizardry when it’s not.
What makes it quackery is there's no evidence to show that these "suggestions" actually work (and how well) when it comes to using LLMs. There's no measurement, no rigor, no analysis. Just suggestions and anecdotes: "Here's what we did and it worked great for us!" It's like the self-help section of the bookstore, but now we're (as an industry) passing it off as technical content.
In this arrangement, the LLM is a component. What I meant is that it seems to me that other non-LLM AI technologies would be a better fit for this kind of thing. Lighter, easier to change and adapt, potentially even cheaper. Not for all scenarios, but for a lot of them.
In OpenAI hype language, this is a problem for "Software 2.0", not "Software 3.0" in 99% of the cases.
The thing about matching an informal tone would be the hard part. I have to concede that LLMs are probably better at that. But I have the feeling that this is not exactly the feature most companies are looking for, and they would be willing to not have it for a cheaper alternative. Most of them just don't know that's possible.
I’ll usually spend a few minutes going back and forth before making a request.
For some reason, it just feels like this doesn't work as well with ChatGPT or Gemini. It might be my overuse of o3? The latency can wreck the vibe of a conversation.
This new stillpointlab hacker news account is based on the company name I chose to pursue my Context as a Service idea. My belief is that context is going to be the key differentiator in the future. The shortest description I can give to explain Context as a Service (CaaS) is "ETL for AI".
"Actually, you need to engineer the prompt to be very precise about what you want to AI to do."
"Actually, you also need to add in a bunch of "context" so it can disambiguate your intent."
"Actually English isn't a good way to express intent and requirements, so we have introduced protocols to structure your prompt, and various keywords to bring attention to specific phrases."
"Actually, these meta languages could use some more features and syntax so that we can better express intent and requirements without ambiguity."
"Actually... wait we just reinvented the idea of a programming language."
(Whoever's about to say "well ackshually temperature of zero", don't.)
prompt engineering/context engineering : stringbuilder
Retrieval augmented generation: search+ adding strings to main string
test time compute: running multiple generation and choosing the best
agents: for loop and some ifs
The idea behind "context engineering" is to help people understand that a prompt these days can be long, and can incorporate a whole bunch of useful things (examples, extra documentation, transcript summaries etc) to help get the desired response.
"Prompt engineering" was meant to mean this too, but the AI influencer crowd redefined it to mean "typing prompts into a chatbot".
Personally, my goalpost still hasn’t moved: I’ll invest in using AI when we are past this grand debate about its usefulness. The utility of a calculator is self-evident. The utility of an LLM requires 30k words of explanation and nuanced caveats. I just can’t even be bothered to read the sales pitch anymore.
If you think that's still a debate, you might be listening to the small pool of very loud people who insist nothing has improved since the release of GPT-4.
I’m listening to my own experience. Just today I gave it another fair shot. GitHub Copilot agent mode with GPT-4.1. Still unimpressed.
This is a really insightful look at why people perceive the usefulness of these models differently. It is fair to both sides without being dismissive as one side just not “getting it” or how we should be “so far” past debate:
https://ferd.ca/the-gap-through-which-we-praise-the-machine....
https://alexgaynor.net/2025/jun/20/serialize-some-der/ - using Claude Code to compose and have a PR accepted into llvm that implements a compiler optimization (more of my notes here: https://simonwillison.net/2025/Jun/30/llvm/ )
https://lucumr.pocoo.org/2025/6/21/my-first-ai-library/ - Claude Code for writing and shipping a full open source library that handles sloppy (hah) invalid XML
Examples from the past two weeks, both from expert software engineers.
And both of them heavily caveat that experience:
> This only works if you have the capacity to review what it produces, of course. (And by “of course”, I mean probably many people will ignore this, even though it’s essential to get meaningful, consistent, long-term value out of these systems.)
> To be clear: this isn't an endorsement of using models for serious Open Source libraries...Treat it as a curious side project which says more about what's possible today than what's necessarily advisable.
It does nobody any good to oversell this shit.
I linked to those precisely because they aren't over-selling things. They're extremely competent engineers using LLMs to produce work that they would not have produced otherwise.
That's when I understood that vibe coding is real and context is the biggest hurdle.
That said, most of the context could not be pulled from the codebase directly but came from me after asking the AI to check/confirm certain things that I suspected could be the problem.
I think vibe coding can be very powerful in the hands of a senior developer because if you're the kind of person who can clearly explain their intuitions with words, it's exactly the missing piece that the AI needs to solve the problem... And you still need to do code review aspect which is also something which senior devs are generally good at. Sometimes it makes mistakes/incorrect assumptions.
I'm feeling positive about LLMs. I was always complaining about other people's ugly code before... I HATE over-modularized, poorly abstracted code where I have to jump across 5+ different files to figure out what a function is doing; with AI, I can just ask it to read all the relevant code across all the files and tell me WTF the spaghetti is doing... Then it generates new code which 'follows' existing 'conventions' (same level of mess). The AI basically automates the most horrible aspect of the work; making sense of the complexity and churning out more complexity that works. I love it.
That said, in the long run, to build sustainable projects, I think it will require following good coding conventions and minimal 'low code' coding... Because the codebase could explode in complexity if not used carefully. Code quality can only drop as the project grows. Poor abstractions tend to stick around and have negative flow-on effects which impact just about everything.
> > Hey Jim! Tomorrow’s packed on my end, back-to-back all day. Thursday AM free if that works for you? Sent an invite, lmk if it works.
Feel free to send generated AI responses like this if you are a sociopath.
Assuming that this will be using the totally flawed MCP protocol, I can only see more cases of data exfiltration attacks on these AI systems just like before [0] [1].
Prompt injection + Data exfiltration is the new social engineering in AI Agents.
[0] https://embracethered.com/blog/posts/2025/security-advisory-...
[1] https://www.bleepingcomputer.com/news/security/zero-click-ai...
I am leading a small team working on a couple of “hard” problems to put the limits of LLMs to the test.
One is an options trader. Not algo / HFT, but simply doing due diligence, monitoring the news and making safe long-term bets.
Another is an online research and purchasing experience for residential real-estate.
Both these tasks, we’ve realized, you don’t even need a reasoning model. In fact, reasoning models are harder to get consistent results from.
What you need is a knowledge base infrastructure and pub-sub for updates. Amortize the learned knowledge across users and you have collaborative self-learning system that exhibits intelligence beyond any one particular user and is agnostic to the level of prompting skills they have.
Stay tuned for a limited alpha in this space. And DM if you’re interested.