The result here seems to be "Our Judge LLM gave another LLM a 21% grade for some code it generated", which is ... not qualitatively meaningful at all to me.
We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.
They could build a system which gives them equal compute time by ignoring time spent rate limiting and such, but they chose not to.
Damned if you do, damned if you don't.
https://source-lens.vercel.app
(Note: this is not at all a production ready app, it's just something I've been making for myself, though I'm also now sharing it with my students to see how they use it. If anyone reads this and is interested in collaborating, let me know).
I regularly paste papers into LLM interfaces but they all spit out generic non-helpful answers. Your app is the only one i've seen that actually helps me understand.
I am using Gemini 2.0 pro
There is a coefficient of intelligence replication ie: Model M with intelligence I_m, can reproduce a model N with intelligence I_n. When (I_n / I_m) > 1 we'll have a runaway intelligence explosion. There are of course several elements in the chain - akin to the Drake equation for intelligent machines - and their combined multiplicative effect determines the overall intelligence of the system. If f(paper) -> code is the weakest part of the chain, it makes sense to target that.
my point is that LLMs are already potentially seeing solution on github, so you can't use that benchmark as metric unless there is some explanation.
The agent didn’t have access to the code, although they acknowledge it could theoretically be in the training set, even then the original code wouldn’t conform to the structure of the test.
yeah, this part should be central in this work: how well those tests are built, do they actually are catching data leakage, how this is measured, etc.
Put all the GPUs in cloud/s controlled by international scientists (now you can use your GPU on any device, can earn money by renting it when you don’t need it, nothing changes except you need to be online to us it, but we’ll have 5G and better worldwide. You can develop, sell or release free math-proven safe AI models in this cloud “AI App Store”, etc).
Because the main risk is an AI agent botnet - current GPUs are like nukes that are 100% unprotected - any hacker can make a virus with AI agent component just to steal money, this AI will be not aligned at all, will become a per perpetual and eventually autonomous botnet.