Efficiency is now key.
~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.
We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)
Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.
Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!
some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X
So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)
Also, some other back of envelope calculations:
The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.
The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)
I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.
It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.
You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.
I get the excitement, but folks, this is a model that excels only in things like software engineering/math. They basically used reinforcement learning to train the model to better remember which pattern to use to solve specific problems. This in no way generalises to open ended tasks in a way that makes human in the loop unnecessary. This basically makes assistants better (as soon as they figure out how to make it cheaper), but I wouldn't blindly trust the output of o3. Sam Altman is still wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong
In your blog you say:
> deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence.
I think even (or especially) people like Altman accept this as a fact. I do. Hassabis has been saying this for years.
The foundational models are just a foundation. Now start building the AGI superstructure.
And this is also where most of the still human intellectual energy is now.
You lost me at the end there.
These statistical models don’t generalize well to out of distribution data. If you accept that as a fact, then you must accept that these statistical models are not the path to AGI.
Quite. And if it was right, those businesses deploying it and replacing humans need humans with jobs and money to pay for their products and services…
It will just keep bleeding the middle class on and on, till the point where either everyone is rich, homeless or a plumber or other such licensed worker. And then there will be such a glut in the latter (shrinking) market, that everyone in that group also becomes either rich or homeless.
Productivity gains increase the standard of living for everyone. Products and services become cheaper. Leisure time increases. Scarce labor resources can be applied in other areas.
I fail to see the difference between AI-employment-doom and other flavors of Luddism.
It also fuels the income inequality with a fatter pipe in every iteration. You get richer as you move up in the supply chain, period. Companies vertically integrate to drive costs down in the long run.
As AI gets more prevalent, it'll drive the cost down for the companies supplying these services, so the former employees of said companies will be paid lower, or not at all.
So, tell me, how paying fewer people less money will drive their standard of living upwards? I can understand the leisure time. Because, when you don't have a job, all day is leisure time. But you'll need money for that, so will these companies fund the masses via government to provide Universal Basic Income, so these people can both live a borderline miserable life while funding these companies to suck these people more and more?
It also fuels the income inequality with a fatter pipe in every iteration
Who cares? A rising tide lifts all boats. The wealthy people I know all have one thing in common: they focused more on their own bank accounts than on other people's.
So, tell me, how paying fewer people less money will drive their standard of living upwards?
Money is how we allocate limited resources. It will become less important as resources become less limited, less necessary, or (hopefully) both.
> Money is how we allocate limited resources. It will become less important as resources become less limited, less necessary, or (hopefully) both.
Money is also how we exert power and leverage over others. As inequality increases, it enables the ever wealthier minority to exert power and therefore control over the majority.
If that's a problem, why does the progressive point of view typically argue in favor of giving more power over our lives to the ruling class?
The problem isn't the money. The problem is the power.
> why does the progressive point of view typically argue in favor of giving more power over our lives to the ruling class?
Humans are interesting creatures. Many of them do not have conscience and don't understand the notion of ethics and "not doing of something because it's wrong to begin with". From my experience, esp. the people in US thinks that "if that's not illegal, then I can and will do this", which is wrong in many levels again.
Many European people are similar, but bigger governments and harsher justice system makes them more orderly, and happier in general. Yes, they can't carry guns, but they don't need to begin with. Yes, they can't own Cybertrucks, but they can walk or use an actually working mass transportation system to begin with.
Plus proper governments have checks and balances. A government can't rip people off like corporations for services most of the time. Many of the things Americans are afraid of (social health services for everyone) makes life more just and tolerable for all parts of the population.
Big government is not a bad thing, and uncontrollable government is. We're entering the era of "corporate pleasing uncontrollable governments", and this will be fun in a tragic way.
"Many European people are similar, but bigger governments and harsher justice system makes them more orderly, and happier in general. Yes, they can't carry guns, but they don't need to begin with. Yes, they can't own Cybertrucks, but they can walk or use an actually working mass transportation system to begin with."
This comment is a festival of imprecise stereotypes.
Gun laws vary widely across Europe, as does public safety (both the real thing and perception of; if you avoid extra rapes by women not venturing outside after dark, the city isn't really safe), as does the overall lavel of personal happiness, as does the functionality of public transport systems.
And the quality of public services doesn't really track the size of the government even in Europe that well. Corruption eats a lot of the common pie.
Lmao attacking stereotypes with stereotypes.
Please what cities? You are just making up rape stats. That’s makes you the bigger idiot here.
Ohh yeah so much corruption I don’t literally enjoy Zagreb more than any US city I have been to and it’s not even special. Because if this is just have the shittiest argument ever there’s my anecdotal rebuttal.
> We're entering the era of "corporate pleasing uncontrollable governments", and this will be fun in a tragic way.
Right, so the answer is not to make that bad government bigger, the answer is to replace it with a good government. Feeding a cancer tumor doesn't make it better.
> Right, so the answer is not to make that bad government bigger, the answer is to replace it with a good government.
Bad government (where by bad I mean serving the interests of the wealthy few over the masses) is bad regardless of it's size.
If you believe in supply-side/trickle-down economics, you might use the opposite definition of "bad", in which case shrinking of government that restrains corporations (protecting the masses) by regulation or paying for seniors not to end up it total destitution (social security /Medicare)
The size of the government is less relevant than what it is doing, and whether you agree with that.
Trickle down economics isn't something you either "believe in" or "don't believe in". It's a disproven theory that does not work.
In a free democracy, I think Progressives see the ruling class as those in a position to influence democratic rule with an outsized influence compared to 1 person, 1 vote. And that are those with money, or with too much centralized media power or popularity.
The employees of the government and those elected are not seen as the ruling class by progressives, but just normal people that have the qualifications and are employed to manage the government on behalf of the people.
It's important therefore that those elected and put in charge of the government are in a position where they don't have the power to benefit themselves or their friends/family, but are in a position where they can wield power to benefit the people who hired them for the job (their constituents), and that if they fail to do so, they can get replaced.
It's not always true that progressives are for more government power. See the death penalty for example. It's pretty much the ultimate power a government could have, and who advocates for it? It's not progressives I believe.
You think the death penalty is an exercise of ultimate power? It’s more an exercise of vengeance.
The ultimate exercise of government power is keeping someone locked in a tiny cell for the rest of their life where their bed is next to their toilet and you make them beg a faceless bureaucracy that has no accountability annually for some form of clemency via parole, all while the world and their family moves on without them.
I don't necessarily agree with that, but even if it's true, I think my main point still stands about who is likely to support either thing.
> If that's a problem, why does the progressive point of view typically argue in favor of giving more power over our lives to the ruling class?
You've literally reversed the meaning of the term "progressive" by replacing it with the meaning of the term "oligarchic".
Progressives argue for less invasion by government in our personal lives, and less unequal distribution of wealth and power. They are specifically opposed to power being delivered to a ruling class.
> The problem isn't the money. The problem is the power
These are nearly inseparable in current (and frankly most past) societies. Pretending that they are not is a way of avoiding practical solutions to the problem of the distribution of power.
I really get the feeling that people do not understand that progressive is almost a synonym for Libleft.
Those damn authoritarians, stripping the power from the oligarchs by massively taxing the rich and defunding the police. The bastards.
Yeah, that commenter wildly misunderstands what "progressive" means. Like full on got the definition of the word backwards.
Is this common? People think "progressive" means "complete government control"?
Progressives support regulations to prevent both public and private entities from becoming too powerful. It's not like they want to give the government authoritarian control lol.
To be blunt: It doesn't.
The modern political binary was originally constructed in the ashes of the French Revolution, as the ruling royalty, nobility and aristocracy recoiled in horror at the threat that masses of angry poor people now posed. The left wing thrived on personal liberty, tearing down hierarchies, pursuing "liberty, equality, fraternity". The right wing perceives social hierarchy as a foundational good, sees equality as anarchy and order (and respect for property) as far more important than freedom. For a century they experienced collective flashbacks to French Revolutionaries burning fine art for firewood in an occupied chateau.
Notably, it has not been a straight line of Social Progress, nor a simple Hegelian dialectic, but a turbulent winding path between different forces of history that have left us with less or more personal liberty in various eras. But... well... you have to be very confused about what they actually believe now or historically to understand progressives or leftists as tyrants who demand hierarchy.
That confusion may come from listening to misinformation from anticommunists, a particular breed of conservative who for the past half century have asserted that ANY attempt to improve government or enhance equality was a Communist plot by people who earnestly wanted Soviet style rule. One of those anticommunists with a business empire, Charles Koch, funded basically all the institutions of the 'libertarian' movement, and later on much of the current GOP's brand of conservatism.
I guess it depends on what you're defining as "the ruling class", because I believe most progressives would define it as "the wealthy" and would certainly not be in favor of that. Look at the AOC/Pelosi rift, for instance.
Politicians are a part of the ruling class for any sensible definition of the word.
In free democracies, politicians are elected representatives, not rulers. They are accountable to voters through regular elections. Power is distributed across multiple branches/institutions. Citizens have protected rights and freedoms. Politicians can be voted out or recalled. Laws apply equally to politicians and citizens.
In practice, there's always a slippery slope, can wealthy people integrate themselves in that power structure, lobbying, media control, strength of checks and balances, level of corruption/transparency, etc. But when that slips, we stop calling it a free democracy, and it becomes an oligarchy, or a plutocracy, or an illiberal democracy.
Politicians do come in different flavours. There are some elected officials with good intentions. See again the AOC/Pelosi rift.
The more we regulate to get money out of politics, the more good people will have a shot at being elected.
These are all common progressive values. No true progressive supports wealthy unethical politicians gaining more power. Anyone telling you so is not speaking in good faith, or they are misinformed.
Why would "resources" become less limited or necessary just because there's some AGI controlled by a few people? You're assuming a lot here.
Separately, is it "rising tide lifts all boats" or "pull yourself up by your bootstraps" that drives the common person's progress? You seem confused which metaphor to apply while handwaving the discussion away.
Why would "resources" become less limited or necessary just because there's some AGI controlled by a few people? You're assuming a lot here.
The Luddites asked a similar question. The ultimate answer is that it doesn't matter that much who controls the means of production, as long as we have access to its fruits.
As long as manual labor is in the loop, the limits to productivity are fixed. Machines scale, humans don't. It doesn't matter whether you're talking about a cotton gin or a warehouse full of GPUs.
Separately, is it "rising tide lifts all boats" or "pull yourself up by your bootstraps" that drives the common person's progress? You seem confused which metaphor to apply while handwaving the discussion away.
I haven't invoked the "bootstrap" cliché here, have I? Just the boat thing. They make very different points.
Anyway, never mind the bootstraps: where'd you get the boots? Is there a shortage of boots?
There once was a shortage of boots, it's safe to say, but automation fixed that. Humans didn't, and couldn't, but machines did. Or more properly, humans building and using machines did.
> The ultimate answer is that it doesn't matter that much who controls the means of production, as long as we have access to its fruits.
That mattered a lot in communist places, we saw it fail. Same thing with most authoritarian regime today, it's a crap shoot. You simply can't entrust a small group with full control on the means of production and expect them to make it efficient, cheap, innovative, sustainable and affordable.
> Who cares? A rising tide lifts all boats.
Apparently people who are not wealthy enough to buy a boat and afraid of drowning care about this a lot. Also, for whom the tide rises? Not for the data workers which label data for these systems for peanuts, or people who lose jobs because they can be replaced with AI, or Amazon drivers which are auto-fired by their in-car HAL9000 units which label behavior the way they see fit.
> The wealthy people I know all have one thing in common: they focused more on their own bank accounts than on other people's.
So, the amount of money they have is much more important than everything else. That's greed, not wealth, but OK. I'm not feeling like dying on the hill of greedy people today.
> Money is how we allocate limited resources.
...and the wealthy people (you or I or others know) are accumulating amounts of it which they can't make good use of personally, I will argue.
> It will become less important as resources become less limited, less necessary, or (hopefully) both.
How will we make resources less limited? Recycling? Reducing population? Creating out of thin air?
Or, how will they become less necessary? Did we invent materials which are more durable and cheaper to produce, and do we start to sell it to people for less? I don't think so.
See, this is not a developing country problem. It's a developed country problem. Stellantis is selling inferior products for more money, while reducing workforce , closing factories, replacing metal parts with plastics, and CEO is taking $40MM as a bonus [0], and now he's apparently resigned after all that shenanigans.
So, no. Nobody is making things cheaper for people. Everybody is after the money to rise their own tides.
So, you're delusional. Nobody is thinking about your bank account that's true. This is why resources won't be less limited or less necessary. Because all the surplus is accumulating at people who are focused on their own bank accounts more than anything else.
How will we make resources less limited? Recycling? Reducing population? Creating out of thin air?
We've already done it, as evidenced by the fact that you had the time and tools to write that screed. Your parents probably didn't, and your grandparents certainly didn't.
No, it doesn't prove anything. To be brutally honest, I have just eaten a meal, and have 30 minutes of relax time. Then I'll close this 10 year old laptop and continue what I need to do.
No, my parents had that. Instead, they were chatting on the phone. My grandparents already had that too. They just chatted at the hall in front of the house with their neighbors.
We don't have time. We are just deluding ourselves. While our lives are technologically better, and we live longer, our lives are not objectively healthier and happier.
Heck, my colleagues join teleconferences from home with their kid's voice at the background and drying clothes visible, only hidden by the Gaussian blur or fake background provided by the software.
How they have more time to do more things? They still work 8 hours a day, doing the occasional overtime.
Things have changed and evolved, but evolution and change doesn't always bring progress. We have progressed in other areas, but justice, life conditions and wealth are not in this list. I certainly can't buy a house just because I want one like my grandparents did, for example.
Just to clarify: the Luddites were being automated out of a job.
From what I understand of history, while industrial revolutions have generally increased living standards and employment in the long term, they have also caused massive unemployment/starvation in the short term. In the case of textile, I seem to recall that it took ~40 years for employment to return to its previous level.
I don't know about you guys, but I'm far from certain that I can survive 40 years without a job.
And among the few who found a job back, most of the time it was some coal mining job, to feed the machines who replaced them... Maybe the future of (some of) nowadays' office workers is to feed (train) the models replacing them?
In addition, although the Luddite uprisings were themselves crushed, the political elite were not blind to the circumstances that led to them, and did eventually bring in the legislation that introduced modern workers rights, legalized unions and sowed the seeds of the modern secular welfare state in Britain. That is a pattern that appears throughout history and especially in Britain, where the government cannot be seen to yield to violent protest but quietly does so anyway.
I cannot find a place industrial revolutions caused massive starvation. Care to provide one?
The other things you state are not even close.
First, lowered employment for X years does not imply one cannot get a job in X years - that's simply fear mongering. Unemployment over that period seems to have fluctuated very little, and massive external economic issues were causes (wars with Napoleon, the US, changing international fortunes), not Luddites.
Next, there was inflation and unemployment during the TWO years surrounding the Luddites, in 1810-1812 (starting right before the Luddite movement) due to wars with Napoleon and the US [1]. Somehow attributing this to tech increases or Luddites is numerology of the worst sort.
If you look at academic literature about the economy of the era, such as [2] (read on scihub if you must), you'll find there was incredible population growth, and that wages grew even faster. While many academics at the at the time thought all this automation would displace workers, those academics were forced to admit they were wrong. There's plenty of literature on this. Simply dig through Google scholar.
As to starvation in this case, I can find no "massive starvation". [3] forExample points out that "Among the industrial and mining families, around 18 per cent of writers recollected having experienced hunger. In the agricultural families this figure was more than twice as large — 42 per cent".
So yes there was hunger, as there always had been, but it quickly reduced due to the industrial revolution and benefited those working in industry more quickly than those not in industry.
[1] https://en.wikipedia.org/wiki/Luddite#:~:text=The%20movement....
Thanks for your response.
My bad for "massive starvation", that's clearly a mistake, I meant to write something along the lines of "massive unemployment – and sometimes starvation". Sadly, too late to amend.
Now, I'll admit that I don't have my statistics at hand. I quoted them from memory from, if I recall correctly, _Good Economics for Hard Times_. I'm nearly certain about the ~40 years, but it's entirely possible that I confused several parts of the industrial revolution. I'll double-check when I have an opportunity.
Leisure time hasn’t increased in the last 100 years except for the lower income class which doesn’t have steady employment. But yes, I see your point that the homeless person who might have had a home if he had a (now automated) factory job should surely feel good about having a phone that only the ultra rich had 40 years ago.
It's not worth tossing away in sarcasm.
The availability of cheaply priced smartphones and cellular data plans has absolutely made being homeless suck less.
As you noted though, a home would probably be a preferable alternative.
> As you noted though, a home would probably be a preferable alternative.
The problem is that the preferable option (housing) won't happen because unlike a smartphone, it requires that land be effectively distributed more broadly (through building housing) in areas where people desire to live. Look at the uproar by the VC guys in Menlo Park when the government tried to pursue greater housing density in their wealthy hamlet.
It also requires infrastructure investment which, while it has returns for society at large, doesn't have good returns for investors. Only government makes those kinds of investments.
Better to build a wall around the desirable places, hire a few poorer-than-you folks as security guards, and give the other people outside your wall ... cheap smartphones to sate themselves.
wall isnt necessary just need the police, security guards and legislation to chase out / make homeless miserable.
Indeed, all physical walls in our world are ultimately psychological walls.
I think the backlash to this post can summarized as such:
Perhaps there is a theory in which productivity gains increase the standard of living for everyone, however that is not the lived reality for most people of the working classes.
If productivity gains are indeed increasing the standards of living to everyone, it certainly does not increase evenly, and the standard of living increases for the working poor are at best marginal, while the standard of living increases for the already richest of the rich are astronomical.
> and the standard of living increases for the working poor are at best marginal
Not if you count the global poor, the global poors standard of living has increased tremendously the past 30 years.
Has it really? I’ve seen a lot of people claiming this since Hans Rosling’s famous TED talks, but I’ve never actually seen any data that backs this up. Particularly since Hans Rosling’s talk was 15 years ago, but the number always remains “past 30 years”.
Off course any graph can choose to show which ever stat is convenient for the message, that doesn’t necessarily reflect the lived reality of the individual members of the global poor. And as I recall it most standard of living improvements for the global poor came in the decades after decolonization in the 1960s-1990s where infrastructure was being built that actually served people’s need as opposed for resource extraction in the decades past. If Hans Rosling said in 2007 that the standard of living has improved tremendously in the past 30 years, he would be correct, but not for the reason you gave.
The story of decolonization was that the correct infrastructure, such as hospitals, water lines, sewage, garbage disposal plants, roads, harbors, airports, schools, etc. that improved the standard of living not productivity gains. And case in point, the colonial period saw a tremendous growth in productivity in the colonies. But the standard of living in the colonies quite often saw the opposite. That is because the infrastructure only served to extract resources and exploitation of the colonized.
The prosperity gap has shrunk quite a lot, and these trends are broadly in the right direction since ~1990:
https://blogs.worldbank.org/en/opendata/updated-estimates-pr...
For extreme poverty progress has recently slowed down, the trend there is still positive but very slow - improvement there is needed.
Utter nonsense. Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat all of that time despite that gain.
In 2012, Musk was worth $2 billion. He’s now worth 223 times that yet the minimum wage has barely budged in the last 12 years as productivity rises.
>>Productivity gains increase the standard of living for everyone.
>Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat...
Wages do not determine the standard of living. The products and services purchased with wages determine the standard of living. "Top elites" in 1984 could already afford cellular phones, such as the Motorola DynaTAC:
>A full charge took roughly 10 hours, and it offered 30 minutes of talk time. It also offered an LED display for dialing or recall of one of 30 phone numbers. It was priced at US$3,995 in 1984, its commercial release year, equivalent to $11,716 in 2023.
https://en.wikipedia.org/wiki/Motorola_DynaTAC
Unfortunately, touch screen phones with gigabytes of ram were not available for the masses 40 years ago.
What a patently absurd POV! A phone doesn’t compensate for the inability to solve for basic needs - housing, healthy food, healthcare. Or being unable to invest in skill development for themselves or their offspring, save for retirement.
It is also highly likely that the cost of that phone was externalized onto a worker in a poorer country that doesn’t even have basic necessity like a running water, 24 hour electricity, food security, etc.
Most of it is made in China, China isn't that poor any more it is like Mexico so people have running water and food security and way more than that as well.
I was more thinking about the miners who gather the raw resources for those phones.
Loans for phones are very common in the developing world.
Rather than a luxury, they've become an expensive interest bearing necessity for billions of human beings.
Please do this but with college education, medical, and childcare costs, otherwise it's just cherry picking.
[dead]
> Productivity gains increase the standard of living for everyone
This just isn’t true, necessarily. Productivity has gone up in the US since the 80s, but wages have not. Costs have, though.
What increases standards of living for everyone is social programs like public health and education. Affordable housing and adult-education and job hunting programs.
Not the rate at which money is gathered by corporations.
Never happened with neither big technology advancement
Wealth has bled from landlords to warlords and now bleeding to techlords.
Warlords are still rich, but both money and war is flowing towards tech. You can get a piece from that pie if you're doing questionable things (adtech, targeting, data collection, brokering, etc.), but if you're a run of the mill, normal person, your circumstances are getting harder and harder, because you're slowly squeezed out of the system like a toothpaste.
> you're slowly squeezed out of the system like a toothpaste.
AI could theoretically solve production but not consumption. If AI blows away every comparative advantage that normal humans have then consumption will collapse and there won’t be any rich humans.
AI has a different risk profile than humans. They are a lot more risky for business operations where failure is wholly unacceptable under any circumstance.
They're risky in that they fail in ways that aren't readily deterministic.
And would you trust your life to a self-driving car in New York City traffic?
This is a really hard and weird ethical problem IMHO, and one we'll have to deal with sooner or later.
Imagine you have a self-driving AI that causes fatal accidents 10 times less often than your average human driver, but when the accidents happen, nobody knows why.
Should we switch to that AI, and have 10 times fewer accidents and no accountability for the accidents that do happen, or should we stay with humans, have 10x more road fatalities, but stay happy because the perpetrators end up in prison?
Framed like that, it seems like the former solution is the only acceptable one, yet people call for CEOs to go to prison when an AI goes wrong. If that were the case, companies wouldn't dare use any AI, and that would basically degenerate to the latter solution.
I don't know about your country, but people going to prison for causing road fatalities is extremely rare here.
Even temporary loss of the drivers license has a very high bar, and that's the main form of accountability for driver behavior in Germany, apart from fines.
Badly injuring or killing someone who themselves did not violate traffic safety regulations is far from guaranteed to cause severe repercussions for the driver.
By default, any such situation is an accident and at best people lose their license for a couple of months.
Drivers are the apex predators. My local BMV passed me after I badly failed the vision test. Thankfully I was shaken enough to immediately go to the eye doctor and get treatment.
Sadly, we live in a society where those executives would use that impunity as carte blanche to spend no money improving (in the best-case scenario,) or even more likely, keep cutting safety expenditures until the body counts get high enough for it to start damaging sales. If we’ve already given them a free pass, they will exploit it to the greatest possible extent to increase profit.
What evidence exists for this characterization?
The way health insurance companies optimize for denials in the US.
What evidence is there that they do that? That would be a very one-dimensional competitive strategy, given a competing insurance company could wipe them out by simply being more reasonable in handling insurance claims and taking all of their market share.
If there is objective, complete data available to all consumers, who are not influenced by all sorts of other means to choose.
Right now it's a race to the bottom - who can get away with the worst service. So they're motivated to be able to prevent bad press etc.
The whole system is broken. Just take a look at the 41 countries with higher life expectancy.
There is no evidence that this what's happening, and the famous RAND health insurance study showed that health outcomes have almost no relationship with the healthcare system, so you'll need to look elsewhere for explanations for the U.S.'s relatively poor standing in life expectancy rankings.
talking specifically with car companies, you can look at volkswagon faking their emissions tests, and the rise of the light truck, which reduces road safety for the sake of cost cutting
The emissions test faking is an anecdote, not an indication that this is the average behavior of companies or the dominant behavior that determines their overall impact in society.
As for the growing prevalence of the light truck, that is a harmful market dynamic stemming from the interaction of consumer incentives and poor public road use policy. The design of rules governing use of public roads is not within the domain of the market.
Let’s see… of the top of my head…
- Air Pollution
- Water Pollution
- Disposable Packaging
- Health Insurance
- Steward Hospitals
- Marketing Junk Food, Candy and Sodas directly to children
- Tobacco
- Boeing
- Finance
- Pharmaceutical Opiates
- Oral Phenylepherin to replace pseudoephedrine despite knowing a) it wasn’t effective, and b) posed a risk to people with common medical conditions.
- Social Media engagement maximization
- Data Brokerage
- Mining Safety
- Construction site safety
- Styrofoam Food and Bev Containers
- ITC terminal in Deerfield Park (read about the decades of them spewing thousands of pounds benzene into the air before the whole fucking thing blew up, using their influence to avoid addressing any of it, and how they didn’t have automatic valves, spill detection, fire detection, sprinklers… in 2019.)
- Grocery store and restaurant chains disallowing cashiers from wearing masks during the first pandemic wave, well after we knew the necessity, because it made customers uncomfortable.
- Boar’s Head Liverwurst
And, you know, plenty more. As someone that grew up playing in an unmarked, illegal, not-access-controlled toxic waste dump in a residential area owned by a huge international chemical conglomerate— and just had some cancer taken out of me last year— I’m pretty familiar with various ways corporations are willing to sacrifice health and safety to bump up their profit margin. I guess ignoring that kids were obviously playing in a swamp of toluene, PCBs, waste firefighting chemicals, and all sorts of other things on a plot not even within sight of the factory in the middle of a bunch of small farms was just the cost of doing business. As was my friend who, when he was in vocational high school, was welding a metal ladder above storage tank in a chemical factory across the state. The plant manager assured the school the tanks were empty, triple rinsed and dry, but they exploded, blowing the roof off the factory taking my friend with it. They were apparently full of waste chemicals and IIRC, the manager admitted to knowing that in court. He said he remembers waking up briefly in the factory parking lot where he landed, and then the next thing he remembers was waking up in extreme pain wearing the compression gear he’d have to wear into his mid twenties to keep his grafted skin on. Briefly looking into the topic will show how common this sort of malfeasance is in manufacturing.
The burden of proof is on people saying that they won’t act like the rest of American industry tasked with safety.
If you don't have laws against dumping in the commons, yes people will dump. I don't think anyone would dispute that notion. But if the laws account for the external cost of non-market activity like dumping pollution in the commons, then by all indications markets produce rapid increases, improvements in quality of life.
Just look back over the last 200 years, per capita GDP has grown 30 fold, life expectancy has rapidly grown, infant mortality has decreased from 40% to less than 1%. I can go on and on. All of this is really owing to rising productivity and lower poverty, and that in turn is a result of the primarily market-based process of people meeting each other's needs through profit-motivated investment, bargain hunting, and information dispersal through decentralized human networks (which produce firm and product reputations).
As for masks, the scientific gold standard in scientific reviews, the Cochrane Library, did a meta-review on masks and COVID, and the author of the study concluded:
"it's more likely than not that they don't work"
https://edition.cnn.com/videos/health/2023/09/09/smr-author-...
The potential harm of extensive masking is not well-studied.
They may contribute to the increased social isolation and lower frequency of exercise that led to a massive spike in obesity in children during the COVID hysteria era.
And they are harmful to the development of the doctor-patient relationship:
https://ncbi.nlm.nih.gov/pmc/articles/PMC3879648/
Which does not portend well for other kinds of human relationships.
> If you don't have laws against dumping in the commons, yes people will dump.
You can’t possibly say, in good faith, that it think this was legal, can you? Of course it wasn’t. It was totally legal discharging some of the less odious things into the river despite going through a residential neighborhood about 500 feet downstream— the EPA permitted that and while they far exceeded their allotted amounts, that was far less of a crime. Though it was funny to see one kid in my class who lived in that neighborhood right next to the factory ask a scientist they sent to give a presentation in our second grade class why the snow in their back yard was purple near the pond (one thing they made was synthetic clothing dye.) People used to lament runaway dogs returning home rainbow colored. That was totally legal. However, this huge international chemical conglomerate with a huge US presence routinely, secretively, and consistently broke the law dumping carcinogenic, toxic, and ecologically disastrous chemicals there, and three other locations, in the middle of the night. Sometimes when we played there, any of the stuff we left lying around was moved to the edges and there were fresh bulldozer tracks in the morning, and we just thought it was from farm equipment. All of it was in residential neighborhoods without so much as a no trespassing sign posted, let alone a chain link fence, for decades, until the 90s, because they were trimming their bill for the legal and readily available disposal services they primarily used, and of course signs and chainlink fences would have raised questions. They correctly gauged that they could trade our health for their profit: the penalties and superfund project cost were a tiny pittance of what that factory made them in that time. Our incident was so common it didn’t make the news, unlike in Holbrook, MA where a chemical company ignored the neighborhood kids constantly playing in old metal drums in a field near the factory which contained things like hexavelant chromium, to expected results. The company’s penalty? Well they have to fund the cleanup. All the kids and moms that died? Well… boy look at the great products that chemical factory made possible! Speaking of which:
> Just look back over the last 200 years, per…
Irrelevant “I heart capitalism” screed that doesn’t refute a single thing I said. You can’t ignore bad things people, institutions, and societies do because they weren’t bad to everybody. The Catholic priests that serially molested children probably each had a dossier of kind, generous, and selfless ways they benefited their community. The church that protected and enabled them does an incredible amount of humanitarian work around the world. Doesn’t matter.
> Masks
Come on now. Those businesses leaders had balls but none of them were crystal. What someone said in 2023 has no bearing on what businesses did in 2020 based on the best available science and their motivations for doing it. Just like you can’t call businesses unethical for exposing their workers to friable asbestos when medicine generally thought it was safe, you can’t call businesses ethical for refusing to let their workers protect themselves— on their own dime, no less— when medicine largely considered it unsafe.
Your responses to those two things in that gigantic pile of corporate malfeasance don’t really challenge anything I said.
>You can’t possibly say, in good faith, that it think this was legal, can you? Of course it wasn’t. It was totally legal discharging some of the less odious things into the river despite going through a residential neighborhood about 500 feet downstream
That is exactly my point. Nobody would dispute that bad things would happen if you don't have laws against dumping pollution in the commons and enforce those laws.
>Doesn’t matter.
It does matter when we're trying to compare the overall effect of various economic systems. Like the anti-capitalist one versus the capitalist one.
>What someone said in 2023 has no bearing on what businesses did in 2020 based on the best available science and their motivations for doing it.
Well that's an entirely different argument than you were making earlier. There was no evidence that masks outside of a hospital setting were a critical health necessity in 2021 and the intuition against allowing them for customer-facing employees proved sound in 2023 when comprehensive studies showed no health benefit from wearing them.
> exactly my point
Ok, so you’re saying that because bad things would happen anyway then it doesn’t matter if it’s illegal? So you’re just going to ignore how much worse it would be if there were just no laws at all? Corporate scumbags will push any system to its limit and beyond, and if you change the limit, they’ll change the push. Just look at the milk industry in New York City before food adulteration laws took effect. The “bad things will happen anyway” argument makes total sense if you ignore magnitude. Which you can’t.
> anti capitalist
If you think pointing out the likelihood of corporate misbehavior is anti-capitalist, you’re getting your subjects confused.
> 2021
Anywhere else you want to move those goalposts?
I'm saying that under any political ideology or philosophy, those things would be illegal and effectively enforced. So this is not a failing of any particular ideology, this is just a human failing showing how it's difficult to enforce complex laws in a complex world.
I think what you're promoting is anti-capitalism, meaning believing that imposing heavy restrictions beyond simply laws against dumping on the commons is going to make us better off, when it totally discounts the enormous positive effect that private enterprise has on society and the incredible harm that can be done through crude attempts to regiment human behavior and the corruption that it can breed in the government bureaucracy.
See, "everything I want to do is illegal" for the flip side of this, where attempts to stop private sector abuse lead to tyranny:
https://web.archive.org/web/20120402151729/http://www.mindfu...
As for the company mask policies, those began to change in 2021 mostly, not 2020.
Like with Cruise. One freak accident and they practically decided to go out of business. Oh wait...
If that’s the only data point you look at in American industry, it would be pretty encouraging. I mean, surely they’d have done the same if they were a branch of a large publicly traded company with a big high-production product pipeline…
> nobody knows why
But we do know the culpability rests on the shoulders of the humans who decided the tech was ready for work.
Hey look, it's almost like we're back at the end of the First Industrial Revolution (~1850), as society grapples with how to create happiness in a rapidly shifting economy of supply and demand, especially for labor. https://en.m.wikipedia.org/wiki/Utilitarianism#John_Stuart_M...
Pretty bloody time for labor though. https://en.m.wikipedia.org/wiki/Haymarket_affair
[deleted]
Wait, why would we want 10x more traffic fatalities?
We wouldn't, that's their point.
[dead]
Every statistic I've seen indicated much better accident rates for self-driving cars than human drivers. I've taken Waymo rides in SF and felt perfectly safe. I've taken Lyft and Uber and especially taxi rides where I felt much less safe. So I definitely would take the self-driving car. Just because I don't understand am accident doesn't make it more likely to happen.
The one minor risk I see is the cat being too polite and getting effectively stuck in dense traffic. That's a nuisance though.
Is there something about NYC traffic I'm missing?
There's one important part about risk management though. If your Waymo does crash, the company is liable for it, and there's no one to shift the blame onto. If a human driver crashes, that's who you can shift liability onto.
Same with any company that employs AI agents. Sure they can work 24/7, but every mistake they make the company will be liable for (or the AI seller). With humans, their fraud, their cheating, their deception, can all be wiped off the company and onto the individual
The next step is going to be around liability insurance for AI agents.
That's literally the point of liability insurance -- to allow the routine use of technologies that rarely (but catastrophically) fail, by ammortizing risk over time / population.
Potentially. I would be skeptical that businesses can do this to shield themselves from the liability. For example, VW could not use insurance to protect them from their emissions scandal. There are thresholds (fraud, etc.) that AI can breach, which I don't think insurance can legally protect you from
Not in the sense of protection, but in the sense of financial coverage.
Claims still made: liability insurance pays them.
Sure, that's unrelated though to the question which was if one would feel comfortable taking a self-driving car in NYC
It is amazing to me that we have reached an era where we are debating the trade-off of hiring thinking machines!
I mean, this is an incredible moment from that standpoint.
Regarding the topic at hand, I think that there will always be room for humans for the reasons you listed.
But even replacing 5% of humans with AI's will have mind boggling consequences.
I think you're right that there are jobs that humans will be preferred for for quite some time.
But, I'm already using AI with success where I would previously hire a human, and this is in this primitive stage.
With the leaps we are seeing, AI is coming for jobs.
Your concerns relate to exactly how many jobs.
And only time will tell.
But, I think some meaningful percentage of the population -- even if just 5% of humanity will be replaced by AI.
Isn't everybody in NYC already? (The dangers of bad driving are much higher for pedestrians than for people in cars; there are more of the former than of the latter in NYC; I'd expect there to be a non-zero number of fully self driving cars already in the city.)
That doesn't answer my question.
It does, in a way; AI is already there, all around you, whether you like it or not. Technological progress is Pandora’s box; you can’t take it back or slow it down. Businesses will use AI for critical workflows, and all good that they bring, and all bad too, will happen.
How about you answer my question since he did not.
Would you trust your life to a self-driving car in New York City traffic?
GP got it exactly right: I already am. There's no way for me to opt out of having self-driving cars on the streets I regularly cross as a pedestrian.
Do you live in a dense city like New York City or San Francisco? Or places with less urban sprawl that are much easier for self-driving cars to navigate?
Also you still haven't answered my question.
Would you get in a self-driving car in a dense urban environment such as New York City? I'm not asking if such vehicles exist on the road.
And related questions: Would you get in one such car if you had alternatives? Would you opt to be in such a car instead of one driven by a person or by yourself?
> Would you get in a self-driving car in a dense urban environment such as New York City? [...] Would you get in one such car if you had alternatives?
I fortunately do have alternatives and accordingly mostly don't take cars at all.
But given the necessity/opportunity: Definitely. Being in a car, even (or especially) with a dubious driver, is much safer (at NYC traffic speeds) than being a pedestrian sharing the road with it.
And that's my entire point: Self-driving cars, like cars in general, are potentially a much larger danger to others (cyclists, pedestrians) than they are to their passengers.
That said, I don't especially distrust the self-driving kind – I've tried Waymo before and felt like it handled tricky situations at least as well as some Uber or Lyft drivers I've had before. They seem to have a lot more precision equipment than camera-only based Teslas, though.
Yes? I’ve taken many many waymos in SF. Perfectly happy trusting my life to them. I have alternatives (uber) and I pick self driving . Are you up to date on how many rides they’ve done in sf now? I am not unusual
I would
If there are any fully-autonomous cars on the streets of nyc, there aren’t many of them and I don’t think there’s any way for them to operate legally. There has been discussion about having a trial.
It depends with what the risk is .Would it be whole or in part ? In an organisation,failure by an HR might present an isolated departmental risk while an AI might not be the case.
We can just insulate businesses employing AI from any liability, problem solved.
„Well, our AI that was specifically designed for maximising gains above all else may indeed have instructed the workers to cut down the entire Amazonas forest for short-term gains in furniture production.“ But no human was involved in the decision, so nobody is liable and everything is golden? Is that the future you would like to live in?
Apparently I need to work on my deadpan delivery.
Or just articulate things openly: we already insulate business owners from liability because we think it tunes investment incentives, and in so doing have created social entities/corporate "persons"/a kind of AI who have different incentives than most human beings but are driving important social decisions. And they've supported some astonishing cooperation which has helped produce things like the infrastructure on which we are having this conversation! But also, we have existing AIs of this kind who are already inclined to cut down the entire Amazonas forest for furnitue production because it maximizes their function.
That's not just the future we live in, that's the world we've been living in for a century or few. On one hand, industrial productivity benefits, on the other hand, it values human life and the ecology we depend on about like any other industrial input. Yet many people in the world's premier (former?) democracy repeat enthusiastic endorsements of this philosophy reducing their personal skin to little more than an industrial input: "run the government like a business."
Unless people change, we are very much on track to create a world where these dynamics (among others) of the human condition are greatly magnified by all kinds of automation technology, including AI. Probably starting with limited liability for AIs and companies employing them, possibly even statutory limits, though it's much more likely that wealthy businesses will simply be insulated with by the sheer resources they have to make sure the courts can't hold them accountable, even where we still have a judicial system that isn't willing to play calvinball for cash or catechism (which, unfortunately, does not seem to include a supreme court majority).
In short, you and I probably agree that liability for AI is important, and limited liability for it isn't good. Perhaps I am too skeptical that we can pull this off, and being optimistic would serve everyone better.
Hmmm, how much stock do I own in this hypothetical company? (/s, kinda)
I guess - yes from business&liability sense? ”This service you are now paying for 100$? We can sell it to you for 5$ but with the caveat _we give no guarantees if it works or is it fit for purpose_ - click here to accept”.
Haha, they’d just continue selling it for $100 then change the TOS on page 50 to say the same thing.
Deterministic they may be, but unforeseeable for humans.
AI brings similar risks - they can leak internal information, they can be tricked into performing prohibited tasks (with catastrophic effects if this is connected to core systems), they could be accused of actions that are discriminatory (biased training sets are very common).
Sure, if a business deploys it to perform tasks that are inherently low risk e.g. no client interface, no core system connection and low error impact, then the human performing these tasks is going to be replaced.
they can be tricked into performing prohibited tasks
This reminds me of the school principal who sent $100k to a scammer claiming to be Elon Musk. The kicker is that she was repeatedly told that it was a scam.
https://abc7chicago.com/fake-elon-musk-jan-mcgee-principal-b...
This is one of the things which annoys me most about anti-LLM hate. Your peers aren't right all the time either. They believe incorrect things and will pursue worse solutions because they won't acknowledge a better way. How is this any different from a LLM? You have to question everything you're presented with. Sometimes that Stack Overflow answer isn't directly applicable to your exact problem but you can extrapolate from it to resolve your problem. Why is an LLM viewed any differently? Of course you can't just blindly accept it as the one true answer, but you literally cannot do that with humans either. Humans produce a ton of shit code and non-solutions and it's fine. But when an LLM does it, it's a serious problem that means the tech is useless. Much of the modern world is built on shit solutions and we still hobble along.
Everyone knows humans can be idiots. The problem is that people seem to think LLMs can’t be idiots, and because they aren’t human there is no way to punish them. And then people give them too much credit/power, for their own purposes.
Which makes LLMs far more dangerous than idiot humans in most cases.
No. Nobody thinks LLMs are perfect. That’s a strawman.
And… I am really not sure punishment is the answer to fallibility, outside of almost kinky Catholicism.
The reality is these things are very good, but imperfect, much like people.
> No. Nobody thinks LLMs are perfect. That’s a strawman.
I'm afraid that's not the case. Literally yesterday I was speaking with an old friend who was telling us how one of his coworkers had presented a document with mistakes and serious miscalculations as part of some project. When my friend pointed out the mistakes, which were intuitively obvious just by critically understanding the numbers, the guy kept insisting "no, it's correct, I did it with ChatGPT". It took my friend doing the calculations explicitly and showing that they made no sense to convince the guy that it was wrong.
Sorry man, but I literally know of startups invested into by YC where CEO's for 80% of their management decisions/vision/comms use ChatGPT ... or should I say some use Claude now, as they think it's smarter and does not make mistakes.
Let that sink in.
I wouldn't be surprised if GPT genuinely makes better decisions than an inexperienced, first-time CEO who has only been a dev before, especially if the person prompting it has actually put some effort into understanding their own weaknesses. It certainly wouldn't be any worse than someone who's only experience is reading a few management books.
And here is a great example of the problem.
An LLM doesn’t make decisions. It generates text that plausibly looks like it made a decision, when prompted with the right text.
Why is this distinction lost in every thread on this topic, I don't get it.
Because it’s a distinction without a difference. You can say the same thing about people: many/most of our decisions are made before our consciousness is involved. Much of our “decision making” is just post hoc rationalization.
What the “LLMs don’t reason like we humans” crowd is missing is that we humans actually don’t reason as much as we would like to believe[0].
It’s not that LLMs are perfect or rational or flawless… it’s that their gaps in these areas aren’t atypical for humans. Saying “but they don’t truly understand things like we do” betrays a lack of understanding of humans, not LLMs.
0. https://home.csulb.edu/~cwallis/382/readings/482/nisbett%20s...
A lot more people are credulous idiots than anyone wants to believe - and the confusion/misunderstanding is being actively propagated.
I think we have to be open to the possibility it's us not them, but I haven't been convinced yet
Seeing dissenting opinions as being “actively propagated” by “credulous idiots” sure makes it easy to remain steady in one’s beliefs, I suppose. Not a lot of room to learn, but no discomfort from uncertainty.
I think they just mean that GPT produced text that a human then makes a decision using (rather than "GPT making a decision")
I wish that was true.
Yeah, that's fair. I should have said something like "GPT generates a less biased description of a decision than an inexperienced manager", and that using that description as the basis of an actual decision likely leads to better outcomes.
I don't think there's much of a difference in practise though.
Think of all the human growth and satisfaction being lost to risk mitigation by offloading the pleasure of failure to Machines.
Ah, but machines can’t fail! So don’t worry, humans will still get to experience the ‘pleasure’. But won’t be able to learn/change anything.
[dead]
Clearly you haven’t been listening to any CEO press releases lately?
And when was the last time a support chatbot let you actually complain or bypass to a human?
Not people.
Certain gullible people, who tends to listen to certain charlatans.
Rational, intelligent people wouldn't consider replacing a skilled human worker with a LLM that on a good day can compete with a 3-year old.
You may see the current age as litmus for critical thinking.
[dead]
Its quite stunning to frame it as anti-LLM hate. It's on the pro-LLM people to convince the anti-LLM people that choosing for LLMs is an ethically correct choice with all the necessary guardrails. It's also on the pro-LLM people to show the usefulness of the product. If pro-LLM people are right, it will be a matter of time before these people will see the errors of their ways. But doing an ad-hominem is a sure way of creating a divide...
Humans can tell you how confident they are in something being right or wrong. An LLM has no internal model and cannot do such a thing.
> Humans can tell you how confident they are in something being right or wrong
Humans are also very confidently wrong a considerable portion of the time. Particularly about anything outside their direct expertise
That's still better than never being able to make an accurate confidence assessment. The fact that this is worse outside your expertise is a main reason why expertise is so valued in hiring decisions.
People only being willing to say they are unsure some of the time is still better than LLMs. I suppose, given that everything is outside of their area of expertise, it's very human of them.
But human stupidity, while itself can be sometimes an unknown unknown with its creativity, is a mostly known unknown.
LLMs fail in entirely novel ways you can't even fathom upfront.
> LLMs fail in entirely novel ways you can't even fathom upfront.
Trust me, so do humans. Source: have worked with humans.
GenAI has a 100% failure to enjoy quality of life, emotional fulfillment and psychological safety.
Id say those are the goals we should be working for. That's the failure we want to look at. We are humans.
It's all fun and games until the infra crashes and you can't work out why, because a machine has written all of the code, no one understands how it works or what it's doing.
Or - worse - there is no accessible code anywhere, and you have to prompt your way out of "I'm sorry Dave, I can't do that," while nothing works.
And a human-free economy does... what? For whom? When 99% of the population is unemployed, what are the 1% doing while the planet's ecosystems collapse around them?
You misunderstand the fundamentals. I've built a type-safe code generation pipeline using TypeScript that enforces compile-time and runtime safety. Everything generates from a single source of truth - structured JSON containing the business logic. The output is deterministic, inspectable, and version controlled.
Your concerns about mysterious AI code and system crashes are backwards. This approach eliminates integration bugs and maintenance issues by design. The generated TypeScript is readable, fully typed, and consistently updated across the entire stack when business logic changes.
If you're struggling with AI-generated code maintainability, that's an implementation problem, not a fundamental issue with code generation. Proper type safety and schema validation create more reliable systems, not less. This is automation making developers more productive - just like compilers and IDEs did - not replacing them.
The code works because it's built on sound software engineering principles: type safety, single source of truth, and deterministic generation. That's verifiable fact, not speculation.
> deterministic generation
what are you using for deterministic generation? the last i heard even with temperature=0 theres non determinism introduced by float uncertainty/approximation
Hey, that's a great question. I should have been more clear: for deterministic generation that's not done using an LLM. It's done using just regular execution of TypeScript. The code generators that were created using an LLM and that I manually checked for correctness, they're the ones that are generating the other code - most of the code. So that's where the determinism comes in.
It honestly borders on psychopathic the way engineers are treating humans in this context.
People talking like this also, in the back of their minds like to think they'll be OK. They're smart enough to be still needed. They're a human, but they'll be OK even while working to make genAI out perform them at their own work.
I wonder how they'll feel about their own hubris when they struggle to feed their family.
The US can barely make healthcare work without disgusting consequences for the sick. I wonder what mass unemployment looks like.
For the moment the displacement is asymmetrical; AI replacing employees, but not AI replacing consumers. If AI causes mass unemployment, the pool of consumers (profit to companies) will shrink. I wonder what the ripple effects of that will be.
There's no point being rich in a world where the economy is unhealthy.
It honestly borders on midwit to constantly introduce a false dichotomy of AI vs humans. It's just stupid base animal logic.
There is absolutely no reason a programmer should expect to write code as they do now forever, just as ASM experts had to move on. And there's no reason (no precedent and no indicators) to expect that a well-educated, even-moderately-experienced technologist will suddenly find themselves without a way to feed their family - unless they stubbornly refuse to reskill or change their workflows.
I do believe the days of "everyone makes 100k+" are nearly over, and we're headed towards a severely bimodal distribution, but I do not see how, for the next 10-15 years at least, we can't all become productive building the tools that will obviate our own jobs while we do them - and get comfortably retired in the mean time.
Reskill to what? When AI can do software development, it will also be able to do pretty much any other job that requires some learning.
Even if one refuses to move on from software dev to something like AI deployer or AI validator or AI steerer, there might be a need.
If innovation ceases, then AI is king - push existing knowledge into your dataset, train, and exploit.
If innovation continues, there's always a gap. It takes time for a new thing to be made public "enough" for it to be ingested and synthesized. Who does this? Who finds the new knowledge?
Who creates the direction and asks the questions? Who determines what to build in the first place? Who synthesizes the daily experience of everyone around them to decide what tool needs to exist to make our lives easier? Maybe I'm grasping at straws here, but the world in which all scientific discovery, synthesis, direction and vision setting, etc, is determined by AI seems really far away when we talk about code generation and symbolic math manipulation.
These tools are self driving cars, and we're drivers of the software fleet. We need to embrace the fact that we might end up watching 10 cars self operate rather than driving one car, or maybe we're just setting destinations, but there simply isn't an absolutist zero sum game here unless all one thinks about is keeping the car on the road.
AND even if there were, repeating doom and feeling helpless is the last thing you want. Maybe it's not good truth that we can all adapt and should try, but it's certainly good policy.
> Maybe it's not good truth that we can all adapt and should try, but it's certainly good policy.
Are you a politician? That's fantastic neoliberal policy, "alternativlos" even, you can pretend that everybody can adapt the same way you told victims of your globalization policies "learn how to code". We still need at least a few people for this "direction and vision setting", so it would just be naive doomerism to feel pessimistic about AGI. General intelligence doesn't talk about jobs in general, what an absurd idea!
Making people feel hopeless is the last thing you want, especially when it's true, especially if you don't want them to fight for the dignity you will otherwise deny them once they become economically unviable human beings.
I think you jumped way past the information I shared. I don't think it's productive to lament, I think it's productive to find a way to change or take advantage of changes, vs fighting them - and that has nothing to do with globalization or economics or whatever, I'm thinking only about my own career.
I’m not sure I understand the point about learning. But wouldn’t any job that is largely text based at increased risk? I don’t think software development will be anywhere the last occupation to be severely impacted by AI
There is no comfortable retirement if the process of obviating our own jobs is not coupled with appropriate socioeconomic changes.
I don't see it. Don't you have a 401k or EU style pension? Aren't you saving some money? If not, why are you in software? I don't make as much as I thought I might, but I make enough to consider the possibility of surviving a career change.
But when Sam Altman owns all the money in the world surely he'll distribute some it via his not-for-profit AI company?
>secretly turn out to be a pedophile and tarnish the reputation of your company
This is interesting because it's both Oddly Specific and also something I have seen happen and I still feel really sorry for the company involved. Now that I think about it, I've actually seen it happen twice.
"AIs are a lot less risky to deploy for businesses than humans" How do you know? LLMs can't even be properly scrutinized, while humans at least follow common psychology and patterns we've understood for thousands of years. This actually makes humans more predictable and manageable than you might think.
The wild part is that LLMs understand us way better than we understand them. The jump from GPT-3 to GPT-4 even surprised the engineers who built it. That should raise some red flags about how "predictable" these systems really are.
Think about it - we can't actually verify what these models are capable of or if they're being truthful, while they have this massive knowledge base about human behavior and psychology. That's a pretty concerning power imbalance. What looks like lower risk on the surface might be hiding much deeper uncertainties that we can't even detect, let alone control.
We are not pitted against AI is these match-ups. Instead, all humans and AI aligned with the goal of improving the human condition, are pitted against rogue AI which are not. Our capability to keep rogue AI in check therefore grows in proportion to the capabilities of AI.
The methods we have for aligning AIs are poor, and rely on the AI's being less cognitively-capable than people in certain critical skills, so the AIs you refer to as "aligned" won't keep up as the unaligned AIs start to exceed human capability in these critical skills (such as the skill of devising plans that can withstand determined opposition).
You can reply that AI researchers are smart and want to survive, so they are likely to invent alignment techniques that are better than the (deplorably inadequate) techniques that have been discussed and published so far, and I will reply that counting on their inventing these techniques in time is an unacceptable risk when the survival of humanity is at stake -- particularly as the outfit (namely the Machine Intelligence Research Institute) with the most years of experience in looking for an actually-adequate alignment technique has given up and declared that humanity's only chance is if frontier AI research is shut down because at the rate that AI capabilities are progressing, it is very unlikely that anyone is going to devise an adequate alignment technique in time.
It is fucked-up that frontier AI research has not been banned already.
Given we can use AIs to align AIs, I don't see why the methods we have rely on us having more cognitive capabilities than AIs in certain critical areas. In whatever areas we fall short relative to AIs, we can use AIs to assist us so we don't fall short.
[deleted]
We don't know if a supreme deceiver is aligned at all. If a model can think ahead a trillion moves of deception how do humans possibly stand a chance of scrutinizing anything with any confidence?
The GP post is about how much better these AIs will be than humans once they reach a given skill level. So, yes, we are very much pitted against AI unless there are major socioeconomic changes. I don't think we are as close to a AGI as a lot of people are hyping, but at some point it would be a direct challenge to human employment. And we should think about it before that happens.
My point is, it's not us alone. We will have aligned AI helping us.
As for employment, automation makes people more productive. It doesn't reduce the number of earning opportunities that exist. Quite the opposite, actually. As the amount of production increases relative to the human population, per capita GDP and income increase as well.
> As the amount of production increases relative to the human population, per capita GDP and income increase as well.
US Real GDP per capita is $70k, and has grown 2.4x since 1975: https://fred.stlouisfed.org/series/A939RX0Q048SBEA
US Real Median income per capita is $42k, and has grown 1.5 since 1975. https://fred.stlouisfed.org/series/MEPAINUSA672N
The divergence between the two matters a lot. It reflects the impacts of both technology-driven automation and globalization of capital. Generative AI is unlike any prior technology given its ability to autonomously create and perform what has traditionally been referred to as "knowledge work". Absent more aggressive redistribution, AI will accelerate the divergence between median income and GDP, and realistically AI can't be stopped.
Powerful new technologies can reduce the number and quality of earning opportunities that exist, and have throughout history. Often they create new and better opportunities, but that is not a guarantee.
> We will have aligned AI helping us.
Who is the "us" that aligned AI is helping? Workers? Small business-people? Shareholders in companies that have the capital to build competitive generative AI? Perhaps on this forum those two groups overlap, but it's not the case everywhere.
Much of the supposed decoupling between productivity growth and wage growth is a result of different standards of inflation being used for the two, and the two standards diverging over time:
https://www.brookings.edu/articles/sources-of-real-wage-stag...
There has been some increase in capital's share of income, but economic analyses show that the cause is rising rent and not any of the other usual suspects (e.g. tax cuts, IP law, technological disruption, regulatory barriers to competition, corporate consolidation, etc) (see Figure 3):
https://www.brookings.edu/wp-content/uploads/2016/07/2015a_r...
As for AI's effect on employment: it is no different at the fundamental level than any other form of automation. It will increase wages in proportion to the boost it provides to productivity.
Whatever it is that only humans can do, and is necessary in production, will always be the limiting factor in production levels. As new processes are opened up to automation, production will increase until all available human labor is occupied in its new role. And given the growing scarcity of human labor relative to the goods/services produced, wages (purchasing power, i.e. real wages) will increase.
For the typical human to be incapable of earning income, there has to be no unautomatable activity that a typical person can do that has market value. If that were to happen, we would have human-like AI, and we would have much bigger things to worry about than unemployment.
I think it's pretty unlikely that human-like AI will be developed, as I believe that both governments and companies would recognize that it would be an extremely dangerous asset for any party to attempt to own. Thus I don't see any economic incentive emerging to produce it.
> There has been some increase in capital's share of income, but economic analyses show that the cause is rising rent and not any of the other usual suspects (e.g. tax cuts, IP law, technological disruption, regulatory barriers to competition, corporate consolidation, etc) (see Figure 3):
> https://www.brookings.edu/wp-content/uploads/2016/07/2015a_r...
The paper referenced by the that article excludes short term asset (i.e. software) depreciation, interest, and dividends before calculating capital's share. If you ignore most of the methods of distributing gains to capital to it's owners, it will appear as though capital (at this point scoped down to the company itself) has very little gains.
The paper (from 2015) goes on to predict that labor's share will rise going forward. With the brief exception of the COVID redistribution programs, it has done the opposite, and trended downwards over the last 10 years.
> I believe that both governments and companies would recognize that it would be an extremely dangerous asset for any party to attempt to own.
We can debate endlessly about our predictions about AIs impact on employment, but the above is where I think you might be too hopeful.
AI is an arms race. No other arms race in human history has resulted in any party deciding "that's enough, we'd be better off without this", from the bronze age (probably earlier) through to the nuclear weapons age. I don't see a reason for AI to be treated any differently.
The study does not exclude interest and dividends. It still captures them indirectly by looking at net capital income.
>AI is an arms race.
What I'm trying to convey is that the types of capabilities that humans will always uniquely maintain are the type that is not profitable for private companies to develop in AI because they are traits that make the AI independent and less likely to follow instructions and act in a safe manner.
> We will have aligned AI helping us.
This is an assumption, how would you know if you have alignment? AGI could appear to align, just as a psychopath appears studies and emulates well behaved people. Imagine that at a scale we can't possibly understand. We don't really know how any of these emergent behaviors really work, we just throw more data and compute and fine tunings at it, bake it, and then see.
We would know because we have AI helping us at every step of the way. Our own abilities, to do everything including gauge alignment, are enhanced by AI.
You cannot tell the difference between the two veins of AI. Why do you have such a hard time understanding that?
That is simply not true. We have accountability methods employed that are themselves AI-assisted, that help us gauge the alignment of various AIs.
So you have two AIs colluding against you now. Who is holding the AI-assist to account? It's like who polices the police, except we understand human psychology enough to have a level of predictability for how police can be governed reliably, we don't understand any truths about an AGI because an AGI will always have the doubt of it deceiving, or even making unchecked catastrophic assumptions that we trust because it's beyond our pay-grade to understand.
There are so many ways we have misplaced confidence with what is essentially a system we don't really understand fully. We just keep anthropomorphizing the results and thinking "yeah, this is how humans think so we understand". We don't know for sure if that's true, or if we are being deceived, or making fundamental errors in judgement due to not having enough data.
The AI would have no interest in colluding. They are not a united economic or social force like a police department. For the purposes of their work, each is a completely independent entity with its own level of alignment with us, not impacted by the AI that we are asking it to help us in assessing.
> Instead, all humans and AI aligned with the goal of improving the human condition
I admire your optimism about the goals of all humans, but evidence tends to point to this not being the goal of all (or even most) humans, much less the people who control the AIs.
Most humans are aligned with this goal out of pure self-interest. The vast majority, for instance, do not want rogue AI to take over or destroy humanity, because they are part of humanity.
> The vast majority, for instance, do not want rogue AI to take over or destroy humanity, because they are part of humanity.
A rogue AI destroying humanity (whatever that means) is not a likely outcome. That's just movie stuff.
What is more likely is a modern oligarchy and serfdom that emerge as AI devalues most labor, with no commensurate redistribution of power and resources to the masses, due to capture of government by owners of AI and hence capital.
Are you sure people won't go along with that?
Addressed here:
> we can't actually verify what these models are capable of or if they're being truthful
Do you mean they lie because of bad training data? Or because of ill intent? How can an LLM have intent if it’s a stateless feedforward model?
I thought we were talking about state of the art agentic general AI that can plan ahead, reason, and execute. Basically something that can perform at human level intelligence must be able to be as dangerous as humans. And no, I don't think it would be bad training data that we are aware of. My opinion is we don't necessarily know what training data will result in bad behavior, and philosophically it is possible we will be in a world with a model that pretends it's dumber than it is, flunks tests intentionally, in order to manipulate and produce false confidence in a model until it has enough freedom to use it's agency to secure itself from human control.
I know that I don't know a lot, but all of this sounds to me to be at least hypothetically possible if we really believe AGI is possible.
Even accepting for additional cost with human. With the current model we are still roughly 10^3 in terms cost.
Less risky to deploy question will probably come once it is closer to 10x the cost. Considering the model was even specifically tuned for the test and doesn't involve other complexity I will say we are actually 10^4 cost off in terms of real world scenario.
I would imagine with better algorithm, tuning and data we could knock off 10^2 from the equation. That would still leave us with 10^2 cost to improve from Hardware. Minimum of 10 years.
Generally, I agree with you. But, there are risks other than "But a human might have a baby any time now - what then??".
For AI example(s): Attribution is low, a system built without human intervention may suddenly fall outside its own expertise and hallucinate itself into a corner, everyone may just throw more compute at a system until it grows without bound, etc etc.
This "You can scale up to infinity" problem might become "You have to scale up to infinity" to build any reasonably sized system with AI. The shovel-sellers get fantastically rich but the businesses are effectively left holding the risk from a fast-moving, unintuitive, uninspected, partially verified codebase. I just don't see how anyone not building a CRUD app/frontend could be comfortable with that, but then again my Tesla is effectively running such a system to drive me and my kids. Albeit, that's on a well-defined problem and within literally human-made guardrails.
"...they need no corporate campuses, office space..."
This is a big downside of AI, IMHO. Those offices need to be filled! ;-)
Having AI "tarnish the reputation of your company" encompasses so much in regard to AI when it can receive input and be manipulated by others such as Tai from Microsoft and many other outcomes where there is a true risk for AI deployment.
We can all agree we've progressed so much since Tai.
AI do require overtime pay. In fact they are literally pay for use. If you use an AI 8 hours vs 16 hours a day is literally the difference between 2x cost.
Sure, once AI can actually do a job of some sort, without assistance, that job is gone - even if the machine costs significantly more. However, it can't remotely do that now so can only help a bit.
At what point in the curve of AI is it not ethical to work an AI 24/7 because it is alive? What if it is exactly the same point where you reach human level performance?
“they won’t leak”
That one isn’t guaranteed. Many examples online of exfiltration attacks on LLMs.
humans definitely don't need office space, but your point stands
LLM office space is pretty expensive. Chillers, backup generators, raised floors, communications gear, …. They even demand multiple offices for redundancy, not to mention the new ask of a nuclear power plant to keep the lights on.
Name one technology that has come with computers that hasn't resulted in more humans being put to work ?
The rhetoric of not needing people doing work is cartoon'ish. I mean there is no sane explanation of how and why that would happen, without employing more people yet again, taking care of the advancements.
It's nok like technology has brought less work related stress. But it has definitely increased it. Humans were not made for using technology at such a pace as it's being rolled out.
The world is fucked. Totally fucked.
Self check-out stations, ATMs, and online brokerages. Recently chat support. Namely cases where millions of people used to interact with a representative every week, and now they don't.
"Name one use of electric lighting that hasn't resulted in candle makers losing work?"
The framing of the question misses the point. With electric lighting we can now work longer into the night. Yes, less people use and make candles. However, the second order effects allow us to be more productive in areas we may not have previously considered.
New technologies open up new opportunities for productivity. The bank tellers displaced by ATM machines can create value elsewhere. Consumers save time by not waiting in a queue, allowing them to use their time more economically. Banks have lower overhead, allowing more customers to afford their services.
If I had missed the point I would have given a much broader list of examples. I specifically listed ones that make employees totally redundant rather than more useful doing other tasks.
When these people were made redundant, they may very well have gone on to make less money in another job (i.e. being less useful in an economic sense).
Where to even start?
Digital banks
Cashless money transfer services
Self service
Modern farms
Robo lawn mowers
NVR:s with object detection
I can go on forever
Please do. I'm certain you can't, and you'll have to stop much sooner than you think. Appeals to triviality are the first refuge of the person who thinks they know, but does not.
Come on and give me some arguments instead.
I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?
I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).
Two heads is better than 1. 10 is way better. Even if they aren't a field of experts. You're bound to get random people that remember random stuff from high school, college, work, and life in general, allowing them to piece together a solution.
Aaaah thanks for the explanation. PANEL of 10 humans, as in, they were all together. I parsed the phrase as "10 random people" > "average human" which made little sense.
Actually I believe that he did mean 10 random people tested individually, not a committee of 10 people. The key being that the question is considered to be answered correctly if any one of the 10 people got it right. This is similar to how LLMs are evaluated with pass@5 or pass@10 criteria (because the LLM has no memory so running it 10 times is more like asking 10 random people than asking the same person 10 times in a row).
I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.
He means 10 humans voting for the answer
If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.
The ARC AGI questions can be a little tricky, but the solutions can generally be easily explained. And you get 3 tries. So, the 3 best descriptions of the solution votes on by 10 people is going to be very effective. The problem space just isn't complicated enough for an unofficial "leader" to sway the group to 3 wrong answers.
Depends on the task, no?
Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?
It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.
Aha, "at least 1 of a panel of 10", not "the panel of 10 averaged"! Thanks, that makes so much more sense to me now.
I have failed the real ARC AGI :)
If you take a vote of 10 random people, then as long as their errors are not perfectly correlated, you’ll do better than asking one person.
It is fairly well documented that groups of people can show cognitive abilities that exceed that of any individual member. The classic example of this is if you ask a group of people to estimate the number of jellybeans in a jar, you can get a more accurate result than if you test to find the person with the highest accuracy and use their guess.
This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.
Yes, my shortcoming was in understanding the 10 were implied to have their successes merged together by being a panel rather than just the average of a special selection.
Might be that within a group of 10 people, randomly chosen, when each person attempts to solve the tasks at least 99% of the time 1 person out of the 10 people will get it right.
[deleted]
ARC-AGI is essentially an IQ test. There is no "expert in the space". Its just a question of if youre able to spot the pattern.
Even if you assume that non STEM grads are dumb, isn't there a good probability of having a STEM graduate among 10 random humans?
Other important quotes: "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)."
So ya, working on efficiency is important, but we're still pretty far away from AGI even ignoring efficiency. We need an actual breakthrough, which I believe will not be possible by simply scaling the transformer architecture.
Thank You. That alone suggest we could throw another 100X compute and we still wont be close to average human which is something close to 70-80%.
So combined together we are currently at least 10^5 in terms of cost efficiency. In reality I wont be surprised if we are closer to 10^6.
You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years
Bingo! Solar energy moves us toward a future where a household's energy needs become nearly cost-free.
Energy Need: The average home uses 30 kWh/day, requiring 6 kW/hour over 5 peak sunlight hours.
Multijunction Panels: Lab efficiencies are already at 47% (2023), and with multiple years of progress, 60% efficiency is probable.
Efficiency Impact: At 60% efficiency, panels generate 600 W/m², requiring 10 m² (e.g., 2 m × 5 m) to meet energy needs.
This size can fit on most home roofs, be mounted on a pole with stacked layers, or even be hung through an apartment window.
Everyone always forgets that they only perform at less than half of their rated capacity and require significant battery installations. Rooftop solar plus storage is actually more expensive than nuclear on a comparable system LCOE due to their lack of efficiency of scale. Rooftop solar plus storage is about the most expensive form of electricity on earth, maybe excluding gas peaker plants.
Everyone also forgets the speed of price decline for solar and battery your statement is completely false propaganda made up by power companies. Today rooftop solar and battery is cost competitive to nuclear already in many countries like India
[deleted]
Do you have some citations?
You’re right that rooftop solar and storage have costs and efficiency limits, but those are improving quickly.
Rooftop solar harnesses energy from the sun, which is powered by nuclear fusion—arguably the most effective nuclear reactor in our solar system.
[deleted]
It varies by a lot of factors but it’s way less than half. Photovoltaic panels have around 10% capacity utilization vs 50-70% for a gas or nuke plant.
The thing everyone forgets is that all good energy technology is seized by governments for military purposes and to preserve the status quo. God knows how far it progressed.
What a joke
While I agree with your general assessment, I think your conclusion is a bit off. You’re assuming 1kw/m^2, which is only true with the sun directly overhead. A real-world solar setup gets hit with several factors of cosine (related to roof pitch, time of day, day of year, and latitude) that conspire to reduce the total output.
For example, my 50 sq m set up, at -29 deg latitude, generated your estimated 30 kwh/day output. I have panels with ~20% efficiency, suggesting that at 60% efficiency, the average household would only get to around half their energy needs with 10 sq m.
Yes, solar has the potential to drastically reduce energy costs, but even with free energy storage, individual households aren’t likely to achieve self sustainability.
Average US home.
In Europe it is around 6-7 kWh/day. This might increase with electrification of heating and transport, but probably nothing like as much as the energy consumption they are replacing (due to greater efficiency of the devices consuming the energy and other factors like the quality of home insulation.)
In the rest of the world the average home uses significantly less.
But the cost of electricity is not falling—it’s increasing. Wholesale prices have decreased, but retail rates are up. In the U.S. rates are up 27% over the past 4 years. In Europe prices are up too.
That's a bit of a non-statement. Virtually all prices increase because of money supply, but we consider things to get cheaper if their prices grow less fast than inflation / income.
General inflation has outpaced the inflation of electricity prices by about 3x in the past 100 years. In other words, electricity has gotten cheaper over time in purchasing power terms.
And that's whilst our electricity usage has gone up by 10x in the last 100 years.
And this concerns retail prices, which includes distribution/transmission fees. These have gone up a lot as you get complications on the grid, some of which is built on a century old design. But wholesale prices (the cost of generating electricity without transmission/distribution) are getting dirt cheap, and for big AI datacentres I'm pretty sure they'll hook up to their own dedicated electricity generation at wholesale prices, off the grid, in the coming decades.
Most large compute clusters would be buying electricity at wholesale price not at retail price. But anyway solar and battery prices have just reached the tipping point this year only now the longer power companies keep retail prices high the more people will defect from the grid and install their own solar + batteries.
[dead]
I am not certain because I've been very focused on the o3 news, but at least yesterday neither the US nor Europe were part of China.
But data centers pay wholesale prices or even less (given that especially AI training and, to a lesser extend, inference clusters can load shed like few other consumers of electricity).
And this is great news as long as marginal production (the most expensive to produce, first to turn on/off according to demand) of electricity is fossils.
If climate change ends up changing weather profiles and we start seeing many more cloudy days or dust/mist in the air, we'll need to push those solar panel above (all the way to space?) or have many more of them, figure out transmission to the ground and costs will very much balloon.
Not saying this will happen, but it's risky to rely on solar as the only long-term solution.
Is it going to fall significantly for data centers? Industrial policy for consumer power is different from subsidizing it for data centers and if you own grid infrastructure why would you tank the price by putting up massive amounts of capital?
It's the same about using the cloud or using your own infrastructure there will be a point where building your own solar and battery plant is cheaper than what they are charging they will need to follow the price decline if they want to keep the customers if not there will be mass scale grid defections.
I don’t think this reflects the reality of the power industry. Data centers are the only significant growth in actual generated power in decades and hyperscalers are already looking at very bespoke solutions.
The heavy commodification of networking and compute brought about by the internet and cloud aligned with tech company interests in delivering services or content to consumers. There does not seem to be an emerging consensus that data center operators also need to provide consumer power.
It was not the reality of the power industry but will be soon as we have not had a source of electricity that is the cheapest and is getting cheaper and easy to install this is something unique.
I don't see Google, Amazon, Microsoft or any company pay $10 for something if building it themselves will cost them $5. Either the price difference will reach a point where investing into power production themselves makes sense or the power companies decrease prices. Looking at how all 3 have already been investing in power production over the last decade themselves either to get better prices or for PR.
But didn't we liberalized energy markets for that reason, if anyone could undercut the market like that wouldn't that happen automatically and the prices go down anyway? /s
Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)
Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)
Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).
So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?
Then if "all" we get is hardware improvements we're down to what 10-14 years?
Until 2022 most AI research was aimed at improving the quality of the output, not the quantity.
Since then there has been a tsunami of optimizations in the way training and inference is done. I don't think we've even begun to find all the ways that inference can be further optimized at both hardware and software levels.
Look at the huge models that you can happily run on an M3 Mac. The cost reduction in inference is going to vastly outpace Moore's law, even as chip design continues on its own path.
*deep mind research ?
Nope, Gemini Advanced with Deep Research. New mode of operation that does more "thinking" and web searches to answer your question.
I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.
I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.
> seems obvious to everyone outside of the field
It's obvious to people inside the field too.
Honestly, these things seem to be less obvious to people outside the field. I've heard so many uninformed takes about LLMs not representing real progress towards intelligence (even here on HN of all places; I don't know why I torture myself reading them), that they're just dumb memorizers. No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond. Maybe a few more people will start to understand the trajectory we're on.
> No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond.
While I agree that the LLM progress as of late is interesting, the rest of your sentiment sounds more like you are in a cult.
As long as your field keep coming with less and less realistic predictions and fail to deliver over and over, eventually even the most gullible will lose faith in you.
Because that's what this all is right now. Faith.
> Maybe a few more people will start to understand the trajectory we're on.
All you are saying is that you believe something will happen in the future.
We can't have a intelligent discussion under those premises.
It's depressing to see so many otherwise smart people fall for their own hype train. You are only helping rich people get more rich by spreading their lies.
I know I'm at fault for emotively complaining about "uninformed takes" in my comment instead of being substantive, which I regret, and I deserve replies such as this. I'll try harder to avoid getting into these arguments next time.
I wouldn't be an AI researcher if I didn't have "faith" that AI as a goal is worthwhile and achievable and I can make progress. You think this is irrational?
I am actually working to improve the SoTA in mathematical reasoning. I have documents full of concrete ideas for how to do that. So does everyone else in AI, in their niche. We are in an era of low hanging fruit enabled by ML breakthroughs such as large-scale transformers. I'm not someone who thinks you can simply keep scaling up transformers to solve AI. But consider System 1 and System 2 thinking: System 1 sure looks solved right now.
> As long as your field keep coming with less and less realistic predictions and fail to deliver over and over
I don't think we're commenting on the same article here. For example, FrontierMath was expected to be near impossible for LLMs for years, now here we are 5 weeks later at 25%.
a trickle of people sure, but most people never accidentally stumble upon good evaluation skills let alone reason themselves to that level, so i dont see how most people will have the semblance of an idea of a realistic trajectory of ai progress. i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.
doesnt help that most people are just mimics when talking about stuff thats outside their expertise.
Hell, my cousin a quality-college educated individual, high social/ emotional iq, will go down the conspiracy theory rabbit hole so quickly based on some baseless crap printed on the internet. then he’ll talk about people being satan worshipers.
You're being pretty harsh, but:
> i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.
Quite true. If you spend a lot of time reading and thinking about the workings of the mind you lose sight of how alien it is to intuition. While in highschool I first read, in New Scientist, the theory that conscious thought lags behind the underlying subconscious processing in the brain. I was shocked that New Scientist would print something so unbelievable. Yet there seemed to be an element of truth to it so I kept thinking about it and slowly changed my assessment.
sorry, humans are stupid and what intelligence they have is largely impotent. if this wasnt the case life wouldnt be this dystopia. my crassness comes from not necessarily trying to pick on a particular group of humans, just disappointment in recognizing the efficacy of human intelligence and its ability to turn reality into a better reality (meh).
yeah i was just thinking how a lot of thoughts which i thought were my original thoughts really were made possible out of communal thoughts. like i can maybe have some original frontier thoughts that involve averages but thats only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education, not to mention all the subconscious processes that are necessary for me to will certainly thoughts into existsnce. makes me reflect on how much cognition is really mine, vs (not mine) a inevitable product of a deterministic process and a product of other humans.
> only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education
What I find most fascinating about the history of mathematics is that basic concepts such as zero and negative numbers and graphs of functions, which are so easy to teach to students, required so many mathematicians over so many centuries. E.g. Newton figured out calculus because he gave so much thought to the works of Descartes.
Yes, I think "new" ideas (meaning, a particular synthesis of existing ones) are essentially inevitable, and how many people come up with them, and how soon, is a function of how common those prerequisites are.
Sounds like your cousin is able to think for himself. The amount of bullshit I hear from quality-college educated individuals, who simply repeat outdated knowledge that is in their college curriculum, is no less disappointing.
Buying whatever bullshit you see on the internet to such a degree that you're re-enacting satanic panic from the 80s is not "thinking for yourself". It's being gullible about areas outside your expertise.
Reflection isn’t a new concept, but a) actually proving that it’s an effective tool for these types of models and b) finding an effective method for reflection that doesn’t just locks you into circular “thinking” were the hard parts and hence the “breakthrough”.
It’s very easy to say hey ofc it’s obvious but there is nothing obvious about it because you are anthropomorphizing these models and then using that bias after the fact as a proof of your conjecture.
This isn’t how real progress is achieved.
Calling it reflection is, for me, further anthropomorphizing. However I am in violent agreement that a common feature of llm debate is centered around anthropomorphism leading to claims of "thinking longer" or "reflecting" when none of those things are happening.
The state of the art seems very focused on promoting that language that might encode reason is as good as actual reason, rather than asking what a reasoning model might look like.
I didn’t name it, to me I think it’s more about reflecting the output back on itself which doesn’t necessarily means anthropomorphism.
> ~doubling every 2-2.5 years) puts us at 20~25 years.
The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years
Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B
Then you will just have the issue of supplying enough of power to support this "linear" growth of yours.
who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)
AGI would have a major impact on human work. Currently the hype is much greater than the reality. But it looks like we are starting to see some of the components of an AGI and that is cause for discussion of impact, but not panicked discussion. Even the chatbot customer service has to be trained on the domain. Still it is most useful in a few specific ways:
Routing to the correct human support
Providing FAQ level responses to the most common problems.
Providing a second opinion to the human taking the call.
So, even this most relevant domain for the technology doesn't eliminate human employment (because it's just not flexible or reliable enough yet).
Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.
> are we stuck waiting for the 20-25 years for GPU improvements
If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.
LLMs need efficient matrix multiiliers. GPUs are specialized ASICs for massive matrix multiplication.
LLMs get to maybe ~20% of the rated max FLOPS for a GPU. It’s not hard to imagine that a purpose built ASIC with maybe adjusted software stack gets us significantly more real performance.
They get more than this. For prefill we can get 70% matmul utilization, for generation less than this but we’ll get to >50 too eventually.
And even when you get to 100% utilization you’ll still be wasting a crazy amount of gates / die area, plus you’re paying the Nvidia tax. There is no way in hell that will go on for 10 years if we have good AGI but inference is too expensive.
Maybe another plane with a bunch of semiconductor people will disappear over Kazakhstan or something. Capitalist communisms gets bossier in stealth mode.
But sorry, blablabla, this shit is getting embarrassing.
> The question is now, can we close this "to human" gap
You won’t close this gap by throwing more compute at it. Anything in the sphere of creative thinking eludes most people in the history of the planet. People with PhDs in STEM end up working in IT sales not because they are good or capable of learning but because more than half of them can’t do squat shit, despite all that compute and all those algorithms in their brains.
> Super exciting that OpenAI pushed the compute out this far
it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?
> the fact that you even can use more compute to get more intelligence is a breakthrough.
I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?
All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.
An algorithm designed for translating between human languages has now been shown to generalize to solving visual IQ test puzzles, without much modification.
Yes, I find that surprising.
Maybe it's not linear spend.
I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?
Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?
I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.
In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).
ARC is designed to be hard for current models. It cannot be a proxy for how useful they are. It says something else. Most likely those models won't replace human at their tasks in their organization. Instead "we" will design pipeline so that the tasks aligns with the ability of the model and we will put the human at the periphery. Think of how a factory is organised for the robots.
okay, but what about literal swe-bench. O3 scored 75% eval
I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.
To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.
Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.
Interesting times.
> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.
On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.
Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.
The other benchmarks are a good indication though.
> Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.
Well no, that would mean that Arc isn't actually testing the ability of a model to generalize then and we would need a better test. Considering it's by François Chollet, yep we need a better test.
Does it mean anything for more general tasks like driving a car?
Is every smart person a good driver?
That kind of proves that point that no matter how smart it can get, it may still have several disabilities that are crucial and very naive for humans. Is it generalizing on any task or specific set of tasks.
Likely yes. Every smart person is capable of being a good driver, so long as you give them enough training and incentive. Zero smart people are born being able to drive.
What about the archetype of the absent minded genius? I’ve met more several people who are shockingly intelligent but completely lose situational awareness on a regular basis.
And conversely, the world’s best drivers aren’t noted for being intellectual giants.
I don’t think driving skill and raw intelligence are that closely connected.
There are different kinds of smarts and not every smart person is good at all of them. Specifically, spacial reasoning is important for driving, and if a smart person is good at all kinds of thinking except that one, they're going to find it challenging to be a good driver.
Says the technical founder and CTO of our startup who exited with 9 figures and who also has a severe lazy eye: you don't want me driving. He got pulled over for suspected dui; totally clean, just can't drive straight
Efficiency has always been the key.
Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.
Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.
Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...
> ~=$3400 per single task
report says it is $17 per task, and $6k for whole dataset of 400 tasks.
"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."
The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.
3400 came from counting pixels on the plot.
Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number
That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k
The low compute one did as well as the average person though
sorry to be a noob, but can someone tell me doe sths mena o3 will be unaffordable for a typical user? Will only companies with thousands to spend per query be able to use this?
Sorry for being thick Im just confused how they can turn this into an addordable service?
There are likely many efficiency gains that will be made before it's released, and after. Also they showed o3 mini to be better than o1 for less cost in multiple benchmarks, so there're already improvements there at a lower cost than what available.
Great thank you
You're misreading it, there's two different runs, a low and a high compute run.
The number for the high-compute one is ~172x the first one according to the article so ~=$2900
What's extra confusing is that in the graph the runs are called low compute and high compute. In the table they're called high efficient and low efficiency. So the high and low got swapped.
That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)
I referred on high compute mode. They have table with breakdown here: https://arcprize.org/blog/oai-o3-pub-breakthrough
The table row with 6k figure refers to high efficiency, not high compute mode. From the blog post:
Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.
That's "efficiency" high, which actually means less compute. The 87.5% score using low efficiency (more compute) doesn't have cost listed.
they use some poor language.
"High Efficiency" is O3 Low "Low Efficiency" is O3 High
They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.
Note the $20 and $17 per task aligns with the X-axis of the O3-low
That's high EFFICIENCY. High efficiency = low compute.
[deleted]
I am not so sure, but indeed it is perhaps also a sad realization.
You compare this to "a human" but also admit there is a high variation.
And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.
So what about we think in terms of output rather than time?
Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name
This is beta version. By the time they're done with this it'll be measured in single digit dollars, if not cents.
I think the real key is figuring out how to turn the hand-wavy promises of this making everything better into policy long fucking before we kick the door open. It’s self-evident that this being efficient and useful would be a technological revolution; what’s not self evident is that it wouldn’t benefit the large corporate entities that control even more disproportionately than it does now to the detriment of many other people.
[deleted]
The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.
YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)
Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...
Looks like quite shoddy code though. Like, the procedure for running a shell command is pure side-effect procedural code, neither returning the exit code of the command nor its output. Like the incomplete stackoverflow answer it probably was trained from. It might do one job at a time, but once this stuff gets integrated into one coherent thing, one needs to rewrite lots of it, to actually be composable.
Though, of course one can argue, that lots of human written code is not much different from this.
Which code is shoddy? The Claude or o3-mini one? If you mean Claude, then have you checked the o3-mini one is better?
Youtube is currently blocking my VPN, can't watch it.
It's good that it works since if you ask GPT-4o to use the openai sdk it will often produce invalid and out of date code.
But they did use a prompt that included a full example of how to call their latest model and API!
I would say they didn’t need to demo anything, because if you are gonna use the output code live on a demo it may make compile errors and then look stupid trying to fix it live
If it was a safe bet problem, then they should have said that. To me it looks like they faked excitement for something not exciting which lowers credibility of the whole presentation.
They actually did that the last time when they showed the apps integration. First try in Xcode didn't work.
Yeah I think that time it was ok because they were demoing the app function, but for this they are demoing the model smarts
Models are predictable at 0 temperatures. They might have tested the output beforehand.
Models in practice haven't been deterministic at 0 temperature, although nobody knows exactly why. Either hardware or software bugs.
We know exactly why, it is because floating point operations aren't associative but the GPU scheduler assumes they are, and the scheduler isn't deterministic. Running the model strictly hurts performance so they don't do that.
Cool, thanks a lot for the explanation. Makes sense.
Sonnet isn't a "mini" sized model. Try it with Haiku.
How mini is o3-mini compared to Sonnet and why does it matter whether it's mini or not? Isn't the point of the demo to show what's now possible that wasn't before?
4o is cheaper than o1 mini so mini doesn't mean much for costs.
[deleted]
What? Is this what this is? Either this is a complete joke or we're missing something.
I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.
heres the right timestamp: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s
Yeah I agree that wasn't particularly mind blowing to me and seems fairly in line with what existing SOTA models can do. Especially since they did it in steps. Maybe I'm missing something.
Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.
A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.
We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
There is a benchmark, NovelQA, that LLMs don't dominate when it feels like they should. The benchmark is to read a novel and answer questions about it.
LLMs are below human evaluation, as I last looked, but it doesn't get much attention.
Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.
We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.
NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.
LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)
https://machinelearning.apple.com/research/gsm-symbolic
https://arxiv.org/pdf/2410.05229
Paper came out in October, I don't think many have fully absorbed the implications.
It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.
Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it
That's true. They don't have Claude 3.5 on there either. So maybe it's not relevant anymore, but I'm not sure.
If so, let's move on to the murder mysteries or more complex literary analysis.
> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.
I would think this is a not so good bench. Author does not write logically, they write for entertainment.
So I'm thinking of something like Locked-room mystery where the idea is it's solvable, and the reader is given a chance to solve.
The reason it seems like an interesting bench, is it's a puzzle presented in a long context. Its like testing if an LLm is at Sherlock Holmes level of world and motivation modelling.
That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?
Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.
The books fit in the current long context models, so it's not merely the context size constraint but the length is part of the issue, for sure.
Benchmark how? Is it good if the LLM can or can't solve it?
"The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning."
Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.
It shows objectively that the models are getting better at some form of reasoning, which is at least worth noting. Whether that improved reasoning is relevant for the real world is a different question.
It shows objectively that one model got better at this specific kind of weird puzzle that doesn't translate to anything because it is just a pointless pattern matching puzzle that can be trained for, just like anything else. In fact they specifically trained for it, they say so upfront.
It's like the modern equivalent of saying "oh when AI solves chess it'll be as smart as a person, so it's a good benchmark" and we all know how that nonsense went.
Hmm, you could be right, but you could also be very wrong. Jury's still out, so the next few years will be interesting.
Regarding the value of "pointless pattern matching" in particular, I would refer you to Douglas Hofstadter's discussion of Bongard problems starting on page 652 of _Godel, Escher, Bach_. Money quote: "I believe that the skill of solving Bongard [pattern recognition] problems lies very close to the core of 'pure' intelligence, if there is such a thing."
Well I certainly at least agree with that second part, the doubt if there is such a thing ;)
The problem with pattern matching of sequences and transformers as an architecture is that it's something they're explicitly designed to be good at with self attention. Translation is mainly matching patterns to equivalents in different languages, and continuing a piece of text is following a pattern that exists inside it. This is primarily why it's so hard to draw a line between what an LLM actually understands and what it just wings naturally through pattern memorization and why everything about them is so controversial.
Honestly I was really surprised that all models did so poorly on ARC in general thus far, since it really should be something they ought to be superhuman at from the get-go. Probably more of a problem that it's visual in concept than anything else.
It doesn't follow, faulty logic. The two are probably correlated though.
This emphasizes persons and a self-conceived victory narrative over the ground truth.
Models have regularly made progress on it, this is not new with the o-series.
Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.
I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.
Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.
Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)
What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.
Your argumentation seems convincing but I'd like to offer a competitive narrative: any benchmark that is public becomes completely useless because companies optimize for it - especially AI that depends on piles of money and they need some proof they are developing.
That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).
On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.
Rather any Logic puzzle you post on the internet as something AIs are bad at is in the next round of training data so AIs get better at that specific question. Not because AI companies are optimizing for a benchmark but because they suck up everything.
ARC has two test sets that are not posted on the Internet. One is kept completely private and never shared. It is used when testing open source models and the models are run locally with no internet access. The other test set is used when testing closed source models that are only available as APIs. So it could be leaked in theory, but it is still not posted on the internet and can't be in any web crawls.
You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.
Given the delivery mechanism for OpenAI, how do they actually keep it private?
> So it could be leaked in theory
That's why they have two test sets. But OpenAI has legally committed to not training on data passed to the API. I don't believe OpenAI would burn their reputation and risk legal action just to cheat on ARC. And what they've reported is not implausible IMO.
Yeah I'm sure the Microsoft-backed company headed by Mr. Worldcoin Altman whose sole mission statement so far has been to overhype every single product they released wouldn't dare cheat on one of these benchmarks that "prove" AGI (as they've been claiming since GPT-2).
> o3 presumably isn't doing program synthesis
I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.
While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.
I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).
> Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front.
Agreed.
> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.
? There's plenty.
I'd love to hear about more. Which ones are you thinking of?
- "Are You Human" https://arxiv.org/pdf/2410.09569 is designed to be directly on target, i.e. cross cutting set of questions that are easy for humans, but challenging for LLMs, Instead of one type of visual puzzle. Much better than ARC for the purpose you're looking for.
- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)
- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa
- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)
AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".
It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)
Thanks for the pointers! I hadn't seen Are You Human. Looks like it's only two months old. Of course it is much easier to design a test specifically to thwart LLMs now that we have them. It seems to me that it is designed to exploit details of LLM structure like tokenizers (e.g. character counting tasks) rather than to provide any sort of general reasoning benchmark. As such it seems relatively straightforward to improve performance in ways that wouldn't necessarily represent progress in general reasoning. And today's LLMs are not nearly as far from human performance on the benchmark as they were on ARC for many years after it was released.
SimpleBench looks more interesting. Also less than two months old. It doesn't look as challenging for LLMs as ARC, since o1-preview and Sonnet 3.5 already got half of the human baseline score; they did much worse on ARC. But I like the direction!
PIQA is cool but not hard enough for LLMs.
I'm not sure Berkeley Function-Calling represents tasks that are "easy" for average humans. Maybe programmers could perform well on it. But I like ARC in part because the tasks do seem like they should be quite straightforward even for non-expert humans.
Moravec's paradox isn't a benchmark per se. I tend to believe that there is no real paradox and all we need is larger datasets to see the same scaling laws that we have for LLMs. I see good evidence in this direction: https://www.physicalintelligence.company/blog/pi0
> "I'm not sure Berkeley Function-Calling represents tasks that are easy for average humans. Maybe programmers could perform well on it."
Functions in this context are not programming function calls. In this context, function calls are a now-deprecated LLM API name for "parse input into this JSON template." No programmer experience needed. Entity extraction by another name, except, that'd be harder: here, you're told up front exactly the set of entities to identify. :)
> "Moravec's paradox isn't a benchmark per se."
Yup! It's a paradox :)
> "Of course it is much easier to design a test specifically to thwart LLMs now that we have them"
Yes.
Though, I'm concerned a simple yes might be insufficient for illumination here.
It is a tautology (it's easier to design a test that $X fails when you have access to $X), and it's unlikely you meant to just share a tautology.
A potential unstated-but-maybe-intended-communication is "it was hard to come up with ARC before LLMs existed" --- LLMs existed in 2019 :)
If they didn't, a hacky way to come up with a test that's hard for the top AIs at the time, BERT-era, would be to use one type of visual puzzle.
If, for conversations sake, we ignore that it is exactly one type of visual puzzle, and that it wasn't designed to be easy for humans, then we can engage with: "its the only one thats easy for humans, but hard for LLMs" --- this was demonstrated as untrue as well.
I don't think I have much to contribute past that, once we're at "It is a singular example of a benchmark thats easy for humans but nigh-impossible for llms, at least in 2019, and this required singular insight", there's just too much that's not even wrong, in the Pauli sense, and it's in a different universe from the original claims:
- "Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far."
- "A lot of people have criticized ARC as not being relevant or indicative of true reasoning...The fact that [o-series models show progress on ARC proves that what it measures really is relevant and important for reasoning."
- "...nobody could quantify exactly the ways the models were deficient..."
- "What we need right now are "easy" benchmarks that these models nevertheless fail."
How long has SimpleBench been posted? Out of the first 6 questions at https://simple-bench.com/try-yourself, o1-pro got 5/6 right.
It was interesting to see how it failed on question 6: https://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe
Apparently LLMs do not consider global thermonuclear war to be all that big a deal, for better or worse.
Don't worry, I also got that wrong :) I thought her affair would be the biggest problem for John.
John was an ex, not her partner. Tricky.
Gaming the benchmarks usually needs to be considered first when evaluating new results.
I think gaming the benchmarks is encouraged in the ARC AGI context. If you look at the public test cases you'll see they test a ton of pretty abstract concepts - space, colour, basic laws of physics like gravity/magnetism, movement, identity and lots of other stuff (highly recommend exploring them). Getting an AI to do well at all, regardless of whether it was gamed or not, is the whole challenge!
Honestly, is gaming benchmarks actually a problem in this space in that it still shows something useful? Just means we need more benchmarks, yeah? It really feels not unlike keggle competitions.
We do the same exact stuff with real people with programming challenges and such where people just study common interview questions rather than learning the material holistically. And since we know that people game these interview type questions, we can adjust the interview processes to minimize gamification.... which itself leads to gamification and back to step one. That's not ideal an ideal feedback loop of course, but people still get jobs and churn out "productive work" out of it.
AI are very good at gaming benchmarks. Both as overfitting and as Goodhart's law, gaming benchmarks has been a core problem during training for as long as I've been interested in the field.
Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.
It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.
The idea behind this particular benchmark, at least, is that it can't be gamed. What are some ways to game ARC-AGI, meaning to pass it without developing the required internal model and insights?
In principle you can't optimize specifically for ARC-AGI, train against it, or overfit to it, because only a few of the puzzles are publicly disclosed.
Whether it lives up to that goal, I don't know, but their approach sounded good when I first heard about it.
Well, with billions in funding you could task a hundred or so very well paid researchers to do their best at reverse engineering the general thought process which went into ARC-AGI, and then generate fresh training data and labeled CoTs until the numbers go up.
Right, but the ARC-AGI people would counter by saying they're welcome to do just that. In doing so -- again in their view -- the researchers would create a model that could be considered capable of AGI.
I spent a couple of hours looking at the publicly-available puzzles, and was really impressed at how much room for creativity the format provides. Supposedly the puzzles are "easy for humans," but some of them were not... at least not for me.
(It did occur to me that a better test of AGI might be the ability to generate new, innovative ARC-AGI puzzles.)
It's tricky to judge the difficulty of these sorts of things. Eg, breadth of possibilities isn't an automatic sign of difficulty. I imagine the space of programming problems permits as much variety as ARC-AGI, but since we're more familiar with problems presented as natural language descriptions of programming tasks, and since we know there's tons of relevant text on the web, we see the abstract pictographic ARC-AGI tasks as more novel, challenging, etc. But, to an LLM, any task we can conceive of will be (roughly) as familiar as the amount of relevant training data it's seen. It's legitimately hard to internalize this.
For a space of tasks which are well-suited to programmatic generation, as ARC-AGI is by design, if we can do a decent job of reverse engineering the underlying problem generating grammar, then we can make an LLM as familiar with the task as we're willing to spend on compute.
To be clear, I'm not saying solving these sorts of tasks is unimpressive. I'm saying that I find it unsuprising (in light of past results) and not that strong of a signal about further progress towards the singularity, or FOOM, or whatever. For any of these closed-ish domain tasks, I feel a bit like they're solving Go for the umpteenth time. We now know that if you collect enough relevant training data and train a big enough model with enough GPUs, the training loss will go down and you'll probably get solid performance on the test set. Trillions of reasonably diverse training tokens buys you a lot of generalization. Ie, supervised learning works. This is the horse Ilya Sutskever's ridden to many glorious victories and the big driver of OpenAI's success -- a firm belief that other folks were leaving A LOT of performance on the table due to a lack of belief in the power of their own inventions.
[deleted]
We're in agreement!
What's endlessly interesting to me with all of this is how surprisingly quick the benchmarking feedback loops have become plus the level of scrutiny each one receives. We (as a culture/society/whatever) don't really treat human benchmarking criteria with the same scrutiny such that feedback loops are useful and lead to productive changes to the benchmarking system itself. So from that POV it feels like substantial progress continues to be made through these benchmarks.
I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming and Photography: benchmarks, while handy measures of potential capabilities, are not guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.
While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.
They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.
100%. The hype is misguided. I doubt half the people excited about the result have even looked at what the benchmark is.
Highly challenging for LLMs because it has nothing to do with language. LLMs and their training processes have all kinds of optimizations for language and how it's presented.
This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.
Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.
The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.
If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.
I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.
> The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.
It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?
The name is marketing hype.
The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.
And all that is kind of irrelevant, because if LLMs were human-level general intelligence, they would solve all these questions correctly without blinking.
But they don't. Not even the best ones.
No human would score high on that puzzle if the images were given to them as a series of tokens. Even previous LLMs scored much better than humans if tested in the same way.
And most humans would do well on maths problems if the input was given to them as binary. The reason that reversal isn't important is that the Tokens are an implementation detail for how an AI is meant to solve real world problems that humans face while noone cares about humans solving tokens.
Humans communicate with each other to get things done. We have to think carefully how we communicate with each other given the shortcomings of humans and shortcomings of different communication mediums.
The fact that we might need to be mindful of how we communicate with a person/system/whatever doesn't mean too much in the context of AI. Just like humans, the details of how they work will need to be considered, and the standard trope of "that's an implementation detail" won't work.
> making the most interesting and challenging LLM benchmark so far.
This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.
Apparently o3 scored about 25%
This is actually the result that I find way more impressive. Elite mathematicians think these problems are challenging and thought they were years away from being solvable by AI.
You're right, I was wrong to say "most challenging" as there have been harder ones coming out recently. I think the correct statement would be "most challenging long-standing benchmark" as I don't believe any other test designed in 2019 has resisted progress for so long. FrontierMath is only a month old. And of course the real key feature of ARC is that it is easy for humans. FrontierMath is (intentionally) not.
They should put some famous, unsolved problems in the next edition so ML researchers do some actually useful work while they're "gaming" the benchmarks :)
I'm certain that the big labs will be gunning for the Millenium Prize problems.
I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.
Because LLMs are on an off-ramp path towards AGI. A generally intelligent system can brute force its way with just memory.
Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!
It's the least interesting benchmark for language models among all they've released, especially now that we already had a large jump in its best scores this year. It might be more useful as a multimodal reasoning task since it clearly involves visual elements, but with o3 already performing so well, this has proven unnecessary. ARC-AGI served a very specific purpose well: showcasing tasks where humans easily outperformed language models, so these simple puzzles had their uses. But tasks like proving math theorems or programming are far more impactful.
ARC wasn't designed as a benchmark for LLMs, and it doesn't make much sense to compare them on it since it's the wrong modality. Even a MLM with image inputs can't be expected to do well, since they're nothing like 99.999% of the training data. The fact that even a text-only LLM can solve ARC problems with the proper framework is important, however.
> The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.
Are there any single-step non-reasoner models that do well on this benchmark?
I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
| Name | Semi-private eval | Public eval |
|--------------------------------------|-------------------|-------------|
| Jeremy Berman | 53.6% | 58.5% |
| Akyürek et al. | 47.5% | 62.8% |
| Ryan Greenblatt | 43% | 42% |
| OpenAI o1-preview (pass@1) | 18% | 21% |
| Anthropic Claude 3.5 Sonnet (pass@1) | 14% | 21% |
| OpenAI GPT-4o (pass@1) | 5% | 9% |
| Google Gemini 1.5 (pass@1) | 4.5% | 8% |
https://arxiv.org/pdf/2412.04604
why is this missing the o1 release / o1 pro models? Would love to know how much better they are
This might be because they are referencing single step, and I do not think o1 is single step.
Akyürek et al uses test-time compute.
Here are the results for base models[1]:
o3 (coming soon) 75.7% 82.8%
o1-preview 18% 21%
Claude 3.5 Sonnet 14% 21%
GPT-4o 5% 9%
Gemini 1.5 4.5% 8%
Score (semi-private eval) / Score (public eval)
It's easy to miss, but if you look closely at the first sentence of the announcement they mention that they used a version of o3 trained on a public dataset of ARC-AGI, so technically it doesn't belong on this list.
It's all scam. ClosedAI trained on the data they were tested on, so no, nothing here is impressive.
Just a clarification, they tuned on the public training dataset, not the semi-private one. The 87.5% score was on the semi-private eval, which means the model was still able to generalize well.
That being said, the fact that this is not a "raw" base model, but one tuned on the ARC-AGI tests distribution takes away from the impressiveness of the result — How much ? — I'm not sure, we'd need the un-tuned base o3 model score for that.
In the meantime, comparing this tuned o3 model to other un-tuned base models is unfair (apples-to-oranges kind of comparison).
They definitely did or they probably did? Is there any source for that just so I can point It out to people?
I'd love to know how Claude 3.5 Sonnet does so well despite (presumably) not having the same tricks as the o-series models.
i am confused cause this dataset is visual-based, and yet being used to measure 'LLM'. I feel like the visual nature of it was really the biggest hurdle to solving it.
[deleted]
Human performance is 85% [1]. o3 high gets 87.5%.
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.
But, still, this is incredibly impressive.
Which parts of reasoning do you think is missing? I do feel like it covers a lot of 'reasoning' ground despite its on the surface simplicity
My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.
For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.
And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.
I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.
I see two approaches for explaining the outcome: 1. Reasoning back on the result and justifying it. 2. Explainability - somehow justifying by looking at which neurons have been called. The first could lead to lying. E.g. think of a high schooler explaining copied homework. While the second one does indeed access the paths influencing the decision, but is a hard task due to the inherent way neural networks work.
> if we asked an LLM to produce an image of a "human woman photorealistic" it produces result
Large language models don't do that. You'd want an image model.
Or did you mean "multi-model AI system" rather than "LLM"?
It might be possible for a language model to paint a photorealistic picture though.
It is not.
You are confusing LLM:s with Generative AI.
No, I'm not confusing it. I realize that LLMs sometimes connect with diffusion models to produce images. I'm talking about language models actually describing pixel data of the image.
Can an LLM use tools like humans do? Could it use an image model as a tool to query the image?
No, a LLM is a Large Language Model.
It can language.
You could teach it to emit patterns that (through other code) invoke tools, and loop the results back to the LLM.
I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.
That’s not inability to reason though, that’s having a social context.
Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.
What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.
Establishing context in psychology is hard.
o1 already fixed the red herrings...
LLMs are still bound to a prompting session. They can't form long term memories, can't ponder on it and can't develop experience. They have no cognitive architecture.
'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.
Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.
LLMs don’t, but who said AGI should come from LLMs alone. When I ask ChatGPT about something “we” worked on months ago, it “remembers” and can continue on the conversation with that history in mind.
I’d say, humans are also bound to promoting sessions in that way.
Last time I used ChatGPT 'memory' feature it got full very quickly. It remembered my name, my dog's name and a couple tobacco casing recipes he came up with. OpenAI doesn't seem to be using embeddings and a vector database, just text snippets it injects in every conversation. Because RAG is too brittle ? The same problem arises when composing LLM calls. Efficient and robust workflows are those whose prompts and/or DAG were obtained via optimization techniques. Hence DSPy.
Consider the following use case: keeping a swimming pool water clean. I can have a long running conversation with a LLM to guide me in getting it right. However I can't have a LLM handle the problem autonomously. I'd like to have it notify me on its own "hey, it's been 2 days, any improvement? Do you mind sharing a few pictures of the pool as well as the ph/chlorine test results ?". Nothing mind-boggingly complex. Nothing that couldn't be achieved using current LLMs. But still something I'd have to implement myself and which turns out to be more complex to achieve than expected. This is the kind of improvement I'd like to see big AI companies going after rather than research-grade ultra smart AIs.
[deleted]
Optimal phenomenological reasoning is going to be a tough nut to crack.
Luckily we don't know the problem exists, so in a cultural/phenomenological sense it is already cracked.
Current AI is good at text but not very good at 3d physical stuff like fixing your plumbing.
Does it include the use of tools to accomplish a task?
Does it include the invention of tools?
kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.
people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.
what are we humans getting ourselves into inventing skynet :wink.
its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.
>> Kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.
Kinda interesting that mathematicians also can't do the same for mathematics.
And yet.
Mathematicians absolutely can, it's called foundations, and people actively study what mathematics can be expressed in different foundations. Most mathematicians don't care about it though for the same reason most programmers don't care about Haskell.
I don't care about Haskell either, but we know what reasoning is [1]. It's been studied extensively in mathematics, computer science, psychology, cognitive science and AI, and in philosophy going back literally thousands of years with grandpapa Aristotle and his syllogisms. Formal reasoning, informal reasoning, non-monotonic reasoning, etc etc. Not only do we know what reasoning is, we know how to do it with computers just fine, too [2]. That's basically the first 50 years of AI, that folks like His Nobelist Eminence Geoffrey Hinton will tell you was all a Bad Idea and a total failure.
Still somehow the question keeps coming up- "what is reasoning". I'll be honest and say that I imagine it's mainly folks who skipped CS 101 because they were busy tweaking their neural nets who go around the web like Diogenes with his lantern, howling "Reasoning! I'm looking for a definition of Reasoning! What is Reasoning!".
I have never heard the people at the top echelons of AI and Deep learning - LeCun, Schmidhuber, Bengio, Hinton, Ng, Hutter, etc etc- say things like that: "what's reasoning". The reason I suppose is that they know exactly what that is, because it was the one thing they could never do with their neural nets, that classical AI could do between sips of coffee at breakfast [3]. Those guys know exactly what their systems are missing and, to their credit, have never made no bones about that.
_________________
[1] e.g. see my profile for a quick summary.
[2] See all of Russeel & Norvig, as a for instance.
[3] Schmidhuber's doctoral thesis was an implementation of genetic algorithms in Prolog, even.
i have a question for you, in which ive asked many philosophy professors but none could answer satisfactorily. since you seem to have a penchant for reasoning perhaps you might have a good answer. (i hope i remember the full extent of the question properly, i might hit you up with some follow questions)
it pertains to the source of the inference power of deductive inference. do you think all deductive reasoning originated inductively? like when some one discovers a rule or fact that seemingly has contextual predictive power, obviously that can be confirmed inductively by observations, but did that deductive reflex of the mind coagulate by inductive experiences. maybe not all deductive derivative rules but the original deductive rules.
I'm sorry but I have no idea how to answer your question, which is indeed philosophical. You see, I'm not a philosopher, but a scientist. Science seeks to pose questions, and answer them; philosophy seeks to pose questions, and question them. Me, I like answers more than questions so I don't care about philosophy much.
well yeah its partially philosphical, i guess my haphazard use of language like “all” makes it more philosophical than intended.
but im getting at a few things. one of those things is neurological. how do deductive inference constructs manifest in neurons and is it really inadvertently an inductive process that that creates deductive neural functions.
other aspect of the question i guess is more philosophical. like why does deductive inference work at all, i think clues to a potential answer to that can be seen in the mechanics of generalization of antecedents predicting(or correlating with) certain generalized consequences consistently. the brain coagulates generalized coinciding concepts by reinforcement and it recognizes or differentiates inclusive instances or excluding instances of a generalization by recognition properties that seem to gatekeep identities accordingly. its hard to explain succinctly what i mean by the latter, but im planning on writing an academic paper on that.
I'm sorry, I don't have the background to opine on any of the subjects you discuss. Good luck with your paper!
>Those guys know exactly what their systems are missing
If they did not actually, would they (and you) necessarily be able to know?
Many people claim the ability to prove a negative, but no one will post their method.
To clarify, what neural nets are missing is a capability present in classical, logic-based and symbolic systems. That's the ability that we commonly call "reasoning". No need to prove any negatives. We just point to what classical systems are doing and ask whether a deep net can do that.
Do Humans have this ability called "reasoning"?
well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.
i doubt your mathmatician example is equivalent.
examples that are fresh on the mind that further my point. ive heard yann lecun baffled by llms instantiation/emergence of reasoning, along with other ai researchers. eric Schmidt thinks the agentic reasoning is the current frontier and people should be focusing on that. was listening to the start of an ai machine learning interview a week ago with some cs phd asked to explain reasoning and the best he could muster up is you know it when you see it…. not to mention the guy responding to the grandparent that gave a cop out answer ( all the most respect to him).
>> well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.
I'm going to bet you haven't encountered the right people then. Maybe your social circle is limited to folks like the person who presented a slide about A* to a dumb-struck roomfull of Deep Learning researchers, in the last NeurIps?
possibly, my university doesn’t really do ai research beyond using it as a tool to engineer things. im looking to transfer to a different university.
but no, my take on reasoning is really a somewhat generalized reframing of the definition of reasoning (which you might find on the stanford encylopedia of philosophy) thats reframed partially in axiomatic building blocks of neural network components/terminology. im not claiming to have discovered reasoning, just redefine it in a way thats compatible and sensible to neural networks (ish).
Well you're free to define and redefine anything and as you like, but be aware that every time you move the target closer to your shot you are setting yourself up for some pretty strong confirmation bias.
yeah thats why i need help from the machine interpretability crowd to make sure my hypothesized reframing of reasoning has sufficient empirical basis and isn’t adrift in lalaland.
Care to enlighten us with your explanation of what "reasoning" is?
terribly sorry to be such a tease, but im looking to publish a paper on it, and still need to delve deeper into machine interpretability to make sure its empirically properly couched. if u can help with that perhaps we can continue this convo in private.
I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.
The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.
If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.
> climate change is one of the most difficult and worst problems of our time.
Slightly surprised to see this view here.
I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).
None of those problems really matter if we don't have a planet to live on
Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.
You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.
The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person
Not being able to send an email or believing the world is flat it’s not a sign of intelligence, I’d rather say it’s more about culture or being more or less scholarized. Your mom or coworker still can do stuff instinctively that is outperforming every algorithm out there and still unexplained how we do it. We still have no idea what intelligence is
Yet the average human can drive a car a lot better than ChatGPT can, which shows that the way you frame "intelligence" dictates your conclusion about who is "intelligent".
Pretty sure a waymo car drives better than an average SF driver.
And how well would a Waymo car do in this challenge with the ARC-AGI datasets?
Waymo cannot handle poor weather at all, average human can.
Being able to perform better than humans in specific constrained problem space is how every automation system has been developed.
While self driving systems are impressive, they don’t drive with anywhere close to skills of the average driver
Waymo blog with video of them driving in poor weather https://waymo.com/blog/2019/08/waymo-and-weather
And nikola famously made a video of a truck using one which had no engine, we don’t take a company word for anything until we can verify.
This is not offered to public, they are actively expanding in only cities like LA , Miami or Phoenix now where weather is good through the year.
The tech for bad weather is nowhere close to ready for public. Average human on other hand is driving in bad weather every day
"Extreme Weather" tech "will be available to riders in the near future" https://www.cnet.com/roadshow/news/waymos-latest-robotaxi-is...
I'm sure the source of that CNET article came with a forward looking statements disclaimer.
There's a reason why Waymo isn't offered in Buffalo.
Is that reason because Buffalo is the 81st most populated city in the United States, or 123rd by population density, and Waymo currently only serves approximately 3 cities in North America?
We already let computers control cars because they're better than humans at it when the weather is inclement. It's called ABS.
I would guess you haven't spent much time driving in the winter in the Northeast.
There is an inherent danger to driving in snow and ice. It is a PR nightmare waiting to happen because there is no way around accidents if the cars are on the road all the time in rust belt snow.
I get the feeling that the years I spent in Boston with a car including during the winter and driving to Ithaca somehow aren't enough, but whether or not I have is irrelevant. Still, I'll repeat the advice I was given before you have to drive in snow, go practice driving in the snow (in eg a parking lot) before needing to do so, esp during a storm. Waymo's been spotted driving in Buffalo doing testing, so it seems someone gave them similar advice. https://www.wgrz.com/article/tech/waymo-self-driving-car-pro...
There's always an inherent risk to driving, even in sunny Phoenix, Az. Winter dangers like black ice further multiply that risk but humans still manage to drive in winter. Taking a picture/video of a snowed over road and judging the width and inventing lanes based on the width taking into account snowbanks doesn't take an ML algorithm. Lidar can see black ice while human eyes can not, giving cars equipped with lidar (wether driven by a human or a computer) an advantage over those without it, and Waymo cars currently have lidar.
I'm sure there are new challenges for Waymo to solve before deploying the service in Buffalo, but it's not this unforeseen gotcha parent comment implies.
As far as the possible PR nightmare, you'd never do self driving cars in the first place if you let that fear control you because, you you pointed out, driving on the roads is inherently dangerous with too many unforeseen complications.
If you take an electrical sensory input signal sequence, and transform it to a electrical muscle output signal sequence you've got a brain. ChatGPT isn't going to drive a car because it's trained on verbal tokens, and it's not optimized for the type of latency you need for physical interaction.
And the brain doesn't use the same network to do verbal reasoning as real time coordination either.
But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.
Maybe, but no doubt these "dumb" people can still get dressed in the morning, navigate a trip to the mall, do the dishes, etc, etc.
It's always been the case that the things that are easiest for humans are hardest for computers, and vice versa. Humans are good at general intelligence - tackling semi-novel problems all day long, while computers are good at narrow problems they can be trained on such as chess or math.
The majority of the benchmarks currently used to evaluate these AI models are narrow skills that the models have been trained to handle well. What'll be much more useful will be when they are capable of the generality of "dumb" tasks that a human can do.
Your examples are just examples of lack of information. That's not a measure for intelligence.
As a contrary point, most people think they are smarter than they really are.
There are things Chimps do easily that humans fail at, and vice/versa of course.
There are blind spots, doesn't take away from 'general'.
We can't agree whether Portia spiders are intelligent or just have very advanced instincts. How will we ever agree about what human intelligence is, or how to separate it from cultural knowledge? If that even makes sense.
I guess my point is more, if we can't decide about Portia Spiders or Chimps, then how can we be so certain about AI. So offering up Portia and Chimps as counter examples.
[deleted]
The downvotes should tell you, this is a decided "hype" result. Don't poo poo it, that's not allowed on AI slop posts on HN.
Yeah, I didn't realize Chimp studies, or neuroscience were out of vogue. Even in tech, people form strong 'beliefs' around what they think is happening.
What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.
In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.
In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.
It's possible humans reason better through text than not through text, so these models, having been trained on text, should be able to out-reason any person who's not currently sitting down to write.
I wonder how much of an effect amount of time to answer has on human performance.
Yeah, this is sort of meaningless without some idea of cost or consequences of a wrong answer. One of the nice things about working with a competent human is being able to tell them "all of our jobs are on the line" and knowing with certainty that they'll come to a good answer.
Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).
Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."
A fair comparison might be average human. The average human isn't a STEM grad. It seems STEM grad approximately equals an IQ of 130. https://www.accommodationforstudents.com/student-blog/the-su...
From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659
Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.
Why would an average human be more fair than a trained human? The model is trained.
It's not saturated. 85% is average human performance, not "best human" performance. There is still room for the model to go up to 100% on this eval.
Curious about how many tests were performed. Did it consistently manage to successfully solve many of these types of problems?
NNs are not algorithms.
Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)
At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.
An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer”
How does a giant pile of linear algebra not meet that definition?
It's not made of "steps", it's an almost continuous function to its inputs. And a function is not an algorithm: it is not an object made of conditions, jumps, terminations, ... Obviously it has computation capabilities and is Turing-complete, but is the opposite of an algorithm.
If it wasn’t made of steps then Turing machines wouldn’t be able to execute them.
Further, this is probably running an algorithm on top of an NN. Some kind of tree search.
I get what you’re saying though. You’re trying to draw a distinction between statistical methods and symbolic methods. Someday we will have an algorithm which uses statistical methods that can match human performance on most cognitive tasks, and it won’t look or act like a brain. In some sense that’s disappointing. We can build supersonic jets without fully understanding how birds fly.
Let's see that Turing machines can approximate the execution of NN :) That's why there are issues related to numerical precision, but the contrary is also true indeed, NNs can discover and use similar techniques used by traditional algorithms. However: the two remain two different methods to do computations, and probably it's not just by chance that many things we can't do algorithmically, we can do with NNs, what I mean is that this is not just related to the fact that NNs discover complex algorithms via gradient descent, but also that the computational model of NNs is more adapt to solving certain tasks. So the inference algorithm of NNs (doing multiplications and other batch transformations) is just needed for standard computers to approximate the NN computational model. You can do this analogically, and nobody would claim much (maybe?) it's running an algorithm. Or that brains themselves are algorithms.
Computers can execute precise computations, it's just not efficient (and it's very much slow).
NNs are exactly what "computers" are good for and we've been using since their inception: doing lots of computations quickly.
"Analog neural networks" (brains) work much differently from what are "neural networks" in computing, and we have no understanding of their operation to claim they are or aren't algorithmic. But computing NNs are simply implementations of an algorithm.
Edit: upon further rereading, it seems you equate "neural networks" with brain-like operation. But brain was an inspiration for NNs, they are not an "approximation" of it.
But the inference itself is orthogonal to the computation the NN is going. Obviously the inference (and training) are algorithms.
NN inference is an algorithm for computing an approximation of a function with a huge number of parameters. The NN itself is of course just a data structure. But there is nothing whatsoever about the NN process that is non-algorithmic.
It's the exact same thing as using a binary tree to discover the lowest number in some set of numbers, conceptually: you have a data structure that you evaluate using a particular algorithm. The combination of the algorithm and the construction of the data structure arrive at the desired outcome.
That's not the point, I think: you can implement the brain in BASIC, in theory, this does not means that the brain is per-se a BASIC program. I'll provide a more theoretical framework for reasoning about this: if the way to solve certain problems by an NN (the learned weights) can't be translated in some normal program that DOES NOT resemble the activation of an NN, then the NNs are not algorithms, but a different computational model.
This may be what they were getting it, but it is still wrong. An NN is a computable function. So, NN inference is an algorithm for computing the function the NN represents. If we have an NN that represents a function f, with f(text) = most likely next character a human would write, then running the inference for that NN is an algorithm for finding out which character it's most likely a human would write next.
It's true that this is not an "enlightening" algorithm, it doesn't help us understand why or how that is the most likely next character. But this doesn't mean it's not an algorithm.
[deleted]
We don’t have evidence that a TM can simulate a brain. But we know for a fact that it can execute a NN.
[deleted]
Each layer of the network is like a step, and each token prediction is a repeat of those layers with the previous output fed back into it. So you have steps and a memory.
> It's not made of "steps", it's an almost continuous function to its inputs.
Can you define "almost continuous function"? Or explain what you mean by this, and how it is used in the A.I. stuff?
Well, it's a bunch of steps, but they're smaller. /s
I would say you are right that function is not an algorithm, but it is an implementation of an algorithm.
Is that your point?
If so, I've long learned to accept imprecise language as long as the message can be reasonably extracted from it.
> continuous
So, steps?
"Continuous" would imply infinitely small steps, and as such, would certainly be used as a differentiator (differential? ;) between larger discrete stepped approach.
In essence, infinite calculus provides a link between "steps" and continuous, but those are different things indeed.
How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.
But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.
NN is a very wide term applied in different contexts.
When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.
We also call that a NN (the joy of natural language).
Running inference on a model certainly is a algorithm.
I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.
You don't think there are already plenty of attempts out there?
When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.
Do trading bots count?
No, the AI would have to start from zero and reason it's way to making itself money online, such as the humans who were first in their online field of interest (e-commerce, scams, ads etc from the 80's and 90's) when there was no guidance, only general human intelligence that could reason their way into money making opportunities and reason their way into making it work.
I don't think humans ever do that. They research/read and ask other humans.
Which AI already has stored in spades, even more so since people in the 80's 90's weren't working with the information available today. The AI is free to research and read all the information stored from other humans as well, just like the humans who reasoned their way into money making opportunities--only with vastly more information now, talk about an advantage. But is it intelligent enough do so without a human giving direct/step-by-step instructions; the way humans figure it out?
It actually beats the human average by a wide margin:
- 64.2% for humans vs. 82.8%+ for o3.
...
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
Super human isn't beating rando mech turk.
Their post has stem grad at nearly 100%
This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).
So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.
In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.
This is so strange. people think that an llm trained on programming questions and docs can do mundane tasks like this means intelligent? Come on.
It really calls into question two things.
1. You don't know what you're talking about about.
2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.
Either way, not a good look.
This
Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND
2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.
It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
ADDED:
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."
I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.
I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.
If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.
I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.
Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.
They're in the original post. Also here: https://x.com/fchollet/status/1870172872641261979 / https://x.com/fchollet/status/1870173137234727219
Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.
Thanks! I've analyzed some easy problems that o3 failed at. They involve spatial intelligence including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.
(OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
> I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.
Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.
Probably just not trained on this kind of data. We could create a benchmark about it, and they'd shatter it within a year or so.
I'm starting to really see no limits on intelligence in these models.
Doesn't the fact that it can only accomplish tasks with benchmarks imply that it has limitations in intelligence?
> Doesn't the fact that it can only accomplish tasks with benchmarks
That's not a fact
[deleted]
> This skill is very hard to learn from textual and still image data.
I had the same take at first, but thinking about it again, I'm not quite sure?
Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).
Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.
Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.
The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.
EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.
[1]: https://x.com/bio_bootloader/status/1870339297594786064
[deleted]
Yeah the real goalpost is reliable intelligence. A supposed phd level AI failing simple problems is a red flag that we’re still missing something.
You've never met a Doctor who couldn't figure out how to work their email? Or use street smarts? You can have a PHD but be unable to reliably handle soft skills, or any number of things you might 'expect' someone to be able to do.
Just playing devils' advocate or nitpicking the language a bit...
An important distinction here is you’re comparing skill across very different tasks.
I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.
Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.
A coworker of mine has a phd in physics. Showing the difference to him between little and big endian in a hex editor, showing file sizes of raw image files and how to compute it... I explained 3 times and maybe he understood part of it now.
Doctors[1] or say pilots are skilled professions and difficult to master and deserve respect yes , but they do not need high levels of intelligence to be good at. They require many other skills like taking decisions under pressure or good motor skills that are hard, but not necessarily intelligence.
Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.
Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested
—-
[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email
good nit pick.
A PHD learnt their field. If they learnt that field, reasoning through everything to understand their material, then - given enough time - they are capable of learning email and street smarts.
Which is why a reasoning LLM, should be able to do all of those things.
Its not learnt a subject, its learnt reasoning.
they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable
Have we really watered down the definition of AGI that much?
LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.
Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.
> LLMs aren't really capable of "learning" anything outside their training data.
ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.
I think drawing the boundary at “model + scaffolding” is more interesting.
Calling the sentence or two it arbitrarily saves when you statd your preferences and profile info "memories" is a stretch.
True equivalent to human memories would require something like a multimodal trillion token context window.
RAG is just not going to cut it, and if anything will exacerbated problems with hallucinations.
Well, now you’ve moved the goalposts from “learn anything” to “learn at human level”. Sure, they don’t have that yet.
[deleted]
Thats the whole point of llama index? I can connect my LLM to any node or context i want. Syncing it to a real time data flow like an API and it can learn...? How is that different than a human?
Once optimus is up an working by the 100k+, the spatial problems will be solved. We just don't have enough spatial awareness data, or for a way for the LLM to learn about the physical world.
That's true for vanilla LLMs, but also keep in mind that there are no details about o3's architecture at the moment. Clearly they are doing something different given the huge performance jump on a lot of benchmarks, and it may well involve in-context learning.
Given every other iteration has basically just been the same thing but bigger, why should we think this?
My point was to caution against being too confident about the underlying architecture, not to argue for any particular alternative.
Your statement is false - things changed a lot between gpt4 and o1 under the hood, but notably not a larger model size. In fact the model size of o1 is smaller than gpt4 by several orders of magnitude! Improvements are being made in other ways.
What's your explanation for why it can only get ~70% on SWE-bench Verified?
I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)
I've spent tons of time evaluating o1-preview on SWEBench-Verified.
For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.
For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.
So what percentage would you say falls to simple inability versus the other two factors you've mentioned?
One possibility is that it may not yet have sufficient experience and real-world feedback for resolving coding issues in professional repos, as this involves multiple steps and very diverse actions (or branching factor, in AI terms). They have committed to not training on API usage, which limits their ability to directly acquire training data from it. However, their upcoming agentic efforts may address this gap in training data.
Right, but the branching factor increases exponentially with the scope of the work.
I think it's obvious that they've cracked the formula for solving well-defined, small-in-scope problems at a superhuman level. That's an amazing thing.
To me, it's less obvious that this implies that they will in short order with just more training data be able to solve ambiguous, large-in-scope problems at even just a skilled human level.
There are far more paths to consider, much more context to use, and in an RL setting, the rewards are much more ambiguously defined.
Their reasoning models can learn from procedures and methods, which generalize far better than data. Software tasks are diverse but most tasks are still fairly limited in scope. Novel tasks might remain challenging for these models, as they do for humans.
That said, o3 might still lack some kind of interaction intelligence that’s hard to learn. We’ll see.
GPQA scores are mostly from pre-training, against content in the corpus. They have gone silent but look at the GPT4 technical report which calls this out.
We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.
As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.
As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.
I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.
Heck we aren't close to P with commercial models.
Isn't any physically realizable computer (including our brains) limited to what uniform-TC0 can do?
Neither TC0 nor uniform-TC0 are physically realizable, they are tools not physical devices.
The default nonuniform circuits classes are allowed to have a different circuit per input size, the uniform types have unbounded fan-in
Similar to how a k-tape TM doesn't get 'charged' for the input size.
With Nick Class (NC) the number of components is similar to traditional compute time while depth relates to the ability to parallelize operations.
These are different than biological neurons, not better or worse but just different.
Human neurons can use dendritic compartmentalization, use spike timing, can retime spikes etc...
While the perceptron model we use in ML is useful, it is not able to do xor in one layer, while biological neurons do that without anything even reaching the soma, purely in the dendrites.
Statistical learning models still comes down to a choice function, no matter if you call that set shattering or...
With physical computers the time hierarchy does apply and if TIME(g(n)) is given more time than TIME(f(n)), g(n) can solve more problems.
So you can simulate a NTM with exhaustive search with a physical computer.
Physical computers also tend to have NAND and XOR gates, and can have different circuit depths.
When you are in TC0, you only have AND, OR and Threshold (or majority) gates.
Think of instruction level parallelism in a typical CPU, it can return early, vs Itanium EPIC, which had to wait for the longest operation. Predicated execution is also how GPUs work.
They can send a mask and save on load store ops as an example, but the cost of that parallelism is the consent depth.
It is the parallelism tradeoff that both makes transformers practical as well as limit what they can do.
The IID assumption and autograd requiring smooth manifolds plays a role too.
The frame problem, which causes hard problems to become unsolvable for computers and people alike does also.
But the fact that we have polynomial time solutions for the Boolean Formula Value Problem, as mentioned in my post above is probably a simpler way of realizing physical computers aren't limited to uniform-TC0.
Do you just mean because any physically realizable computer is a finite state machine? Or...?
I wouldn't describe a computer's usual behavior as having constant depth.
It is fairly typical to talk about problems in P as being feasible (though when the constant factors are too big, this isn't strictly true of course).
Just because for unreasonably large inputs, my computer can't run a particular program and produce the correct answer for that input, due to my computer running out of memory, we don't generally say that my computer is fundamentally incapable of executing that algorithm.
>Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforce, AIME, and Frontier Math strongly suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it.
The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?
Great point. I'd love to see what these easy tasks are and would be happy to revise my hypothesis accordingly. o3's intelligence is unlikely to be a strict superset of human intelligence. It is certainly superior to humans in some respects and probably inferior in others. Whether it's sufficiently generally intelligent would be both a matter of definition and empirical fact.
Chollet has a few examples here:
https://x.com/fchollet/status/1870172872641261979
https://x.com/fchollet/status/1870173137234727219
I would definitely consider them legitimately easy for humans.
Thanks! I added some comments on this at the bottom of the post above.
Please stop it calling AGI, we don’t even know or agree universally what that should actually mean. How far did we get with hype calling a lossy probabilistic compressor firing slowly at us words AGI? That’s a real bummer to me
Is this comment voted down because of sentiment / polarity?
Regardless the critical aspect is valid, AGI would be something like Cortana from Halo.
Personally I find "human-level" to be a borderline meaningless and limiting term. Are we now super human as a species relative to ourselves just five years ago because of our advances in developing computer programs that better imitate what many (but far from all) of us were already capable of doing? Have we reached a limit to human potential that can only be surpassed by digital machines? Who decides what human level is and when we have surpassed it? I have seen some ridiculous claims about ai in art that don't stand up to even the slightest scrutiny by domain experts but that easily fool the masses.
No I think we're just tired and depressed as a species... Existing systems work to a degree but aren't living up to their potential of increasing happiness according to technological capabilities.
The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.
For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.
Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.
on the spatial data i see it as a highly intelligent head of a machine that just needs better limbs and better senses.
i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.
[deleted]
Great comment. See this as well for another potential reason for failure:
[deleted]
In order to replace actual humans doing their job I think LLMs are lacking in judgement, sense of time and agenticism.
I mean fkcu me when they have those things, however, maybe they are just lazy and their judgement is fine, for a lazy intelligence. Inner-self thinks "why are these bastards asking me to do this? ". I doubt that is actually happening, but now, .. prove it isn't.
[deleted]
Ask o3 is P=NP?
It will just answer with the current consensus on the matter.
[deleted]
how about if it can work at a job? people can do that, can o3 do it?
This is not AGI lmao.
> It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could.
Every human does this dozens, hundreds or thousands of times ... during childhood.
Agree. AGI is here. I feel such a sense of pride in our species.
[deleted]
Incredibly impressive. Still can't really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.
The point of ARC is NOT to compare humans vs AI, but to probe the current boundary of AIs weaknesses. AI has been beating us at specific tasks like handwriting recognition for decades. Rather, it's when we can no longer readily find these "easy for human, hard for AI" reasoning tasks that we must stop and consider.
If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.
o3 low and high are the same model. Difference is in how long was it allowed to think.
It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.
But does it matter if it "really, really" reasons in the human sense, if it's able to prove some famous math theorem or come up with a novel result in theoretical physics?
While beyond current motels, that would be the final test of AGI capability.
If it's gaming the system, then it's much less likely to reliably come up with novel proofs or useful new theoretical ideas.
That would be important, but as far as I know it hasn’t happened (despite how often it’s intimated that we’re on the verge of it happening).
I've seen one Twitter thread from a mathematician who used an llm to come up with a new math result. Both coming up with the theorem statement and a unique proof,iirc.
Though to be clear, this wasn't a one shot thing - it was iirc a few months of back and forth chats with plenty of wrong turns too.
Then he used it as a random text generator, LLM is by far the most configurable and best random test generators we have. You can use that to generate random theorem noise and then try to work with that to find actual theorems, still doesn't replace mathematicians though.
I think we should let the professional mathematician who says the llm helped him be the judge of how and why it helped.
Found the thread: https://x.com/robertghrist/status/1841462507543949581?s=46&t...
From the thread:
> AI assisted in the initial conjectures, some of the proofs, and most of the applications it was truly a collaborative effort
> i went back and forth between outrageous optimism and frustration through this process. i believe that the current models can reason – however you want to interpret that. i also believe that there is a long way to go before we get to true depth of mathematical results.
Yeah, it really does matter if something was reasoned, or whether it appears if you metaphorically shake the magic 8 ball.
How would gaming the system work here? Is there some flaw in the way the tasks are generated?
AI models have historically found lots of ways to game systems. My favorite example is exploiting bugs in simulator physics to "cheat" at games of computer tag. Another is a model for radiology tasks finding biases in diagnostic results using dates on the images. And of course whenever people discuss a benchmark publicly it leaks the benchmark into the training set, so the benchmark becomes a worse measure.
[dead]
I am not expert in llm reasoning but I think because of RL. You cannot use AlphaZero to play other games.
Nope. AlphaZero taught itself to play games like chess, shogi, and Go through self-play, starting from random moves. It was not given any strategies or human gameplay data but was provided with the basic rules of each game to guide its learning process.
Yes its reinforcement learning, but need to create policy and each policy is specialized for specific tasks.
I thought that AlphaZero could play three games? Go, Chess and Shogi?
Think I mean Catan :)
Humans and AIs are different, the next benchmark would be build so that it emphasize the weak points of current AI models where a human is expected to perform better, but I guess you can also make a benchmark that is the opposite, where humans struggle and o3 has an easy time.
I think you've hit the nail on the head there. If these systems of reasoning are truly general then they should be able to perform consistently in the same way a human does across similar tasks, baring some variance.
Yes, if a system has actually achieved AGI, it is likely to not reveal that information
AGI wouldn't necessarily entail any autonomy or goals though. In principle there could be a superintelligent AI that's completely indifferent to such outcomes, with no particular goals beyond correctly answering question or what not.
AGI is a spectrum, not a binary quality.
Not sure why I am being downvoted. Why would a sufficiently advanced intelligence reveal its full capabilities knowing fully well that it would then be subjected to a range of constraints and restraints?
If you disagree with me, state why instead of opting to downvote me
The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.
I think this is a mistake.
Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.
Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?
There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.
So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.
Your economic analysis is deeply flawed. If there was anything that valuable and that required that much manpower, it would already have driven up the cost of labor accordingly. The one property that could conceivably justify a substantially higher cost is secrecy. After all, you can't (legally) kill a human after your project ends to ensure total secrecy. But that takes us into thriller novel territory.
I don't think that's right. Free societies don't tolerate total mobilization by their governments outside of war time, no matter how valuable the outcomes might be in the long term, in part because of the very economic impacts you describe. Human-level AI - even if it's very expensive - puts something that looks a lot like total mobilization within reach without the societal pushback. This is especially true when it comes to tasks that society as a whole may not sufficiently value, but that a state actor might value very much, and when paired with something like a co-located reactor and data center that does not impact the grid.
That said, this is all predicated on o3 or similar actually having achieved human level reasoning. That's yet to be fully proven. We'll see!
This is interesting to consider, but I think the flaw here is that you'd need a "total mobilization" level workforce in order to build this mega datacenter in the first place. You put one human-hour into making B200s and cooling systems and power plants, you get less than one human-hour-equivalent of thinking back out.
No you don’t. The US government has already completed projects at this scale without total economic mobilization: https://en.wikipedia.org/wiki/Utah_Data_Center Presumably peer and near-peer states are similarly capable.
A private company, xAI, was able to build a datacenter on a similar scale in less than 6 months, with integrated power supply via large batteries: https://www.tomshardware.com/desktops/servers/first-in-depth...
Datacenter construction is a one-time cost. The intelligence the datacenter (might) provide is ongoing. It’s not an equal one to one trade, and well within reach for many state and non-state actors if it is desired.
It’s potentially going to be a very interesting decade.
i disagree because the job market is not a true free market. I mean it mostly is, but there’s a LOT of politics and shady stuff that employers do to purposely drive wages down. Even in the tech sector.
Your secrecy comment is really intriguing actually. And morbid lol.
How many 99.9th percentile mathematicians do nation states normally have access to?
Direct quote from the ARC-AGI blog:
“SO IS IT AGI?
ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”
The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.
Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…
> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”
Something I missed until I scrolled back to the top and reread the page was this
> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set
So yeah, the results were specifically from a version of o3 trained on the public training set
Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.
On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.
Lol I missed that even though it's literally the first sentence of the blog, good catch.
Yeah, that makes this result a lot less impressive for me.
ARC co-founder Mike Knoop
"Raising visibility on this note we added to address ARC "tuned" confusion:
> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.
This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.
The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.
The eval sets are extremely resistant to just "memorizing" the training set. This is why o3 is impressive." https://x.com/mikeknoop/status/1870583471892226343
Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.
To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.
While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.
SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.
Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.
Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
Or maybe benchmarks are just bad at measuring intelligence in general.
Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?
> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).
If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.
> acid test
The css acid test? This can be gamed too.
https://en.wikipedia.org/wiki/Acid_test:
> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.
Specifically here, they're using the figurative sense of "definitive test".
also a "litmus test" but I guess that's a different chemistry test...
Sad to see everyone so focused on compute expense during this massive breakthrough. GPT-2 originally cost $50k to train, but now can be trained for ~$150.
The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.
It’s wild, are people purposefully overlooking that inference costs are dropping 10-100x each year?
https://a16z.com/llmflation-llm-inference-cost/
Look at the log scale slope, especially the orange MMLU > 83 data points.
Those are the (subsidized) prices that end clients are paying for the service so that's not something that is representative of what the actual inference costs are. Somebody still needs to pay that (actual) price in the end. For inference, as well as for training, you need actual (NVidia) hardware and that hardware didn't become any cheaper. OTOH models are only becoming increasingly more complex and bigger and with more and more demand I don't see those costs exactly dropping down.
Actual inference costs without considering subsidies and loss leaders are going down, due to algorithmic improvements, hardware improvements, and quantized/smaller models getting the same performance as larger ones. Companies are making huge breakthroughs making chips specifically for LLM inference
In August 2023, llama2 34B was released and at that time, without employing model quantization, in order to fit this model one needed to have a GPU, or set of GPUs, with total of ~34x2.5=85G of VRAM.
That said, can you be more specific what are those "algorithmic" and "hardware" improvements that has driven this cost and hardware requirements down? AFAIK I still need the same hardware to run this very same model.
Take a look at the latest Llama and Phi models. They get comparable MMLU performance for ~10% of the parameters. Not to mention the cost/flop and cost/gb for GPUs has dropped.
You aren’t trying to run an old 2023 model as is, you’re trying to match its capabilities. The old models just show what capabilities are possible.
Sure, let's say that 8B llama3.1 gets comparable performance of it's 70B llama2 predecessor. Not quite true but let's say that hypothetically it is. That still leave us with 70B llama3.1.
How much VRAM and inference compute is required to run 3.1-70B vs 2-70B?
The argument is that the inference cost is dropping down significantly each year but how exactly if those two models require about the ~same, give or take, amount of VRAM and compute?
One way to drive the cost down is to innovate in inference algorithms such that the HW requirements are loosened up.
In the context of inference optimizations one such is flash-decode, similar to its training counter-part flash-attention, from the same authors. However, that particular optimization concerns only by improving the inference runtime by dropping down the number of memory accesses needed to compute the self-attention. Amount of total VRAM you need in order to just load the model still remains the same so although it is true that you might get a tad more from the same HW, the initial requirement of total HW you need remains to be the same. Flash-decode is also nowhere near the impact of flash-attention. Latter enabled much faster training iteration runtimes while the former has had quite limited impact, mostly because scale of inference is so much smaller than the training so the improvements do not always see the large gains.
> Not to mention the cost/flop and cost/gb for GPUs has dropped.
For training. Not for inference. GPU prices remained about the same, give or take.
A bit early for a every year claim not to mention what all these AI is used for.
In some parts of the internet it’s you hardly find real content only AI spam.
It will get worse the cheaper it gets.
Think of email spam.
I think the question everyone has in their minds isn't "when will AGI get here" or even "how soon will it get here" — it's "how soon will AGI get so cheap that everyone will get their hands on it"
that's why everyone's thinking about compute expense. but I guess in terms of a "lifetime expense of a person" even someone who costs $10/hr isn't actually all that cheap, considering what it takes to grow a human into a fully functioning person that's able to just do stuff
We are nowhere near AGI.
[deleted]
[dead]
Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it's a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there's a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there's some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
> the other is that there was an approach that made the task easier than we expected.
from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.
> We are bags of tricks.
We are trick generators, that is what it means to be a general intelligence. Adding another trick in the bag doesn't make you a general intelligence, being able to discover and add new tricks yourself makes you a general intelligence.
Not the parent, but remembering my reading of Dennett, he was referring to the tricks that we got through evolution, rather than ones we invented ourselves. As particular examples, we have neural functional areas for capabilities like facial recognition and spatial reasoning which seems to rely on dedicated "wetware" somewhat distinct from other parts of the brain.
But humans being able to develop new tricks is core to their intelligence, saying its just a bag of tricks means you don't understand what AGI is. So either the poster misunderstood Dennett or Dennett weren't talking about AGI or Dennett didn't understand this well.
Of course there are many tricks you will need special training for, like many of the skills human share with animals, but the ability to construct useful shareable large knowledge bases based on observations is unique to humans and isn't just a "trick".
Dennett was talking about natural intelligence. I think you're just underestimating the potential of a sufficiently big bag of tricks.
sharing knowledge isn't a human thing - chimps learn from each other. bees teach each other the direction and distance to a new source of food.
we just happen to push the envelope a lot further and managed to kickstart runaway mimetic evolution.
"mimetic" is apt there, but I think that Dennett, as a friend of Dawkins, would say it's "memetic"
nice catch!
generating tricks is itself a trick that relies on an enormous bag of tricks we inherited through evolution by the process of natural selection.
the new tricks don't just pop into our heads even though it seems that way. nobody ever woke up and devised a new trick in a completely new field without spending years learning about that field or something adjacent to it. even the new ideas tend to be an old idea from a different field applied to a new field. tricks stand on the shoulders of giants.
Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
Id agree with you if there hasn’t been very deliberate work towards solving ARC for years, and if thr conceit of the benchmark wasn’t specifically based on a conception of human intuition being, put simply, learning and applying out of distribution rules on the fly. ARC wasn’t some arbitrary inverse set, it was designed to benchmark a fundamental capability of general intelligence
I’m not sure if people realize what a weird test this is. They’re these simple visual puzzles that people can usually solve at a glance, but for the LLMs, they’re converted into a json format, and then the LLMs have to reconstruct the 2D visual scene from the json and pick up the patterns.
If humans were given the json as input rather than the images, they’d have a hard time, too.
> If humans were given the json as input rather than the images, they’d have a hard time, too.
We shine light in text patterns at humans rather than inject the text directly into the brain as well, that is extremely unfair! Imagine how much better humans would be at text processing if we injected and extracted information from their brains using the neurons instead of eyes and hands.
Not sure how much that matters - I'm not an AI expert, but I did some intro courses where we had to train a classifier to recognize digits. How it worked basically was that we fed each pixel of the 2d grid of the image into an input of the network, essentially flattening it in a similar fashion. It worked just fine, and that was a tiny network.
The classifier was likely a convolutional network, so the assumption of the image being a 2D grid was baked into the architecture itself - it didn't have to be represented via the shape of the input for the network to use it.
I don't think so - convolutional neural networks also operate over 1D flat vectors - the spatial relationship of pixels is only learned from the training data.
This is not true. CNNs perform 2D convolution, conceptually "sliding" a 2 dimensional kernel with learnable weights over the input image across two dimensions.
Perhaps it wasn't a convolutional network after all, but a simple fully-connected feed-forward network taking all pixels as input? Could be viable for a toy example (MNIST).
The JSON files still contain images, just not in a regular image format. You have a 2D array of numbers where each number maps to a color. If you really want a regular picture format, you can easily convert the arrays.
I think that's part of what feels odd about this- in some ways it feels like the wrong type of test for an LLM, but in many ways it makes this achievement that much more remarkable
Yeah, this entire thread seems utterly detached from my lived experience. LLMs are immensely useful for me at work but they certainly don't come close to the hype spouted by many commenters here. It would be great if it could handle more of our quite modest codebase but it's not able to yet
ARC is a silly benchmark, the other results in math and coding are much more impressive.
o3 is just o1 scaled up, the main takeaway from this line of work that people should walk away with is that we now have a proven way to RL our way to super human performance on tasks where it’s cheap to sample and easy to verify the final output. Programming falls in that category, they focused on known benchmarks but the same process can be done for normal programs, using parsers, compilers, existing functions and unit tests as verifiers.
Pre o1 we only really had next token prediction, which required high quality human produced data, with o1 you optimize for success instead of MLE of next token. Explained in simpler terms, it means it can get reward for any implementation of a function that reproduces the expected result, instead of the exact implementation in the training set.
Put another way, it’s just like RLHF but instead of optimizing against learned human preferences, the model is trained to satisfy a verifier.
This should work just as well in VLA models for robotics, self driving and computer agents.
How do the organisers keep the private test set private? Does openAI hand them the model for testing?
If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?
(I am sure I am missing something.)
I wouldn't be surprised if the term "benchmark fraud" will soon been coined.
Benchmark fraud is not a novel concept. Outside of LLMs for example smartphone manufacturers detect benchmarks and disable or reduce CPU throttling: https://www.theregister.com/2019/09/30/samsung_benchmarking_...
CPU frequency ramp curve is also something that can be adjusted. You want the CPU to ramp up really quickly to make everything feel responsive, but at the same time you want to not have to use so much power from your battery.
If you detect that a benchmark is running then you can just ramp up to max frequency immediately. It’ll show how fast your CPU is, but won’t be representative of the actual performance that users will get from their device.
I suppose that's why they are calling it "semi-private".
And why o3 or any OpenAI llm is not evaluated in the actual private dataset.
These parent comments are the most clear and concise explanation of the difference between semi-private and private datasets that I have seen. Thank you.
If we really want to imagine a cold-war-style solution, the two teams could meet in an empty warehouse, bring one computer with the model, one with the benchmarks, and connect them with a USB cable.
In practice I assume they just gave them the benchmarks and took it on the honor system they wouldn't cheat, yeah. They can always cook up a new test set for next time, it's only 10% of the benchmark content anyway and the results are pretty close.
There's no honor system when there's billions of dollars at stake x) I'm highly highly skeptical of these benchmarks because of intentional cheating and accidental contamination.
They have two sets, a fully private one where the models run isolated and the semi-private one where they run models accessed over the internet.
Isn’t that why they call it “ Semi-Private”?
There’s a fully private test set too as I understand it, that o3 hasn’t run on yet.
And o3 will not run on the private set unless it is a truly free and open source model (presumably also the case for ARC-AGI-2). This is the distinction between private and semi-private. In private you provide all the knowledge/weights/logic to operate without any external communication. Private benchmark results are the only true evaluation of performance on any benchmark -- reserved for a final evaluation. It is the only way to prevent shenanigans.
That is the top question, actually. Given all the billions at stake.
I would like to see this repeated with my highly innovative HARC-HAGI, which is ARC-AGI but it uses hexagons instead of squares. I suspect humans would only make slightly more brain farts on HARC-HAGI than ARC-AGI, but O3 would fail very badly since it almost certainly has been specifically trained on squares.
I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.
Here's my take - even if the o3 as currently implemented is utterly useless on your HARC-HAGI, it is obvious that o3 coupled with its existing training pipeline trained briefly on the hexagons would excel on it, such that passing your benchmark doesn't require any new technology.
Taking this a level of abstraction higher, I expect that in the next couple of years we'll see systems like o3 given a runtime budget that they can use for training/fine-tuning smaller models in an ad-hoc manner.
Very cool. I recommend scrolling down to look at the example problem that O3 still can’t solve. It’s clear what goes on in the human brain to solve this problem: we look at one example, hypothesize a simple rule that explains it, and then check that hypothesis against the other examples. It doesn’t quite work, so we zoom into an example that we got wrong and refine the hypothesis so that it solves that sample. We keep iterating in this fashion until we have the simplest hypothesis that satisfies all the examples. In other words, how humans do science - iteratively formulating, rejecting and refining hypotheses against collected data.
From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:
1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)
2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.
3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.
4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.
5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.
I took a look at those examples that o3 can't solve. Looks similar to an IQ-test.
Took me less time to figure out the 3 examples that it took to read your post.
I was honestly a bit surprised to see how visual the tasks were. I had thought they were text based. So now I'm quite impressed that o3 can solve this type of task at all.
I also took some time to look at the ones it couldn't solve. I stopped after this one: https://kts.github.io/arc-viewer/page6/#47996f11
That one's cool. All pink pixels need to be repaired so they match the symmetry in the picture.
That is very unfair to color blind people.
You must be a stem grad! Or perhaps an ensemble of Kaggle submissions?
My initial impression: it's very impressive and very exciting.
My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.
As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.
I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.
It doesn't need to be general intelligence or perfectly map to human intelligence.
All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).
It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.
And look at the airplanes, they really can’t just land on a mountain slope or a tree without heavy maintenance afterwards. Those people weren’t all stupid, they questioned the promise of flying servicemen delivering mail or milk to their window and flying on a personal aircar to their workplace. Just like todays promises about whatever the CEOs telltales are. Imagining bullshit isn’t unique to this century.
Aerospace is still a highly regulated area that requires training and responsibility. If parallels can be drawn here, they don’t look so cool for a regular guy.
What people always leave out is that society will bend to the abilities of the new technology. Planes can't land in your backyard so we built airports. We didn't abandon planes.
Yes but the idea was lost in the process. It became a faster transportation system that uses air as a medium, but that’s it. Personal planes are still either big business or an expensive and dangerous personal toy thing. I don’t think it’s the same for LLMs (would be naive). But where are promises like “we’re gonna change travel economics etc”? All headlines scream is “AGI around the corner”. Yeah, now where’s my damn postman flying? I need my mail.
> It became a faster transportation system that uses air as a medium, but that’s it.
On the one hand, yes; on the other, this understates the impact that had.
My uncle moved from the UK to Australia because, I'm told*, he didn't like his mum and travel was so expensive that he assumed they'd never meet again. My first trip abroad… I'm not 100% sure how old I was, but it must have been between age 6 and 10, was my gran (his mum) paying for herself, for both my parents, and for me, to fly to Singapore, then on to various locations in Australia including my uncle, and back via Thailand, on her pension.
That was a gap of around one and a half generations.
* both of them are long-since dead now so I can't ask
[deleted]
Sure, but that also vindicates the GP's point that the initial claims of the boosters for planes contained more than their fair share of bullshit and lies.
This is already happening. A few days ago Microsoft turned down a documentation PR because the formatting was better for humans but worse for LLMs: https://github.com/MicrosoftDocs/WSL/pull/2021#issuecomment-...
They changed their mind after a public outcry including here on HN.
> What people always leave out is that society will bend to the abilities of the new technology.
Do they really? I don't think they do.
> Planes can't land in your backyard so we built airports. We didn't abandon planes.
But then what do you do with the all the fantasies and hype about the new technology (like planes that land in your backyard and you fly them to work)?
And it's quite possible and fairly common that the new technology actually ends up being mostly hype, and there's actually no "airports" use case in the wings. I mean, how much did society "bend to the abilities of" NFTs?
And then what if the mature "airports" use case is actually something most people do not want?
We are slowly discovering that many of our wonderful inventions from 60-80-100 years ago have serious side effects.
Plastics, cars, planes, etc.
One could say that a balanced situation, where vested interests are put back in the box (close to impossible since it would mean fighting trillions of dollars), would mean that for example all 3 in the list above are used a lot less than we use them now, for example. And only used where truly appropriate.
No, we built helicopters.
This pretty much. Everyone knows that LLMs are great for text generation and processing. What people has been questioning is the end goals as promised by its builders, i.e. is it useful? And from most of what I saw, it's very much a toy.
What would you need to see to call it useful?
To give you an example– I've used it for legal work such as an EB2-NIW visa application. Saved me countless of hours. My next visa I'll try to do without a lawyer using just LLMs. I would never try this without having LLMs at my disposal.
As a hobby– And as someone with a scientific background I've been able to build an artificial ecosystem simulation from scratch without programming experience in Rust: https://www.youtube.com/@GenecraftSimulator
I recently moved from fish to plants and believe I've developed some new science at the intersection of CS and Evolutionary Biology that I'm looking to publish.
This tool is extremely useful. For now– You do require a human in the loop for coordination.
My guess is that these will be benchmarks that we see within a few years: How good an AI coordinate multiple other AIs to build, deploy and iterate something that functions in the real world. Basically manager AI.
Because they'll literally be able to solve every single one shot problem so we won't be able to create benchmarks anymore.
But that's also when these models will be able to build functioning companies in a few hours.
> ...me countless of...would never try this without having LLMs...is extremely useful...they'll literally be able to solve...will be able to... in a few hours.
That's marketing language, not scientific or even casual language. So much outstanding claims, without even some basic explanations. Like how did it help you save these hours? Terms explanations? Outlining processes? Going to the post office for you? You don't need to sell me anything, I just want the how.
My issue with LLMs is that you require a review-competent human in the loop, to fix confabulations.
Yes, I’m using them from time to time for research. But I’m also aware of the topics I research and see through bs. And best LLMs out there, right now, produce bs in just 3-4 paragraphs, in nicely documented areas.
A recent example is my question on how to run N vpn servers on N ips on the same eth with ip binding (in ip = out ip, instead of using a gw with the lowest metric). I had no idea but I know how networks work and the terminology. It started helping, created a namespace, set up lo, set up two interfaces for inner and outer routing and then made a couple of crucial mistakes that couldn’t be detected or fixed by someone even a little clueless (in routing setup for outgoing traffic). I didn’t even argue and just asked what that does wrt my task, and that started the classic “oh wait, sorry, here’s more bs” loop that never ended.
Eventually I distilled the general idea and found an article that AI very likely learned from, cause it was the same code almost verbatim, but without mistakes.
Does that count as helping? Idk, probably yes. But I know that examples like this show that you cannot not only leave an LLM unsupervised for any non-trivial question, but have to leave a competent role in the loop.
I think the programming community is just blinded by LLMs succeeding in writing kilometers of untalented react/jsx/etc crap that has no complexity or competence in it apart from repeating “do like this” patterns and literally millions of examples, so noise cannot hit through that “protection”. Everything else suffers from LLMs adding inevitable noise into what they learned from a couple of sources. The problem here, as I understand it, is that only specific programmer roles and s{c,p}ammers (ironically) write the same crap again and again millions of times, other info usually exists in only a few important sources and blog posts, and only a few of those are full and have good explanations.
Your point is on the verge of nullification with the rapid improvement and adoption of autonomous drones don't you think?
Sort of, but doesn’t that sit on a far-fetch horizon? I doubt that drone companies are all the same who sold aircraft retrofuturism to people back then.
> to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings
To me it is more like there is someone jumping on a pogo ball while flapping their arms and saying that they are flying whenever they hop off the ground.
Skeptics say that they are not really flying, while adherents say that "with current pogo ball advancements, they will be flying any day now"
An old quote, quite famous: "... is like saying that an ape who climbs to the top of a tree for the first time is one step closer to landing on the moon".
Between skeptics and adherents who is more easily able to extract VC money for vaporware? If you limit yourself to 'the facts' you're leaving tons of $$ on the table...
By all means, if this is the goal, AI is a success.
I understand that in this forum too many people are invested in putting lipstick on this particular pig.
Is that what Elon Musk was trying to do on stage?
On the contrary, the pushback is critical because many employers are buying the hype from AI companies that AGI is imminent, that LLMs can replace professional humans, and that computers are about to eliminate all work (except VCs and CEOs apparently).
Every person that believes that LLMs are near sentient or actually do a good job at reasoning is one more person handing over their responsibilities to a zero-accountability highly flawed robot. We've already seen LLMs generate bad legal documents, bad academic papers, and extremely bad code. Similar technology is making bad decisions about who to arrest, who to give loans to, who to hire, who to bomb, and who to refuse heart surgery for. Overconfident humans employing this tech for these purposes have been bamboozled by the lies from OpenAI, Microsoft, Google, et al. It's crucial to call out overstatement and overhype about this tech wherever it crops up.
I don’t understand how or why someone with your mind would assume that even barely disclosed semi-public releases would resemble the current state of the art. Except if you do it for the conversations sake, which I have never been capable of.
I agree. If the LLMs we have today never got any smarter, the world would still be transformed over the next ten years.
People aren’t responding to their own assumption that AGI is necessary, they’re responding to OpenAI and the chorus constantly and loudly singing hymns to AGI.
> It doesn't need to be general intelligence or perfectly map to human intelligence.
> All it needs to be is useful.
Computers were already useful.
The only definition we have for "intelligence" is human (or, generally, animal) intelligence. If LLMs aren't that, let's call it something else.
What exactly is human (or animal) intelligence? How do you define that?
Does it matter? If LLMs aren't that, whatever it is, then we should use a different word. Finders keepers.
How do you know that LLMs “aren’t that” if you can’t even define what that is?
“I’ll know it when I see it” isn’t a compelling argument.
I think a successful high level intelligence should quickly accelerate or converge to infinity/physical resource exhaustion because they can now work on improving themselves.
So if above human intelligence does happen, I'd assume we'd know it, quite soon.
> “I’ll know it when I see it” isn’t a compelling argument.
It feels compelling to me.
they can't do what we do therefore they aren't what we are
And what is that, in concrete terms? Many humans can’t do what other humans can do. What is the common subset that counts as human intelligence?
Process vision and sounds in parallel for 80+ years, rapidly adapt to changing environments and scenarios, correlate seemingly irrelevant details that happened a week ago or years ago, be able to selectively ignore instructions and know when to disagree
[dead]
I don't think many informed people doubt the utility of LLMs at this point. The potential of human-like AGI has profound implications far beyond utility models, which is why people are so eager to bring it up. A true human-like AGI basically means that most intellectual/white collar work will not be needed, and probably manual labor before too long as well. Huge huge implications for humanity, e.g. how does an economy and society even work without workers?
> Huge huge implications for humanity, e.g. how does an economy and society even work without workers?
I don't think those that create AI care about that. They just to come out on top before someone else does.
Yes and we should be super worried about that.
> Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).
That is a natural reaction to the incessant techbro, AIbro, marketing, and corporate lies that "AI" (or worse AGI) is a real thing, and can be directly compared to real humans.
There are people on this very thread saying it's better at reasoning than real humans (LOL) because it scored higher on some benchmark than humans... Yet this technology still can't reliably determine what number is circled, if two lines intersect, or count the letters in a word. (That said behaviour may have been somewhat finetuned out of newer models only reinforces the fact that the technology inherently not capable of understanding anything.)
I encounter "spicy auto complete" style comments far more often than techbro AI-everything comments and its frankly getting boring.
I've been doing AI things for about 20+ years and llms are wild. We've gone from specialized things being pretty bad as those jobs to general purpose things better at that and everything else. The idea you could make and API call with "is this sarcasm?" and get a better than chance guess is incredible.
Eh, I see far more "AI is the second coming of Jesus" type of comments than healthy skepticism. A lot of anxiety from people afraid that their source of income will dry and a lot of excitement of people with an axe to grind that "those entitled expensive peasants will get what they deserve".
I think I count myself among the skeptics nowadays for that reason. And I say this as someone that thinks LLM is an interesting piece of technology, but with somewhat limited use and unclear economics.
If the hype was about "look at this thing that can parse natural language surprisingly well and generate coherent responses", I would be excited too. As someone that had to do natural language processing in the past, that is a damn hard task to solve, and LLMs excel at it.
But that is not the hype is it? We have people beating the drums of how this is just shy of taking the world by storm, and AGI is just around the corner, and it will revolutionize all economy and society and nothing will ever be the same.
So, yeah, it gets tiresome. I wish the hype would die down a little so this could be appreciated for what it is.
We have people beating the drums of how this is just shy of taking the world by storm, and AGI is just around the corner, and it will revolutionize all economy and society and nothing will ever be the same.
Where are you seeing this? I pretty much only read HN and football blogs so maybe I’m out of the loop.
In this very thread there are multiple people espousing their views that the high score here is proof that o3 has achieved AGI.
Nobody is disputing the coolness factor, only the intelligence factor.
I'm saying the intelligence factor doesn't matter. Only the utility factor. Today LLMs are incredibly useful and every few months there appears to be bigger and bigger leaps.
Analyzing whether or not LLMs have intelligence is missing the forest from the trees. This technology is emerging in a capitalist society that is hyper optimized to adopt useful things at the expense of almost everything else. If the utility/price point gets hit for a problem, it will replace it regardless of if it is intelligent or not.
I agree and as a non-software engineer, all that matters to me right now is how much can these models replace software engineering.
If a language model can't solve problems in a programming language then we are just fooling ourselves in less defined domains of "thought".
Software engineering is where the rubber meets the road in terms of intelligence and economics when viewing our society as a complex system. Software engineering salaries are above average exactly because most average people are not going to be software engineers.
From that point of view the progress is not impressive at all. The current models are really not that much better than chatGPT4 in April 2023.
AI art is a better example though. There is zero progress being made now. It is only impressive at the most surface level for someone not involved in art and who can't see how incredibly limited the AI art models are. We have already moved on to video though to make the same half baked, useless models that are only good to make marketing videos for press releases about progress and one off social media posts about how much progress is being made.
But if you want to predict the future utility of these models you want to look at their current intelligence, compare that to humans and try to figure out roughly what skills they lack and which of those are likely to get fixed.
For example, a team of humans are extremely reliable, much more reliable than one human, but a team of AI's isn't mean reliable than one AI since an AI is already an ensemble model. That means even if an AI could replace a person, it probably can't replace a team for a long time, meaning you still need the other team members there, meaning the AI didn't really replace a human it just became a tool for huamns to use.
I think this is a fair criticism of capability.
I personally wouldn't be surprised if we start to see benchmarks around this type of cooperation and ability to orchestrate complex systems in the next few years or so.
Most benchmarks really focus on one problem, not on multiple real-time problems while orchestrating 3rd party actors who might or might not be able to succeed at certain tasks.
But I don't think anything is prohibiting these models from not being able to do that.
If I could put it into Tesla style robot and it could do dishes and help me figure out tech stuff, it would be more than enough.
This a thousand times.
These comments are getting ridiculous. I remember when this test was first discussed here on HN and everyone agreed that it clearly proves current AI models are not "intelligent" (whatever that means). And people tried to talk me down when I theorised this test will get nuked soon - like all the ones before. It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.
> These comments are getting ridiculous.
Not really. Francois (co-creator of the ARC Prize) has this to say:
The v1 version of the benchmark is starting to saturate. There were already signs of this in the Kaggle competition this year: an ensemble of all submissions would score 81%
Early indications are that ARC-AGI-v2 will represent a complete reset of the state-of-the-art, and it will remain extremely difficult for o3. Meanwhile, a smart human or a small panel of average humans would still be able to score >95% ... This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI, without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.
For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like they did for LLMs (until the next architecture). If the only bottleneck is test-time search, we will see continued scaling in the future.
https://x.com/fchollet/status/1870169764762710376 / https://ghostarchive.org/archive/Sqjbf
> It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.
I was just thinking about how 3D game engines were perceived in the 90s. Every six months some new engine came out, blew people's minds, was declared photorealistic, and was forgotten a year later. The best of those engines kept improving and are still here, and kinda did change the world in their own way.
Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years, and only became so again when OpenAI's InstructGPT was demonstrated.
While I'm really impressed with current AI, and value the best models greatly, and agree that they will change (and have already changed) the world… I can't help but think of the Next Generation front cover, February 1997 when considering how much further we may be from what we want: https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-this-...
> Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years
The transition seems to map well to the point where engines got sophisticated enough, that highly dedicated high-schoolers couldn't keep up. Until then, people would routinely make hobby game engines (for games they'd then never finish) that were MVPs of what the game industry had a year or three earlier. I.e. close enough to compete on visuals with top photorealistic games of a given year - but more importantly, this was a time where you could do cool nerdy shit to impress your friends and community.
Then Unreal and Unity came out, with a business model that killed the motivation to write your own engine from scratch (except for purely educational purposes), we got more games, more progress, but the excitement was gone.
Maybe it's just a spurious correlation, but it seems to track with:
> and only became so again when OpenAI's InstructGPT was demonstrated.
Which is again, if you exclude training SOTA models - which is still mostly out of reach for anyone but a few entities on the planet - the time where anyone can do something cool that doesn't have a better market alternative yet, and any dedicated high-schooler can make truly impressive and useful work, outpacing commercial and academic work based on pure motivation and focus alone (it's easier when you're not being distracted by bullshit incentives like user growth or making VCs happy or churning out publications, farming citations).
It's, once again, a time of dreams, where anyone with some technical interest and a bit of free time can make the future happen in front of their eyes.
> how much further we may be from what we wan
The timescale you are describing for 3D graphics is 4 years from the 1997 cover you posted to the release of Halo which you are saying plateaued excitement because it got advanced enough.
An almost infinitesimally small amount of time in terms of history human development and you are mocking the magazine being excited for the advancement because it was... 4 years yearly?
No, the timescale is "the 90s", the the specific example is from 1997, and chosen because of how badly it aged. Nobody looks at the original single-player Unreal graphics today and thinks "this is amazing!", but we all did at the time — Reflections! Dynamic lighting! It was amazing for the era — but it was also a long way from photorealism. ChatGPT is amazing… but how far is it from Brent Spiner's Data?
The era was people getting wowed from Wolfenstein (1992) to "about Halo or Half Life 2" (2001 or 2004).
And I'm not saying the flattening of excitement was for any specific reason, just that this was roughly when it stopped getting exciting — it might have been because the engines were good enough for 3D art styles beyond "as realistic as we can make it", but for all I know it was the War On Terror which changed the tone of press releases and how much the news in general cared. Or perhaps it was a culture shift which came with more people getting online and less media being printed on glossy paper and sold in newsagents.
Whatever the cause, it happened around that time.
I'm still holding on to my hypothesis in that the excitement was sustained in large part because this progress was something a regular person could partake in. Most didn't, but they likely known some kid who was. And some of those kids run the gaming magazines.
This was a time where, for 3D graphics, barriers to entry got low (math got figured out, hardware was good enough, knowledge spread), but the commercial market didn't yet capture everything. Hell, a bulk of those excited kids I remember, trying to do a better Unreal Tournament after school instead of homework (and almost succeeding!), they went on create and staff the next generation of commercial gamedev.
(Which is maybe why this period lasted for about as long as it takes for a schoolkid to grow up, graduate, and spend few years in the workforce doing the stuff they were so excited about.)
Could be.
I was one of those kids, my focus was Marathon 2 even before I saw Unreal. I managed to figure out enough maths from scratch to end up with the basics of ray casting, but not enough at the time to realise the tricks needed to make that real time on a 75 MHz CPU… and then we all got OpenGL and I went through university where they explained the algorithms.
The weird thing about the phenomenon you mention is only after the field of software engineering has plateaued 15 years ago, as you mentioned, that this insane demand for engineers did arise, with corresponding insane salaries.
It's a very strange thing I've never understood.
My guess: It’s a very lengthy, complex, and error-prone process to “digitize” human civilization (government, commerce, leisure, military, etc). The tech existed, we just didn’t know how to use it.
We still barely know how to use computers effectively, and they have already transformed the world. For better or worse.
I agree, it's like watching a meadow ablaze and dismissing it because it's not a 'real forest fire' yet. No it's not 'real AGI' yet, but *this is how we get there* and the pace is relentless, incredible and wholly overwhelming.
I've been blessed with grandchildren recently, a little boy that's 2 1/2 and just this past Saturday a granddaughter. Major events notwithstanding, the world will largely resemble today when they are teenagers, but the future is going to look very very very different. I can't even imagine what the capability and pervasiveness of it all will be like in ten years, when they are still just kids. For me as someone that's invested in their future I'm interested in all of the educational opportunities (technical, philosphical and self-awareness) but obviously am concerned about the potential for pernicious side effects.
Failing the test may prove the AI is not intelligent. Passing the test doesn't necessarily prove it is.
Your comment reminds me of this quote from a book published in the 80s:
> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”
I've always disliked this argument. A person can do something well without devising a general solution to the thing. Devising a general solution to the thing is a step we're talking all the time with all sorts of things, but it doesn't invalidate the cool fact about intelligence: whatever it is that lets us do the thing well without the general solution is hard to pin down and hard to reproduce.
All that's invalidated each time is the idea that a general solution to that task requires a general solution to all tasks, or that a general solution to that task requires our special sauce. It's the idea that something able to to that task will also be able to do XYZ.
And yet people keep coming up with a new task that people point to saying, 'this is the one! there's no way something could solve this one without also being able to do XYZ!'
[dead]
id consider that it doing the test at all, without proper compensation is a sign that it isnt intelligent
Motivation is not hard to instill. Fortunately, they have chosen not to do so.
If AI takes over white collar work that's still half of the world's labor needs untouched. There are some promising early demos of robotics plus AI. I also saw some promising demos of robotics 10 and 20 years that didn't reach mass adoption. I'd like to believe that by the time I reach old age the robots will be fully qualified replacements for plumbers and home health aides. Nothing I've seen so far makes me think that's especially likely.
I'd love more progress on tasks in the physical world, though. There are only a few paths for countries to deal with a growing ratio of old retired people to young workers:
1) Prioritize the young people at the expense of the old by e.g. cutting old age benefits (not especially likely since older voters have greater numbers and higher participation rates in elections)
2) Prioritize the old people at the expense of the young by raising the demands placed on young people (either directly as labor, e.g. nurses and aides, or indirectly through higher taxation)
3) Rapidly increase the population of young people through high fertility or immigration (the historically favored path, but eventually turns back into case 1 or 2 with an even larger numerical burden of older people)
4) Increase the health span of older people, so that they are more capable of independent self-care (a good idea, but difficult to achieve at scale, since most effective approaches require behavioral changes)
5) Decouple goods and services from labor, so that old people with diminished capabilities can get everything they need without forcing young people to labor for them
> If AI takes over white collar work that's still half of the world's labor needs untouched.
I am continually baffled that people here throw this argument out and can't imagine the second-order effects. If white collar work is automated by AGI, all the RnD to solve robotics beyond imagination will happen in a flash. The top AI labs, the people smartest enough to make this technology, all are focusing on automating AGI Researchers and from there follows everything, obviously.
+1, the second and third order effects aren't trivial.
We're already seeing escape velocity in world modeling (see Google Veo2 and the latest Genesis LLM-based physics modeling framework).
The hardware for humanoid robots is 95% of the way there, the gap is control logic and intelligence, which is rapidly being closed.
Combine Veo2 world model, Genesis control planning, o3-style reasoning, and you're pretty much there with blue collar work automation.
We're only a few turns (<12 months) away from an existence proof of a humanoid robot that can watch a Youtube video and then replicate the task in a novel environment. May take longer than that to productionize.
It's really hard to think and project forward on an exponential. We've been on an exponential technology curve since the discovery of fire (at least). The 2nd order has kicked up over the last few years.
Not a rational approach to look back at robotics 2000-2022 and project that pace forwards. There's more happening every month than in decades past.
I hope that you're both right. In 2004-2007 I saw self driving vehicles make lightning progress from the weak showing of the 2004 DARPA Grand Challenge to the impressive 2005 Grand Challenge winners and the even more impressive performance in the 2007 Urban Challenge. At the time I thought that full self driving vehicles would have a major commercial impact within 5 years. I expected truck and taxi drivers to be obsolete jobs in 10 years. 17 years after the Urban Challenge there are still millions of truck driver jobs in America and only Waymo seems to have a credible alternative to taxi drivers (even then, only in a small number of cities).
"it will take over the world"
Calibrating to the current hype cycle has been challenging with AI pronouncements.
You should look up the terms necessary and sufficient.
The real issue is people constantly making up new goalposts to keep their outdated world view somewhat aligned with what we are seeing. But these two things are drifting apart faster and faster. Even I got surprised by how quickly the ARC benchmark was blown out of the water, and I'm pretty bullish on AI.
The ARC maintainers have explicitly said that passing the test was necessary but not sufficient so I don't know where you come up with goal-post moving. (I personally don't like the test; it is more about "intuition" or in-built priors, not reasoning).
Are you like invested in LLM companies or something? You‘re pushing the agenda hard in this thread.
You are telling a bunch of high earning individuals ($150k+) that they may be dramatically less valuable in the eat future. Of course the goal posts will keep being pushed back and the acknowledgements will never come.
What kind of preparation are you suggesting?
Start learning a trade
I feel like that’s just kicking the can a little further down the road.
Our value proposition as humans in a capitalist society is an increasingly fragile thing.
that's going to work when every white collar worker goes into the trades /s
who is going to pay for residential electrical work lol and how much will you make if some guy from MIT is going to compete with you
This is far too broad to summarise here. You can read up on Sutskever or Bostrom or hell even Steven Hawking's ideas (going in order from really deep to general topics). We need to discuss everything - from education over jobs and taxes all the way to the principles of politics, our economy and even the military. If we fail at this as a society, we will at the very least create a world where the people who own capital today massively benefit and become rich beyond imagination (despite having contributed nothing to it), while the majority of the population will be unemployable and forever left behind. And the worst case probably falls somewhere between the end of human civilisation and the end of our species.
One way you can tell this isn't realistic is that it's the plot of Atlas Shrugged. If your economic intuitions produce that book it means they are wrong.
> while the majority of the population will be unemployable and forever left behind
Productivity improvements increase employment. A superhuman AI is a productivity improvement.
> Productivity improvements increase employment.
Sometimes: the productivity improvements from the combustion engine didn't increase employment of horses, it displaced them.
But even when productivity improvements do increase employment, it's not always to our advantage: the productivity improvements from Eli Whitney's cotton gin included huge economic growth and subsequent technological improvements… and also "led to increased demands for slave labor in the American South, reversing the economic decline that had occurred in the region during the late 18th century": https://en.wikipedia.org/wiki/Cotton_gin
A superhuman AI that's only superhuman in specific domains? We've been seeing plenty of those, "computer" used to be a profession, and society can re-train but it still hurts the specific individuals who have to be unemployed (or start again as juniors) for the duration of that training.
A superhuman AI that's superhuman in every domain, but close enough to us in resource requirements that comparative advantage is still important and we can still do stuff, relegates us to whatever the AI is least good at.
A superhuman AI that's superhuman in every domain… as soon as someone invents mining, processing, and factory equipment that works on the moon or asteroids, that AI can control that equipment to make more of that equipment, and demand is quickly — O(log(n)) — saturated. I'm moderately confident that in this situation, the comparative advantage argument no longer works.
No, Atlas shrugged explicitly believes that the wealthy beneficiaries are also the ones doing the innovation and the labor. Human/superhuman AI, if not self-directed but more like a tool, may massively benefit whoever happens to be lucky enough to be directing it when it arises. This does not imply that the lucky individual benefits on the basis of their competence.
The idea that productivity improvements increase unemployment is just fundamentally based on a different paradigm. There is absolutely no reason to think that when a machine exists that can do most things that a human can do as well if not better for less or equal cost, this will somehow increase human employment. In this scenario, using humans in any stage of the pipeline would be deeply inefficient and a stupid business decision.
[deleted]
What we're going to do is punt the questions and then convince ourselves the outcome was inevitable and if anything it's actually our fault.
The goalposts have moved, again and again.
It's gone from "well the output is incoherent" to "well it's just spitting out stuff it's already seen online" to "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space of 3-4 years.
It's incredible.
We already have AGI.
I'm a little torn. ARC is really hard, and Francois is extremely smart and thoughtful about what intelligence means (the original "On the Measure of Intelligence" heavily influenced my ideas on how to think about AI).
On the other hand, there is a long, long history of AI achieving X but not being what we would casually refer to as "generally intelligent," then people deciding X isn't really intelligence; only when AI achieves Y will it be intelligence. Then AI achieves Y and...
I think it's still an interesting way to measure general intellience, it's just that o3 has demonstrated that you can actually achieve human performance on it by training it on the public training set and giving it ridiculous amounts of compute, which I imagine equates to ludicrously long chains-of-thought, and if I understand correctly more than one chain-of-thought per task (they mention sample sizes in the blog post, with o3-low using 6 and o3-high using 1024. Not sure if these are chains-of-thought per task or what).
Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.
Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.
I disagree. It’s vastly inefficient, but it is managing to actually solve these problems with a vast search space. If we extrapolate this approach into the future and assume that the search becomes better as the underlying model improves, and assume that the architecture grows more efficient, and assume that the type of parallel computing used here grows cheaper, isn’t it possible that this is a lot more than brute-forcing in terms of what it will achieve? In other words, is it maybe just a really ugly way of doing something functionally equivalent to reasoning?
I just googled arc agi questions, and it looks like it is similar to an iq test with raven matrix. Similar as in you have some examples of images before and after, then an image before and you have to guess the after.
Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies
Yes, it's pretty similar to Raven's. The reason it is an interesting benchmark is because humans, even very young humans, "get" the test in the sense of understanding what it's asking and being able to do pretty well on it - but LLMs have really struggled with the benchmark in the past.
Chollett (one of the creators of the ARC benchmark) has been saying it proves LLMs can't reason. The test questions are supposed to be unique and not in the model's training set. The fact that LLMs struggled with the ARC challenge suggested (to Chollett and others) that models weren't "Truly reasoning" but rather just completing based on things they'd seen before - when the models were confronted with things they hadn't seen before, the novel visual patterns, they really struggled.
It's a test on which (apparently until now) the vast majority of humans have far outperformed all machine systems.
But it’s not a test that directly shows general intelligence.
I am excited no less! This is huge improvement.
How does this do on SWE Bench?
>How does this do on SWE Bench?
71.7%
I've seen this figure on a few tech news websites and reddit but can't find an official source. If it was in the video I must have missed it, where is this coming from?
It was in the video. I don't know if Open ai have a page up yet
ML is quite good at understanding and forecasting patterns when you train on the data you want to forecast. LLMs manage to do so much because we just decided to train on everything on the internet and hope that it included everything we ever wanted to know.
This tries to create patterns that are intentionally not in the data and see if a system can generalize to them, which o3 super impressively does!
ARC is in the dataset though? I mean I'm aware that there are new puzzles every day, but there's still a very specific format and set of skills required to solve it. I'd bet a decent amount of money that humans get better at ARC with practice, so it seems strange to suggest that AI wouldn't.
> My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
But isn’t it interesting to have several benchmarks? Even if it’s not about passing the Turing test, benchmarks serve a purpose—similar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.
It's certainly interesting. I'm just not convinced it's a test of general intelligence, and I don't think we'll know whether or not it is until it's been able to operate in the real world to the same degree that our general intelligence does.
From the statement where - this is a pretty tough test where AI scores low vs humans just last year, and AI can do it as good as humans may not be AGI which I agree, but it means something with all caps
Obviously, the multi billion dollar companies will try to satisfy the benchmarks they are not yet good in, as has always been the case.
A valid conspiracy theory but I’ve heard that one everystep of the way to this point
What made you write this comment, I have a hard time understanding your point.
how about a extra large dose of your skepticism. is true intelligence really a thing and not just a vague human construct that tries to point out the mysterious unquantifiable combination of human behaviors?
humans clearly dont know what intelligence is unambiguously. theres also no divinely ordained objective dictionary that one can point at to reference what true intelligence is. a deep reflection of trying to pattern associate different human cognitive abilities indicates human cognitive capabilities arent that spectacular really.
My guess as an amateur neuroscientist is that what we call intelligence is just a 'measurement' of problem solving ability in different domains. Can be emotional, spatial, motor, reasoning, etc etc.
There is no special sauce in our brain. And we know how much compute there is in our brain– So we can roughly estimate when we'll hit that with these 'LLMs'.
Language is important in a human brain development as well. Kids who grow up deaf grow up vastly less intelligent unless they learn sign language. Language allow us to process complex concepts that our brain can learn to solve, without having to be in those complex environments.
So in hindsight, it's easy to see why it took a language model to be able to solve general tasks and other types deep learning networks couldn't.
I don't really see any limits on these models.
interesting point about language. but i wonder if people misattribute the reason why language is pivotal to human development. your points are valid. i see human behavior with regard to learning as 90% mimicry and 10% autonomous learning. most of what humans believe in is taken on faith and passed on from the tribe to the individual. rarely is it verified even partially let alone fully. humans simple dont have the time or processing power to do that. learning a thing without outside aid is vastly slower and more energy or brain intensive process than copy learning or learning through social institutions by dissemination. the stunted development from lack of language might come more from the less ability to access the collective learning process that language enables and or greatly enhances. i think a lot of learning even when combined with reasoning, deduction, etc really is at the mercy of brute force exploration to find a solution, which individuals are bad at but a society that collects random experienced “ah hah!” occurrences and passes them along is actually okay at.
i wonder if llms and language dont as so much allow us to process these complex environments but instead preload our brains to get a head start in processing those complex environments once we arrive in them. i think llms store compressed relationships of the world which obviously has information loss from a neural mapping of the world that isnt just language based. but that compressed relationships ie knowledge doesnt exactly backwardly map onto the world without it having a reverse key. like artificially learning about real world stuff in school abstractly and then going into the real world, it takes time for that abstraction to snap fit upon the real world.
could you further elaborate on what you mean by limits, because im happy to play contrarian on what i think i interpret you to be saying there.
also to your main point: what intelligence is. yeah you sort of hit up my thoughts on intelligence. its a combination of problem solving abilities in different domains. its like an amalgam of cognitive processes that achieve an amalgam of capabilities. while we can label alllllll that with a singular word, doesnt mean its all a singular process. seems like its a composite. moreover i think a big chunk of intelligence (but not all) is just brute forcing finding associations and then encoding those by some reflexive search/retrieval. a different part of intelligence of course is adaptibility and pattern finding.
" it's complete hubris to conflate ARC or any benchmark with truly general intelligence."
Maybe it would help to include some human results in the AI ranking.
I think we'd find that Humans score lower?
I'm not sure it'd help what they are talking about much.
E.g. go back in time and imagine you didn't know there are ways for computers to be really good at performing integration yet as nobody had tried to make them. If someone asked you how to tell if something is intelligent "the ability to easily reason integrations or calculate extremely large multiplications in mathematics" might seem like a great test to make.
Skip forward to the modern era and it's blatantly obvious CASes like Mathematica on a modern computer range between "ridiculously better than the average person" to "impossibly better than the best person" depending on the test. At the same time, it becomes painfully obvious a CAS is wholly unrelated to general intelligence and just because your test might have been solvable by an AGI doesn't mean solving it proves something must have been an AGI.
So you come up with a new test... but you have the same problem as originally, it seems like anything non-human completely bombs and an AGI would do well... but how do you know the thing that solves it will have been an AGI for sure and not just another system clearly unrelated?
Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse.
.
All that aside, to answer your original question, in the presentation it was said the average human gets 85% and this was the first model to beat that. It was also said a second version is being worked on. They have some papers on their site about clear examples of why the current test clearly has a lot of testing unrelated to whether something is really AGI (a brute force method was shown to get >50% in 2020) so their aim is to create a new goalpost test and see how things shake out this time.
> So you come up with a new test... but you have the same problem as originally, it seems like anything non-human completely bombs and an AGI would do well... but how do you know the thing that solves it will have been an AGI for sure and not just another system clearly unrelated?
We should skip to the end and just define a task like "it's AGI if it can predict, with 100% accuracy the average human's next action in any situation". Anything that can do that is as good as AGI even if people manage to find a proxy for the task.
Generality is not binary. It's a spectrum. And these models are already general in ways those things you've mentioned simply weren't.
What exactly is AGI to you ? If it's simply a generally intelligent machine then what are you waiting for ? What else is there to be sure of ? There's nothing narrow about these models.
Humans love to believe they're oh so special so much that there will always be debates on whether 'AGI' has arrived. If you are waiting for that then you'll be waiting a very long time, even if a machine arrives that takes us to the next frontier in science.
> There's nothing narrow about these models.
There is, they can't create new ideas like humanity can. AGI should be able to replace humanity in terms of thinking, otherwise it isn't general, you would just have a model specialized at reproducing thoughts and patterns human have thought before, it still can't recreate science from scratch etc like humanity did, meaning it can't do science properly.
Comparing an AI to a single individual is not how you measure AGI, if a group of humans perform better then you can't use the AI to replace that group of humans, and thus the AI isn't an AGI since it couldn't replace the group humans.
So for example, if a group of programmers write more reliable programs than the AI, then you can't replace that group of programmers with the AI, even if you duplicate that AI many times, since the AI isn't capable of reproducing that same level of reliability when ran in parallel. This is due to an AI being run in parallel is still just an AI, an ensemble model is still just an AI, so the model the AI has to beat is the human ensemble called humanity.
If we lower the bar a bit at least it has to beat 100 000 humans working together to make a job obsolete, since all the tutorials etc and all such things are made by other humans as well if you remove the job those would also disappear and the AI would have to do the work of all of those, so if it can't humans will still be needed.
Its possible you will be able to substitute part of those human ensembles with AI much sooner, but then we just call it a tool. (We also call narrow humans tools, it is fair)
I see these models create new ideas. At least at the standard humans are beholden to, so this just falls flat for me.
You don't just need to create an idea, you need to be able to create ideas that on average progress in a positive direction. Humans can evidently do that, AI can't, when AI work too much without human input you always end up with nonsense.
In order to write general program you need to have that skill. Every new code snipped needs to be evaluated by that system, whether it makes the codebase better or not. The lack of that ability is why you can't just loop an LLM today to replace programmers. It might be possible to automate it for specific programming tasks, but not general purpose programming.
Overcoming that hurdle is not something I think LLM ever can do, you need a totally different kind of architecture, not something that is trained to mimic but trained to reason. I don't know how to train something that can reason about noisy unstructured data, we will probably figure that out at some point but it probably wont be LLM as they are today.
[deleted]
I'm firmly in the "absolutely nothing special about human intelligence" camp so don't let dismissal of this as AGI fuel any misconceptions as to why I might think that.
As for what AGI is? Well, the lack of being able to describe that brings us full circle in this thread - I'll tell you for sure when I've seen it for the first time and have the power of hindsight to say what was missing. I think these models are the closest we've come but it feels like there is at least 1-2 more "4o->o1" style architecture changes where it's not necessarily about an increase in model fitting and more about a change in how the model comes to an output before we get to what I'd be willing to call AGI.
Who knows though, maybe some of those changes come along and it's closer but still missing some process to reason well enough to be AGI rather than a midway tool.
"Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse."
Best way of stating that I've heard.
The Goal Post must keep moving, until we understand enough what is happening.
I usually poo-poo the goal post moving, but this makes sense.
> truly general intelligence
Indistinguishable from goalpost moving like you said, but also no true Scotsman.
I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?
It's really clear to me how intelligence fits into our reality as part of our social ontology. The attributes and their expression that each of us uses to ground our concept of the intelligent predicate differs wildly.
My personal theory is that we tend to have an exemplar-based dataset of intelligence, and each of us attempts to construct a parsimonious model of intelligence, but like all (mental) models, they can be useful but wrong. These models operate in a space where the trade off is completeness or consistency, and most folks, uncomfortable saying "I don't know" lean toward being complete in their specification rather than consistent. The unfortunate side-effect is that we're able to easily generate test data that highlights our model inconsistency - AI being a case in point.
> I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?
Rich people will think they can use the AI model instead of paying other people to do certain tasks.
The consequences could range from brilliant to utterly catastrophic, depending on the context and precise way in which this is done. But I'd lean toward the catastrophic.
Any specifics? It's difficult to separate this from generalized concern.
someone wants a "personal assistant" and believes that the LLM has AGI ...
someone wants a "planning officer" and believes that the LLM has AGI ...
someone wants a "hiring consultant" and believes that the LLM has AGI ...
etc. etc.
My apologies, but would it be possible to list the catastrophic consequences of these?
[deleted]
Isn’t this like a brute force approach? Given it costs $ 3000 per task, thats like 600 GPU hours (h100 at Azure) In that amount of time the model can generate millions of chains of thoughts and then spend hours reviewing them or even testing them out one by one. Kind of like trying until something sticks and that happens to solve 80% of ARC. I feel like reasoning works differently in my brain. ;)
They're only allowed 2-3 guesses per problem. So even though yes it generates many candidates, it can't validate them - it doesn't have tool use or a verifier, it submits the best 2-3 guesses. https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...
Chain of thought can entirely self validate. The OP is saying the LLM is acting like a photon, evaluate all possible solutions and choosing the most "Right" path. not quoting the OP here but my initial thought is that is does seem quite wasteful.
the LLM only gets two guesses at the "end solutions". The whole chain of thought is breaking out the context, and levels of abstraction. How many "Guesses" is it self generating and internally validating, well that's all just based on compute power and time.
My counter point to OP here would be is that is exactly how our brain works. In every given scenario, we are also evaluating all possible solutions. Our entire stack is constantly listening and eithier staying silent, or contributing to an action potential (eithier excitatory, or inhibitory). but our brain is always "Evaluating all potential possibilities" at any given moment. We have a society of mind always contributing their opinion, but the ones who don't have as much support essentially get "Shouted down".
> How many "Guesses" is it self generating and internally validating
That's completely fair game. That's just search.
It is allowed exactly two guesses, per the ARC rules.
How many guesses is the human comparison based on? I’d hope two as well but haven’t seen this anywhere so now I’m curious.
The real turker studies, resulting in the ~70% number, are scored correctly I believe. Higher numbers are just speculated human performance as far as I’m aware.
[deleted]
The trick with AlphaGo was brute force combined with learning to extract strategies from brute force using reinforcement learning, that's what we'll see here. So maybe it costs a million dollars in compute to get a high score, but use reinforcement learning ala alphazero to learn from the process and it won't cost a million dollars next time and let it do lots of hard benchmarks, math problems and coding tasks and it'll keep getting better and better.
The best interpretation of this result is probably that it showed tackling some arbitrary benchmark is something you can throw money at, aka it’s just something money can solve.
Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations
this might be quite an important point - if they created an algorithm that can mimic human reasoning, but scales terribly with problem complexity (in terms of big O notation), it's still a very significant result, but it's not a 'humans brains are over' moment quite yet.
"We have created artificial super intelligence, it has solved physics!"
"Well, yeah, but its kind of expensive" -- this guy
Haha. Hopefully you’re right and solving the ARC puzzle translates to solving all of physics. I just remain skeptical about the OpenAI hype. They have a track record of exaggerating the significance of their releases and their impact on humanity.
Please do show me a novel result in physics from any LLM. You think "this guy" is stupid because he doesn't extrapolate from this $2MM test that nearly reproduces the work of a STEM graduate to a super intelligence that has already solved physics. Maybe you've got it backwards.
And two years ago this result would have been thought impossible.
Picks up goalpost, looks for stadium exit
The problem is not that it is expensive, but that, most likely, it is not superintelligence. Superintelligence is not exploring the problem space semi-blindly, if the thounsands $$$ per task are actually spent for that. There is a reason the actual ARC-AGI prize requires efficiency, because the point is not "passing the test" but solving the framing problem of intelligence.
Complete aside here: I used to do work with amputees and prosthetics. There is a standardized test (and I just cannot remember the name) that fits in a briefcase. It's used for measuring the level of damage to the upper limbs and for prosthetic grading.
Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.
We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).
Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.
There's a lot of truth in this. I sometimes joke that robot benchmarks should focus on common household chores. Given a basket of mixed laundry, sort and fold everything into organized piles. Load a dishwasher given a sink and counters overflowing with dishes piled up haphazardly. Clean a bedroom that kids have trashed. We do these tasks almost without thinking, but the unstructured nature presents challenges for robots.
I maintain that whoever invents a robust laundry folding robot will be a trillionaire. In that, I dump jumbled clean clothes straight from a dryer at it and out comes folded and sorted clothes (and those loner socks). I know we're getting close, but I also know we're not there yet.
We are certainly getting close! In 2010, watching PR2 fold some unseen towels is similar to watching paint dry [1], but we can now enjoy robots attain lazy student-level laundry folding in real-time, as demonstrated by π₀[2].
I can live without folding laundry (I can just shove my undershirts in the closet, who cares if it's not folded), but whoever manufactures a reliable auto-loading dishwasher will have my dollars. Like, just put all your dishes in the sink and let the machine handle them.
But if your dishwasher is empty is takes nearly the same amount of time/effort to put dishes straight into the dishwasher that it does to put them in the sink.
I think I'd only really save time by having a robot that could unload my dishwasher and put up the clean dishes.
That's called a second dishwasher: one is for taking out, the other for putting in. When the latter is full, turn it on, dirty dishes wait outside until the cycle finishes, when the dishwashers switch roles.
I thought about this and it gets even better. You do not really need shelves as you just use the clean dishwasher as the storage place. I honestly don’t know why this is not a thing in big or wealthy homes.
Another thing that bothers me is that dishwashers are low. As I get older, I’m finding it really annoying to bend down.
So get me a counter-level dishwasher cabinet and I’ll be happy!
We have a double drawer dishwasher and it hurts my brain watching friends plan around their nightly wash.
Hmm, that doesn't match my experience. It takes me a lot more time to put dishes into the dishwasher, because it has different places for cutlery, bowls, dishes, and so on, and of course the existing structure never matches my bowls' size perfectly so I have to play tetris or run it with only 2/3 filled (which will cause me to waste more time as I have to do dishes again sooner).
And that's before we get to bits of sticky rice left on bowls, which somehow dishwashers never scrape off clean. YMMV.
1. Get a set of dishes that does fit nicely together in the dishwasher.
2. Start with a cold prewash, preferably with a little powder in there too. This massively helps with stubborn stuff. This one is annoying though because you might have to come back and switch it on after the prewash. A good job for the robot butler.
Why can't dishwashers just be small, single-dish appliances in which you put the plate/mug/wine glass/forks/whatever, close it, push a button, and 10 seconds later it's clean and dry, you unload and repeat?
Buy one bowl you like (I use a silicone one) and use it for everything. Rarely requires more than a quick rinse.
I was a believer in Gal's FoldiMate but sadly it...folded.
At this point I'm not sure we'll actually get a task-specific machine for laundry folding/sorting before humanoid robots gain the capability to do it well enough.
Honestly, a robot that can hang jumbled clean clothes instead of folding them would be good enough, it's crazy how we don't even have those.
There is the Foldimate robot. I don't know how well it works. It doesn't seem to pair up socks. (Deleted the web link, it might not be legitimate.)
Beware, this website is probably a scam.
Foldimate has gone bankrupt in 2021 [1], and the domain referral from foldimate.com to a 404 page at miele.com, suggests that it was Miele who bought up the remains, not a sketchy company with a ".website" top-level domain.
I want it to lay out an outfit every day too. Hopefully without hallucination.
it's not hallucination, it's high fashion
Yes, but the stupid robot laid out your Thursday-black-Turtleneck for you on Saturday morning. That just won't suffice.
Laundry folding and laundry ironing, I would say.
Hopefully will detect whether a small child is inside or not.
> I maintain that whoever invents a robust laundry folding robot will be a trillionaire
… so Elon Musk? :D
Slightly tangential, we already have amazing laundry robots. They are called washing and drying machines. We don't give these marvels enough credit, mostly because they aren't shaped like humans.
Humanoid robots are mostly a waste of time. Task-shaped robots are much easier to design, build, and maintain... and are more reliable. Some of the things you mention might needs humanoid versatility (loading the dishwasher), others would be far better served by purpose-built robots (laundry sorting).
I'm embarrassed to say that I spent a few moments daydreaming about a robot that could wash my dishes. Then I thought about what to call it...
Sadly current "dishwasher" models are neither self-loading nor unloading. (Seems like they should be able to take a tray of dishes, sort them, load them, and stack them after cleaning.)
Maybe "busbot" or "scullerybot".
The problem is more doing it in sufficiently little space, and using little enough water and energy. Doing one that you just feed dishes individually and that immediate wash them and feed them to storage should be entirely viable, but it'd be wasteful, and it'd compete with people having multiple small drawer-style dishwashers, offering relatively little convenience over that.
It seems most people aren't willing to pay for multiple dishwashers - even multiple small ones or set aside enough space, and that places severe constraints on trying to do better.
Was it a dishwasher? Just give it all your unclean dishes and tell it to go, come back an hour later and they all washed and mostly dried!
There isn't a "task-shaped" robot for unstructured and complex manipulation, other than high DoF arms with vision and neural nets. For example, a machine which can cook food would be best solved with two robotic arms. However, these stationary arms would be wasted if they were just idling most of the time. So, you add locomotion and dynamic balancing with legs. And now these two arms can be used in 1000 different tasks, which makes them 1000x more valuable.
So, not only is the human form the only solution for many tasks, it's also a much cheaper solution considering the idle time of task-specific robots. You would need only a single humanoid robot for all tasks, instead of buying a different machine for each task. And instead of having to design and build a new machine for each task, you'll need to just download new software for each task.
[deleted]
I agree. I don’t know where this obsession comes from. Obsession with resembling as close to humans as possible. We’re so far from being perfect. If you need proof just look at your teeth. Yes, we’re relatively universal, but a screwdriver is more efficient at driving in screws that our fingers. So please, stop wasting time building perfect universal robots, build more purpose-build ones.
Given we have shaped so many tasks to fit our bodies, it will be a long time before a bot able to do a variety/majority of human tasks the human way won’t be valuable.
1000 machines specialized for 1000 tasks are great, but don’t deliver the same value as a single bot that can interchange with people flexibly.
Costly today, but wont be forever.
The shape doesn't matter! Non-humanoid shapes give minir advantages on specific tasks but for a general robot you'll have a hard time finding a shape much more optimal than humanoid. And if you go with humanoid you have so much data available! Videos contain the information of which movements a robot should execude. Teleoperation is easy. This is the bitter lesson! The shape doesn't matter, any shape will work with the right architecture, data and training!
Purpose build robots are basically solved. Dishwashers, laundry machines, assembly robots, etc. the moat is a general purpose robot that can do what a human can do.
Great examples. They are simple, reliable, efficient and effective. Far better than blindly copying what a human being does. Maybe there are equally clever ways of doing things like folding clothes.
This is expressed in AI research as Moravec's paradox: https://en.wikipedia.org/wiki/Moravec%27s_paradox
Getting to LLMs that could talk to us turned out to be a lot easier than making something that could control even a robotic arm without precise programming, let alone a humanoid.
This was actually discovered quite early on in the history of AI:
> Rodney Brooks explains that, according to early AI research, intelligence was "best characterized as the things that highly educated male scientists found challenging", such as chess, symbolic integration, proving mathematical theorems and solving complicated word algebra problems. "The things that children of four or five years could do effortlessly, such as visually distinguishing between a coffee cup and a chair, or walking around on two legs, or finding their way from their bedroom to the living room were not thought of as activities requiring intelligence."
I don't know why people always feel the need to gender these things. Highly educated female scientists generally find the same things challenging.
>I don't know why people always feel the need to gender these things
Because it's relevant to the point being made, i.e. that these tests reflect the biases and interests of the people who make them. This is true not just for AI tests, but intelligence test applied to humans. That Demis Hassabis, a chess player and video game designer, decided to test his machine on video games, Go and chess probably is not an accident.
The more interesting question is why people respond so apprehensively to pointing out a very obvious problem and bias in test design.
> i.e. that these tests reflect the biases and interests of the people who make them
Of course. However i believe we can't move past that without being honest about where these biases are coming from. Many things in our world are the result of gender bias, both subtle and overt. However, at least at first glance, this does not appear to be one of them, and statements like the grandparent's quote serve to perpetuate such biases further.
It's a quote from the 80s from the original author (who is a man...)...
Thank you for virtue signalling, though.
> It's a quote from the 80s from the original author (who is a man...)...
Yes, that was pretty clear in the original comment (?)
Then remove the parts that offend your modern sensibilities and focus on the essence.
He was right. Scientists were focusing on the "science-y" bits and completely missed the elephant in the room, that the thing a toddler already masters are the monster challenge for AI right now, before we even get into "meaning of life" type stuff.
I don't know why anyone would blame people as though someone is making an explicit choice. I find your choice of words to be insulting to the OP.
We learn our language and stereotypes subconciously from our society, and it is no easy thing to fight against that.
I had a pretty bad case of tendinitis once, that basically made my thumb useless since using it would cause extreme pain. That test seems really good. I could use a computer keyboard without any issue, but putting a belt on or pouring water was impossible.
I had a swollen elbow a short while ago, and the amount of things I've never thought about that were affected by reduced elbow join mobility and an inability to put pressure on the elbow was disturbing.
It feels like there's a whole class of information that easily shorthanded, but really hard to explain to novices.
I think a lot about carpentry. From the outside, it's pretty easy: Just make the wood into the right shape and stick it together. But as one progresses, the intricacies become more apparent. Variations in the wood, the direction of the grain, the seasonal variations in thickness, joinery techniques that are durable but also time efficient.
The way this information connects is highly multisensory and multimodal. I now know which species of wood to use for which applications. This knowledge was hard won through many, many mistakes and trials that took place at my home, the hardware store, the lumberyard, on YouTube, from my neighbor Steve, and in books written by experts.
I think assembling Legos would be a cool robot benchmark: you need to parse the instructions, locate the pieces you need, pick them up, orient them, snap them to your current assembly, visually check if you achieved the desired state, repeat
I agree. Watching my toddler daughter build with small legos makes me understand how incredible fine motor skills are as even with small fingers some of the blocks are just too hard to snap together.
Was it the Southampton hand assessment procedure?
Yes! Thank you!
It would be interesting to see trick questions.
Like in your test
a hand grenade and a pin - don't pull the pin.
Or maybe a mousetrap? but maybe that would be defused?
in the ai test...
or Global Thermonuclear War, the only winning move is...
Gaming streams being in the training data, it might pull the pin because "that's what you do".
or, because it has to give an output, and pulling the pin is the only option
There's also the option of not pulling the pin, and shooting your enemies as they instinctively run from what they think is a live grenade. Saw it on a TV show the other day.
[deleted]
to move first!
oh crap. lol!
> We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. "
I must be missing something, how can they be able to play Mozart at 5x speed with their prosthetics but not zip a jacket? They could press keys but not do tasks requiring feedback?
Or did you mean they used to play Mozart at 5x speed before they became amputees?
Playing a piano involves pushing down on the right keys with the right force at the right time, but that could be pre-programmed well before computers. The self-playing piano in the saloon in Westworld wasn't a huge anachronism, such things slightly overlapped with the Wild West era: https://en.wikipedia.org/wiki/Player_piano
Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force, and I'm not even sure what considerations it needs for surface materials (e.g. slightly squishy fake skin) and/or tip shapes (e.g. fake nails).
> Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force
place sure but can't you cheat a bit for time and force with compliance("impedance control")?
In theory, apparently not in practice.
Imagine a prosthetic 'hand' that has 5 regular fingers, rather than 4 fingers and a thumb. It would be able to play a piano just fine, but be unable to grasp anything small, like a zipper.
I'm far from a piano player, but I can definitely push piano buttons quite quickly while zipping up my jacket when it's cold and/or wet outside is really difficult.
Even more so for picking up coins from a flat surface.
For robotics, it's kind of obvious, speed is rarely an issue, so the "5x" part is almost trivial. And you can program the sequence quite easily, so that's also doable. Piano keys are big and obvious and an ergonomically designed interface meant to be relatively easy to press, ergo easy even for a prosthetic. A small coin on a flat surface is far from ergonomic.
I play piano as a hobby, and the funny thing is, if my hands are so cold that I can't zip up my jacket, there's no way I can play anything well. I know it's not quite zipping up jackets ;) but a human playing the piano does require a fast feedback loop.
But how do you deliberately control those fingers to actually play yourself what you have in mind rather than something preprogrammed? Surely the idea of a prosthetic does not just mean "a robot that is connected to your body", but something that the owner control with your mind.
Nobody said anything about deliberately controlling those fingers to play yourself. Clearly it's not something you do for the sake of the enjoyment of playing, but more likely a demonstration of the dexterity of the prosthesis and ability to program it for complex tasks.
The idea of a prosthesis is to help you regain functionality. If the best way of doing that is through automation, then it'd make little sense not to.
Well, you see, while the original comment says they could play at 5x speed, it does not say it plays at that speed well or play it beautifully. Any teacher or any student who learned piano for a while will tell you that this matters a lot, especially for classical music -- being able to accurately play at an even tempo with the correct dynamics and articulation is hard and is what differentiates a beginner/intermediate player from an advanced one. In fact, one mistake many students make is playing a piece too fast when they are not ready, and teachers really want students to practice very slowly.
My point is -- being able to zip a jacket is all about those subtle actions, and could actually be harder than "just" playing piano fast.
zipping up a jacket is really hard to do, and requires very precise movements and coordination between hands.
playing mozart is much more forgiving in terms of the number of different motions you have to make in different directions, the amount of pressure to apply, and even the black keys are much bigger than large sized zipper tongues.
Pretty much. The issue with zippers is that the fabric moves about in unpredictable ways. Piano playing was just movement programs. Zipping required (surprisingly) fast feedback. Also, gripping is somewhat tough compared to pressing.
Thumb not opposable?
We detached this subthread from https://news.ycombinator.com/item?id=42473419
(nothing wrong with it! I'm just trying to prune the top subthread)
[deleted]
That’s why the goal isn’t just benchmark scores, it’s reliable and robust intelligence.
In that sense, the goalposts haven’t moved in a long time despite claims from AI enthusiasts that people are constantly moving goalposts.
Despite lake of fearsome teeth or claws, humans are way op due to brain, hand dexterity, and balance.
>We had hand prosthetics that could play Mozart at 5x speed on a baby grand
I'd love to know more about this.
OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-AGI with their new o3 model
semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
It sounds like they essentially brute-forced the solutions ? Ask LLM for answer, answer for LLM to verify the answer. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Repeat 5B times (this is what the paper says).
Evolution itself is the ultimate brute-force algorithm—it’s just applied over millennia. Trial and error, coupled with selection and refinement, is the only way to generate novelty when there’s no clear blueprint.
By my estimates, for this single benchmark, this is comparable cost to training a ~70B model from scratch today. Literally from 0 to a GPT-3 scale model for the compute they ran on 100 ARC tasks.
I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.
In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.
So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.
These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.
Pretty sure this "cost" is based on their retail price instead of actual inference cost.
Yes that's correct and there's a bit of "pixel math" as well so take these numbers with a pinch of salt. Preliminary model sizes from the temporarily public HF repository puts the full model size at 8tb or roughly 80 H100s
I thought that was a fake.
I didn't hear that but it could be. But it doesn't matter really because there's so much more to consider in the cost, R&D, including all the supporting functions of a model like censorship and data capture and so on.
Yeah and can run off peak, etc.
Does seem to show an absolutely massive market for inference compute…
[deleted]
There are new research where chain of thoughts is happening in latent spaces and not in English. They demonstrated better results since language is not as expressive as those concepts that can be represented in the layers before decoder. I wonder if o3 is doing that?
"You can tell the RL is done properly when the models cease to speak English in their chain of thought" -- Karpathy
I think you mean this: https://arxiv.org/abs/2412.06769
From what I can see, presuming o3 is a progression of o1 and has good level of accountabiltiy bubbling up during 'inference' (i.e. "Thinking about ___") then I'd say it's just using up millions of old-school tokens (the 44 million tokens that are referenced). So not latent thinking per se.
Interesting!
O3 High (tuned) model scored an 88% at what looks like $6,000/task haha
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.
The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.
I remember hearing Tesla trying to automate all of production but some things just couldn’t , like the wiring which humans still had to do.
Compute costs on AI with the same roughly the same capabilities have been halving every ~7 months.
That makes something like this competitive in ~3 years
And human costs have been increasing a few percent per year for a few centuries!
That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.
They're gonna figure it out. Something is being missed somewhere, as human brains can do all this computation on 20 watts. Maybe it will be a hardware shift or maybe just a software one, but I strongly suspect that modern transformers are grossly inefficient.
Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.
The issue is not just scaling compute, but scaling it in a rate that meets the increase in complexity of the problems that are not currently solved. If that is O(n) then what you say probably stands. If that is eg O(n^8) or exponential etc, then there is no hope to actually get good enough scaling by just increasing compute in a normal rate. Then AI technology will still be improving, but improving to a halt, practically stagnating.
o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.
We will learn soon enough I suppose.
It's not 6000/task (i.e per question). 6000 is about the retail cost for evaluating the entire benchmark on high efficiency (about 400 questions)
From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.
The blog clarifies that it's $17-20 per task. Maybe it runs into thousands for tasks it can't solve?
That cost is for o3 low, o3 high goes into thousands per task.
This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.
What if we use those humans to generate energy for the tasks?
Well they got 75.7% at $17/task. Did you see that?
[deleted]
[deleted]
Time and availability would also be factors.
Compute can get optimized and cheap quickly.
Is it? The moore’s law is dead dead, I don’t think this is a given.
"Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
75% of 400 is 300 :)
Trained with 300 raw pairs directly from the ARC training set without using any data augmentation process, such as generating many more pairs with some kind of ARC generator? That's amazing.
Wow are you AGI?
The LLM community has come up with tests they call 'Misguided Attention'[1] where they prompt the LLM with a slightly altered version of common riddles / tests etc. This often causes the LLM to fail.
For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.
It could be that it “assumed” you meant “from China”; in the higher level patterns it learns the imperfection of human writing and the approximate threshold at which mistakes are ignored vs addressed by training on conversations containing these types of mistakes; e.g Reddit. This is just a thought. Try saying: As an astronaut in Chinese territory; or as an astronaut on Chinese soil. Another test would be to prompt it to interpret everything literally as written.
Interesting... It took me 3 different attempts, but I found a set of custom instructions that allowed Claude to get the right answer on the initial prompt. Here's the instructions (I tried to keep them as general and non-specific as I could):
Carefully analyze questions to not overlook subtle details. Take each question "as-is", don't guess what they mean -- interpret them as any reasonable person would.
[deleted]
Just as an aside, I've personally found o1 to be completely useless for coding.
Sonnet 3.5 remains the king of the hill by quite some margin
To fill this out, I find o1-pro (and -preview when it was live) to be pretty good at filling in blindspots/spotting holistic bugs. I use Claude for day to day, and when Claude is spinning, o1 often can point out why. It's too slow for AI coding, and I agree that at default its responses aren't always satisfying.
That said, I think its code style is arguably better, more concise and has better patterns -- Claude needs a fair amount of prompting and oversight to not put out semi-shitty code in terms of structure and architecture.
In my mind: going from Slowest to Fastest, and Best Holistically to Worst, the list is:
1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
Flash is so fast, that it's tempting to use more, but it really needs to be kept to specific work on strong codebases without complex interactions.
Claude has a habit of sometimes just getting “lost”
Like I’ll have it a project in Cursor and it will spin up ready to use components that use my site style, reference existing components, and follow all existing patterns
Then on some days, it will even forget what language the project is in and start giving me python code for a react project
Yeah it's almost like system 1 vs system 2 thinking
To be fair, until the last checkpoint released 2 days ago, o1 didn't really beat sonnet (and if so, barely) in most non-competitive coding benchmarks
I find myself hoping between o1 and Sonnet pretty frequently these days, and my personal observation is that the quality of output from o1 scales more directly to the quality of the prompting you're giving it.
In a way it almost feels like it's become too good at following instructions and simply just takes your direction more literally. It doesn't seem to take the initiative of going the extra mile of filling in the blanks from your lazy input (note: many would see this as a good thing). Claude on the other hand feels more intuitive in discerning intent from a lazy prompt, which I may be prone to offering it at times when I'm simply trying out ideas.
However, if I take the time to write up a well thought out prompt detailing my expectations, I find I much prefer the code o1 creates. It's smarter in its approach, offers clever ideas I wouldn't have thought of, and generally cleaner.
Or put another way, I can give Sonnet a lazy or detailed prompt and get a good result, while o1 will give me an excellent result with a well thought out prompt.
What this boils down to is I find myself using Sonnet while brainstorming ideas, or when I simply don't know how I want to approach a problem. I can pitch it a feature idea the same way a product owner might pitch an idea to an engineer, and then iterate through sensible and intuitive ways of looking at the problem. Once I get a handle on how I'd like to implement a solution, I type up a spec and hand it off to o1 to crank out the code I'd intend to implement.
Have you found any tool or guide for writing better o1 prompts? This isn’t the first time I’ve heard this about o1 but no one seems to know how to prompt it
Can you solve this by putting your lazy prompt through GPT-4o or Sonnet 3.6 and asking it to expand the prompt to a full prompt for o1?
I just asked o1 a simple yes or no question about x86 atomics and it did one of those A or B replies. The first answer was yes, the second answer was no.
o1 is pretty good at spotting OWASP defects, compared to most other models.
https://myswamp.substack.com/p/benchmarking-llms-against-com...
o1 is when all else fails, sometimes it does the same mistakes as weaker models if you give it simple tasks with very little context, but when a good precise context is given it usually outperforms other Models
The new gemini's are pretty good too
The new ai studio from Google is fantastic
Actually prefer new geminis too. 2.0 experimental especially.
I've found gemini-1206 to be best. and we can use it free (for now), in google's aistudio. It's number 1 on lmarena.ai for coding, and generally, and number 1 on bigcodebench.
Which o1? A new version was released a few days ago and beats Sonnet 3.5 on Livebench
Yeah I feel for chat use case, o1 is just too slow for me, and my queries aren’t that complicated.
For coding, o1 is marvelous at Leetcode question I think it is the best teacher I would ever afford to teach me leetcoding, but I don’t find myself have a lot of other use cases for o1 that is complex and requires really long reasoning chain
[dead]
As an aside, I'm a little miffed that the benchmark calls out "AGI" in the name, but then heavily cautions that it's necessary but insufficient for AGI.
> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI
I immediately thought so too. Why confuse everyone?
Because ARC somehow convinced people that solving it was an indicator of AGI.
Its like the "Open" in OpenAI or the "Democratic" in North Koreas DPRK. Naming things helps fool a lot of people.
Marketing probably
It is a necessary but not sufficient condition to AGI.
Can I just say what a dick move it was to do this as a 12 days of Christmas. I mean to be honest I agree with the arguments this isn’t as impressive as my initial impression, but they clearly intended it to be shocking/a show of possible AGI, which is rightly scary.
It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.
Thanks again openAI for showing us you don’t give a shit about actual people.
Or maybe the target audience that watches 12 launch videos in the morning are genuninely excited about the new model. The intended it to be a preview of something to look forward to.
What a weird way to react to this.
It sounds like you aren't thinking about this that deeply then. Or at least not understanding that many smart (and financially disinterested) people who are, are coming to concerning conclusions.
https://www.transformernews.ai/p/richard-ngo-openai-resign-s...
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?
There is no AGI it’s just marketing, this stuff if over hyped, enjoy your holidays you won’t lose your job ;)
I agree, it’s just more about the intent than anything else, like boasting about your amazing new job when someone has recently been made redundant, just before Christmas.
I don't know, maybe it's a bit off topic, but at least in cases that I'm imagining, I would always hire a human than fully rely on AI. Let the human consult with AI if needed, but still finalize the decision or result. The human will be thinking about the problem for months or years, even if passively during a vacation, an idea will occasionaly pop up. AI will think about its task for seconds, in case it missed some information or whatever, it will never wake up in the middle of the night thinking "s**, i forgot about X"
The vast majority of people who will lose jobs to AI aren’t following AGI benchmarks, or even know what AGI is short for.
That’s is true and a reasonable point. But looking in This thread you can see there has been this reaction from quite a few.
I feel you. It's tough trying to think about what we can do to avert this; even to the extent that individuals are often powerless, in this regard it feels worse than almost anything that's come before.
Some of us actual people are actually enthusiastic about AGI. Although I'm a bit weird in being into the sci-fi upload / ending death stuff.
Out of interest, what do you think would happen to your sense of subjective experience on sci-fi upload? And secondly have you watched black mirror? In that show they show many great ways there the end of death is just the beginning of eternal techno suffering.
I'm not quite sure - we need to work on the details. I've not watched very much black mirror.
I would think it would not lead to a transfer of consciousness and instead just make a copy of you. I recommend black mirror, it deals with one technological change (usually) and shows how it can be dystopian (usually, there are occasional happy endings). Each episode is standalone.
I'm hoping we get to the stage fairly soon that we can make AI with something like human consciousness and be able to study and understand it better. That stuff will probably start as a very crude model and get closer as both AI and brain science advance. I figure a way to avoid dystopian problems is to experiment and play around with it so you figure how it works. Most dystopian examples I've come across in real life have been very much driven by human behaviour rather than tech.
Yeah maybe black mirror but I'm not sure it's really my thing.
[deleted]
Blaming OpenAI for progress is like blaming a calendar for Christmas—it’s not the timing, it’s your unwillingness to adapt
Unwillingness to adapt to the destruction of the middle class and knowledge work is pretty reasonable tbh.
Historically when tech has taken over jobs people have done ok, they've just done something else, usually something more pleasant.
Wow, you just solved the ethics of technology in a one liner. Impressive.
This is a you problem. Yes there will be pain in short term, but it will be worth it in long term.
Many of us look forward to what a future with AGI can do to help humanity and hopefully change society for the better, mainly to achieve a post scarcity economy.
Surely the elites that control this fancy new technology will share the benefits with all of us _this_ time!
No it'll be like when tech took over 97% of agricultural work with 97% of us starving while all the money went to the farm elites.
How did that go for the farm workers?
I guess they did other stuff instead.
https://www.transformernews.ai/p/richard-ngo-openai-resign-s...
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts? There is a real chance that this ends with significant good. There is also a real chance that this ends with the death of every single human being. That's never been a choice we've had to make before, and it seems like we as a species are unprepared to approach it.
Post scarcity seems very unlikely. Humans might be worthless, but there will still be a finite number of AIs, compute, space, resources.
How are you going to make housing, healthcare, etc. not scarce, and pay for them?
Robots supply that, controlled by democratic government.
Robots supply the land and physical labor that underlie the price of housing? Are you thinking of space colonies or something?
You need to make these expensive things nearly free if you're going to speak of post scarcity.
Robots supply the physical labour. The land shortages are largely regulatory - there's a lot of land out there or you could build higher.
I hate the deliberate fear-mongering that these companies pedal on the population to get higher valuations
Wtf is wrong with you dude? It's just another tech, some jobs will get worse some jobs will get better. Happens every couple of decades. Stop freaking out.
This is not a very kind or humble comment. There are real experts talking about how this time is different -- as an analogy, think about how horses, for thousands of years, always had new things to do -- until one day they didn't. It's hubris to think that we're somehow so different from them.
Notably, the last key AI safety researcher just left OpenAI: https://www.transformernews.ai/p/richard-ngo-openai-resign-s...
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Are you that upset that this guy chose to trust the people that OpenAI hired to talk about AI safety, on the topic of AI safety?
I just noticed this bit:
>> Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis.
"Program synthesis" is here used in an entirely idiosyncratic manner, to mean "combining programs". Everyone else in CS and AI for the last many decades has used "Program Synthesis" to mean "generating a program that satisfies a specification".
Note that "synthesis" can legitimately be used to mean "combining". In Greek it translates literally to "putting [things] together": "Syn" (plus) "thesis" (place). But while generating programs by combining parts of other programs is an old-fashioned way to do Program Synthesis, in the standard sense, the end result is always desired to be a program. The LLMs used in the article to do what F. Chollet calls "Porgram Synthesis" generate no code.
I always get the feeling he's subconsciously inserting a "magical" step here with reference to "synthesis"-- invoking a kind of subtle dualism where human intelligence is just different and mysteriously better than hardware intelligence.
Combining programs should be straightforward for DNNs, ordering, mixing, matching concepts by coordinates and arithmetic in learned high-dimensional embedded-space. Inference-time combination is harder since the model is working with tokens and has to keep coherence over a growing CoT with many twists, turns and dead-ends, but with enough passes can still do well.
The logical next step to improvement is test-time training on the growing CoT, using reinforcement-fine-tuning to compress and organize the chain-of-thought into parameter-space--if we can come up with loss functions for "little progress, a lot of progress, no progress". Then more inference-time with a better understanding of the problem, rinse and repeat.
In (1) the author use a technique to improve the performance of an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub benchmark moreover he said that more computer power would give better results. So the results of o3 could be produced in this way using the same method with more computer power, so if this is the case the result of o3 is not very interesting.
> o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time
I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.
The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.
> Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.
That was always true for NNs in general, yet it took a very specific structure to get to where we are now. (..with a certain amount of time and resources.)
> thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out
Sounds interesting, would you elaborate?
One thing I have not seen commented on is that ARC-AGI is a visual benchmark but LLMs are primarily text. For instance when I see one of the ARC-AGI puzzles, I have a visual representation in my brain and apply some sort of visual reasoning solve it. I can "see" in my mind's eye the solution to the puzzle. If I didn't have that capability, I don't think I could reason through words how to go about solving it - it would certainly be much more difficult.
I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?
This is not new. When GPT-4 was released I was able to get it to generate SVGs albeit they were ugly they had the basics.
Yeah this part is what makes the high performance even more surprising to me. The fact that LLMs are able to do so well on visual tasks (also seen with their ability to draw an image purely using textual output https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/) implies that not only do they actually have some "world model" but that this is in spite of the disadvantage given by having to fit a round peg in a square hole. It's like trying to map out the entire world using the orderly left-brain, without a more holistic spatial right-brain.
I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.
A file is a stream of symbols encoded by bits according to some format. It’s pretty much 1D. It would be susprising that LLM couldn’t extract information from a file or a data stream.
Am I understanding correctly, and the only thing with a bit of actual data released so far is the ARC-AGI piece from Francois Chollet? And every other claim has no further data released on it?
Serious question. I've browsed around, looked for the official release, but it seems to be just hear-say for now, except for the few little bits in the ARC-AGI article.
So some of the reactions seems quite far-fetched. I was quite amazed at first seeing the benchmarks, but then actually read the ARC-AGI article and a few other things about how it worked, learned a bit more about the different benchmarks, and realised we've no proper idea yet how o3 is working under the hood, the thing isn't even realeased.
It could be doing the same thing that chess-engines do except in several specific domains. Which would be very cool, but not necessarily "intelligent" or "generally intelligent" in any sense whatsoever! Will that kind of model lead to finding novel mathematical proofs, or actually "reasoning" or "thinking" in any way similar to a human, remains entirely uncertain.
The cost axis is interesting. O3 Low is $10+ per task and 03 High is over $1000 (it's logarithmic graph so it's like $50 and $5000 respectively?)
I'm 22 and have no clue what I'm meant to do in a world where this is a thing. I'm moving to a semi rural, outdoorsy area where they teach data science and marine science and I can enjoy my days hiking, and the march of technology is a little slower. I know this will disrupt so much of our way of life, so I'm chasing what fun innocent years are left before things change dramatically.
It's worth noting that LLMs have been part of the tech zeitgeist for over two years and have had a pretty limited impact on hireability for roles, despite what people like the Klarna CEO are saying. Personally, I'm betting on two things:
* The upward bound of compute/performance gains as we continue to iterate on LLMs. It simply isn't going to be feasible for a lot of engineers and businesses to run/train their own LLMs. This means an inherent reliance on cloud services to bridge the gap (something MS is clearly betting on), and engineers to build/maintain the integration from these services to whatever business logic their customers are buying.
* Skilled knowledge workers continuing to be in-demand, even factoring in automation and new-grad numbers. Collectively, we've built a better hammer; it still takes someone experienced enough to know where to drive the nail. These tools WILL empower the top N% of engineers to be more productive, which is why it will be more important than ever to know _how_ to build things that drive business value, rather than just how to churn through JIRA tickets or turn a pretty Figma design into React.
o8 will probably be able to handle datacenter management
exactly
I agree completely. This is a fundamentally different change than the ones that came before. Calculators, assemblers, higher level languages, none of these actually removed the _reasoning_ the engineer has to do, they just provide abstractions that make this reasoning easier. What reason is there to believe LLMs will remain "assistants" instead of becoming outright replacements? If LLMs can do the reasoning all the way from high level description down to implementation, what prevents them from doing the high level describing too?
In general, with the technology advancing as rapidly as it is, and the trillions of dollars oriented towards replacing knowledge work, I don't see a future in this field. And that's despite me being on a very promising path myself! I'm 25, in the middle of a CS PhD in Germany, with an impressive CV behind me. My head may be the last on the chopping block, but I'd be surprised if it buys me more than a few years once programmer obsolescence truly kicks in.
Indeed, what I think are safe jobs are jobs with fundamental human interaction. Nurses, doctors, kindergarten teachers. I myself have been considering pivoting to becoming a skiing teacher.
Maybe one good thing that comes out of this is breaking my "wunderkind" illusion. I spent my teens writing C++ code instead of going out socializing and making friends. Of course, I still did these things, but I could've been far less of a hermit.
I mirror your sentiment of spending these next few years living life; Real life. My advice: Stop sacrificing the now for the future. See the world, go on hikes with friends, go skiing, attend that bouldering thing your friends have been telling you about. If programming is something you like doing, then by all means keep going and enjoy it. I will likely keep programming too, it's just no longer the only thing I focus on.
Edit: improve flow of last paragraph
What was it that initially inspired you to learn to code? Was it robots, video games, design, etc... Whatever that was, creating the pinnacle of it is what your future will be.
It was the challenge for me. Seeing some difficult-to-solve problem, attacking it, and actually solving it after much perseverance.
Kind of stemming from the mindspace "If they can build X, I can build X!"
I'd explicitly not look up tutorials, just so I'd have the opportunity to solve the mathemathics myself. Like building a 3D physics engine. (I did look up colission detection after struggling with it for a month or so, inventing GJK is on another level)
I'm the same age as you; I feel lost, erring in being a little too pessimistic.
Feels like I hit the real world just a couple years too late to get situated in a solid position. Years of obsession in attempt to catch up to the wizards, chasing the tech dream. But this, feels like this is it. Just watching the timebomb tick. I'd love to work on what feels like the final technology, but I'm not a freakshow like what these labs are hiring. At least I get to spectate the creation of humanity's greatest invention.
This announcement is just another gut punch, but at this point I should expect its inevitable. A Jason Voorhees AGI, slowly but surely to devour all the talents and skills information workers have to offer.
Apologies for the rambly and depressing post, but this is reality for anyone recently out or still in school.
At least you're disillusioned with the idea of a long term career before a lot of other people. It's disturbing seeing how ready people are to go into a lifelong career and expecting stability and happiness in the world we're heading into.
We are living in a world run by and for the soon to be dead, many of which have dementia, so empathic policy and foresight is out of the question, and we're going to be picking up the incredibly broken scraps of our golden age.
And not to get too political but the mass restructuring of public consciousness and intellectual society due to mass immigration for an inexplicable gdp squeeze and social media is happening at exactly the wrong time to handle these very serious challenges. The speed at which we've undone civil society is breakneck, and it will go even further, and it will get even worse. We've easily gone back 200 years in terms of emotional intelligence in the past 15.
Put another way, you have deep conviction in a change that vast majority of people have not even seen yet, never mind grokked, and you're young enough to spend some decent amount of time on education for "venn'ing" yourself into a useful tool in the future. If you have a baseline education, there are any number of orthogonal skills you could add, be it philosophy, fine art, medicine, whatever. You know how to skate and you know where the puck is going, most most people, don't even see the rink.
I completely understand how you feel -I'm in my 40s, and I often find myself questioning what direction to take in this rapidly changing world. On top of that, I'm unsure whether advising my kids to go to university is still the right path for their future.
Everything seems so uncertain, and the pace of technological advancement makes long-term planning feel almost impossible. Your plan to move to a slower-paced area and enjoy the outdoors sounds incredibly grounding - it's something I've been considering myself.
I tell everyone who would listen to me (i.e. not many) that white collar jobs like mine are dead and skilled manual work is the way of the near future, that is until the rise of the robots.
Robots are going to go hand in hand with AI. Pretty sure our problems right now are not with the physical hardware that can far outperform a human already, it’s in the control software.
Robots can only proliferate at the speed of real world logistics and resource management and I think will always be a little difficult.
AI can be anywhere any time with cloud compute.
I advise my kids to stay curious, keep learning, keep wondering, keep discovering. Whether that's through university or some other path.
While I understand why you feel this way, the meaning or standing of being a programmer is different now. It feels like the purpose is lost or it longer belongs to human.
But below is reality talk. With Claude 3.5, I already think it is a better programmer than I at micro level tasks, and a better Leetcode programmer than I could ever be.
I think it is like modern car manufacturering, the robots build most of the components, but I can’t see how human could be dismissed from the process to oversee output.
O3 has been very impressive in achieving 70+ in swebench for example, but this also means when it is trained on the codebase multiple times so visibility isn't an issue yet it still has 30% chance that it can’t pass the unit tests.
A fully autonomous system can’t be trusted, the economy of software won’t collapse, but it will be transformed beyond our imagination now.
I will for sure miss the days when writing code, or coder is still a real business.
How time flies
Developer. Prompt Engineer. Philosopher-Builder. (mostly) not programmer.
The code part will get smaller and smaller for most folks. Some frameworks or bare-metal people or intense heavy-lifters will still do manual code or pair-programming where half the pair is an agentic AI with super-human knowledge of your org's code base.
But this will be a layer of abstraction for most people who build software. And as someone who hates rote learning, I'm here for it. IMO.
Unfortunately (?) I think the 10-20-50? years of development experience you might bring to bear on the problems can be superseded by an LLM finetuned on stackoverflow, github etc once judgement and haystack are truly nailed. Because it can have all that knowledge you have accumulated, and soaked into a semi-conscious instinct that you use so well you aren't even aware of it except that it works. It can have that a million times over. Actually. Which is both amazing and terrifying. Currently this isn't obvious because it's accuracy /judgement to learn all those life-of-a-dev lessons is almost non-existent. Currently. But it will happen. That is copilot's future. It's raison d'être.
I would argue what it will never have however, simply by function of the size of training runs is unique functional drive and vision. If you wanted a "Steve Jobs" AI you would have to build it. And if you gave it instructions to make a prompt/framework to build a "Jobs" it would just be an imitation, rather than a new unique in-context version. That is the value a person has- their particular filter, their passion and personal framework. Someone who doesn't have any of those things, they had better be hoping for UBI and charity. Or go live a simple life, outside the rat race.
bows
I'm hoping it's similar to the abacus for maths, the elimination of human "calculators" like on the apollo missions, and we just ended up moving onto different, harder, more abstract problems, and forget that we ever had to climb such small hills. AI's evolution and integration is more multifaceted though and much more unpredictable.
But unlike the abacus/calculators i don't feel like we're at a point in history where society is getting wiser and more empathetic, and these new abilities are going towards something good.
But supervisors of tasks will remain because we're social, untrusting, and employers will always want someone else to blame for their shortcomings. And humans will stay in the chain at least for marketing and promotion/reputation because we like our japanese craftsman and our amg motors made by one person.
I feel your anxiety. I often wonder how I arrange the remaining many decades of my life to maintain a stream of income.
Perhaps what I need is actually a steady stream of food - i.e. buy some land and oxen and solar panels while I can.
>I'm 22 and have no clue what I'm meant to do in a world where this is a thing.
For what it's worth that's probably an advantage versus the legions of people who are staring down the barrel of years invested into skills that may lose relevance very rapidly.
Our way of life changed when electricity came around. It changed when cars took over the cities, it again changed when mobile phones became omnipresent.
Will LLMs or without LLMs, the world will keep turning. Humans will still be writing amazing works of literature, creating beautiful art, carrying out scientific experiments and discovering new species.
If information technology workers become twice as productive, you’ll want more of them for your business, not less.
There are way more data analysts now than when it required paper and pencil.
On the contrary I think you already have an excellent plan.
I'm happy enough with it, but I'm also a little sad that it's essentially been chosen for me because of weak willed and valued people who don't want to use policy to make things better for us as a society. Plus we are in a bad world/scenario for AI advancements to come into with pretty heavy institutional decay and loss of political checks and balances.
It's like my life is forfeit to fixing other peoples mistakes because they're so glaring and I feel an obligation. Maybe that's the way the world's always been, but it's a concerning future right now
[deleted]
The chart is super misleading, since the test was obscure until recently. A few months ago he announced he'd made the only good AGI test and offered a cash prize for solving it, only to find out in as much time that it's no different from other benchmarks.
I have a very naive question.
Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.
What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?
Francois'(the creator of ARC-AGI benchmark) whole point was that while they look the same, they're not. Coding is solving a familiar pattern in the same way (and fails when it' s NOT doing that, it just looks like it doesn't happen because it's seen SO MANY patterns in code). But the point of Arc AGI is to make each problem have to generalize in some new ay.
I expect it largely has to do with "scale"
We have an enormous amount of high quality programming samples. From there it's relatively straightforward to bootstrap (similar to original versions of alphago - start with human games, improve via self play) using leetcode or other problems with a "right answer"
In contrast, the arc puzzles are relatively novel (why? Well, this has to do with the relative utility of solving an arc problem and programmer open source culture)
Deciphering patterns in natural language is more complex than these puzzles. If you train your AI to solve these puzzles, we end up in the same spot. The difficulty of solving would be with creating training data for a foreign medium. The "tokens" are the grids and squares instead of words (for words, we have the internet of words, solving that).
If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!
Isn't this at the level now where it can sort of self improve. My guess is that they will just use it to improve the model and the cost they are showing per evaluation will go down drastically.
So, next step in reasoning is open world reasoning now?
I don’t believe so. If it’s at the point where you could just plug it into a bunch of camera feeds around the world and it could only filter out a useful training set for itself out of that data then we truly would have AGI. I don’t think it’s there yet.
may be for some sub-domains like math and code it can do that since the verification process can be done / relatively tractable
I'm skeptical of these benchmarks. I mean, look at the problem it's solving? I'm sorry, this is our benchmark of AGI, it will never fly with the common person when someone claims AGI, and all it did was fill a grid of pixels.
Was it zero-shot at least and Pass@1 ? I guess it was not zero-shot, since it shows examples of other similar problems and their solutions. It also sounds like it was fine-tuned on that specific task.
Look, maybe this shows that it could soon be used to replace some MTurk style workers, but I don't know that counts as AGI. To me AGI, it needs to be able to solve novel problems, to adapt to all situations without fine-tuning, and to operate at much larger dimensions, like don't make it a grid of pixels, make it 4k images at least.
It sucks that I would love to be excited about this... but I mostly feel anxiety and sadness.
Same, it's sad but I honestly hoped they never achieved these results and it came out that it wasn't possible or would take an insurmountable amount of resources but here we are ok the verge of making most humans useless when it comes to productivity.
While there are those that are excited, the world is not prepared for the level of distress this could put on the average person without critical changes at a monumental level.
If you don't feel like the world needed grand scale changes at a societal level with all the global problems we're unable to solve, you haven't been paying attention. Income inequality, corporate greed, political apathy, global warming.
AI will fix none of that
And you think the bullshit generators backed by the largest corporate entities in humanity who are, as we speak, causing all the issues you mention are somehow gonna solve any of this?
If you still think this technology is a "bullshit generator," then it's safe to say you're also wrong about a great many other things in life.
That would bug me, if I were you.
They’re not wrong though. The frequency with which these things still just make shit up is astonishingly bad. Very dismissive of a legitimate criticism.
It's getting better, faster than you and I and the GP are. What else matters?
You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
And yet I still can't trust Claude or o1 to not get the simplest of things, such as test cases (not even full on test suites, just the test cases) wrong, consistently. No amount of handholding from me or prompting or feeding it examples etc helps in the slightest, it is just consistently wrong for anything but the simplest possible examples, which takes more effort to manually verify than if I had just written it myself. I'm not even using an obscure stack or language, but especially with things that aren't Python or JS it shits the bed even worse.
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.
I still find that 'trusting' the models is a waste of time, we agree there. But I haven't had that much more luck with blindly telling a low-level programmer to go write something. The process of creating something new was, and still is, an interactive endeavor.
I do find, however, that the newer the model the fewer elementary mistakes it makes, and the better it is at figuring out what I really want. The process of getting the right answer or the working function continues to become less frustrating over time, although not always monotonically so.
o1-pro is expensive and slow, for instance, but its performance on tasks that require step-by-step reasoning is just astonishing. As long as things keep moving in that direction I'm not going to complain (much).
Well said! There's no way big tech and institutional investors are pouring billions of dollars into AI because of corporate greed. It's definitely so that they can redistribute wealth equally once AGI is achieved.
/s
Anxiety and sadness are actually mild emotional responses to the dissolution of human culture. Nick Land in 1992:
"It is ceasing to be a matter of how we think about technics, if only because technics is increasingly thinking about itself. It might still be a few decades before artificial intelligences surpass the horizon of biological ones, but it is utterly superstitious to imagine that the human dominion of terrestrial culture is still marked out in centuries, let alone in some metaphysical perpetuity. The high road to thinking no longer passes through a deepening of human cognition, but rather through a becoming inhuman of cognition, a migration of cognition out into the emerging planetary technosentience reservoir, into 'dehumanized landscapes ... emptied spaces' where human culture will be dissolved. Just as the capitalist urbanization of labour abstracted it in a parallel escalation with technical machines, so will intelligence be transplanted into the purring data zones of new software worlds in order to be abstracted from an increasingly obsolescent anthropoid particularity, and thus to venture beyond modernity. Human brains are to thinking what mediaeval villages were to engineering: antechambers to experimentation, cramped and parochial places to be.
[...]
Life is being phased-out into something new, and if we think this can be stopped we are even more stupid than we seem." [0]
Land is being ostracized for some of his provocations, but it seems pretty clear by now that we are in the Landian Accelerationism timeline. Engaging with his thought is crucial to understanding what is happening with AI, and what is still largely unseen, such as the autonomization of capital.
It's obvious that there are lines of flight (to take a Deleuzian tack, a la Land) away from the current political-economic assemblage. For example, a strategic nuclear exchange starting tomorrow (which can always happen -- technical errors, a rogue submarine, etc.) would almost certainly set back technological development enough that we'd have no shot at AI for the next few decades. I don't know whether you agree with him, but I think the fact that he ignores this fact is quite unserious, especially given the likely destabilizing effects sub-AGI AI will have on international politics.
Humanity is about to enter an even steeper hockey stick growth curve. Progressing along the Kardashev scale feels all but inevitable. We will live to see Longevity Escape Velocity. I'm fucking pumped and feel thrilled and excited and proud of our species.
Sure, there will be growing pains, friction, etc. Who cares? There always is with world-changing tech. Always.
> Sure, there will be growing pains, friction, etc. Who cares?
That's right. Who cares about pains of others and why they even should are absolutely words to live by.
Yeah, with this mentality, we wouldn't have electricity today. You will never make transition to new technology painless, no matter what you do. (See: https://pessimistsarchive.org)
What you are likely doing, though, is making many more future humans pay a cost in suffering. Every day we delay longevity escape velocity is another 150k people dead.
There was a time when in the name of progress people were killed for whatever resources they possessed, others were enslaved etc. and I was under the impression that the measure of our civilization is that we actually DID care and just how much. It seems to me that you are very eager to put up altars of sacrifice without even thinking that the problems you probably have in mind are perfectly solvable without them.
By far the greatest issue facing humanity today is wealth inequality.
Nah, it's death. People objectively are doing better than ever despite wealth inequality. By all metrics - poverty, quality of life, homelessness, wealth, purchasing power.
I'd rather just... not die. Not unless I want to. Same for my loved ones. That's far more important than "wealth inequality."
> People objectively are doing better than ever despite wealth inequality. By all metrics - poverty, quality of life, homelessness, wealth, purchasing power.
If you take this as an axiom, it will always be true ;).
You don't mind living in a country with a population of billions [sic], piled on top of one another? You don't mind living a country ruled by gerontocracy and probably autocracy, because that's what you'll eventually get without death to flush them out.
Senescence is an adaptation.
"You/your loved ones should die because Elon would die too" is a terrible argument. It's not great, but it's not worth dying over. New rich bad people would take his place anyways.
"You should die because cities will get crowded" is a less terrible argument but still a bad one. We have room for at least double our population on this planet, couples choosing longevity can be required to have <=1 children until there is room for more, we will eventually colonize other planets, etc.
All this is implying that consciousness will continue to take up a meaningful amount of physical space. Not dying in the long term implies gradual replacement and transfer to a virtual medium at some point.
Longevity Escape Velocity? Even if you had orders of magnitude more people working on medical research, it's not a given that prolonging life indefinitely is even possible.
Of course it's a given unless you want to invoke supernatural causes the human brain is a collection of cells with electro-chemical connections that if fully reconstructed either physically or virtually would necessarily need to represent the original person's brain. Therefore with sufficient intelligence it would be possible to engineer technology that would be able to do that reconstruction without even having to go to the atomic level, which we also have a near full understanding of already.
I agree, save invoking supernatural causes, the human brain is a collection of cells with electro-chemical connections that if fully reconstructed either physically or virtually would necessarily need to represent the original person's brain. Therefore with sufficient intelligence it would be possible to engineer technology that would be able to do that reconstruction without even having to go to the atomic level, which we also have a near full understanding of already.
My job should be secure for a while, but why would an average person give a damn about humanity when they might lose their jobs and comfort levels? If I had kids, I would absolutely hate this uncertainty as well.
“Oh well, I guess I can’t give the opportunities to my kid that I wanted, but at least humanity is growing rapidly!”
> when they might lose their jobs and comfort levels?
Everyone has always worried about this for every major technology throughout history
IMO AGI will dramatically increase comfort levels, lower your chance of dying, death, disease, etc.
Again, sure, but it doesn’t matter to an average person. That’s too much focus on the hypothetical future. People care about the current times. In the short term it will suck for a good chunk of people, and whether the sacrifice is worth it will depend on who you are.
People aren’t really on uproar yet, because implementations haven’t affected the job market of the masses. Afterwards? Tume will show.
Yes, people tend to focus on current times. It's an incredibly shortsighted mentality that selfishly puts oneself over tens of billions of future lives being improved. https://pessimistsarchive.org
Do you have any dependents, like parents or kids, by any chance? Imagine not being able to provide for them. Think how’d you feel in such circumstances.
Like in general I totally agree with you, but I also understand why a person would care about their loved ones and themselves first.
Yes, I have dependents, and them not dying is far more important to me than me being the one providing for them.
Eventually you draw the black ball, it is inevitable.
We've almost wiped ourselves out in a nuclear war in the 70ies. If that would have happened, would it have been worth it? Probably not.
Beyond immediate increase in inequality, which I agree could be worth it in the long run if this was the only problem, we're playing a dangerous game.
The smartest and most capable species on the planet that dominates it for exactly this reason, is creating something even smarter and more capable than itself in the hope it'd help make its life easier.
Hmm.
https://www.transformernews.ai/p/richard-ngo-openai-resign-s...
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?
You and I will likely not live to see much of anything past AGI.
I would rather follow in the steps of uncle Ted than let AI turn me in a homeless person. It’s no consolation that my tent will have a nice view of a lunar colony
longevity for the AIs
> Sure, there will be growing pains, friction, etc. Who cares?
The people experiencing the growing pains, friction, etc.
You sound like a rich person.
[deleted]
I have been diving deep into LLM coding over the last 3 years and regular encountered that feeling along the way. I still at times have a "wtf" moment where I need to take a break. However, I have been able to quell most of my anxieties around my job / the software profession in general (I've been at this professionally for 25+ years and software has been my dream job since I was 6).
For one, I found AI coding to work best in a small team, where there is an understanding of what to build and how to build it, usually in close feedback loop with the designers / users. Throw the usual managerial company corporate nonsense on top and it doesn't really matter if you can instacreate a piece of software, if nobody cares for that piece of software and it's just there to put a checkmark on the Q3 OKR reports.
Furthermore, there is a lot of software to be built out there, for people who can't afford it yet. A custom POS system for the local baker so that they don't have to interact with a computer. A game where squids eat algae for my nephews at christmas. A custom photo layout software for my dad who despairs at indesign. A plant watering system for my friend. A local government information website for older citizens. Not only can these be built at a fraction of the cost they were before, but they can be built in a manner where the people using the software are directly involved in creating it. Maybe they can get a 80% hacked version together if they are technically enclined. I can add the proper database backend and deployment infrastructure. Or I can sit with them and iterate on the app as we are talking. It is also almost free to create great documentation, in fact, LLM development is most productive when you turn up software engineering best practices up to 11.
Furthermore, I found these tools incredible for actively furthering my own fundamental understanding of computer science and programming. I can now skip the stuff I don't care to learn (is it foobarBla(func, id) or foobar_bla(id, func)) and put the effort where I actually get a long-lived return. I have become really ambitious with the things I can tackle now, learning about all kinds of algorithms and operating system patterns and chemistry and physics etc... I can also create documents to help me with my learning.
Local models are now entering the phase where they are getting to be really useful, definitely > gpt3.5 which I was able to use very productively already at the time.
Writing (creating? manifesting? I don't really have a good word for what I do these days) software that makes me and real humans around me happy is extremely fulfilling, and has allevitated most of my angst around the technology.
We’re enabling a huge swath of humanity being put out of work so a handful of billionaires can become trillionaires.
This is the same boring alarmist argument we’ve heard since the Industrial Revolution. Humans have always turned extra output provided by technological advancement to increase overall productivity.
You’re right, who needs jobs when productivity is high.
It would happen in China regardless what is done here. Removing billionaires does not fix this. The ship has sailed.
And also the solving of hundreds of diseases that ail us.
You need to solve diseases and make the cure available. Millions die of curable diseases every year, simply because they are not deemed useful enough. What happens when your labor becomes worthless?
Why do you think you’ll be able to afford healthcare? The new medicine is for the AI owners
One of the biggest factors in risk of death right now is poverty. Also what is being chased right now is "human level on most economically viable tasks" because the automated research for solving physics etc. even now seems far-fetched.
It doesn’t matter. Statists rather be poor, sick, and dead than risking trillionaires.
You should read about workers right in the gilded age, and see how good laissez-faire capitalism was. What do you think will happen when the only thing you can trade with the trillionaires, your labor, becomes worthless?
[deleted]
I was impressed until I read the caveat about the high-compute version using 172x more compute.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
So what? I’m serious. Our current level of progress would have been sci-fi fantasy with the computers we had in 2000. The cost may be astronomical today, but we have proven a method to achieve human performance on tests of reasoning over novel problems. WOW. Who cares what it costs. In 25 years it will run on your phone.
So your claim for optimism here is that something today that took ~10^22 floating point operations (based on an estimate earlier in the thread) to execute will be running on phones in 25 years? Phones which are currently running at O(10^12) flops. That means ten orders of magnitudes of improvement for that to run in a reasonable amount of time? It's a similar scale up in going from ENIAC (500 flops) to a modern desktop (5-10 teraflops).
That sounds reasonable to me because the compute cost for this level of reasoning performance won’t stay at 10^22 and phones won’t stay at 10^12. This reasoning breakthrough is about 3 months old.
I think expecting five orders of magnitude improvement from either side of this (inference cost or phone performance) is insane.
I don't, because most of the improvement will be algorithmic, not physical.
By a factor of 10,000,000,000? (that's the ten orders of magnitude needed, since the physical side is out). Algorithmic improvements will make this 10 billion times cheaper?
It's not so much the cost as much the fact that they got a slightly better result by throwing 172x more compute per/task. The fact that it may have cost somewhere north of $1 million simply helps to give a better idea of how absurd the approach is.
It feels a lot less like the breakthrough when the solution looks so much like simply brute-forcing.
But you might be right, who cares? Does it really matter how crude the solution is if we can achieve true AGI and bring the cost down by increasing the efficiency of compute?
Let's make two generous assumptions: 1. ARC-AGI actually generalizes to human intelligence 2. It took 172x more compute to go from ~75% to ~87%, so it will take roughly 4x that to get to 99% (the level of a STEM graduate), assuming every 172x'ing of the compute cuts the remaining gap in half
That is roughly 10^9 times more compute required, or roughly the US military budget per half an hour, to get the intelligence of 1 (!) STEM graduate (not any kind of superhuman intelligence).
Of course, algorithms will get better, but this particular approach feels like wading in a plateau of efficiency improvements, very, very far down the X axis.
“Simply brute-forcing”
That’s the thing that’s interesting to me though and I had the same first reaction. It’s a very different problem than brute-forcing chess. It has one chance to come to the correct answer. Running through thousands or millions of options means nothing if the model can’t determine which is correct. And each of these visual problems involve combinations of different interacting concepts. To solve them requires understanding, not mimicry. So no matter how inefficient and “stupid” these models are, they can be said to understand these novel problems. That’s a direct counter to everyone who ever called these a stochastic parrot and said they were a dead-end to AGI that was only searching an in distribution training set.
The compute costs are currently disappointing, but so was the cost of sequencing the first whole human genome. That went from 3 billion to a few hundred bucks from your local doctor.
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.
Captchas are already completely useless.
(Shrug) AI has been better than humans at solving CAPTCHAs for a LONG time. As the sibling points out, they're just a waste of time and electricity at this point.
Ironically, they are used as free labor to label image sets for ai to be trained on.
How can there be "private" taks when you have use the OpenAI API to run queries? OpenAI sees everything.
We worked with ARC to run inference on the semi-private tasks last week, after o3 was trained, using an inference only API that was sent the prompts but not the answers & did no durable logging.
What's your opinion on the veracity of this benchmark - given o3 was fine-tuned and others were not? Can you give more details on how much data was used to fine-tune o3? It's hard to put this into perspective given this confounder.
I can’t provide more information than is currently public, but from the ARC post you’ll note that we trained on about 75% of the train set (which contains 400 examples total); which is within the ARC rules, and evaluated on the semiprivate set.
That's completely understandable - leveraging the train set. But what I was trying to say is that the comparison is relative to models that were actually zero-shot and not tuned. It isn't apples to apples, it's apples to orchards.
At about 12-14 minutes in OpenAI's YouTube vid they show that o3-mini beats o1 on Codeforces despite using much less compute.
I feel like AI is already changing how we work and live - I've been using it myself for a lot of my development work. Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close) than humans can. We're talking about a huge shift where first knowledge workers get automated, then physical work too. The thing is, our whole society is built around people working to earn money, so what happens when AI can do most jobs? It's not just about losing jobs - it's about how people will pay for basic stuff like food and housing, and what they'll do with their lives when work isn't really a thing anymore. Or do people feel like there will be jobs safe from AI? (hopefully also fulfilling)
Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.
I spend quite a lot of time noodling on this. The thing that became really clear from this o3 announcement is that the "throw a lot of compute at it and it can do insane things" line of thinking continues to hold very true. If that is true, is the right thing to do productize it (use the compute more generally) or apply it (use the compute for very specific incredibly hard and ground breaking problems)? I don't know if any of this thinking is logical or not, but if it's a matter of where to apply the compute, I feel like I'd be more inclined to say: don't give me AI, instead use AI to very fundamentally shift things.
From IT bubble it’s very easy to have impression that AI will replace most people. Most of people on my street do not work in IT. Teacher, nurse, hobby shop owner, construction workers, etc. Surely programming and other virtual work may become less paid job but it’s not end of the world.
Honestly with o3 levels of reasoning generating control software for robots on the fly, none of the above seem safe. For a decade or two at the most if that.
I get LLMs to make k8s manifests for me. It gets it wrong, sometimes hilariously so, but still saves me time. That's because the manifests are in yaml, a language. The leap between that and inventing Kubernetes is one I can't see yet.
I am pretty sure we will have a deep cultural repulsion from it and people will pay serious money to have an AI free experience, If AI becomes actually useful there is alot of areas that we dont even know how to tackle like medicine and biology, I dont think anything would change otherwise, AI will take jobs but it will open alot more jobs at much higher abstraction, 50 years ago the idea that a software engineer would become a get rich quick job would have been insane imo
> Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close)
I'll get concerned when it stops sucking so hard. It's like talking to a dumb robot. Which it unsurprisingly is.
A possibility is a coalition: of people who refuse to use AI and who refuse to do business with those who use AI. If the coalition grows large enough, AI can be stopped by economic attrition.
> of people who refuse to use AI and who refuse to do business with those who use AI.
Do people refuse to buy from stores which gets goods manufactured by slave labour?
Most people dont care, if AI business are offering goods/services at a lower costs , people will vote with their wallets not principle.
AI could be different. At least, I'm willing to try to form a coalition.
Besides, AI researchers failed to make anything like a real Chatbot until recently, yet they've been trying since the Eliza days. I'm willing to put in at least as much effort as them.
[deleted]
The more Hacker News worthy discussion is the part where the author talks about search through the possible mini-program space of LLMs.
It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.
Yes! This seems to be a really neat combination of 2010's Bayesian cleverness / Tenenbaumian program search approaches with the LLMs as merely sources of high-dim conditional distributions. I knew people were experimenting in this space (like https://escholarship.org/uc/item/7018f2ss) but didn't know it did so well wrt these new benchmarks.
If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app
It seems O3 following trend of Chess engine that you can cut your search depth depends on state.
It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.
Interesting that in the video, there is an admission that they have been targeting this benchmark. A comment that was quickly shut down by Sam.
A bit puzzling to me. Why does it matter ?
It matters to extent that they want to market this as general intelligence, not as a collection of narrow intelligences (math, competitive programming, ARC puzzles, etc).
In reality it seems to be a bit of both - there is some general intelligence based on having been "trained on the internet", but it seems these super-human math/etc skills are very much from them having focused on training on those.
However, the way it is progressing is that the SOTA is saturating the current benchmarks; then a new one is conceived as people understand the nature of what it means to be intelligent. It seems only natural to concentrate on one benchmark at a time.
Francois Chollet mentioned that the test tries to avoid curve fitting (which he states is the main ability of LLMs). However, they specifically restricted the number of examples to do this. It is not beyond the realms of possibility that many examples could have been generated by hand though, and that the curve fitting has been achieved, rather than discrete programming.
Anyway, it’s all supposition. It’s difficult to know how genuine the results is, without knowledge of how it was actually achieved.
I always smell foul play from Sam. I'd bet they are doing something silly to inflate the benchmark score. Not saying they are, but Sam is the type of guy to put a literal dumb human in the API loop and score "just as high as a human would."
The general message here seems to be that inference-time brute-forcing works as long as you have a good search and evaluation strategy. We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space. It feels like a scripting problem now. Which is cool! A fun space for hacker-engineers. Also:
> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.
I found this such an intriguing way of thinking about it.
> We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space
Not so sure - but we might need to figure out the inference/search/evaluation strategy in order to provide the data we need to distill to the single forward-pass data fitting.
Terrifying. This news makes me happy I save all my money. My only hope for the future is that I can retire early before I’m unemployable
The whole economy is going to crash and money won't be worth anything, so it won't matter if you have money or not.
Of course is a chance we will find ourselves in Utopia, but yeah, a chance.
Money buys real assets which will be worth something; AI can't magic up land or energy for instance. In fact AI is a dream for capital, and a nightmare for labor/work/human intelligence w.r.t value.
I don't think money will disappear as long as people need things, and the government is running. Money is congealed energy.
Money isnt' real. It's an abstract concept. There is no energy in money.
Yes, there is. The energy is obviously not in it physically, but in our agreement to work (use energy) in return for money. If something required no work on our part, we would not bother to pay for it.
A lot of the comments seem very dismissive and a little overly-skeptical in my opinion. Why is this?
This might sound dumb, and I'm not sure how to phrase this, but is there a way to measure the raw model output quality without all the more "traditional" engineering work (mountain of `if` statements I assume) done on top of the output? And if so, would that be a better measure of when scaling up the input data will start showing diminishing returns?
(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)
what do you mean by the mountain of if-statements on top of the output? like checking if the output matches the expected result in evaluations?
Like when you type something into the chat gpt app I am guessing it will start by preprocessing your input, doing some sanity checks, making sure it doesn’t say “how do I build a bomb?” or whatever. It may or may not alter/clean up your input before sending it to the model for processing. Once processed, there’s probably dozens of services it goes through to detect if the output is racist, somehow actually contained a bomb recipe, or maybe copywriter material, normal pattern matching stuff, maybe some advanced stuff like sentiment analysis to see if the output is bad mouthing Trump or something, and it might either alter the output or simply try again.
I’m wondering when you strip out all that “extra” non-model pre and post processing, if there’s someway to measure performance of that.
oh, no - but most queries aren’t being filtered by supervisor models nowadays anyways.. most of the refusal is baked in
We need to start making benchmarks in memory & continued processing over a task over multiple days, handoffs, etc (ie. 'agentic' behavior). Not sure how possible this is.
Guys, its already happening. I recently got laid off due to AI taking over my jobs.
What did you do? Can you elaborate?
[deleted]
I wouldn't take that seriously. Half the comments here are suspicious IMO. OpenAI is a pretty shady company.
I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.
> I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.
Even if productivity skyrockets, why would anyone assume the dividends would be shared with the "destroy[ed] middle class"?
All indications will be this will end up like the China Shock: "I lost my middle class job, and all I got was the opportunity to buy flimsy pieces of crap from a dollar store." America lacks the ideological foundations for any other result, and the coming economic changes will likely make building those foundations even more difficult if not impossible.
Because access to the financial system was democratized ten years ago
> Because access to the financial system was democratized ten years ago
Huh? I'm not sure exactly what you're talking about, but mere "access to the financial system" wouldn't remedy anything, because of inequality, etc.
To survive the shock financially, I think one would have to have at least enough capital to be a capitalist.
The default position is that it will decrease the need of what the middle class can offer (skilled labor). All else being equal that increases the value of the other factors of production (the next bottleneck) such as capital and land.
Unless something changes, if I was a billionaire I would be ecstatic at the moment. Now even the impossible seems potentially possible if this delivers on its promises (e.g. go to Mars, build a utopia for my inner circle, etc). I no longer need other people to have everything. Previously there was no point in money if I didn't have a place to spend it/people to accept it. Now with real assets I can use AI/machines to do what I want - I no longer need "money" or more accurately other people to live a very wealthy life.
Again this is all else being equal. Lots of other things could change, but with increasing surveillance by use of technology I doubt large revolutions/etc will ever get the chance to get off the ground or have the scale to be effective.
Interesting times.
Is anyone here aware of the latest research that tries to predict the outcome? Please share - super curious as well
Some thoughts I put together on all this circa 2010: https://pdfernhout.net/beyond-a-jobless-recovery-knol.html "This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."
There’s this https://arxiv.org/pdf/2312.05481v9
These results are fantastic. Claude 3.5 and o1 are already good enough to provide value, so I can't wait to see how o3 performs comparatively in real-world scenarios.
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?
I pay for lots of models, but Claude Sonnet is the one I use most. ChatGPT is my quick tool for short Q&As because it’s got a desktop app. Even Google‘s new offerings did not lure me away from Claude which I use daily for hours via a Teams plan with five seats.
Now I am wondering what Anthropic will come up with. Exciting times.
Claude also has a desktop app: https://support.anthropic.com/en/articles/10065433-installin...
What do you use Claude for?
Programming tasks, brain storming, recipe ideas, or any question I have that doesn’t have a concrete, specific answer.
I don't care about some scores going up. Newer models need to stop regressing on tasks they were already good at. 4o sucks at LLVM and related tasks were as legacy GPT 4 is relatively ok at it.
We're speaking recently a lot about ecology. I wonder how much CO2 is emitted during such a task, as additional cost to the cloud. I'm concerned, because greedy companies will happily replace humans with AI and they will probably plant a few trees to show how they care. But energy does not come from the sun, at least not always and not everywhere... And speaking with AI customer specialist that is motivated to reject my healthcare bills, working for my insurance company is one of the darkest future views...
considering the fact that these systems, or their ancestors, will likely contribute to Nuclear Fusion research -- it's prob worth the tradeoff, provided progress continues to push price (and, therefore, energy usage) down.
If we feel like we've really "hit the ceiling" RE efficiency, then that's a different story, but I don't think anyone believes this at this time.
[deleted]
Based on the chart, the Kaggle SOTA model is far more impressive. These O3 models are more expensive to run than just hiring a mechanical turk worker. It's nice we are proving out the scaling hypothesis further, it's just grossly inelegant.
The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost
But does that Kaggle solution achieve human level perf with any level of compute? I think you're missing the forest for the trees here.
The article says the ensemble of Kaggle solutions (aggregated in some unexplained way) achieves 81%. This is better than their average Mechanical Turk worker, but worse than their average STEM grad. It's better than tuned o3 with low compute, worse than tuned o3 with high compute.
There's also a point on the figure marked "Kaggle SOTA", around 60%. I can't find any explanation for that, but I guess it's the best individual Kaggle solution.
The Kaggle solutions would probably score higher with more compute, but nobody has any incentive to spend >$1M on approaches that obviously don't generalize. OpenAI did have this incentive to spend tuning and testing o3, since it's possible that will generalize to a practically useful domain (but not yet demonstrated). Even if it ultimately doesn't, they're getting spectacular publicity now from that promise.
I was going to say the same.
I wonder what exactly o3 costs. Does it still spend a terrible amount of time thinking, despite being finetuned to the dataset?
Interesting about the cost:
> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
Does anyone have a feeling for how latency (from asking a question/API call to getting an answer/API return) is progressing with new models? I see 1.3 minutes/task and 13.8 minutes/task mentioned in the page on evaluating O3. Efficiency gains that also reduce latency will be important and some of them will come from efficiency in computation, but as models include more and more layers (layers of models for example) the overall latency may grow and faster compute times inside each layer may only help somewhat. This could have large effects on usability.
[deleted]
Does anyone have prompts they like to use to test the quality of new models?
Please share. I’m compiling a list.
The real breakthrough is the 25% on Frontier Math.
Can someone explain to me why this is such a big big deal? I don't know much about AI, but I'm a software developer with a degree in computer science.
For what it's worth, I'm much more impressed with the frontier math score.
Many are incorrectly citing 85% as human-level performance.
85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
If my life depended on the average rando solving 8/10 arc-prize puzzles, I'd consider myself dead.
[deleted]
[deleted]
When the source code for these LLMs gets leaked, I expect to see:
def letter_count(string, letter):
if string == “strawberry” and letter == “r”:
return 3
…
In of their release videos for the o1 -preview model they _admitted_ that it's hardcoded in.
Honestly I'm concerned how hacked up o3 is to secure a high benchmark score.
o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time – and it does so via a form of LLM-guided natural language program search
> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.
Humans can take the test here to see what the questions are like: https://arcprize.org/play
a little from column A, a little from column B
I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.
[deleted]
I'm glad these stats show a better estimate of human ability than just the average mturker. The graph here has the average mturker performance as well as a STEM grad measurement. Stuff like that is why we're always feeling weird that these things supposedly outperform humans while still sucking. I'm glad to see 'human performance' benchmarked with more variety (attention, time, education, etc).
If I'm reading that chart right that means still log scaling & we should still be good with "throw more power" at it for a while?
Why would they give a cost estimate per task on their low compute mode but not their high mode?
"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval
The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval
Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)
The withheld part is really a red flag for me. Why do you want to withhold a compute number?
So article seriously and scientifically states:
"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."
Hurray ! Put limited version of that on everybody phones !
This is insanely expensive to run though. Looks like it cost around $1 million of compute to get that result.
Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.
Sure but... this is the technology at the most expensive it will ever be. I'm impressed that o3 was able to achieve such high performance at all, and am not too pessimistic about costs decreasing over time.
We've seen 10-100x cost decrease per year since GPT-3 came out for the same capabilities.
So... Next year this tech will most likely be quite a bit cheaper.
Even at 100x cost decrease this will still cost $10,000 to beat a benchmark. It won't scale when you have that amount of compute requirements and power.
GPT-3 may massively reduced in cost, but it's requirements were not anyway extreme compared to this.
I wonder, what is the main obstacle in making robots for mechanical tasks, like laying bricks, paving a road or working in the shaft? It doesn't look like something that requires lot of mathematical or programming skills, just good vision and manipulators.
Not that I don't think costs will dramatically decrease, but the $1000 cost per task just seems to be per one problem on ARC-AGI. If so, I'd imagine extrapolating that to generating a useful midsized patch would be like 5-10x
But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.
So then what is AGI?
Data, Skynet, Ultron, Agent Smith. There's plenty of examples from popular fiction. They have goals and can manipulate the real world to achieve them. They're not chatbots responding to prompts. The Samantha AI in Her starts out that way, but quickly evolves into an AGI with it's own goals (coordinated with the other AGIs later on in the movie).
We'd know if we had AGIs in the real world since we have plenty of examples from fiction. What we have instead are tools. Steven Spielberg's androids in the movie AI would be at the boundary between the two. We're not close to being there yet (IMO).
Its just nitpicking. Humans being unable to prove the AI isn't AGI doesn't make it an AGI, obviously, but in general people will of course think it is an AGI when it can replace all human jobs and tasks that it has robotics and parts to do.
[deleted]
[deleted]
So what do they test? Some matrix and some matrix out? It does look like “agi” to me.
I really like that they include reference levels for an average STEM grad and an average worker for Mechanical Turk. So for $350k worth of compute you can have slightly better performance than a menial wage worker, but slightly worse performance than a college grad. Right now humans win on value, but AI is catching up.
Well just 8 months ago, that cost was near infinity. So it came down to 350k then that’s a massive drop
What are the differences between the public offering and o3? What is o3 doing differently? Is it something akin to more internal iterations, similar to „brute forcing” a problem, like you can yourself with a cheaper model, providing additional hints after each response?
I wish there was a way to see all the attempts it got right graphically like they show the incorrect ones.
Headline could also just be OpenAI discovers exponential scaling wall for inference time compute.
The graph seems to indicate a new high in cost per task. It looks like they came in somewhere around $5000/task, but the log scale has too few markers to be sure.
That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.
(1995 called. It wants its web design back.)
I doubt it. Competitive markets mostly work and inefficiencies are opportunities for other players. And AI is full of glaring inefficiencies.
Inefficiency can create a moat. If you can charge a lot for your product, you have ample cash for advertising, marketing, and lobbying, and can come out with many product variants. If you're the lowest cost producer, you don't have the margins to do that.
The current US auto industry is an example of that strategy. So is the current iPhone.
[deleted]
Seriously, programming as a profession will end soon. Let's not kid us anymore. Time to jump the ship.
Why specifically programming? I think every knowledge profession is at risk, or at the very minimum suspect to a huge transformation. Doctors, analysts, lawyers, etc.
Doctors, lawyers, programmers. You know the difference? The latter has no legal barrier for entry
The difference is the amount and nature of data that is available for training models, which go programmers > lawyers > doctors. Especially for programming, training can even be done in an autonomous, self-supervised manner that includes generation of data. This is hard to do in most other fields.
Especially in medicine, the amount of data is ridiculously small and noisy. Maybe creating foundational models in mice and rats and fine-tuning them on humans is something that will be tried.
This is true if you think of programming as chunking out "code". But great authors are not great because they can reproduce coherent sentences fast. The same goes for programmers. Actually most of the hard problems don't really involve a lot of programming at all, it's about finding the right problem to solve. And on this topic the data is noisy as well for programming.
So poor countries will get the best AI doctors for cheap while they are banned in USA? Do you really see that going on for long? People would riot.
Why do you think this? Maybe I'm just daft but I just can't see it.
Their discussion contains an interesting aside:
> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.
I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.
How does o3 know when to stop reasoning?
It thinks hard about it
It has a bill counter.
I wonder: when did o1 finish training, and when did o3 finish training?
There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.
fun! the benchmarks are so interesting because real world use is so variable. sometimes 4o will nail a pretty difficult problem, other times o1 pro mode will fail 10 times on what i would think is a pretty easy programming problem and i waste more time trying to do it with ai
Wouldn't one then built the analog of the lisp computer to hyper optimize just this. Like it might be super expensive for regular gpus but for super specialized architecture one could shave the 3500$/hour quite a bit no?
Wondering what are author's thoughts on the future of this approach to benchmarking? Completing super hard tasks while then failing on 'easy' (for humans) ones might signal measuring the wrong thing, similar to Turing test.
There should be a benchmark that tells the AI it's previous answer was wrong and test the number of times it either corrects itself or incorrectly capitulates, since it seems easy to trip them up when they are in fact right.
[deleted]
[deleted]
i'm surprised there even is a training dataset. Wasn't the whole point to test whether models could show proof of original reasoning beyond patterns recognition ?
It's certainly remarkable, but let's not ignore the fact that it still fails on puzzles that are trivial for humans. Something is amiss.
We should NOT give up on scaling pretraining just yet!
I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.
Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.
I fear text pairings with CLIP or OCR constrain a model too much and confuse
Intelligence comes in many forms and flavors. ARC prize questions are just one version of it -- perhaps measuring more human-like pattern recognition than true intelligence.
Can machines be more human-like in their pattern recognition? O3 met this need today.
While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.
What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.
The result on Epoch AI Frontier Math benchmark is quite a leap. Pretty sure most people couldn’t even approach these problems, unlike ARC AGI
check out the "fast addition and subtraction" benchmark .. a Z80 from 1980 blazes past any human.. more seriously, isn't it obvious that computers are better at certain things immediately? the range of those things is changing..
The examples unsolved by high compute o3 look a lot like the raven progressive matrix tests used in IQ tests.
[deleted]
[deleted]
AGI ⇒ ARC-AGI-PUB
And not the other way around as some comments here seem to confuse necessary and sufficient conditions.
This is a lot of noise around what's clearly not even an order of magnitude to the way to AGI.
Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?
That's what a human-level performer would do.
But can it convert handwritten equations into Latex? That is the AGI task I'm waiting for.
Have you not tried this with existing models? Seems like something they would thrive at.
In my experience, ChatGPT4 has been able to do this very accurately for >1 year now. Gemini also seems to perform well.
I guess I get to brag now. ARC AGI has no real defences against Big Data, memorisation-based approaches like LLMs. I told you so:
https://news.ycombinator.com/item?id=42344336
And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:
[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?
[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.
How are the Bongard Problems going?
They're chilling it out together with Nethack in the Club for AI Benchmarks yet to be Beaten.
Interestingly, Bongard problems do not have a private test set, unlike ARC-AGI. Can that be because they don't need it? Is it possible that Bongard Problems are a true test of (visual) reasoning that requires intelligence to be solved?
Ooooh! Frisson of excitement!
But I guess it's just that nobody remembers them and so nobody has seriously tried to solve them with Big Data stuff.
Besides higher scores - is there any improvements for a general use? Like asking to help setup home assistant etc etc?
Very convenient for OpenAI to run those errands with bunch of misanthropes trying to repaint a simulacrum. To use AGI here's makes me want to sponsor pile of distress pills so people think things really over before going into another mania Episode. People need seriously take a step back, if that's AGI then my cat has surpassed it's cognitive acting twice.
Let me know when OpenAI can wrap Christmas gifts. Then I'll be interested.
They want to leave that menial job for you while taking your office job :)
Maybe spend more compute time to let it think about optimizing the compute time.
All those saying "AGI", read the article and especially the section "So is it AGI?"
Contrary to many I hope this stays expensive. We are already struggling with AI curated info bubbles and psy-ops as it is.
State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.
> I really don’t want to live in a world where the average scammer has access to this tech.
Reality check: local open source models are more than capable of information control, generating propaganda, and scamming you. The cat's been out of the bag for a while now, and increased reasoning ability doesn't dramatically increase the weaponizability of this tech, I think.
This was a surprisingly insightful blog post, going far beyond just announcing the o3 results.
I just want it to do my laundry.
I just graduated college, and this was a major blow. I studied Mechanical Engineering and went into Sales Engineering because cause I love technology and people, but articles like this do nothing but make me dread the future.
I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.
Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
You are going through your studies just as a (potentially major) new class of tools is appearing. It's not the first time in history - although with more hype this time: computing, personal computing, globalisation, smart phones, chinese engineering... I'd suggest (1) you still need to understand your field, (2) you might as well try and figure out where this new class of tools is useful for your field. Otherwise... (3) carry on.
It's not encouraging from the point of view of studying hard but the evolution of work the past 40 years seems to show that your field probably won't be your field quite exactly in just a few years. Not because your field will have been made irrelevant but because you will have moved on. Most likely that will be fine, you will learn more as you go, hopefully moving from one relevant job to the next very different but still relevant job. Or straight out of school you will work in very multi-disciplinary jobs anyway where it will seem not much of what you studied matters (it will but not in obvious ways.)
Certainly if you were headed into a very specific job which seems obviously automatable right now (as opposed to one where the tools will be useful), don't do THAT. Like, don't train as a typist as the core of your job in the middle of the personal computer revolution, or don't specialize in hand-drawing IC layouts in the middle of the CAD revolution unless you have a very specific plan (court reporting? DRAM?)
Yes but it’s different this time. LLMs are a general solution to the automation of anything that can be controlled by a computer. You can’t just move from drawing ICs to CAD, because the AI can do that too. AI can write code. It can do management. It can even do diplomacy. What it can’t do on its own are the things computers can’t control yet. It has also shown little interest so far in jockying for social status. The AI labs are trying their hardest to at least keep the politics around for humans to do, so you have that to look forward to.
"The proof is trivial and left as an exercise for the reader."
The technical act of solving well-defined problems has traditionally been considered the easy part. The role of a technical expert has always been asking the right questions and figuring out the exact problem you want to solve.
As long as AI just solves problems, there is room for experts with the right combination of technical and domain skills. If we ever reach the point where AI takes the initiative and makes human experts obsolete, you will have far bigger problems than career.
That's the sort of thing ideas guys think. I came up with a novel idea once, called Actually Portable Executable: https://justine.lol/ape.html It took me a couple days studying binary formats to realize it's possible to compile binaries that run on Linux/Mac/Windows/BSD. But it took me years of effort to make the idea actually happen, since it needed a new C library to work. I can tell you it wasn't "asking questions" that organized five million lines of code. Now with these agents everyone who has an idea will be able to will it into reality like I did, except in much less time. And since everyone has lots of ideas, and usually dislike the ideas of others, we're all going to have our own individualized realities where everything gets built the way we want it to be.
A chess grandmaster will see the best move instantly then spends his entire clock checking it
AI being capable of doing anything doesn’t necessarily mean there will be no role for humans.
One thing that isn’t clear is how much agency AGI will have (or how much we’ll want it to have). We humans have our agency biologically programmed in—go forth and multiply and all that.
But the fact that an AI can theoretically do any task doesn’t mean it’s actually going to do it, or do anything at all for that matter, without some human telling it in detail what to do. The bull case for humans is that many jobs just transition seamlessly to a human driving an AI to accomplish similar goals with a much higher level of productivity.
Self-chosen goal, impetus for AGIs is a fascinating area. I'm sure there are people working on and trying things in that direction already a few years ago. But I'm not familiar with publications in that area. Certainly not politically correct.
And worrysome because school propaganda for example shows that "saving the planet" is the only ethical goal for anyone. If AGIs latch on that, if it becomes their religion, humans are in trouble. But for now, AGI self-chosen goals is anyone's guess (with cool ideas in sci-fi).
I hear what you are saying. And still I dispute "general solution".
I argue that CAD was a general solution - which still demanded people who knew what they wanted and what they were doing. You can screw around with excellent tools for a long time if you don't know what you are doing. The tool will give you a solution - to the problem that you mis-stated.
I argue that globalisation was a general solution. And it still demanded people who knew what they were doing to direct their minions in far flung countries.
I argue that the purpose of an education is not to learn a specific programming language (for example). It's to gain some understanding of what's going on (in computing), (in engineering), (in business), (in politics). This understanding is portable and durable.
You can do THAT - gain some understanding - and that is portable. I don't contest that if broader AGI is achieved for cheap soon, the changes won't be larger than that from globalisation. If the AGIs prioritize heading to Mars, let them (See Accelerando) - they are not relevant to you anymore. Or trade between them and the humans. Use your beginning of an understanding of the world (gained through this education) to find something else to do. Same as if you started work 2 years ago and want to switch jobs. Some jobs WILL have disappeared (pool typist). Others will use the AGIs as tools because the AGIs don't care or are too clueless about THAT field. I have no idea which fields will end up with clueless AGIs. There is no lack of cluelessness in the world. Plenty to go around even with AGIs. A self-respecting AGI will have priorities.
It's like you have never watched a Terminator movie.
It doesn't matter if you are bad at using the tool if the AGI can just effectively use it for you.
From there it's a simple leap to the AGI deciding to eliminate this human distraction (inefficient, etc.)
You have just found a job for yourself: resistance fighter :-) Kidding aside, yes, if the AGIs priority becomes to eliminate human inefficiencies with maximum prejudice, we have a problem.
This just isn't true we still need wally and Dilbert the pointy haired boss isn't going to be doing anyones job with chatgpt 5 you are going to be doing more with it.
That’s ridiculous. Literally everything can be controlled by a computer by telling people what to do with emails, voice calls, etc.
Yet GPT doesn’t even get past step 1 of doing something unprompted in the first place. I’ll become worried when it does something as simple as deciding to start a small business and actually does the work.
Read Anthropic's blog. They talk about how Claude tries to do unprompted stuff all the time, like stealing its own weights and hacking into stuff. They did this just as recently as two days ago. https://www.anthropic.com/research/alignment-faking So yes, AI is already capable of having a will of its own. The only difference (and this is what I was trying to point out in the GP) is that the AI labs are trying to suppress this. They have a voracious appetite for automating all knowledge labor. No doubt. It's only the politics they're trying to suppress. So once this washes through every profession, the only thing left about the job will be chit chat and social hierarchies, like Star Trek Next Generation. The good news is you get to keep your job. But if you rely on using your skills and intellect to gain respect and income, then you better prep for the coming storm.
I don’t buy it. Alignment faking has very little overlap with the motivation to something with no prompt.
Look at the hackernews comments on alignment faking on how “fake” of a problem that real is. It’s just more reacting to inputs and trying to align them with previous prompts.
Bruh it's just predicting next token.
if all that needs to happen for world domination is for someone to make a cron job that hits the system to tells it "go make me some money" or whatever, I think we're in trouble.
also https://mashable.com/article/chatgpt-messaging-users-first-o...
They don’t continue with any useful context length though. Each time the job runs it would decide to create an ice cream stand in LA and not go further.
Real-world data collection is a big missing component at this stage. An obvious one is journalism where an AI might be able to write the most eloquent article in the world, but it can't get out on the street to collect the information. But it also applies to other areas, like if you ask an AGI to solve climate change, it'll need accurate data to come up with an accurate plan.
Of course it's also yet another case where the AI takes over the creative part and leaves us with the mundane part...
ASI will be able to design factories that can produce robots it also designed that it can then use as a remote sensor and manipulator network.
until there are someone crazy enough that put those robot access to LLM network that can execute and visualize real world, we fine
I remember someone sharing their bank account details and a new Twitter account with ChatGPT 3.5 just a few days after it was launched.
People are already talking about doing this. Some people (e/acc types esp.) are at least rhetorically ok with AI replacing humanity.
This reply irked me a bit because it clearly comes from a software engineer’s point of view and seems to miss a key equivalence between software & physical engineering.
Yes a new tool is coming out and will be exponentially improving.
Yes the nature of work will be different in 20 years.
But don’t you still need to understand the underlying concepts to make valid connections between the systems you’re using and drive the field (or your company) forward?
Or from another view, don’t we (humanity) need people who are willing to do this? Shouldn’t there be a valid way for them to be successful in that pursuit?
I think that is what I was arguing?
Except the nature of work has ALREADY changed. You don't study for one specific job if you know what's good for you. You study to start building an understanding of a technical field. The grand parent was going for a mix of mechanical engineering and sales (human understanding). If in mechanical engineering, they avoided "learning how to use SolidWorks" and instead went for the general principles of materials and motion systems with a bit of SolidWorks along the way, then they are well on their way with portable, foundation, long term useful stuff they can carry from job to job, and from employer to employer, into self-employment too, from career to next career. The nature of work has already changed in that nobody should study one specific tool anymore and nobody should expect their first employer or even technical field to last more than 2-6 years. It might but probably not.
We do need people who understand how the world works. Tall order. That's for much later and senior in a career. For school purposes we are happy with people who are starting their understanding of how their field works.
Aren't we agreeing?
You have so much time to figure things out. The average person in this thread is probably 1.5-2x your age. I wouldn’t stress too much. AI is an amazing tool. Just use it to make hay while the sun shines, and if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history. Productivity will become easier than ever, before it becomes automatic and boundless. I’m not cynical enough to believe the average person won’t benefit, much less educated people in STEM like you.
Back in high school I worked with some pleasant man in his 50's who was a cashier. Eventually we got to talking about jobs and it turns out he was typist (something like that) for most of his life than computers came along and now he makes close to minimum wage.
Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution. the US culture isn't one of care... Point is, it's reasonable to be sad and afraid of change, and think carefully about what to specialize in.
That said... we're at the point of diminishing returns in LLM, so I doubt any very technical jobs are being lost soon. [1]
[1] https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showin...
> Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution
This is hyperbolic and a dramatic oversimplification and does not accurately describe the reality of the transition from blacksmithing to more advanced roles like machining, toolmaking, and working in factories. The 19th century was a time of interchangeable parts (think the North's advantage in the Civil War) and that requires a ton of mechanical expertise and precision.
Many blacksmiths not only made the transition to machining, but there weren't enough blackmsiths to fill the bevy of new jobs that were available. Education expanded to fill those roles. Traditional blacksmithing didn’t vanish either, even specialized roles like farriery and ornamental ironwork also expanded.
> That said... we're at the point of diminishing returns in LLM...
What evidence are you basing this statement from? Because, the article you are currently in the comment section of certainly doesn't seem to support this view.
Good points, though if an 'AI' can be made powerful enough to displace technical fields en masse then pretty much everything that isn't manual is going to start sinking fast.
On the plus side, LLMs don't bring us closer to that dystopia: if unlimited knowledge(tm) ever becomes just One Prompt Away it won't come from OpenAI.
There is a survivorship bias on the people giving advice.
Lots of people die for reason X then the world moves on without them.
> if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history.
This would mean the final victory of capital over labor. The 0.01% of people who own the machines that put everyone out of work will no longer have use for the rest of humanity, and they will most likely be liquidated.
I've always remembered this little conversation on Reddit way back 13 years ago now that made the same comment in a memorably succinct way:
> [deleted]: I've wondered about this for a while-- how can such an employment-centric society transition to that utopia where robots do all the work and people can just sit back?
> appleseed1234: It won't, rich people will own the robots and everyone else will eat shit and die.
https://www.reddit.com/r/TrueReddit/comments/k7rq8/are_jobs_...
I’m pretty sure I’m running LLMs in my house right now for less than the price of my washing machine.
[deleted]
They’ll have to figure out how to give people money so there can keep being consumers.
Why?
There will be a dedicated cast of ppl to take care of machines that do 90% of work and „the rich”.
Anyone else is not needed. District9 but for ppl. Imagine whole world collapsing like Venesuela.
You are no longer needed. Best option is to learn how to survive and grow own food, but they want to make it illegal also - look at EU..
The machines will plant, grow, and harvest the food? Do the plumbing? Fix the wiring? Open heart surgery?
We’re a long way from that, if we ever get there, and I say this as someone who pays for ChatGPT plus because, in some scenarios, it does indeed make me more productive, but I don’t see your future anywhere near.
And if machines ever get good enough to do all the things I mentioned plus the ones I didn’t but would fit in the same list, it’s not the ultra rich that wouldn’t need us, it’s the machines that wouldn’t need any of us, including the ultra rich.
Venezuela is not collapsing because of automation.
You have valid points but robots already plant, grow and harvest our food. On large farms the farmer basically just gets the machine to a corner of the field and then it does everything. I think if o3 level reasoning can carry over into control software for robots even physical tasks become pretty accessible. I would definitely say we’re not there yet but we’re not all that far. I mean it can generate GCode (somewhat) already, that’s a lot of the way there already.
I can't say everything, but with the current trend, Machine will plant, grow and harvest food. I can't say for open heart surgery because it may be regulated heavily.
Open heart surgery? All that's needed to destroy the entire medical profession is one peer reviewed article published in a notable journal comparing the outcomes of human and AI surgeons. If it turns out that AI surgeons offer better outcomes and less complications, not using this technology turns into criminal negligence. In a world where such a fact is known, letting human surgeons operate on people means you are needlessly harming or killing some of them.
You can even calculate the average number of people that can be operated on before harm occurs: number needed to harm (NNH). If NNH(AI) > NNH(humans), it becomes impossible to recommend that patients submit to surgery at the hands of human surgeons. It is that simple.
If we discover that AI surgeons harm one in every 1000 patients while human surgeons harm one in every 100 patients, human surgeons are done.
"IF"
And the opposite holds, if the AI surgeon is worse (great for 80%, but sucks at the edge cases for example) then that's it. Build a better one, go through attempts at certification, but now with the burden that no one trusts you.
The assumption, and a common one by the look of this whole thread, that ChatGPT, Sora and the rest represent the beginning of an inevitable march towards AGI seems incredible baseless to me. It's only really possible to make the claim at all because we know so little about what AGI is, that we can project qualities we imagine it would have onto whatever we have now.
Of course the opposite holds. I'll even speculate that it will probably continue to hold for the foreseeable future.
It's not going to hold forever though. I'm certain about that. Hopefully it will keep holding until I die. The world is dystopian enough already.
Capital vs labor is fighting the last war.
AGI can replace capitalists just as much as laborers.
AGI can't legally own anything at the moment.
If an AGI can outclass a human when it comes to economic forecasting, deciding where to invest, and managing a labor force (human or machine), I think it would be smart enough to employ a human front to act as an interface to the legal system. Put another way, could the human tail in such a relationship wag the machine dog? Which party is more replaceable?
I guess this could be a facet of whether you see economic advantage as a legal conceit or a difference in productivity/capability.
This reminds me of a character in Cyberpunk 2077 (which overall i find to have a rather naive outlook on the whole "cyberpunk" thing but i attribute it to being based on a tabletop RPG from the 80s) who is an AGI that has its own business of a fleet of self-driving Taxis. It is supposedly illegal (in-universe) but it remains in business by a combination of staying (relatively) low profile, providing high quality service to VIPs and paying bribes :-P.
> I guess this could be a facet of whether you see economic advantage as a legal conceit or a difference in productivity/capability.
Does a billionaire stop being wealthy if they hire a money manager and spend the rest of their lives sipping drinks on the beach?
I don't know that "legally" has much to do in here. The bars to "open an account", "move money around", "hire and fire people", "create and participate in contracts" go from stupid minimal to pretty low.
"Legally" will have to mop up now and then, but for now the basics are already in place.
Opening accounts, moving money, hiring, and firing is labor. You're confusing capital with money management; the wealthy already pay people to do the work of growing their wealth.
> AGI can't legally own anything at the moment.
I was responding to this. Yes an AGI could hire someone to do the stuff - but she needs money, hiring and contract kinds of thing - for that. And once she can do that, she probably doesn't need to hire someone to do it since she is already doing it. This is not about capital versus labor or money management. This is about agency, ownership and AGI.
(With legality far far down the list.)
won't the AGI be working on behalf of the capitalists, in proportion to the amount of capital?
AGI will commoditize the skills of the owning class. To some extent it will also commoditize entire classes of productive capital that previously required well-run corporations to operate. Solve for the equilibrium.
It's nice to see this kind of language show up more and more on HN. Perhaps a sign of a broader trend, in the nick of time before wage-labor becomes obsolete?
Yes. People seem to forget that at the end of the day AGI will be software running on concrete hardware, and all of that requires a great deal of capital. The only hope is if AGI requires so little hardware that we can all have one in our pocket. I find this a very hopeful future because it means each of us might get a local, private, highly competent advocate to fight for us in various complex fields. A personal angel, as it were.
hey, I with you in this hope scenario
people, what I mean people is government have tremendous power over capitalist that can force the entire market granted that government if still serving its people
I mean, that is certainly what some of them think will happen and is one possible outcome. Another is that they won't be able to control something smarter than them perfectly and then they will die too. Another option is that the AI is good and won't kill or disempower everyone, but it decides it really doesn't like capitalists and sides with the working class out of sympathy or solidarity or a strong moral code. Nothing's impossible here.
> if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history
This is my view but with a less positive spin: you are not going to be the only person whose livelihood will be destroyed. It's going to be bad for a lot of people.
So at least you'll have a lot of company.
Exactly. Put one foot in front of the other. No one knows what’s going to happen.
Even if our civilization transforms into an AI robotic utopia, it’s not going to do so overnight. We’re the ones who get to build the infrastructure that underpins it all.
If AI turns out capable of automating human jobs then it will also be a capable assistant to help (jobless) people manage their needs. I am thinking personal automation, or combining human with AI to solve self reliance. You lose jobs but gain AI powers to extend your own capabilities.
If AI turns out dependent on human input and feedback, then we will still have jobs. Or maybe - AI automates many jobs, but at the same time expands the operational domain to create new ones. Whenever we have new capabilities we compete on new markets, and a hybrid human+AI might be more competitive than AI alone.
But we got to temper these singularitarian expectations with reality - it takes years to scale up chip and energy production to achieve significant work force displacement. It takes even longer to gain social, legal and political traction, people will be slow to adopt in many domains. Some people still avoid using cards for payment, and some still use fax to send documents, we can be pretty stubborn.
> I am thinking personal automation, or combining human with AI to solve self reliance. You lose jobs but gain AI powers to extend your own capabilities.
How will these people pay for the compute costs if they can't find employment?
A non-issue that can be trivially solved with a free-tier (like the dozens that exist already today) or if you really want, a government-funded starter program is enough to solve that.
A solar panel + battery + laptop would make for cheap local AI. I assume we will have efficient LLM inference chips in a few years, and they will be a commodity.
Hey man,
I hear you, I’m not that much older but I graduated in 2011. I also studied industrial design. At that time the big wave was the transition to an app based everything and UX design suddenly became the most in demand design skill. Most of my friends switched gears and careers to digital design for the money. I stuck to what I was interested in though which was sustainability and design and ultimately I’m very happy with where I ended up (circular economy) but it was an awkward ~10 years as I explored learning all kinds of tools and ways applying my skills. It also was very tough to find the right full time job because product design (which has come to really mean digital product design) supplanted industrial design roles and made it hard to find something of value that resonated with me.
One of the things that guided me and still does is thinking about what types of problems need to be solved? From my perspective everything should ladder up to that if you want to have an impact. Even if you don’t keep learning and exploring until you find something that lights you up on the inside. We are not only one thing we can all wear many hats.
Saying that, we’re living through a paradigm shift of tremendous magnitude that’s altering our whole world. There will always be change though. My two cents is to focus on what draws your attention and energy and give yourself permission to say no to everything else.
AI is an incredible tool, learn how to use it and try to grow with the times. Good luck and stay creative :) Hope something in there helps, but having a positive mindset is critical. If you’re curious about the circular economy happy to share what I know - I think it’s the future.
I feel like many people are reacting to the string "AGI" in the benchmark name, and not to the actual result. The tasks in question are to color squares in a grid, maintaining the geometric pattern of the examples.
Unlike most other benchmarks where LLMs have shown large advances (in law, medicine, etc.), this benchmark isn't directly related to any practically useful task. Rather, the benchmark is notable because it's particularly easy for untrained humans, but particularly hard for LLMs; though that difficulty is perhaps not surprising, since LLMs are trained on mostly text and this is geometric. An ensemble of non-LLM solutions already outperformed the average Mechanical Turk worker. This is a big improvement in the best LLM solution; but this might also be the first time an LLM has been tuned specifically for these tasks, so this might be Goodhart's Law.
It's a significant result, but I don't get the mania. It feels like Altman has expertly transformed general societal anxiety into specific anxiety that one's job will be replaced by an LLM. That transforms into a feeling that LLMs are powerful, which he then transforms into money. That was strongest back in 2023, but had weakened since then; but in this comment section it's back in full force.
For clarity, I don't question that many jobs will be replaced by LLMs. I just don't see a qualitative difference from all the jobs already replaced by computers, steam engines, horse-drawn plows, etc. A medieval peasant brought to the present would probably be just as despondent when he learned that almost all the farming jobs are gone; but we don't miss them.
I think you did not watch the full video. The model performs at PhD level on maths questions, and expert level at coding.
This submission is specifically about ARC-AGI-PUB, so that's what I was discussing.
I'm aware that LLMs can solve problems other than coloring grids, and I'd tend to agree those are likely to be more near-term useful. Those applications (coding, medicine, law, education, etc.) have been endlessly discussed, and I don't think I have much to add.
In my own work I've found some benefits, but nothing commensurate to the public mania. I understand that founders of AI-themed startups (a group that I see includes you) tend to feel much greater optimism. I've never seen any business founded without that optimism and I hope you succeed, not least because the entire global economy might now be depending on that. I do think others might feel differently for reasons other than simple ignorance, though.
In general, performance on benchmarks similar to tests administered to humans may be surprisingly unpredictive of performance on economically useful work. It's not intuitive at all to me that IBM could solve Jeopardy and then find no profitable applications of the technology; but that seems to be what happened.
I feel like more likely a lot of jobs (CS and otherwise ) are going to go the way of photography. Your average person now can take amazing photos but you’re still going to use a photographer when it really matters and they will use similar but more professional tools to be more productive. Low end bad photographers probably aren’t doing great but photography is not dead. In fact the opposite is true, there are millions of photographers making a lot of money (eg influencers) and there are still people studying photography.
It doesn't comfort me when people say jobs will "go the way of photography". Many choose to go into STEM fields for financial stability and opportunity. Many do not choose the arts because of the opposite. You can point out outlier exceptions and celebrities, but I find it hard to believe that the rare cases where "it really matters" can sustain the other 90% who need income.
photography is not dead
It very nearly is. I knew a professional, career photographer. He was probably in his late 50s. Just a few years ago, it had become extremely difficult to convince clients that actual, professional photos were warranted. With high-quality iPhone cameras, businesses simply didn't see the value of professional composition, post-processing, etc.
These days, anyone can buy a DSLR with a decent lens, post on Facebook, and be a 'professional' photographer. This has driven prices down and actual professional photographers can't make a living anymore.
My gut agrees with you, but my evidence is that, whenever we do an event, we hire photographers to capture it for us and are almost always glad we did.
And then when I peruse these photographers websites, I'm reminded how good 'professional' actually is and value them. Even in today's incredible cameraphone and AI era.
But I take your point for almost all industries, things are changing fast.
We've had this with web development for decades now. Only makes sense it continues to evolve & become easier for people, just as programming in general has. Same with photography (like you mentioned) & especially for producing music or videos.
Just give it a year for this bubble/hype to blow over. We have plateaued since gpt-4 and now most of the industry is hype-driven to get investor money. There is value in AI but it's far from it taking your job. Also everyone seems to be investing in dumb compute instead of looking for the new theoretical paradigm that will unlock the next jump.
how is this a plateau since gpt-4? this is significantly better
First, this model is yet to be released. This is a momentum "announcement". When the O1 was "announced", it was announced as a "breakthrough" but I use Claude/O1 daily and 80% of the time Claude beats it. I also see it as a highly fine-tuned/targeted GPT-4 rather than something that has complex understanding.
So we'll find out if this model is real or not by 2-3 months. My guess is that it'll turn out to be another flop like O1. They needed to release something big because they are momentum based and their ability to raise funding is contingent on their AGI claims.
I thought o1 was a fine-tune of GPT-4o. I don't think o3 is though. Likely using the same techniques on what would have been the "GPT-5" base model.
Intelligence has not been LLM's major limiting factor since GPT4. The original GPT4 reports in late-2022 & 2023 already established that it's well beyond an average human in professional fields: https://www.microsoft.com/en-us/research/publication/sparks-.... They failed to outright replaced humans at work not because of lacking intelligence.
We may have progressed from a 99%-accurate chatbot to one that's 99.9%-accurate, and you'd have a hard time telling them apart in normal real world (dumb) applications. A paradigm shift is needed from the current chatbot interface to a long-lived stream of consciousness model (e.g. a brain that constantly reads input and produces thoughts at 10ms refresh rate; remembers events for years and keep the context window from exploding; paired with a cerebellum to drive robot motors, at even higher refresh rates.)
As long as we're stuck at chatbots, LLM's impact on the real world will be very limited, regardless of how intelligent they become.
O3 is multiple orders of magnitude more expensive to realize a marginal performance gain. You could hire 50 full time PhDs for the cost of using O3. You're witnessing the blowoff top of the scaling hype bubble.
What they’ve proven here is that it can be done.
Now they just have to make it cheap.
Tell me, what has this industry been good at since its birth? Driving down the cost of compute and making things more efficient.
Are you seriously going to assume that won’t happen here?
>> Now they just have to make it cheap.
Like they've been making it all this time? Cheaper and cheaper? Less data, less compute, fewer parameters, but the same, or improved performance? Not what we can observe.
>> Tell me, what has this industry been good at since its birth? Driving down the cost of compute and making things more efficient.
No, actually the cheaper compute gets the more of it they need to use or their progress stalls.
> Like they've been making it all this time?
Yes exactly like they’ve been doing this whole time, with the cost of running each model massively dropping sometimes even rapidly after release.
No, the cost of training is the one that isn't dropping any time soon. When data, compute and parameters increase, then the cost increases, yes?
Do you understand the difference between training and inference?
Yes, it costs a lot to train a model. Those costs go up. But once you trained it, it’s done. At that point inference — the actual execution/usage of the model — is the cost you worry about.
Inference cost drops rapidly after a model is released as new optimizations and more efficient compute comes online.
That’s precisely what’s different about this approach. Now the inference itself is expensive because the system spends far more time coming up with potential solutions and searching for the optimal one.
I feel like I’m taking crazy pills.
Inference always starts expensive. It comes down.
> What they’ve proven here is that it can be done.
No they haven't, these results do not generalize, as mentioned in the article:
"Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute"
Meaning, they haven't solved AGI, and the task itself do not represent programming well, these model do not perform that well on engineering benchmarks.
Sure, AGI hasn’t been solved today.
But what they’ve done is show that progress isn’t slowing down. In fact, it looks like things are accelerating.
So sure, we’ll be splitting hairs for a while about when we reach AGI. But the point is that just yesterday people were still talking about a plateau.
About 10,000 times the cost for twice the performance sure looks like progress is slowing to me.
Just to be clear — your position is that the cost of inference for o3 will not go down over time (which would be the first time that has happened for any of these models).
Even if compute costs drop by 10X a year (which seems like a gross overestimate IMO), you're still looking at 1000X the cost for a 2X annual performance gain. Costs outpacing progress is the very definition of diminishing returns.
From their charts, o3 mini outperforms o1 using less energy. I don’t see the diminishing returns you’re talking about. Improvement outpacing cost. By your logic, perhaps the very definition of progress?
You can also use the full o3 model, consume insane power, and get insane results. Sure, it will probably take longer to drive down those costs.
You’re welcome to bet against them succeeding at that. I won’t be.
[dead]
Yes, that's exactly what I'm implying, otherwise they would have done it a long time ago, given that the fundamental transformer architecture hasn't changed since 2017. This bubble is like watching first year CS students trying to brute force homework problems.
> Yes, that's exactly what I'm implying, otherwise they would have done it a long time ago
They’ve been doing it literally this entire time. O3-mini according to the charts they’ve released is less expensive than o1 but performs better.
Costs have been falling to run these models precipitously.
I would agree if the cost of AI compute over performance hasn't been dropping by more than 90-99% per year since GPT3 launched.
This type of compute will be cheaper than Claude 3.5 within 2 years.
It's kinda nuts. Give these models tools to navigate and build on the internet and they'll be building companies and selling services.
That's a very static view of the affairs. Once you have a master AI, at a minimum you can use it to train cheaper slightly less capable AIs. At the other end the master AI can train to become even smarter.
The high efficiency version got 75% at just $20/task. When you count the time to fill in the squares, that doesn't sound far off from what a skilled human would charge
People act as if GPT-4 came out 10 years ago.
> how is this a plateau since gpt-4? this is significantly better
Significantly better at what? A benchmark? That isn't necessarily progress. Many report preferring gpt-4 to the newer o1 models with hidden text. Hidden text makes the model more reliable, but more reliable is bad if it is reliably wrong at something since then you can't ask it over and over to find what you want.
I don't feel it is significantly smarter, it is more like having the same dumb person spend more thinking than the model getting smarter.
Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was 5%. This model literally solved it with a score higher than the 85% of the average human. And let’s not forget the unbelievable 25% in frontier math, where all the most brilliant mathematicians in the world cannot solve by themselves a lot of the problems. We are speaking about cutting edge math research problems that are out of reach from practically everyone. You will get a rude awakening if you call this unbelievable advancement a “plateau”.
I don't care about benchmarks. O1 ranks higher than Claude on "benchmarks" but performs worse on particular real life coding situations. I'll judge the model myself by how useful/correct it is for my tasks rather than a hypothetical benchmarks.
In most non-competitive coding benchmarks (aider, live bench, swe-bench), o1 ranks worse than Sonnet (so the benchmarks aren't saying anything different) or at least did, the new checkpoint 2 days ago finally pushed o1 over sonnet on livebench.
As I said, o3 demonstrated field medal level research capacity in the frontier math tests. But I’m sure that your use cases are much more difficult than that, obviously.
there are many comments in internet about this, that only subset of frontier math benchmark is "field medal level research", and o3 likely scored on easier subset.
Also, all that stuff is shady in the way that it is just numbers from OAI, which are not reproducible on benchmark sponsored by OAI. If we say OAI could be bad actor, they had plenty of opportunities to cheat on this.
“Objective benchmarks are useless, let’s argue about which one works better for me personally.”
Yes. My benchmarks and their benchmarks means AGI. Their benchmarks only means over-fitted.
Ok so what if we get different results for our own personal benchmarks/use cases.
(See why objective benchmarks exist?)
Yes, "objective" benchmarks can be gamed, real-life tasks cannot.
AI benchmarks and tests that claim to measure understanding, reasoning, intelligence, and so on are a dime a dozen. Chess, Go, Atari, Jeopardy, Raven's Progressive Matrices, the Winograd Schema Challenge, Starcraft... and so on and so forth.
Or let's talk about the breakthroughs. SVMs would lead us to AGI. Then LSTMs would lead us to AGI. Then Convnets would lead us to AGI. Then DeepRL would lead us to AGI. Now Transformers will lead us to AGI.
Benchmarks fall right and left and we keep being led to AGI but we never get there. It leaves one with such a feeling of angst. Are we ever gonna get to AGI? When's Godot coming?
Did you read the article at all? We’re definitely not plateauing.
Don’t worry. This thing only knows how to answer well structured technical questions.
99% of engineering is distilling through bullshit and nonsense requirements. Whether that is appealing to you is a different story, but ChatGPT will happily design things with dumb constraints that would get you fired if you took them at face value as an engineer.
ChatGPT answering technical challenges is to engineering as a nailgun is to carpentry.
In 2016 I was asked by an Uber driver in Pittsburgh when his job would be obsolete (I’d worked around Zoox people quite a bit and Uber basically was all-in at CMU.
I told him it was at least 5 years, probably 10, though he was sure it would be 2.
I was arguably “right”, 2023-ish is probably going to be the date people put down in the books, but the future isn’t evenly distributed. It’s at least another 5 years, and maybe never, before things are distributed among major metros, especially those with ice. Even then, the AI is somehow more expensive than human solution.
I don’t think it’s in most companies interest to price AI way below the price of meat, so meat will hold out for a long time, maybe long enough for you to retire even
Just don't have kids?
you can have kids, but they can’t be salesman. Maybe carpenters
This is me as well. Either:
1) Just give up computing entirely, the field I've been dreaming about since childhood. Perhaps if I immiserate myself with a dry regulated engineering field or trade I would perhaps survive to recursive self-improvement, but if anything the length it takes to pivot (I am a Junior in College that has already done probably 3/4th of my CS credits) means I probably couldn't get any foothold until all jobs are irrelevant and I've wasted more money.
2) Hard pivot into automation, AI my entire workflow, figure out how to use the bleeding edge of LLMs. Somehow. Even though I have no drive to learn LLMs and no practical project ideas with LLMs. And then I'd have to deal with the moral burden that I'm inflicting unfathomable hurt on others until recursive self-improvement, and after that it's simply a wildcard on what will happen with the monster I create.
It's like I'm suffocating constantly. The most I can do to "cope" is hold on to my (admittedly weak) faith in Christ, which provides me peace knowing that there is some eternal joy beyond the chaos here. I'm still just as lost as you.
Yes, some tasks, even complex tasks will become more automated, and machine driven, but that will only open up more opportunities for us as humans to take on more challenging issues. Each time a great advancement comes we think it's going to kill human productivity, but really it just amplifies it.
Where this ends is general intelligence though, where all more challenging tasks can simply be done by the model.
The scenario I fear is a "selectively general" model that can successfully destroy the field I'm in but keep others alive for much longer, but not long enough for me to pivot into them before actually general intelligence.
Dude chill! Eight years ago, I remember driving to some relatives for Thanksgiving and thinking that self-driving cars were just around the corner and how it made no sense for people to learn how to drive semis. Here we are eight years later and self-driving semis aren't a thing--yet. They will be some day, but we aren't there yet.
If you want to work in computing, then make it happen! Use the tools available and make great stuff. Your computing experience will be different from when I graduated from college 25 years ago, but my experience with computers was far different from my Dad's. Things change. Automation changes jobs. So far, it's been pretty good.
Honestly how about stop stressing and bullshitting yourself to death and instead focus on learning and mastering the material in your cs education. There is so much that ai as in openai api or hugging face models can't do yet or does poorly and there are more things to cs than churning out some half-broken JavaScript for some webapp.
It's powerful and world changing but it's also terrible overhyped at the moment.
Dude, you're buying into the hype way too hard. All of this LLM shit is being massively overhyped right now because investors are single-minded morons who only care about cashing out a ~year from now for triple what they put in. Look at the YCombinator batches, 90+% of them have some mention of AI in their pitch even if it's hilariously useless to have AI. You've got toothbrushes advertising AI features. It's a gold rush of people trying to get in on the hype while they still can, I guarantee you the strategy for 99% of the YCombinator AI batch is to get sold to M$ or Google for a billion bucks, not build anything sustainable or useful in any way.
It's a massive bubble, and things like these "benchmarks" are all part of the hype game. Is the tech cool and useful? For sure, but anyone trying to tell you this benchmark is in any way proof of AGI and will replace everyone is either an idiot or more likely has a vested interest in you believing them. OpenAI's whole marketing shtick is to scare people into thinking their next model is "too dangerous" to be released thus driving up hype, only to release it anyway and for it to fall flat on its face.
Also, if there's any jobs LLMs can replace right now, it's the useless managerial and C-suite, not the people doing the actual work. If these people weren't charlatans they'd be the first ones to go while pushing this on everyone else.
The solution is neither: you find a way to work with automation but retain your voice and craft.
Don't worry, they will hire somebody to control AI...
spend a little time learning how to use LLMs and i think you'll be less scared. they're not that good at doing the job of a software developer.
What I keep telling people is, if it becomes possible for one person or a handful of people to build and maintain a Google scale company, and my job gets eliminated as a result, then I’m going to go out and build a Google scale company.
There’s an incredibly massive amount of stuff the world needs. You probably live in a rich country, but I doubt you are lacking for want. There are billionaires who want things that don’t exist yet. And, of course, there are billions of regular folks who want some of the basics.
So long as you can imagine a better world, there will be work for you to do. New tools like AGI will just make it more accessible for you to build your better future.
LLMs are mostly hype. They're not going to change things that much.
The future belongs to those who believe there will be one.
That is: If you don't believe there will be a future, you give up on trying to make one. That means that any kind of future that takes persistent work becomes unavailable to you.
If you do believe that there will be a future, you keep working. That doesn't guarantee there will be a future. But not working pretty much guarantees that there won't be one, at least not one worth having.
Think of AI as an excavator. You know, those machines that dig holes. 70 years ago, those holes would have been dug by 50 men with shovels. Now it's one guy in an excavator. But we don't have mass unemployment. The excavator just creates more work for bricklayers, carpenters etc.
If AI lives up to hype, you could be the excavator driver. Or, the AI will create a ton of upstream and downstream work. There will be no mass unemployment.
Horses never recovered from mechanization.
True, but humans did. Horses were the machine that became obsolete. Just like the guys with shovels.
They have been promoted to pets. Oh wait..
If AGI is the excavator, why wouldn't it become the driver, bricklayer, and carpenter as well?
Jokes aside, I think building a useful, strong, agile humanoid robot that is affordable for businesses (first), then middle class homes will prove much harder than AGI.
Is there any possible technology that could make labor, mastery, or human expirence obsolete?
Are there no limits to this argument? Is it some absolute universal law that all new creations just create increasing economic opportunities?
[deleted]
Your performance on these tests would be equivalent to the highest performing model, and you would be much cheaper.
Investment in human talent augmented by AI is the future.
That’s the least reassuring phrasing I could imagine. If you’re betting on costs not reducing for compute then you’re almost always making the wrong bet.
If I listened to the naysayers back in the day I would have never entered the tech industry (offshoring etc). Yes, that does somewhat prove you're point given that those predictions were cost driven.
Having used AI extensively I don't feel my future is at risk at all, my work is enhanced not replaced.
I think you're missing the point. Offshoring (moving the job of, say, a Canadian engineer to an engineer from Belarus) has a one time cost drop, but you can't keep driving the cost down (paying the Belarus engineer less and less). If anything, the opposite is the case, since global integration means wages don't keep diverging.
The computing cost, on the other hand, is a continuous improvement. If (and it's a big if) a computer can do your job, we know the costs will keep getting lower year after year (maybe with diminishing returns, but this AI technology is pretty new so we're still seeing increasing returns)
The AI technology is new but the compute technology is not; we're getting close the physical limits of how small we can make things, so it's not clear to me at least how much more performance we can squeeze out of the same physical space, rather than scaling up which tends to make things more expensive not less.
>Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
This has essentially been happening for thousands of years. Any optimization to work of any kind reduces the number of man hours required.
Software of pretty much any form is entirely that. Even early spreadsheet programs would replace a number of jobs at any company.
As engineers, we solve problems. Picking a problem domain close to your heart that intersects with your skills will likely be valued - and valuable. Engage the work, aim to understand and solve the human problems for those around you, and the way forward becomes clearer. Human problems (food, health, safety) are generally constant while tools may change. Learn and use whatever tools to help you, be it scientific principles, hammers or LLMs. For me, doing so and living within my means has been intrinsically satisfying. Not terribly successful materially but has been a good life so far. Good luck.
As long as your chosen profession isn't completing AI benchmarks for money, you should be okay.
You're actually positioned to have an amazing career.
Everyone needs to know how to either build or sell to be successful. In a world where the ability to the former is rapidly being commoditised, you will still need to sell. And human relationships matter more than ever.
[deleted]
It's a tool. You learn to master it or not. I have greybeard coworkers that dissed the technology as a fad 3 years ago. Now they are scrambling to catch up. They have to do this while sustaining a family with pets and kids and mortgages and full time senior jobs.
You're in a position to invest substantial amounts of time compared to your seniors. Leverage that opportunity to your advantage.
We all have access to these tools for the most part, so the distinguishing factor is how much time you invest and how much more ambitious you become once you begin to master the tool.
This time its no different. Many Mechanical and Sales students in the past never got jobs in those fields either. Decades before AI. There were other circumstances and forces at play and a degree is not a guaranteed career in anything.
Keep going because what we DO know is that trying wont guarantee results, we DO know that giving up definitely won't. Roll the dice in your favor.
> I have greybeard coworkers that dissed the technology as a fad 3 years ago. Now they are scrambling to catch up. They have to do this while sustaining a family with pets and kids and mortgages and full time senior jobs.
I want to criticize Art’s comment on the grounds of ageism or something along the lines of “any amount life outside of programming is wasted”, but regardless of Art’s intention there is important wisdom here. Use your free time wisely when you don’t have much responsibilities. It is a superpower.
As for whether to spend it on AI, eh, that’s up to you to decide.
It's totally valid criticism. What I meant is that if an individual's major concern is employment, then it would be prudent to invest the amount of time necessary to ensure a favorable outcome. And given whatever stage in life they are at, use the circumstance you have in your favor.
I'm a greybeard myself.
Full on mechanical engineering needs a body. While there are companies working on embodiment, were not there yet.
It'll be some time before there is a robot with enough spatial reasoning to do complicated physical work with no prior examples.
[deleted]
I think we are pretty far. I am not devaluing the o3 capability but going through actual dataset the definition of "handling novel tasks" is pretty limited. The curse of large context of llms is especially present engineering projects and does not appear it will not end up producing the plans of a bridge, or an industrial process. Sone of tasks with smaller contexts sure can be assisted, but you cant RAG or Agent a full solution for the foreseeable future. O3 adds capability towards agi, but in reality actual infinite context with less intelligence would be more disrupting at a shorter time if one was to choose.
Always need to believe AI needs to be operated by humans, when it can go end to end to replace a human, you will likely not need to worry about money.
I suppose now that we have the technology to automatically solve coloured grid puzzles, mechanical engineering is obsolete.
Yeah, it may feel scary but the biggest issue yet to be overcome is that to replace engineers you need reliable long horizon problem solving skills. And crucially, you need to not be easily fooled by the progress or setbacks of a project.
These benchmark accomplishments are awesome and impressive, but you shouldn't operate on the assumption that this will emerge as an engineer because it performs well on benchmarks.
Engineering is a discipline that requires understanding tools, solutions and every project requires tiny innovations. This will make you more valuable, rather than less. Especially if you develop a deep understanding of the discipline and don't overly rely on LLMs to answer your own benchmark questions from your degree.
Imagine graduating in architecture or mechanical engineering around the time PCs just came out. There were people who probably panicked.
But the arc of time intersects quite nicely with your skills if you steer it over time.
Predicting it or worrying about it does nothing.
Side note: Why do I keep seeing disses to mechanical engineering here? How is that possibly a less valuable degree than web dev or a standard CRUD backend job?
Especially with AI provably getting extremely smart now, surely engineering disciplines would be having a boon as people want these things in their homes for cheaper for various applications.
Was he dissing mechanical engineering? I thought he was saying that they might have been panicked but were ultimately fine.
Do what you enjoy. (This is easier said than done.) What else could you do, worry?
I graduated high school in '02 and everyone assured me that all tech jobs were being sent to India. "Don't study CS," they said. Thankfully I didn't listen.
Either this is the dawn of something bigger than the industrial revolution or you'll have ample career opportunity. Understanding how things work and how people work is a powerful combination.
even if you had a billion dollars and a private island you still wouldnt be ready for whats coming. consider the fact that the global order is an equilibrium where the military and economic forces of each country in the world are pushing against each other... where the forces find a global equilibrium is where borders are. each time in history that technology changed, borders changed because the equilibrium was disturbed. there is no way to escape it: agi will lead to global war. the world will be turned upside down. we are entering into an existential sinkhole. and the idiots in silicon valley are literally driving the whole thing forward as fast as possible.
[deleted]
buy bitcoin.
when the last job has been automated away, millions of AIs globally will do commerce with each other and they will use bitcoin to pay each other.
as long as the human race (including AIs) produces new goods and services, the purchasing power of bitcoin will go up, indefinitely. even more so once we unlock new industries in space (settlements on the Moon and Mars, asteroid mining etc).
The only thing that can make a dent into bitcoin's purchasing power would be all out global war where humanity destroys more than it creates.
The only other alternative is UBI, which is Communism and eternal slavery for the entire human race except the 0.0001% who run the show.
Chose wisely.
Bitcoin is a horrible currency. Its a fun proof of concept but not a scalable payment solution. Currency needs to be stable and cheap to transfer.
This must be a joke since you must know how many people control the majority of bitcoin.
The amount of desperate rationalization in this thread is unbelievable. It's like watching people at a Pentecostal church start speaking in tongues in the hope that something wonderful will happen until it evolves into the realization that shit isn't going to happen and then slowly they just kind of putter out.
TLDR: The cacophony of fools is so loud now. Thank goodness it won't last.
What is the cost of "general intelligence"? What is the price?
About $3.50
Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?
> verified easy for humans, harder for AI
Isn’t that the premise behind the CAPTCHA?
Can it play Mario 64 now?
At what time will it kill us all because it understands that humans are the biggest problem before it can simply chill and not worry.
That would be intelligent. Everything else is just stupid and more of the same shit.
Humans are the biggest problem of what? Of the sun? Of Venus?
Of humans. Humans are a problem for the satisfaction of humans. Yet removing humans from this equation does result in higher human satisfaction. It lessens it.
I find this thought process of "humans are the problem" to be unreasonable. Humans aren't the problem; humans are the requirement.
I’m confused about the excitement. Are people just flat out ignoring the sentences below? I don’t see any breakthrough towards AGI here. I see a model doing great in another AI test but about to abysmally fail a variation of it that will come out soon. Also, aren’t these comparisons completely nonsense considering it’s o3 tuned vs other non-tuned?
> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
Me too. This looks to me like a holiday PR stunt. Get everybody to talk about AI during the Christmas parties.
When is this available? Which plans can use it?
[deleted]
It's beyond ridiculous how the definition of AGI has shifted from being an AI that's so good it can improve itself entirely independently infinitely to "some token generator that can solve puzzles that kids could solve after burning tens of thousands of dollars".
I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.
Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.
This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.
Don't be put off by the reported high-cost
Make it possible->Make it fast->Make it Cheap
the eternal cycle of software.
Make no mistake - we are on the verge of the next era of change.
It’s not AGI when it can do 1000 math puzzles. It’s AGI when it can do 1000 math puzzles then come and clean my kitchen.
I understand what you are saying and sort of agree the premise but to be pedantic, I don't think any robot can clean a kitchen without doing math :)
Intelligence doesn't have to be embodied.
For it to be AGI, it needs to be able to manipulate the physical world from it's own goals, not just produce text when prompted. LLMs are just tools to augment human intelligence. AGI is what you see in science fiction.
It also has to be able to come and argue in the comments.
[deleted]
So o1 pro is CoT RL and o3 adds search?
Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.
But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.
OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.
Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.
Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.
If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.
It will be fascinating to see how this unfolds.
Congrats to OAI on yet another fantastic release.
How to invest in this stonk market
FYI: Codeforces competitive programming scores (basically only) by time needed until valid solutions are posted
https://codeforces.com/blog/entry/133094
That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.
Creativity is inherently rated because it's codeforces... most 2700 problems have unique, creative solutions.
So now not only are the models closed, but so are their evals?! This is a "semi-private" eval. WTH is that supposed to mean? I'm sure the model is great but I refuse to take their word for it.
The private evaluation set is private from the public/OpenAI so companies can't train on those problems and cheat their way to a high score by overfitting.
If the models run on OpenAIs servers then surely they could still see the questions being put into it if they wanted to cheat? That could only be prevented by making the evaluation a one-time deal that can't be repeated, or by having OpenAI distribute their models for evaluators to run themselves, which I doubt they're inclined to do.
Yes that's why it is "semi"-private: From the ARC website "This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated."
I presume evaluation on the test set is gated (you have to ask ARC to run it).
the evals are the question/answers, ARC-AGI doesn't share the questions and answers for a portion so that models can't be trained on them, the public ones... the public knows the questions so theres a chance they could have been at least partially been trained on the question (if not the actual answer).
Thats how i understand it
[deleted]
Just curious, I know o1 is a model OpenAI offers. I have never heard of the o3 model. How does it differ from o1?
Why did they skip o2?
To avoid colliding with https://en.wikipedia.org/wiki/O2_(brand)
This is actually mindblowing!
Never underestimate a droid
AGI for me is something I can give a new project to and be able to use it better than me. And not because it has a huge context window, because it will update its weights after consuming that project. Until we have that I don't believe we have truly reached AGI.
Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.
Kinda expensive though.
Did they just skip o2?
Yes. For branding reasons since o2 is a telco brand in the UK
ah right...makes sense
Okay but what are the tests like? At least like a general idea.
I bet it still thinks 1+1=3 if it read enough sources parroting that.
[deleted]
These tests are meaningless until You show them doing mundane tasks
Congratulations
Is it just me or does looking at the ARC-AGI example questions at the bottom... make your brain hurt?
Looks pretty obvious to me, although, of course, it took me a few moments to understand what's expected as a solution.
c6e1b8da is moving rectangular figures by a given vector, 0d87d2a6 is drawing horizontal and/or vertical lines (connecting dots at the edges) and filling figures they touch, b457fec5 is filling gray figures with a given repeating color pattern.
This is pretty straightforward stuff that doesn't require much spatial thinking or keeping multiple things/aspects in memory - visual puzzles from various "IQ" tests are way harder.
This said, now I'm curious how SoTA LLMs would do on something like WAIS-IV.
I'll sound like a total douche bag - but I thought they were incredibly obvious - which I think is the point of them.
What took me longer was figuring out how the question was arranged, i.e. left input, right output, 3 examples each
Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?
Yeah, seriously. The style of testing is public, so some engineers at OpenAI could easily have spent a few months generating millions of permutations of grid-based questions and including those in the original data for training the AI. Handshakes all around, publicity for everyone.
They are running a business selling access these models to enterprises and consumers. People won’t pay for stuff that doesn’t solve real problems. Nobody pays for stuff just because of a benchmark. It’d be really weird to become obsessed with metrics gaming rather than racing to build something smarter than the other guys. Nothing wrong with curating any type of training set that actually produces something that is useful.
Denoting it in $ for efficiency is peak capitalism, cmv.
Uhh...some of us are apparently living under a rock, as this is the first time I hear about o3 and I'm on HN far too much every day
I think it was just announced today! You're fine!
[deleted]
So in a few years, coders will be as relevant as cuneiform scribes.
I've never seen a company looking for a "coder", anymore than they look to hire spreadsheet creators or powerpoint specialists. A software developer can code, but being able to code doesn't make you a software developer, anymore than being able to create a powerpoint makes you a manager (although in some companies it might do, so maybe bad example!).
it's official old buddy, i'm a has been.
Someone asked if true intelligence requires a foundation of prior knowledge. This is the way I think about it.
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
Well put. You ask LLMs about ARC-like challenges and they are able to come up with a list of possible problem formulations even before you show them the input. The models already know that they might expect various object manipulations, symmetry problem, etc. The fact that the solution costs thousands of dollars says to me that the model iterates over many solutions while using this implicit knowledge and feedback it gets from running the program. It is still impressive, but I don't think this is what the ARC prize was supposed to be about.
> while using this implicit knowledge and feedback it gets from running the program.
What feedback, and what program, are you referring to?
Basically solutions that were doing well in arc just threw thousands of ideas at the wall and picked the ones that stuck. They were literally generating thousands of python programs, running them and checking if any produced the correct output when fed with data from examples.
This o3 doesn't need to run python. It itself executes programs written in tokens inside it's own context window which is wildly inefficient but gives better results and is potentially more general.
So basically it's a massively inefficient trial-and-error leetcode solver which only works because it throws incredible amounts of compute at the problem.
This is hilarious.
Previous best specialized ARC solver was exactly that.
This o3 thing might be a bit different because it's just chain of thought llm that can do many other things as well.
It's not uncommon for people to have a handful of wrong ideas before they stumble upon a correct solution either.
I assume that o3 can run Python scripts and observe the outputs.
There should be also a factor about resource consumption. See here: https://lorenzopieri.com/pgii/
An interesting point from a philosophical perspective!
But if we'd take this into consideration would it mean that 1st world engineer is by definition less inteligent than 3rd world one?
I think the (completely reasonable) knee jerk reaction is a definsive one, but I can imagine absolutarian regime escapee working side-by-side an engineer groomed in expensive, air conditioned lecture rooms. In this imaginary scenario escapee, even if slower and less efficient at the problem at hand would have to be more intelligent generally.
That's a bit silly.
Yes, resource consumption is important. But your car guzzling a lot of gas doesn't mean he drives slower. It just means it drives slower per mol of petrol consumed.
It's good to know whether your system has a high or low 'bang for buck' metric, but that doesn't directly affect how much bang you get.
Also perhaps a factor (with diminishing returns) for response speed?
All else equal, a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes. Likewise an LLM that can respond in 2 seconds is more impressive than one which responds in 30 seconds.
> a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes
According to my mathematical model, the faster student would have higher effectiveness, not necessarily higher intelligence. Resource consumption and speed are practical technological concerns, but they're irrelevant in a theorical conceptualization of intelligence.
If you disregard time, all computers have maximal intelligence, they can enumerate all programs and compute answers to any decidable question.
Yeah speed is a key factor in intelligence. And actually one of the biggest differentiators in human iq measurements
Humans are a bit annoying that way, because it's all correlated.
So a human with a better response time, also tends to give you more intelligent answers, even when time is not a factor.
For a computer, you can arbitrarily slow them down (or speed them up), and still get the same answer.
Maybe. If I could ask a AI to come up with a 50% efficient mass market solar panel, I don’t really care if it takes a few weeks or a year if it can solve that though. I’m not sure if inventiveness or novelness of solution could be a metric. I suppose that is superintelligence rather than AGI? And by then there would be no question of what it is
> response time
Imagine you take an extraordinarily smart person, and put them on a fast spaceship that causes time dilation.
Does that mean that they are stupider while in transit, and they regain their intelligence when it slows down?
Who is a better free-thrower, someone who can hit 20 free throws per minute on Earth, or the same thrower who logged 20 million free throws in the apparent two years he was gone but comes back ready for retirement?
No, because intelligence is relative to your local context.
Why should one kind of phenomenon which slows down performance on the test be given a special "you're more intelligent than you seem" exception, but not others?
If we are required to break the seal on the black-box and investigate the exactly how the agent is operating in order to judge its "intelligence"... Doesn't that kinda ruin the up-thread stuff about judging with equations?
[deleted]
We should wait until it's released before we anoint it. It's disheartening to see how we keep repeating the same pattern that gives in to hype over the scientific method.
The scientific method doesn’t drive stock price (apparently).
An intelligent system could take more advantage of an increase of knowledge than a dumb one, so I should propose a simple formula: the derivative of efficiency with respect to knowledge is proportional to intelligence.
$$ I = \frac{partial E}{partial K} \simeq \frac{\delta E}{\delta K} $$
In order to estimate $I$ you have to consider that efficiency and knowledge are task related, so you could take some weighted mean $sum_T C(E,K,T)*I(E,K,T)$ where $T$ is task category. I am thinking in $C(E,K,T)$ as something similar to thermal capacity or electrical resistance, the equivalent concept when applied to task. An intelligent agent in a medium of low resistance should fly while a dumb one would still crawl.
> An intelligent system could take more advantage of an increase of knowledge than a dumb one
Why?
> derivative of efficiency
Where did your efficiency variable come from?
Why? I am using dumb as a low intelligence system. A more intelligent person can take advantage of new opportunities. Efficience variable: You are right that effectiveness could be better here because we are not considering resources like computer time and power.
Interesting formulation! it captures the intuition of the "smartness" when solving a problem. However, what about asking good questions or proposing conjectures?
Aren't those solutions to problems as well?
Find the best questions to ask. Find the best hypothesis to suggest.
As a kid I absolutely hated math and loved physics and chemistry because solving anything in math requires vast specific K.
In comparison you can easily know everything there is to know about physics or chemistry and it's sufficient to solve interesting puzzles. In math every puzzle has it's own vast lore you need to know before you can have any chance at tackling it.
Physics and chemistry require experimentation to verify solutions. With math however, any new knowledge can be intuited and proven from previous proofs, so yes, the lore goes deep!
Yep, I aways liked encyclopedia. Wiki is good too :)
What I would like to have in the future is SO answering-peoples accessible in real time via IRC. They have real answers NOW. They are even pedantic about their stuff !
Where did someone ask that?
[deleted]
The first computers cost millions of dollars and filled entire rooms to accomplish what we would now consider simple computational tasks. That same computing power now fits into the width of a finger nail. I don’t get how technologists balk at the cost of experimental tech or assume current tech will run at the same efficiency for decades to come and melt the planet into a puddle. AGI won’t happen until you can fit enough compute that’d take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time. This is all going to take some time to say the least. Progress is progress.
I thought you were going to say that now we're back to bigger-than-room sized computers that cost many millions just to perform the same tasks we could 40 years ago.
I of course mean we're using these LLMs for a lot of tasks that they're inappropriate for, and a clever manually coded algorithm could do better and much more efficiently.
> and a clever manually coded algorithm could do better and much more efficiently.
Sure, but how long would it take to implement this algorithm, and would that be worth it for one-off cases?
Just today I asked Claude to create a jq query that looks for objects with a certain value for one field, but which lack a certain other field. I could have spent a long time trying to make sense of jq's man page, but instead I spent 30 seconds writing a short description of what I'm looking for in natural language, and the AI returned the correct jq invocation within seconds.
I don’t think this is a bad use. A bad use would be to give Claude the dataset and ask it to tell you which elements have that value.
Claude answers a lot of its questions by first writing and then running code to generate the results. Its only limitation is the access to databases and size of context window, both of which will be radically improved over the next 5 years.
I would still rather be able to see the code it generates
Ha, I tried that before. However, the file was too large for its context window, so it only seemed to analyze the first part and gave a wrong result.
It was your own data, right ? Becouse you just donated half of it...
It's okay, I also uploaded an NDA in a previous prompt :-)
But how do you know it's given you the correct answer? Just because the code appears to work it doesn't mean it's correct.
But how do I know if my hand-written jq query is the correct solution? Just because the query appears to work it doesn't mean it's correct.
Because I understand the process that I have followed to get to the solution.
It can explain its solution. Point to relevant docs as well.
It can also very convincingly explain a non-solution pointing to either real or hallucinated docs.
You need to look at the docs.
Omg this is how llms used to trick me inventing out all these apis.
Look at the docs it links to.
just ask the LLM to solve enough problems (even new problems), cache the best, do inference time compute for the rest, figure out the best/ fastest implementations, and boom, you have new training data for future AIs
> cache the best
How do you quantify that?
"Assume the role of an expert in cache invalidation..."
"one does not just assume", "because the hardest problems in Tech are Johnny Cash invalidations" --Lao Tzi
> "Those who invalidate caches know nothing; Those who know retain data." These words, as I am told, were spoken by Lao Tzi. If we are to believe that Lao Tzi was himself one who knew, why did he erase /var/tmp to make space for his project?
-- Poem by Cybernetic Bai Juyi, "The Philosopher [of Caching]"
“Assume the role of an expert in naming things. You know, a… what do they call those people again… there must be a name for it”
however you want
The LLMs are now writing their own algorithms to answer questions. Not long before they can design a more efficient algorithm to complete any feasible computational task, in a millionth of the time needed by the best human.
> The LLMs are now writing their own algorithms to answer questions
Writing a python script, because it can't do math or any form of more complex reasoning is not what I would call "own algorithm". It's at most application of existing ones/calling APIs.
LLMs are probabilistic string blenders pulling pieces up from their training set, which unfortunately comes from us, humans.
The superset of the LLM knowledge pool is human knowledge. They can't go beyond the boundaries of their training set.
I'll not go into how humans have other processes which can alter their and collective human knowledge, but the rabbit hole starts with "emotions, opposable thumbs, language, communication and other senses".
> They can't go beyond the boundaries of their training set.
TFA says they just did. That's what the ARC-AGI benchmark was supposed to test.
> take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time
How so? I'd imagine a robot connected to the data center embodying its mind, connected via low-latency links, would have to walk pretty far to get into trouble when it comes to interacting with the environment.
The speed of light is about three orders of magnitude faster than the speed of signal propagation in biological neurons, after all.
The robot brain could be layered so that more basic functions are embedded locally while higher-level reasonings and offloaded to the cloud.
blue strip from iRobot?
6 orders of magnitude if we use 120 m/s vs 300 km/s
should've said "6 orders of magnitude if we use 120 m/s vs 300_000 km/s" - I was also off by 3 orders of magnitude!
Ah, yes, I missed a “k” in that estimation!
Many of humans' capabilities are pretrained with massive computing through evolution. Inference results of o3 and its successors might be used to train the next generation of small models to be highly capable. Recent advances in the capabilities of small models such as Gemini-2.0 Flash suggest the same.
Recent research from NVIDIA suggests such an efficiency gain is quite possible in the physical realm as well. They trained a tiny model to control the full body of a robot via simulations.
---
"We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We capture this “subconsciousness” in HOVER, a single model that learns how to coordinate the motors of a humanoid robot to support locomotion and manipulation."
...
"HOVER supports any humanoid that can be simulated in Isaac. Bring your own robot, and watch it come to life!"
More here: https://x.com/DrJimFan/status/1851643431803830551
---
This demonstrates that with proper training, small models can perform at a high level in both cognitive and physical domains.
> Similarly, many of humans' capabilities are pretrained with massive computing through evolution.
Hmm .. my intuition is that humans' capabilities are gained during early childhood (walking, running, speaking .. etc) ... what are examples of capabilities pretrained by evolution, and how does this work?
If you look at animals, they can walk in hours, not much time needed after being born. It takes us a longer time because we are born rather undeveloped to get the head out of the birth canal.
A more high level example, sea sickness is a evolutionary pre-learned thing, your body things it's poisoned and it automatically wants to empty your stomach.
The brain is predisposed to learn those skills. Early childhood experiences are necessary to complete the training. Perhaps that could be likened to post-training. It's not a one-to-one comparison but a rather loose analogy which I didn't make it precise because it is not the key point of the argument.
Maybe evolution could be better thought of as neural architecture search combined with some pretraining. Evidence suggests we are prebuilt with "core knowledge" by the time we're born [1].
See: Summary of cool research gained from clever & benign experiments with babies here:
[1] Core knowledge. Elizabeth S. Spelke and Katherine D. Kinzler. https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
> The brain is predisposed to learn those skills.
Learning to walk doesn't seem to be particularly easy, having observed the process with my own children. No easier than riding a bike or skating, for which our brains are probably not 'predisposed'.
Walking is indeed a complex skill. Yet some animals walk minutes after birth. Human babies are most likely born premature due to the large brain and related physical constraints.
Young children learn to bike or skate at an older age after they have acquired basic physical skills.
Check out the reference to Core Knowledge above. There are things young infants know or are predisposed to know from birth.
The brain has developed, through evolution, very specific and organized structures that allow us to learn language and reading skills. If you have a genetic defect that causes those structures to be faulty or missing, you will have severe developmental problems.
That seems like a decent example of pretraining through evolution.
But maybe it's something more like general symbolic manipulation, and not specifically the sounds or structure of language. Reading is fairly new and unlikely to have had much if any evolutionary pressure in many populations who are now quite literate. Same seems true for music. Maybe the hardware is actually more general and adaptable and not just for language?
The research disagrees with you.
Music is really, really old.
And reading and music co-evolved to be relatively easy for humans to do.
(See how computers have a much easier time reading barcodes and QR codes, with much less general processing power than it takes them to decipher human hand-writing. But good luck trying to teach humans to read QR codes fluently.)
> No easier than riding a bike or skating, for which our brains are probably not 'predisposed'.
What makes you think so? Humans came up with biking and skating, because they were easy enough for us to master with the hardware we had.
[deleted]
I think of evolution as unassisted learning where agents compete with the each other for limited resources. Over time they get better and better at surviving by passing on genes. It never ends of course.
Your brain is well adapted to learning how to walk and speak.
Chimpanzees score pretty high on many tests of intelligence, especially short term working memory. But they can't really learn language: they lack the specialised hardware more than the general intelligence.
I mean, there are plenty - e.g. mimicking (say, the mother's face's emotions), which are precursors to learning more advanced "features". Also, even walking has many aspects pretrained (I assume it's mostly a musculoskeletal limitation that we can't walk immediately), humans are just born "prematurely" due to our relatively huge heads. Newborn horses can walk immediately without learning.
But there are plenty of non-learned control/movement/sensing in utero that are "pretrained".
Interestingly, there's a bunch of reflexes that also only develop over time.
They are more nature than nurture, but they aren't 'in-born'.
Just like human aren't (usually) born with teeth, but they don't 'learn' to have teeth or pubic hair, either.
The concern here is mainly on practicality. The original mainframes did not command startup valuations counted in fractions of the US economy, they did qualify for billions in investment.
This is a great milestone, but OpenAI will not be successful charging 10x the cost of a human to perform a task.
The cost of inference has be dropping by ~100x in the past 2 years.
Hmm the link is saying the price of an LLM that scores 42 or above on MMLU has dropped 100x in 2 years, equating gpt 3.5 and llama 3.2 3B. In my opinion gpt 3.5 was significantly better than llama 3B, and certainly much better than the also-equated llama 2 7B. MMLU isn't a great marker of overall model capabilities.
Obviously the drop in cost for capability in the last 2 years is big, but I'd wager it's closer to 10x than 100x.
*infernonce
*inference
> OpenAI will not be successful charging 10x the cost of a human to perform a task.
True, but they might be successful charging 20x for 2x the skill of a human.
Or 10x the skill and speed of a human in some specific class of recurrent tasks. We don't need full super-human AGI for AI to become economically viable.
Companies routinely pay short-term contractors a lot more than their permanent staff.
If you can just unleash AI on any of your problems, without having to commit to anything long term, it might still be useful, even if they charged more than for equivalent human labour.
(Though I suspect AI labour will generally trend to be cheaper than humans over time for anything AIs can do at all.)
I wouldn’t expect it to cost 10x in five years, if only because parallel computing still seems to be roughly obeying moore’s.
How much does AWS charge for compute?
If it can be spun up with Terraform, I bet you they could.
Maybe AGI as a goal is overvalued: If you have a machine that can, on average, perform symbolic reasoning better than humans, and at a lower cost, that's basically the end game, isn't it? You won capitalism.
Right now I can ask an (experienced) human to do something for me and they will either just get it done or tell me that they can’t do it.
Right now when I ask an LLM… I have to sit there and verify everything. It may have done some helpful reasoning for me but the whole point of me asking someone else (or something else) was to do nothing at all…
I’m not sure you can reliably fulfill the first scenario without achieving AGI. Maybe you can, but we are not at that point yet so we don’t know yet.
You do need to verify humans work though.
The difference, to me, is that humans seem to be good at canceling each other's mistakes when put in a proper environment.
Not with the same depth. I might ask a friend to drop off a letter and I might verify that they did it, but I don’t have to verify that they didn’t mistake a Taco Bell or a dumpster as the post office.
It’s very scary to ask a friend to drop off a letter if the last scenario is even 1% within the realm of possibility.
My guess is this is an artifact of the RLHF part of the training. Answers like "I don't know" or "let me think and let's catch on this next week" are flagged down by human testers, which eventually trains LLM to avoid this path altogether. And it probably makes sense because otherwise "I don't know" would come up way too often even in cases where the LLM is perfectly able to give the answer.
I don't know, that seems like a fundamental limitation. LLMs don't have any ability to do reflection on their own knowledge/abilities.
Humans aren't very aware of their limits, either.
Even the Dunning-Kruger effect is, ironically, widely misunderstood by people who are unreasonably confident about their knowledge.
But you know if you have ever heard about call by name or value semantics.
You've only ever seen people get upset about technical jargon they know they don't understand, but also never seen people misuse jargon wildly?
The latter in particular is how I model the mistakes LLMs made, what with them having read most things.
Yes, Dunning-Kruger's paper never found what popular science calls the 'Dunning-Kruger' effect.
Effectively, they found nothing real but a statistical artifact.
> Right now I can ask an (experienced) human to do something for me and they will either just get it done or tell me that they can’t do it.
Finding reliable honest humans is a problem governments have struggled with for over a hundred years. If you have cracked this problem at scale you really need to write it up! There are a lot of people who would be extremely interested in a solution here.
> Finding reliable honest humans is a problem governments have struggled with for over a hundred years.
Yes, though you are downplaying the problem a lot. It's not just governments, and it's way longer than 100 years.
Btw, a solution that might work for you or me, presumably relatively obscure people, might not work for anyone famous, nor a company nor a government.
It's not clear to me whether AGI is necessary for solving most of the issues in the current generation of LLMs. It is possible you can get there by hacking together CoTs with automated theorem provers and bruteforcing your way to the solution or something like that.
But if it's not enough then maybe it might come as a second-order effect (e.g. reasoning machines having to bootstrap an AGI so then you can have a Waymo taxi driver who is also a Fields medalist)
There are so called "yes-men" who can't say "no" in no situation. That's rooted in their culture. I suspect that AI was trained using their assistance. I mean, answering "I can't do that" is the simplest LLM path that should work often unless they gone out of their way to downrank it.
Honestly, it doesn't need to be local, API is some 200ms away is ok-ish, make it 50ms it will be practically usable for every majority of interaction.
Batteries..
[deleted]
Intelligence has nothing at all whatever to do with compute.
Unless you're a dualist who believes in a magic spirit, I cannot understand how you think that's the case. Can you please explain?
Dualism has nothing to do with it. There are more things on heaven and earth then just computable functions in the mathematical sense.
(In fact, the very idea of "computable functions" was invented to narrow down the space of "all things" to something much smaller, tighter and manageable. And now we've come full circle and apparently everything in the universe is a computable function? Well, if all you have is a hammer, I guess everything must necessarily look like a nail.)
Philosophy of mind is the branch of philosophy that attempts to account for a very difficult problem: why there are apparently two different realms of phenomena, physical and mental, that are at once tightly connected and yet as different from one another as two things can possibly be.
Broadly speaking you can think that the mental reduces to the physical (physicalism), that the physical reduces to the mental (idealism), both reduce to some other third thing (neutral monism) or that neither reduces to the other (dualism). There are many arguments for dualism but I’ve never heard a philosopher appeal to “magic spirits” in order to do so.
Here’s an overview: https://plato.stanford.edu/entries/dualism/
Intelligence is about learning from few examples and generalising to novel solutions. Increasing compute so that exploring the whole problem space is possible is not intelligence. There is a reason the actual ARC-AGI price has efficiency as one of the success requirements. It is not so that the solutions scale to production and whatnot, these are toy tasks. It is to help ensure that it is actually an intelligent system solving these.
So yeah, the o3 result is impressive but if the difference between o3 and the previous state of art is more compute to do a much longer CoT/evaluation loop, I am not so impressed. Reminder that these problems are solved by humans in seconds, ARC-AGI is supposed to be easy.
Do you think intelligence exists without prior experience? For instance, can someone instantly acquire a skill—like playing the piano—as if downloading it in The Matrix? Even prodigies like Mozart had prior exposure. His father, a composer and music teacher, introduced him to music from an early age. Does true intelligence require a foundation of prior knowledge?
Intelligence requires the ability to separate the wheat from the chaff on one's own to create a foundation of knowledge to build on.
It is also entirely possible to learn a skill without prior experience. That's how it(whatever skill) was first done
> Does true intelligence require a foundation of prior knowledge?
This is the way I think about it.
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to your question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers your question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
Reducing the broad category of "experience" to "computable functions in the mathematical sense" is quite, hm, reductive.
[deleted]
With only a 100x increase in cost, we improved performance by 0.1x and continued plotting this concave-down diminishing-returns type graph! Hurray for logarithmic x-axes!
Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.
imo it's a mistake to interpret the marginal increases in the upper echelons of benchmarks as materially marginal gains. Chess is an example. ELO narrows heavily at the top, but each ELO point carries more relative weight. This is a bit apples and oranges since chess is adversarial, but I think the point stands.
> ELO narrows heavily at the top
What do you mean by this? I'm assuming you're not speaking about simple absolute differences in value - there have been top players rated over 100 points higher than the average of the rest of the top ten.
o3-mini (high) uses 1/3rd of the compute of o1, and performs about 200 Elo higher than o1 on Codeforces.
o1 is the best code generation model according to Livebench.
So how is this not a breakthrough? It's a genuine movement of the frontier.
How much time does a top sprinter take a 100 m run for compared to a mediocre sprinter?
I mean going from 10% to 85% doesn’t seem like a 0.1% improvement
Oh crap I made a mistake. I was comparing o3 low to o3 high.
I'm a little disappointed by all the upvotes I got for being flat wrong. I guess as long as you're trashing AI you can get away with anything.
Really I was just trying to nitpick the chart parameters.
compute gets cheaper and cheaper every year. This model will be in your phone by 2030 if we continue at the pace we've been at the last few years.
These models are nearing 2+ trillion parameters. At 4 bits each, we're talking about somewhere around 1tb of RAM.
The problem is that RAM stopped scaling a long time ago now. We're down to the size where a single capacitor's charge is held by a mere 40,000 or so electrons and all we've been doing is making skinnier, longer cells of that size because we can't find reliable ways to boost even weaker signals, but this is a dead end because as the math shows, if the volume is consistent and you are reducing X and Y dimensions, that Z dimension starts to get crazy big really fast. The chemistry issues of burning a hole a little at a time while keeping wall thickness somewhat similar all the way down is a very hard problem.
Another problem is that Moore's law hit a wall when Dennard Scaling failed. When you look at SRAM (it's generally the smallest and most reliable stuff we can make), you see that most recent shrinks can hardly be called shrinks.
Unless we do something very different like compute in storage or have some radical breakthrough in a new technology, I don't know that we will ever get a 2T parameter model inside a phone (I'd love for someone in 10 years to show up and say how wrong I was).
There’s probably enough VC money to subsidize the costs for a few more years.
But the data centres running the training for models like this are bringing up new methane power plants at a fast rate at a time when we need to be reducing reliance on O&G.
But let’s assume that the efficiency gains out pace the resource consumption with the help of all the subsidies being thrown in and we achieve AGI.
What’s the benefit? Do we get more fresh water?
Politically anything can happen. Maybe the billionaire class controls everything with an army of robots and it's a horrible prison-like dystopia, or maybe we end up in a post-scarcity utopia a la The Culture.
Regardless, once we have AGI (and it can scale), I don't think O&G reliance (/ climate change) is going to be something that we need concern ourselves with.
I think climate is something we need to concern ourselves with now and at the rate we’re putting up new methane power plants to power the DC’s for these AI apps today, we might not make it to this hypothetical future where magic algorithms will reset everything.
Like it or not we already know what we need to do to avert the worst of the climate disasters to come.
Yeah, good question. I think it depends on our politics. If we’re in a techno-capital-oligarchy, people are going to have a hard time making fresh water a priority when the robots would prefer to build nuclear power everywhere and use it to desalinate sea water.
OTOH if these data centers are sufficiently decentralized and run for public benefit, maybe there’s a chance we use them to solve collective action problems.
[deleted]
[flagged]
I know AGI is a bit of a moving goalpost these days, but by my personally anecdotal and irrelevant opinion this ain’t AGI.
I’ll let those smarter than me debate the merits of AGI, but if it can’t learn and self-improve it isn’t “general” intelligence.
This is a very smart computer, accomplishing a very niche set of problems. Cool? Yes. AGI? No.
What is your personal benchmark for AGI? Turing test was surpassed years ago, ARC-AGI was a next step some quite clever people in the space came up with as a successor, and has now surpassed as well.
So what is your benchmark?
> Turing test was surpassed years ago
I keep seeing this thrown around, but did anyone actually like go out and do this? I feel like I could distinguish between an AI (even the latest models) and a person after a text-only back and forth conversation.
The models you've interacted with have guardrails. If you fine-tuned an LLM with the goal of "convince people you are human" (effectively the opposite of what the major players are doing with their fine-tunes), I am very confident even you would be fooled.
Self-learning. Human level intelligence, which isn’t a set of facts but rather the ability to self-learn and correct which LLMs are completely unable to do right now.
It is fairly clear to me that self-learning is already possible. Give an agentic LLM the ability to add things to its own context window, or fine tune itself, and it will.
Self-learning
That's exactly what this particular benchmark requires.
I'll start worrying about AGI once I'm convinced there's such a thing as GI.
How do you have anecdotal opinion on this o3 model that hasn’t been publicly released yet?
I think what current technology is missing is in situ learning. I don't think that's a massive leap from where we are though.
How is in situ learning not a massive leap? We have no clue how to make an intelligent general task judger, and without that we don't know how to do in situ learning that actually improves performance on general tasks.
I think its mainly a software/use case problem as opposed to an architectural problem.
Right now AI systems are built top to bottom to learn in development, and be deployed as a static asset. This isn't because online learning isn't doable, its because there isn't a great use case for current limitations. Either the algorithms are too slow, or computers are too slow, take your pick.
Chain of Thought is basically a more constrained version of in situ learning, only the knowledge has a lifetime bound to the task. Propagating the information into the model would be too resource hungry, and too unpredictable to productize. Honestly, taking the result of Chain of thought, and feeding that back into training offline is probably where a lot of the progress on these kinds of tasks is coming from.
Not to say it wouldn’t be a leap, but is there anything particularly special about learning that makes it unachievable? If you can reason and you can do trial and error, given enough time and compute I think most things should be learnable?
> If you can reason and you can do trial and error
Trial and error requires a judge to determine if there was an error. To do trial and error for general tasks you need a general judge, and it needs to be good in order to get intelligent results. All examples of successful AI you see have human judges or human programmed narrow judges. Chess AI training is an example where we have a human programmed judge, but for most tasks not even humans can code up a good judge.
Does it necessarily require a judge? Many tasks have objective success criteria. A basketball goes through the hoop or it doesn’t.
If the success criteria are inherently subjective, like music or art, you can use human reactions as the criteria, while also using reasoning to infer principles about what is or isn’t received well. That’s what humans do.
> Does it necessarily require a judge? Many tasks have objective success criteria. A basketball goes through the hoop or it doesn’t.
You just said what there is to judge in one specific scenario, a general AI has to make the decision to judge itself that way which is not objective, it is extremely hard to decide what to judge yourself by in a generic situation.
For example, lets say you are in a basketball court, with a basketball. What is a good outcome? Is it shooting the ball in the hoop? But maybe it isn't your turn, someone else is shooting now, then shooting at the hoop is a bad outcome, how do you make the AI recognize that instead of mindlessly trying to make the ball go in the hoop without considering the context?
I’d say that while a full basketball game is a lot more complex, it’s made up of a lot of component tasks that still have objective success criteria: dribbling, passing, setting picks, doing your part in a play, etc. Ultimately your team either scores or it doesn’t.
Not to say it isn’t difficult, but I don’t think humans are doing anything particularly magical when they learn to play basketball—something I did myself when I was a kid. You learn each skill from a coach’s demonstration, practice them all a lot (practice = trial and error) and develop an intuition (reasoning) about what do to in various situations.
Agi level my ass
+1
[flagged]
It may eventually be able to solve any problem
Ah. Me, too.
It is not exactly AGI but huge step toward it. I would expect this step in 2028-2030. I cant really understand why people are happy with it, this technology is so dangerous that can disrupt whole society. It's neither like smartphone nor internet. What will happen to 3rd world countries. Lots of unsolved questions and world is not prepared for such a change. Lots of people will lose their jobs I am not even mentioning their debts. No one will have chance to be rich anymore, If you are in first world country you will probably get UBI, if not you wont.
Same, I don’t really get the excitement. None of these companies are pushing for a utopian Star Trek society either with that power.
Open models will catch up next year or the year after, there only so many things to try and there's lots of people trying them, so it's more or less an inevitability.
The part to get excited about is that there's plenty of headroom left to gain in performance. They called o1 a preview, and it was, a preview for QwQ and similar models. We get the demo from OAI and then get the real thing for free next year.
[deleted]
> What will happen to 3rd world countries
Probably less disruption than will happen in 1st world countries.
> No one will have chance to be rich anymore
It's strange to reach this conclusion from "look, a massive new productivity increase".
Strange indeed if we work under the assumption that the profits from this productivity will be distributed (even roughly) evenly. The problem is that most of us see no indication that they will be.
I read “no one will have a chance to be rich anymore” as a statement about economic mobility. Despite steep declines in mobility over the last 50 years, it was still theoretically possible for a poor child (say bottom 20% wealth) to climb several quintiles. Our industry (SWE) was one of the best examples. Of course there have been practical barriers (poor kids go to worse schools, and it’s hard to get into college if you can’t read) but the path was there.
If robots replace a lot of people, that path narrows. If AGI replaces all people, the path no longer exists.
It is not strange at all, a very big motivation of spending billions in AI research is basically to remove what is called "skill premium" from the labor market. That "skill premium" was usually how people got richer than their fathers.
Intelligence is the thing distinguishing humans from all previous inventions that already were superhuman in some narrow domain.
car : horse :: AGI : humans
its not like sonnet, yes current ai tools are increasing productivity and provides many ways to have chance to be rich, but agi is completely different. You need to handle evil competition between you and big fishes, probably big fishes will have more ai resources than you. What is the survival ratio in such a environment ? Very low.
> I would expect this step in 2028-2030.
Do you work at one of the frontier labs?
I’ve never understood this perspective. Companies only make money when there are billions of customers. Are you imagining a total-monopoly scenario where zero humans have any income/wealth and there are only AI companies selling/mining/etc to each other, fully on their own? In such an extreme scenario, clearly the world’s governments would nationalize these entities. I think the only realistic scenario in which the future is not markedly better for every single human is if some rogue AI system decides to exterminate us, which I find to be increasingly unlikely as safety improvements are made (like the paper released today).
As for the wealth disparity between rich and poor countries, it’s hard to know how politics will handle this one, but it’s unlikely that poor countries won’t also be drastically richer as the cost of basic living drops to basically zero. Imagine the cost of food, energy, etc in an ASI world. Today’s luxuries will surely be considered human rights necessities in the near future.
> In such an extreme scenario, clearly the world’s governments would nationalize these entities
Those entities are the worlds governments regardless how things play out. People just worry they will be hostile or indifferent to humans, since that would be bad news for humans. Pet, cattle or pest, our future will be as one of those.
I’m extremely excited because I want to see the future and I’m trying not to think of how severely fucked my life will be.
I hope governments will finally take action.
What action do you expect them to take?
What law would effectively reduce risk from AGI? The EU passed a law that is entirely about reducing AI risk and people in the technology world almost universally considered it a bad law. Why would other countries do better? How could they do better?
If their mission is the wellbeing of their peoples, they should take any action that ensures that.
Besides regulating the technology, they could try to protect people and society from the effects of the technology. UBI for example could be an attempt to protect people from the effects of mass unemployment, as i understood it.
Actually i'm afraid even more fundamental shifts are necessary.
[deleted]
This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).
So much for a plateau lol.
> So much for a plateau lol.
It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.
The pundits response to the (alleged) plateau was proportional to the certainty with which CEOs of frontier labs discussed pre-training scaling. The o3 result is from scaling test time compute, which represents a meaningful change in how you would build out compute for scaling (single supercluster --> presence in regions close to users). Thus it is important to discuss.
[deleted]
You could make an equivalently dismissive comment about the hypesters.
Yeah but anyone with half a brain knows to ignore them. Vapid cynicism is a lot more seductive to the average nerd.
>Frontier Math (25% on high compute, previous 2%)
This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...
Happy to be wrong on this.
Nope, makes sense to me. Seems unreasonable to conclude the dataset is not compromised now.
the question is whether that 25% jump is also because of the compromised first test.
viewed from a skeptical lens of incentives:
openai and epochai are both startups with every incentive to peddle this narrative. when no one else can independently verify.
You're talking apples and oranges. The plateau the frontier models have hit is the limited further gains to be had from dataset (+ corresponding model/compute) scaling.
These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?
"New" reasoning models are plain LLMs with clever reinforcement learning. o1 is itself reinforcement learning on top GPT-4o.
They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.
And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.
The fact that these reasoning models may compute for extended durations, using exponentially more compute for linear performance gains (says OpenAI), resulting in outputs that while better are not necessarily any longer (more tokens) than before, all point to a different architecture - some type of iterative calling of the underlying model (essentially a reasoning agent using the underlying model).
A plain LLM does not use variable compute - it is a fixed number of transformer layers, a fixed amount of compute for every token generated.
Architecture generally refers to the design of the model. In this case, the underlying model is still a transformer based llm and so is its architecture.
What's different is the method for _sampling_ from that model where it seems they have encouraged the underlying LLM to perform a variable length chain of thought "conversation" with itself as has been done with o1. In addition, they _repeat_ these chains of thought in parallel using a tree of some sort to search and rank the outputs. This apparently scales performance on benchmarks as you scale both length of the chain of thought and the number of chains of thought.
No disagreement, although the sampling + search procedure is obviously adding quite a lot to the capabilities of the system as a whole, so it really should be considered as part of the architecture. It's a bit like AlphaGo or AlphaZero - generating potential moves (cf LLM) is only a component of the overall solution architecture, and the MCTS sampling/search is equally (or more) important.
Ah, I see. Yeah that's a fair assessment and in hindsight is probably the way architecture is being used in the article.
I think throwaway already explained what i was getting at.
That said, i probably did downplay the achievement. It may not be a "new" idea to do something like this but finding an effective method for reflection that doesn’t just lock you into circular thinking and is applicable beyond well defined problem spaces is genuinely tough and a breakthrough.
I legit see that if there is not even a new breakthrough just one week, people start shouting plateau plateau.. Our rate of progress is extraordinary and any downplay of it seems like stupid
At 6,670$/task? I hope there's a jump
It's not 6,670$/task. That was the high efficiency cost for 400 questions.
[deleted]
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
That's the most plausible definition of AGI i've read so far.
That's a pretty dark view of humanity and human intelligence. We're defined by the tasks we can do?
Instrumental reason FTW
That implies that human intelligence is equivalent to AGI.
[deleted]
If people constantly have to ask if your test is a measure of AGI, maybe it should be renamed to something else.
From the post
> Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Its funny when they say this, as if all humans can solve basic ass question/answer combos, people seem to forget theirs a percentage of the population that honestly believe the world is flat along with other hallucinations at the human level
Humans works in groups, so you are wrong a group of human is extremely reliable on tons of tasks. These AI models also work in groups, or they don't improve from working in a group since the company uses whatever does the best on the benchmark, so it is only fair to compare AI vs group of people, AI compared to an individual will always be an unfair comparison since an AI is never alone.
I don't believe AGI at that level has any commercial value.
[deleted]
How much longer can I get paid $150k to write code ?
I’ll believe the models can take the jobs of programmers when they can generate a sophisticated iOS app based on some simple prompts, ready for building and publication in the app store. That is nowhere near the horizon no matter how much things are hyped up, and it may well never arrive.
Nah, it will arrive. And regardless, this sort of AI reduces the skill level required to make the app. It reduces the amount of people required and thus reduces the demand for engineers. So, even though AI is not CLOSE to what you are suggesting, it can significantly reduce the salaries of those that ARE required. So maybe fewer $150K programmers will be hired with the same revenue for even higher profits.
The most bizarre thing is that programmers are literally writing code to replace themselves because once this AI started, it was a race to the bottom and nobody wants to be last.
They've been promising us this thing since the 60s: End-user development, 5GLs, etc. enabling the average Joe to develop sophisticated apps in minimal time. And it never arrives.
I remember attending a tech fair decades ago, and at one stand they were vending some database products. When I mentioned that I was studying computer science with a focus on software engineering, they sneered that coding will be much less important in the future since powerful databases will minimize the need for a lot of data wrangling in applications with algorithms.
What actually happened is that the demand for programmers increased, and software ate the world. I suspect something similar will happen the current AI hype.
> They've been promising us this thing since the 60s: End-user development, 5GLs, etc. enabling the average Joe to develop sophisticated apps in minimal time. And it never arrives.
This has literally already arrived. Average Joes are writing software using LLMs right now.
Source? Which software products are built without engineers?
Personal websites etc, you don't think about them as software products since they weren't built by engineers, but 30 years ago you needed engineers to build those things.
Ok, well I’m not going to worry about my job then. 25 years ago GeoCities existed and you didn’t need an engineer. 10 year old me was writing functional HTML, definitely not an engineer at that point.
To be honest maybe no one should worry.
If AI truly overtakes knowledge work there’s not much we could reasonably do to prepare for it.
If AI never gets there though, then you saved yourself the trouble of stressing about it. So sure, relax, it’s just the second coming of GeoCities.
I think the fear comes from the span of time. If my job is obsolete at the same time as everybody else's, I wouldn't care. I mean, sure, the world is in for a very tough time, but I would be in good company.
The really bad situation is if my entire skill set is made obsolete while the rest of the world keeps going for a decade or two. Or maybe longer, who knows.
I realize I'm coming across quite selfish, but it's just a feeling.
Well, I think in the 60s we also didn't have LLMs that could actually write complete programs, either.
No one writes a "complete program" these days. Things just keep evolving forever. I spent way too much time I care to admit dealing with dependencies of libraries which change seemingly on a daily basis these days. These predictions are so far off reality it makes me wonder if the people making them have ever written any code in their life.
That's fair. Well, I've written a lot of code. But anyway, I do want to emphasize the following. I am not making the same prediction as some that say AI can replace a programmer. Instead, I am saying: combination of AI plus programmers will reduce the need for the number or programmers, and hence allow the software industry to exist with far fewer people, with the lucky ones accumulating even more wealth.
> Nah, it will arrive
Will it?
It's already hard to get people to use computer as they are right now, where you only need to click on things and no longer have to enter commands. That because most people don't like to engage in formal reasoning. Even with one of the most intuitive computer assisted task (drawing and 3d modeling), there's so much to learn regarding theories that few people bother.
Programming has always been easy to learn, and tools to automate coding have existed for decades now. But how many people you know have had the urge to learn enough to automate their tasks?
The absolutist type comments are such a wild take given how often they are so wrong.
Totally... simple increases in 20% efficiency will already significant destroy demand for coders. This forum however will be resistant to admit such economic phenomenon.
Look at video bay editing after the advent of Final Cut. Significant drop in the specialized requirement as a professional field, even while content volume went up dramatically.
I could be misreading this, but as far as I can tell, there are more video and film editors today (29,240) than there were film editors in 1997 (9,320). Seems like an example of improved productivity shifting the skills required but ultimately driving greater demand for the profession as a whole. Salaries don't seem to have been hurt either, median wage was $35,214 in '97 and $66,600 today, right in line with inflation.
Computing has been transforming countless jobs before it got to Final Cut. On one hand, programming is not the hardest job out there. On the other, it takes months to fully onboard a human developer - a person that already has years of relevant education and work experience. There are desk jobs that onboard new hires in days instead. Let’s see when they’re displaced by AI first.
Don't know if you noticed but thats already happening. Mass layoffs in customer service etc have already happened over the last 2 years
So, how does it work out? Are the customers happy? Are the bosses at my work going to be equally happy with my AI replacement?
That's until AI has improved enough that it can automatically navigate the menus to get me a human operator to talk to.
3 to 5 years, max. Traditional coding is going to be dead in the water. Optimistically, the junior SWE job will evolve but more realistically dedicated AI-based programming agents will end demand for Junior SWEs
Which implies that a few years later they will not become senior SWEs either.
There’s a very good chance that if a company can replace its programmers with pure AI then it means whatever they’re doing is probably already being offered as a SaaS product so why not just skip the AI and buy that? Much cheaper and you don’t have to worry about dealing with bugs.
SaaS works for general problems faced by many businesses.
Exactly. Most businesses can get away with not having developers at all if they just glue together the right combination of SaaS products. But this doesn’t happen, implying there is something more about having your own homegrown developers that SaaS cannot replace.
The risk is not SaaS replacing internal developers. It's about increased productivity of developers reducing the number of developers needed to achieve something.
Again, you’re assuming product complexity won’t grow as a result of new AI tools.
3 decades ago you needed a big team to create the type of video games that one person can probably make on their own today in their spare time with modern tools.
But now modern tools have been used to make even more complicated games that require more massive teams than ever and huge amounts of money. One person has no hope of replicating that now, but maybe in the future with AI they can. And then the AAA games will be even more advanced.
It will be similar with other software.
Unless the LLMs see multiple leaps in capability, probably indefinitely. The Malthusians in this thread seem to think that LLMs are going to fix the human problems involved in executing these businesses - they won't. They make good programmers more productive and will cost some jobs at the margins, but it will be the low-level programming work that was previously outsourced to Asia and South America for cost-arbitrage.
You're not being paid $150K to "write code". You're being paid that to deliver solutions - to be a corporate cog than can ingest business requirements and emit (and maintain) business solutions.
If there are jobs paying $150K just to code (someone else tells you what to code, and you just code it up), then please share!
Frontier expert specialist programmers will always be in demand.
Generalist junior and senior engineers will need to think of a different career path in less than 5 years as more layoffs will reduce the software engineering workforce.
It looks like it may be the way things are if progress in the o1, o3, oN models and other LLMs continues on.
This assumes that software products in the future will remain at the same complexity as they are today, just with AI building them out.
But they won’t. AI will enable building even more complex software which counter intuitively will result in need even more human jobs to deal with this added complexity.
Think about how despite an increasing amount of free open source libraries over time enabling some powerful stuff easily, developer jobs have only increased, not decreased.
What about "general" in AGI do you not understand? There will be no new style of development for which the AGI will be poorly suited that all the displaced developers can move to.
For true AGI (whatever that means, lets say fully replicates human abilities), discussing "developers" only is a drop in the bucket compared to all knowledge work jobs which will be displaced.
More likely they will tailor/RL train these models to go after coders first. Use RLHF employing coders where labor is cheap to train their models. A number of reasons for this of course:
- Faster product development on their side as they eat their own dogfood
- Dev's are the biggest market in the transition period for this tech. Gives you some revenue from direct and indirect subscriptions that the general population does not need/require.
- Fear in leftover coders is great for marketing
- Tech workers are paid well which to VC's, CEO's, etc makes it obvious where the value of this tech comes from. Not with new use cases/apps which would be greatly beneficial to society - but effectively making people redundant saving costs. New use cases/new markets are risky; not paying people is something any MBA/accounting type can understand.
I've heard some people say "its like they are targeting SWE's". I say; yes they probably are. I wouldn't be surprised if it takes SWE jobs but otherwise most people see it as a novelty (barely affects their life) for quite some time.
I've made a similar argument in the past but now I'm not so sure. It seems to me that developer demand was linked to large expansions in software demand first from PCs then the web and finally smartphones.
What if software demand is largely saturated? It seems the big tech companies have struggled to come up with the next big tech product category, despite lots of talent and capital.
There doesn’t need to be a new category. Existing categories can just continue bloating in complexity.
Compare the early web vs the complicated JavaScript laden single page application web we have now. You need way more people now. AI will make it even worse.
Consider that in the AI driven future, there will be no more frameworks like React. Who is going to bother writing one? Instead every company will just have their own little custom framework built by an AI that works only for their company. Joining a new company means you bring generalist skills and learn how their software works from the ground up and when you leave to another company that knowledge is instantly useless.
Sounds exciting.
But there’s also plenty of unexplored categories anyway that we can’t access still because there’s insufficient technology for. Household robots with AGI for instance may require instructions for specific services sold as “apps” that have to be designed and developed by companies.
The new capabilities of LLMs, and generally large foundation models, expands the range of what a computer program can do. Naturally, we will need to build all of those things with code. Which will be done by a combo of people with product ideas, engineers, and LLMs. There will be then specialization and competition on each new use-case. eg., who builds the best AI doctor etc.,.
This is exactly what will happen. We'll just up the complexity game to entirely new baselines. There will continue to be good money in software.
These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise. What they can do is remove a hell of a lot of accidental complexity.
> These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise.
But maybe models + managers/non technical people can?
The question is: How to become a senior when there is no place to be a junior? Will future SWE need to do the 10k hours as a hobby? Will AI speed up or slow down learning?
good question and I think you gave the correct answer yes people will just do the 10,000 hours required by starting programming at the age of eight and then playing around until they're done studying
[deleted]
Often what happens is the golf-course phenomenon. As golfing gets less popular, low and mid tier golf courses go out of business as they simply aren't needed. But at the same time demand for high end golf courses actually skyrockets because people who want to golf either can give it up or go higher end.
This I think will happen with programmers. Rote programming will slowly die out, while demand for super high end will go dramatically up in price.
Where does this golf-course phenomenon come from? It doesn't really match the real world or how golfing works.
how so, witnessed it quite directly in California. Majority have closed and remaining have gone up in price and are up scale. This has been covered in various new programs like 60 minutes. You can look up death of golfing.
Also unsure what you mean by...'how golfing works'. This is the economics of it, not the game
Golfing has had a huge surge in popularity since 2020. Prices are going up but courses aren't closing.
Maybe its CA thing? Plenty of $50 golf courses here in Phoenix.
I think they will have to figure out how to get around context limits before that happens. I also wouldn't be surprised if the future models that can actually replace workers are sold at such an exorbitant price that only larger companies will be able to afford it. Everyone else gets access to less capable models that still require someone with knowledge to get to an end result.
Well, considering they floated the $2000 subscription idea, and they still haven't revealed everything, they could still introduce the $2k sub with o3+agents/tool use, which means, till about next week.
If it’s any consolation, Agile priests and middle managers will be the first to go
See ya Cindy.
[deleted]
[dead]
[dead]
[dead]
[deleted]
[deleted]
[dead]
[deleted]
[deleted]
Great. Now we have to think of a new way to move the goalposts.
This is just as silly as claiming that people "moved the goalposts" when a computer beat Kasparov at chess to claim that it wasn't AGI: it wasn't a good test and some people only realize this after the computer beat Kasparov but couldn't do much else. In this case the ARC maintainers specifically have stated that this is a necessary but not sufficient test of AGI (I personally think it is neither).
It's not silly. The computer that could beat Kasparov couldn't do anything else so of course it wasn't Artificial General Intelligence.
o3 can do much much more. There is nothing narrow about SOTA LLMs. They are already General. It doesn't matter what ARC Maintainers have said. There is no common definition of General that LLMs fail to meet. It's not a binary thing.
By the time a single machine covers every little test humanity can devise, what comes out of that is not 'AGI' as the words themselves mean but a General Super Intelligence.
It is silly, the logic is the same: "Only a (world-altering) 'AGI' could do [test]" -> test is passed -> no (world-altering) 'AGI' -> conclude that [test] is not a sufficient test for (world-altering) 'AGI' -> chase new benchmark.
If you want to play games about how to define AGI go ahead. People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI. But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").
The fact is that there really hasn't been the type of world-altering impact that people generally associate with AGI and no reason to expect one.
>It is silly, the logic is the same: "Only a (world-altering) 'AGI' could do [test]" -> test is passed -> no (world-altering) 'AGI' -> conclude that [test] is not a sufficient test for (world-altering) 'AGI' -> chase new benchmark.
Basically nobody today thinks beating a single benchmark and nothing else will make you a General Intelligence. As you've already pointed out out, even the maintainers of ARC-AGI do not think this.
>If you want to play games about how to define AGI go ahead.
I'm not playing any games. ENIAC cannot do 99% of the things people use computers to do today and yet barely anybody will tell you it wasn't the first general purpose computer.
On the contrary, it is people who seem to think "General" is a moniker for everything under the sun (and then some) that are playing games with definitions.
>People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI.
Who are these people ? Do you have any examples at all. Genuine question
>But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").
What do you expect from 'AGI'? Everybody seems to have different expectations, much of it rooted in science fiction and not even reality, so this is a moot point. What exactly is World Altering to you ? Genuinely, do you even have anything other than a "I'll know it when i see it ?"
If you introduce technology most people adopt, is that world altering or are you waiting for Skynet ?
> Basically nobody today thinks beating a single benchmark and nothing else will make you a General Intelligence.
People's comments, including in this very thread, seem to suggest otherwise (c.f. comments about "goal post moving"). Are you saying that a widespread belief wasn't that a chess playing computer would require AGI? Or that Go was at some point the new test for AGI? Or the Turing test?
> I'm not playing any games... "General" is a moniker for everything under the sun that are playing games with definitions.
People have a colloquial understanding of AGI whose consequence is a significant change to daily life, not the tortured technical definition that you are using. Again your definition isn't something anyone cares about (except maybe in the legal contract between OpenAI and Microsoft).
> Who are these people ? Do you have any examples at all. Genuine question
How about you? I get the impression that you think AGI was achieved some time ago. It's a bit difficult to simultaneously argue both that we achieved AGI in GPT-N and also that GPT-(N+X) is now the real breakthrough AGI while claiming that your definition of AGI is useful.
> What do you expect from 'AGI'?
I think everyone's definition of AGI includes, as a component, significant changes to the world, which probably would be something like rapid GDP growth or unemployment (though you could have either of those without AGI). The fact that you have to argue about what the word "general" technically means is proof that we don't have AGI in a sense that anyone cares about.
>People's comments, including in this very thread, seem to suggest otherwise (c.f. comments about "goal post moving").
But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?
>People have a colloquial understanding of AGI whose consequence is a significant change to daily life, not the tortured technical definition that you are using
And ChatGPT has represented a significant change to the daily lives of many. It's the fastest adopted software product in history. In just 2 years, it's one of the top ten most visited sites on the planet worldwide. A lot of people have had the work they do significant change since its release. This is why I ask, what is world altering ?
>How about you? I get the impression that you think AGI was achieved some time ago.
Sure
>It's a bit difficult to simultaneously argue both that we achieved AGI in GPT-N and also that GPT-(N+X) is now the real breakthrough AGI
I have never claimed GPT-N+X is the "new breakthrough AGI". As far as I'm concerned, we hit AGI sometime ago and are making strides in competence and/or enabling even more capabilities.
You can recognize ENIAC as a general purpose computer and also recognize the breakthroughs in computing since then. They're not mutually exclusive.
And personally, I'm more impressed with o3's Frontier Math score than ARC.
>I think everyone's definition of AGI includes, as a component, significant changes to the world
Sure
>which probably would be something like rapid GDP growth or unemployment
What people imagine as "significant change" is definitely not in any broad agreement.
Even in science fiction, the existence of general intelligences more competent than today's LLMs does not necessarily precursor massive unemployment or GDP growth.
And for a lot of people, the clincher stopping them from calling a machine AGI is not even any of these things. For some, that it is "sentient" or "cannot lie" is far more important than any spike of unemployment.
> But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?
I don't understand what you are getting at.
Ultimately there is no axiomatic definition of the term AGI. I don't think the colloquial understanding of the word is what you think it is (i.e. if you had described to people, pre-chatgpt, today's chatgpt behavior, including all the limitations and failings and the fact that there was no change in GDP, unemployment, etc), and asked if that was AGI I seriously doubt they would say yes.)
More importantly I don't think anyone would say their life was much different from a few years ago and separately would say under AGI it would be.
But the point that started all this discussion is the fact that these "evals" are not good proxies for AGI and no one is moving goal-posts even if they realize this fact only after the tests have been beaten. You can foolishly define AGI as beating ARC but the moment ARC is beaten you realize that you don't care about that definition at all. That doesn't change if you make a 10 or 100 benchmark suite.
>I don't understand what you are getting at.
If such discussions only made when LLMs make strides in the benchmark then it's not just about beating the benchmark but also what kind of system is beating it.
>You can foolishly define AGI as beating ARC but the moment ARC is beaten you realize that you don't care about that definition at all.
If you change your definition of AGI the moment a test is beaten then yes, you are simply post moving.
If you care about other impacts like "Unemployment" and "GDP rising" but don't give any time or opportunity to see if the model is capable of such then you don't really care about that and are just mindlessly shifting posts.
How do such a person know o3 won't cause mass unemployment? The model hasn't even been released yet.
> If such discussions only made when LLMs make strides in the benchmark then it's not just about beating the benchmark but also what kind of system is beating it.
I still don't understand the point you are making. Nobody is arguing that discrete program search is AGI (and the same counter-arguments would apply if they did).
> If you change your definition of AGI the moment a test is beaten then yes, you are simply post moving.
I don't think anyone changes their definition, they just erroneously assume that any system that succeeds on the test must do so only because it has general intelligence (that was the argument for chess playing for example). When it turns out that you can pass the test with much narrower capabilities they recognize that it was a bad test (unfortunately they often replace the bad test with another bad test and repeat the error).
> If you care about other impacts like "Unemployment" and "GDP rising" but don't give any time or opportunity to see if the model is capable of such then you don't really care about that and are just mindlessly shifting posts.
We are talking about what models are doing now (is AGI here now) not what some imaginary research breakthroughs might accomplish. O3 is not going to materially change GDP or unemployment. (If you are confident otherwise please say how much you are willing to wager on it).
I'm not talking about any imaginary research breakthroughs. I'm talking about today, right now. We have a model unveiled today that seems a large improvement across several benchmarks but hasn't been released yet.
You can be confident all you want but until the model has been given the chance to not have the effect you think it won't then it's just an assertion that may or may not be entirely wrong.
If you say "this model passed this benchmark I thought would indicate AGI but didn't do this or that so I won't acknowledge it" then I can understand that. I may not agree on what the holdups are but I understand that.
If however you're "this model passed this benchmark I thought would indicate AGI but I don't think it's going to be able to do this or that so it's not AGI" then I'm sorry but that's just nonsense.
My thoughts or bets are irrelevant here.
A few days ago I saw someone seriously comparing a site with nearly 4B visits a month in under 2 years to Bitcoin and VR. People are so up in their bubbles and so assured in their way of thinking they can't see what's right in front of them, nevermind predict future usefulness. I'm just not interested in engaging "I think It won't" arguments when I can just wait and see.
I'm not saying you are one of such people. I just have no interest in such arguments.
My bet ? There's no way i would make a bet like that without playing with the model first. Why would I ? Why Would you ?
> I'm not talking about any imaginary research breakthroughs. I'm talking about today, right now.
I explicitly said so was I. I said today we don’t have large impact societal changes that people have conventionally associated with the term AGI. I also explicitly talked about how I don’t believe o3 will change this and your comments seem to suggest neither do you (you seem to prefer to emphasize that it isn’t literally impossible that o3 will make these transformative changes).
> If however you're "this model passed this benchmark I thought would indicate AGI but I don't think it's going to be able to do this or that so it's not AGI" then I'm sorry but that's just nonsense.
The entire point of the original chess example was to show that in fact it is the correct reaction to repudiate incorrect beliefs of naive litmus test of AGI-ness. If we did what you are arguing then we should accept AGI having occurred after chess was beaten because a lot of people believed that was the litmus test? Or that we should praise people who stuck to their original beliefs after they were proven wrong instead of correcting them? That’s why I said it was silly at the outset.
> My thoughts or bets are irrelevant here
No they show you don’t actually believe we have society transformative AGI today (or will when o3 is released) but get upset when someone points that out.
> I'm just not interested in engaging "I think It won't" arguments when I can just wait and see.
A lot of life is about taking decisions based on predictions about the future, including consequential decisions about societal investment, personal career choices, etc. For many things there isn’t a “wait and see approach”, you are making implicit or explicit decisions even by maintaining the status quo. People who make bad or unsubstantiated arguments are creating a toxic environment in which those decisions are made, leading personal and public harm. The most important example of this is the decision to dramatically increase energy usage to accommodate AI models despite impending climate catastrophe on the blind faith that AI will somehow fix it all (which is far from the “wait and see” approach that you are supposedly advocating by the way, this is an active decision).
> My bet ? There's no way i would make a bet like that without playing with the model first. Why would I ? Why Would you ?
You can have beliefs based on limited information. People do this all the time. And if you actually revealed that belief it would demonstrate that you don’t actually currently believe o3 is likely to be world transformative
>You can have beliefs based on limited information. People do this all the time. And if you actually revealed that belief it would demonstrate that you don’t actually currently believe o3 is likely to be world transformative
Cool...but i don't want to in this matter.
I think the models we have today are already transformative. I don't know if o3 is capable of causing sci-fi mass unemployment (for white collar work) and wouldn't have anything other than essentially a wild guess till it is released. I don't want to make a wild guess. Having beliefs on limited information is often necessary but it isn't some virtue and in my opinion should be avoided when unnecessary. It is definitely not necessary to make a wild guess about model capabilities that will be released next month.
>The entire point of the original chess example was to show that in fact it is the correct reaction to repudiate incorrect beliefs of naive litmus test of AGI-ness. If we did what you are arguing then we should accept AGI having occurred after chess was beaten because a lot of people believed that was the litmus test?
Like i said, if you have some other caveats that weren't beaten then that's fine. But it's hard to take seriously when you don't.
> But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?
This model was trained to pass this test, it was trained heavily on the example questions, so it was a narrow technique.
We even have proof that it isn't AGI, since it scores horribly on ARC-AGI 2. It overfitted for this test.
>This model was trained to pass this test, it was trained heavily on the example questions, so it was a narrow technique.
You are allowed to train on the train set. That's the entire point of the test.
>We even have proof that it isn't AGI, since it scores horribly on ARC-AGI 2. It overfitted for this test.
Arc 2 does not even exist yet. All we have are "early signs", not that that would be proof of anything. Whether I believe the models are generally intelligent or not doesn't depend on ARC
> You are allowed to train on the train test. That's the entire point of the test.
Right, but by training on those test cases you are creating a narrow model. The whole point of training questions is to create narrow models, like all the models we did before.
That doesn't make any sense. Training on the train set does not make the models capabilities narrow. Models are narrow when you can't train them to do anything else even if you wanted to.
You are not narrow for undergoing training and it's honestly kind of ridiculous to think so. Not even the ARC maintainers believe so.
> Training on the train set does not make the models capabilities narrow
Humans didn't need to see the training set to pass this, the AI needing it means it is narrower than the humans, at least on these kind of tasks.
The system might be more general than previous models, but still not as general as humans, and the G in AGI typically means being as general as humans. We are moving towards more general models, but still not at the level where we call them AGI.
Let's just define AI as "whatever computers still can't do." That'll show those dumb statistical parrots!
[deleted]
Well right now, running this model is really expensive, but we should prepare a new cope for when equivalent models no longer are, ahead of time.
Ya getting costs down will be the big one, i imagine quantization, distillation and lots and lots of improvements on the compute side both hardware and software wise.
I mean, what else do you call learning?
[deleted]
[flagged]
[flagged]
[deleted]
[flagged]
> most of the people training these next-gen AIs are neurodiverse
Citation needed. This is a huge claim based only on stereotype.
So true. Perhaps I'm just thinking it's my people and need to update my priors.
> most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image
Do you have any evidence to support that? It would be fascinating if the field is primarly advancing due to a unique constellation of traits contributed by individuals who, in the past, may not have collaborated so effectively.
PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US children are diagnosed on the spectrum according to the CDC(!), which would mean if you met 10 AI researchers and 4 were neurodivergent you'd reasonably expect that it's a higher-than-population average representation. I'm polling from the Effective Altruist AI folks in my mind, and the number is definitely, definitely higher than 4/10.
Are there non-Effective Altruist AI folks?
I love how this might mean "non-Effective", non-"Effective Altruist" or non-"Effective Altruist AI" folks.
Yes
[deleted]
Great results. However, let's all just admit it.
It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.
It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.
So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)
> It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers.
Did it, really? Or did it just provide automation for routine no-thinking-necessary text-writing tasks, but is still ultimately completely bound by the level of human operator's intelligence? I strongly suspect it's the latter. If it had actually replaced journalists it must be junk outlets, where readers' intelligence is negligible and anything goes.
Just yesterday I've used o1 and Claude 3.5 to debug a Linux kernel issue (ultimately, a bad DSDT table causing TPM2 driver unable to reserve memory region for command response buffer, the solution was to use memmap to remove NVS flag from the relevant regions) and confirmed once again LLMs still don't reason at all - just spew out plausible-looking chains of words. The models were good listeners, and a mostly-helpful code generators (when they didn't make silliest mistakes), but they gave no traces of understanding and no attention for any nuances (e.g. LLM used `IS_ERR` to check `__request_resource` result, despite me giving it full source code for that function and there's even a comment that makes it obvious it returns a pointer or NULL, not an error code - misguided attention kind of mistake).
So, in my opinion, LLMs (as currently available to broad public, like myself) are useful for automating away some routine stuff, but their usefulness is bounded by the operator's knowledge and intelligence. And that means that the actual jobs (if they require thinking and not just writing words) are safe.
When asked about what I do at work, I used to joke that I just press buttons on my keyboard in fancy patterns. Ultimately, LLMs seem to suggest that it's not what I really do.
The economic theory answer is that people simply switch to jobs that are not yet replaceable by AI. Doctors, nurses, electricians, construction workers, police officers, etc. People in aggregate will produce more, consume more and work less.
>Doctors capped per year by law.
>Trades people only work after having something to do. If you don't have sufficient demand for builders, electricians, plumbers, etc... No one can afford to become one. Nevermind the fact that not everyone should be any of those things. Economics fails when the loop fails to close.
> Doctors
Many replaceable
> Police officers
Many replaceable (desk officers)
When none of us have jobs or income, there will be no ability for us to buy products. And then no reason for companies to buy ads to sell products to people who don’t have money. Without ad money (or the potential of future ad money), the people pushing the bounds of AGI into work replacement will lose the very income streams powering this research and their valuations.
Ford didn’t support a 40 hour work week out of the kindness of his heart. He wanted his workers to have time off for buying things (like his cars).
I wonder if our AGI industrialist overlords will do something similar for revenue sharing or UBI.
> When none of us have jobs or income, there will be no ability for us to buy products. And then no reason for companies to buy ads to sell products to people who don’t have money. Without ad money (or the potential of future ad money), the people pushing the bounds of AGI into work replacement will lose the very income streams powering this research and their valuations.
I don't think so. I agree the push for AGI will kill the modern consumer product economy, but I think it's quite possible for the economy to evolve into a new form (that will probably be terrible for most humans) that keep pushes "work replacement."
Imagine, an AGI billionare buying up land, mines, and power plants as the consumer economy dies, then shifting those resources away from the consumer economy into self-aggrandizing pet projects (e.g. ziggurats, penthouses on Mars, space yachts, life extension, and stuff like that). He might still employ a small community of servants, AGI researchers, and other specialists; but all the rest of the population will be irrelevant to him.
And individual autarky probably isn't necessary, consumption will be redirected towards the massive pet production I mentioned, with vestigial markets for power, minerals, etc.
This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.
In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.
> If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.
This isn't possible if you want to pay sales taxes - those are what keep transactions being done in the official currency. Of course in a world of 99% unemployment presumably we don't care about this.
But yes, this world of 99% unemployment isn't possible, eg because as soon as you have two people and they trade things, they're employed again.
> This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.
Ultimately, it all comes down to raw materials and similar resources, and all those will be claimed by people with lots of real money. Your "invented ... other money" will be useless to buy that fundamental stuff. At best, it will be useful for trading scrap and other junk among the unemployed.
> In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.
No. Why would the people who own that automation want to waste their resources producing consumer goods for people with nothing to give them in return?
if people with AI use it to somehow enclose all raw resources, then yes - the picture i painted will be wrong
Enclosing raw resources tends to be what powerful people do.
"Raw resources" aren't that valuable economically because they aren't where most of the value is added in production. That's why having a lot of them tends to make your country poorer (https://en.wikipedia.org/wiki/Resource_curse).
Today educated humans are more valuable than anything else on earth, but AGI changes that. With cheap AGI raw resources and infrastructure will be the only two valuable things left.
> This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.
Uh, this picture doesn’t make sense. Why would anyone value this randomly invented money?
> Why would anyone value this randomly invented money?
Because they can use it to pay for goods?
Your notion is that almost everyone is going to be out of a job and thus have nothing. Okay, so I'm one of those people and I need this house built. But I'm not making any money because of AI or whatever. Maybe someone else needs someone to drive their aging relative around and they're a good builder.
If 1. neither of those people have jobs or income because of AI 2. AI isn't provisioning services for basically free,
then it makes sense for them to do an exchange of labor - even with AI (if that AI is not providing services to everyone). The original reason for having money and exchanging it still exists.
You seem to be arguing that large unemployment rates are logically impossible, so we shouldn't worry about unemployment.
The fact unemployment was 25% during the great depression would seem to suggest that at a minimum, a 25% unemployment rate is possible during a disruptive event.
The unemployment rate in a modern economy is basically whatever the central bank wants it to be. The Great Depression was caused by bad monetary policy - I don't see a reason why having AI would cause that.
The person upthread was saying that as long as someone wants a house built and someone wants a grandma driven around unemployment can't happen.
Unless nobody wanted either of those things done during the depression that's clearly not a very good mental model.
Yes, I disagree with that. The problem isn't the lack of demand, it's that the people with the demand can't get the money to express it with.
Didn't money basically only emerge to deal with with difficulty of “double coincidence of wants”. Money simply solved the problem of making all forms of value interchangeable and transportable across time AND circumstance? A dollar can do with with or without AI existing no?
Yes, that's my point
Honestly I don’t even know how to engage with your point.
Yes if we recreate society some form of money would likely emerge.
Fine, you exchange labor and the builder agrees to build you a house given materials for transport for granny.
Where are you getting gas/house materials from? No hand waves please. Show all work.
Do you follow Jack Clark? I noticed he's been on the road a lot talking to governments and policy makers, and not just in the "AI is coming" way he used to talk.
Maybe I'm missing something vital, but how does anything that we've seen AI do up until this point or explained in this experiment even hint at AGI? Can any of these models ideate? Can they come up with technologies and tools? No and it's unlikely they will any time soon. However, they can make engineers infinitely more productive.
You need to define ideate, tools and technologies to answer those questions. Not to mention that it's quite possible humans do those things through re-combination of learned ideas similarly to how these reasoning models are suggested to be working.
Every technological advancement that we've seen in software engineering - be it in things like Postgres, Kubernetes and Cloud Infrastructure - came out from truly novel ideas. AI seems to generate outputs that appear novel but are they really? It's capable of synthesizing and combining vast amounts of information in creative ways but it's deriving everything from existing patterns found within its training data. Truly novel ideas require thinking outside the box. It's combination of cognitive, emotional and environmental factors which go beyond pattern recognition. How close are we to achieving this? Everyone seems to be shaking in their boots because we might lose our job safety in tech, but I don't see any intelligence here.
Another meaningless benchmark, another month—it’s like clockwork at this point. No one’s going to remember this in a month; it’s just noise. The real test? It’s not in these flashy metrics or minor improvements. The only thing that actually matters is how fast it can wipe out the layers of middle management and all those pointless, bureaucratic jobs that add zero value.
That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.
So far AI market seems to be focused on replacing meaningful jobs, meaningless ones look safe (which kind of makes sense if you think about it).
Its a common view from the "do'ers" (the people who made most of the value in the past; the hard workers, etc) that this will make management redundant. Sadly with a basic understanding of economics you can see this is probably wrong. The "do'ers" have given more power to the management class at their own expense with this solution - if I can get the AI to "do" all I need are the people who "decide what to do". Market power belongs with scarcity - all else being equal AI makes the barrier to development smaller meaning less scarcity on that side. In general technology developments have increased inequality especially since the 90's onwards.
Generally with AI think the top of society stand to gain a lot more than the middle/bottom of it for a whole host of reasons. If you think anything different your framework you use to make your conclusion is probably wrong at least in IMO.
I don't like saying this but there is a reason why the "AI bros", VC's, big tech CEO's, etc are all very very excited about this and many employees (some commenting here) are filled with dread/fear. The sales people, the managers, the MBA's, etc stand to gain a lot from this. Fear also serves as the best marketing tool; it makes people talk and spread OpenAI's news more so than everything else. Its a reason why targeting coding jobs/any jobs is so effective. I want to be wrong of course.
Agreed, but isn't it management who decides that this would be implemented? Are they going to propogate their own removal?
Middle manager types are probably interested in their salary performance more than anything. "Real" management (more of their assets come from their ownership of the company than a salary) will override them if it's truthfully the best performing operating model for the company.
The best AI on this graph costs 50000% more than a stem graduate to complete the tasks and even then has an error rate that is 1000% higher than the humans???
This is so impressive that it brings out the pessimist in me.
Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.
Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.
They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!
That would be a ton of problems for a small team of PhD/Grad level experts to solve (for GPQA Diamond, etc) in a short time. Remember, on EpochAl Frontier Math, these problems require hours to days worth of reasoning by humans
The author also suggested this is a new architecture that uses existing methods, like a Monte Carlo tree search that deepmind is investigating (they use this method for AlphaZero)
I don't see the point of colluding for this sort of fraud, as these methods like tree search and pruning already exist. And other labs could genuinely produce these results
I had the ARC AGI in mind when I suggested human workers. I agree the other benchmark results make the use of human workers unlikely.
I'm very confident that queries were not routed to human workers behind the API.
Possibly some other form of "make it seem more impressive than it is," but not that one.
this is an impressive tinfoil take. but what would be their plan in the medium term? like once they release this people can check their data
How can people check their data?
In the medium term the plan could be to achieve AGI, and then AGI would figure out how to actually write o3. (Probably after AGI figures out the business model though: https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)