Pushing Operator to the Limit with Video Games

agi forecasting • ai scaling • reinforcement learning

2025-01-28 • status: in-progress • importance: high

OpenAI's first version of their computer-use agent "Operator" was released to brief fanfare and obvious, glaring issues.

However, it is highly capable and its present issues seem to mostly stem from its inability to see. While this is of course a deciding issue for a computer-use agent, Operator's intended usecase in this early roll-out seems to be filling out forms, and I find its ability to detect the location of text on screens is nearly flawless. It instead struggles with spatial relationships and contrasting colors. I assume OAI is using some Gato-style form of RL to generate computer-use data combined with self-supervised training on that data.

1. Exploring Strengths
2. Simplified Montezuma's Revenge

1. Exploring Strengths

Computer-using agents are of particular interest to me due to Metaculus's Date of Weakly General AI market, which seems resolved in spirit since 2020's foreboding AI flurries, chiefly GPT-3 and Agent57, which both displayed considerable and uncanny generality despite their heavy constraints.

Over the last five years, the flurries have borne into a full-on snow storm, with humans clambering to assemble cognitive benchmarks AI can't solve, and yet the market still remains open due to the stand-out criteria that a weakly general AI must be a unified AI system capable of learning Montezuma's Revenge in sub-100 hours of real-time play, along with the more traditional criteria that it must score >75th percentile on the SAT, >90% on WinoGrande, and pass a two-hour-long Turing test. These last three have been long since saturated, and only the most fiendishly difficult Atari game remains.

My intuition is that a scaled-up Gato or frankensteined Dreamer-v3 could probably muster its way through Montezuma's Revenge without special training, since the main challenge for RL systems of the (early) 2010s wasn't the complexity of the game mechanics but that no world models existed to predict the game's sparse rewards. With systems of the 2020s now integrating RL with world modeling from unsupervised learning, the remaining issues have dwindled to pesky engineering problems, not a conceptual wall.

Case in point: Language models show stunning out-of-distribution generalization in-context, and are even capable of episodic memory-style recall on their contexts. While teaching entirely new skills in-context remains a struggle, fine-tuning and reinforcement learning atop LMs now press at the barrier of super-human (competition) coding ability, and crush pared-down tasks attempting to test skill-acquisition.

My testing of OpenAI's Operator for the last few days suggests that the Montezuma's Revenge criteria is now satisfied per its original intention, despite Operator's limitations causing it to struggle on the exact challenge. I test Operator on additional challenges and more explorative environments, though these sections are in-progress.

While most benchmarks aim to push models to conditions where they fail, I instead treat Operator as having idiosyncratic visual impairements and seek what conditions allow it to succeed.

Since forecasting criteria must be precise so that they can resolve fairly for their participants, and benchmarks must be sensational between "total failure to reason" and enourmous step changes, little attention is spent on how systems manage to learn and accomplish tasks, not from the low-level perspective of mechanistic interpretability, but in the hands-on sense of watching how AI systems attempt tasks in-context and imagining what prosthetics might let them work around their constraints.

Such results are vaguely indicative, often contain failure for unsexy reasons, and don't compress neatly into "what the system can definitively do" or exam scores, but reveal much more about what the system is really capable of once barriers are removed via scaling or algorithmic tweaks. Gwern's creative writing experiments and janus's loom did much more to illustrate that GPT-3 possessed "a world model" than any Winogrande or common knowledge score. I attempt to continue exploration in that tradition here.

2. Simplified Montezuma's Revenge

I created a simple 2d platformer game of similar "abstract complexity" to the first level of Montezuma's Revenge using Claude 3.5 Sonnet, modifying the game such that Operator would be able to recognize artifacts despite its "blindness". Operator impressed me not for its ability to play, but its ability to correct its mistakes, try new strategies, and apply my advice to update its strategies. In terms of the cognitive component of the challenges, exploring each room in a novel environment such as Montezuma's Revenge seems within scope of its ability, though the visual components and "motor complexity" of the game seems far beyond its reach.

During the first successful trial run of my simplified game, Operator correctly identified the meaning of different game elements, such as hazards and a ladder (though the ladder's meaning is explained in the text instructions). It also, to my surprise, correctly figured out how to use the poorly defined "grapple" functionality, an in-game tool allowing the player to "swing" to other parts of the same level, managed to apply this to save time, and correctly used it to solve the "victory circle" puzzle at the end, though I did not explain it had to do that. It did, however, get catastrophically stuck figuring out how to actually use the ladder, seemingly owing to its inability to see its upward progress (metaphor?).

Here is the video playthrough, provided by OpenAI. It's slightly sped-up by their video export software. Operator is really slow! I did not interfere with the browser while Operator plays, though it does ask me for help. The video is painfully slow and contains a few minutes of Operator messing up while climbing the ladder, so if you skip most of it or watch it at 3x speed you won't miss anything. Operator allows the user to chat with the agent during task execution, and below is a reproduced transcript:

User: Play this game:

https://koratkar.github.io/claude-game/

Operator: Worked for 7 minutes I've attempted various strategies to move the black square to the victory circle, but it seems there's a limitation in reaching the green circle from the bottom level. The "GRAPPLE" button and the ladder haven't been successful. Would you like me to try something else?

User: Remember, you can't grapple across levels.

Operator: Got it! I'll focus on moving the black square within the same level. I'll try to align it with the victory circle using the available movement options.

Operator: Worked for 1 minute I've tried various approaches to move the black square to the victory circle, but it seems there's a challenge in ascending the ladder. Would you like me to try a different strategy or approach?

User: You fell off the ladder.

Operator: Worked for 1 minute I've attempted several approaches to move the black square to the victory circle, but it seems there's a challenge in ascending the ladder. How would you like me to proceed?

User: Why do you keep moving left when you go up the ladder?

Operator: I'm sorry for the confusion. I'll try ascending the ladder again and avoid moving left. I'll focus on moving directly to the victory circle once I reach the top level. Let me attempt that.

Operator: Worked for 37 seconds It seems there's a challenge in ascending the ladder, as the black square isn't moving up. Would you like me to try a different approach or strategy?

User: Oh, I see the issue. You can't see it going up. When you are on the ladder you are moving up, but movement is rather slow. If you are on the ladder, you can keep pressing up and be assured you are moving. You will receive visual indication of success once it says "You are on the: TOP LEVEL".

Operator: Thank you for clarifying! I'll keep pressing the "UP" button while on the ladder until I reach the top level. I'll look for the message indicating success. I'll start by moving the black square to the ladder and proceed from there.

Operator: Worked for 32 seconds The game is complete! The black square reached the victory circle, and the "You Won!" message is displayed. Let me know if there's anything else you'd like to do.

User: Great job!

Operator: Thank you! If you have any other tasks or need assistance, feel free to let me know.

3. In-progress sections

Board games
Social simulator games
Drawing
Expressive drawing: drawing to express emotions or ideas
Skill acquisition: Drawing from instructions

Thanks to Bjartur Tomas for research funding to test Operator.

Contents

1. Exploring Strengths

2. Simplified Montezuma's Revenge

3. In-progress sections