Why is the operator/visual workflow data important? The DeepSeek R1 paper in my mind argued that R1-Zero's impact, self reinforcement on evaluatable objectives, was more than the human chain SFT on top. Models with richer reasoning chain data are still better, but why then operator. My intuition is that the data on Airbnb's feedback form page is low signal, especially when you lack the primitives on how its built. Am I missing something?
It's a good Q! I wrote in my previous blog post about R1: "It’s worth noting that there are still many problems that require human guidance, especially if success is not something that’s easy to “programmatically verify”. AKA: is there a way that the model can easily check its answers? Or do you need a human to tell it if it’s correct because there’s not a clear, obvious answer?"
For tasks with clear verifiable answers (like math), DeepSeek demonstrated that you don't need to develop a reward model trained on human data. For but problems that you can't programmatically verify (like problems that have subjective answers for example) you still need to have a way to give the model a signal on what correct looks like so that it can self-verify its work. I'm suggesting that exposure to what people are doing on their screens might be one way to show that.
Agreed, but there is process (reasoning) and outcomes. For coding, what Operator sees is the outcome. For agents, what the Operator sees is the process. So, you could reinforce the process and then get a better agent for clicking things in a UI, but your claim is that this enables an AI-Quantity app future. Here the outcomes are low signal because they do not contain why decision were made. My intuition though, is that coding itself can be self-evaluated just like maths, through self-generated test-driven development. Still, this would not embed taste, I am just not sure that Operator can infer it visually if that is what you mean. I think, for now, you need a company to provide a stopgap or finetuning.
I’m not sure if we’re on the same page here, I don’t follow what you’re saying. But if you’re able to show that all coding instructions and tasks can be self evaluated, you’ll be ahead of the major labs!
Did you read: [Competitive Programming with Large Reasoning Models](https://arxiv.org/pdf/2502.06807). It seems to me from reading it, that not only do the major labs know you can train AI to generate tests to evaluate itself and thus coding is like maths in objectivity, but perhaps, reading between the lines, the AI may have discovered by itself that it could generate tests to self-improve.
I guess it depends what you mean by "all". I am also not sure what you are confused about, but perhaps a perfect segway into rekindling the chat we missed out on.
The information I have is coming from the major labs. I covered what you’re talking about in my last blog re recursive improvement, including coding tasks. When I say “all” I mean that there are still coding tasks that cannot be programmatically verified
Why is the operator/visual workflow data important? The DeepSeek R1 paper in my mind argued that R1-Zero's impact, self reinforcement on evaluatable objectives, was more than the human chain SFT on top. Models with richer reasoning chain data are still better, but why then operator. My intuition is that the data on Airbnb's feedback form page is low signal, especially when you lack the primitives on how its built. Am I missing something?
It's a good Q! I wrote in my previous blog post about R1: "It’s worth noting that there are still many problems that require human guidance, especially if success is not something that’s easy to “programmatically verify”. AKA: is there a way that the model can easily check its answers? Or do you need a human to tell it if it’s correct because there’s not a clear, obvious answer?"
For tasks with clear verifiable answers (like math), DeepSeek demonstrated that you don't need to develop a reward model trained on human data. For but problems that you can't programmatically verify (like problems that have subjective answers for example) you still need to have a way to give the model a signal on what correct looks like so that it can self-verify its work. I'm suggesting that exposure to what people are doing on their screens might be one way to show that.
Agreed, but there is process (reasoning) and outcomes. For coding, what Operator sees is the outcome. For agents, what the Operator sees is the process. So, you could reinforce the process and then get a better agent for clicking things in a UI, but your claim is that this enables an AI-Quantity app future. Here the outcomes are low signal because they do not contain why decision were made. My intuition though, is that coding itself can be self-evaluated just like maths, through self-generated test-driven development. Still, this would not embed taste, I am just not sure that Operator can infer it visually if that is what you mean. I think, for now, you need a company to provide a stopgap or finetuning.
I’m not sure if we’re on the same page here, I don’t follow what you’re saying. But if you’re able to show that all coding instructions and tasks can be self evaluated, you’ll be ahead of the major labs!
Did you read: [Competitive Programming with Large Reasoning Models](https://arxiv.org/pdf/2502.06807). It seems to me from reading it, that not only do the major labs know you can train AI to generate tests to evaluate itself and thus coding is like maths in objectivity, but perhaps, reading between the lines, the AI may have discovered by itself that it could generate tests to self-improve.
I guess it depends what you mean by "all". I am also not sure what you are confused about, but perhaps a perfect segway into rekindling the chat we missed out on.
The information I have is coming from the major labs. I covered what you’re talking about in my last blog re recursive improvement, including coding tasks. When I say “all” I mean that there are still coding tasks that cannot be programmatically verified