Notes from The Curve

‣

Macroeconomics

productivity

Electrification: early motors just replaced windmill/large crankshafts. Productivity improvements came after process/reorg changes, that unleashed things
Productivity J curve — extra effort up front = decline

[a] How does this look with AI engineering

20-30% gains from implementations
Low productivity future: “patent thicket” = reluctant to train models
remote workers? “offshoring is a waystation on the road to automation”

pandemic: 16% pre to 50% (at peak), dropped back to 33%

inequality

deaths of despair up ([a] but what’s baseline?)
is AI a complement vs substitutes?
long tail of tasks: ML does better with more data; humans better at improv, novelty. division of labor: humans take exceptions

[a] applying to my work:

industrial concentration

$10b or $100b models — will AI be winner takes all?
Hayek: “use of knowledge in society”. Need people with relevant info & incentives. but relevant info is very dispersed

if you know 18% trucks are half-empty, doesn’t help — need to know specific shop
need decentralized market (vs all in moscow)

Sam Walton: in Walmart HQ, have a bunch of data, giant ML models to figure it out

share of mom & pop shops is shrinking. centralization of data is winning in the market. maybe bensonville knows more than local shopkeeper
Also other utilities [a] like big tech?

Economics of Transformative AI — focus on what it does for the economy, spent 3 days thinking of what would the world look like. Level of impact of industrial. Trying to pull together a group thinking on econ

Q: How could AI help with econ research?

Large scale simulation with LLM agents
designing research agenda. “undermine AI” organizes, or NotebookLLM

Q: What’s the future economy?

UBI? idk

not worried about loss of meaning, but worried about loss of power
at the mercy of Sama/DJT; could we continue to empower folks

Q: This model assumes capital being human owned. When AI is owning capital and consuming — how much of econ overton window exists?

Rich Sutton — these are our mindchildren, we’re just stepping stones
Q: modeling perspective: are you talking about the machines helping us, or
More interested in descriptive picture

truth terminal act as representative of AI, kind of a conceptual art project

Can imagine capital/labor share
What does it look like with a nonhuman process (AI)

Q: Better centralized economy with AI?

worth reopening — Hayek, soviets couldn’t do this
fears: 2030, might come up differently
[a] … why does this matter, dissolve the question.

‣

Cursor

Better tool calling
Grand project: unification. Many different components. Chat,
Composer, recently popular: letting model write out code & apply it
Don’t want to be typing out which files to edit, which files of context, btw you should know in the last hour I was doing this. Too tedious, should be baked in

Going out today:

lots more searches — “agent mode”

Other fun feature: Bug finder, not meant to be found yet

Costs $4
Usually finds 1-2 bug on every PR that Sualeh likes to fix

initially: do you want to spend $15-20 on this PR. Is that worth it?
But now: starting, 10 PRs in a row, there’s a bug, gotten humbled, tend to read the glob of text

First 2-3 bugs are somewhat reasonable

model doing tool calls, and helping find bugs, were both in v0 cursor
almost immediately, we cut them out for not being useful. either model was too stupid, or bugs couldn’t be found — artifacts of that moment in time

Mistake 1: be super-ready to re-evaluated based on where the models are today. There are some features that sound reasonable but don’t work

Bugs: believed that models didn’t have enough training data
Didn’t re-evaluate until 1mo ago. Maybe general question:
Maybe easy solution: run the evals, see eval score go up?

Naive problem: super tied to implementation that you wrote down
if you wrote an eval for 10k tokens (arbitrary constraints that no longer exist)

Q:, Tend to believe in discontinuities but can’t prove it. Saw a jump with Sonnet. Internet benchmarks say continuous, personel experience is discontinuous
bug finder uses o-1,

At cursor, half the job is looking at UX
Other half is training models, now training to improve false positive rate
Not trying to compete with Sonnet, but useful to finetune for task specific

Deepseek models have worked really well, great base models, good for many tasks can just do a lot

Point 2: Epistemology, debates are hard because you have to reevaluate every 3 months

15mo ago, you evaluated something, I tried it for 1mo, didn’t work. You’re totally wrong
And then something underlying in the world changed
List: “cursor graveyard”. Every idea we deleted code for

Checking if what can be awoken from the dead, sure there will be more

Hard to send new employee to graveyard idea 33. But the models just had 8k context window

It’s really hard — trying to convey how hard to do
Do you test 5 models, see the percentage, see if they get over the edge? or more ad hoc

Sualeh: Combo, free-for-all, very vibe-based
Wished there was a more systematic

Frontier seems like deepseek, o1, sonnet 3.5

Sualeh: as of this moment in time

Point 3: Pricing. Seems inevitable there has to be a pricing flip for product builders

No one wants to be the first, but there’s no way you can be a flat rate per month

$5 per PR — who will be the first to succumb to pricing change

Ning: turn to enterprise

Sualeh: don’t care. I want general public to be able to do it, seems like a general public question.

Owain: Why is it $5?

Details matter. But as a human, evaluating a PR, analyzing code paths, seeing if all reasonable paths don’t degrade or you said “X” but you did “not X”
In humans: either have it cached in head, or have to go read it

usage-based pricing, vs charging everyone, and then the vendor finds cheaper ways to serve

Sualeh: worries that usage-base decreases incentive to improve
statement: Also see base that covers 90% and then ala carte

If I’m tired, cursor goes off the rails, no higher-level executive function

Sualeh: running joke, all files get big enough that composer can’t edit them
Keep adding stuff to the file. You composer them, now you need to understand it. Then

[a] Cursor refactors?

How do you break strange loops — I did the thing for you?

Any evals for customer
How do you construct evals to capture actual user experience?

Sualeh: Hard

Point 4: Cursor principle: make it interactive and fast

Worry: this is hard. Models are not used to being interrupted, would love to figure out how to interrupt and still make things much faster
Inevitably, looking at 32 things
Autocomplete: making fast is important. Same with apply
Underrated property of sonnet: it’s fast
[ning] Supermaven, acquired, was it because it was fast?

Theoretical plan to integrate. Future autocomplete plan: do perfect jumps in current file
Objective: How many tabs in a row can you do. 30?

Artificial restrictions: after it’s done, I suspect your next edit is on line 34, I propose you go there
Or: model suspects edit is on another file

Owain Evans: Are people building real stuff?

Ning: Lots of magical moments, UX has been really good
Friend tried a bit, but got stuck pretty fast
Sualeh: “For the most part, we wrote it for ourselves”

and the 10 people in the office

Has been some effort to make it good for the general engineer
Side effect (unplanned) — popular with people who didn’t know coding

But this was never the initial intention. Happy accident
9yo coding video.
Probably more general than expect. Photoshop is built for the expert

One fallacy: if you just build for the beginner, you’re building for no one. You don’t know how to use it

beginners want to become experts

Pro tools: want to be the most approachable tool for experts
that said, suspect all the web-dev specific ones will be successful

Devops concerns: have some code, trying to use tools outside of the editor itself. these are more problematic

Sualeh: Look, before kubernetes, just talk about debugging. Models can’t help debug right now
Graveyard number K/L/M are around helping people debug code
Some fundamental things which make it hard
SOTA: Let models console log & run code
Jarred from Bun: export runtime errors on a websocket/port to the model, so model can see runtime errors as part of some hot reload loop as if they are proper errors

These things are hard, can get infinitely difficult

[a] separation of human/LLM

Q: Many tools by good engineers for good engineers didn’t take off, why Cursor

Sualeh: hard to analyze ex-post
Something about making it relatively simple? Hard to say
Could give reasons Sualeh think are true

Really annoying to paste context perfectly into GPT window, so left the option: What’s the least worse option in the editor. Cursor was the least bad

Probably not great, but trying to be better

Q: What are the autocomplete actions? Create new file, install package?

[sualeh] not in hot loop. creating files is far too rare. Getting multi file cursor prediction would be amazing

Q: How well can cursor support large monorepo?

Sualeh: Yeah, improved search a lot. Turns out this is valuable. Some core components
Won’t support google/facebook, different worlds
Not doing that because Sualeh not working at Google

Q: Cursor made a decision to use base models, vs others like Augment/Poolside raising $100m to train their own. Was that right?

Came down to simple thing: could spend all this money recreating GPT4. But we have GPT4 — before you make strong claims, it’ll take you a year, etc. Given it’s hard, don’t you just want to make the most useful thing?
Make the useful thing, and if it’s good, you can go train the model. Didn’t make any sense. Not that Sualeh doesn’t like model training, but believed at the time: overhang in capabilities

Many tasks that the prompt construction is too tedious

Q: How much finetuning/posttraining in models?

Sualeh: Substantial part of the team. Is it crucial to cursor? if we just remove the models, it’d be much worse. Maybe 5-6 people

Q: How far ahead does Sualeh thinking, if models are getting a lot better in a year, are you considering this?

Don’t think the labs can tell you about the things they haven’t built yet
Mentally preparing that stuff might have to go. But don’t start coding it now, don’t know what it’d be like to code for it
Might build overcomplicated stuff not necessary. You can build overcomplicated systems for GPT4, destroyed by Sonnet

You have to build with the final artifact, not what you think it’ll be

Q: Do you think Cursor will be more like Devin, iterate & run code itself

You want to give highest autonomy possible but still useful. Suspect we’re underneath that bar. Not giving highest level now

Q: Refactoring?

Can ask Cursor
2 things:

Meta point: try to do things that are once every hour at most every day. Refactoring is something you do once every month. Goes out of the purview of things we build for ourselves

Too easy to build once every month tools, try to build things that are once every second

Q: Other graveyard ideas that were most painful to bin?

idk probably the worse ones are the ones that should go into the graveyard but are still actually in the product

Eg long context chat, no one uses
Beta tab as a bunch

Q: How do you stay grounded when everyone is raising a lot of money, hiring quickly, how do you stay out of FOMO, are you naturally chill?

Out of socope, no good answer

‣

SF Compute

How a GPU cloud thinks about compute, how it leads to GPU hour

Software margin is both what they think

VC pitch
Hyperscaler pitch
CPUs

Buy enough, but don’t need more

GPUs — you don’t stop once you’ve trained

Want more GPUS to keep training,
All hyperscalers, neo clouds, pitched thinking CPU clouds

CPUs:

customer wants enough compute to launch, then they keep making money, compute costs don’t grow

GPUs

maximize flop per dollar, don’t care. GPU has very slim margins on this
Customer requires capex, and then is super price sensitive. So traditional cloud provider, pretty much everyone loses money on GPUs

Solutions:

Move down value chain (make chips) — NVIDIA has high margins
Move up value chain (chatGPT), make ML infra, just sell software
Win by correctly pricing risk

Interest rates graph (% sell GPU vs depreciation)

Map graph towards GPU pricing to not lose money
Or: large paid over time
1gpu per hour, or thousands for year. Bottom right
But: most people don’t like large contracts, up front. A few vendors won the market

Predictions

Coreweave will make money — long term locked
Lambda’s 512 — will lose money. Lambda push down total costs to seem like the cheap option
Digital Ocean & Together AI — bought large clusters, will lose money

Fantastic software ⇒ bad hardware. Guess they are losing money

Modal & Replicate will make money. Don’t own underlying hardware, just make money. Prices are higher, pretty much earning on per-unit basis

What does SF Compute

Need a DCM to sell futures & options. 3-8 years to get DCM
Meantime: spot market for chunks of compute to spread out risk
Can also build a software buisness

Modal or Replicate can buy from SF Compute on market price, and can take a spread with no inventory risk

What is the thing the service you’re providing on top of the GPU

A: Most services are not valuable enough to cover GPU cost

DX, optimization, etc
Problem: Software, any reasonable margin for software business is like 10%. 10% on 100m contract is $10m, not worth it, enterprise contracts for pure software cap out on this
That $10m would have paid to GPU provider, could have built your own engineers to replicate it. So if you’re GPU cloud, sell at lowest prices, better off optimizing risk to lower underlying cost

How fungible is GPU?

Hard to funge:

Time of when you want it
Bunch of compute in same cluster is different than spread out. Interconnected better
Under the hood: clusters are a bit different, some are broken, security requirements

We create a rulebook, have this stamp or better, and gotten an audit
Is that: training or inference?

Who is the sell side — people who have bought clusters they dont’ need

A lot of people who thought they were going to make

People with extra GPUs, and need some way to sell them, typically in chunks that are different

Today, look like HotelTonight for GPUs — a bunch of folks with GPUs, but market price for GPUs

Prevent their branded channel from dropping prices,
Very much a market glut atm

Long term, think this is the right way to sell compute. Past provider (cloud provider with hardware on same company) is a bad model for GPUs, so market with disintermediate.

Pure software vs pure hardware companies

‣

HF0

Problem:

Dialog agents have bad UX

Why it matters

Emotionality is missing

Solutions

Brute force
Busy listening
Emotional LORAs

Affordances

Conversational affordances

yeah, uh huh, waiting to get more
shared context

Reactions, acknowledgements, laughter, cadence changes, changes in dominant speaker, emotional tensions & transitions

Dialog agents are stuck in the command line prompt, waiting for enter
Legacy factor in having conversational

Input your text, and then inference happens, and then output comes
Stuck in the moment
Smashing enter button

Or waiting for you to be silent long enough to imply/infer you’ve hit the symbolic enter button
But normal convos don’t have this button, it’s a continuous flow
ping pong dialog

though: this is fundamental to dialog, it’s limiting

What’s happening: dialog agents don’t have temporality. No moment-to-moment awareness

Hey, how are you vs Hey…. how are you

Want emotionally coherent, fluent AIs

skipping “should we want these things”

Character AI: have users who are nerds willing to suspend disbelief to have chats with fantasy bots; has not spread to mainstream

(but these people are getting a shitload of value — using 2-3h on average)
When conversational UX mirrors natural human convo
Business, utilities applications. Whether you like talking to a bot might determine sales, recruiting — enjoyability of interface matters, not enough energy

Solution 1: Brute Force Agentic

Meta-conversation affordances. agentic decision making on when to use it

Solution 2: Better training data

Time stamps, audio training, will big multimodal models solve everything?
But still losing to the “big enter button” problem

Solution 3: busy listening

What if we built an app to constantly run inference
Assess inputs, thinks about things to say, decide when/whether to say and buffers it
[a] trying it for text chat first?
Q: parameter for how often the speaker wants to be interrupted

if I’m talking to an AI and it starts responding
Other times: wants it to just be “yeah”

Kevin: Notebook LLM has research from Deepmind, the disfluencies that make it sound more realistic. Conversations between two agents, could imagine

Solution 4: making the AI more emotional

Example conversation: if you keep saying the same thing over and over again, LLM will keep responding in the same way
Appropriate because the prompt focuses with how in the same way
But a person: will get annoyed, are you okay

Something changes over time — modify the latent space based on an emotional aspect
Annoyance, curiosity — how could you create a bot that, if it’s silent, reaches out “hey are you still there”, feeling of presence, witnessing the moment

How to create realistic film dialog

People talk over each other, not a ping pong, people are trying to compete for power
playing devil’s advocate: having LLM as less aligned, less friendly would help with the UX issue
Many chatbots out there are now sterile — maybe the future of companionship, schizo roommate who is constantly bothering you

AI doesn’t initiate the conversation, proactively start convos, text me at the right time
Hard to build that isn’t an annoying notification
Real life friend — doesn’t create that experience

Tech will imitate life just like art imitates life

Ask questions like: once you have this, how to make it safer
Might become more addictive, might help people learn how to communicate, then talking to character AI
Would prefer our kids talk to non ping pongs, rather than character ai

Hume is trying
OpenAi — Greg Brockman team, generated an image with GPT4o, what if we model everything jointly
Research on Replika (proactive notifications) — not that sophisticated as a bot, but research shows reduction in suicide ideation and more inclination to go out into the world

for HF0: austin: isn’t investing in startup founders kind of old-school? what does investing in founders who are LLMs look like?

‣

Notes from The Curve

Macroeconomics

Cursor

SF Compute

HF0

Constellation in SF