buni dev notes

@October 28, 2024 election-promises usage notes

Started with Claude Artifacts

(also wtf Claude really needs to make these things exportable)

Mock up a website design to inform voters of likely outcomes of the presidential election. The website name is "If elected, I will...". The outcomes are taken from predictions on the site Manifold Markets.
Show an image of Harris saying "reschedule marijuana from a schedule 1 drug" and trump saying "organize a ceasefire in Ukraine before 2026 midterms".

Put Kamala on the left and Trump on the right of their cards; make the quotes look like speech bubbles coming out of their avatar

But it didn’t actually have Trump/Harris images, so I grabbed them from Manifold

Replace the images with https://manifold.markets/political-candidates/harris.png and trump.png; add a little tail to the speech bubbles so it looks like it comes out of their mouths

Could paste in images, which was cool/timesaving

But inability to see the results with real images led me to switch to yield…

See discussion on https://yield.sh/edit/election-promises/app.tsx

Looking at the diffs sometime helped when code didn’t match my expectations

I guess it saves a bit of time too

Applying diffs via Haiku seemed to basically always work, nice

Could write eval for Cerebras

Had to be prompted to refactor code
I sometimes dropped in to tweak wording exactly (best for copy text, where my taste is better than Claude’s)

Do want to have side-by-side code + preview for these cases

Not having image paste was a bit awkward; I copied in some text instead

Maybe ideal is pasting in an URL?
With images, the benefit is in providing exactly the context and no more

Benefits of yield over Claude Artifacts

Immediately hosted & shareable (eg sent to Nate)
Can render externally linked images…
Can tweak code directly for things I know how to do, instead of asking via prompt
Can import custom libraries from esm.sh

Benefits of Claude Artifacts over yield

Accepts image-pasted input

Shadcdn out of the box

Fewer clicks (applied & saved automatically)

Feels a bit more reliable? idk
Version tabbing is nice, maybe we should steal it

At least, get rid of alert popup, and show current loaded version “7/9”. Maybe also show timestamp.

Though in practice I’m not really going back to old version very much lol. It’s more about peace of mind, having an undo function

Maybe instead of versioning to a number, it saves a commit message based on the prompt

Maybe it’s integrated inline to the chat, vs separate numbering system?

Try: fast diff gen + apply with cerebras, versioned on every checkpoint?

Export formats

Could export to single HTML page which is hosted… somewhere (git?)

There’s gotta be some cheap/free place to just host static HTML files at scale. (Cloudflare?)
Some stuff not needed for export (like uiw react editor)

Nice to have source code too, for modifying + exporting to NextJS I guess (did this for futarchy.dev)

Could just add a copy button

@October 25, 2024 Playing with different futarchy.dev versions

🪡futarchy.dev + yield.sh

@October 25, 2024 Check in with Matt

Catch ups, childcare, what happened, Matt & childcare
Overall:

yield.sh: upgraded sonnet, db generation works out of the box a lot, one more prompt that passed “want to share” test

shareability
Intuition: turn up the fun, turn up variability, get something that you are
Sonnet should be make things a lot better

Old sonnet 3.5 vs new sonnet 3.5 to build stuff in Minecraft

Even though it’s not much better — on this test it’s so much better. Some speculation — maybe get better spatial & visual reasoning. Feels handwavy

Some tradeoff between “how to get started” vs constraining space

Pippa: Maybe in early days, don’t constrain the space
Like Matt where you find a set of users to force, do 15 prompts, get past the hurdle

Instinctive user behavior in the early days, the 2 things Pippa sees in the left panel. Really interesting behavior

starting talking with Nate Foss (founder of gather town) on a joint project to hack on (prediction market futarchy)

Matt: Usual model, be pretty explicit, it’s something we’re pretty into, don’t know where it’ll go, maybe won’t be a thing. Pretty open minded, hope that it evolves into something that’s important to me

In an ideal world, have someone to do it with. Not asking to accept a marriage proposal
Way Matt usually thinks about it — do we feel more productive, ideally more than sum of parts
Anchor on a reasonable level of uncertainty, with upside case being clear

Could just be really interesting to — as well as hacking together — spend some time talking about what it could become, seeing what his instincts

There’s the core of something interesting here, suspect there’ll be a new behavior somewhere, needs to come up
Prompting, sharing, a loop around it
Fun to see whether we feel aligned or not

slight tangent: was at AI conference last week. “Why do AI panels suck?” was talking about it’s a weird technology, the primitives are so far ahead of productiziation

Other than ChatGPT, few products that use the primitives
Panel with chief product officers of OAI, Ant, but also Instagram

Riff on early insta — they got so obsessed with features, but in the end, what they thought was important, were not — and the things they didn’t hink about was key
Like the square — scroll back to early insta, became iconic
More important: have the right primitive, and then create enough variance that interesting behaviors get amplified

Just getting people to use it even if it feels weird

Very generative, the 2 of them, talking about what it might be. But then — let’s build it, put in people hands

What are the new primitives?

Arbitrary image, arbitrary prompt and get high quality response
Core premise: chatpgt will not be most popular product in 5 years. We’re basically not using the primitives in 5 years, the way we’re using them is boring

(Already told pippa) 2 panel of og 6 of ChatGPT

When they finished GPT4 training in May/Jun of ‘22, 2 posttraining teams, smaller one was chat, that was not that interesting, not the cool thing (which was instruction tuning)
In Nov there were 7 people in Chat
Median guess for #users was 50k after 1mo. 1 person guessed 1m, actually 100m
Ran from Aug - Oct a friends & family beta in 50. Average daily actives was 4

Reflection — 50 is too small to observe real variation in usage. Interesting, the 4 people, all were using it for debugging
Nobody was using it for things that drive most of the usage — 50% of GPT might still be people who are coding
Interesting the inarguably fastest growing product

Small usage at start can obscure what’s interesting
Also: debugging matters, something feels applicable around how you generate enough breadth to allow the product to blossom

Some equivalent of temperature for this, where you can increase the variance in the actual code generation

Artificial equivalent of 4 different types, where you increase the variance at level of
Computer usage

[m] Feels like it’s been pretty underappreciated

What to do next

advice for The Curve (esp — maintaining value after the conference?)

Won’t be able to get out to SF in Nov will be in London
From Jan, will be in SF a lot. Keep having a lot of other things
also Progress Conference — but sad not able to make
E.g. how did Matt do it for AISS

AISS is very formal
Been to a bunch of good conferences with good informal ways of following up

Genuinely active whatsapp groups
For AISS — easy bit, knew there was another one in Korea in May. So formal touchpoints leading up to May.
Road to summit, lay trail for next of agenda

One thing Matt really likes: intl stuff, cool to have a London day, SF day, 3mo later

Dialog every year, it works really well. Some city that no one’s from (last one was Frankfurt)

Most important: events can be fantastic ways to build community, but it’s real work. Need someone who wakes up, “what do I to stimulate”

Ben casnova. Runs Village Global, man about town in SV, runs a thing called Satori (annual conf)

Satori whatsapp group remains active all year. Ben just makes a massive effort

Part linkedin, news alerts — any time anyone who attended the last one attended something cool, he posts in that group
Whenever he writes anything that is vaguely topical for people, posts on it

Rhythm “Can’t believe Pippa got knighted”

Pippa: Group chat from events conference, one of the only networking things that is valuable, active, the exact thing playbook

Plus they actually have a monthly call where people catch up

advice for Manifold

Next steps for checkpoint

try the “variations” approach
put it in front of 15 people
build out the futarchy thing here

@October 24, 2024

How to do a db migration?
Adding upvotes sucks

Even though no migration needed!
Like, need a new join table, then need to add APIs to write to and sync from the table, ugh
Just want to pretend everything’s in memory and just works

@October 22, 2024 Orchestrating prompts

Fundamentally: how to structure code when language models are creative but unreliable?

Analogy: human users are also creative and unreliable, eg on a social media site, or Github
Also corporations, on AWS

But there

‣

text/code-based

‣

node-based

Hm… Instead of complicated fanout structure for eg generating a slug for an app, can just ask for structured text output:

<antArtifact identifier="dashboard-component" type="application/vnd.ant.react" title="React Component: Metrics Dashboard">

Or could try some kind of structured JSON or tool calling stuff

@October 19, 2024

Bun is seriously so good, shell scripting instead of package.json :chefskiss:
This Vercel AI SDK thing seems annoying to merge with Cerebras. Let’s just use bare cerebras for now.
Cerebras is fast (like, 1s to generate apps from scratch) but the apps are somewhat worse than sonnet’s

Could be good for applying diffs
Regenerating apps is a neat party trick, but users probably want a more iterative, diffusion-y approach
See Aider benchmarks: https://aider.chat/docs/leaderboards/

meta: there’s probably going to be lots of things like this, where there’s a tradeoff between price, latency, and quality among different LLM providers

Useful thinking for world of LLMs, eg o1 is very high quality vs cerebras llama-7.1b is very fast

@October 8, 2024

Notes from Ben Evans:

https://www.ben-evans.com/benedictevans/2024/6/8/building-ai-products

A stock reaction of AI people to examples like mine is to say “you’re holding it wrong” - I asked 1: the wrong kind of question and 2: I asked it in the wrong way. I should have done a bunch of prompt engineering! But the message of the last 50 years of consumer computing is that you do not move adoption forward by making the users learn command lines - you have to move towards the users.

Looking at this on another axis: with any new technology, we begin by trying to make it fit the problems we already have, while the incumbents try to make it a feature (hence Google and Microsoft spraying LLMs all over their products in the last year). Then startups use it to unbundle the incumbents (to unbundle search, Oracle or Email), but meanwhile, other startups try to work out what we could build that would be truly native to the new technology. That comes in stages. First, Flickr had an iPhone app, but then Instagram used the smartphone camera, and used local computing to add filters, and further on again, Snap and TikTok used the touch screen, video and location to make something truly native to the platform. So, what native experiences do we build with this, that aren’t the chatbot itself, or where the ‘error rate’ doesn’t matter, but abstracts this new capability in some way?

@October 7, 2024

Playing with Eden Treaty; need to figure out how to pass Server generated Elysia framework to clients?

Maybe a codegen solution is actually better vs magic type inference

Spent a while trying to reroute to OpenRouter; it’s not an exact match for Anthropic API…

Could go all in on OpenRouter
… or could try Vercel’s ai-sdk…

@October 5, 2024

Hm, try catching errors the way Val Town does?

https://esm.town/v/std/catch?v=7

Bug: modifyCode sometimes just returns a subset

Testing user interaction with LLMs?

Spin up a new app, make various calls, have LLMs input commands and test them?

For each app, ask LLM to create a testing plan for each one??

Write tests for the different platform components (eg auth, db, etc?)
Note: When more of the coding happens via LLM, then testing/verifying becomes more expensive, takes up a larger chunk of human time.

Flo user test

Expected code to be on the same side as the chat
Generated movie recs, enter a movie.

Oh, I really like “The intouchables”
Get more recommendations
I want to change to be book recommendations ⇒

~~Realtime messages didn’t load!!~~

Looking at diff through what’s changed, just curious

When Flo uses lang models to generate code, feel like I want to check that it makes sense
If it had just applied, I wouldn’t have done it

Auto-apply diffs? cursor doesn’t, v0 does…

~~Would expect the conversation to auto-scroll, and have what would you have to change at bottom, like chatbot. Feels unintuitive~~

Tried getting Claude to do it

don’t have a Claude API key

Can you edit the diff?
“You have to apply”. Oh, you can edit the code

Cool I can get my API key, I guess
I guess the reason is that we don’t want people to use your API keys, maybe

Tried to save just now with ctrl+s

Formatting feels weird, thought it would auto format
Now it doesn’t work — is that something I did?

[a] Still had to take over, check console logs, etc

Ended up having to delete Flo’s stuff because of local vs remote sync issues… ideally figure out a way to merge volumes and databases?

Volumes could go with git
Databases…. not sure how to generally merge sqlite stuff

Cleaning up data sucks, dealing with data incompatibility sucks… db migration sucks

Wonder how to make all this stuff much better — how to get to developer nirvana with databases?
(go schemaless?)
(why do you need a database when your data fits in memory?)

Oh, should ask about background with coding, code gen tools as part of interview

‣

Render debugging — fixed itself (maybe because resolved %/buni/FileEditor?

@October 4, 2024

More context to Claude:

Internal APIs

LLM capabilities like /anthropic. Maybe add /fal.

Setting up DBs
Other files in the same project
NPM packages — how to know which esm.sh things will work?

Approaches for context on internal APIs:

Just write a text prompt
Build on some docusaurus-like thing

… or build our own…

Bun docs are a collection of .md files https://github.com/oven-sh/bun/tree/main/docs

Get it for free(ish) from Elysia & Swagger?

Hm… Elysia is a bit of a commitment Embed GitHub

Also, do we want strong typing? Or does LLM solve this?
Scalar looks nice though!

Get Claude to do documentation?

see what Hanson is doing
Generate from example apps?
Maybe we should start versioning APIs…

@October 2, 2024

Auth

‣

Ugh spent forever figuring https vs http on AuthJS, debugging notes:

Okay now trying to fix sth based on the way csrf token is set — Claude noticed that __Host-authjs.csrf-token= matches, but not authjs.csrf-token=

See also https://scottspence.com/posts/csrf-with-sveltekit-on-flyio

Database

How to handle database migrations in userspace?

Generate via Drizzle? Not sure if we want to be locked into that though, adds extra complexity
Generate a migration somehow, apply diff in bun, or somehow specify the init script to rerun — without dropping everything

Do something like https://david.rothlis.net/declarative-schema-migration-for-sqlite/

(could migrate python script to JS)
hm, doesn’t do data migrations though

… ask Claude to generate the migration??

very tempted… Could introspect the existing sql schema maybe, then diff against intended schema.

‣

This seems to work!

Go schemaless???

@October 1, 2024 Approaches to screenshots

Tried some html2canvas approaches suggested by Claude, but doesn’t work for iframes?

Might be good for instant screenshots for introspecting though, vs another service with roundtrip & rendering

Would be nice to use end-user’s browser for rendering though — literally just shows what’s on the screen

Also,

Could also try a screenshot API of some kind from chrome?

More exotic: Add a chrome extension that makes it easy to send context to Claude

Started trying puppeteer

But hosting this on the same flyio/docker container is a bit annoying

Blows up the install size, adds complexity.

Plus it doesn’t work yet, something about missing a dependency…
After a bunch of wrangling, still doesn’t work to install chrome-headless-shell into the bun image, maybe time to give up

Learned a bunch about running docker locally though, not too bad

Hm, to run screenshotting might have to use

docker buildx create --use docker buildx build --platform linux/amd64 -t your-image-name .

To get around the x64 vs arm differences on local Mac dev vs Fly.io

Claude recommends splitting up microservices but thinking about architecture there is also kind annoying
Docker images to choose from…

Embed GitHub

https://fly.io/blog/fly-with-alpine/

https://pptr.dev/guides/docker
Unsure what the differences are to https://developer.chrome.com/blog/chrome-headless-shell

Which is available via NPM…?

Could try to find a different service, like an existing API? “trim” or sth looked okay

Though tbh maybe should just host our own service if it’s going to be important

Franz

Try creating chess game, and then create a neural net with self play, then an animation of the action space distribution, animated on the board

~~Noticed prompt isn’t there, add it~~

Add a chess set make it playable

saw chess prompts

~~Would have imagined seeing the diff~~

Some animation of the thing loading would be good

Something went wrong — want to go back, do the edit thing

(not clear that you could just go back)
Want to reset the state for the ping pong, to reproduce the bug, otherwise

Could reprompt the last

Want to reproduce the game when it’s stock. Post hoc see it what happened

Expectation that refreshing would kill the game

More output when “something went wrong”

Even if I wanted to test it, get a gradient to best update prompting strategy

Python is more familiar, reason being machine learning

Pippa

John Daley — ping, looking for first users, understand the demo
If you’re doing this, the metric to decide to invest, who’s winning, who has customers
Cursor vs Buni

Buni could be better for nontechnical people
How to decide Wix vs more technical

Put together some paragraphs on:

What you’re building, who it’s for, how it plays in the existing landscape, by 4pm with Matt

Meaningful input.

If you’re building for the same audience of cursor, the path to winning is narrow

Don’t need to write “why you can win”, just do that by winning

Other question:

What do you want to get out of the cohort?

As part of research: think about what Cursor doesn’t work well for right now

For the people who are here

Who is Wix built for, how would you build
[a] Even my wife uses Wix

Or who are the right users
Put together a persona — technicality, comfortable with these tasks, this is what they’d use cursor for today, this is where you find it painful
Figure out the format, clarify thoughts, 1 page, 1 place, do a brief

After figuring that out, can think through GTM

Distribution — the broader base from day 1
Level of virality in the right circles
Use distribution network of EF/YC while you have it.

@September 26, 2024

How much to focus on de novo app generation, vs steady state editing?

De novo is flashy, explains the system
Huge % of coding activity is steady state editing
Goal of buni is to blur the line between the two concepts? When I use NextJS etc to spin up a new app, it’s still scaffolding off a long list of stuff

10da6e41f23e80ccb8c4c1f5405471f8

De novo today often starts from templates anyways

@September 24, 2024

Codemirror

Pros: Batteries included; supports extensions? Nicer to code in…
Cons:

Packaging via bun install really inflates HTML size/parseability
Also adds like 1mb in general
Trying to figure out how to build without babel but it turned out nontrivial

Overall question: how important will coding be? Ideally, not very…

@September 21, 2024

How to do sql access in userspace?

Something with Bun, with straight sql strings? https://bun.sh/docs/api/sqlite

Could proxy as an API call, for now.

Drizzle ORM? Nicer DX but maybe harder for LLM to understand?
Insane: don’t use sql at all, just write data directly to JS files

Might be easier for Claude to read…

For sqlite tracking chats:

one database per app, including meta level stuff?
Or one global db for the “buni” chat app

Late musings

Why is prompting high-effort?

You try to be creative, but often the result doesn’t live up to what you imagined
So much easier to browse and not risk being wrong
Slow to see results
Not sure what’s in your possibility space
Solutions:

Multiple tries?
Show examples?
Make iterating cheap
Juice: reward typing in longer, better prompts, reward iterating
Suggest better things to do

Everything will be meta*programming

Telling AI to write prompts and code to tell other AI to …

Most of programming with AI is thinking very hard about which tokens to keep

What gets checked into a codebase is a very small set of things you trust and affix your name to.
Right now: staring at AI-gen code to see if it’s doing the right thing
Expensive piece is thinking of what might work, and validating that it did work
Implies: make it easier to throw prototype stuff away, and keep/amplify stuff that does work

Also, a lot of doing research, figuring out what APIs and components are good to use, what people say about them, previewing stuff, having some taste

@September 18, 2024

Maybe just importmap react to esm.sh and then don't bundle React via bun's node_modules. - ...has been working ok, but adding eg react codemirror has been a pain
Even with simpler @uiw/react-textarea-code-editor, have to: 1. Add react/jsx-runtime 2. add to importmap 3. remove deps with ?external=... (and it still doesn't style right right)

One path: just roll my own textarea thing that is natively here
Oh, well, now it’s probably just css, so adding ?css&external=… gets you most of the way

Now something about MIME types, maybe css imports are just going to be hard

Thorny problem: How to deal with CSS imports, if not for Tailwind?

One answer: just use Tailwind.
Another answer: hack the <html> generated to bundle in <link> tag, maybe with Bun plugin

Esbuild recommends this strategy: https://esbuild.github.io/content-types/#css-from-js. For now, might just hardcode this for the text editor.

Musing: Debuggability is not super there atm, want to expose more of the guts (or, build the thing that makes the guts exposable)

@September 16, 2024

esm.sh is kinda amazing.
Also https://docs.val.town/reference/import/
importmapping esm.sh

Pros

Easier to read html output
Sets stage for esm.sh importing of external NPM modules

Cons

Slower than full bundling? (maybe on first load but not after caching the “react” import)
One more API for me/Claude to learn

Some things don’t work and then take some debugging
So possibly esm.sh path still requires some kind of whitelisting

(alternatively, could self-host a version of esm.sh that runs on Bun and thus does fun bundle things)

Hm, packaging @uiw/react-codemirror has been kinda hard with esm.sh

react-dom.production.min.js:127 Uncaught Error: Unrecognized extension value in extension set ([object Object]). This sometimes happens because multiple instances of @codemirror/state are loaded, breaking instanceof checks

https://gist.github.com/Potherca/028514c75f581db115797ecb50c6f945 but the example doesn’t even work for regular codemirror
Though this does work, so maybe just version mismatch? https://glitch.com/edit/#!/furtive-bird-alder

Okay CodeMirror is a bit nicer but since it doesn’t work with importmaps, replacing. Tried https://esm.sh/@uiw/react-codemirror@4.23.0?external=react,react-dom but no dice on the multi-load problem.

See discussion here: https://discuss.codemirror.net/t/uncaught-error-unrecognized-extension-value-in-extension-set-object-object-this-sometimes-happens-because-multiple-instances-of-codemirror-state-are-loaded-breaking-instanceof-checks/7898/8

@September 15, 2024

pricing: Claude Sonnet 3.5 is $15/mtok, to output, aka:

1.5c per 1000 tokens (a typical response)
at ~100tok/s, ~10c/min or $6/h. (So can’t bankrupt myself running Claude Sonnet continuously)
Claude Haiku is 2x faster and 10x cheaper so maybe just use that

Prompt architecture?

Maybe prompts should be stored in /codegen instead of in repo

… once codegen versioning is anything like git

how to bootstrap the prompt/Claude API calls?

abstract away:

File read/writes
API keys
internet read/writes
Someplace to store this code

Misc musing: RTS like Age of Mythology except you spend your “gold” on training LLMS?

Or imagine a CTF where everything is constructed via LLM and you are the one allocating funding to agents which you think are doing a better job

@September 2, 2024

Flyio mysteries while trying to get push & pull to volumes working:

Why is the app down sometimes?
… why isn’t there any easy way to sync local files to cloud files?

Trying out Render — seems okay too, on the paid $7/mo plan

@July 29, 2024

Hm, doesn’t build correctly in prod…
Should we build as a single HTML file?

The completely bundled js file seems to cause string escape issues (eg a stray </script> might kill it)

The issue only happens on prod though — maybe it’s a configuration of Bun.build?
Or conflicting bun versions?

But a relative /dist/ file is somewhat more annoying to send around

Could try to host it statically?

Maybe that’s the overall architecture of buni

Dev notes @July 24, 2024

Architecture questions:

Where to store LLM-generated code?

First guess: filesystem at buni/austin/counter
Is there a package.json?

If so: adds complexity
If not: how to specify imports between LLM generated code?

How to compile and serve it?
What is the ownership model of data?

Note: NPM packages are published built

See https://claude.ai/chat/67c1ba48-09dc-4427-a0ad-a2a087605bbc
Do we want NPM compatibility (for plugins/ecosystem; don’t reinvent the wheel)

Do we want NPM interoperability (so people can literally import @buni/austin/counter…?)

Alternatively: Deno’s JSR? https://deno.com/blog/jsr_open_beta

Things we’re trying to replace

Twitter as a town square
Traditional hosting of web apps
NPM as a package registry

(Github as a source code ecosystem)

What is the first use case?

Building a bunch of my own small side projects?

eg Contact Swap?

Hosting blogs?

Self expression?

Minigames? Like the boardless games vision?
Calculators? Like microcovid, kelly bet? Or RPS poker calculator?
Connecting to AI functionality? Like midjourney, chatgpt?

“Zapier for AI”

Q: What have Claude Artifacts been most used for?
Q: Which things require user navigation? Where do people spend most of their time?

Productivity (Notion, VS Code)
Social (Twitter, Reddit)

Information (Substack)

Q: What could go viral?

The glitch meme generator