Teaching AI to write lyrics

I have been on a bit of a kick to get the large language models (LLMs) to generate interesting lyrics for songs. Over time, it’s become quite an adventure, a fun exploration of LLM capabilities. Here’s the multi-part account of this quest, offered as a loose timeline and progression of techniques, interspersed with pictures of and links to working prototypes, which were made with Breadboard Visual Editor.

Part 1: A simple song writer

At first blush, it seems too easy. Just ask ChatGPT or Gemini to write lyrics – and it will happily oblige, generating rhyming verses and a chorus for us. The first few times, it’s pretty neat. Building something like this in Breadboard takes a few seconds:

However, on the fourth or fifth try, a glaring problem becomes apparent: models are rather uncreative and lazy. The rhymes are repetitive. The metaphors are quite limited. The interpretation of ideas is too literal.  It’s like LLM is lacking imagination, producing something that’s average, and is only impressive the first time.

And it kind of makes sense. What we’re observing here is a process of inference: an LLM trying its best to predict what might be a natural completion of – let’s admit it – a rather average request.

Part 2: A lyrical factory

To address the mundanity of the model, my first idea was to catalyze it with the power of methodology. As we know well from how human organizations work, practices and methodology are often responsible for a dramatic increase in quality of the product. Workers who aren’t deeply familiar with the nuance and craft  – perhaps even particularly skilled – can be given simple tasks, arranged into production lines, and deliver quality results.

Applying industrial-age technology of organizing workers to LLMs  is something I already wrote about a while back, and it’s not a difficult thing to imagine. Let’s break down the process of creating a song into some components, and then form a production line, focusing an LLM on one task at a time.

After a brief foray into researching song-writing best practices, here’s what I ended up building:

First, we have a “Theme Developer”, who is instructed to develop a theme based on the provided material. Next, we have a “Copywriter” who is asked to write a storyline based on the theme that was developed. After that, the “Hook Artist” steps in to develop catchy hooks for the lyrics. And finally, a “Lyricist” completes the job to create the lyrics.

After much tweaking, I settled on the technique of “reminding the model” for my prompt design. The basic premise of this technique is that LLMs already have all of the necessary information. Our problem is not that they don’t have the knowledge. Our problem is that they don’t know which information is important right now. For instance, the “Copywriter” is reminded to use the Freytag Pyramid, but I don’t include its definition in the prompt.

Even with this shortcut, the prompts for each working in our lyrical factory ended up being quite elaborate (you can see each of them by clicking on each individual component in Breadboard Visual Editor and looking at the “Instruction” field). Song-writing is quite a process, it turns out.

The resulting lyrical factory produced much more interesting results: it was fun to see how the LLM would make up details and themes and then build out a story to base the lyrics on. Especially in situations when I didn’t quite know what I was looking for in the lyrics, it worked surprisingly well.

Lyrics for the song “The Blade” were written by a factory. It’s a pretty nice song. However, if we look at the original lyrics and compare them with what went into the song, it’s a pretty dramatic difference. The original feels like the lyricist is walking on stilts. 

This became a bit of a pattern for me. After generating a few outputs, I pick the one that’s closest to what I want and then edit it to turn it into a final song. And with that, the ongoing challenge: teaching an LLM to require fewer edits to get the lyrics into decent shape.

Part 3: The loosey-goosey duo

So, the results were better and more creative, but not quite there. So I decided to see if I could mix it up a bit. Instead of a predefined sequence of steps, I switched things up a bit and employed the Plan + Execute pattern. In this pattern, there’s one planner who decides what the steps should be. Once these steps have been decided, a model is invoked multiple times, once for each step, to eventually produce the final result.

Compared to the lyrical factory, this approach adds another level of creativity (or at least, randomness) to the overall process. Lucky for me, it’s super-easy to implement the P+E pattern in Breadboard with the Looper component (here’s the board).

The prompt for it is fairly straightforward:

Develop a step-by-step plan to write lyrics for a modern hit song 
based on the source material from building a list of stereotypes
and cliches in popular songs that might be related the source
material that are already overused, to developing themes to
writing the storyline, to creating catchy hooks, to researching
existing lyrics on the selected theme as inspiration, with the
final task of writing the lyrics. Each task must be
a comprehensive job description and include self-reflection
and critique.

Note the “remind the model” pattern. Instead of specifying what the steps are, I just remind the LLM of some steps that a good plan might contain.

The structure of the team also became much simpler. Instead of multiple workers in the factory line, it’s a loosey-goosey duo of a “Writer” and a “Planner”, who jam together to produce lyrics. The “Planner” sets the pace and manages the execution of the steps in the plan. The “Writer” goes hard at each particular task.

Surprisingly (or maybe not), this configuration ended up producing the most creative results. It was downright fascinating to watch the “Planner” make up steps that I could have never imagined, like requesting to draw inspiration from existing songs or have multiple rounds of self-critique. Some plans would be just three-to-four steps, and some would run for a while.

My favorite output of this creative duo is probably the “Chrysalis” song, which was a result of me showing the duo my Chrysalis poem. My mind boggled when I saw the coming-of-age, small-town girl story that the loosey-goosey team came up with. Not far behind is the “Sour Grapes” track, which was generated from on the Fox and the Grapes fable. It’s pretty cool, right?

Unfortunately, just like any P + E implementation (and any highly creative team composed of actual humans, for that matter), the duo setup was somewhat unreliable. Often, instead of producing a song, I would get back a transcript of collaborators quarreling with each other, or something entirely irrelevant, like the model pleading to stop asking it to promote the non-existent song on Instagram. It was time to add some oversight.

Part 4: Just add human

Breadboard’s Agent Kit had just the ingredient I needed: a Human component. I used it as a way to insert myself into the conversation, and help steer it with my feedback. While in the previous design, I would only provide input once, in this one, I get to speak up at every turn of the plan, and let the duo adjust the plan accordingly.

It was a foolproof plan. And actually, it was very clear that the results were immediately more relevant to what I was looking for. Except… it was no longer fun. I found it exhausting to be part of the team, having to speak up and offer my thoughts. So more often than not, I would have some terse reply like “yeah good” or “make it more edgy”, and couldn’t wait until the session was over. I came here for casino creativity, not to sweat over the details with the crew.

Moreover, the challenge I posed for myself at the start was still not addressed. Way too many edits were still necessary to bring the lyrics to the shape where they didn’t smell AI-generated. This smell is something that everyone who tried generating songs with AI can recognize. Each model has their own repertoire of words that it’s prone to inserting into the lyrics. There’s always  “neon lights”, “whispers”, a “kaleidoscope” or two, something is always “shimmering”, and of course, there’s a “symphony” of something. And OMG, please stop with the “embrace”. It’s like, everywhere.

Part 5: A few-shot rethink

For this last leg of the trip, I tried to take an entirely different approach. For a while, an idea of employing a few-shot prompting was percolating in my brain. Perhaps I could dislodge my synthetic lyricist from its rut by giving it a few examples? But how would I go about doing that?

My first idea was to add an “Editor” worker at the end of the process, and give it all of the “before/after” pairs of my edits with the task of improving the produced lyrics. By then, I accumulated quite a collection of these, and it seemed reasonable to try. Unfortunately, there’s something about LLMs and lyrics that makes them not do well with “improvements” or “edits” of lyrics. You’re welcome to try it. Any attempt at improving lyrics with suggestions just produces even more average lyrics.

I had to come up with a way for the ersatz lyricist’s creativity to be stimulated by something real.

What if I gave it a few examples of existing similar lyrics? Would that work?

RAG  to the rescue. For those not following the whole AI rigamarole closely, RAG, or retrieval-augmented generation, is a fairly broad – and growing! – collection of techniques that typically take advantage of vector embeddings.

With a bit of ETL elbow grease, I built a simple semantic store with Qdrant that contained about 10K songs from a lyric database I purchased at  Usable databases. With this semantic store, I could now take any prompt, generate an embedding for it, and then search the store, returning the top N results. Easy-peasy in Breadboard:

Great! I can get N songs related to the prompt! What do I do with them? 

I then spent a bunch of time playing with various ideas and eventually settling on an architecture. Instead of the loosey-goosey flow of steps,  I decided on  a simple prompt -> proposal -> lyric pipeline. A song is produced with two LLM invocations: one – let’s call it “Proposal Developer” – to generate a song proposal from a prompt, and one (the “Song Writer”) to generate lyrics from the proposal. Each invocation uses a few-shot prompt. The first one has a bunch of examples of how to create a song proposal from a prompt, and the second one has a bunch of examples of how to produce lyrics from the proposal.

Wait a second. I only have the lyrics examples. To implement the few-shot prompting properly, I need to provide two sets of pairs: a “prompt -> proposal” set for the “Proposal Developer” and a “proposal -> lyrics” set for the “Song Writer”.

Welcome to the wonderful world of synthetic datasets. If we don’t have the data, why not just generate it? That’s right, I came up with a simple system that works backwards: it takes a song and then comes up with a proposal for it. Then, in a similar fashion, it takes a proposal and comes up with a prompt that might have been used to generate it. Wiring it all together, I finally had a working system.

Immediately, I could tell the difference. The introduction of few-shot examples dramatically changed the LLM’s lyrical prowess, making it more flexible, interesting, and alive. It would still insert “whispers” and an occasional “symphony” here and there, but the change felt significant. Trippy stuff. Here are a few examples: “The Algorithm of Us”, where I literally just copy-pasted the output (I even forgot to remove the title of the song from the body 🤦). The “Uhh… What Were The Words Again?” ended up being particularly cute. I did have to make some edits to these lyrics, but only to make them more palatable for Suno’s consumption.

As a finishing touch, I multiplied the lyricists to produce three variants of lyrics. This was a last-minute thought: what if I give song writers different personas? Would their output vary? Turns out, it does! And, just like Midjourney and Suno and Udio found out, offering more than one output is a valuable trick.

Intentionally skipping over some lazy synthetic data generation techniques (this post is already too long), here is the final design.

Sadly, you won’t be able to run boards from this part of the journey directly. Unlike the Gemini API key, which can be easily obtained, Qdrant keys are tied to collections, which means that you will need to build your own lyrics database to get it to work. However, if you send an email to dglazkov.writeMeASong@valtown.email, you  will receive back a song that was generated using the body of the email as a prompt. Or, you can peruse this massive side-by-side eval spreadsheet, which is automatically updated by the cron job, filling out variant columns anytime I add a new prompt row.

Part 6: Not the end

I am not quite ready to declare victory yet. As the quality of lyrics improved, flaws in other parts of the system became more apparent. For instance, the final variant is not that great at proposal generation. And that makes sense, since the proposals aren’t original  – they are synthetic data, which means that the creative potential of the proposals  is limited by the confines of model inference. Which means that there’s work to be done.

One big lesson for me in this project is this: while it may seem like turning to LLMs to perform a task is a simple step, when we dig down into the nature of the task and really try to get that task done reliably and with the quality we desire, simple prompt engineering simply won’t do. We need AI systems composed of various AI patterns to take us where we want to go.

Jamming with Udio

Today, I had my first jam session with Udio. With the introduction of audio prompting, I am now able to use my own sound as the starting point for a generated track. This seems like a leap forward, despite the actual product still being quite clunky. It’s a leap, because through introducing audio upload, Udio managed to merge casino creativity with the other, more traditional kind. Let me walk you through what I found.

As my first try, I fed Udio one of my old abandoned loops. As part of making music, I often arrive at dead ends: promising loops with which I can’t figure out what to do. I have a bunch.

Then, I extended this loop from both ends with Udio, producing a decent trance track in a matter of a few minutes. It’s not going to win any awards, but it’s definitely farther than I’ve been able to walk on my own.

Here’s the original loop that I made a while back:

Here’s the finished Udio track: https://www.udio.com/songs/usCmcABg3yP4aC1J8S5WCA

Extending existing audio clips fits well into the standard Udio process.

In the standard Udio process, we get 32 seconds of audio as a starting point, and then we iteratively extend this audio from either end to produce music that we like. Each iterative extension is an opportunity to make some choices – the casino creativity at its finest.

When uploading the audio prompt, the prompt becomes the first N seconds of the audio (the N depends on the length of the audio prompt we load).

Udio tries to match the prompt’s tempo and the style, expanding it, and in the process, riffing on it. It feels like extrusion: pushing more music through the template that I defined. As Udio expands the clip, it adds new details to what was in the original clip, trying to predict what might have been playing before or after that clip.

You can still hear the original loop in the finished track at 3:12. It is bookended by entirely new sound that now fits seamlessly around it. The music around it is something that Udio extruded, generating it using the original loop as a template.

The presence of the original loop hints at the connection between two kinds of creativity that I mentioned earlier. For instance, I could imagine myself sitting down with Ableton and building out a catchy loop, then shifting to Udio to help me imagine the track that would contain this loop. I could then go back to Ableton and use the results of our little jam session as inspiration.

As my next try, I did something slightly different. I gave a simple melody to Udio and then rolled the casino dice until I’ve gotten the right sound. At this point, Udio anchors on the audio prompt quite firmly, so if you give it a piano (like I did) and ask for a saxophone, it might take a few attempts to produce a rendition of the melody with a different instrument.

Here, I was looking to create something that sounds like a film score, so I was looking for strings. After a little while, Udio relented and gave me an extension that seemed right.

At that point, I trimmed the original clip from the Udio track. Now that Udio had learned my melody, I no longer needed the original material, since it didn’t fit with the vibe I was looking for. This removal of the original trick is something I expect to be pretty common. For instance, I could hum a melody or peck it single-note on my keyboard. My intuition is that Udio will (sooner or later) have a “remix” feature for audio prompts, where we can start with the sound of my whistling of a tune and then shape it directly, rather than waiting for the right extension to happen.

Once the right vibe was established, the rest of the process was quite entertaining. It was fun to watch Udio reimagine my original melody in minor for the “scary” part of the movie and boost it with drums and a full orchestra at the climax.

Here’s the original melody:

Here’s the finished track: https://www.udio.com/songs/eCMEFFSGnoicnHR4S5fRK1

In both cases, the process felt a lot more like a jam session than creative casino, because the final product included a distinct contribution from me. It wasn’t something that I just told Udio to do. I gave it raw material to riff on. And it did a pretty darned good job.

Casino Creativity

I’ve been geeking out on AI-generated music services Suno and Udio, and it’s been super-interesting to see them iterate and ship quickly. It looks like there might be a value niche in this particular neck of the woods in the larger generative AI space. There are tons of users with very active Discord communities for both, and it does not seem like the interest is waning.

The overall arc of the generative music story seems to follow that of Midjourney, with the interest primarily fueled by the phenomenon that I would like to name “casino creativity”. Let’s see if I can define what I mean by that.

I would like to start by positing that the craving to create is in every one of us. Some of us are more blessed than others in also having skills to satisfy this craving. Moreso, I am going to proclaim that most of us are unable to fully embrace our creative selves because we lack some of the skills required to take flight.

For instance, I can make music. I have been making music since I was a teenager. For me, satisfying my craving for creativity is just a matter of firing up Ableton. When I am skilled in the medium, the friction to create is low. All it takes is being next to my keyboard (and Push), a little inspiration – and a track begins to emerge.

However, I can’t sing. Like, not at all. Like, don’t even ask me. In the music school, when testing out for the choir or orchestra, I was asked to sing. After me belting out a few words (not even a full verse!), the teacher yelled: “The Orchestra! The Orchestra!”

Being a music producer without a voice is a story of unrequited love. I have to settle for tracks without lyrics. The instrumentals are nice, but it’s just not the same feeling without a voice.

So obviously, ever since the current generative AI spring blossomed, I’ve been on a quest to find a way to sate this creative craving. I played with Melodyne and  Synth V, and while they both offered a path forward, the barrier to entry was just too high. Gaining a voice is not the same as knowing how to sing. It’s about the same distance between being able to buy a violin and knowing how to play one.

Things started shifting with Chirp. This was the original model created by Suno, and it was Discord-only, very similar to Midjourney – feed it the lyrics alongside a description of the vibe, and out comes a 30-second clip of music. Not just music – it also sang out the lyrics I gave it!

Brain-splosion. Sort of. The output quality of Chirp was pretty weak-sauce. It was not the music I could share with anyone except for minor giggles and an eye roll. I forgot about Chirp for a little while, until this spring Suno came out with the v3 of their sound model. I heard about it from Alex, whose work colleagues composed various songs to celebrate his last day at Stripe.

Ok, now we were getting somewhere. Songs generated with Suno v3 possessed that extra emotional weight that made them nearly passable as listenable music. When Udio came out shortly with their own model, it upped the barrier even more. I was blown away by some of the output. Just like that, my age of voiceless musicing was over. I could type in some lyrics and get back something that expressed it back to me as music.

Every generation took only about a minute and produced two variants for me to pick from. I could choose the one I like and extend it or remix it – or roll the dice again. All it takes is a click.

It’s this metaphorical rolling of the dice that gives the name to the titular term. As I was pushing the “Create 🎶” button, I realized that the anticipation of the output had a pronounced dopamine hit. What will come out? Will it be something like Duran Duran? Or maybe more like Bono? Will it go in a completely different direction? Gimme gimme gimme. I was hooked on Suno.

Casino creativity is a form of creative expression that emerges when the creative environment has such a low barrier to entry that the main way to express my creativity is through providing preference: selecting one choice out of a few offered. A creative casino is a place where all I need to bring is my money and my vibes: everything else will be provided. 

Midjourney is one of the first environments where I experienced casino creativity. There’s something subtly addictive about looking for that prompt and seeing those 4-up images that pop out. I know peeps who can spend a very long time tweaking and tuning their inputs. We could argue that prompt craftsmanship itself is a skill that must be acquired. But this skill has a short expiration date – as the models improve and change, the need for prompt-foo diminishes rapidly.

At the end, what we’re left with is pressing the button and making choices. Casino creativity is less about the skill and more about the vibes.

Not to say that casino creativity isn’t able to produce interesting – and perhaps even beautiful – things. Vibes are important – and some of us have more latent vibes hidden within us that we could ever realize. Ultimately, casino creativity is very similar in spirit to the democratization of writing that we’d seen with the Web. I am not yet ready to proclaim that casino creativity is somehow less intriguing and full of potential than any other type of creativity. Just like my Midjourney-obsessed friends, I can see how unleashing one’s creative energy might lead to surprising and wonderful results.

Here’s a twist though. As long as I have the credits to roll the dice, I can see if my vibes work for others. Both Suno and Udio are vying to be the place where music happens. I can look at what’s popular and peruse the top charts. It’s all very naive and simplistic at the moment. 

Yet, when executed ruthlessly (and it’s inevitable that somebody will do this), the creative casino is not just the place where I can express my creativity. It’s also the place where I can get the extra dopamine release of seeing my song climb the charts – of my vibes becoming recognized. Come for the vibes, stay for the likes.

An interesting effect of introducing generative AI, it seems, is that we’re likely to see more creative casinos and more ventures capitalizing on casino creativity itself. And we have to ponder the implications of that.

Chrysalis

A moment of clarity
I suspect it’s playing a game
I reach out, and it’s gone
My unreality
By a different name
Is what yet to be drawn.

What needs to be done
Always feels right
And when the story had spun
Shying from light,
It always begs to forget
Filling the stores of regret.

Do caterpillars dream of flying?
Do they know they will have wings?
Do they realize that being land-bound
Is just a temporary thing?

Imperceptible? Immense?
I can’t tell, barely there myself
Unable to keep the facade
of pretense
swallowed by the intense
losing all sense
of space, time, and self.

Losing all sense
Changing, yet staying the same,
Thrashing my wits and will,
Am I still me?
In my defense,
This question is unanswered still,
While being reframed.
What will I be?

The third option

I facilitated a workshop on systems thinking recently (or “lensical thinking” as I’ve come to call it). The purpose of the lensical thinking workshop is to provide a coherent set of tools for approaching problems of particularly unsolvable kind, using the concept of lenses as the key organizing concept.

One of the participants became quite fond of the polarity lens, realizing how broadly applicable it is, from the larger organizational challenges (like “centralized vs. decentralized”), to something as in-the-moment as hidden tension in a conversation. 

The participant even pointed out that the use of lenses, at its core, has a polarity-like quality: a wide diversity of lenses brings more nuance to the picture and as such, makes the picture less simple and crisp. The tension between the desire for clarity and the desire to see the fuller picture are in tension, causing us – if we’re not careful – to swing wildly between the two extremes.

That one resonated with me, because it’s a very common problem that every leader faces: in a vastly complex space of mostly unsolvable problems, how do they speak with clarity and intention while still leaving room for nuance and convey who fuzzy everything actually is?

The deeper insight here is that we are surrounded by polarities. Behind every bad decision, there is undoubtedly a pair of what seems like two equally bad conflicting options and the agony of having to commit to one – often knowing full well that some time from now, we’ll have to swing all the way to the other side.  In that moment, the polarities rule us, both individually and collectively.

In a separate conversation with a dear friend of mine, we arrived at a point of clarity when we both saw how a problem we’ve been looking at was a gnarly polarity. We saw how, over time, the only seemingly available options had this quality of “either do <clearly unsatisfying thing> or do <another clearly unsatisfying thing>”, repeating over and over.

When that happened, we both felt a bit of a despondence setting it. This has been going on for a long time. What is the way out of this swing of a pendulum? Is there even one?

This question was surprisingly helpful: “What’s the third option?” When caught in the grip of the polarity, it feels counterintuitive to slow down and look around for something other than the usual. However,  given that the other choices leave us animated by the pendulum, the least we can do is to invest into observing what is happening to us.

The third option is definitely that of deciding to do better this time, now that we’ve seen how we are subject to polarity. We may claim to have a more active role, to elevate the polarity into something productive, and many consulting hours are spent heaving that weight. Unfortunately, the pendulum cares very little about that. And inevitably, we find ourselves back in its grip.

The third option is rarely that of nihilism, of deciding that the world around us is inherently bad and we are better off uncontaminated by it. The WarGames ending was beautiful, but in a polarity, it’s an option that is chosen when guided by naivete – for nihilism and naivete are close cousins.

The third option is rarely that of avoidance – even if it’s the avoidance through drastic action, like proclaiming that we’re moving to an island/starting a new business/adventure, or joining  a group of like-minded, clear-sighted individuals that suspiciously smells like a cult. When we choose this path, we mustn’t be surprised how the same old chorus cuts into our apparently new song.

The presence of a polarity is a sign that our thinking is too constrained, flattened either by the burden of the allostatic load or simply the absence of experience with higher dimensional spaces. The search for  the “third option” is an attempt to break out of a two-dimensional picture –  into the third dimension, to see the limits of the movements along the X and Y axis and look for ways to fold our two-dimensional space into the axis Z.

Put differently, polarities are a sign that our mental modals are due for reexamination. Polarities, especially particularly vicious ones, are a blinking light on our dashboard of vertical development. This means that the third option will not be obvious and immediately seen. “Sitting with it” is a very common refrain of third-option seekers. The axis Z is hard to comprehend with a two-dimensional brain, and some serious stretching of the mind will be necessary to even catch the first glimpses of it.

Most significantly, that first glimpse or even a full realization doesn’t not necessarily bring instant victory. Orthogonality is a weird trick. Opening up a new dimension does not obsolete the previous ones – it just creates more space. Those hoping for a neat solution will be disappointed. 

Instead of “solving” a polarity, we might find a whole different perspective, which may not change the situation in a dramatic way – at least not at first. We might find that we are less attached to the effects of a pendulum. We might find that we no longer suffer at the extremes, and have a tiny bit more room to move, rather than feeling helpless. We might find that the pendulum swings no longer seem as existential. And little by little, its impact on us will feel less and less intense.

Flexibility of the medium

I did this fun little experiment recently. I took my two last posts (Thinking to Write and The Bootstrapping Phase) and asked an LLM to turn them into lyrics. Then, after massaging the lyrics a bit, to better fit with the message I wanted to come across, I played with Suno for a bit to transform them into short, 2-minute songs – a sort of vignettes for my long-form writing. Here they are:

Unbaked Cookie Testers on Suno

Catchy, content-ful, and in total, maybe 20 minutes to make. And, it was so much fun! I got to think about what’s important, how to express somewhat dry writing as emotionally interesting. I got to consider what music style would resonate with what I am trying to convey in the original content.

This got me thinking. What I was doing in those few minutes was transforming the medium of the message. With generative AI, the cost of medium transformation seems to be going down dramatically. 

I know how to make music. I know how to write lyrics. But it would have taken me hours of uninterrupted time (which would likely translate into months of elapsed time) to actually produce something like this. Such investment makes medium transformation all but prohibitive. It’s just too much effort.

However,  with the help of a couple of LLMs, I was able to walk over this threshold like there’s nothing to it. I had fun, and – most importantly – I had total agency in the course of the transformation. I had the opportunity to tweak the lyrics. I played around with music styles and rejected a bunch of things I didn’t like. It was all happening in one-minute intervals, in rapid iteration.

This rapid iteration was more reminiscent of jamming with a creative partner than working with a machine. Gemini gave me a bunch of alternatives (some better than others), and Suno was eager to mix bluegrass with glitch, no matter how awful the results. At one moment I paused and realized: wow, this feels closer to the ideal creative collaboration than I’ve ever noticed before.

What’s more importantly, the new ease of medium transformation opens up all kinds of new possibilities. If we presume – and that’s a big one – for a moment that the cost of medium transformation will indeed go down for all of us, we now can flexibly adjust the medium according to the circumstances of the audience.

The message does not have to be locked in a long-form post or an academic tome, waiting for someone to summarize it in an easily consumable format. We can turn it into a catchy tune, or a podcast. It could be a video. It could be something we don’t yet have, like a “zoomable” radio station where I listen to a stream of short-form snippets of ideas, and can “zoom in” to the ones I am most interested in, pausing the stream to have a conversation with the avatar of the author of the book, or have an avatar of someone I respect react to it. I could then “zoom out” again and resume the flow of short-form snippets.

Once flexible, the medium of the message can adapt and meet me where I am currently.

The transformation behind this flexibility will often be lossy. Just like the tweets pixelate the nuance of the human soul, turning a book into a two-verse ditty will flatten its depth. My intuition is that this lossiness and the transformation itself will usher in a whole new era of UX explorations, where we struggle to find that new shared way of interacting with the infinitely flexible, malleable canvas of the medium. Yup, this is going to get weird.

The Bootstrapping Phase

I think I have a slightly better way of describing a particular moment in a product’s life that I alluded to in Rock tumbler teams, Chances to get it right,  and later, Build a thing to build the thing. I call this moment the “bootstrapping phase.” It very much applies to consumer-oriented products as well, but is especially pronounced – and viscerally felt – in the developer experience spaces.

I use the term “bootstrapping phase” to point at the period of time when our aspiring developer product is facing a tension of two forces. On one hand, we must start having actual users to provide the essential feedback loop that will guide us. On the other hand, the product itself isn’t yet good enough to actually help users.

The bootstrapping phase is all about navigating this tension in the most effective way. Move a little too much away from having the feedback loop, and we run the danger of building something that nobody wants. Go a little too hard on growing the user base, and we might prematurely conclude the story of the product entirely.

The trick about this phase is that all assumptions we might have made about the final shape of what we’re building are up in the air. They could be entirely wrong, based on our misunderstanding of the problem space, or overfit to our particular way of thinking. These assumptions must face the contact with reality, be tested – and necessarily, change.

The word “bootstrapping” in the name refers to this iterative process of evolving our assumptions in collaboration with a small group of users who are able and eager to engage.

Those of you hanging out in the Breadboard project heard me use the expression “unbaked cookies”: we would like to have you try the stuff we made, and we’re pretty sure it’s not yet cooked. Our cookies might have bits of crushed glass in them, and we don’t yet know if that cool new ingredient we added last night is actually edible. Yum.

At the bootstrapping phase of the project, the eagerness to eat unbaked cookies is a precious gift. I am in awe of the folks I know who have this mindset. For them, it’s a chance to play with something new and influence – often deeply – what the next iteration of the product will look like. On the receiving end, we get a wealth of insights they generate by trying – and gleefully failing – to use the product as intended.

For this process to work, we must show a complementary eagerness to change our assumptions. It is often disheartening to see our cool ideas be dismantled with a single click or a confused stare. Instead of falling prey to the temptation of filtering out these moments, we must use them as guiding signals – these are the bits that take us toward a better product.

The relationship between the bakers of unbaked cookies and cookie testers requires a lot of trust – and this can only be built over time. Both parties need to develop a sense of collaborative relationship that allows them to take risks, challenging each other. As disconcerting it may be, some insights generated might point at fundamental problems with the product – things that aren’t fixable without rethinking everything. While definitely a last resort, such rethinking must always be on the table. Bits of technology can be changed with some work. The mental models behind the product, once it ships to the broader audience are much, much more difficult to change.

Because of that, the typical UX studies aren’t a great fit for the bootstrapping phase of the project. We’re not looking for folks to react to the validity of mental models we imbued the nascent product with. We fully realize that some of them – likely many – are wrong. Instead, we need a collaborative, tight-feedback loop relationship with the potential users, who feel entrusted with steering the product direction through them chewing on not-yet baked cookies. They aren’t just trusted testers of the product. They aren’t just evaluators of it. They are full participants in its development, representing the users.

Thinking to write

I’ve had this realization about myself recently, and it’s been rather useful in gaining a bit more understanding about how my mind works. I am writing it down in hopes it would help you in your own self-reflections.

The well-worn “Writing to think” maxim is something that’s near and dear to my heart: weaving a sequential story of the highly non-linear processes that are happening in my mind is a precious tool. I usually recommend developing the muscle for writing to think as a way to keep our thoughts organized to my colleagues and friends. Often, when I do, I am asked: “What do I write about?”

It’s a good question. At least for me, the ability to write appears to be closely connected to the space in which I am doing the thinking. It seems like the whole notion of “writing to think” might also work in reverse: when I don’t have something to write about, it might be a signal that my thinking space is fairly small or narrow.

There might be fascinating and very challenging problems that I am working on. I could be spending many hours wracking my brain trying to solve them. However, if this thinking doesn’t spur me to write about it, I am probably inhabiting a rather confined problem space.

I find that writing code and software engineering in general tend to collapse this space for me. Don’t get me wrong, I love making software. It’s one of those things that I genuinely enjoy and get a “coder’s high” from.

Yet, when doing so, I find that my thoughts are sharply focused and narrow. They don’t undulate and wander in vast spaces. They don’t get lost just for the sake of getting lost. Writing code is about bringing an idea to life. It’s a very concretizing process. Writing code is most definitely a process of writing to think, but it’s more of “writing it”, rather than “writing about it”.

The outcome is a crisp – albeit often spaghetti-like – set of instructions that are meant to be understood by a machine, which for all its complicatedness is a lot less complex than a human mind.

On the other hand, when I was doing more strategy work a few years back, I found myself brimming with ideas to write down. It was very easy to just knock out a post – nearly every idea I had was begging to be played with and turned into a story to share. I was in the wide-open space of thinking among people, and particularly long-term horizon, broad thinking, and wandering.

Nothing’s wrong with inhabiting smaller problem spaces for a little while. However, it’s probably not something I would pick as the only way of being. “Inhabiting” brings habits, and habits entrench. Becoming entrenched in smaller problem spaces means that the larger spaces become less and less accessible over time, resulting in strategic myopia.

It seems that to avoid such a diagnosis, we’ve gotta keep finding ways to think in spaces big enough to spur us to write. To use an analogy, “writing to think” is like developing a habit of brushing our teeth, and “thinking to write” is a way to check if we indeed follow this habit. If we find ourselves struggling to write, then maybe we need to broaden the problem space we inhabit.

Aircraft Carriers and Zodiac Boats

An insightful conversation with a colleague inspired me to articulate the distinction between velocity and agility – and all that implies. When I talked about velocity in the past, I sort of elided the fact that agility, or the ability to change direction at a given velocity, plays a crucial role in setting organizations for success in certain conditions. I also want to disclaim that the definition of “agility” I use here is only loosely related to the well-known/loved/hated agile methodologies.

The example I’ve used in the past is that of a zodiac boat and an aircraft carrier. Though they are capable of roughly going at the same velocity, their agility is dramatically different. The zodiac boat’s maneuverability is what gives it a decisive advantage in an environment where the situation changes rapidly. On the other hand, an aircraft carrier is able to sustain velocity for much longer periods of time, which enables it to travel around the globe.

In engineering teams, velocity and agility are often used interchangeably, and I am quite guilty of doing this as well. Only in retrospect I am realizing why some teams that I’ve worked both with (and next to) looked and acted so differently. They were valuing, respectively, velocity or agility.

When the team favors velocity, it invests significantly into its capacity to achieve maximum sustained speed for the longest possible time. Decision-making and engineering processes, tools and infrastructure all feel like that of an aircraft carrier, regardless of the team’s actual size. It’s like the question on everyone’s mind is: “Can we cross the Pacific Ocean? How many times over?” The team is designed to go far, even if that means sacrificing some velocity to its robustness.

For instance, the Blink team I led a while back was all about velocity, borrowing most of its ethos from Google engineering culture. We designed our infrastructure to enable us to ship directly from trunk through diligent test coverage and a phenomenal build system (we built our own!), we followed the practice of code reviews and the shipping process. We talked about how this team was built to run for multiple decades.

This was (and a decade later, still is), of course, the right fit for that project. Rendering engines represent highly overconstrained, immensely complex systems of a relatively well-defined shape. The team that ships a rendering engine will not one day decide to do something completely different. The word “velocity” in such a team is tightly coupled with achieving predictable results over a long period of time.

However, when the final shape of the value niche is still unknown, and the product-market fit is something we only wistfully talk about, a different structure is needed. Here, the engineering team needs to lean into agility. When they do so, the project will act very differently. It will be more like a zodiac boat: not built to run forever, but rather to zig and zag quickly.

A project structured like a zodiac boat will have alarmingly few processes and entrenched practices. “What? They don’t do code reviews?” The trunk might be unshippable for periods that would be unacceptable by any standards of an aircraft carrier team. The codebase will have large cheese holes in implementation and test coverage, with many areas only loosely sketched. In a zodiac boat project, everything is temporary, and meant to shift as soon as a new promising approach is discovered.

Such projects are also typically small-sized. Larger teams means more opinions and more coordination headwinds, so zodiac boat projects will favor fewer folks who deeply understand the code base and have no problem diving in and changing everything. They will also attract those who are comfortable with uncertainty. In a highly dynamic situation, the courage to make choices (even if they might not be the right ones), and the skill to keep the OODA loop spinning are paramount. 

A well-organized startup will have to run like a zodiac boat project. Startups rarely form around old ideas or long-discovered value niches. A lot of maneuvering will be necessary to uncover that motherlode. Any attempts to turn this zodiac boat into an aircraft carrier prematurely will dramatically reduce the probability of finding it. This is why ex-Googlers often struggle in startups: their culturally-instilled intuition will direct them to install nuclear reactors and rivet steel panels onto their boats – and in doing so, sink them.   

Which brings me to engineering mastery. In my experience, there are two kinds of successful zodiac boat projects: ones run by people who aren’t that familiar with robust software engineering, and ones who have achieved enough software engineering mastery to know which practices can be broken and disregarded.

The first group of folks succeeded accidentally. The second – intentionally. This second group knows where to leave the right cheese holes in their project, and will do so consistently with magical results.

That’s what’s so tricky about forming zodiac boat projects. One can’t just put together a bunch of engineers into a small boat and let them loose. As with any elite special force, zodiac boat projects require a crew that is tightly knit, intrinsically motivated, and skilled to extreme.

Curiously, aircraft carrier-culture organizations can sometimes produce relatively high agility through what I call ergodic agility. The ergodic agility refers to the phenomenon where a multitude of projects are given room to start and fail, and over time, through the ergodic motion, find a new value niche. Here, the maneuverability is achieved through quantity and diversity of unchanging directions.

Like the infamous quote from Shrek, this process looks and feels like utter failure from inside of most of these teams, with the lucky winner experiencing the purest form of survivorship bias.

I am not sure if ergodic agility is more or less expensive for a large organization compared to cultivating an archipelago of zodiac boat teams and culture. One thing is certain: to thrive in an ever-changing world, any organization will need to find a way to have both aircraft carriers and zodiac boats in its fleet.

AI Baklava

At its core, the process of layering resolves two counterposed forces. Both originate from the need for predictability in behavior. One from below, where the means are concrete, but the aims are abstract – and one from above, where the aims are concrete, but the means are abstract. 

Each layer is a translation of an abstraction: as we go up in the stack of layers, the means become more abstract and the aims more concrete. At each end, we want to have predictability in what happens next, but the object of our concern is different. Lower layers want to ensure that the means are employed correctly. Higher layers want to ensure that the aims are fulfilled.

For example, a user who wants to press a button to place an order in the food-ordering app has a very concrete aim in mind for this button – they want to satisfy their hunger. The means of doing that are completely opaque to the user – how this will happen is entirely hidden by the veil of abstraction.

One layer below, the app has a somewhat more abstract aim (the user pressed a button), but the means are a bit more concrete: it needs to route the tap to the right handler which will then initiate the process of placing the order.

The aims are even less concrete at a lower layer. The button widget receives a tap and invokes an event handler that corresponds to it. Here, we are unaware of the user’s hunger. We don’t know why they want this button tapped, and nor do we care: we just need to ensure that the right sequence of things transpires when a tap event is received. The means are very concrete.

A reasonable question might be: why layer at all? Why not just connect the means and aims directly? This is where another interesting part of the story comes in. It appears that we humans have limits to the level of complexity of mental models we can reasonably contain in our minds. These limits are well-known to software engineers.

For a seasoned engineer, the pull toward layering emerges nearly simultaneously with the first lines of code written. It’s such a habit, they do it almost automatically. The experience that drives this habit is that of painful untangling of spaghetti after the code we wrote begins to resist change. This resistance, this unwillingness of cooperating with its own creator is not the fault of the code. It is the limit of the engineer’s mental capacity to hold the entirety of the software held in their minds.

When I talk about software with non-technical people, they are nearly always surprised about the amount of bugs any mature software contains. It seems paradoxical that older software has more bugs than the newer code. “What exactly are y’all doing there? Writing bugs?!” Yup. The mental model capacity needed to deeply grok a well-used, well-loved piece of software is typically way beyond that of any individual human.

So we come up with ways to break software into smaller chunks to allow us to compartmentalize their mental models, to specialize. And because of the way the forces of the aims and the means are facing each other, this process of chunking results in layering. Whether we want it or not, the layering will emerge in our code.

Putting it another way, layering is the artifact of the limits of human capacity to hold coherent mental models. If we imagine a software engineer with near-infinite mental model capacity, they could likely write well-functioning, relatively bug-free code using few (if any) layers of abstraction.

The converse is also true: lesser capacity to hold a mental model of traversing from the aims to the means will lead to software with more layers of abstraction.

By now, you probably know where I am going with this. Let’s see if we can apply these insights to the realm of large language models. What kind of layering will yield better results when we ask LLMs to write software for us?

It is my guess that the mental model-holding capacity of an LLM is roughly proportional to the size of this model’s context window. It is not the parametric memory that matters here. The parametric memory reflects an LLM’s ability to choose and apply various layers of abstraction. It is the context window that places a hard limit on what a model can and can not coherently perceive holistically.

Models with smaller context windows will have to rely on thinner layers and be more clever with the abstraction layers they choose. They would have to work harder and need more assistance from their human coworkers.  Models with larger context windows would be able to get by with fewer layers.

How will the LLM-based software engineers compare to human counterparts? Here’s my intuition. LLMs will continue to be abysmally bad at understanding large code bases. There’s just way too many assumptions and tacit knowledge that lurks in those lines of code. We will likely see an industry-wide spirited attempt to solve this problem, and the solution will likely see thinning the abstraction layers within the code base to create safe, limited-scope lanes for synthetic software engineers to be effective in their work.

At the same time, LLMs will have a definite advantage over humans in learning the codebases that are well within their limits. Unlike humans, they will not get tired and are easily cloned. If I fine-tune a model on my codebase, all I need is the GPU/TPU capacity to scale it to a multitude of synthetic workers.

Putting these two together, I wonder if we’ll see the emergence of synthetic software engineering as a discipline. This discipline will encompass the best practices for the human software engineer to construct – and maintain – the scaffolding for the baklava of layers of abstraction populated by a hive of their synthetic kin.