I have been on a bit of a kick to get the large language models (LLMs) to generate interesting lyrics for songs. Over time, it’s become quite an adventure, a fun exploration of LLM capabilities. Here’s the multi-part account of this quest, offered as a loose timeline and progression of techniques, interspersed with pictures of and links to working prototypes, which were made with Breadboard Visual Editor.
Part 1: A simple song writer
At first blush, it seems too easy. Just ask ChatGPT or Gemini to write lyrics – and it will happily oblige, generating rhyming verses and a chorus for us. The first few times, it’s pretty neat. Building something like this in Breadboard takes a few seconds:
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image.png?resize=840%2C344&ssl=1)
However, on the fourth or fifth try, a glaring problem becomes apparent: models are rather uncreative and lazy. The rhymes are repetitive. The metaphors are quite limited. The interpretation of ideas is too literal. It’s like LLM is lacking imagination, producing something that’s average, and is only impressive the first time.
And it kind of makes sense. What we’re observing here is a process of inference: an LLM trying its best to predict what might be a natural completion of – let’s admit it – a rather average request.
Part 2: A lyrical factory
To address the mundanity of the model, my first idea was to catalyze it with the power of methodology. As we know well from how human organizations work, practices and methodology are often responsible for a dramatic increase in quality of the product. Workers who aren’t deeply familiar with the nuance and craft – perhaps even particularly skilled – can be given simple tasks, arranged into production lines, and deliver quality results.
Applying industrial-age technology of organizing workers to LLMs is something I already wrote about a while back, and it’s not a difficult thing to imagine. Let’s break down the process of creating a song into some components, and then form a production line, focusing an LLM on one task at a time.
After a brief foray into researching song-writing best practices, here’s what I ended up building:
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-1.png?resize=840%2C189&ssl=1)
First, we have a “Theme Developer”, who is instructed to develop a theme based on the provided material. Next, we have a “Copywriter” who is asked to write a storyline based on the theme that was developed. After that, the “Hook Artist” steps in to develop catchy hooks for the lyrics. And finally, a “Lyricist” completes the job to create the lyrics.
After much tweaking, I settled on the technique of “reminding the model” for my prompt design. The basic premise of this technique is that LLMs already have all of the necessary information. Our problem is not that they don’t have the knowledge. Our problem is that they don’t know which information is important right now. For instance, the “Copywriter” is reminded to use the Freytag Pyramid, but I don’t include its definition in the prompt.
Even with this shortcut, the prompts for each working in our lyrical factory ended up being quite elaborate (you can see each of them by clicking on each individual component in Breadboard Visual Editor and looking at the “Instruction” field). Song-writing is quite a process, it turns out.
The resulting lyrical factory produced much more interesting results: it was fun to see how the LLM would make up details and themes and then build out a story to base the lyrics on. Especially in situations when I didn’t quite know what I was looking for in the lyrics, it worked surprisingly well.
Lyrics for the song “The Blade” were written by a factory. It’s a pretty nice song. However, if we look at the original lyrics and compare them with what went into the song, it’s a pretty dramatic difference. The original feels like the lyricist is walking on stilts.
This became a bit of a pattern for me. After generating a few outputs, I pick the one that’s closest to what I want and then edit it to turn it into a final song. And with that, the ongoing challenge: teaching an LLM to require fewer edits to get the lyrics into decent shape.
Part 3: The loosey-goosey duo
So, the results were better and more creative, but not quite there. So I decided to see if I could mix it up a bit. Instead of a predefined sequence of steps, I switched things up a bit and employed the Plan + Execute pattern. In this pattern, there’s one planner who decides what the steps should be. Once these steps have been decided, a model is invoked multiple times, once for each step, to eventually produce the final result.
Compared to the lyrical factory, this approach adds another level of creativity (or at least, randomness) to the overall process. Lucky for me, it’s super-easy to implement the P+E pattern in Breadboard with the Looper component (here’s the board).
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-2.png?resize=840%2C384&ssl=1)
The prompt for it is fairly straightforward:
Develop a step-by-step plan to write lyrics for a modern hit song
based on the source material from building a list of stereotypes
and cliches in popular songs that might be related the source
material that are already overused, to developing themes to
writing the storyline, to creating catchy hooks, to researching
existing lyrics on the selected theme as inspiration, with the
final task of writing the lyrics. Each task must be
a comprehensive job description and include self-reflection
and critique.
Note the “remind the model” pattern. Instead of specifying what the steps are, I just remind the LLM of some steps that a good plan might contain.
The structure of the team also became much simpler. Instead of multiple workers in the factory line, it’s a loosey-goosey duo of a “Writer” and a “Planner”, who jam together to produce lyrics. The “Planner” sets the pace and manages the execution of the steps in the plan. The “Writer” goes hard at each particular task.
Surprisingly (or maybe not), this configuration ended up producing the most creative results. It was downright fascinating to watch the “Planner” make up steps that I could have never imagined, like requesting to draw inspiration from existing songs or have multiple rounds of self-critique. Some plans would be just three-to-four steps, and some would run for a while.
My favorite output of this creative duo is probably the “Chrysalis” song, which was a result of me showing the duo my Chrysalis poem. My mind boggled when I saw the coming-of-age, small-town girl story that the loosey-goosey team came up with. Not far behind is the “Sour Grapes” track, which was generated from on the Fox and the Grapes fable. It’s pretty cool, right?
Unfortunately, just like any P + E implementation (and any highly creative team composed of actual humans, for that matter), the duo setup was somewhat unreliable. Often, instead of producing a song, I would get back a transcript of collaborators quarreling with each other, or something entirely irrelevant, like the model pleading to stop asking it to promote the non-existent song on Instagram. It was time to add some oversight.
Part 4: Just add human
Breadboard’s Agent Kit had just the ingredient I needed: a Human component. I used it as a way to insert myself into the conversation, and help steer it with my feedback. While in the previous design, I would only provide input once, in this one, I get to speak up at every turn of the plan, and let the duo adjust the plan accordingly.
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-3.png?resize=840%2C284&ssl=1)
It was a foolproof plan. And actually, it was very clear that the results were immediately more relevant to what I was looking for. Except… it was no longer fun. I found it exhausting to be part of the team, having to speak up and offer my thoughts. So more often than not, I would have some terse reply like “yeah good” or “make it more edgy”, and couldn’t wait until the session was over. I came here for casino creativity, not to sweat over the details with the crew.
Moreover, the challenge I posed for myself at the start was still not addressed. Way too many edits were still necessary to bring the lyrics to the shape where they didn’t smell AI-generated. This smell is something that everyone who tried generating songs with AI can recognize. Each model has their own repertoire of words that it’s prone to inserting into the lyrics. There’s always “neon lights”, “whispers”, a “kaleidoscope” or two, something is always “shimmering”, and of course, there’s a “symphony” of something. And OMG, please stop with the “embrace”. It’s like, everywhere.
Part 5: A few-shot rethink
For this last leg of the trip, I tried to take an entirely different approach. For a while, an idea of employing a few-shot prompting was percolating in my brain. Perhaps I could dislodge my synthetic lyricist from its rut by giving it a few examples? But how would I go about doing that?
My first idea was to add an “Editor” worker at the end of the process, and give it all of the “before/after” pairs of my edits with the task of improving the produced lyrics. By then, I accumulated quite a collection of these, and it seemed reasonable to try. Unfortunately, there’s something about LLMs and lyrics that makes them not do well with “improvements” or “edits” of lyrics. You’re welcome to try it. Any attempt at improving lyrics with suggestions just produces even more average lyrics.
I had to come up with a way for the ersatz lyricist’s creativity to be stimulated by something real.
What if I gave it a few examples of existing similar lyrics? Would that work?
RAG to the rescue. For those not following the whole AI rigamarole closely, RAG, or retrieval-augmented generation, is a fairly broad – and growing! – collection of techniques that typically take advantage of vector embeddings.
With a bit of ETL elbow grease, I built a simple semantic store with Qdrant that contained about 10K songs from a lyric database I purchased at Usable databases. With this semantic store, I could now take any prompt, generate an embedding for it, and then search the store, returning the top N results. Easy-peasy in Breadboard:
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-4.png?resize=840%2C431&ssl=1)
Great! I can get N songs related to the prompt! What do I do with them?
I then spent a bunch of time playing with various ideas and eventually settling on an architecture. Instead of the loosey-goosey flow of steps, I decided on a simple prompt -> proposal -> lyric pipeline. A song is produced with two LLM invocations: one – let’s call it “Proposal Developer” – to generate a song proposal from a prompt, and one (the “Song Writer”) to generate lyrics from the proposal. Each invocation uses a few-shot prompt. The first one has a bunch of examples of how to create a song proposal from a prompt, and the second one has a bunch of examples of how to produce lyrics from the proposal.
Wait a second. I only have the lyrics examples. To implement the few-shot prompting properly, I need to provide two sets of pairs: a “prompt -> proposal” set for the “Proposal Developer” and a “proposal -> lyrics” set for the “Song Writer”.
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-5.png?resize=840%2C376&ssl=1)
Welcome to the wonderful world of synthetic datasets. If we don’t have the data, why not just generate it? That’s right, I came up with a simple system that works backwards: it takes a song and then comes up with a proposal for it. Then, in a similar fashion, it takes a proposal and comes up with a prompt that might have been used to generate it. Wiring it all together, I finally had a working system.
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-6.png?resize=840%2C490&ssl=1)
Immediately, I could tell the difference. The introduction of few-shot examples dramatically changed the LLM’s lyrical prowess, making it more flexible, interesting, and alive. It would still insert “whispers” and an occasional “symphony” here and there, but the change felt significant. Trippy stuff. Here are a few examples: “The Algorithm of Us”, where I literally just copy-pasted the output (I even forgot to remove the title of the song from the body 🤦). The “Uhh… What Were The Words Again?” ended up being particularly cute. I did have to make some edits to these lyrics, but only to make them more palatable for Suno’s consumption.
As a finishing touch, I multiplied the lyricists to produce three variants of lyrics. This was a last-minute thought: what if I give song writers different personas? Would their output vary? Turns out, it does! And, just like Midjourney and Suno and Udio found out, offering more than one output is a valuable trick.
Intentionally skipping over some lazy synthetic data generation techniques (this post is already too long), here is the final design.
![](https://cdn.statically.io/img/i0.wp.com/glazkov.com/wp-content/uploads/2024/07/image-7.png?resize=840%2C522&ssl=1)
Sadly, you won’t be able to run boards from this part of the journey directly. Unlike the Gemini API key, which can be easily obtained, Qdrant keys are tied to collections, which means that you will need to build your own lyrics database to get it to work. However, if you send an email to dglazkov.writeMeASong@valtown.email, you will receive back a song that was generated using the body of the email as a prompt. Or, you can peruse this massive side-by-side eval spreadsheet, which is automatically updated by the cron job, filling out variant columns anytime I add a new prompt row.
Part 6: Not the end
I am not quite ready to declare victory yet. As the quality of lyrics improved, flaws in other parts of the system became more apparent. For instance, the final variant is not that great at proposal generation. And that makes sense, since the proposals aren’t original – they are synthetic data, which means that the creative potential of the proposals is limited by the confines of model inference. Which means that there’s work to be done.
One big lesson for me in this project is this: while it may seem like turning to LLMs to perform a task is a simple step, when we dig down into the nature of the task and really try to get that task done reliably and with the quality we desire, simple prompt engineering simply won’t do. We need AI systems composed of various AI patterns to take us where we want to go.