Audio Lyric Engine
2 March 2026
I am currently working on training an AI to match specific syllables of a song to a timestamp of the song. Providing syllables of a song and their timing is a valuable set of data for linguistics and dialectology. This proves to be a much more difficult task than I anticipated.
The basic input of the system is an audiofile with the song title and artist name as metadata. The engine is then responsible for multiple things. First, the engine needs to determine the genre of song. Then, it needs to do one of three things to find the lyrics: Ask the user for a lyric.txt file, scrape the internet for the lyrics, or transcribe the lyrics manually. For manual transcription, the engine first needs to isolate the vocals from the instrumental "noise." Next, it needs to pull from its data on that specific artist that it was trained on, and use that custom recognition skill learned for that artist in that genre for building a new transcription. This is because of how wildly different each artist is from each other, let alone themselves in different genres. For example, transcribing Monica is different from transcribing Nettspend, and transcribing Future on a Trap flow is different from Future on an R&B feature.
Here comes the most difficult part: training an AI to break down the vocals file into syllables, and interpolating that data onto the lyric file. This is what I am currently attempting to tackle first. But in the meantime, I vibe-coded a simple MVP, which fetches the lyric data from Spotify and matches the phrase timing to the start of each hook. I made this as a Python and HTML file, which then is input as a webhook into TouchDesigner, where I animate particles through audio analysis. This is close to what the Cotodama Lyric Speaker does.