Audio Lyric Engine

2 March 2026

I am currently training an AI to match specific syllables of a song to precise timestamps. Providing syllables and their timing creates a valuable dataset for linguistics and dialectology, though this is proving to be a much more difficult task than I anticipated.


The basic input is an audio file with the song title and artist name as metadata. The engine is then responsible for several tasks. First, it must determine the genre. Then, it retrieves lyrics via one of three methods: a user-provided .txt file, web scraping, or manual transcription. For manual transcription, the engine isolates vocals from the instrumental "noise" before utilizing a custom recognition model trained on that specific artist and genre. Transcribing Monica is fundamentally different from transcribing Nettspend; even a single artist like Future requires different logic for a Trap flow versus an R&B feature.


The most difficult challenge is training an AI to break down the vocal file into syllables and interpolating that data onto the lyric file. While I tackle that, I vibe-coded a simple MVP that fetches lyric data from Spotify and matches phrase timing to the start of each hook. Built with Python and HTML, this serves as a webhook for TouchDesigner, where I animate particles through real-time audio analysis—similar to the functionality of the Cotodama Lyric Speaker.

Audio Lyric Engine