Caption formats, language support, workflow, pricing, and the rest. If you don't see your question, email hello@delivercc.io.
DeliverCC outputs four formats from a single generation. Here's what each one is and where to use it:
| Format | What it is | Where to use it |
|---|---|---|
| SRT | The universal subtitle format. Plain text, simple timecodes | YouTube, Vimeo, Facebook, Instagram, TikTok, most video editors |
| VTT | Web video standard. WebVTT format | HTML5 video players, web embeds |
| SCC | Scenarist Closed Captions. CEA-608 broadcast standard | US broadcast TV (CBS, NBC, ABC, Fox) |
| TTML | Timed Text Markup Language. Apple Music synced-lyrics dialect (line-level) | Apple Music synced song lyrics. The file labels send via their distributor to power karaoke-style highlighting in the Apple Music app |
Both. Paste everything you want captioned: the lyrics, ad-libs, and any spoken dialogue, and DeliverCC aligns all of it. A broadcast caption file has to carry every word, sung and spoken, so captioning the spoken parts is what makes your delivery complete and compliant.
Because the tool aligns the text you provide, anything you want captioned has to be in what you paste. An ad-lib that is not in your lyric sheet will not appear unless you add it.
They go to two different places. Video captions (SRT, VTT, and SCC) ride with your video. They display text over the picture, synced to everything audible, and they work wherever the video plays: SRT and VTT for YouTube, Vimeo, and social, and SCC for US broadcast TV. Apple Music synced lyrics ride with the song instead. They are the lyrics that scroll and highlight line by line inside the Apple Music app while the track plays. Same timed text underneath, two different destinations, and they are not interchangeable. One paints words on a video; the other powers the lyrics view in the streaming app.
TTML (Timed Text Markup Language) is a W3C standard for timed text. DeliverCC emits the Apple Music synced-lyrics dialect of TTML, the line-level format Apple Music uses for lyrics that highlight in time with playback. It is the file your label or distributor submits to Apple, through Transporter or iTunes Connect, to turn on synced lyrics for a release. It is not a general video-caption TTML and it is not a video subtitle. For video, use the SRT, VTT, or SCC output.
Not as a file, because Spotify does not accept one. Spotify's synced lyrics are powered entirely by Musixmatch. The only way to add them is to verify an artist or label account in Musixmatch and sync the lyrics inside Musixmatch's own tool, which then pushes them to Spotify. No tool can hand Spotify a finished lyrics file.
Apple Music is different: it accepts a time-synced TTML lyrics file submitted directly by the rights holder or distributor, which is the file DeliverCC produces. So DeliverCC serves the destination that takes a file and leaves the one that requires manual work in a separate tool. Instagram, Amazon Music, and Tidal run through Musixmatch the same way Spotify does.
Twenty-one alignment languages, covering most major music markets:
English, Spanish, Portuguese, Korean, Japanese, French, German, Italian, Arabic, Danish, Dutch, Finnish, Hindi, Indonesian, Norwegian, Polish, Russian, Swedish, Thai, Turkish, and Chinese.
Each language uses the highest-quality alignment model available for it. One thing to know for non-Latin script languages (Korean, Japanese, Arabic, Hindi, Thai, Chinese): paste your lyrics in the native script of the song, not romanized versions. "Tum ho meri zindagi" won't align for a Hindi song. "तुम हो मेरी ज़िन्दगी" will.
Music vocals break speech-to-text. Mumbled delivery, ad-libs, harmonies, autotune, non-lexical sounds, all of it degrades transcription accuracy to the point where what comes out doesn't match what was actually sung.
DeliverCC takes a different approach. You provide the lyrics that are correct, the artist-approved version, and the system aligns those lyrics to the audio rather than guessing what was sung. The captions say exactly what the lyric sheet says, with word-level timing accuracy that holds even on the hardest vocal performances.
You provide them. DeliverCC is built around the lyric sheet being the source of truth, not a transcription. This matches the workflow most labels already use: captions go out matched to the official lyrics that have been signed off on, not to whatever an AI thinks it heard in the recording. DeliverCC handles the timing, you control what the words say.
Typical generation: 30 to 60 seconds from the moment you click Generate to the moment captions appear. First request on a fresh worker takes longer (around 90 seconds while infrastructure spins up). Subsequent requests on warm workers are consistently faster. Most users see sub-60-second times in normal use.
Yes. Every generation lands in the timeline editor with a waveform view, draggable block edges, per-block text editing, and full undo and redo. Most songs need zero edits. When edits are needed (usually for ad-libs or intro instrumentals), the fix takes seconds. Edits are baked into the exported caption file in whatever format you select.
Forced alignment handles these better than transcription tools. Ad-libs ("yeah", "oh", "mmm"), producer tags, and mumbled or harmonized vocals all confuse caption tools that rely on speech-to-text. DeliverCC aligns to the lyrics you provide, so:
If you want to add an ad-lib that wasn't in your lyric sheet, or remove one that was, the timeline editor lets you edit any block text and adjust timing manually.
DeliverCC accepts standard audio and video formats:
Uploads are capped at 500 MB and 15 minutes of duration. DeliverCC automatically pulls the audio out of video uploads, so you don't need to convert anything yourself.
Tip for music video editors: Raw video exports (MOV, ProRes, etc.) are often many gigabytes, well past the 500 MB upload cap. The faster path is to export an audio-only file from your video editor and upload that. A 5-minute song as an MP3 is typically under 10 MB, while the same song in raw video can be gigabytes. Generation runs the same alignment either way, and you skip the long upload wait.
Audio files are automatically deleted from DeliverCC's storage approximately 14 days after upload, a window that covers the project review and revision phase.
Generated caption files stay in your account until you delete them. Nothing about your audio or your lyrics is used to train any model. The full retention policy is in the privacy policy.
One credit equals one caption generation. You get all four export formats with that single credit, generated from the same alignment data.
Monthly plans grant credits at the start of each billing period and reset each month: Creator gets 5, Studio gets 12, Label gets 30. Pay-as-you-go credits are bought one at a time and never expire. If you run out mid-month, you can buy a Pay-as-you-go credit or upgrade your plan. There are no overage charges and no per-format fees.
Have questions? Email hello@delivercc.io