When you think of Text to Speech in AI terms, the first company you may think of is Eleven Labs as the quality of their product literally speaks for itself. If you are looking for an Open Source tool, then Bark, by Suno may be of interest.
In Hacker News one of the founders of Suno said this of Bark: 'At Suno we work on audio foundation models, creating speech, music, sounds effects etc….
Text to speech was a natural playground for us to share with the community and get some feedback. Given that this model is a full GPT model, the text input is merely a guidance and the model can technically create any audio from scratch even without input text, aka hallucinations or audio continuation.
When used as a TTS model, it’s very different from the awesome high quality TTS models already available. It produces a wider range of audio – that could be a high quality studio recording of an actor or the same text leading to two people shouting in an argument at a noisy bar.'
This tool is already available on Hugging Face (which I'm due to do a blog piece on - the ToDO list is growing) which increases the utility.
The GitHub description states:
'Similar to Vall-E and some other amazing work in the field, Bark uses GPT-style models to generate audio from scratch. Different from Vall-E, the initial text prompt is embedded into high-level semantic tokens without the use of phonemes. It can therefore generalize to arbitrary instructions beyond speech that occur in the training data, such as music lyrics, sound effects or other non-speech sounds. A subsequent second model is used to convert the generated semantic tokens into audio codec tokens to generate the full waveform. To enable the community to use Bark via public code we used the fantastic EnCodec codec from Facebook to act as an audio representation.'
Comments
Post a Comment