[Discussion] Speech synthesis – text-to-speech vs speech-to-speech
I’ve recently started looking into speech synthesis, and notice that most of the focus is on text-to-speech.
I haven’t had much like finding anything on speech-to-speech – that is, changing the voice of an audio clip to that of another person (e.g. by passing a voice embedding as an input to the model). Not sure what the actual term is for it. Is there much happening in this space, and if so, any recommendations on where to start? While not broadly applicable, it seems (on the surface) like it’d be a lot easier than TTS.