How we made Iron Man's JARVIS Inspired TTS Voice

Iron Man's JARVIS, a fictional AI voice assistant, has captured the imagination of tech enthusiasts and movie buffs alike. Now not only can we promise to replicate the charisma of Tony Stark's iconic AI, we can also take you behind the scenes to understand the process of creating JARVIS-inspired text-to-speech (TTS) using cutting-edge technology. In this blog, we'll dive into the tools and techniques that allow us to bring some JARVIS magic to life.

1. Cloning the Voice with Bark

To create a JARVIS-like TTS, we rely on the power of Bark, a transformer-based text-to-speech model developed by Suno AI. Bark comprises four main models, each contributing to the synthesis of lifelike speech:

BarkSemanticModel (Text Model):

This causal auto-regressive transformer model takes tokenized text as input and predicts semantic text tokens that capture the meaning of the text.

BarkCoarseModel (Coarse Acoustics Model):

As a causal autoregressive transformer, it uses the results of the BarkSemanticModel to predict the first two audio codebooks required for EnCodec.

BarkFineModel (Fine Acoustics Model):

This non-causal autoencoder transformer predicts the last codebooks iteratively based on the sum of the previous codebook embeddings.

EnCodec Model:

Having predicted all the codebook channels, Bark utilizes EnCodec to decode the output audio array.

Additionally, each of the first three modules supports conditional speaker embeddings to customize the output sound according to specific predefined voices.

Code to Set Up Bark:

Here's the code to set up Bark for your project:

!pip install git+https://github.com/suno-ai/bark.git
!git clone https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer
!pip install -r ./bark-voice-cloning-HuBERT-quantizer/requirements.txt

This code prepares the environment to use Bark for voice cloning.

2. Cloning the Voice with HuBERT

For the JARVIS-inspired voice, we employ HuBERT, a state-of-the-art speech recognition model, to clone the speaker's voice. HuBERT's versatility makes it an ideal choice for voice cloning.

Code to Set Up HuBERT:

Here's how you set up HuBERT for your project:

# Load HuBERT for semantic tokens
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert
from bark_hubert_quantizer.customtokenizer import CustomTokenizer

# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)

# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth', map_location=device).to(device)

This code snippet loads the HuBERT model and the CustomTokenizer model, which are essential components for voice cloning.

3. Processing the Audio

Once we have the models in place, we can begin processing the audio waveform that contains the desired speaker's voice.

# Load and pre-process the audio waveform
audio_filepath = 'speaker.wav'  # Replace with the path to your audio file
wav, sr = torchaudio.load(audio_filepath)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.to(device)

This code prepares the audio file for further analysis.

This is the audio file that we are using for training:

https://prismic-io.s3.amazonaws.com/multiverseapp/b883a6f6-6aa4-48e2-902f-4484fdc568d8_jarvis_training.wav

4. Extracting Semantic Tokens

Now, we'll extract semantic tokens from the provided audio to infuse the cloned voice with meaning.

pythonCopy codesemantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_tokens = tokenizer.get_token(semantic_vectors)

These semantic tokens enable the TTS system to understand the context and emotions of the speaker.

5. Extracting Discrete Codes

To generate the voice, we need to extract discrete codes from EnCodec.

pythonCopy codewith torch.no_grad():
    encoded_frames = model.encode(wav.unsqueeze(0))
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()  # [n_q, T]

These codes are crucial for recreating the speaker's voice.

6. Saving the Voice Model

We save the generated voice model to use in the TTS system.

pythonCopy code# Move codes and semantic tokens to CPU
codes = codes.cpu().numpy()
semantic_tokens = semantic_tokens.cpu().numpy()

voice_filename = 'output.npz'
current_path = os.getcwd()
voice_name = os.path.join(current_path, voice_filename)

# Save the voice model
np.savez(voice_name, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

This code snippet ensures that the voice model is easily accessible for TTS synthesis.

7. Creating JARVIS-Like Voice

Finally, we can utilize the saved voice model to generate a JARVIS-inspired voice using Bark.

pythonCopy codefrom bark.api import generate_audio
from bark.generation import SAMPLE_RATE, preload_models, codec_decode, generate_coarse, generate_fine, generate_text_semantic

# Enter your prompt and speaker here
text_prompt = "Hello Mr. Stark. I am JARVIS, your personal voice assistant. Thanks for reading this blog by Multiverse Software!"

# Simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

The code above uses Bark to synthesize the speaker's voice based on the prompt provided.

8. Enjoy the JARVIS-Inspired Voice

Now, you can enjoy the JARVIS-like voice that you've created. Play or save it for your project's unique needs.

pythonCopy codefrom IPython.display import Audio
# Play the audio
Audio(audio_array, rate=SAMPLE_RATE)

The JARVIS-inspired TTS is ready to explore and integrate into your applications. It's a step towards creating the future of AI-powered voice assistants.

These are the results that we got:

https://prismic-io.s3.amazonaws.com/multiverseapp/dcf99896-1220-48cc-b606-af5d3e918269_jarvis-voice-1.wav

Creating a voice that resonates with your vision is both an art and a science. The process may seem complex, but with the right tools and expertise, it becomes a journey of creativity and innovation. At Multiverse Software, we're committed to pushing the boundaries of technology to bring your ideas to life. Reach out to us today and embark on your AI-powered voice assistant adventure.

Creating a JARVIS-inspired TTS system involves harnessing advanced models and deep learning techniques. From cloning the voice with Bark to using HuBERT for voice recognition, it's a multi-step process that blends technology and creativity. The resulting voice is a testament to the power of AI and its potential in shaping the future of human-computer interaction.

At Multiverse Software, we're at the forefront of this technology, ready to turn your vision into reality. Contact us today, and let's take your voice assistant project to the next level.