SparkNV-Voice
6
2
1 language
license:cc-by-nc-sa-4.0
by
yasserrmd
Audio Model
OTHER
2507.13155B params
New
6 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
5604GB+ RAM
Mobile
Laptop
Server
Quick Summary
SparkNV-Voice is a fine-tuned version of the Spark-TTS model trained on the NonverbalTTS dataset.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
2335GB+ RAM
Code Examples
🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🛠 Installationbash
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
pip install --no-deps unsloth
git clone https://github.com/SparkAudio/Spark-TTS
pip install omegaconf einx🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")🚀 Inference Codepythonpytorch
import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T
from unsloth import FastModel
import sys
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from huggingface_hub import snapshot_download
# Download model and code
snapshot_download("yasserrmd/SparkNV-Voice", local_dir = "SparkNV-Voice")
max_seq_length = 2048 # Choose any for long context!
model, tokenizer = FastModel.from_pretrained(
model_name = "SparkNV-Voice",
max_seq_length = max_seq_length,
dtype = torch.float32, # Spark seems to only work on float32 for now
full_finetuning = True, # We support full finetuning now!
load_in_4bit = False,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
FastModel.for_inference(model) # Enable native 2x faster inference
audio_tokenizer = BiCodecTokenizer("SparkNV-Voice", "cuda")
audio_tokenizer.model.to("cuda")
input_text = "Hey there, my name is Yasser, and I'm a 🌬️ speech generation model that can sound like a person."
chosen_voice = None # None for single-speaker
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8, # Generation temperature
top_k: int = 50, # Generation top_k
top_p: float = 1, # Generation top_p
max_new_audio_tokens: int = 2048, # Max tokens for audio part
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
"""
Generates speech audio from text using default voice control parameters.
Args:
text (str): The text input to be converted to speech.
temperature (float): Sampling temperature for generation.
top_k (int): Top-k sampling parameter.
top_p (float): Top-p (nucleus) sampling parameter.
max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
device (torch.device): Device to run inference on.
Returns:
np.ndarray: Generated waveform as a NumPy array.
"""
torch.compiler.reset()
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens, # Limit generation length
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id, # Stop token
pad_token_id=tokenizer.pad_token_id # Use models pad token id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging
# Extract semantic token IDs using regex
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found in the generated output.")
# Handle appropriately - perhaps return silence or raise error
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim
# Extract global token IDs using regex (assuming controllable mode also generates these)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim
pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
# 5. Detokenize using BiCodecTokenizer
print("Detokenizing audio tokens...")
# Ensure audio_tokenizer and its internal model are on the correct device
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
# Squeeze the extra dimension from global tokens as seen in SparkTTS example
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
pred_semantic_ids.to(device) # Shape (1, N_semantic)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
generated_waveform = generate_speech_from_text(input_text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
# Optional: Play in notebook
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.