语音快速复刻
接口描述(Description)
基于给定的文字信息和示例音频,生成指定音色的音频。
请求地址(Request URL)
[POST] wss://api.sensenova.cn/v2/audio
请求头(Request Header)
在请求头中,添加 Authorization 字段,生成 API Key,如下所示:
HEADERS = {
"Authorization": "Bearer {API_KEY}" //$API_KEY 在大装置万象模型平台ModelStudio中服务管理获取
}
客户端请求格式
| Byte offset | 字段 | 含义 | 格式 | 备注 |
|---|---|---|---|---|
| 0 | version | 版本 | uint8整型数值:0x01 目前仅支持当前值其他值 未定义 | |
| 1 | serialization | 序列化方法 | uint8整型数值:0x01: json0x02: proto(当前版未实现) | payload data serialization method |
| 2-5 | data length | 请求的长度 | unit32, big endian | 表示后续payload长度 |
| 6-N | payload data | 具体请求数据 | byte[] | 按照payload type 进行解析 |
克隆音色
该接口功能为 根据用户输入的语音数据及文本,clone生成对应的音色。
接口地址
wss://api.sensenova.cn/v2/audio/speech_clone
请求参数
| 参数名 | 参数类型 | 是否必须 | 取值 | 说明 |
|---|---|---|---|---|
| model | string | 是 | SenseNova-Audio-Clone-0901 | 目前仅支持该模型 |
| audio_data | bytes | 是 | 音频数据(0, 2]M | |
| text | string | 是 | 音频数据对应的文本,不能为空,字符数长度(0,200] | |
| audio_format | string | 是 | 音频格式 PCM WAV* MP3 | |
| disable_noise_reduction | bool | 否 | 默认为false |
响应参数
| 参数名 | 取值 | 说明 |
|---|---|---|
| status | SUCCESS/FAILED | |
| error_detail | 仅当status为FAILED时有值 | |
| session_id | 会话ID | |
| voice | clone之后的音色ID, 在clone完成后用户自行保存,不提供查询clone 音色ID查询功能。 |
请求示例(Request Example)
payload = {
"model": "SenseNova-Audio-Clone-0901",
"text": text,
"audio_data": base64.b64encode(audio_data).decode("utf-8"),
"audio_format": "wav",
"disable_noise_reduction": True
}
查询音色
接口功能为 查询系统默认音色及自行克隆的音色列表。
接口地址
https://api.sensenova.cn/v2/audio/voices
Method: GET
请求参数
| 参数名 | 参数类型 | 是否必须 | 取值 | 说明 |
|---|---|---|---|---|
| voice_type | string | 是 | system cloned* all | query param |
| voice | string | 否 | 音色ID | query param |
响应参数
| 参数名 | 取值 | 说明 |
|---|---|---|
| system_voices | Voice Object List | |
| cloned_voices | Voice Object List | |
| 参数名 | 取值 | 说明 |
|---|---|---|
| voice | string | 音色 |
| voice_type | string | 音色类型 |
| description | string | 音色描述 |
| created_at | string | 音色创建时间 |
请求示例(Request Example)
curl --location 'https://api.sensenova.cn/v2/audio/voices?voice_type=all' \
--header 'Authorization: Bearer {sk}'
响应示例(Response Example)
{
"system_voices": [
{
"voice": "male_naigou_m2",
"voice_type": "system",
"description": "男-角色扮演",
"created_at": "2025-09-17T10:22:20.142328992Z"
},
{
"voice": "male_nangong",
"voice_type": "system",
"description": "男-角色扮演",
"created_at": "2025-09-17T10:22:20.137216375Z"
}
],
"cloned_voices": [
{
"voice": "spk_e7de22be_136_250917195119_mrkeor",
"voice_type": "cloned",
"description": "spk_e7de22be_136_250917195119_mrkeor",
"created_at": "2025-09-17T11:51:19.768525911Z"
}
]
}
删除音色
接口功能为 删除自行创建的音色。
接口地址
https://api.sensenova.cn/v2/audio/voices/:voice
Method: DELETE
请求参数
| 参数名 | 参数类型 | 是否必须 | 取值 | 说明 |
|---|---|---|---|---|
| voice | string | 否 | 音色ID | query param |
响应参数
| 参数名 | 取值 | 说明 |
|---|---|---|
| voice | string | 音色 |
| status | string | 音色状态 |
| created_at | string | 音色创建时间 |
请求示例(Request Example)
curl --location --request DELETE 'https://api.sensenova.cn/v2/audio/voices/"spk_e7de22be_136_250917195119_mrkeor' \
--header 'Authorization: Bearer {sk}'
响应示例(Response Example)
{
"voice": "spk_e7de22be_136_250917195119_mrkeor",
"created_at": "2025-09-17T11:51:19.768525911Z",
"status": "deleted"
}
语音生成
该接口功能为根据用户输入的文本和音色等,生成对应的音频数据
请求地址
wss://api.sensenova.cn/v2/audio/speech
请求参数
| 参数名 | 参数类型 | 是否必须 | 取值 | 说明 |
|---|---|---|---|---|
| model | string | 是 | SenseNova-Audio-TTS-0901 | 目前仅支持该模型 |
| Input | string | 是 | 不为空 | 用户输入的文本 |
| trunk_seq | int32 | 否 | 大于等于0 | 用户输入的请求顺序号,如果stream为false,该字段可以不传,如果stream为true,该字段为必须字段,且序号从0开始 |
| last_trunk | bool | 是 | true/false | 是否为最后一个数据包 |
| voice | string | 是 | 见voice表 | 音色 |
| language | string | 否 | ZH_CN/EN_US/ZH_CN_HK | 语言 |
| style | string | 否 | 见voice表 | 风格 |
| speed | float | 否 | [0.5, 2.0] | 语速 |
| volume | float | 否 | [-12, 12] | 音量 |
| pitch | float | 否 | [-24, 24] | 音调 |
| sample_rate | int | 否 | 8000/16000/24000/32000/48000 | 采样率 |
| response_format | string | 否 | PCM/WAV/MP3 | 生成语音的音频格式 |
| with_subtitles | bool | 否 | true/false | 是否需要字幕信息 |
| stream | bool | 否 | true/false | 是否流式 |
voice
| Voice | 性别 | style | 适用场景 |
|---|---|---|---|
| male_nanxingyoushengshu1_p2 | 男 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 新闻播报 |
| female_nvxingyoushengshu2_p2 | 女 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 新闻播报 |
| male_miantian | 男 | 正常 | 通用助手 |
| female_nvxingyoushengshu1_p2 | 女 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 通用助手 |
| male_jingyingqingnianyinse_p2 | 男 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 通用助手 |
| male_JieShuoXiaoMing_p2 | 男 | 正常 | 旁白讲解 |
| male_yunxi_p2 | 男 | 智能助手,对话,尴尬,新闻,愤怒,高兴,厌恶,恐惧,诗歌,正常,悲伤,阴阳怪气,悄悄话,惊讶,期待,赞美,傲娇,鼓励 | 旁白讲解 |
| female_xiaoxiao_p2 | 女 | 智能助手,冷静,对话,闲聊,客服,惊讶,礼貌,新闻,疑惑,赞美,愤怒,高兴,鼓励,厌恶,期待,正常,诗歌,悲伤,傲娇,悄悄话,深情,抱歉,温柔,聊天,平静,恐惧,抒情 | 有声书 |
| male_nanxingyoushengshu2_p2 | 男 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 有声书 |
| female_chunzhen_m2 | 女 | 正常 | 有声书 |
| male_kaishujianggushi_p2 | 男 | 正常 | 有声书 |
| female_jiaomei_m2 | 女 | 正常 | 有声书 |
| child_congmingnantong_p2 | 男 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 角色扮演 |
| child_katongzhuxiaoqi_p2 | 女 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 角色扮演 |
| female_diantai | 女 | 正常 | 角色扮演 |
| female_daihuo_p2 | 女 | 正常 | 角色扮演 |
| female_tianmeinvxingyinse_m2 | 女 | 愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶 | 角色扮演 |
| male_dashu | 男 | 正常 | 角色扮演 |
| male_nangong | 男 | 正常 | 角色扮演 |
| male_naigou_m2 | 男 | 正常 | 角色扮演 |
响应参数
| 参数名 | 取值 | 说明 |
|---|---|---|
| status | SUCCESS/FAILED | |
| error_detail | 仅当status为FAILED时有值 | |
| session_id | 会话ID | |
| voice | 音色 | |
| time_cost_ms | 耗时,单位为毫秒 | |
| response_format | 音频文件编码格式 | |
| chunk_seq | 顺序号 | |
| audio_data | 音频数据 | |
| last_chunk | 是否为最后一个数据包 | |
| subtitles | 字幕信息 | |
| usage_characters | 计费字符数。本次语音生成的计费字符数,流式响应的该值为在最后一个数据包中返回 |
请求示例(Request Example)
payload = {
"model": "SenseNova-Audio-TTS-0901",
"input": input,
"trunk_seq": 0,
"last_trunk": True,
"text_type": text_type,
"voice": voice,
"language": language,
"style": style,
"speed": speed,
"volume": volume,
"pitch": pitch,
"stream": False,
"sample_rate": sample_rate,
"response_format": response_format,
"with_subtitles": with_subtitles
}
Python 示例
import argparse
import asyncio
import base64
import json
import struct
import numpy as np
import soundfile as sf
from typing import AsyncIterable
from websockets.exceptions import ConnectionClosedError, ConnectionClosedOK
from websockets.legacy.client import connect, WebSocketClientProtocol
WS_BASE_URL = "wss://api.sensenova.cn/v2/audio"
API_KEY = "your api key"
HEADERS = {
"Authorization": f"Bearer {API_KEY}"
}
# WebSocket configuration
WS_CONFIG = {
"max_size": 100 * 1024 * 1024, # 100MB max frame size
"max_queue": 32, # Max number of messages in queue
"read_limit": 1024, # 1MB read limit
"write_limit": 1024, # 1MB write limit
}
async def send_messages(ws: WebSocketClientProtocol, payloads: AsyncIterable[bytes]):
async for data in payloads:
await ws.send(data)
print(f"> Sent binary message: {len(data)} bytes")
async def receive_messages(ws: WebSocketClientProtocol, queue: asyncio.Queue):
try:
async for message in ws:
if isinstance(message, bytes):
print(
f"< Received binary message: {len(message)} bytes and request id: {ws.response_headers.get('X-Request-Id')}")
parsed = parse_data_in_protocol(message)
json_data = json.loads(parsed.decode("utf-8"))
await queue.put(json_data)
else:
raise ValueError("Unsupported message type")
except ConnectionClosedOK:
print("Receiver closed cleanly.")
except ConnectionClosedError as e:
print(f"Receiver error: {e}")
finally:
await queue.put(None)
async def tts_payloads_generator(hyper_param: dict = {}):
streaming = hyper_param['stream'] or False
voice = hyper_param['voice'] or 'M20'
language = hyper_param['language'] or "ZH_CN"
text_type = hyper_param['text_type'] or "PLAIN"
response_format = hyper_param['response_format'] or "mp3"
with_subtitles = hyper_param['with_subtitles'] or False
style = hyper_param['style'] or "正常"
speed = hyper_param['speed'] or 1.0
volume = hyper_param['volume'] or 1.0
pitch = hyper_param['pitch'] or 1.0
sample_rate = hyper_param['sample_rate'] or 16000
input = hyper_param['input'] or "欲买桂花同载酒,终不似,少年游。"
if not streaming:
payload = {
"model": "SenseNova-Audio-TTS-0901",
"input": input,
"trunk_seq": 0,
"last_trunk": True,
"text_type": text_type,
"voice": voice,
"language": language,
"style": style,
"speed": speed,
"volume": volume,
"pitch": pitch,
"stream": False,
"sample_rate": sample_rate,
"response_format": response_format,
"with_subtitles": with_subtitles
}
raw = json.dumps(payload).encode("utf-8")
packed = pack_data_in_protocol(raw)
yield packed
return
inputs = list(input)
for i, text in enumerate(inputs):
payload = {
"model": "SenseNova-Audio-TTS-0901",
"input": text,
"trunk_seq": i,
"last_trunk": (i == len(inputs) - 1),
"text_type": text_type,
"voice": voice,
"language": language,
"style": style,
"speed": speed,
"volume": volume,
"pitch": pitch,
"stream": streaming,
"sample_rate": sample_rate,
"response_format": response_format,
"with_subtitles": with_subtitles
}
raw = json.dumps(payload).encode("utf-8")
packed = pack_data_in_protocol(raw)
yield packed
await asyncio.sleep(0.1)
# protocol header:
# uint8 version
# uint8 serialization
# uint32 length
# protocol body:
# bytes data using serialization
def pack_data_in_protocol(payload: bytes) -> bytes:
version = 0x01
serialization = 0x01 # JSON
length = len(payload)
header = struct.pack(">BBI", version, serialization, length)
return header + payload
def parse_data_in_protocol(payload: bytes) -> bytes:
if len(payload) < 6:
raise ValueError(f"invalid data length: {len(payload)}")
version, serialization, length = struct.unpack(">BBI", payload[:6])
if version != 0x01:
raise ValueError("invalid version")
if serialization != 0x01:
raise ValueError("invalid serialization")
expected_len = 6 + length
if len(payload) < expected_len:
raise ValueError("data length mismatch")
return payload[6:expected_len]
def write_audio_to_file(audio_data: bytes, sample_rate: int, dest_file: str, response_format: str):
file_name = dest_file + f".{response_format.lower()}"
audio_np = np.frombuffer(audio_data, dtype=np.int16)
if response_format.lower() == "mp3":
import io
from pydub import AudioSegment
# Convert numpy array to AudioSegment
audio_segment = AudioSegment(
audio_np.tobytes(),
frame_rate=sample_rate,
sample_width=2, # 16-bit
channels=1
)
audio_segment.export(file_name, format="mp3")
elif response_format.lower() == "wav":
sf.write(file_name, audio_np, samplerate=sample_rate, subtype="PCM_16")
elif response_format.lower() == "pcm":
# For PCM, write raw bytes directly
with open(file_name, "wb") as f:
f.write(audio_data)
else:
raise ValueError(f"Unsupported response format: {response_format}")
print(f"write audio to file: {file_name} successfully...")
async def voice_clone():
async def payloads_generator():
with open("sample.wav", "rb") as f:
audio_data = f.read()
base64.b64encode(audio_data).decode("utf-8")
text = "欲买桂花同载酒,终不似,少年游。"
payload = {
"model": "SenseNova-Audio-Clone-0901",
"text": text,
"audio_data": base64.b64encode(audio_data).decode("utf-8"),
"audio_format": "wav",
"disable_noise_reduction": True
}
raw = json.dumps(payload).encode("utf-8")
packed = pack_data_in_protocol(raw)
yield packed
return
async with connect(f"{WS_BASE_URL}/speech_clone", extra_headers=HEADERS, **WS_CONFIG) as ws:
queue = asyncio.Queue()
await send_messages(ws, payloads_generator())
await receive_messages(ws, queue)
while True:
item = await queue.get()
if item is None:
break
print(item)
async def tts(hyper_param: dict = {}):
queue = asyncio.Queue()
response_format = hyper_param.get("response_format") or "wav"
async with connect(f"{WS_BASE_URL}/speech", extra_headers=HEADERS, **WS_CONFIG) as websocket:
send_task = asyncio.create_task(send_messages(
websocket, tts_payloads_generator(hyper_param)))
recv_task = asyncio.create_task(receive_messages(websocket, queue))
responses = []
while True:
item = await queue.get()
if item is None:
break
responses.append(item)
await asyncio.gather(send_task, recv_task)
if responses:
stream = hyper_param.get("stream")
if not stream:
audio_data = responses[0]["audio_data"]
if audio_data:
decoded_audio_bytes = base64.b64decode(audio_data)
write_audio_to_file(decoded_audio_bytes,
16000, "output", response_format)
else:
stream_bytes = bytes()
for response in responses:
audio_data = response["audio_data"]
if audio_data:
decoded_audio_bytes = base64.b64decode(audio_data)
stream_bytes += decoded_audio_bytes
write_audio_to_file(stream_bytes, 16000,
"output_stream", response_format)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
'--service', '-s', help='服务名称, voice_clone or tts', type=str, default="tts")
parser.add_argument(
'--voice', '-v', help='音色', type=str, default="M20")
parser.add_argument(
'--input', '-i', help='输入文本', type=str, default="欲买桂花同载酒,终不似,少年游。")
parser.add_argument(
'--style', help='风格', type=str, default="正常")
parser.add_argument(
'--speed', help='语速', type=float, default=1.0)
parser.add_argument(
'--volume', help='音量', type=float, default=1.0)
parser.add_argument(
'--pitch', help='音调', type=float, default=1.0)
parser.add_argument(
'--sample_rate', help='采样率', type=int, default=16000)
parser.add_argument(
'--language', '-l', help='语言', type=str, choices=['ZH_CN', 'ZH_CN_SICHUAN', 'ZH_CN_HK'], default="ZH_CN")
parser.add_argument(
'--text_type', '-t', help='文本类型', type=str, choices=['PLAIN', 'SSML'], default="PLAIN")
parser.add_argument(
'--response_format', '-r', help='响应格式', type=str, choices=['mp3', 'wav', 'pcm'], default="wav")
parser.add_argument('--with_subtitles', action='store_true', default=False,
help='是否返回字幕')
parser.add_argument('--stream', action='store_true', default=False,
help='音频是否流式返回')
try:
args = parser.parse_args()
print(args)
if args.service == "tts":
param = dict(
voice=args.voice,
input=args.input,
stream=args.stream,
language=args.language,
with_subtitles=args.with_subtitles,
style=args.style,
speed=args.speed,
volume=args.volume,
pitch=args.pitch,
sample_rate=args.sample_rate,
text_type=args.text_type,
response_format=args.response_format
)
asyncio.run(tts(param))
elif args.service == "voice_clone":
asyncio.run(voice_clone())
except KeyboardInterrupt:
print("\nDisconnected.")
except Exception as e:
print(e)