语音快速复刻 | 大装置帮助中心
跳到主要内容

语音快速复刻

接口描述(Description)

基于给定的文字信息和示例音频,生成指定音色的音频。


请求地址(Request URL)

[POST] wss://api.sensenova.cn/v2/audio

请求头(Request Header)

在请求头中,添加 Authorization 字段,生成 API Key,如下所示:

HEADERS = {
"Authorization": "Bearer {API_KEY}" //$API_KEY 在大装置万象模型平台ModelStudio中服务管理获取
}

客户端请求格式

Byte offset字段含义格式备注
0version版本uint8整型数值:0x01 目前仅支持当前值其他值 未定义
1serialization序列化方法uint8整型数值:0x01: json0x02: proto(当前版未实现)payload data serialization method
2-5data length请求的长度unit32, big endian表示后续payload长度
6-Npayload data具体请求数据byte[]按照payload type 进行解析

克隆音色

该接口功能为 根据用户输入的语音数据及文本,clone生成对应的音色。

接口地址

wss://api.sensenova.cn/v2/audio/speech_clone

请求参数

参数名参数类型是否必须取值说明
modelstringSenseNova-Audio-Clone-0901目前仅支持该模型
audio_databytes音频数据(0, 2]M
textstring音频数据对应的文本,不能为空,字符数长度(0,200]
audio_formatstring音频格式 PCM WAV* MP3
disable_noise_reductionbool默认为false

响应参数

参数名取值说明
statusSUCCESS/FAILED
error_detail仅当status为FAILED时有值
session_id会话ID
voiceclone之后的音色ID, 在clone完成后用户自行保存,不提供查询clone 音色ID查询功能。

请求示例(Request Example)

        payload = {
"model": "SenseNova-Audio-Clone-0901",
"text": text,
"audio_data": base64.b64encode(audio_data).decode("utf-8"),
"audio_format": "wav",
"disable_noise_reduction": True
}

查询音色

接口功能为 查询系统默认音色及自行克隆的音色列表。

接口地址

https://api.sensenova.cn/v2/audio/voices
Method: GET

请求参数

参数名参数类型是否必须取值说明
voice_typestring system cloned* allquery param
voicestring音色IDquery param

响应参数

参数名取值说明
system_voicesVoice Object List
cloned_voicesVoice Object List
参数名取值说明
voicestring音色
voice_typestring音色类型
descriptionstring音色描述
created_atstring音色创建时间

请求示例(Request Example)

curl --location 'https://api.sensenova.cn/v2/audio/voices?voice_type=all' \
--header 'Authorization: Bearer {sk}'

响应示例(Response Example)


{
"system_voices": [
{
"voice": "male_naigou_m2",
"voice_type": "system",
"description": "男-角色扮演",
"created_at": "2025-09-17T10:22:20.142328992Z"
},
{
"voice": "male_nangong",
"voice_type": "system",
"description": "男-角色扮演",
"created_at": "2025-09-17T10:22:20.137216375Z"
}
],
"cloned_voices": [
{
"voice": "spk_e7de22be_136_250917195119_mrkeor",
"voice_type": "cloned",
"description": "spk_e7de22be_136_250917195119_mrkeor",
"created_at": "2025-09-17T11:51:19.768525911Z"
}
]
}

删除音色

接口功能为 删除自行创建的音色。

接口地址

https://api.sensenova.cn/v2/audio/voices/:voice
Method: DELETE

请求参数

参数名参数类型是否必须取值说明
voicestring音色IDquery param

响应参数

参数名取值说明
voicestring音色
statusstring音色状态
created_atstring音色创建时间

请求示例(Request Example)

curl --location --request DELETE 'https://api.sensenova.cn/v2/audio/voices/"spk_e7de22be_136_250917195119_mrkeor' \
--header 'Authorization: Bearer {sk}'

响应示例(Response Example)

{
"voice": "spk_e7de22be_136_250917195119_mrkeor",
"created_at": "2025-09-17T11:51:19.768525911Z",
"status": "deleted"
}

语音生成

该接口功能为根据用户输入的文本和音色等,生成对应的音频数据

请求地址

wss://api.sensenova.cn/v2/audio/speech

请求参数

参数名参数类型是否必须取值说明
modelstringSenseNova-Audio-TTS-0901目前仅支持该模型
Inputstring不为空用户输入的文本
trunk_seqint32大于等于0用户输入的请求顺序号,如果stream为false,该字段可以不传,如果stream为true,该字段为必须字段,且序号从0开始
last_trunkbooltrue/false是否为最后一个数据包
voicestring见voice表音色
languagestringZH_CN/EN_US/ZH_CN_HK语言
stylestring见voice表风格
speedfloat[0.5, 2.0]语速
volumefloat[-12, 12]音量
pitchfloat[-24, 24]音调
sample_rateint8000/16000/24000/32000/48000采样率
response_formatstringPCM/WAV/MP3生成语音的音频格式
with_subtitlesbooltrue/false是否需要字幕信息
streambooltrue/false是否流式

voice

Voice性别style适用场景
male_nanxingyoushengshu1_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶新闻播报
female_nvxingyoushengshu2_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶新闻播报
male_miantian正常通用助手
female_nvxingyoushengshu1_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶通用助手
male_jingyingqingnianyinse_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶通用助手
male_JieShuoXiaoMing_p2正常旁白讲解
male_yunxi_p2智能助手,对话,尴尬,新闻,愤怒,高兴,厌恶,恐惧,诗歌,正常,悲伤,阴阳怪气,悄悄话,惊讶,期待,赞美,傲娇,鼓励旁白讲解
female_xiaoxiao_p2智能助手,冷静,对话,闲聊,客服,惊讶,礼貌,新闻,疑惑,赞美,愤怒,高兴,鼓励,厌恶,期待,正常,诗歌,悲伤,傲娇,悄悄话,深情,抱歉,温柔,聊天,平静,恐惧,抒情有声书
male_nanxingyoushengshu2_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶有声书
female_chunzhen_m2正常有声书
male_kaishujianggushi_p2正常有声书
female_jiaomei_m2正常有声书
child_congmingnantong_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶角色扮演
child_katongzhuxiaoqi_p2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶角色扮演
female_diantai正常角色扮演
female_daihuo_p2正常角色扮演
female_tianmeinvxingyinse_m2愤怒,厌恶,恐惧,高兴,正常,悲伤,惊讶角色扮演
male_dashu正常角色扮演
male_nangong正常角色扮演
male_naigou_m2正常角色扮演

响应参数

参数名取值说明
statusSUCCESS/FAILED
error_detail仅当status为FAILED时有值
session_id会话ID
voice音色
time_cost_ms耗时,单位为毫秒
response_format音频文件编码格式
chunk_seq顺序号
audio_data音频数据
last_chunk是否为最后一个数据包
subtitles字幕信息
usage_characters计费字符数。本次语音生成的计费字符数,流式响应的该值为在最后一个数据包中返回

请求示例(Request Example)

        payload = {
"model": "SenseNova-Audio-TTS-0901",
"input": input,
"trunk_seq": 0,
"last_trunk": True,
"text_type": text_type,
"voice": voice,
"language": language,
"style": style,
"speed": speed,
"volume": volume,
"pitch": pitch,
"stream": False,
"sample_rate": sample_rate,
"response_format": response_format,
"with_subtitles": with_subtitles
}


Python 示例

import argparse
import asyncio
import base64
import json
import struct
import numpy as np
import soundfile as sf
from typing import AsyncIterable
from websockets.exceptions import ConnectionClosedError, ConnectionClosedOK
from websockets.legacy.client import connect, WebSocketClientProtocol


WS_BASE_URL = "wss://api.sensenova.cn/v2/audio"
API_KEY = "your api key"

HEADERS = {
"Authorization": f"Bearer {API_KEY}"
}

# WebSocket configuration
WS_CONFIG = {
"max_size": 100 * 1024 * 1024, # 100MB max frame size
"max_queue": 32, # Max number of messages in queue
"read_limit": 1024, # 1MB read limit
"write_limit": 1024, # 1MB write limit
}


async def send_messages(ws: WebSocketClientProtocol, payloads: AsyncIterable[bytes]):
async for data in payloads:
await ws.send(data)
print(f"> Sent binary message: {len(data)} bytes")


async def receive_messages(ws: WebSocketClientProtocol, queue: asyncio.Queue):
try:
async for message in ws:
if isinstance(message, bytes):
print(
f"< Received binary message: {len(message)} bytes and request id: {ws.response_headers.get('X-Request-Id')}")
parsed = parse_data_in_protocol(message)
json_data = json.loads(parsed.decode("utf-8"))
await queue.put(json_data)
else:
raise ValueError("Unsupported message type")
except ConnectionClosedOK:
print("Receiver closed cleanly.")
except ConnectionClosedError as e:
print(f"Receiver error: {e}")
finally:
await queue.put(None)


async def tts_payloads_generator(hyper_param: dict = {}):
streaming = hyper_param['stream'] or False
voice = hyper_param['voice'] or 'M20'
language = hyper_param['language'] or "ZH_CN"
text_type = hyper_param['text_type'] or "PLAIN"
response_format = hyper_param['response_format'] or "mp3"
with_subtitles = hyper_param['with_subtitles'] or False
style = hyper_param['style'] or "正常"
speed = hyper_param['speed'] or 1.0
volume = hyper_param['volume'] or 1.0
pitch = hyper_param['pitch'] or 1.0
sample_rate = hyper_param['sample_rate'] or 16000
input = hyper_param['input'] or "欲买桂花同载酒,终不似,少年游。"
if not streaming:
payload = {
"model": "SenseNova-Audio-TTS-0901",
"input": input,
"trunk_seq": 0,
"last_trunk": True,
"text_type": text_type,
"voice": voice,
"language": language,
"style": style,
"speed": speed,
"volume": volume,
"pitch": pitch,
"stream": False,
"sample_rate": sample_rate,
"response_format": response_format,
"with_subtitles": with_subtitles
}
raw = json.dumps(payload).encode("utf-8")
packed = pack_data_in_protocol(raw)
yield packed
return

inputs = list(input)

for i, text in enumerate(inputs):
payload = {
"model": "SenseNova-Audio-TTS-0901",
"input": text,
"trunk_seq": i,
"last_trunk": (i == len(inputs) - 1),
"text_type": text_type,
"voice": voice,
"language": language,
"style": style,
"speed": speed,
"volume": volume,
"pitch": pitch,
"stream": streaming,
"sample_rate": sample_rate,
"response_format": response_format,
"with_subtitles": with_subtitles
}
raw = json.dumps(payload).encode("utf-8")
packed = pack_data_in_protocol(raw)
yield packed
await asyncio.sleep(0.1)


# protocol header:
# uint8 version
# uint8 serialization
# uint32 length

# protocol body:
# bytes data using serialization
def pack_data_in_protocol(payload: bytes) -> bytes:
version = 0x01
serialization = 0x01 # JSON
length = len(payload)

header = struct.pack(">BBI", version, serialization, length)
return header + payload


def parse_data_in_protocol(payload: bytes) -> bytes:
if len(payload) < 6:
raise ValueError(f"invalid data length: {len(payload)}")

version, serialization, length = struct.unpack(">BBI", payload[:6])

if version != 0x01:
raise ValueError("invalid version")
if serialization != 0x01:
raise ValueError("invalid serialization")

expected_len = 6 + length
if len(payload) < expected_len:
raise ValueError("data length mismatch")
return payload[6:expected_len]


def write_audio_to_file(audio_data: bytes, sample_rate: int, dest_file: str, response_format: str):
file_name = dest_file + f".{response_format.lower()}"
audio_np = np.frombuffer(audio_data, dtype=np.int16)

if response_format.lower() == "mp3":
import io
from pydub import AudioSegment

# Convert numpy array to AudioSegment
audio_segment = AudioSegment(
audio_np.tobytes(),
frame_rate=sample_rate,
sample_width=2, # 16-bit
channels=1
)
audio_segment.export(file_name, format="mp3")

elif response_format.lower() == "wav":
sf.write(file_name, audio_np, samplerate=sample_rate, subtype="PCM_16")

elif response_format.lower() == "pcm":
# For PCM, write raw bytes directly
with open(file_name, "wb") as f:
f.write(audio_data)

else:
raise ValueError(f"Unsupported response format: {response_format}")

print(f"write audio to file: {file_name} successfully...")


async def voice_clone():
async def payloads_generator():
with open("sample.wav", "rb") as f:
audio_data = f.read()
base64.b64encode(audio_data).decode("utf-8")
text = "欲买桂花同载酒,终不似,少年游。"
payload = {
"model": "SenseNova-Audio-Clone-0901",
"text": text,
"audio_data": base64.b64encode(audio_data).decode("utf-8"),
"audio_format": "wav",
"disable_noise_reduction": True
}
raw = json.dumps(payload).encode("utf-8")
packed = pack_data_in_protocol(raw)
yield packed
return

async with connect(f"{WS_BASE_URL}/speech_clone", extra_headers=HEADERS, **WS_CONFIG) as ws:
queue = asyncio.Queue()
await send_messages(ws, payloads_generator())
await receive_messages(ws, queue)
while True:
item = await queue.get()
if item is None:
break
print(item)


async def tts(hyper_param: dict = {}):
queue = asyncio.Queue()
response_format = hyper_param.get("response_format") or "wav"
async with connect(f"{WS_BASE_URL}/speech", extra_headers=HEADERS, **WS_CONFIG) as websocket:
send_task = asyncio.create_task(send_messages(
websocket, tts_payloads_generator(hyper_param)))
recv_task = asyncio.create_task(receive_messages(websocket, queue))

responses = []

while True:
item = await queue.get()
if item is None:
break
responses.append(item)
await asyncio.gather(send_task, recv_task)

if responses:
stream = hyper_param.get("stream")
if not stream:
audio_data = responses[0]["audio_data"]
if audio_data:
decoded_audio_bytes = base64.b64decode(audio_data)
write_audio_to_file(decoded_audio_bytes,
16000, "output", response_format)
else:
stream_bytes = bytes()
for response in responses:
audio_data = response["audio_data"]
if audio_data:
decoded_audio_bytes = base64.b64decode(audio_data)
stream_bytes += decoded_audio_bytes
write_audio_to_file(stream_bytes, 16000,
"output_stream", response_format)

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
'--service', '-s', help='服务名称, voice_clone or tts', type=str, default="tts")
parser.add_argument(
'--voice', '-v', help='音色', type=str, default="M20")
parser.add_argument(
'--input', '-i', help='输入文本', type=str, default="欲买桂花同载酒,终不似,少年游。")
parser.add_argument(
'--style', help='风格', type=str, default="正常")
parser.add_argument(
'--speed', help='语速', type=float, default=1.0)
parser.add_argument(
'--volume', help='音量', type=float, default=1.0)
parser.add_argument(
'--pitch', help='音调', type=float, default=1.0)
parser.add_argument(
'--sample_rate', help='采样率', type=int, default=16000)
parser.add_argument(
'--language', '-l', help='语言', type=str, choices=['ZH_CN', 'ZH_CN_SICHUAN', 'ZH_CN_HK'], default="ZH_CN")
parser.add_argument(
'--text_type', '-t', help='文本类型', type=str, choices=['PLAIN', 'SSML'], default="PLAIN")
parser.add_argument(
'--response_format', '-r', help='响应格式', type=str, choices=['mp3', 'wav', 'pcm'], default="wav")
parser.add_argument('--with_subtitles', action='store_true', default=False,
help='是否返回字幕')
parser.add_argument('--stream', action='store_true', default=False,
help='音频是否流式返回')
try:
args = parser.parse_args()
print(args)
if args.service == "tts":
param = dict(
voice=args.voice,
input=args.input,
stream=args.stream,
language=args.language,
with_subtitles=args.with_subtitles,
style=args.style,
speed=args.speed,
volume=args.volume,
pitch=args.pitch,
sample_rate=args.sample_rate,
text_type=args.text_type,
response_format=args.response_format
)
asyncio.run(tts(param))
elif args.service == "voice_clone":
asyncio.run(voice_clone())
except KeyboardInterrupt:
print("\nDisconnected.")
except Exception as e:
print(e)