语音快速复刻

接口描述（Description）

基于给定的文字信息和示例音频，生成指定音色的音频。

请求地址（Request URL）

[POST] wss://api.sensenova.cn/v2/audio

请求头（Request Header）

在请求头中，添加 Authorization 字段，生成 API Key，如下所示：

HEADERS = {
    "Authorization": "Bearer {API_KEY}" //$API_KEY 在大装置万象模型平台ModelStudio中服务管理获取
}

客户端请求格式

Byte offset	字段	含义	格式	备注
0	version	版本	uint8整型数值:0x01 目前仅支持当前值其他值未定义
1	serialization	序列化方法	uint8整型数值:0x01: json0x02: proto（当前版未实现）	payload data serialization method
2-5	data length	请求的长度	unit32， big endian	表示后续payload长度
6-N	payload data	具体请求数据	byte[]	按照payload type 进行解析

克隆音色

该接口功能为根据用户输入的语音数据及文本，clone生成对应的音色。

接口地址

wss://api.sensenova.cn/v2/audio/speech_clone

请求参数

参数名	参数类型	是否必须	取值	说明
model	string	是	SenseNova-Audio-Clone-0901	目前仅支持该模型
audio_data	bytes	是	音频数据(0, 2]M
text	string	是	音频数据对应的文本，不能为空，字符数长度(0,200]
audio_format	string	是	音频格式 PCM WAV* MP3
disable_noise_reduction	bool	否	默认为false

响应参数

参数名	取值	说明
status	SUCCESS/FAILED
error_detail		仅当status为FAILED时有值
session_id		会话ID
voice		clone之后的音色ID，在clone完成后用户自行保存，不提供查询clone 音色ID查询功能。

请求示例（Request Example）

        payload = {
            "model": "SenseNova-Audio-Clone-0901",
            "text": text,
            "audio_data": base64.b64encode(audio_data).decode("utf-8"),
            "audio_format": "wav",
            "disable_noise_reduction": True
        }

查询音色

接口功能为查询系统默认音色及自行克隆的音色列表。

接口地址

https://api.sensenova.cn/v2/audio/voices
Method: GET

请求参数

参数名	参数类型	是否必须	取值	说明
voice_type	string	是	system cloned* all	query param
voice	string	否	音色ID	query param

响应参数

参数名	取值	说明
system_voices	Voice Object List

cloned_voices	Voice Object List

参数名	取值	说明
voice	string	音色
voice_type	string	音色类型
description	string	音色描述
created_at	string	音色创建时间

请求示例（Request Example）

curl --location 'https://api.sensenova.cn/v2/audio/voices?voice_type=all' \
--header 'Authorization: Bearer {sk}'

响应示例（Response Example）

{
    "system_voices": [
        {
            "voice": "male_naigou_m2",
            "voice_type": "system",
            "description": "男-角色扮演",
            "created_at": "2025-09-17T10:22:20.142328992Z"
        },
        {
            "voice": "male_nangong",
            "voice_type": "system",
            "description": "男-角色扮演",
            "created_at": "2025-09-17T10:22:20.137216375Z"
        }
    ],
    "cloned_voices": [
        {
            "voice": "spk_e7de22be_136_250917195119_mrkeor",
            "voice_type": "cloned",
            "description": "spk_e7de22be_136_250917195119_mrkeor",
            "created_at": "2025-09-17T11:51:19.768525911Z"
        }
    ]
}

删除音色

接口功能为删除自行创建的音色。

接口地址

https://api.sensenova.cn/v2/audio/voices/:voice
Method: DELETE

请求参数

参数名	参数类型	是否必须	取值	说明
voice	string	否	音色ID	query param

响应参数

参数名	取值	说明
voice	string	音色
status	string	音色状态
created_at	string	音色创建时间

请求示例（Request Example）

curl --location --request DELETE 'https://api.sensenova.cn/v2/audio/voices/"spk_e7de22be_136_250917195119_mrkeor' \
--header 'Authorization: Bearer {sk}'

响应示例（Response Example）

{
    "voice": "spk_e7de22be_136_250917195119_mrkeor",
    "created_at": "2025-09-17T11:51:19.768525911Z",
    "status": "deleted"
}

语音生成

该接口功能为根据用户输入的文本和音色等，生成对应的音频数据

请求地址

wss://api.sensenova.cn/v2/audio/speech

请求参数

参数名	参数类型	是否必须	取值	说明
model	string	是	SenseNova-Audio-TTS-0901	目前仅支持该模型
Input	string	是	不为空	用户输入的文本
trunk_seq	int32	否	大于等于0	用户输入的请求顺序号，如果stream为false，该字段可以不传，如果stream为true，该字段为必须字段，且序号从0开始
last_trunk	bool	是	true/false	是否为最后一个数据包
voice	string	是	见voice表	音色
language	string	否	ZH_CN/EN_US/ZH_CN_HK	语言
style	string	否	见voice表	风格
speed	float	否	[0.5, 2.0]	语速
volume	float	否	[-12, 12]	音量
pitch	float	否	[-24, 24]	音调
sample_rate	int	否	8000/16000/24000/32000/48000	采样率
response_format	string	否	PCM/WAV/MP3	生成语音的音频格式
with_subtitles	bool	否	true/false	是否需要字幕信息
stream	bool	否	true/false	是否流式

voice

Voice	性别	style	适用场景
male_nanxingyoushengshu1_p2	男	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	新闻播报
female_nvxingyoushengshu2_p2	女	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	新闻播报
male_miantian	男	正常	通用助手
female_nvxingyoushengshu1_p2	女	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	通用助手
male_jingyingqingnianyinse_p2	男	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	通用助手
male_JieShuoXiaoMing_p2	男	正常	旁白讲解
male_yunxi_p2	男	智能助手，对话，尴尬，新闻，愤怒，高兴，厌恶，恐惧，诗歌，正常，悲伤，阴阳怪气，悄悄话，惊讶，期待，赞美，傲娇，鼓励	旁白讲解
female_xiaoxiao_p2	女	智能助手，冷静，对话，闲聊，客服，惊讶，礼貌，新闻，疑惑，赞美，愤怒，高兴，鼓励，厌恶，期待，正常，诗歌，悲伤，傲娇，悄悄话，深情，抱歉，温柔，聊天，平静，恐惧，抒情	有声书
male_nanxingyoushengshu2_p2	男	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	有声书
female_chunzhen_m2	女	正常	有声书
male_kaishujianggushi_p2	男	正常	有声书
female_jiaomei_m2	女	正常	有声书
child_congmingnantong_p2	男	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	角色扮演
child_katongzhuxiaoqi_p2	女	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	角色扮演
female_diantai	女	正常	角色扮演
female_daihuo_p2	女	正常	角色扮演
female_tianmeinvxingyinse_m2	女	愤怒，厌恶，恐惧，高兴，正常，悲伤，惊讶	角色扮演
male_dashu	男	正常	角色扮演
male_nangong	男	正常	角色扮演
male_naigou_m2	男	正常	角色扮演

响应参数

参数名	取值	说明
status	SUCCESS/FAILED
error_detail		仅当status为FAILED时有值
session_id		会话ID
voice		音色
time_cost_ms		耗时，单位为毫秒
response_format		音频文件编码格式
chunk_seq		顺序号
audio_data		音频数据
last_chunk		是否为最后一个数据包
subtitles		字幕信息
usage_characters		计费字符数。本次语音生成的计费字符数，流式响应的该值为在最后一个数据包中返回

请求示例（Request Example）

        payload = {
            "model": "SenseNova-Audio-TTS-0901",
            "input": input,
            "trunk_seq": 0,
            "last_trunk": True,
            "text_type": text_type,
            "voice": voice,
            "language": language,
            "style": style,
            "speed": speed,
            "volume": volume,
            "pitch": pitch,
            "stream": False,
            "sample_rate": sample_rate,
            "response_format": response_format,
            "with_subtitles": with_subtitles
        }

Python 示例

import argparse
import asyncio
import base64
import json
import struct
import numpy as np
import soundfile as sf
from typing import AsyncIterable
from websockets.exceptions import ConnectionClosedError, ConnectionClosedOK
from websockets.legacy.client import connect, WebSocketClientProtocol


WS_BASE_URL = "wss://api.sensenova.cn/v2/audio"
API_KEY = "your api key"

HEADERS = {
    "Authorization": f"Bearer {API_KEY}"
}

# WebSocket configuration
WS_CONFIG = {
    "max_size": 100 * 1024 * 1024,  # 100MB max frame size
    "max_queue": 32,  # Max number of messages in queue
    "read_limit": 1024,  # 1MB read limit
    "write_limit": 1024,  # 1MB write limit
}


async def send_messages(ws: WebSocketClientProtocol, payloads: AsyncIterable[bytes]):
    async for data in payloads:
        await ws.send(data)
        print(f"> Sent binary message: {len(data)} bytes")


async def receive_messages(ws: WebSocketClientProtocol, queue: asyncio.Queue):
    try:
        async for message in ws:
            if isinstance(message, bytes):
                print(
                    f"< Received binary message: {len(message)} bytes and request id: {ws.response_headers.get('X-Request-Id')}")
                parsed = parse_data_in_protocol(message)
                json_data = json.loads(parsed.decode("utf-8"))
                await queue.put(json_data)
            else:
                raise ValueError("Unsupported message type")
    except ConnectionClosedOK:
        print("Receiver closed cleanly.")
    except ConnectionClosedError as e:
        print(f"Receiver error: {e}")
    finally:
        await queue.put(None)


async def tts_payloads_generator(hyper_param: dict = {}):
    streaming = hyper_param['stream'] or False
    voice = hyper_param['voice'] or 'M20'
    language = hyper_param['language'] or "ZH_CN"
    text_type = hyper_param['text_type'] or "PLAIN"
    response_format = hyper_param['response_format'] or "mp3"
    with_subtitles = hyper_param['with_subtitles'] or False
    style = hyper_param['style'] or "正常"
    speed = hyper_param['speed'] or 1.0
    volume = hyper_param['volume'] or 1.0
    pitch = hyper_param['pitch'] or 1.0
    sample_rate = hyper_param['sample_rate'] or 16000
    input = hyper_param['input'] or "欲买桂花同载酒，终不似，少年游。"
    if not streaming:
        payload = {
            "model": "SenseNova-Audio-TTS-0901",
            "input": input,
            "trunk_seq": 0,
            "last_trunk": True,
            "text_type": text_type,
            "voice": voice,
            "language": language,
            "style": style,
            "speed": speed,
            "volume": volume,
            "pitch": pitch,
            "stream": False,
            "sample_rate": sample_rate,
            "response_format": response_format,
            "with_subtitles": with_subtitles
        }
        raw = json.dumps(payload).encode("utf-8")
        packed = pack_data_in_protocol(raw)
        yield packed
        return

    inputs = list(input)

    for i, text in enumerate(inputs):
        payload = {
            "model": "SenseNova-Audio-TTS-0901",
            "input": text,
            "trunk_seq": i,
            "last_trunk": (i == len(inputs) - 1),
            "text_type": text_type,
            "voice": voice,
            "language": language,
            "style": style,
            "speed": speed,
            "volume": volume,
            "pitch": pitch,
            "stream": streaming,
            "sample_rate": sample_rate,
            "response_format": response_format,
            "with_subtitles": with_subtitles
        }
        raw = json.dumps(payload).encode("utf-8")
        packed = pack_data_in_protocol(raw)
        yield packed
        await asyncio.sleep(0.1)


# protocol header:
# uint8 version
# uint8 serialization
# uint32 length

# protocol body:
# bytes data using serialization
def pack_data_in_protocol(payload: bytes) -> bytes:
    version = 0x01
    serialization = 0x01  # JSON
    length = len(payload)

    header = struct.pack(">BBI", version, serialization, length)
    return header + payload


def parse_data_in_protocol(payload: bytes) -> bytes:
    if len(payload) < 6:
        raise ValueError(f"invalid data length: {len(payload)}")

    version, serialization, length = struct.unpack(">BBI", payload[:6])

    if version != 0x01:
        raise ValueError("invalid version")
    if serialization != 0x01:
        raise ValueError("invalid serialization")

    expected_len = 6 + length
    if len(payload) < expected_len:
        raise ValueError("data length mismatch")
    return payload[6:expected_len]


def write_audio_to_file(audio_data: bytes, sample_rate: int, dest_file: str, response_format: str):
    file_name = dest_file + f".{response_format.lower()}"
    audio_np = np.frombuffer(audio_data, dtype=np.int16)

    if response_format.lower() == "mp3":
        import io
        from pydub import AudioSegment

        # Convert numpy array to AudioSegment
        audio_segment = AudioSegment(
            audio_np.tobytes(),
            frame_rate=sample_rate,
            sample_width=2,  # 16-bit
            channels=1
        )
        audio_segment.export(file_name, format="mp3")

    elif response_format.lower() == "wav":
        sf.write(file_name, audio_np, samplerate=sample_rate, subtype="PCM_16")

    elif response_format.lower() == "pcm":
        # For PCM, write raw bytes directly
        with open(file_name, "wb") as f:
            f.write(audio_data)

    else:
        raise ValueError(f"Unsupported response format: {response_format}")

    print(f"write audio to file: {file_name} successfully...")


async def voice_clone():
    async def payloads_generator():
        with open("sample.wav", "rb") as f:
            audio_data = f.read()
            base64.b64encode(audio_data).decode("utf-8")
            text = "欲买桂花同载酒，终不似，少年游。"
        payload = {
            "model": "SenseNova-Audio-Clone-0901",
            "text": text,
            "audio_data": base64.b64encode(audio_data).decode("utf-8"),
            "audio_format": "wav",
            "disable_noise_reduction": True
        }
        raw = json.dumps(payload).encode("utf-8")
        packed = pack_data_in_protocol(raw)
        yield packed
        return

    async with connect(f"{WS_BASE_URL}/speech_clone", extra_headers=HEADERS, **WS_CONFIG) as ws:
        queue = asyncio.Queue()
        await send_messages(ws, payloads_generator())
        await receive_messages(ws, queue)
        while True:
            item = await queue.get()
            if item is None:
                break
            print(item)


async def tts(hyper_param: dict = {}):
    queue = asyncio.Queue()
    response_format = hyper_param.get("response_format") or "wav"
    async with connect(f"{WS_BASE_URL}/speech", extra_headers=HEADERS, **WS_CONFIG) as websocket:
        send_task = asyncio.create_task(send_messages(
            websocket, tts_payloads_generator(hyper_param)))
        recv_task = asyncio.create_task(receive_messages(websocket, queue))

        responses = []

        while True:
            item = await queue.get()
            if item is None:
                break
            responses.append(item)
        await asyncio.gather(send_task, recv_task)

        if responses:
            stream = hyper_param.get("stream")
            if not stream:
                audio_data = responses[0]["audio_data"]
                if audio_data:
                    decoded_audio_bytes = base64.b64decode(audio_data)
                    write_audio_to_file(decoded_audio_bytes,
                                        16000, "output", response_format)
            else:
                stream_bytes = bytes()
                for response in responses:
                    audio_data = response["audio_data"]
                    if audio_data:
                        decoded_audio_bytes = base64.b64decode(audio_data)
                        stream_bytes += decoded_audio_bytes
                write_audio_to_file(stream_bytes, 16000,
                                    "output_stream", response_format)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--service', '-s', help='服务名称, voice_clone or tts', type=str, default="tts")
    parser.add_argument(
        '--voice', '-v', help='音色', type=str, default="M20")
    parser.add_argument(
        '--input', '-i', help='输入文本', type=str, default="欲买桂花同载酒，终不似，少年游。")
    parser.add_argument(
        '--style', help='风格', type=str, default="正常")
    parser.add_argument(
        '--speed', help='语速', type=float, default=1.0)
    parser.add_argument(
        '--volume', help='音量', type=float, default=1.0)
    parser.add_argument(
        '--pitch', help='音调', type=float, default=1.0)
    parser.add_argument(
        '--sample_rate', help='采样率', type=int, default=16000)
    parser.add_argument(
        '--language', '-l', help='语言', type=str, choices=['ZH_CN', 'ZH_CN_SICHUAN', 'ZH_CN_HK'], default="ZH_CN")
    parser.add_argument(
        '--text_type', '-t', help='文本类型', type=str, choices=['PLAIN', 'SSML'], default="PLAIN")
    parser.add_argument(
        '--response_format', '-r', help='响应格式', type=str, choices=['mp3', 'wav', 'pcm'], default="wav")
    parser.add_argument('--with_subtitles', action='store_true', default=False,
                        help='是否返回字幕')
    parser.add_argument('--stream', action='store_true', default=False,
                        help='音频是否流式返回')
    try:
        args = parser.parse_args()
        print(args)
        if args.service == "tts":
            param = dict(
                voice=args.voice,
                input=args.input,
                stream=args.stream,
                language=args.language,
                with_subtitles=args.with_subtitles,
                style=args.style,
                speed=args.speed,
                volume=args.volume,
                pitch=args.pitch,
                sample_rate=args.sample_rate,
                text_type=args.text_type,
                response_format=args.response_format
            )
            asyncio.run(tts(param))
        elif args.service == "voice_clone":
            asyncio.run(voice_clone())
    except KeyboardInterrupt:
        print("\nDisconnected.")
    except Exception as e:
        print(e)

    

语音快速复刻

接口描述（Description）​

请求地址（Request URL）​

请求头（Request Header）​

客户端请求格式​

克隆音色​

查询音色​

删除音色​

语音生成​

Python 示例​

接口描述（Description）

请求地址（Request URL）

请求头（Request Header）

客户端请求格式

克隆音色

查询音色

删除音色

语音生成

Python 示例