This document explains the details of each models, such as export configuration, and input/output argument for onnx model.
Export configuration
| config key | type | note | default |
|---|---|---|---|
| max_seq_len | int | Maximum sequence length. | 512 |
| noise_scale | float | Noise scale parameter for flow. | 0.667 |
| noise_scale_dur | float | Noise scale parameter for duration predictor. | 0.8 |
| alpha | float | Alpha parameter to control the speed of generated speech. | 1.0 |
| use_teacher_forcing | bool | Whether to use teacher forcing. | False |
| predict_duration | bool | Whether to predict duration while inference. | True |
model input
| input name | detail | shape | dtype | dynamic dim |
|---|---|---|---|---|
| text | Input text token ids. | (1,) |
int64 | 0 |
| feats | Feature vector. Required if use_teacher_forcing is True. |
(feats_length, feat_dim) |
float32 | 0 |
| sids | Speaker id. Required if exported model requires speaker id. | (1,) |
int64 | - |
| spembs | Speaker vector. Required if exported model requires speaker embedding. | (spk_embed_dim,) |
float32 | - |
| lids | Language id. Required if exported model required language id. | (1,) |
int64 | - |
| duration | Ground-truth duration tensor. Required if predict_duration is False when exporting the model. |
(len_text,) |
float32 | 0 |
model output
| output name | detail | shape | dtype | dynamic dim |
|---|---|---|---|---|
| wav | Generated waveform tensor | (len_wav,) |
float32 | 0 |
| att_w | Monotonic attention weight tensor | (feats_len, len_text) |
float32 | 0, 1 |
| dur | Predicted duration tensor | (len_text,) |
float32 | 0 |