Skip to content

tamochii/transformer-mood

Repository files navigation

English | 简体中文

Transformer Mood

A local-first speech emotion recognition toolkit with training, CLI inference, and a FastAPI WebUI.

License Python Model WebUI Dataset

Features

  • RAVDESS dataset preprocessing and training
  • Transformer Encoder speech emotion classification model
  • Single-file CLI inference
  • FastAPI WebUI
  • Browser microphone recording, audio uploads, and probability display

Repository Layout

src/
  transformer_mood/
    __init__.py
    main.py
    speech_emotion_classifier.py
    templates/
      index.html
    static/
      .gitkeep
README.zh.md                    # Chinese project README
output/                        # Runtime output directory (ignored except .gitkeep)
data/README.md                 # Dataset placement notes
data/README.zh.md              # Chinese dataset placement notes
transformer-md/                # Reference materials
requirements.txt               # Non-PyTorch Python dependencies
requirements-webui.txt         # Minimal extra dependencies for the WebUI

Quick Start

python3 -m venv .venv
source .venv/bin/activate

Install torch and torchaudio first, then install the remaining dependencies:

pip install torch torchaudio
pip install -r requirements.txt

Install ffmpeg:

sudo apt update
sudo apt install -y ffmpeg

The recommended project entrypoints are:

python run.py doctor
python run.py

Dataset

This repository does not include the RAVDESS dataset itself. Download it manually and place it under:

data/ravdess/

See data/README.md for details.

Training

python run.py train
python run.py train -- --dataset tess

--dataset tess keeps the old CLI name, but now reads training audio from data/vec/ instead of data/tess/.

The vec-backed tess mode uses 6 classes:

  • angry
  • disgust
  • fearful
  • happy
  • neutral
  • sad

CLI Prediction

python run.py predict --audio path/to/example.wav

WebUI

python run.py
python run.py webui --host 127.0.0.1 --port 8000

Open:

http://127.0.0.1:8000

The WebUI supports:

  • Uploading local audio files
  • Recording from the browser microphone
  • Displaying predicted emotion, confidence, and full probability distribution

Prediction requires a local checkpoint at output/best_model.pth, or an explicit EMOTION_MODEL_PATH environment variable.

Notes

  • This project is released under the MIT License. See LICENSE for details.
  • data/ravdess/ is in .gitignore so the raw dataset will not be committed accidentally
  • Legacy root-level model and image artifacts are ignored; the current expected outputs live in output/
  • output/ is kept as a directory boundary, but generated model files and figures are ignored for the public repository

About

基于 Transformer 的语音情绪识别项目

Resources

License

Stars

Watchers

Forks

Contributors