encode_benchmark

Example of the component esphome/micro-opus v0.3.3
# ESP32-S3 Opus Encode Benchmark

Benchmarks Opus encoding performance across a matrix of settings using two 30-second audio clips. Tests speech encoding (SILK codec, low bitrates) and music encoding (CELT codec, high bitrates), reporting per-frame timing statistics and actual vs target bitrate.

## Features

- Two embedded 30-second test audio clips (public domain):
  - **SPEECH (SILK)**: 16kHz mono, tests low-bitrate encoding (10-32 kbit/s)
  - **MUSIC (CELT)**: 48kHz stereo, tests high-bitrate encoding (64-192 kbit/s)
- Full test matrix: complexity levels (0, 2, 5, 8, 10) x application modes (VOIP, AUDIO) x bitrates
- Per-frame encoding timing with statistical analysis (min/max/avg/stddev)
- Only encodes are timed (decode time excluded from measurements)
- Actual vs target bitrate comparison
- Auto-skip: stops test series when encoding becomes slower than real-time
- Pre-configured for maximum performance (240MHz, PSRAM, fixed-point)

## Test Matrix

### Speech (40 configurations)

| Mode | Complexity | Bitrates |
| ---- | ---------- | -------- |
| VOIP | 0, 2, 5, 8, 10 | 10k, 16k, 24k, 32k |
| AUDIO | 0, 2, 5, 8, 10 | 10k, 16k, 24k, 32k |

### Music (20 configurations)

| Mode | Complexity | Bitrates |
| ---- | ---------- | -------- |
| AUDIO | 0, 2, 5, 8, 10 | 64k, 96k, 128k, 192k |

## Building and Flashing

### Prerequisites

- **PlatformIO** (recommended) OR ESP-IDF v5.0 or later
- ESP32-S3 development board with PSRAM

### Option 1: PlatformIO (Recommended)

PlatformIO provides a simplified build process with automatic dependency management.

```bash
cd examples/encode_benchmark

# Build the project
pio run

# Upload and monitor
pio run -t upload -t monitor
```

The PlatformIO configuration uses the parent microOpus repository as a component, so no additional setup is required.

### Option 2: Native ESP-IDF

```bash
cd examples/encode_benchmark
idf.py set-target esp32s3
idf.py build
idf.py flash monitor
```

### Configuration Options

#### PlatformIO

The default configuration is optimized for maximum performance. To customize:

1. Edit `sdkconfig.defaults` to change Opus-specific settings
2. Use `pio run -t menuconfig` for full ESP-IDF configuration

#### Native ESP-IDF

```bash
idf.py menuconfig
```

Navigate to **Component config → Opus Audio Codec** to adjust:

- Memory allocation mode (THREADSAFE_PSEUDOSTACK, NONTHREADSAFE_PSEUDOSTACK, USE_ALLOCA)
- Floating-point vs fixed-point implementation
- Memory preferences (PSRAM vs internal RAM for state/pseudostack)
- Pseudostack size

## Expected Output

Each iteration runs through all encoder configurations for both audio types:

```text
I (1019) ENCODE_BENCH: === ESP32-S3 Opus Encode Benchmark ===
I (1019) ENCODE_BENCH: Audio sources:
I (1029) ENCODE_BENCH:   SPEECH (SILK): 38196 bytes, 16 kHz, 1 channel
I (1029) ENCODE_BENCH:   MUSIC (CELT): 497933 bytes, 48 kHz, 2 channels
I (1039) ENCODE_BENCH: Processing: decode packet -> encode packet (timing encode only)
I (1049) ENCODE_BENCH: Test matrix: 40 speech configs + 20 music configs = 60 total
I (1049) ENCODE_BENCH: Free heap: 17070164 bytes
I (1059) ENCODE_BENCH: Free PSRAM: 16774624 bytes
I (1059) ENCODE_BENCH: Free Internal: 295540 bytes

I (1069) ENCODE_BENCH: ========== Iteration 1 ==========

I (1079) ENCODE_BENCH: === SPEECH Encoding Tests (40 configurations) ===

I (1079) ENCODE_BENCH: --- SPEECH: VOIP, complexity=0, bitrate=10000 ---
I (9899) ENCODE_BENCH: Frame (us): min=3733 max=5057 avg=4467.0 sd=448.8 (n=1500)
I (9899) ENCODE_BENCH: Total: 6700 ms (30.0s audio), RTF: 0.223 (4.5x real-time)
I (9899) ENCODE_BENCH: Encoded: 34542 bytes (9211 bps actual, target 10000 bps)

...

I (346419) ENCODE_BENCH: --- SPEECH: AUDIO, complexity=10, bitrate=32000 ---
I (626839) ENCODE_BENCH: Frame (us): min=4316 max=8150 avg=6860.2 sd=603.6 (n=1500)
I (626839) ENCODE_BENCH: Total: 10290 ms (30.0s audio), RTF: 0.343 (2.9x real-time)
I (626839) ENCODE_BENCH: Encoded: 121566 bytes (32418 bps actual, target 32000 bps)

I (626849) ENCODE_BENCH: === MUSIC Encoding Tests (20 configurations) ===

I (626859) ENCODE_BENCH: --- MUSIC: AUDIO, complexity=0, bitrate=64000 ---
I (644049) ENCODE_BENCH: Frame (us): min=5781 max=6811 avg=6556.7 sd=96.0 (n=1500)
I (644049) ENCODE_BENCH: Total: 9835 ms (30.0s audio), RTF: 0.328 (3.1x real-time)
I (644059) ENCODE_BENCH: Encoded: 241654 bytes (64441 bps actual, target 64000 bps)

...

I (1079799) ENCODE_BENCH: --- MUSIC: AUDIO, complexity=10, bitrate=192000 ---
I (1079799) ENCODE_BENCH: Frame (us): min=8537 max=16959 avg=13005.2 sd=1959.1 (n=1500)
I (1079799) ENCODE_BENCH: Total: 19507 ms (30.0s audio), RTF: 0.650 (1.5x real-time)
I (1079799) ENCODE_BENCH: Encoded: 721980 bytes (192528 bps actual, target 192000 bps)

I (1079809) ENCODE_BENCH: === Iteration 1 Summary ===
I (1079819) ENCODE_BENCH: All encodes successful: YES
I (1079819) ENCODE_BENCH: Free heap: 16949064 bytes
```

### Output Fields

- **Frame (us)**: Per-frame encode time statistics (min/max/avg/sd in microseconds, n = frame count)
- **Total**: Wall-clock time spent encoding all frames
- **RTF**: Real-Time Factor (encode_time / audio_duration). RTF < 1 means faster than real-time
- **Nx real-time**: How many times faster than real-time encoding (1/RTF)
- **Encoded**: Compressed size and actual bitrate vs target bitrate

### Auto-Skip Behavior

When encoding becomes slower than real-time (RTF > 1.0), the benchmark skips remaining configurations in that audio type since higher complexity/bitrate settings will be even slower:

```text
W (95000) ENCODE_BENCH: RTF > 1.0, skipping remaining MUSIC tests (higher settings will be slower)
```

## Benchmark Results (Fixed-Point)

Results from ESP32-S3 at 240MHz using fixed-point arithmetic (`CONFIG_OPUS_FLOATING_POINT=n`). Fixed-point encoding is significantly faster than floating-point on ESP32-S3, especially for SILK at higher complexity levels where floating-point fails to encode in real-time.

Values show real-time multiplier with RTF in parentheses.

### Speech Encoding (16kHz mono)

#### VOIP Mode

| Complexity | 10 kbit/s | 16 kbit/s | 24 kbit/s | 32 kbit/s |
| ---------- | --------- | --------- | --------- | --------- |
| 0 | 4.5x (0.22) | 3.6x (0.28) | 3.6x (0.28) | 3.6x (0.28) |
| 2 | 2.8x (0.36) | 2.8x (0.36) | 2.8x (0.36) | 2.8x (0.36) |
| 5 | 2.0x (0.50) | 2.0x (0.50) | 2.0x (0.51) | 2.0x (0.51) |
| 8 | 1.4x (0.69) | 1.4x (0.69) | 1.4x (0.69) | 1.4x (0.70) |
| 10 | 1.4x (0.69) | 1.4x (0.69) | 1.4x (0.69) | 1.4x (0.70) |

#### AUDIO Mode

| Complexity | 10 kbit/s | 16 kbit/s | 24 kbit/s | 32 kbit/s |
| ---------- | --------- | --------- | --------- | --------- |
| 0 | 4.5x (0.22) | 3.6x (0.28) | **5.5x (0.18)** | **5.4x (0.18)** |
| 2 | 2.8x (0.36) | 2.8x (0.36) | **4.6x (0.22)** | **4.6x (0.22)** |
| 5 | 2.0x (0.50) | 2.0x (0.50) | **3.0x (0.34)** | **3.0x (0.34)** |
| 8 | 1.4x (0.70) | 1.4x (0.69) | **2.9x (0.34)** | **2.9x (0.34)** |
| 10 | 1.4x (0.70) | 1.4x (0.69) | **2.9x (0.34)** | **2.9x (0.34)** |

**Bold** = CELT/SILK hybrid (faster than SILK at same bitrate)

### Music Encoding (48kHz stereo, AUDIO mode)

| Complexity | 64 kbit/s | 96 kbit/s | 128 kbit/s | 192 kbit/s |
| ---------- | --------- | --------- | ---------- | ---------- |
| 0 | 3.1x (0.33) | 2.9x (0.35) | 2.8x (0.36) | 2.6x (0.39) |
| 2 | 2.6x (0.39) | 2.4x (0.41) | 2.3x (0.43) | 2.2x (0.45) |
| 5 | 1.9x (0.52) | 1.8x (0.55) | 1.8x (0.56) | 1.7x (0.58) |
| 8 | 1.8x (0.55) | 1.7x (0.60) | 1.6x (0.63) | 1.5x (0.65) |
| 10 | 1.8x (0.55) | 1.7x (0.60) | 1.6x (0.63) | 1.5x (0.65) |

### Key Observations

| Finding | Details |
| ------- | ------- |
| Complexity 8 = 10 | No performance difference between complexity 8 and 10 |
| CELT faster than SILK | At 24+ kbit/s in AUDIO mode, encoder switches to CELT (~2x faster) |
| All configs real-time capable | Worst case 1.4x real-time (complexity 8/10 VOIP speech) |
| Bitrate effect (SILK) | Minimal impact on encode time |
| Bitrate effect (CELT) | Higher bitrates slightly slower (more data to process) |

### Summary by Codec

| Codec | Best Case | Worst Case | Notes |
| ----- | --------- | ---------- | ----- |
| SILK (speech) | 4.5x @ c=0 | 1.4x @ c=8+ | Bitrate has little effect on speed |
| CELT (speech) | 5.5x @ c=0 | 2.9x @ c=8+ | ~2x faster than SILK at same complexity |
| CELT (music) | 3.1x @ c=0 | 1.5x @ c=8+ | Stereo 48kHz more demanding than mono 16kHz |

## Benchmark Results (Floating-Point)

Results with `CONFIG_OPUS_FLOATING_POINT=y` for comparison. Floating-point is significantly slower for SILK encoding.

### Speech Encoding (16kHz mono) - Floating-Point

| Mode | Complexity | 10 kbit/s | 16 kbit/s | Higher |
| ---- | ---------- | --------- | --------- | ------ |
| VOIP | 0 | 1.1x (0.90) | **0.6x (1.55)** | skipped |
| AUDIO | 0 | 1.1x (0.89) | **0.6x (1.55)** | skipped |

**Bold** = Slower than real-time (RTF > 1.0). Higher complexity levels were not tested because complexity 0 already fails at 16 kbit/s.

### Music Encoding (48kHz stereo, AUDIO mode) - Floating-Point

| Complexity | 64 kbit/s | 96 kbit/s | 128 kbit/s | 192 kbit/s |
| ---------- | --------- | --------- | ---------- | ---------- |
| 0 | 2.8x (0.36) | 2.6x (0.39) | 2.4x (0.42) | 2.1x (0.48) |
| 2 | 2.4x (0.42) | 2.2x (0.46) | 2.1x (0.48) | 1.9x (0.54) |
| 5 | 2.0x (0.51) | 1.8x (0.54) | 1.8x (0.57) | 1.6x (0.62) |
| 8 | 1.3x (0.77) | 1.2x (0.83) | 1.1x (0.88) | 1.1x (0.95) |
| 10 | 1.3x (0.77) | 1.2x (0.83) | 1.1x (0.88) | 1.1x (0.95) |

### Fixed-Point vs Floating-Point Comparison

| Codec | Fixed-Point | Floating-Point | Speedup |
| ----- | ----------- | -------------- | ------- |
| SILK (c=0, 10k) | 4.5x real-time | 1.1x real-time | **4x faster** |
| SILK (c=0, 16k) | 3.6x real-time | 0.6x (fails) | **6x faster** |
| CELT music (c=0, 64k) | 3.1x real-time | 2.8x real-time | 1.1x faster |
| CELT music (c=8, 192k) | 1.5x real-time | 1.1x real-time | 1.4x faster |

**Recommendation**: Use fixed-point (`CONFIG_OPUS_FLOATING_POINT=n`) for encoding on ESP32-S3. SILK encoding with floating-point is not viable for real-time applications.

## Performance Characteristics

### Encoder Complexity

Opus complexity ranges from 0 (fastest) to 10 (best quality):

| Complexity | Trade-off |
| ---------- | --------- |
| 0-2 | Fastest encoding, lower quality |
| 5 | Balanced (default in most applications) |
| 8-10 | Best quality, slowest encoding |

Higher complexity uses more CPU cycles for analysis and psychoacoustic modeling. Note that complexity 8 and 10 show identical performance on ESP32-S3.

### Application Mode

- **VOIP**: Optimized for speech, prefers SILK codec even at higher bitrates
- **AUDIO**: Optimized for music, prefers CELT codec, uses SILK only at very low bitrates

### Bitrate Impact

Higher bitrates generally:

- Increase encoding time (more data to process)
- Improve audio quality
- Result in larger compressed output

## Configuration

The default configuration uses 240MHz, fixed-point, THREADSAFE_PSEUDOSTACK, and pseudostack in PSRAM.

Key settings in `sdkconfig.defaults`:

```ini
# Fixed-point (currently configured, can change to floating-point)
CONFIG_OPUS_FLOATING_POINT=n
```

## Memory Usage

| Type | Size | Notes |
| ---- | ---- | ----- |
| Flash | ~640KB | 100KB code + 498KB music + 38KB speech |
| Pseudostack | 120KB | Shared between encoder and decoder |

## Troubleshooting

| Problem | Solution |
| ------- | -------- |
| Watchdog timeout | Already disabled in default config |
| Stack overflow | Increase `CONFIG_ESP_MAIN_TASK_STACK_SIZE` |
| Allocation failures | Check PSRAM is enabled, reduce pseudostack size, or set state to prefer PSRAM |
| All tests skip | RTF > 1.0 on first test; consider lowering complexity or switching to floating-point |

## Technical Details

**Processing Flow**: For each encoder configuration, the benchmark:

1. Decodes the embedded Opus file packet by packet using `OggOpusDecoder`
2. Accumulates PCM samples until a full 20ms frame is ready
3. Encodes the frame using the raw Opus encoder API (`opus_encode`)
4. Times only the encode step (decode time is excluded)
5. Reports statistics after processing all audio

**Music Audio (CELT)**: Beethoven Symphony No. 3 "Eroica", Op. 55, Movement I, 30s extract.

- Performer: Czech National Symphony Orchestra
- Source: [Musopen Collection](https://archive.org/details/MusopenCollectionAsFlac) on Archive.org
- License: Public Domain
- Format: Ogg Opus 48kHz stereo ~128kbit/s VBR (CELT codec)

**Speech Audio (SILK)**: The Art of War, Chapters 1-2, 30s extract.

- Author: Sun Tzu
- Reader: Moira Fogarty (October 2006)
- Source: [LibriVox](https://archive.org/details/art_of_war_librivox) on Archive.org
- License: Public Domain
- Format: Ogg Opus 16kHz mono ~10kbit/s (SILK codec)

**Timing**: Uses `esp_timer_get_time()` for microsecond precision. Only measures `opus_encode()` calls.

To create a project from this example, run:

idf.py create-project-from-example "esphome/micro-opus=0.3.3:encode_benchmark"

or download archive (~885.81 KB)