Building Vocalio: An AI-Powered Voice Training App

After my previous App Store account was terminated, I needed to start fresh. I wanted to build something meaningful - an app that genuinely helps people. Voice training was a space I'd been interested in for years.

The Problem

Voice training apps fell into two categories:

Too basic - Just audio playback with no feedback

Too expensive - Professional coaching costs $100+/hour

There was a gap: an affordable app with professional-grade feedback.

Research Phase

Before writing code, I spent weeks researching:

Interviewed vocal coaches about their training methods
Studied music theory and vocal physiology
Analyzed competitor apps (their strengths and weaknesses)
Read research papers on pitch detection algorithms

Key insight: Most voice apps only track pitch. But professional vocal training involves much more - breath control, resonance, vibrato, strain detection.

Technical Architecture

Vocalio is built with:

SwiftUI for the UI
AVFoundation for audio capture
Accelerate framework for DSP (Digital Signal Processing)
Core ML for some analysis features

Real-Time Pitch Detection

The core challenge was accurate pitch detection in real-time. I implemented an autocorrelation-based algorithm:

func detectPitch(buffer: AVAudioPCMBuffer) -> Float? {
    guard let channelData = buffer.floatChannelData?[0] else {
        return nil
    }
let frames = Int(buffer.frameLength)
// Apply Hann window to reduce spectral leakage     var windowedSignal = [Float](repeating: 0, count: frames)     vDSP_vmul(channelData, 1, hannWindow, 1, &windowedSignal, 1, vDSP_Length(frames))
// Compute autocorrelation     var autocorrelation = [Float](repeating: 0, count: frames)     vDSP_conv(windowedSignal, 1, windowedSignal, 1, &autocorrelation, 1,               vDSP_Length(frames), vDSP_Length(frames))
// Find the peak in the autocorrelation (excluding lag 0)     let minLag = Int(Float(sampleRate) / maxFrequency)     let maxLag = Int(Float(sampleRate) / minFrequency)
var peakLag = minLag     var peakValue: Float = 0
for lag in minLag..<min(maxLag, frames) {         if autocorrelation[lag] > peakValue {             peakValue = autocorrelation[lag]             peakLag = lag         }     }
// Convert lag to frequency     let frequency = Float(sampleRate) / Float(peakLag)     return frequency }

This runs at 60fps on even older iPhones.

Vibrato Analysis

Detecting vibrato required tracking pitch variations over time:

struct VibratoAnalysis {
    let rate: Float      // Oscillations per second (typically 5-7 Hz)
    let extent: Float    // Pitch variation in semitones
    let consistency: Float // How regular the oscillation is
}
func analyzeVibrato(pitchHistory: [Float]) -> VibratoAnalysis {     // Apply FFT to pitch history to find oscillation frequency     // Peak in 4-8 Hz range indicates vibrato     // ... }

Strain Detection

This was tricky. Vocal strain manifests as:

Increased noise in the signal (higher HNR - Harmonic-to-Noise Ratio)
Pitch instability
Tightening of the formants

I combined these signals into a strain score:

func detectStrain(buffer: AVAudioPCMBuffer, pitch: Float) -> Float {
    let hnr = calculateHNR(buffer)
    let jitter = calculatePitchJitter(recentPitches)
    let shimmer = calculateAmplitudeShimmer(buffer)
// Weighted combination     let strainScore = (1 - hnr) * 0.4 + jitter * 0.3 + shimmer * 0.3     return min(1.0, max(0.0, strainScore)) }

The 40+ Exercises

Each exercise was designed with specific goals:

Breathing exercises - Foundation of good vocal technique

Diaphragmatic breathing (belly expansion, not chest)
Breath stamina (sustained exhale)
Rib cage expansion

Warm-ups - Prepare the voice safely

Lip trills (reduces tension)
Humming (engages resonance)
Sirens (smooth transitions)

Strength training - Build power and control

Messa di voce (soft → loud → soft)
Sustained notes at different volumes
Power bursts

Flexibility - Expand range and agility

Scales and arpeggios
Octave jumps
Quick pitch changes

Challenges Faced

Background Audio

Users wanted to practice while playing backing tracks. Implementing proper audio session management for simultaneous playback and recording was complex:

let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playAndRecord,
                             mode: .measurement,
                             options: [.defaultToSpeaker, .allowBluetooth])
try audioSession.setActive(true)

Latency

Users are sensitive to latency when hearing their own voice. I minimized it by:

Using the smallest viable buffer size (256 samples)
Processing on a high-priority audio thread
Avoiding any UI updates in the audio callback

Different Voice Types

A baritone and soprano have very different frequency ranges. The app detects voice type during onboarding and adjusts all exercises accordingly.

Business Model

I chose freemium:

Free: Basic exercises, limited analysis
Premium: Full exercise library, advanced analytics, progress tracking

Weekly and annual subscription options. Annual converts better for retention.

Results

Since launch:

5-star ratings from real users
Positive feedback from vocal coaches who recommend it
Growing user base

The most rewarding feedback: "I used to hate my voice. Now I actually enjoy speaking."

What's Next

Currently working on:

Song practice mode (sing along with pitch guidance)
Duet recording features
More specialized exercise packs (public speaking, singing genres)

Building Vocalio taught me that the best apps solve real problems with technical excellence. Audio processing is hard, but when you get it right, users notice.