← back to blog
9 min read

Building Vocalio: An AI-Powered Voice Training App

How I built Vocalio, a voice training app with real-time pitch detection, vocal analysis, and 40+ professional exercises for singers and speakers.

iOSAudioAIMobile

After my previous App Store account was terminated, I needed to start fresh. I wanted to build something meaningful - an app that genuinely helps people. Voice training was a space I'd been interested in for years.

The Problem

Voice training apps fell into two categories:

  • Too basic - Just audio playback with no feedback
  • Too expensive - Professional coaching costs $100+/hour
  • There was a gap: an affordable app with professional-grade feedback.

    Research Phase

    Before writing code, I spent weeks researching:

    • Interviewed vocal coaches about their training methods
    • Studied music theory and vocal physiology
    • Analyzed competitor apps (their strengths and weaknesses)
    • Read research papers on pitch detection algorithms
    Key insight: Most voice apps only track pitch. But professional vocal training involves much more - breath control, resonance, vibrato, strain detection.

    Technical Architecture

    Vocalio is built with:

    • SwiftUI for the UI
    • AVFoundation for audio capture
    • Accelerate framework for DSP (Digital Signal Processing)
    • Core ML for some analysis features

    Real-Time Pitch Detection

    The core challenge was accurate pitch detection in real-time. I implemented an autocorrelation-based algorithm:

    func detectPitch(buffer: AVAudioPCMBuffer) -> Float? {
        guard let channelData = buffer.floatChannelData?[0] else {
            return nil
        }
    

    let frames = Int(buffer.frameLength)

    // Apply Hann window to reduce spectral leakage var windowedSignal = [Float](repeating: 0, count: frames) vDSP_vmul(channelData, 1, hannWindow, 1, &windowedSignal, 1, vDSP_Length(frames))

    // Compute autocorrelation var autocorrelation = [Float](repeating: 0, count: frames) vDSP_conv(windowedSignal, 1, windowedSignal, 1, &autocorrelation, 1, vDSP_Length(frames), vDSP_Length(frames))

    // Find the peak in the autocorrelation (excluding lag 0) let minLag = Int(Float(sampleRate) / maxFrequency) let maxLag = Int(Float(sampleRate) / minFrequency)

    var peakLag = minLag var peakValue: Float = 0

    for lag in minLag..<min(maxLag, frames) { if autocorrelation[lag] > peakValue { peakValue = autocorrelation[lag] peakLag = lag } }

    // Convert lag to frequency let frequency = Float(sampleRate) / Float(peakLag) return frequency }

    This runs at 60fps on even older iPhones.

    Vibrato Analysis

    Detecting vibrato required tracking pitch variations over time:

    struct VibratoAnalysis {
        let rate: Float      // Oscillations per second (typically 5-7 Hz)
        let extent: Float    // Pitch variation in semitones
        let consistency: Float // How regular the oscillation is
    }
    

    func analyzeVibrato(pitchHistory: [Float]) -> VibratoAnalysis { // Apply FFT to pitch history to find oscillation frequency // Peak in 4-8 Hz range indicates vibrato // ... }

    Strain Detection

    This was tricky. Vocal strain manifests as:

    • Increased noise in the signal (higher HNR - Harmonic-to-Noise Ratio)
    • Pitch instability
    • Tightening of the formants
    I combined these signals into a strain score:

    func detectStrain(buffer: AVAudioPCMBuffer, pitch: Float) -> Float {
        let hnr = calculateHNR(buffer)
        let jitter = calculatePitchJitter(recentPitches)
        let shimmer = calculateAmplitudeShimmer(buffer)
    

    // Weighted combination let strainScore = (1 - hnr) * 0.4 + jitter * 0.3 + shimmer * 0.3 return min(1.0, max(0.0, strainScore)) }

    The 40+ Exercises

    Each exercise was designed with specific goals:

    Breathing exercises - Foundation of good vocal technique

    • Diaphragmatic breathing (belly expansion, not chest)
    • Breath stamina (sustained exhale)
    • Rib cage expansion
    Warm-ups - Prepare the voice safely
    • Lip trills (reduces tension)
    • Humming (engages resonance)
    • Sirens (smooth transitions)
    Strength training - Build power and control
    • Messa di voce (soft → loud → soft)
    • Sustained notes at different volumes
    • Power bursts
    Flexibility - Expand range and agility
    • Scales and arpeggios
    • Octave jumps
    • Quick pitch changes

    Challenges Faced

    Background Audio

    Users wanted to practice while playing backing tracks. Implementing proper audio session management for simultaneous playback and recording was complex:

    let audioSession = AVAudioSession.sharedInstance()
    try audioSession.setCategory(.playAndRecord,
                                 mode: .measurement,
                                 options: [.defaultToSpeaker, .allowBluetooth])
    try audioSession.setActive(true)

    Latency

    Users are sensitive to latency when hearing their own voice. I minimized it by:

    • Using the smallest viable buffer size (256 samples)
    • Processing on a high-priority audio thread
    • Avoiding any UI updates in the audio callback

    Different Voice Types

    A baritone and soprano have very different frequency ranges. The app detects voice type during onboarding and adjusts all exercises accordingly.

    Business Model

    I chose freemium:

    • Free: Basic exercises, limited analysis
    • Premium: Full exercise library, advanced analytics, progress tracking
    Weekly and annual subscription options. Annual converts better for retention.

    Results

    Since launch:

    • 5-star ratings from real users
    • Positive feedback from vocal coaches who recommend it
    • Growing user base
    The most rewarding feedback: "I used to hate my voice. Now I actually enjoy speaking."

    What's Next

    Currently working on:

    • Song practice mode (sing along with pitch guidance)
    • Duet recording features
    • More specialized exercise packs (public speaking, singing genres)
    Building Vocalio taught me that the best apps solve real problems with technical excellence. Audio processing is hard, but when you get it right, users notice.