Building Vocalio: An AI-Powered Voice Training App
How I built Vocalio, a voice training app with real-time pitch detection, vocal analysis, and 40+ professional exercises for singers and speakers.
After my previous App Store account was terminated, I needed to start fresh. I wanted to build something meaningful - an app that genuinely helps people. Voice training was a space I'd been interested in for years.
The Problem
Voice training apps fell into two categories:
There was a gap: an affordable app with professional-grade feedback.
Research Phase
Before writing code, I spent weeks researching:
- Interviewed vocal coaches about their training methods
- Studied music theory and vocal physiology
- Analyzed competitor apps (their strengths and weaknesses)
- Read research papers on pitch detection algorithms
Technical Architecture
Vocalio is built with:
- SwiftUI for the UI
- AVFoundation for audio capture
- Accelerate framework for DSP (Digital Signal Processing)
- Core ML for some analysis features
Real-Time Pitch Detection
The core challenge was accurate pitch detection in real-time. I implemented an autocorrelation-based algorithm:
func detectPitch(buffer: AVAudioPCMBuffer) -> Float? {
guard let channelData = buffer.floatChannelData?[0] else {
return nil
}
let frames = Int(buffer.frameLength)
// Apply Hann window to reduce spectral leakage var windowedSignal = [Float](repeating: 0, count: frames) vDSP_vmul(channelData, 1, hannWindow, 1, &windowedSignal, 1, vDSP_Length(frames))
// Compute autocorrelation var autocorrelation = [Float](repeating: 0, count: frames) vDSP_conv(windowedSignal, 1, windowedSignal, 1, &autocorrelation, 1, vDSP_Length(frames), vDSP_Length(frames))
// Find the peak in the autocorrelation (excluding lag 0) let minLag = Int(Float(sampleRate) / maxFrequency) let maxLag = Int(Float(sampleRate) / minFrequency)
var peakLag = minLag var peakValue: Float = 0
for lag in minLag..<min(maxLag, frames) { if autocorrelation[lag] > peakValue { peakValue = autocorrelation[lag] peakLag = lag } }
// Convert lag to frequency let frequency = Float(sampleRate) / Float(peakLag) return frequency }
This runs at 60fps on even older iPhones.
Vibrato Analysis
Detecting vibrato required tracking pitch variations over time:
struct VibratoAnalysis {
let rate: Float // Oscillations per second (typically 5-7 Hz)
let extent: Float // Pitch variation in semitones
let consistency: Float // How regular the oscillation is
}
func analyzeVibrato(pitchHistory: [Float]) -> VibratoAnalysis { // Apply FFT to pitch history to find oscillation frequency // Peak in 4-8 Hz range indicates vibrato // ... }
Strain Detection
This was tricky. Vocal strain manifests as:
- Increased noise in the signal (higher HNR - Harmonic-to-Noise Ratio)
- Pitch instability
- Tightening of the formants
func detectStrain(buffer: AVAudioPCMBuffer, pitch: Float) -> Float {
let hnr = calculateHNR(buffer)
let jitter = calculatePitchJitter(recentPitches)
let shimmer = calculateAmplitudeShimmer(buffer)
// Weighted combination let strainScore = (1 - hnr) * 0.4 + jitter * 0.3 + shimmer * 0.3 return min(1.0, max(0.0, strainScore)) }
The 40+ Exercises
Each exercise was designed with specific goals:
Breathing exercises - Foundation of good vocal technique
- Diaphragmatic breathing (belly expansion, not chest)
- Breath stamina (sustained exhale)
- Rib cage expansion
- Lip trills (reduces tension)
- Humming (engages resonance)
- Sirens (smooth transitions)
- Messa di voce (soft → loud → soft)
- Sustained notes at different volumes
- Power bursts
- Scales and arpeggios
- Octave jumps
- Quick pitch changes
Challenges Faced
Background Audio
Users wanted to practice while playing backing tracks. Implementing proper audio session management for simultaneous playback and recording was complex:
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playAndRecord,
mode: .measurement,
options: [.defaultToSpeaker, .allowBluetooth])
try audioSession.setActive(true)
Latency
Users are sensitive to latency when hearing their own voice. I minimized it by:
- Using the smallest viable buffer size (256 samples)
- Processing on a high-priority audio thread
- Avoiding any UI updates in the audio callback
Different Voice Types
A baritone and soprano have very different frequency ranges. The app detects voice type during onboarding and adjusts all exercises accordingly.
Business Model
I chose freemium:
- Free: Basic exercises, limited analysis
- Premium: Full exercise library, advanced analytics, progress tracking
Results
Since launch:
- 5-star ratings from real users
- Positive feedback from vocal coaches who recommend it
- Growing user base
What's Next
Currently working on:
- Song practice mode (sing along with pitch guidance)
- Duet recording features
- More specialized exercise packs (public speaking, singing genres)