The three to know
- Shazam — founded 1999 (UK), acquired by Apple in 2018
- Chris Barton, Philip Inghelbrecht, and Dhiraj Mukherjee founded it in London in 1999, with Avery Wang (Stanford) building the core tech. Commercial launch in the UK via the phone number "2580" in 2002. The iPhone app shipped with the App Store on July 10, 2008. Apple announced a ~US$400M (£300M) acquisition in December 2017; it closed on September 24, 2018. 100 billion+ lifetime recognitions. The consumer juggernaut — but no official public API for external services.
- ACRCloud — China-based
- 150M+ track reference database. Broadcast monitoring, humming recognition, offline SDK, Speech-to-Text, and more. Free 14-day trial (no card required), then metered pricing. Widely adopted by automated VJ tools, DJ-mix identification services, and radio monitoring. autovj.club uses ACRCloud.
- AudD — open API
- Token-only authentication (no HMAC), public pricing (US$5 per 1,000 requests after a 300-request free quota), long-file support, on-prem deploy option. A natural choice for side projects, prototypes, and individual developers.
How fingerprinting works
The baseline algorithm — Avery Wang's 2003 paper — still underpins the industry. Simplified: (1) run a Short-Time Fourier Transform (STFT) to get a spectrogram, (2) extract local peaks in time-frequency space, (3) hash pairs of nearby peaks, (4) compare hashes against the reference DB and return the track with the most consistent time-offset alignment.
This is why venue noise and speaker distortion rarely break recognition — the fingerprint survives ugly input. The dominant factor for accuracy is mic placement. Within 1 to 2 meters of a speaker, a 60-second Auto Identify cadence gets 70% or better hit rate. Beyond 5 m or close to an HVAC vent, no service will hold up.
Which service fits which use case
- Automated VJ (track → preset switching in a venue)
- ACRCloud is the de facto choice. 150M-track coverage picks up J-POP, anime tracks, niche electronic releases reliably. autovj.club uses it. 60 seconds is the cost/accuracy sweet spot.
- Personal app or side project
- AudD is the easiest start. A 300-request free quota plus published $5/1,000 pricing makes cost predictable at prototype scale.
- Consumer "name that song" app
- Shazam has no public API, so building this experience yourself means ACRCloud or AudD. That said, the Shazam consumer UX is its own product — if you truly just need "name the song," the Shazam app is the right answer.
- Broadcast / streaming monitoring
- ACRCloud's broadcast-monitoring product, or dedicated services like BMAT or MediaGuide. Used for rights-reporting and piracy-detection workflows.
Operating notes for automated VJ
Three choices shape operating quality: (1) recognition interval (60 seconds is the sweet spot; shrinking to 15–30 seconds barely improves hits but multiplies cost roughly 4×), (2) mic placement (1–2 m from a speaker, away from HVAC), (3) fallback on recognition failure (hold the previous preset or revert to a default).
The "after you recognize a track" layer is a separate system: genre inference (house / hip hop / pop / anime), BPM detection, lyric retrieval (via LRCLIB and similar), each driving preset or overlay switches. Stacking those layers on top of recognition is the emerging default pattern for automated VJ.