How to detect AI voices. Voice cloning and audio deepfakes.
A step-by-step guide for verifying suspicious audio. Written for journalists, investigators, compliance teams, and anyone who has been forwarded a voicemail that "felt off."
What is an AI voice?
An AI voice is audio of a human voice generated by a machine learning model rather than recorded from a human. The two common categories are voice cloning, which produces speech in the voice of a specific person, and generic synthesis, which produces speech in a generic voice. Both are increasingly hard to distinguish from real audio by ear alone.
The current generation of models (ElevenLabs v2, Resemble v3, PlayHT, OpenAI's TTS, and others) can clone a voice from as little as three seconds of source audio. That is the world we live in now.
Why AI voice scams are rising
Three reasons:
- The bar to clone a voice has collapsed. A few years ago, voice cloning required hours of clean studio audio and serious compute. Today it takes seconds of any audio (a TikTok, a voicemail) and a free trial of a hosted service.
- The attack surface is huge. Anyone who answers the phone is a target. CEO fraud, family ransom scams, election robocalls, and customer-service impersonation all use the same underlying technique.
- Detection has lagged. Until recently, voice detection was research-grade only. There was no consumer-accessible way to run a verdict on a suspicious clip.
Fast checklist: signs a voice may be AI
If you have 30 seconds and need to make a snap judgment, listen for these:
- Unnatural pauses. Real speech has irregular pauses for breath and thought. Synthesized speech often paces too evenly.
- Missing room tone. Real recordings carry background noise (HVAC, traffic, a TV). Synthesized audio is often too clean.
- Identical prosody across sentences. Real speakers vary cadence and pitch. Synthesized voices often hit the same rhythm twice in a row.
- Compressed dynamic range. Real voices get louder and softer. Synthesized voices often stay in a narrow band.
- "Plastic" sibilance. The "s" sounds in synthesized speech can have a metallic edge that real voices do not.
None of these alone is conclusive. All five together is a strong tell. The next section shows you how to verify quickly.
Step by step: how to verify suspicious audio
1. Save the audio file
If the audio came through a messaging app, download the file. Do not screen-record. The compression added by screen recording destroys the acoustic signature we rely on.
2. Run it through the detector
Drop the file at aivoicedetector.com/is-this-ai. The verdict appears in about half a second. You will see a probability, a model attribution (when we can identify the specific generator), and a confidence number.
3. Read the verdict carefully
A high confidence number means we are very sure. A confidence between 40% and 60% means we genuinely cannot tell, usually because the audio is heavily compressed or very short. In those cases, find a longer or cleaner sample.
4. Cross-check
If the audio claims to be a specific person saying something specific, call that person on a verified number and ask. Detector verdicts are evidence, not proof. They join other evidence in a chain.
5. File the verdict if it matters
If you are publishing a story or making a legal decision, sign in and save the verdict to your dossier. Every saved verdict gets a permanent URL with a citation in APA format that holds up in court or an editor's office.
Tools we recommend
- The web detector for one-off verification. Free, no account.
- The Chrome extension for in-browser audio (WhatsApp Web, YouTube, podcasts).
- The API for newsrooms and call centers processing audio at scale.
A note on real cases
We have flagged audio in the 2024 election robocall campaigns, in the Hong Kong $25 million CFO deepfake fraud, and in dozens of journalist-submitted clips that turned out to be authentic recordings of real people saying real things. The detector is right roughly 99% of the time on clean audio. We tell you when we are not sure.
The detector returns a probability. It is not a verdict on the speaker. It is a verdict on the audio. Read it that way.