Перейти до вмісту

Can AI Detect Deepfake Video Calls in Real Time?

Цей контент ще не доступний вашою мовою.

Split-screen video call showing a real person and a deepfake impersonator with detection overlay flagging synthetic regions

Your CFO joins a Zoom call and asks the finance team to wire $25 million. The face looks right. The voice matches. Forty minutes later, the real CFO finds out nothing was scheduled. The Arup fraud in early 2024 unfolded exactly this way because detection did not save them. No Zoom plugin flagged the deepfake. No audio analyzer caught the clone.

The obvious question follows: can AI detect deepfake video calls in real time? And if the tools exist, why did Arup lose $25 million?

Partially. Enterprise deepfake detection platforms exist and work under lab conditions. In live production calls, detection accuracy degrades as generation quality improves. Major consumer video platforms (Zoom, Microsoft Teams, Google Meet) do not ship reliable built-in detection in 2026.

For any high-stakes request arriving on a live call, process-based verification (callback on known numbers, pre-shared code words, out-of-context internal questions) remains more reliable than any real-time detection tool on the market.

Real-time deepfake detection is the automated analysis of video or audio streams during a live call to identify synthetic media before a human makes a decision on the request. Detection systems look for signals across three layers: pixel-level artifacts from GAN generation, physiological inconsistencies such as irregular blink patterns, absent pulse signals in skin tone, and lip-sync mismatches, and temporal anomalies including frame rate inconsistencies and compression artifacts. Voice detection adds spectral analysis of clone fingerprints and prosodic irregularities in cadence and intonation. According to the 2024 Regula survey, 49 percent of businesses worldwide had experienced deepfake fraud, and detection vendors responded by shipping real-time plugins for enterprise video platforms. Accuracy varies widely: vendor claims of 95 percent detection rates apply to lab datasets of known generator outputs. Against a novel generator the detector has never seen, accuracy can drop below 60 percent, which is why production deployments rarely match vendor benchmarks.

What vendors offer real-time detection in 2026?

Section titled “What vendors offer real-time detection in 2026?”

Reality Defender. Browser-based and API-based detection for video and audio. Claims around 95 percent accuracy on known generators, lower on novel models. Used by major banks for call center fraud screening.

Pindrop. Voice-focused detection with phoneme-level analysis. Strong on call center use cases. Has integrated with Zoom Contact Center.

Intel FakeCatcher. Uses photoplethysmography (detecting blood flow through skin tone changes) to identify real humans in video. Claims real-time detection but requires specific lighting conditions.

Microsoft Video Authenticator. Part of Microsoft responsible AI tooling. Not packaged as a real-time Teams plugin as of 2026.

DuckDuckGoose, Sensity, Truepic. Smaller players with specific verticals such as identity verification and media forensics.

All of these target enterprise buyers, not consumer video call platforms. Pricing starts around $50,000 per year for mid-market deployments and scales with usage.

The fundamental problem is speed. A detector that takes two seconds to analyze a video frame cannot keep up with a 30-fps live call.

Four technical constraints make real-time detection difficult in practice.

Computational cost. Running a deep learning classifier on every frame of video requires GPU resources that most video call infrastructure does not allocate per participant. Lighter classifiers trade accuracy for speed.

Novel generator adaptation. Detectors trained on 2023 GAN outputs often miss 2025 diffusion-model video. Retraining cycles run behind attacker innovation by weeks or months.

Compression and network artifacts. Legitimate video compression, packet loss, and codec variations produce artifacts that look similar to deepfake artifacts. False positive rates climb quickly.

Lighting and angle variability. Physiological detection methods (blink rate, pulse detection) depend on consistent lighting and front-facing camera angles. Real video calls often have neither.

The Arup case illustrates the production gap. Multiple people on that call were synthetic. If reliable real-time detection existed at consumer scale, finance teams would have a Zoom plugin that flagged the fraud. No such plugin shipped. Arup implemented internal verification controls after the incident, not a technical detection fix.

What does real-time detection catch in practice?

Section titled “What does real-time detection catch in practice?”

Real-world performance depends heavily on who generated the deepfake.

Commodity tools such as DeepFaceLive or consumer face-swap apps are detected reasonably well. These tools produce identifiable artifacts and run on known generators.

Custom models trained on the target become harder to catch. An attacker who trains a dedicated model on an executive public YouTube appearances can produce output that classifiers miss.

Real-time face-swap with modern GPUs is unreliable to detect. Latency constraints force fast classifiers that have lower accuracy.

Voice cloning using VALL-E, ElevenLabs, and consumer text-to-speech services is more tractable than video detection. Phoneme analysis and spectral fingerprinting work better against audio than pixel-based methods against video.

The pattern across vendor benchmarks and independent evaluations is consistent: detection tools perform well against the generators they were trained on and poorly against everything else. The gap closes slowly because training data for the latest generators is always limited.

Every security team that has reviewed a deepfake incident reaches the same conclusion: detection is a secondary defense. Primary defense is procedural.

Callback verification on a known number defeats a deepfake regardless of its quality. A pre-shared code word agreed on a different channel cannot be cloned. An out-of-context question about internal non-public information will trip up an impersonator working from public research.

These controls do not require AI. They do not degrade as generators improve. They work on consumer Zoom calls where no detection plugin is installed. They also extend naturally to vishing calls, BEC emails, and AI-powered phishing, which attackers increasingly combine with deepfakes in multi-channel campaigns.

For the underlying attack mechanics and why deepfakes succeed beyond the technical challenge, see our deepfake social engineering guide.

Real-time detection is a reasonable layer for three specific scenarios.

Call centers handling high-volume verification. Banks, insurance companies, and identity verification services process thousands of calls per day. Automated detection catches a meaningful share of commodity fraud attempts before they reach human agents.

Media and newsroom operations. News organizations verifying submitted video benefit from forensic detection on pre-recorded content, which is materially more accurate than real-time live call detection.

Executive protection programs. Organizations with named high-value targets can justify the cost of a detection layer on video calls involving those individuals. The detection does not replace procedural verification; it supplements it.

For most organizations, the ROI on a $50,000 to $200,000 annual detection platform does not beat the ROI on training finance, HR, and executive assistants to execute callback verification every time. Detection is harder to scale to behavior. Behavior is cheaper to scale to policy.

Four actions matter more than buying a detection plugin.

  1. Write callback verification into every financial and credential-related process. No exceptions, even when the request looks routine.

  2. Issue rotating code words to executives and their assistants. Update weekly and distribute through a channel separate from the calls being verified.

  3. Train employees to expect deepfakes on live calls. Show convincing examples so the baseline assumption becomes: the face and voice do not authenticate the speaker.

  4. Build a low-friction reporting path. Employees who felt a call was “off” but cannot articulate why should have a one-click way to flag it for security review.

Interactive practice matters more than any slide deck here. Our Whaling With A Deepfake exercise puts employees in a realistic deepfake video call scenario modeled on the Arup case, so the verification reflex is rehearsed before a real attack lands.

Can AI detect deepfake video calls in real time? Some of it, some of the time, if you buy the right enterprise platform and the attacker uses a generator your platform has seen before.

That is not a strategy. The organizations handling deepfake risk well in 2026 treat detection as a useful supplement and procedural verification as the actual defense. Attackers will always have access to generators the classifiers have not seen yet. They will not always have access to your pre-shared code words.


Train your team to verify identity when faces and voices cannot be trusted. Try our free Whaling With A Deepfake exercise based on the $25 million Arup fraud, or explore the full security awareness training catalogue for more hands-on exercises.