Researchers from Google’s AI division DeepMind and the University of Oxford have used artificial intelligence to create the most accurate lip-reading software ever. Using thousands of hours of TV footage from the BBC, scientists trained a neural network to annotate video footage with 46.8 percent accuracy. That might not seem that impressive at first — especially compared to AI accuracy rates when transcribing audio — but tested on the same footage, a professional human lip-reader was only able to get the right word 12.4 percent of the time.
The research follows similar work published by a separate group at the University of Oxford earlier this month. Using related techniques, these scientists were able to create a lip-reading program called LipNet that achieved 93.4 percent accuracy in tests, compared to 52.3 percent human accuracy. However, LipNet was only tested on specially-recorded footage that used volunteers speaking formulaic sentences. By comparison, DeepMind’s software — known as “Watch, Listen, Attend, and Spell” — was tested on far more challenging footage; transcribing natural, unscripted conversations from BBC politics shows.
DeepMind’s AI program was trained on 5,000 hours of TV
More than 5,000 hours of footage from TV shows including Newsnight, Question Time, and the World Today, was used to train DeepMind’s “Watch, Listen, Attend, and Spell” program. The videos included 118,000 difference sentences and some 17,500 unique words, compared to LipNet’s test database of video of just 51 unique words.
DeepMind’s researchers suggest that the program could have a host of applications, including helping hearing-impaired people understand conversations. It could also be used to annotate silent films or allow you to control digital assistants like Siri or Alexa by just mouthing words to a camera (handy if you’re using the program in public).
But when most people learn that an AI program has learned how to lip-read, their first thought is how it might be used for surveillance. Researchers say that there’s still a big difference in transcribing brightly-lit, high-resolution TV footage, and grainy CCTV video with a low frame rate, but you can’t ignore the fact, that artificial intelligence seems to be closing this gap.