Captions now adorn more than a billion YouTube videos, and the video site is making sure to keep its text add-ons up to date. A blog post reveals that YouTube’s automatic captions are now able to recognize three common sound effects and include them in its translations for hard-of-hearing viewers.
The sound effects in question are applause, laughter, and music, which YouTube chose to encode within its automatic caption system because they are common and because their respective meanings are unambiguous. “While the sound space is obviously far richer and provides even more contextually relevant information than these three classes,” reads YouTube’s introductory blog post, “the semantic information conveyed by these sound effects in the caption track is relatively unambiguous, as opposed to sounds like [RING] which raises the question of ‘what was it that rang – a bell, an alarm, a phone?’”
YouTube was able to add sound effects to its automatic captions thanks to the use of a Deep Neural Network that galvanized the machine-learning process. If you’re the kind of person who can understand the technical details of that branch of programming (I, unfortunately, am not) you can glean more information about YouTube’s process by reading its blog post. The rest of us will have to be content to watch the automatic sound effect captions in action. They show up in the below video offered by YouTube: