Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition

Liming Wang, Junrui Ni, Heting Gao, Jialu Li, Kai Chieh Chang, Xulin Fan, Junkai Wu, Mark Hasegawa-Johnson, Chang D. Yoo

Findings: Speech and Multimodality Findings Paper

Session 7: Speech and Multimodality (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 12, Session 7 (15:00-16:30 UTC)
Spotlight Session: Spotlight - Metropolitan Centre (Spotlight)
Conference Room: Metropolitan Centre
Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)
Global Time: July 10, Spotlight Session (23:00-01:00 UTC)
Keywords: speech and vision, speech technologies, multimodality
TLDR: Existing supervised sign language recognition systems rely on an abundance of well-annotated data. Instead, an unsupervised speech-to-sign language recognition (SSR-U) system learns to translate between spoken and sign languages by observing only non-parallel speech and sign-language corpora. We pro...
You can open the #paper-P1270 channel in a separate window.
Abstract: Existing supervised sign language recognition systems rely on an abundance of well-annotated data. Instead, an unsupervised speech-to-sign language recognition (SSR-U) system learns to translate between spoken and sign languages by observing only non-parallel speech and sign-language corpora. We propose speech2sign-U, a neural network-based approach capable of both character-level and word-level SSR-U. Our approach significantly outperforms baselines directly adapted from unsupervised speech recognition (ASR-U) models by as much as 50\% recall@10 on several challenging American sign language corpora with various levels of sample sizes, vocabulary sizes, and audio and visual variability. The code is available at {https://github.com/cactuswiththoughts/UnsupSpeech2Sign.git}{cactuswiththoughts/UnsupSpeech2Sign.git}.