[Industry] Referring to Screen Texts with Voice Assistants

Shruti Bhargava, Anand Dhoot, Ing-marie Jonsson, Hoang Long Nguyen, Alkesh Patel, Hong Yu, Vincent Renkens

Industry: Industry Industry Paper

Session 5: Industry (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 16:15-17:45 (EDT) (America/Toronto)
Global Time: July 11, Session 5 (20:15-21:45 UTC)
TLDR: Voice assistants help users make phone calls, send messages, create events, navigate and do a lot more. However assistants have limited capacity to understand their users' context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phon...
You can open the #paper-I213 channel in a separate window.
Abstract: Voice assistants help users make phone calls, send messages, create events, navigate and do a lot more. However assistants have limited capacity to understand their users' context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, urls, and dates on their phone screens. We focus on reference understanding, which is particularly interesting when, similar to visual grounding, there are multiple similar texts on screen. We collect a dataset and propose a lightweight general purpose model for this novel experience. Since consuming pixels directly is expensive, our system is designed to rely only on text extracted from the UI. Our model is modular, offering flexibility, better interpretability and efficient run time memory.