By Chris Quirk, Carnegie Mellon University April 23, 2022
As more people watch movies, edit video, read the news and keep up with social media on their smartphones, these devices have grown to accommodate the bigger screens and higher processing power needed for more demanding activities.
The problem with unwieldy phones is they frequently require a second hand or voice commands to operate—which can be cumbersome and inconvenient.
In response, researchers in the Future Interfaces Group at Carnegie Mellon University's Human-Computer Interaction Institute (HCII) are developing a tool called EyeMU, which allows users to execute operations on a smartphone by combining gaze control and simple hand gestures.
"We asked the question, 'Is there a more natural mechanism to use to interact with the phone?' And the precursor for a lot of what we do is to look at something," said Karan Ahuja, a doctoral student in human-computer interaction.
Gaze analysis and prediction aren't new, but achieving an acceptable level of functionality on a smartphone would be a noteworthy advance.
"The eyes have what you would call the Midas touch problem," said Chris Harrison, an associate professor in the HCII and director of the Future Interfaces Group. "You can't have a situation in which something happens on the phone everywhere you look. Too many applications would open."
Software that tracks the eyes with precision can solve this problem. Andy Kong, a senior majoring in computer science, had been interested in eye-tracking technologies since he first came to CMU. He found commercial versions pricey, so he wrote a program that used a laptop's built-in camera to track the user's eyes, which in turn moved the cursor around the screen—an important early step toward EyeMU.
CMU researchers show how gaze estimation using a phone’s user-facing camera can be paired with motion gestures to enable a rapid interaction technique on handheld phones. Credit: Carnegie Mellon University
"Current phones only respond when we ask them for things, whether by speech, taps or button clicks," Kong said. "If the phone is widely used now, imagine how much more useful it would be if we could predict what the user wanted by analyzing gaze or other biometrics."
It wasn't easy to streamline the package so it could work at speed on a smartphone.
"That's a resource constraint. You must make sure your algorithms are fast enough," Ahuja said. "If it takes too long, your eye will move along."
Kong, the paper's lead author, presented the team's findings with Ahuja, Harrison and Assistant Professor of HCII Mayank Goel at last year's International Conference on Multimodal Interaction. Having a peer-reviewed paper accepted to a major conference was a huge achievement for Kong, an undergraduate researcher.
Kong and Ahuja advanced that early prototype by using Google's Face Mesh tool to study the gaze patterns of users looking at different areas of the screen and render the mapping data. Next, the team developed a gaze predictor that uses the smartphone's front-facing camera to lock in what the viewer is looking at and register it as the target.
The team made the tool more productive by combining the gaze predictor with the smartphone's built-in motion sensors to enable commands. For example, a user could look at a notification long enough to secure it as a target and flick the phone to the left to dismiss it or to the right to respond to the notification. Similarly, a user might pull the phone closer to enlarge an image or move the phone away to disengage the gaze control, all while holding a large latte in the other hand.
"The big tech companies like Google and Apple have gotten pretty close with gaze prediction, but just staring at something alone doesn't get you there," Harrison said. "The real innovation in this project is the addition of a second modality, such as flicking the phone left or right, combined with gaze prediction. That's what makes it powerful. It seems so obvious in retrospect, but it's a clever idea that makes EyeMU much more intuitive."