The annual IEEE conference on virtual reality took place in Minneapolis last week. It was a unique opportunity to meet some of the leading VR researchers in the world, to showcase new product innovations and to exchange views on the present and future of VR.
I had the pleasure of sharing the stage in "the battle of the HMDs" panel session at the conference, together with David A Smith, Chief Innovation Officer for Lockheed Martin, Stephen Ellis who leads the Advanced Displays and Spatial Perception Laboratory at NASA and Dr. Jason Jerald of NextGen Interactions.
Below are a (slightly edited) version of my slide and a free-form version of the accompanying text. The audience was primarily VR researchers, so if one thinks of "R&D" as "Research and Development", this talk was aimed more at the research side then the development side.
I believe that there are three layers to what I call the "HMD value pyramid": baseline technology, sensing and context. As one would expect, the pyramid cannot stand without its baseline technology, which we will discuss shortly, but once baseline technology exists, additional layers of value build upon it. While the baseline technologies are mandatory, the real value in my opinion is in the layers above it. This is where I am hoping the audience will focus their research: making these layers work, and then developing methods and algorithms to make these capabilities affordable and thus widespread.
There are several components that form the baseline of the VR visual experience:
- Display(s)
- Optics that adapt the displays to the appropriate viewing distance and provide the desired field of view, eye relief and other optical qualities.
- Ergonomics: a way to wear these optics and displays comfortably on the head, understanding that there are different sizes and facial formations, and quickly adjust them to an optimal position
- Wireless video, which allows disconnecting an HMD from a host computer, thus allowing freedom of motion without risk of cable entanglement
- Processing power, whether performing the simple tasks of controlling the displays, performing calculation-intensive activities such as distortion correction or ultimately allowing applications to run completely inside the HMD without the need to connect to an external computing device.
There will clearly continue to be many improvements in these components. We will see higher-resolution and faster displays. We will continue to see innovative optical designs (as Sensics is showing in the exhibit outside). We will continue to see alternative displays such as pico projectors. But basically, we can now deliver reasonably good visual experience in a reasonably good price. Yes, just like in cars or audio systems or airplane seats or wedding services, there are different experience levels and different price levels, but I think these topics are moving from a 'research' focus into a 'development' focus.
Once the underlying technologies of the HMD are in place, we can move the next layer which I think is more interesting and more valuable: the sensory layer. I've spoken and
written about this before: beyond a head-worn display, the HMD is a platform. It is a computing platform but it is first and foremost a sensory platform that is uniquely positioned to gather real-time information about the user. Some of the key sensors:
- Head orientation sensors (yaw/pitch/roll) that have become commonplace in HMDs
- Head position sensors (X/Y/Z)
- Position and orientation sensors for other body parts such as arms or legs
- Sensors to detect hands and fingers
- Eye tracking which provides real-time reporting of gaze direction
- Biometric sensors - heart rate, skin conductivity, EEG
- Outward-facing cameras that can provide real-time image of the surroundings (whether visible, IR or depth)
- Inward-facing cameras that might provide clues with regards to facial expressions
Each of these sensors are in a different stage of technical maturity. Head orientation sensors, for instance, used to cost thousands of dollars just a few years back. Today, an orientation sensor can be had for a few dollars and are much more powerful than those of the past: tracking accuracy has improved. Predictive tracking is sometimes built in. Tremor cancellation; gravity direction sensing, sensing of the magnetic north, and of course reporting speeds are increasingly higher.
HMD eye tracking sensors are behind in the development curve. Yes, it is possible to buy excellent HMD-based eye trackers for $10K-$20K, but at these prices, only a few can afford them. What would it take to have a "good enough" eye tracker follow the price curve of the orientation tracker?
HMD-based hand and finger sensors are probably even farther behind in terms of robustness, responsiveness, detection field and analysis capabilities.
All these sensors could bring tremendous benefits to the user experience, to the ability of the application to effectively serve the user, or even to the ability of remote users to naturally communicate with each other while wearing HMDs. I think the challenge this this audience is to advance these frontiers: make these sensors work; make them work robustly (e.g. across many users, in many different conditions and not just in the lab) and then make them in such a way that they can be mass-produced inexpensively. Whether these required breakthroughs are in new types of sensing elements, or new computational algorithms, that is up to you to decide, but I can't under-emphasize how important sensors are beyond the basic capabilities of HMDs.
Once sensors have been figured out, context is the next and ultimate frontier. Context takes data from one or more sensors and combines it into information. It provides the application a high-level cue of what is going on; what the user is doing or where the user is or what's going to happen next.
For instance, it's nice to know where my hand is, but tracking the hand over time might indicate that I am drawing a "figure 8" in air. Or maybe that my hands are positioned to signal a "time out". Or maybe, as in the Microsoft patent filing image above, that the hand is brought close to the ear to signal that I would like to increase the volume. That "louder" gesture doesn't work if the hand is 50 cm from the head. It takes an understanding of where the hand is relative to the head and thus I look at it as a higher level of information relative to just the positional data of the head and hand.
Additional examples of context that is derived from multiple sensors: the user is walking; or jumping; or excited (through biometric data and pupil size); or smiling; or scared. The user is about to run into the sofa. The user is next to Joe. The user is holding a toy gun and is aiming at the window.
Sometimes, there are many ways to express the same thing. Consider a "yes/no" dialog box in Windows. The user might click on "yes" using the mouse, or "tab" over to the "yes" button and hit space, or click alt-y, or say yes, and there are perhaps a few other modes to achieve the same result. Similarly in VR, the user might speak "yes" or might nod her head up and down in a "yes" gesture, or might provide the thumbs up sigh, or might touch a virtual "yes" button in space. Context enables the multi-modal interface that focuses on "what" you are trying to express as opposed to exactly "how" you are doing it.
Context, of course, requires a lot of research. Which sensors are available? How much can their data be trusted? How can we minimize training? How can we reduce false negative or false positives? This is yet another great challenge to this community.
In summary, we live in exciting times for the VR world, and we hope that you can join us for the travel up the HMD value pyramid.