The VRguy's Blog: eye tracking

Showing posts with label eye tracking. Show all posts

Sunday, June 11, 2017

How does eye tracking work?

Eye tracking could become a standard peripheral in VR/AR headsets. Tracking gaze direction can deliver many benefits. Foveated rendering, for instance, optimizes GPU resources by using eye tracking data. Higher-resolution images at shown at the central vision area and lower-resolution outside it. Understanding gaze direction can lead to more natural interaction. Additionally, People with certain disabilities can use their eyes instead of their hands. Eye tracking can detect concussions in athletes and can even help people see better. Eye tracking can help advertisers understand what interests customers.

Eye tracking is complex. Scientists and vendors have spent many year perfecting algorithms and techniques.

But how does it work? Let's look at a high-level overview.

Most eye tracking systems use a camera pointing at the eye and infrared (IR) light. IR illuminates the eye and a camera sensitive to IR analyzes the reflections. The wavelength of the light is often 850 nanometers. It is just outside the visible spectrum of 390 to 700 nanometers. The eye can't detect the illumination but the camera can.

We see the world when our retinal detects light entering through the pupil. IR light also enters the eye through this pupil. Outside the pupil area, light does not enter the eye. Instead, it reflects back towards the camera. Thus, the camera sees the pupil as a dark area - no reflection - whereas the rest of the eye is brighter. This is "dark pupil eye tracking". If the IR light source is near the optical axis, it can reflect from the back of the eye. In this case, the pupil appears bright. This is "bright pupil eye tracking". It is like the "red eye" effect when using flash photography. Whether we use dark or bright pupil, the key point is that the pupil looks different than the rest of the eye.

The image captured by the camera is then processed to determine the location of the pupil. This allows estimating the direction of gaze from the observed eye. Processing is sometimes done on a PC, phone or other connected processor. Other vendors developed special-purpose chips that offload the processing from the main CPU. If eye tracking cameras observe both eyes, one can combine the gaze readings from both eyes. This allows estimating of the fixation point of the user in real or virtual 3D space.

There are other eye tracking approaches that are less popular. For instance, some have tried to detect movements of the eye muscles. This method provides high-speed data but is less accurate than camera-based tracking.

How often should we calculate the gaze direction? The eyes have several types of movements. Saccadic movements are fast and happen when we need to shift gaze from one area to another. Vergence shifts are small movements the help in depth perception. They aim to get the image of an object to appear on corresponding spots on both retinas. Smooth pursuit is how we move when we track a moving object. To track saccadic movements, one needs to track the eye hundreds of time per second. But, saccadic movements do not provide gaze direction. Thus, they are interesting to research applications but not to mass-market eye tracking. Vergence and smooth pursuit movements are slower. Tens of samples per second are often enough. Since Many VR applications want to have the freshest data, there is a trend to track the eyes at the VR frame rate.

Eye tracking systems need to compensate for movements of the camera relative to the eye. For instance, a head-mounted display can slide and shift relative to the eyes. One popular technique is to use reflections of the light source from the cornea. These reflections are called Purkinje reflections. They change little during eye rotation and can serve as an anchor for the algorithm. Other algorithms try to identify the corners of the eye as an anchor point.

There are other variables that an algorithm needs to compensate for. The eye is not a perfect sphere. Some people have bulging eyes and others have inset eyes. The location of the eye relative to the camera is not constant between users. These and other variables are often addressed during a calibration procedure. Simple calibration presents a cross on the screen at a known location and asks the user to fixate on it. By repeating this for a few locations, the algorithm calibrates the tracker to a user.

Beyond the algorithm, the optical system of the tracker presents extra challenges. It aims to be lightweight. It tries to avoid needs constraints on the optics used to present the actual VR/AR image to the user. It needs to work with a wide range of facial structures. For a discussion on optical configurations for eye tracking, please see here.

Eye trackers used to be expensive. This was not the result of expensive components, but rather of a limited market. When only researchers bought eye trackers, companies charged more to cover their R&D expenses. As eye trackers move into mainstream, eye trackers will become inexpensive.

Sunday, May 8, 2016

Understanding Predictive Tracking

Image source: Adrian Boeing blog

In the context of AR and VR systems, predictive tracking refers to the process of predicting the future orientation and/or position of an object or body part. For instance, one might want to predict the orientation of the head or the position of the hand.

Why is predictive tracking useful?

One common use of predictive tracking is to reduce the apparent "motion to photon" latency, meaning the time between movement and when that movement is reflected in the drawn scene. Since there is some delay between movement and an updated display (more on the sources of that delay below), using an estimated future orientation and position as the data used in updating the display, could shorten that perceived latency.

While a lot of attention has been focused on predictive tracking in virtual reality applications, it is also very important in augmented reality. For instance, if you are displaying a graphical overlay to appear on top of a physical object that you see with an augmented reality goggles, it is important that the overlay stays on the object even when you rotate your head. The object might be recognized with a camera, but it takes time for the camera to capture the frame, for a processor to determine where the object is in the frame and for a graphics chip to render the new overlay. By using predictive tracking, you can get better apparent registration between the overlay and the physical object.

How does it work?

If you saw a car travelling at a constant speed and you wanted to predict where that car will be one second in the future, you could probably make a fairly accurate prediction. You know the current position of the car, you might know (or can estimate) the current velocity, and thus you can extrapolate the position into the near future.

Of course if you compare your prediction with where the car actually is in one second, your prediction is unlikely to be 100% accurate every time: the car might change direction or speed during that time. The farther out you are trying to predict, the less accurate your prediction will be: predicting where the car will be in one second is likely much more accurate than predicting where it will be in one minute.

The more you know about the car and its behavior, the better chance you have of making an accurate prediction. For instance, if you were able to measure not only the velocity but also the acceleration, you can make a more accurate prediction.

If you have additional information about the behavior of the tracked body, this can also improve prediction accuracy. For instance, when doing head tracking, understand how fast the head can possibly rotate and what are common rotation speeds, can improve the tracking model. Similarly, if you are doing eye tracking, you can use the eye tracking information to anticipate head movements as discussed in this post

Sources of latency

The desired to perform predictive tracking comes from having some latency between actual movement and displaying an image that reflects that movement. Latency can come from multiple sources, such as:

Sensing delays. The sensors (e.g. gyroscope) may be bandwidth-limited and do not instantaneously report orientation or position changes. Similarly, camera-based sensors may exhibit delay between when the pixel on the camera sensor receives light from the tracked object to that frame being ready to be sent to the host processor.
Processing delays. Sensors are often combined using some kind of sensor fusion algorithm, and executing this algorithm can add latency.
Data smoothing. Sensor data is sometimes noisy and to avoid erroneous jitter, software or hardware-based low pass algorithms are executed.
Transmission delays. For example, if orientation sensing is done using a USB-connected device, there is some non-zero time between the data available to be ready by the host processor and the time data transfer over USB is completed.
Rendering delays. When rendering a non-trivial scene, it takes some time to have the image ready to be sent to the display device.
Frame rate delays. If a display is operating at 100 Hz, for instance, there is a 10 mSec time between successive frames. Information that is not precisely current to when a particular pixel is drawn may need to wait until the next time that pixel is drawn on the display.

Some of these delays are very small, but unfortunately all of them add up and predictive tracking, along with other techniques such as time warping, are helpful in reducing the apparent latency.

How much to track into the future?

In two words: it depends. You will want to estimate the end-the-end latency of your system as a starting point and then optimize them to your liking.

It may be that you will need to predict several timepoints into the future at any given time. Here are some examples why this may be required:

There are objects with different end-to-end delays. For instance, a hand tracked with a camera may be have different latency than a head tracker, but both need to be drawn in sync in the same scene, so predictive tracking with different 'look ahead' times will be used.
In configurations where a single screen - such as a cell phone screen - is used to provide imagery to both eyes, it is often the case that the image for one eye appears with a delay of half a frame (e.g. half of 1/60 seconds, or approx 8 mSec) relative to the other eye. In this case, it is best to use predictive tracking that looks ahead 8 mSec more for that delayed half of the screen.

Common prediction algorithms

Here is some sampling of predictive tracking algorithms:

Dead reckoning. This is a very simple algorithm: if the position and velocity (or angular position and angular velocity) is known at a given time, the predicted position assumes that the last know position and velocity are correct and the velocity remains the same. For instance, if the last known position is 100 units and the last known velocity is 10 units/sec, then the predicted position 10 mSec (0.01 seconds) into the future is 100 + 10 x 0.01 = 100.1. While this is very simple to compute, it assumes that the last position and velocity are accurate (e.g. not subject to any measurement noise) and that the velocity is constant. Both these assumptions are often incorrect.
Kalman predictor. This is based on a popular Kalman filter that is used to reduce sensor noise in systems where there exists a mathematical model of the system's operation. See here for more detailed explanation of the Kalman filter.
Alpha-beta-gamma. The ABG predictor is closely related to the Kalman predictor, but is less general and has simpler math, which we can explain here at a high level. ABG tries to continuously estimate both velocity and acceleration and use them in prediction. Because the estimates take into account actual data, they provide some measurement noise reduction. Configuring the parameters (alpha, beta and gamma) provide the ability to emphasize responsiveness as opposed to noise reduction. If you'd like to follow the math, here it goes:

Summary

Predictive tracking is a useful and commonly-used technique for reducing apparent latency. It offers simple or sophisticated implementations, requires some thought and analysis, but it is well worth it.

Monday, April 11, 2016

Understanding Foveated Rendering

Foveated rendering is a rendering technique that takes advantage of the fact that that the resolution of the eye is highest in the fovea (the central vision area) and lower in the peripheral areas. As a result, if one can sense the gaze direction (with an eye tracker), GPU computational load can be reduced by rendering an image that has higher resolution at the direction of gaze and lower resolution elsewhere.

The challenge in turning this from theory to reality is to find the optimal function and parameters that maximally reduce GPU computation while maintaining highest quality visual experience. If done well, the user shouldn’t be able to tell that foveated rendering is being used. The main questions to address are:

In what angle around the center of vision should we keep the highest resolution?
Is there a mid-level resolution that is best to use?
What is the drop-off in “pixel density” between central and peripheral vision?
What is the maximum speed that the eye can move? This question is important because even though the eye is normally looking at the center of the image, the eye can potentially rotate so that the fovea is aimed at image areas with lower resolution.

Let's address these questions:

1. In what angle around the center of vision should we keep the highest resolution?

Source: Wikipedia

The macula portion of the retina is responsible for fine detail. It spans the central 18˚ around the gaze point, or 9˚ eccentricity (the angular distance away from the center of gaze). This would be the best place to put the boundary of the inner layer. Fine detail is processed by cones (as opposed to rods), and at eccentricities past 9˚ you see a rapid fall off of cone density, so this makes sense biologically as well. Furthermore, the “central visual field” ends at 30˚ eccentricity, and everything past that is considered periphery. This is a logical spot to put the boundary between the middle and outermost layer for foveated rendering.

2. Is there a mid-level resolution that is best to use? and 3. What is the drop-off in “pixel density” between central and peripheral vision?

Some vendors such as Sensomotoric Instruments (SMI) use an inner layer at full native resolution, a middle layer at 60% resolution, and an outer layer at 20% resolution. When selecting the resolution dropoff, it is important to ensure that at the layer boundaries, the resolution is at or above the eye’s acuity at that eccentricity. At 9˚ eccentricity, acuity drops to 20% of the maximum acuity, and at 30˚ acuity drops to 7.4% of the max acuity. Given this, it appears that SMI’s values work, but are generous compared to what the eye can see.

4. What is the maximum speed that the eye can move?

Source: Indiana University

A saccade is a rapid movement of the eye between fixation points. Saccade speed is determined by the distance between the current gaze and the stimulus. If the stimulus is as far as 50˚ away, then peak saccade velocity can get up to around 900˚/sec. This is important because you want the high resolution layer to be large enough so that the eye can’t move to the lower resolution portion in the time it takes to get the gaze position and render the scene. So if system latency is 20 msec, and assume eye can move at 900˚/sec – eye could move 18˚ in that time, meaning you would want the inner (higheslayer radius to be greater than that – but that is only if the stimulus presented is 50˚ away from current gaze.

Additional thoughts

Source: Vision and Ocular Motility by Gunter Noorden

Visual acuity decreases on the temporal side (e.g. towards the ear) somewhat more rapidly than on the nasal side. It also decreases more sharply below and, especially, above the fovea, so that lines connecting points of equal visual acuity are elliptic, paralleling the outer margins of the visual field. Following this, it might make sense to render the different layers in ellipses rather than circles. The image shows the lines of equal visual acuity for the visual field of the left eye – so one can see that it extends farther to the left (temporal side) for the left eye, and for the right eye visual field would extend farther to the right.

For additional reading

This paper from Microsoft research is particularly interesting.

They approach the foveated rendering problem in a more technical way – optimizing to find layer parameters based on a simple but fundamental idea: for a given acuity falloff line, find the eccentricity layer sizes which support at least that much resolution at every eccentricity, while minimizing the total number of pixels across all layers. It explains their methodology though does not give their results for the resolution values and layer sizes.

Note: special thanks to Emma Hafermann for her research on this post

For additional VR tutorials on this blog, click here
Expert interviews and tutorials can also be found on the Sensics Insight page here