Sunday, May 8, 2016

Understanding Predictive Tracking

Image source: Adrian Boeing blog
In the context of AR and VR systems, predictive tracking refers to the process of predicting the future orientation and/or position of an object or body part. For instance, one might want to predict the orientation of the head or the position of the hand.

Why is predictive tracking useful?

One common use of predictive tracking is to reduce the apparent "motion to photon" latency, meaning the time between movement and when that movement is reflected in the drawn scene. Since there is some delay between movement and an updated display (more on the sources of that delay below), using an estimated future orientation and position as the data used in updating the display, could shorten that perceived latency.

While a lot of attention has been focused on predictive tracking in virtual reality applications, it is also very important in augmented reality. For instance, if you are displaying a graphical overlay to appear on top of a physical object that you see with an augmented reality goggles, it is important that the overlay stays on the object even when you rotate your head. The object might be recognized with a camera, but it takes time for the camera to capture the frame, for a processor to determine where the object is in the frame and for a graphics chip to render the new overlay. By using predictive tracking, you can get better apparent registration between the overlay and the physical object.

How does it work? 

If you saw a car travelling at a constant speed and you wanted to predict where that car will be one second in the future, you could probably make a fairly accurate prediction. You know the current position of the car, you might know (or can estimate) the current velocity, and thus you can extrapolate the position into the near future.

Of course if you compare your prediction with where the car actually is in one second, your prediction is unlikely to be 100% accurate every time: the car might change direction or speed during that time. The farther out you are trying to predict, the less accurate your prediction will be: predicting where the car will be in one second is likely much more accurate than predicting where it will be in one minute.

The more you know about the car and its behavior, the better chance you have of making an accurate prediction. For instance, if you were able to measure not only the velocity but also the acceleration, you can make a more accurate prediction.

If you have additional information about the behavior of the tracked body, this can also improve prediction accuracy. For instance, when doing head tracking, understand how fast the head can possibly rotate and what are common rotation speeds, can improve the tracking model. Similarly, if you are doing eye tracking, you can use the eye tracking information to anticipate head movements as discussed in this post

Sources of latency

The desired to perform predictive tracking comes from having some latency between actual movement and displaying an image that reflects that movement. Latency can come from multiple sources, such as:
  • Sensing delays. The sensors (e.g. gyroscope) may be bandwidth-limited and do not instantaneously report orientation or position changes. Similarly, camera-based sensors may exhibit delay between when the pixel on the camera sensor receives light from the tracked object to that frame being ready to be sent to the host processor.
  • Processing delays. Sensors are often combined using some kind of sensor fusion algorithm, and executing this algorithm can add latency.
  • Data smoothing. Sensor data is sometimes noisy and to avoid erroneous jitter, software or hardware-based low pass algorithms are executed.
  • Transmission delays. For example, if orientation sensing is done using a USB-connected device, there is some non-zero time between the data available to be ready by the host processor and the time data transfer over USB is completed.
  • Rendering delays. When rendering a non-trivial scene, it takes some time to have the image ready to be sent to the display device.
  • Frame rate delays. If a display is operating at 100 Hz, for instance, there is a 10 mSec time between successive frames. Information that is not precisely current to when a particular pixel is drawn may need to wait until the next time that pixel is drawn on the display.
Some of these delays are very small, but unfortunately all of them add up and predictive tracking, along with other techniques such as time warping, are helpful in reducing the apparent latency.

How much to track into the future?

In two words: it depends. You will want to estimate the end-the-end latency of your system as a starting point and then optimize them to your liking.

It may be that you will need to predict several timepoints into the future at any given time. Here are some examples why this may be required:
  • There are objects with different end-to-end delays. For instance, a hand tracked with a camera may be have different latency than a head tracker, but both need to be drawn in sync in the same scene, so predictive tracking with different 'look ahead' times will be used.
  • In configurations where a single screen - such as a cell phone screen - is used to provide imagery to both eyes, it is often the case that the image for one eye appears with a delay of half a frame (e.g. half of 1/60 seconds, or approx 8 mSec) relative to the other eye. In this case, it is best to use predictive tracking that looks ahead 8 mSec more for that delayed half of the screen.

Common prediction algorithms

Here is some sampling of predictive tracking algorithms:
  • Dead reckoning. This is a very simple algorithm: if the position and velocity (or angular position and angular velocity) is known at a given time, the predicted position assumes that the last know position and velocity are correct and the velocity remains the same. For instance, if the last known position is 100 units and the last known velocity is 10 units/sec, then the predicted position 10 mSec (0.01 seconds) into the future is 100 + 10 x 0.01 = 100.1. While this is very simple to compute, it assumes that the last position and velocity are accurate (e.g. not subject to any measurement noise) and that the velocity is constant. Both these assumptions are often incorrect.
  • Kalman predictor. This is based on a popular Kalman filter that is used to reduce sensor noise in systems where there exists a mathematical model of the system's operation. See here for more detailed explanation of the Kalman filter.
  • Alpha-beta-gamma. The ABG predictor is closely related to the Kalman predictor, but is less general and has simpler math, which we can explain here at a high level. ABG tries to continuously estimate both velocity and acceleration and use them in prediction. Because the estimates take into account actual data, they provide some measurement noise reduction. Configuring the parameters (alpha, beta and gamma) provide the ability to emphasize responsiveness as opposed to noise reduction. If you'd like to follow the math, here it goes:

Summary

Predictive tracking is a useful and commonly-used technique for reducing apparent latency. It offers simple or sophisticated implementations, requires some thought and analysis, but it is well worth it.

Saturday, April 30, 2016

VR and AR in 12 variations

I've been thinking about how to classify VR and AR headsets and am starting to look at them along three dimensions (no pun intended):
  1. VR vs AR
  2. PC-powered vs. Phone-powered vs. Self-powered. This looks at where the processing and video generation is coming from. Is it connected to a PC? Is it using a standard phone? Or does it embed processing inside the headset 
  3. Wide field of view vs. Narrow FOV
This generates a total of 2 x 3 x 2 = 12 options as follows


Configuration

Example and typical use

1: VR, PC-powered, Wide-field
Examples: Oculus, HTC Vive, Sensics dSight, OSVR HDK. This immersive VR configuration is used in many applications, though the most popular one is gaming. One attribute that separated consumer-grade goggles like the HTC Vive from professional-grade goggles such as the Sensics dSight is pixel density: the number of pixels per degree. You can think about this as the diffence between watching a movie on a 50 inch standard-definition TV as opposed to a 50 inch HDTV.

2: VR, PC-powered, Narrow-field
Example: Sensics zSight 1920. With a given number of pixels per eye, narrow-field systems allow for much higher pixel density, which allows observing fine details or very small objects. For instance, imagine that you are training to land a UAV. The first step in landing a UAV is spotting it in the sky. The higher the pixel density is, the farther out you can spot an object of a given size. The zSight 1920 has about 32 pixels/degree whereas a modern consumer goggle like the HTC Vive has less than half that.

3: VR, Phone-powered, Wide-field
Examples: Samsung Gear VR, Google Cardboard, Zeiss VROne.. This configuration where the phone is inserted into some kind of holster is used for general-purpose mobile VR. The advantages of this configuration is its portability as well as its low cost - assuming you already own a compatible phone. The downside of this configuration is that the processing power of a phone is inferior to a high-end PC and thus the experience is more limited in terms of frame rate and scene complexity. Today's phones were not fully designed with VR in mind, so there are sometimes concerns about overheating and battery life.

4: VR, Phone-powered, Narrow-field
Example: LG 369 VR. In this configuration, the phone is not carried on the head but rather connected via a thin wire to a smaller unit on the head. The advantage of this configuration is that it can be very lightweight and compact. Also, the phone could potentially be used as an input pad. The downside is that the phone is connected via a cable. Another downside is often the cost. Because this configuration does not use the phone screens, it needs to include its own screens that might add to the cost. Another advantage is that the phone camera can not be used for video see-through or for sensing.

5: VR, Self-powered, Wide-field
Examples: Gameface Labs, Pico Neo. These configurations aim for standalone, mobile VR without requiring the mobile phone. They potentially save weight by not using unnecessary phone components such as the casing and touch screen, but would typically be more expensive than phone based VR for those users that already own the phone. They might have additional flexibility with regards to which sensors to include, camera placement and battery location. They are more difficult to upgrade relative to a phone-based VR solution, but the fact that the phone cannot be taken out might be an advantage for applications such as public VR where a fully-integrated system that cannot be easily taken apart is a plus.

6: VR, Self-powered, Narrow-field
Example: Sensics SmartGoggles.  These configurations are less popular today. Even the Sensics SmartGoggles which included on-board Android processor as well as wide-area hand sensors was built with relatively narrow field of view (60 degrees) because of the components available at the time.

7: AR, PC-powered, Wide-field
Example: Meta 2. In many augmented reality applications, people ask for wide field so that, for instance, a virtual object that appears overlaid on the real world does not disappear when the user looks to the side.  This configuration may end up being transient because in many cases the value of augmented reality is in being able to interact with the real world, and the user's motion when tethered to a PC is more limited. However, one might see uses in applications such as engineering workstation.

8: AR, PC-powered, Narrow-field
I am not aware of good examples of this configuration. It combines the limit of narrow-field AR with the tether to the PC.

9: AR, Phone-powered, Wide-field
This could become one of the standard AR configuration just like phone-powered, wide-field VR is becoming a mainstream configuration. To get there, the processing power and optics/display technology catch up with the requirements.
10: AR, Phone-powered, Narrow-field
Example: Seebright. In this configuration, a phone is worn on the head and its screen becomes the display for the goggles. Semi-transparent optics combine phone-generated imagery with the real world. I believe this is primarily a transient configuration until wide-field models appear.

11: AR, Self-powered, Wide-field
I am unaware of current examples of this configuration though one would assume it could be very attractive because of the mobility on one hand and the ability to interact in a wide field of view.
12: AR, Self-powered, Narrow-field
Examples: Microsoft Hololens, Google glass, Vuzix M300. There are two types of devices here: one is an 'information appliance' like Google Glass, designed to provide contextually-relevant information without taking over the field of view. These configurations are very attractive in industrial settings for applications like field technicians, workers in a warehouse or even customer service representatives needing a mobile, wearable terminal often to connect with a cloud-based database. The second type of device, exemplified by the Hololens seeks to augment the reality by placing virtual objects locked in space. I am sure the Hololens would like to be a wide-field model and it is narrow field at the moment because of the limitations of its current display technology


Looking forward to feedback and comments.

Monday, April 11, 2016

Understanding Foveated Rendering





Foveated rendering is a rendering technique that takes advantage of the fact that that the resolution of the eye is highest in the fovea (the central vision area) and lower in the peripheral areas. As a result, if one can sense the gaze direction (with an eye tracker), GPU computational load can be reduced by rendering an image that has higher resolution at the direction of gaze and lower resolution elsewhere.

The challenge in turning this from theory to reality is to find the optimal function and parameters that maximally reduce GPU computation while maintaining highest quality visual experience. If done well, the user shouldn’t be able to tell that foveated rendering is being used. The main questions to address are:
  1. In what angle around the center of vision should we keep the highest resolution? 
  2. Is there a mid-level resolution that is best to use? 
  3. What is the drop-off in “pixel density” between central and peripheral vision? 
  4. What is the maximum speed that the eye can move? This question is important because even though the eye is normally looking at the center of the image, the eye can potentially rotate so that the fovea is aimed at image areas with lower resolution.
Let's address these questions:

1. In what angle around the center of vision should we keep the highest resolution?

Source: Wikipedia
The macula portion of the retina is responsible for fine detail. It spans the central 18˚ around the gaze point, or 9˚ eccentricity (the angular distance away from the center of gaze). This would be the best place to put the boundary of the inner layer. Fine detail is processed by cones (as opposed to rods), and at eccentricities past 9˚ you see a rapid fall off of cone density, so this makes sense biologically as well. Furthermore, the “central visual field” ends at 30˚ eccentricity, and everything past that is considered periphery. This is a logical spot to put the boundary between the middle and outermost layer for foveated rendering.

2. Is there a mid-level resolution that is best to use?  and 3. What is the drop-off in “pixel density” between central and peripheral vision? 

Some vendors such as Sensomotoric Instruments (SMI) use an inner layer at full native resolution, a middle layer at 60% resolution, and an outer layer at 20% resolution. When selecting the resolution dropoff, it is important to ensure that at the layer boundaries, the resolution is at or above the eye’s acuity at that eccentricity. At 9˚ eccentricity, acuity drops to 20% of the maximum acuity, and at 30˚ acuity drops to 7.4% of the max acuity. Given this, it appears that SMI’s values work, but are generous compared to what the eye can see.




4.    What is the maximum speed that the eye can move?


Source: Indiana University
A saccade is a rapid movement of the eye between fixation points. Saccade speed is determined by the distance between the current gaze and the stimulus. If the stimulus is as far as 50˚ away, then peak saccade velocity can get up to around 900˚/sec. This is important because you want the high resolution layer to be large enough so that the eye can’t move to the lower resolution portion in the time it takes to get the gaze position and render the scene. So if system latency is 20 msec, and assume eye can move at 900˚/sec – eye could move 18˚ in that time, meaning you would want the inner (higheslayer radius to be greater than that – but that is only if the stimulus presented is 50˚ away from current gaze. 

Additional thoughts

Source: Vision and Ocular Motility by Gunter Noorden

Visual acuity decreases on the temporal side (e.g. towards the ear) somewhat more rapidly than on the nasal side. It also decreases more sharply below and, especially, above the fovea, so that lines connecting points of equal visual acuity are elliptic, paralleling the outer margins of the visual field. Following this, it might make sense to render the different layers in ellipses rather than circles. The image shows the lines of equal visual acuity for the visual field of the left eye – so one can see that it extends farther to the left (temporal side) for the left eye, and for the right eye visual field would extend farther to the right.

For additional reading

This paper from Microsoft research is particularly interesting. 
They approach the foveated rendering problem in a more technical way – optimizing to find layer parameters based on a simple but fundamental idea: for a given acuity falloff line, find the eccentricity layer sizes which support at least that much resolution at every eccentricity, while minimizing the total number of pixels across all layers. It explains their methodology though does not give their results for the resolution values and layer sizes.

Note: special thanks to Emma Hafermann for her research on this post

For additional VR tutorials on this blog, click here
Expert interviews and tutorials can also be found on the Sensics Insight page here

Tuesday, April 5, 2016

Time-warp Explained

In the context of virtual reality, time warp is a technique to reduce the apparent latency between head movement and the the corresponding image that appears inside an HMD.

In an ideal world, the rendering engine would render an image using the measured head pose (orientation and position) immediately before the image is displayed on the screen. However, in the real world, rendering takes time, so the rendering engine uses a pose reading that is a few milliseconds before the image is displayed on the screen. During these few milliseconds, the head moves, so the displayed image lags a little bit after the actual pose reading.

Let's take a numerical example, Let's assume we need to render at 90 frames per second, so there are approximately 11 milliseconds for the rendering process of each frame. Let's assume that head tracking data is available pretty much continuously but that rendering takes 10 milliseconds. Knowing the rendering time, the rendering engine starts rendering as late as possible, which is 10 milliseconds before the frame needs to be displayed. Thus, the rendering engine uses head tracking data that is 10 milliseconds old. If the head rotates at a rate of 200 degrees/second, these 10 milliseconds are equivalent to 2 degrees. If the horizontal field of view of the HMD is 100 degrees and there are 1000 pixels across the visual field, a 2-degree error means that the image lags actual head movement by about 20 pixels.

However, it turns out that even a 2 degree head rotation does not dramatically change the perspective of how the image is drawn. Thus, if there was a way to move the image by 20 pixels on the screens (e.g. 2 degrees in the example), the resultant image would be pretty much exactly what the render engine would draw if the reported head position was changed by two degrees.

That's precisely what time-warping (or "TW" for short) does: it quickly (in less than 1 millisecond) translates the image a little bit based on how much the head rotated between the time the render engine used the head rotation reading and the time the time warping begins.

The process with time warping is fairly simple: the render engine renders and then when the render engine is done, the time-warp is quickly applied to the resultant image.

But what happens if the render engine takes more time than is available between frames? In this case, a version of time-warping, called asynchronous time-warping ("ATW") is often used. ATW takes the last available frame and applies time-warping to it. If the render engine did not finish in time, ATW takes the previous frame, and applies time-warping to it. If the previous frame is taken, the head probably rotated even more, so a greater shift is required. While not as ideal as having the render engine finish on time, ATW on the previous frame is still better than just missing a frame which typically manifests itself in 'judder' - uneven movement on the screen. This is why ATW is sometimes referred to as a "safety net" for rendering, acting in case the render did not complete on time. The "Asynchronous" part of ATW comes from the fact that ATW is an independent process/thread from the main render engine, and runs at a higher priority than the render engine so that it can present an updated frame to the display even if the render engine did not finish on time.



Let's finish with a few finer technical points:

  • The time-warping example might lead to believe that only left-right (e.g. yaw) head motion can be compensated. In practice, all three rotation directions - yaw, pitch and roll - can be compensated as well as head position under some assumptions. For instance, OSVR actually performs 6-DOF warping based in an assumption of objects that are 2 meters from the center of projection. It handles rotation about the gaze direction and approximates all other translations and rotations.
  • Moving objects in the scene - such as hands - will still exhibit judder if the render engine misses a frame, in spite of time-warping. 
  • For time-warping to work well, the rendered frame needs to be somewhat bigger than the size of the display. Otherwise, when shifting the image one might end up shifting empty pixels into the visible area. Exactly how much the rendered frame needs to be larger depends on the frame rate, and the expected velocity of the head rotation. Larger frames mean more pixels to render and more memory, so time warping is not completely 'free'
  • If the image inside the HMD is rendered onto a single display (as opposed to two displays - one per eye), time warping might want to use different warping amounts for each eye because typically one eye would be drawn on screen before the other.
  • Objects such as a menu that are in "head space" (e.g. should be fixed relative to head) need to be rendered and submitted to the time-warp code separately since they should not be modified for projected head movement.
  • Predictive tracking (estimating future pose based on previous reads of orientation, position and angular/linear velocity) can help as input to the render engine, but an actual measurement is always preferable to estimation of the future pose.
  • Depending on the configuration of the HMD displays, there may be some rendering delay between left eye and right eye (for instance, if the screen is a portrait-mode screen, renders top to bottom and the left eye maps to the top part of the screen). In this case, one can use different time warp values for each eye.

For additional VR tutorials on this blog, click here
Expert interviews and tutorials can also be found on the Sensics Insight page here


Saturday, March 19, 2016

The temporary job of a VR bridesmaid

At GDC - the Game Developers Conference - I saw quite a lot of HTC Vive demos. Compare these two images:



See any similarities? Just like the bridesmaid carries the bride's train, the "VR bridesmaid" carries the wire for the VR user.

Low-latency wireless video links have been on the market for years. I hope to see some in consumer VR very soon, so that the "VR bridesmaid" can be a thing of the past.

Tuesday, March 15, 2016

Action Items from the OSVR Software Developer Survey

A couple of weeks ago, we surveyed the OSVR community for what they would like the core software development team to focus on. You can see the questions and answers here

Based on this input, the Sensics development team met and decided to focus on the following items:

1. Smoother end-user experience
This includes:

  • 1-click installer for both end-user and developer
  • Lightweight software that will detect when a new hardware is connected and help with the process of obtaining and installing the relevant drivers. 
  • A graphical configurator to help the end-user select the HMD, input and output devices as well as configure key parameters for each
2. Continue to add device support. An immediate focus is to add support for the HTC Vive so that an end-user can obtain software that was written on the OSVR framework that - using a simple configurator - decide what to run it on. For instance 1) HTC Vive; 2) Oculus + hand controller; 3) OSVR HDK and more.

3. Make it easier for the OSVR community to contribute to the platform by listing key development priorities as well as high-level directions on how to perform certain development tasks.

Keep an eye on the progress of the OSVR software platform in the coming weeks.

Thank you for your feedback!

Monday, February 15, 2016

Vision Summit 2016: Using OSVR to Support (practically) Any Device in AR/VR

I delivered this presentation last week at the Vision 2016 Summit

Full video:


Slides only:



The key point in the presentation is that no one wants to write AR/VR applications that work only on one device. To put a positive frame on it, the ability of applications to work across a wide range of displays, inputs and output devices is valuable to practically to everyone:

  • Content providers want their applications to be used on the widest range of possible devices.
  • Makers of VR displays, input or output devices don't want to settle for a few pieces of content that are written specifically for then; they want access to a wide range of content
  • Consumers want 2016 applications to work on 2017 hardware without having to buy upgrades or wonder if their new hardware will ever be supported
OSVR achieves just that - it allows runtime choice of what input, output and display devices to use and the presentation illustrates this. OSVR supports numerous devices today, with new devices being added every week.