Saturday, April 30, 2016

VR and AR in 12 variations

I've been thinking about how to classify VR and AR headsets and am starting to look at them along three dimensions (no pun intended):
  1. VR vs AR
  2. PC-powered vs. Phone-powered vs. Self-powered. This looks at where the processing and video generation is coming from. Is it connected to a PC? Is it using a standard phone? Or does it embed processing inside the headset 
  3. Wide field of view vs. Narrow FOV
This generates a total of 2 x 3 x 2 = 12 options as follows


Example and typical use

1: VR, PC-powered, Wide-field
Examples: Oculus, HTC Vive, Sensics dSight, OSVR HDK. This immersive VR configuration is used in many applications, though the most popular one is gaming. One attribute that separated consumer-grade goggles like the HTC Vive from professional-grade goggles such as the Sensics dSight is pixel density: the number of pixels per degree. You can think about this as the diffence between watching a movie on a 50 inch standard-definition TV as opposed to a 50 inch HDTV.

2: VR, PC-powered, Narrow-field
Example: Sensics zSight 1920. With a given number of pixels per eye, narrow-field systems allow for much higher pixel density, which allows observing fine details or very small objects. For instance, imagine that you are training to land a UAV. The first step in landing a UAV is spotting it in the sky. The higher the pixel density is, the farther out you can spot an object of a given size. The zSight 1920 has about 32 pixels/degree whereas a modern consumer goggle like the HTC Vive has less than half that.

3: VR, Phone-powered, Wide-field
Examples: Samsung Gear VR, Google Cardboard, Zeiss VROne.. This configuration where the phone is inserted into some kind of holster is used for general-purpose mobile VR. The advantages of this configuration is its portability as well as its low cost - assuming you already own a compatible phone. The downside of this configuration is that the processing power of a phone is inferior to a high-end PC and thus the experience is more limited in terms of frame rate and scene complexity. Today's phones were not fully designed with VR in mind, so there are sometimes concerns about overheating and battery life.

4: VR, Phone-powered, Narrow-field
Example: LG 369 VR. In this configuration, the phone is not carried on the head but rather connected via a thin wire to a smaller unit on the head. The advantage of this configuration is that it can be very lightweight and compact. Also, the phone could potentially be used as an input pad. The downside is that the phone is connected via a cable. Another downside is often the cost. Because this configuration does not use the phone screens, it needs to include its own screens that might add to the cost. Another advantage is that the phone camera can not be used for video see-through or for sensing.

5: VR, Self-powered, Wide-field
Examples: Gameface Labs, Pico Neo. These configurations aim for standalone, mobile VR without requiring the mobile phone. They potentially save weight by not using unnecessary phone components such as the casing and touch screen, but would typically be more expensive than phone based VR for those users that already own the phone. They might have additional flexibility with regards to which sensors to include, camera placement and battery location. They are more difficult to upgrade relative to a phone-based VR solution, but the fact that the phone cannot be taken out might be an advantage for applications such as public VR where a fully-integrated system that cannot be easily taken apart is a plus.

6: VR, Self-powered, Narrow-field
Example: Sensics SmartGoggles.  These configurations are less popular today. Even the Sensics SmartGoggles which included on-board Android processor as well as wide-area hand sensors was built with relatively narrow field of view (60 degrees) because of the components available at the time.

7: AR, PC-powered, Wide-field
Example: Meta 2. In many augmented reality applications, people ask for wide field so that, for instance, a virtual object that appears overlaid on the real world does not disappear when the user looks to the side.  This configuration may end up being transient because in many cases the value of augmented reality is in being able to interact with the real world, and the user's motion when tethered to a PC is more limited. However, one might see uses in applications such as engineering workstation.

8: AR, PC-powered, Narrow-field
I am not aware of good examples of this configuration. It combines the limit of narrow-field AR with the tether to the PC.

9: AR, Phone-powered, Wide-field
This could become one of the standard AR configuration just like phone-powered, wide-field VR is becoming a mainstream configuration. To get there, the processing power and optics/display technology catch up with the requirements.
10: AR, Phone-powered, Narrow-field
Example: Seebright. In this configuration, a phone is worn on the head and its screen becomes the display for the goggles. Semi-transparent optics combine phone-generated imagery with the real world. I believe this is primarily a transient configuration until wide-field models appear.

11: AR, Self-powered, Wide-field
I am unaware of current examples of this configuration though one would assume it could be very attractive because of the mobility on one hand and the ability to interact in a wide field of view.
12: AR, Self-powered, Narrow-field
Examples: Microsoft Hololens, Google glass, Vuzix M300. There are two types of devices here: one is an 'information appliance' like Google Glass, designed to provide contextually-relevant information without taking over the field of view. These configurations are very attractive in industrial settings for applications like field technicians, workers in a warehouse or even customer service representatives needing a mobile, wearable terminal often to connect with a cloud-based database. The second type of device, exemplified by the Hololens seeks to augment the reality by placing virtual objects locked in space. I am sure the Hololens would like to be a wide-field model and it is narrow field at the moment because of the limitations of its current display technology

Looking forward to feedback and comments.

Monday, April 11, 2016

Understanding Foveated Rendering

Foveated rendering is a rendering technique that takes advantage of the fact that that the resolution of the eye is highest in the fovea (the central vision area) and lower in the peripheral areas. As a result, if one can sense the gaze direction (with an eye tracker), GPU computational load can be reduced by rendering an image that has higher resolution at the direction of gaze and lower resolution elsewhere.

The challenge in turning this from theory to reality is to find the optimal function and parameters that maximally reduce GPU computation while maintaining highest quality visual experience. If done well, the user shouldn’t be able to tell that foveated rendering is being used. The main questions to address are:
  1. In what angle around the center of vision should we keep the highest resolution? 
  2. Is there a mid-level resolution that is best to use? 
  3. What is the drop-off in “pixel density” between central and peripheral vision? 
  4. What is the maximum speed that the eye can move? This question is important because even though the eye is normally looking at the center of the image, the eye can potentially rotate so that the fovea is aimed at image areas with lower resolution.
Let's address these questions:

1. In what angle around the center of vision should we keep the highest resolution?

Source: Wikipedia
The macula portion of the retina is responsible for fine detail. It spans the central 18˚ around the gaze point, or 9˚ eccentricity (the angular distance away from the center of gaze). This would be the best place to put the boundary of the inner layer. Fine detail is processed by cones (as opposed to rods), and at eccentricities past 9˚ you see a rapid fall off of cone density, so this makes sense biologically as well. Furthermore, the “central visual field” ends at 30˚ eccentricity, and everything past that is considered periphery. This is a logical spot to put the boundary between the middle and outermost layer for foveated rendering.

2. Is there a mid-level resolution that is best to use?  and 3. What is the drop-off in “pixel density” between central and peripheral vision? 

Some vendors such as Sensomotoric Instruments (SMI) use an inner layer at full native resolution, a middle layer at 60% resolution, and an outer layer at 20% resolution. When selecting the resolution dropoff, it is important to ensure that at the layer boundaries, the resolution is at or above the eye’s acuity at that eccentricity. At 9˚ eccentricity, acuity drops to 20% of the maximum acuity, and at 30˚ acuity drops to 7.4% of the max acuity. Given this, it appears that SMI’s values work, but are generous compared to what the eye can see.

4.    What is the maximum speed that the eye can move?

Source: Indiana University
A saccade is a rapid movement of the eye between fixation points. Saccade speed is determined by the distance between the current gaze and the stimulus. If the stimulus is as far as 50˚ away, then peak saccade velocity can get up to around 900˚/sec. This is important because you want the high resolution layer to be large enough so that the eye can’t move to the lower resolution portion in the time it takes to get the gaze position and render the scene. So if system latency is 20 msec, and assume eye can move at 900˚/sec – eye could move 18˚ in that time, meaning you would want the inner (higheslayer radius to be greater than that – but that is only if the stimulus presented is 50˚ away from current gaze. 

Additional thoughts

Source: Vision and Ocular Motility by Gunter Noorden

Visual acuity decreases on the temporal side (e.g. towards the ear) somewhat more rapidly than on the nasal side. It also decreases more sharply below and, especially, above the fovea, so that lines connecting points of equal visual acuity are elliptic, paralleling the outer margins of the visual field. Following this, it might make sense to render the different layers in ellipses rather than circles. The image shows the lines of equal visual acuity for the visual field of the left eye – so one can see that it extends farther to the left (temporal side) for the left eye, and for the right eye visual field would extend farther to the right.

For additional reading

This paper from Microsoft research is particularly interesting. 
They approach the foveated rendering problem in a more technical way – optimizing to find layer parameters based on a simple but fundamental idea: for a given acuity falloff line, find the eccentricity layer sizes which support at least that much resolution at every eccentricity, while minimizing the total number of pixels across all layers. It explains their methodology though does not give their results for the resolution values and layer sizes.

Note: special thanks to Emma Hafermann for her research on this post

For additional VR tutorials on this blog, click here
Expert interviews and tutorials can also be found on the Sensics Insight page here

Tuesday, April 5, 2016

Time-warp Explained

In the context of virtual reality, time warp is a technique to reduce the apparent latency between head movement and the the corresponding image that appears inside an HMD.

In an ideal world, the rendering engine would render an image using the measured head pose (orientation and position) immediately before the image is displayed on the screen. However, in the real world, rendering takes time, so the rendering engine uses a pose reading that is a few milliseconds before the image is displayed on the screen. During these few milliseconds, the head moves, so the displayed image lags a little bit after the actual pose reading.

Let's take a numerical example, Let's assume we need to render at 90 frames per second, so there are approximately 11 milliseconds for the rendering process of each frame. Let's assume that head tracking data is available pretty much continuously but that rendering takes 10 milliseconds. Knowing the rendering time, the rendering engine starts rendering as late as possible, which is 10 milliseconds before the frame needs to be displayed. Thus, the rendering engine uses head tracking data that is 10 milliseconds old. If the head rotates at a rate of 200 degrees/second, these 10 milliseconds are equivalent to 2 degrees. If the horizontal field of view of the HMD is 100 degrees and there are 1000 pixels across the visual field, a 2-degree error means that the image lags actual head movement by about 20 pixels.

However, it turns out that even a 2 degree head rotation does not dramatically change the perspective of how the image is drawn. Thus, if there was a way to move the image by 20 pixels on the screens (e.g. 2 degrees in the example), the resultant image would be pretty much exactly what the render engine would draw if the reported head position was changed by two degrees.

That's precisely what time-warping (or "TW" for short) does: it quickly (in less than 1 millisecond) translates the image a little bit based on how much the head rotated between the time the render engine used the head rotation reading and the time the time warping begins.

The process with time warping is fairly simple: the render engine renders and then when the render engine is done, the time-warp is quickly applied to the resultant image.

But what happens if the render engine takes more time than is available between frames? In this case, a version of time-warping, called asynchronous time-warping ("ATW") is often used. ATW takes the last available frame and applies time-warping to it. If the render engine did not finish in time, ATW takes the previous frame, and applies time-warping to it. If the previous frame is taken, the head probably rotated even more, so a greater shift is required. While not as ideal as having the render engine finish on time, ATW on the previous frame is still better than just missing a frame which typically manifests itself in 'judder' - uneven movement on the screen. This is why ATW is sometimes referred to as a "safety net" for rendering, acting in case the render did not complete on time. The "Asynchronous" part of ATW comes from the fact that ATW is an independent process/thread from the main render engine, and runs at a higher priority than the render engine so that it can present an updated frame to the display even if the render engine did not finish on time.

Let's finish with a few finer technical points:

  • The time-warping example might lead to believe that only left-right (e.g. yaw) head motion can be compensated. In practice, all three rotation directions - yaw, pitch and roll - can be compensated as well as head position under some assumptions. For instance, OSVR actually performs 6-DOF warping based in an assumption of objects that are 2 meters from the center of projection. It handles rotation about the gaze direction and approximates all other translations and rotations.
  • Moving objects in the scene - such as hands - will still exhibit judder if the render engine misses a frame, in spite of time-warping. 
  • For time-warping to work well, the rendered frame needs to be somewhat bigger than the size of the display. Otherwise, when shifting the image one might end up shifting empty pixels into the visible area. Exactly how much the rendered frame needs to be larger depends on the frame rate, and the expected velocity of the head rotation. Larger frames mean more pixels to render and more memory, so time warping is not completely 'free'
  • If the image inside the HMD is rendered onto a single display (as opposed to two displays - one per eye), time warping might want to use different warping amounts for each eye because typically one eye would be drawn on screen before the other.
  • Objects such as a menu that are in "head space" (e.g. should be fixed relative to head) need to be rendered and submitted to the time-warp code separately since they should not be modified for projected head movement.
  • Predictive tracking (estimating future pose based on previous reads of orientation, position and angular/linear velocity) can help as input to the render engine, but an actual measurement is always preferable to estimation of the future pose.
  • Depending on the configuration of the HMD displays, there may be some rendering delay between left eye and right eye (for instance, if the screen is a portrait-mode screen, renders top to bottom and the left eye maps to the top part of the screen). In this case, one can use different time warp values for each eye.

For additional VR tutorials on this blog, click here
Expert interviews and tutorials can also be found on the Sensics Insight page here