Tuesday, December 22, 2015

A 2014 WebVR Challenge and Review

I almost can't believe it was nearly 2 years ago when I started to think about WebVR as a serious and quite new presentation platform for the web. The WebGL implementation in Internet Explorer 11 was state of the art and you could do some pretty amazing things with it. While there were still a few minor holes, customers were generally happy and the performance was great.

A good friend, Reza Nourai, had his DK1 at the time and was experimenting with a bunch of experimental optimizations. He was working on the DX12 team and so you can imagine he knew his stuff and could find performance problems both in the way we built games/apps for VR and in the way the hardware serviced all of the requests. In fact, it wasn't long after our Hackathon that he got a job at Oculus and gained his own bit of power over the direction that VR was heading ;-) For the short time where I had access to the GPU mad scientist we decided if Microsoft was going to give us two days for the first every OneHack then we'd play around with developing an implementation of this spec we kept hearing about, WebVR and attaching it to the DK1 we had access to.

This became our challenge. Over the course of 2 days, implement the critical parts of the WebVR specification (that was my job, I'm the OM expert for the IE team), get libVR running in our build environment and finally attach the business end of a WebGLRenderingContext to the device itself so we could hopefully spin some cubes around. The TLDR is that we both succeeded and failed. We spent more time redesigning the API and crafting it into something useful than simply trying to put something on the screen. So in the end, we wound up with an implementation that blew up the debug runtime and rendered a blue texture. This eventually fixed itself with a new version of the libVR, but that was many months later. We never did track down why we hit this snag, nor was it important. We had already done what we set out to do, integrate and build an API set so we had something to play around with. Something to find all of the holes and issues and things we wanted to make better. Its from this end point that many ideas and understandings were had and I hope to share these with you now.

Getting Devices

Finding an initializing devices or hooking up a protected process to a device is not an easy task. You might have to ask the user's permission (blocking) or any number of other things. At the time the WebVR implementation did not have a concept for how long this would take and they did not return a Promise or have any other asynchronous completion object (callback for instance) that would let you continue to run your application code and respond to user input, such as trying to navigate away. This is just a bad design. The browser needs APIs that get out of your way as quickly as possible when long running tasks could be serviced on a background thread.

We redesigned this portion and passed in a callback. We resolved the callback in the requestAnimationFrame queue and gave you access at this point to a VRDevice. Obviously this wasn't the best approach, but our feedback, had we had the foresight at the time to approach the WebVR group, would have been, "Make this a Promise or Callback". At the time a Promise was not a browser intrinsic so we probably would have ended up using a callback, and then later, moving to a Promise instead. I'm very happy to find the current specification does make this a Promise.

This still comes with trade-offs. Promises are micro-tasks and serviced aggressively. Do you really want to service this request aggressively or wait until idle time or some other time. You can always post your own idle task once you get the callback to process later. The returned value is a sequence and so it is fixed and unchanging.

The next trade-off comes when the user has a lot of VR devices. Do you init them all? Our API would let you get back device descriptors and then you could acquire them. This had two advantages. First, we could cache the previous state and return it more quickly without having to acquire the devices themselves. Second, you could get information about the devices without having to pay seconds of cost or have to immediately ask the user for permission. You might say, what is a lot? Well, imagine that I have 6 or 7 positional devices that I use along with 2 HMDs. And lets not forget positional audio which is completely missing from current specifications.

The APIs we build for this first step will likely be some of the most important we build for the entire experience. Right now the APIs cater towards the interested developer who has everything connected, is actively trying to build something with the API and is willing to work round poor user experience. Future APIs and experiences will have to be seamless and allow normal users easy access to our virtual worlds.

Using and Exposing Device Information

Having played with the concept of using the hardware device ID to tie multiple devices together, I find the arrangement very similar to how we got devices. While an enterprising developer can make sure that their environment is set up properly, we can't assert the same for the common user. At current, we should probably assume though that the way to tie devices together is sufficient. That an average user would only have one set of hardware. But then, if that is the case, why would we separate the positional tracking from the HMD itself? We are, after all, mostly tracking the position of the HMD itself in 3D space. For this reason, we didn't implement a positional VR device at all. We simply provided the positional information directly from the HMD through a series of properties.

Let's think about how the physical devices then map to existing web object model. For the HMD we definitely need some concept of WebVR and the ability to get a device which comprises of a rendering target and some positional/tracking information. This is all a single device, so having a single device expose the information makes the API much simpler to understand from a developer perspective.

What about those wicked hand controllers? We didn't have any, but we did have some gamepads. The Gamepad API is much more natural for this purpose. Of course it needs a position/orientation on the pad so that you can determine where it is. This is an easy addition that we hadn't made. It will also need a reset so you can zero out the values and set a zero position. Current VR input hardware tends to need this constantly, if for no other reason than user psychology.

Since we also didn't have WebAudio and positional audio in the device at the time we couldn't really have come up with a good solution then. Exposing a set of endpoints from the audio output device is likely the best way to make this work. Assuming that you can play audio out of the PC speakers directly is likely to fail miserably. While you could achieve some 3D sound you aren't really taking advantage of the speakers in the HMD itself. More than likely you'll want to send music and ambient audio to the PC speakers and you'll want to send positional audio information, like gunshots, etc... to the HMD speakers. WebAudio, fortunately, allows us to construct and bind our audio graphs however we want making this quite achievable with existing specifications.

The rest of WebVR was "okay" for our needs for accessing device state. We didn't see the purpose of defining new types for things like rects and points. For instance DOMPoint is overkill (and I believe it was named something different before it took on the current name). There is nothing of value in defining getter/setter pairs for a type which could just be a dictionary (a generic JavaScript object). Further, it bakes in a concept like x, y, z, w natively that shouldn't be there at all and seems only to make adoption more difficult. To be fair to the linked specification it seems to agree that other options, based solely on static methods and dictionary types is possible.

Rendering Fast Enough

The VR sweet spot for fast enough is around 120fps. You can achieve results that don't induce simulation sickness (or virtual reality sickness) at lesser FPS but you really can't miss a single frame and you have to have a very fast and responsive IMU (this is the unit that tracks your head movement). What we found when using canvas and window.requestAnimationFrame is that we could even get 60fps let alone more. The reason is the browser tends to lock to the monitor refresh rate. At the time, we also had 1 frame to commit the scene and one more frame the compose the final desktop output. That added 2 more frames of latency. That latency will dramatically impact the simulation quality.

But we could do better. First, we could disable browser frame commits and take over the device entirely. By limiting the browser frames we could get faster intermediate outputs from the canvas. We also had to either layer or flip the canvas itself (technical details) and so we chose to layer it since flipping an overly large full-screen window was a waste of pixels. We didn't need that many, so we could get away with a much smaller canvas entirely.

We found we didn't want a browser requestAnimationFrame at all. That entailed giving up time to the browser's layout engine, it entailed sharing the composed surface and in the end it meant sharing the refresh rate. We knew that devices were going to get faster. We knew 75fps was on the way and that 90 or 105 or 120 was just a year or so away. Obviously browsers were going to crush FPS to the lowest number possible for achieving performance while balancing the time to do layout and rendering. 60fps is almost "too fast" for the web and most pages only run at 1-3 "changed" frames per second while browsers do tricks behind the scenes to make other things like user interactivity and scrolling appear to run at a much faster rate.

We decided to add requestAnimationFrom to the VRDevice instead. Now we gained a bunch of capabilities, but they aren't all obvious so I'll point them out. First, we now hit the native speeds of the devices and we sync to the device v-sync and we don't wait for the layout and rendering sub-systems of the browser to complete. We give you the full v-sync if we can. This is huge. Second, we are unbound from the monitor refresh rate and backing browser refresh rate, so if we want to run at 120fps while the browser does 60fps, we can. An unexpected win is that we could move the VRDevice off-thread entirely and into a web worker. So long as the web worker had all of the functionality we needed, or it existed on the device or other WebVR interfaces we could do it off-thread now! You might point out that WebGL wasn't available off-thread generally in browsers, but to be honest, that wasn't a blocker for us. We could get a minimal amount of rendering available in the web worker. We began experimenting here, but never completed the transition. It would have been a couple of weeks, rather than days, of work to make the entire rendering stack work in the web worker thread at the time.

So we found, that as long as you treat VR like another presentation mechanism with unique capabilities, then you can achieve fast and fluid frame commits. You have to really break the browser's normal model to get here though. For this reason I felt that WebVR in general should focus more on how to drive content to this alternate device using a device specific API rather than on piggybacking existing browser features such as canvas, full-screen APIs, and just treating the HMD like another monitor.

Improving Rendering Quality

When we were focusing on our hack, WebVR had some really poor rendering. I'm not even sure we had proper time warp available to us in the SDK we chose and certainly the DK1 we had was rendering at a low frame rate, low resolution, etc... But even with this low resolution you still really had to think about the rendering quality. You still wanted to render a larger than necessary texture to take advantage of deformation. You still wanted to get a high quality deformation mesh that matched the users optics profile. And you still wanted to hit the 60fps framerate. Initial WebGL implementations with naive programming practices did not make this easy. Fortunately, we weren't naive and we owned the entire stack so when we hit a stall we knew exactly where and how to work around it. That makes this next section very interesting, because we were able to achieve a lot without the restrictions of building against a black box system.

The default WebVR concept at the time was to take a canvas element, size it to the full size of the screen and then full screen it onto the HMD. In this mode the HMD is visualizing the browser window. With the original Oculus SDK you even had to go move the window into an area of the virtual desktop that showed up on the HMD. This was definitely an easy way to get something working. You simply needed to render into a window, and move that onto the HMD desktop and disable all of the basic features like deformation etc... (doing them yourself) to get things going. But this wasn't the state of the art, even at that time. So we went a step further.

We started by hooking our graphics stack directly into the Oculus SDK's initialization. This allowed for us to set up all of the appropriate swap chains and render targets, while also giving us the ability to turn on and off Oculus specific features. We chose to use the Oculus deformation meshes for instances rather than our own since it offloaded 1 more phase out of our graphics pipeline that could be done on another thread in the background without us having to pay the cost.

That got us pretty far, but we still had a concept of using a canvas in order to get the WebGLRenderingContext back. We then told the device about this canvas and it effectively swapped over to "VR" mode. Again, this was way different than the existing frameworks that relied on using the final textures from the canvas to present to the HMD. This extra step seemed unnecessary so we got rid of it and had the device give us back the WebGLRenderingContext instead. This made a LOT of sense. This also allowed the later movement off to the web worker thread ;-) So we killed two birds with one stone. We admitted that the HMD itself was a device with all of the associated graphics context, we gave it its own rendering context, textures and a bunch of other state and simply decoupled that from the browser itself. At this point you could almost render headless (no monitor) directly to the HMD. This is not easy to debug on the screen though, but fortunately Oculus had a readback texture that would give you back the final image presented to the HMD, so we could use that texture and make it available, on demand, off of the device so we only paid the perf cost if you requested it.

At the time, this was the best we could do. We were basically using WebGL to render, but we were using it in a way that made us look a lot more like an Oculus demo, leaning heavily on the SDK. The rendering quality was as good as we could get at the time, without us going into software level tweaks. I'll talk about some of those ideas (in later posts), which have now been implemented I believe by Oculus demos and some industry teams, so they won't be anything new, but can give you a better idea of why the WebVR API has to allow for innovation and can't simply be a minimal extension of existing DOM APIs and concepts if it wants to be successful.

Improvements Since the Hackathon

We were doing this for a Microsoft Hackathon and our target was the Internet Explorer 11 browser. You might notice IE and later Microsoft Edge doesn't have any support for WebVR. This is both due to the infancy of the technology, but also due to their not being a truly compelling API. Providing easy access to VR for the masses sounds great, but VR requires a very high fidelity rendering capability and great performance if we want users to adopt it. I've seen many times where users will try VR for the first time, break out the sick bag, and not go back. Even if the devices are good enough, if the APIs are not good then it will hold back the adoption rates for our mainstream users. While great for developers, WebVR simply doesn't set, IMO, the web up for great VR experiences. This is a place where we just have to do better, a lot better and fortunately we can.

The concept of the HMD as its own rendering device seems pretty critical to me. Also, making it have its own event loop and making it available on a web worker thread also go a long way to helping the overall experience and achieving 120fps rendering sometime in the next two years. But we can go even further. We do want, for instance, to be able to render both 3D content and 2D content in the same frame. A HUD is a perfect example. We want the devices to compose, where possible, these things together. We want to use time warp when we can't hit the 120fps boundaries so that there is a frame that the user can see that has been moved and shifted. Let's examine how a pre-deformed, pre-composed HUD system, would look using our existing WebVR interfaces today if we turned on time warp?

We can use something like Babylon.js or Three.js for this and we can turn on their default WebVR presentation modes. By doing so, we get a canonical deformation applied for us when we render the scene. We overlay the HUD using HTML 5, probably by layering it over top of the canvas. The browser, then snapshots this and presents it to the HMD. The HUD itself is now "stuck" and occluding critical pixels from the 3D scene that would be nice to have. If you turned on time warp you'd smear the pixels in weird ways and it just wouldn't look as good as if you had submitted the two textures separately.

Let's redo this simulation using the WebGLRenderingContext on the device itself and having it get a little bit more knowledge about the various textures involved. We can instead render the 3D scene in full fidelity and commit that directly to the device. It now has all of the pixels. Further, it is NOT deformed, so the device is going to do that for us. Maybe the user has a custom deformation mesh that helps correct an optical abnormality for them, we'll use that instead of a stock choice. Next we tell the device its base element for the HUD. The browser properly layers this HUD and commits that as a separate texture to the device. The underlying SDK is now capable of time warping this content for multiple frames until we are ready to commit another update and this can respond to the user as they move their head in real-time with the highest fidelity.

You might say, but if you can render at 120fps then you are good right? No, not really. That still means up to 8ms of latency between the IMU reading and your rendering. The device can compensate for this by time warping with a much smaller latency by sampling the IMU when it goes to compose the final scene in the hardware. Also, since we decomposed the HTML overlay into its own texture, we can also billboard that into the final scene, partially transparent, or however we might want to show it. The 3D scene can shine through or we can even see new pixels from "around" the HUD since we didn't break them away.


Since our hack, the devices have changed dramatically. They are offering services, either in software or in the hardware that couldn't have been predicted. Treating WebVR like a do it all shop and then splash onto a flat screen, seems like its not going to be able to take advantage of the features in the hardware itself. An API that instead gets more information from the device and allows the device to advertise features that can be turned on and off might end up being a better approach. We move from an API that is device agnostic, to one that embraces the devices themselves. No matter what we build, compatibility with older devices and having something "just work" on a bunch of VR devices is likely not going to happen. There is simply too much innovation happening and the API really has to allow for this innovation without getting in the way.

Our current WebVR specifications are pretty much the same now as they were when we did our hack in 2014. Its been almost 2 years now and the biggest improvement I've seen is the usage of the Promise capability. I don't know what advances a specification such as WebVR, but I'm betting the commercial kit from Oculus coming out in 2016 will do the trick. With a real device gain broad adoption there will likely be a bigger push to get something into all of the major browsers.


  1. Interesting read. Being that you appear to be in a position of some influence on how this effort will take root within MS. Please please please please please ensure that Microsoft breaks with previous patterns and works collaboratively with the folks working on WebVR from chrome and FF.

    1. I would honestly say the era of Microsoft not working collaboratively ended a few years ago. IE 9 or maybe even bits of 10 had some proprietary stuff as we were ramping up our standards engagement still. We were very active though even during 9 and 10, contributing to HTML 5 specs, having a lot of people on various standards committees and generally trying to contribute back where we could.

      Some great examples are the CSS 2.1 test suite which we contributed to significantly. Also, the WebGL conformance suite which Rafael Cintron has committed many changes to.

      Right now, WebVR just doesn't seem real. The spec grew organically based on the technology at the time and has such really needs a revamp. This is the problem with building an API set for an early technology such as VR, that you put out APIs that you HAVE to deprecate. The web isn't really prepared to do that in most cases. Its not a problem with compiled software since you ship the library with you, though it is a problem when you library fails to run on latest drivers, devices, and then you have to recompiled. There isn't a similar thing for the web. Had we implemented the existing WebVR spec and shipped it. I guarantee 6 months from now we'd all be scrambling to pull it out and replace it with something better, taking many breaking changes in the process.

      I will most certainly be working in collaboration with other vendors here. I wanted to get these thoughts out so I had something to point to when I had conversations. I'll probably go even further and expand on some of these issues so that I have a canonical example, probably with nice graphs and such, to demonstrate why we need to change. Keep watching the blog and I'm sure you'll find it interesting ;-)