Replacing Reality on the Vision Pro

Featured image

Introduction

It’s been a while since I posted so I thought I’d do one on a subject not many know about.

In undergrad, I wrote a research thesis on diminished reality, an interesting subfield of mixed reality, and I would like to demonstrate some of the fun I’ve been having building a DR pipeline on the AVP for this project. The thesis is here, if you would like to read it in full: Graduation Thesis.

What is DR?

Diminished Reality (DR) is a subset of Mixed Reality focused on the removal, replacement, or minimization of the user’s environment through the integration of Vision and Spatial Computing. The current technical pipeline uses precise real-time object segmentation and image inpainting to replace occluded regions with geometrically consistent background textures. To maintain “visual invisibility,” the system must resolve the diminishment-observation loop by ensuring synthesized content adheres to 6DOF pose tracking and local epipolar constraints.

Current research focuses on overcoming latency bottlenecks in dense semantic scene reconstruction and ensuring temporal coherence under dynamic illumination using generative synthesis models.

The seminal work into this genre, where the authors explore more simplistic techniques, is Towards Understanding Diminished Reality by Cheng et al.

The principal pipeline can be broken down into two major steps: Observation – Diminishing. Once we master both steps, the pipeline becomes robust and seamless. Full AVP Diminishing Pipeline

Identifying and Observing the Object

The observation stage takes the form of a multi-modal image acquisition pipeline from the headset, using either the Photo-Video camera on the HoloLens 2 or the Video-See-Through (VST) pipeline of the Apple Vision Pro. I found that it was incredibly hard to diminish anything on the hololens due to its Optical-See-Through lens that could not fully overwrite image data. For removal of the target, I used a point-prompt-based segmentation strategy using lightweight models like EdgeSAM or FastSAM to generate precise binary masks (Although, with the rate new SAM models are being developed, I would love to try this on newer SAM versions). On the Vision Pro, this perception is further enhanced by a built in 3D photogrammetry-based tracking system which utilizes .usdz models to ensure robust object recognition across varying viewpoints. I also tried optimizing the segmentation and tracking modules via Core ML and Metal shaders to minimize latency by processing data directly on the device’s Neural Engine. The final output of this stage is a dual-input packet containing the raw RGB frame and the isolated Region of Interest (ROI) mask, which provides the necessary spatial context for the subsequent synthesis. Example: The Original Scene

Diminishing the Object

To inpaint the object, I pass the masked ROI into a deep inpainting engine, specifically LaMa. This engine reconstructs the occluded pixels by synthesizing a background that maintains structural and textural continuity – although it could be improved for fast-feedback systems, such as mine. To ensure the result is geographically consistent with the physical room, the system maps the inpainted texture onto a localized 3D mesh rather than a simple 2D plane. Techniques such as triplanar mapping are utilized during this rendering phase to prevent texture stretching on irregular surfaces and maintain visual fidelity. Finally, the processed content is composited back into the user’s view using the headset’s rendering pipeline, effectively “diminishing” the physical object in real-time. See some examples below for yourself! Example: The Inpainted Scene

Samples

Here are some more examples from my work. I kind of just used any objects that were easily modeled by photogrammetry and available as well. The practical use case here is to diminish your phone – a package of everlasting entertainment and boredom-killers at your fingertips. I was interested in how we could mitigate phone addiction, distraction, and improve attention in mixed reality and this study was one of the examples of doing so: Diminished Phone On Window Diminished Phone On Window 2 In the background you can see the beautiful Duke scenery and how the LaMa model tries to compensate and guess those spaces with deep learning.

As a side note, I also evaluated the immersiveness of my diminished scenes with VLMs. This work is similar to the work I did for this article on using VLMs in mixed reality spaces.

Limitations and Future