Apple’s Machine Learning Research wing has developed a foundational AI model “for zero-shot metric monocular depth estimation.” Depth Pro enables high-speed generation of detailed 3D depth maps from a single two-dimensional image.
Our brains process visual information from two image sources – our eyes. Each has a slightly different view of the world, and these are combined into a single stereo image, with the differences also helping us to gauge how close or far objects are.
Many cameras and smartphones look at life through a single lens, but three dimensional depth maps can be created using information hidden in metadata of 2D photos (such as focal lengths and sensor info) or estimated using multiple images.
The Depth Pro system doesn’t bother with all that though, yet is able to generate a detailed 3D depth map at 2.25 megapixels from a single image in 0.3 seconds via a standard graphics processing unit.
The AI model’s architecture includes something called a multi-scale vision transformer to simultaneously process the overall context of an image as well as all the finer details like “hair, fur, and other fine structures.” And it’s able to estimate both relative and absolute depth, meaning that the model can furnish real-world measurements to allow, for example, augmented reality apps to precisely position virtual objects in a physical space.
The AI is able to do all this without needing resource-intensive training on very specific datasets, employing something called zero-shot learning – which IBM describes as “a machine learning scenario in which an AI model can recognize and categorize unseen classes without labeled examples.” This makes for quite a versatile beast.
As for applications, beyond the AR scenario mentioned above, Depth Pro could make for much more efficient photo editing or even lead to real-time 3D imagery using a single-lens camera, and prove useful for helping machines like autonomous vehicles and robots to better perceive the world around them in real-time.
The project is still at the research stage, but perhaps unusually for Apple, the code and supporting documentation are being made available as open source on GitHub, allowing developers, scientists and coders to take the technology to the next level
A paper on the project has been published on the Arxiv server, and there’s a live demo available for anyone who wants to experience the current version for themselves.
Source: Apple