An Introduction to Stereo Depth
Depth perception is the ability to perceive the distance of objects in our environment. It was a critical evolutionary step for our ancestors to be able to accurately estimate their surroundings to hunt, navigate, and avoid dangerous predators. In robotics, various sensor techniques are used to replicate the ability of our eyes to map a surrounding environment in three dimensions. And as with our ancestors, this task is just as important for navigation, planning, and other advanced behaviors. In this post, we introduce one of the most common methods to map a 3D environment, namely stereo depth camera systems.
In animals, a large part of depth perception is achieved through a process called stereopsis where each eye receives a slightly different image of the same target due to the distance between the pupils. Because objects closer to the animal will appear to move significantly from eye to eye, and distant objects appear to move very little, animals can get a sense of distance to the target object. In Fig. 1, for example, object C will appear to move less than object D because object C is further away.
The difference in position measured in each eye, referred to as “horizontal disparity”, is processed in the visual cortex of the brain to help provide depth perception. In robotics, a similar effect is achieved with two cameras and is the basis for stereo vision.
In practice, it is not trivial to determine the depth directly from the disparity, because it can be difficult to match a pixel in the right camera image to a pixel in the left camera image. A stereo matching algorithm is usually required to match up pixels from both images and then calculate the depth from disparity.
Horizontal disparity and stereo vision are also the basis of the structured light imaging technique (Fig. 2.). In this variant, the system does not rely on passive light in the environment, but rather actively emits a known pattern of light on the surroundings. The horizontal disparity and perturbation of the pattern due to a surface can be used as before to then extract depth data of objects and the environment.
Note that structured light techniques tend to be more accurate than passive stereo vision for a couple of reasons. The most obvious reason is that the light pattern itself provides more contrast and illumination than a passive system which reduces measurement error. This is especially important for low-light or no-light conditions outside the active range of the imaging sensor.
The more subtle reason for increased performance, however, is because the structured pattern itself provides constraints that would not be normally available in a passive light environment. These constraints allow for better pixel matching between the right and left images to produce a more accurate disparity value. This technique is therefore widely used in applications requiring more accuracy such as in industrial robotics, face scanning, and microscopy.
Depth Maps and Point Clouds
Once pixel matching is complete and disparity is calculated accurately, depth can be calculated as:
depth = (baseline * focal length) / disparity)
where the so-called baseline value is the distance between the two cameras, and the focal length is the focal length of each camera. This equation highlights some of the design considerations for choosing a stereo vision system because each system will have different camera characteristics and baseline values. It also highlights the needs for accurate calibration procedures, e.g. for camera intrinsics such as the focal length, which will be covered in a future post.
The depth values extracted in the above equation allow us to extract depth values for each pixel of the environment, called a depth map. Depth maps can be visualized in a gray-scale image where distance values for exact pixel are encoded between the range 0 (darkest) and 255 (lightest) as in Fig. 3.
While depth maps are already useful for a variety of tasks such as object segmentation (Fig. 4),, they are typically converted to coordinate representations by simply converting the pixel position (x, y) and depth values (z) into 3D coordinates (x, y, z) using open source tools such as OpenCV or Open3D (Fig. 5). We can additionally add color data for each coordinate if the stereo vision depth system uses an RGB camera, enabling us to build a very rich representation of the 3D world (Fig. 6).
Such 3D representations, or point clouds, are foundational for many applications in robotics and AI. For example, if a robot can move through its environment while collecting point cloud data, it can accurately calculate its position, map an environment, navigate obstacles, execute complex planning tasks, and detect and understand objects more effectively than 2D imaging alone.
Pros and Cons Stereo Vision
Point clouds are not unique to stereo vision systems but can also be generated from a variety of sensors such as sonar, acoustic, LiDAR, or RADAR. What tends to make stereo vision unique is that it leverages cameras which are extremely cheap, easy to calibrate, high resolution sensors that provide additional color information to each point cloud coordinate, effectively doubling the amount of useful information available to the system. Thus stereo vision depth systems tend to provide much more information at a much lower price point. This information, in turn, can be used to power computer vision algorithms that otherwise wouldn’t be possible with other sensors.
However, there are some disadvantages of stereo vision compared to these systems as well. For example, distance estimation tends to be much less accurate than time of flight (ToF) sensors such as LiDAR, structured light is not always practical (meaning low-light environments are difficult to accurately measure), and disparity is can be difficult to determine in many environments (e.g. if there are surfaces without textures or distinguishing features such as flat walls). Oh yeah, and those increased data rates? Well, you’ll need even more compute and power to handle it.
To help you accelerate your development and determine when to use stereo vision, we provide a quick comparison of passive stereo vision (PSV), structured light stereo vision (SSV), and time of flight (ToF) sensors in Table 1 below.
|Technology||Stereo Vision||Time of Flight||Structured Light|
|Typical Range||1 – 20m||0 – 2000m||0 – 3m|
|Typical Accuracy||mm – cm||mm – cm||<mm|
|Ideal Environment||Passively-lit, textured (i.e. feature-rich) environments||Highly reflective environments with static objects||Low-light, indoor environments with reflective surfaces|
|Typical Frame Rate||30 – 60 fps||10 – 30 fps||30 – 60 fps|
|Applications||VR/AR, mobile robotics, mapping, gaming||Surveying & mapping, inspection, digital twins, autonomy||Facial recognition, pick and place, forensics, microscopy|
Stereo vision is a powerful technique, but as with any technology, it has its limits. We hope this article helped you understand how stereo depth vision works and when it can be used. In future posts, we will discuss some of our favourite stereo vision sensor systems for our robotics development, and in particular, resource constrained systems such as drones.
Adinkra is an R&D engineering firm helping customers create state of the art robotics and AI products while minimizing costs and time to market. We combine a world-class engineering team with a flexible project management framework to offer a one-stop development solution and unlock your product’s full potential for your customers.