We are attending the Commercial UAV Expo 2023 in Las Vegas, NV. Visit us at booth #332!

State of the Art in 2D Real-Time Object Detection

Object detection is the ability to determine “what” an object is and “where” it is in an image, video, or other type of sensor data. It is critical for a variety of robotics and AI applications because it allows these systems to process and understand large amounts of data automatically. For example, object detection could be used to automatically find asset defects in hundreds of hours of utility inspection video or help rescue workers rapidly locate survivors in a search and rescue mission by processing multiple input video streams in parallel. In robotics applications, such as for autonomous vehicles or drones, object detection is critical for the robot to understand and interact with its surroundings safely and effectively. 

Object detection can be performed on both 2D sensor data, such as cameras, and 3D sensor data, such as LiDAR, and is not always required to be real-time. However, in this article we are going to focus on real-time 2D object detection and leave the exploration of more advanced 3D methods to a future post. We do this for the following reasons:

  1. 2D sensor systems are becoming cheaper, more powerful, and more ubiquitous than ever before. A huge number of applications use 2D data, making 2D object detection algorithms increasingly relevant and popular. 
  2. The “where” of 3D object detection can be accomplished in many cases with 2D object detection methods. Why? Depth values (or z-coordinates) for each pixel can be estimated using 2D sensor data, e.g. through SLAM or stereo vision techniques, rather than measured directly using 3D sensors. This often makes it possible to transform results from the output of a 2D object detection method directly into a 3D positional estimate for an object with reasonable accuracy.
  3. There is already so much information content available in 2D sensor systems that accurately determining “what” an object is often doesn’t improve significantly with the addition of 3D data. Additionally, acquiring and using 3D data comes at a steep cost: fusing 2D and 3D sensor data is computationally expensive, inference times are exponentially slower than for 2D data, and the power requirements can make 3D data acquisition impractical. 
  4. Real-time or “online” inference is often required for robotics and much of the work we do at Adinkra. While we will touch on a few of the latest methods that are maturing rapidly, but not real-time, our main focus will be on evaluating algorithms that can be deployed today for these applications.

Let’s jump in!

How Does Object Detection Work?

Let’s review how object detection works. In a nutshell, object detectors aim to identify patterns inside images and generalize these patterns to make predictions on new data. However, because the patterns describing real-world objects are often complex, it is usually required to extract and learn such patterns automatically. This is most commonly achieved today with a type of machine learning called deep learning. Deep learning models are trained on a large data set of example images and objects so they can learn to identify useful patterns for making new detections at runtime.

Training a model usually requires collecting a data set of images and manually identifying and locating the objects in that image through a process called data labeling or data annotation. These labels help the algorithm learn to distinguish patterns unique to those objects from other patterns in the environment. An example of a common labeling scheme with simple rectangular boxes (called bounding boxes) is shown in Fig. 1. 

Fig. 1: Objects are identified and localized in the image with boxes called bounding boxes for the training set.

It is important to note that there are often a huge number of labels and examples needed (often 10,000+ image examples per object category) for these algorithms to generalize well enough to be useful in the real world. These data requirements can make training deep learning models an expensive and challenging task. Many techniques, such as synthetic data generation, proactive learning, and auto-annotation, have therefore been developed to speed up the labeling process. 

Types of Object Detectors

Convolutional Neural Networks

Most of the state-of-the-art object detectors in use today are based on Convolutional Neural Networks (CNNs). CNNs automatically extract features such as edges, corners, or regions of interest which have proven useful as learned patterns on object detection tasks. Their key innovations are the mathematical convolution operation (giving them their name) which allows for nearby pixels to be analyzed together when learning features, as well as a parameter-sharing constraint allowing useful features learned in one part of an image to be used for other parts of the image. There have been many other notable architecture and algorithm advancements using CNNs which still make them competitive and effective object detectors today. 

Single-Stage Detectors

Single-stage detectors are a type of architecture based on CNNs that tends to be simpler and run faster than their two-stage counterparts, albeit at the cost of some accuracy. The detector uses a CNN to extract so-called feature maps (i.e. learned patterns) from the image, and then uses these to find and classify bounding boxes. Among some popular one-stage detectors are YOLO (You Only Look Once), SSD (Single-Shot Detector) and RetinaNet, with YOLO being one of the most widely used base architectures for real-time object detection applications. 

Fig. 2: Single-stage object detection is faster and simpler, but at the cost of some accuracy. This figure shows bounding box priors for the evaluation but there is a movement to push toward so-called anchor free methods, such as with YOLO-X variants.

Two-Stage Detectors

Two-stage detectors work similarly to single stage ones, except there is an extra step to propose candidate regions of the image that may contain objects prior to refining bounding box coordinates and classifying them. This results in slower inference speeds, but higher object localization and recognition accuracy. The method of proposing regions often contains convolutional layers, but not always. Some common examples of two-stage detectors include Region Convolutional Neural Networks (R-CNN), Faster R-CNN, Mask R-CNN, and Granulated R-CCN (G-R-CNN).

Fig. 3: Two-stage object detection involves first proposing a region of interest before refining and classifying the bounding box for an object.


By the end of 2020, object detection was starting to be taken over by transformer architectures, beginning with the introduction of DEtection TRansformers (DETR). DETR showed improved performance on object detection tasks as compared to pure CNNs since the convolution operation works only on local neighboring pixels and thus misses the global information available from other pixels in the image. DETR uses the self-attention capabilities of transformers to capture these long-range correlations between pixels, providing significant improvements for localizing and identifying objects. This method also simplified architectures by removing the need for anchor boxes, non-max suppression, and region proposals.

The shift in paradigm to transfers is continuing at a rapid clip. In 2022, variants of DETR, such as the Swin Transformer (Shifted-WINdow Transformer), are reaching SOTA performance results, however they are not yet capable of real-time inference on edge GPUs. 

Comparing Object Detectors

Fig. 4: Advancement of computer vision models on object detection with the standard COCO dataset. The box mean average precision is a key metric for determining the classification and localization of objects correctly.

In the last section, we gave a high-level overview of the different types of object detectors. However, in reality, this is an active area of research with many new variants and ideas being published on a nearly daily basis (e.g., Fig 4.). This raises the important question: how do we pick the best detector for a given application? In this section, we will talk about the most important metrics to consider and how these metrics are evaluated for various model types. 


Mean Average Precision

Perhaps unsurprisingly, two key performance metrics for a real-time object detection system are how well an object is localized in an image (the “where”), and how well it is classified (the “what”). The “what” and “where” of model performance are typically captured in a metric called the mean average precision (mAP). 

To get some intuition for this metric, we first need to define a quantity called the intersection over union (IoU), which is the area of overlap between the true object location (created during data labeling) and the predicted object location (Fig. 5). The more overlap there is between our prediction and the true location of the object, the better the object detector is performing.

Fig. 5: The intersection over union for object detection. More overlap indicates better performance. 

Note, however, there is some ambiguity about how much overlap is considered enough to accept a prediction as a true positive (Fig 6.): if the overlap is small, we should consider this as a missed detection (a false positive), whereas if the overlap is large, we should consider this as a detected object (a true positive). To address this, we set a parameter called the IoU threshold to require a certain amount of overlap to consider something a true positive. We can then apply our object detector to the dataset, and calculate the precision as (# true positives) / (# true positives + # false positives). Note that boxes with no overlap with true objects will always be considered false positives, as intended, and we also consider the overlap to be zero if the object categories don’t match. 

Since IoU thresholds are somewhat arbitrary, repeating this calculation for multiple IoU thresholds allows us to calculate an “average performance” metric called the average precision (AP) for each object category. Averaging the AP across all object categories then gets us to the mean average precision (mAP) metric. In other words, the mAP metric measures the performance of the detector across all classes and IoU assumptions, providing a good metric to use when comparing object detector performance. 

Fig. 6: Ambiguity about whether or not the predicted box should be labeled “cat” or not.


Naturally, runtime is another important metric for real-time applications, and the faster, the better. Runtime is impacted by many factors, such as input image size (e.g., is this a slow high-resolution image or fast low-resolution image?), model architecture (e.g. is this a scaled YOLO-large model or YOLO-tiny model?), hardware acceleration (e.g., is the model compiled using TensorRT?), available memory (e.g., are we on an 8GB Xavier or 16GB Orin?), model parameters (e.g., how much nms-supression to apply), and batching options (e.g., do we need to process each image one at a time or can we process sets of data).

Engineers often need to carefully select the hardware and parameters of these models to achieve the desired performance. For this reason, evaluations of models are usually performed on standard datasets (such as Microsoft’s COCO dataset), standard hardware (such as the A100 or V100 GPU), and standard parameterization (no TensorRT, standard image input sizes, etc.). 

Other Metrics

Expanding on the previous section, it is common for architectures to be expanded or contracted based on the intended runtime hardware. For example, in the YOLOv5 family, there are multiple variants such as nano, small, medium, large, and extra-large (and even more if you consider the P6 networks for larger objects) which are intended to fit into the memory of various target hardware. Therefore, engineers typically must consider both the model size (number of parameters), and floating point operations per second (FLOPs) that can be achieved when selecting a model. The number of model parameters also impacts how quickly models can be trained.

State of the Art Object Detection

At this point we can evaluate what is state of the art in 2D object detection as of the end of 2022. We have listed out the top algorithms, including the up and coming ConvNeXt and Swin Transformer models, and our previous favorite object detector, YOLOv5. The mAP is computed using the COCO evaluation dataset, and the frames per second (FPS) is shown for V100 GPUs (with the exception of the ConvNeXt and Swin Transformers, which were run on more advanced A100 GPUs). 

We also attempt to answer the question “what algorithm should I use for my application?” by highlighting the compute regime for each model as an edge device GPU, normal GPU, or more capable cloud GPU. However, there are many compute platforms, hardware acceleration options, and practical engineering considerations that can allow for flexibility in what models will work for a given application. For example, we have optimized a YOLOv5-L model to run on an edge device despite also running an object tracker, optical character recognition algorithm, and color analyzer on the same device. 

AlgorithmImage SizeWhen to UsemAPFPS
YOLOv7- tiny640Object Detection, Edge GPU38.7286
YOLOv7640Object Detection, Normal GPU51.2161
YOLOv7-X640Object Detection, Cloud GPU52.9114
YOLOv7-W61280Object Detection, Cloud GPU54.684
YOLOv7-E61280Object Detection, Cloud GPU55.956
YOLOv7-D61280Object Detection, Cloud GPU56.344
YOLOv7-E6E1280Object Detection, Cloud GPU56.836
YOLOR-P61280Multi-Task, Edge GPU53.576
YOLOR-W61280Multi-Task, Edge GPU54.866
YOLOR-E61280Multi-Task, Edge GPU55.745
YOLOR-D61280Multi-Task, Edge GPU56.134
ConvNeXt-S1280Multi-Task, Edge GPU51.912 *
ConvNeXt-81280Multi-Task, Edge GPU52.711.4*
ConvNeXt-L1280Multi-Task, Edge GPU54.810*
ConvNeXt-XL1280Multi-Task, Edge GPU55.28.6*
Swin-81280Multi-Task, Edge GPU51.911.6*
Swin-L1280Multi-Task, Edge GPU53.99.2*
YOLOv5-S640Object Detection, Edge GPU37.4156
YOLOv5-M640Object Detection, Normal GPU45.4122
YOLOv5-L640Object Detection, Cloud GPU49.099
YOLOv5-XL640Object Detection, Cloud GPU50.783
Table 1. Comparison of state of the art object detectors. CNN-based YOLOR is included as a multi-task computer vision alternative to YOLOv7. ConvNeXt (CNN-based) and Swin Transformers (Transformer-based) are included for comparison, but arguably not fast enough yet for real-time edge applications. We include YOLOv5, arguably the previously most popular state of the art approach for comparison purposes. Note that the COCO dataset was used in the calculation for mAP.

That’s all for now. Stay tuned for future computer vision posts from our team where we will talk about some of the pitfalls in camera-based object detection and how to avoid them, stereo vision systems and hardware, 3D object detection, and more!

About Adinkra

Adinkra is an R&D engineering firm helping customers create state of the art robotics and AI products while minimizing costs and time to market. We combine a world-class engineering team with a flexible project management framework to offer a one-stop development solution and unlock your product’s full potential for your customers.