Deep Learning Architectures/ Semantic Segmentation Models for Autonomous Driving

in Autonomous Driving, Machine Learning

List of Semantic Segmentation Models for Autonomous Vehicles

State-of-the-art Semantic Segmentation models need to be tuned for efficient memory consumption and fps output to be used in time-sensitive domains like autonomous vehicles.

In a previous post, we studied various open datasets that could be used to train a model for pixel-wise semantic segmentation(one of the Image annotation types) of urban scenes. Here, we take a look at various deep learning architectures that cater specifically to time-sensitive domains like autonomous vehicles. In recent years, deep learning has surpassed traditional computer vision algorithms for object detection by learning a hierarchy of features from the training dataset itself.

This eliminates the need for hand-crafted features and thus such techniques are being extensively explored in academia and industry.

Deep Learning Architectures for Semantic Segmentation

Prior to deep learning architectures, semantic segmentation models relied on hand-crafted features fed into classifiers like Random Forests, SVM, etc. But after their mettle was proved in image classification tasks, these deep learning architectures started being used by researchers as a backbone for semantic segmentation tasks. Their feature learning capabilities, along with further algorithmic and network design improvements, have then helped produce fine and dense pixel predictions. We introduce one such pioneering work below called Fully Convolutional Network (FCN) on the basis of which all future models are roughly based.


A skip connection based network for end-to-end semantic segmentation.

VGG-16 architecture reinterpreted as FCN - For Image Semantic Segmentation

Figure 1 [Source] : VGG-16 architecture reinterpreted as FCN

Contribution: This work reinterpreted the final fully connected layers of various LSVRC (Large Scale Visual Recognition Challenge, a.k.a ImageNet) networks such as AlexNet and VGG16 as fully convolutional networks. Using the concept of skip-layer fusion to decode low-resolution feature maps to pixel-wise prediction allowed the network to learn end to end.

Architecture: As seen in the above image, the upsampled outputs of a particular layer are concatenated with the outputs of the previous layer to improve the accuracy of the output. Thus, appearance (edges) from the shallower layers are combined with coarse and semantic information from the deeper layers.

The upsampling operation in the deeper layers’ feature maps is also trainable, unlike conventional upsampling operations that use mathematical interpolations.

Drawbacks: The authors did not add more decoders since there was no additional accuracy gain and thus, high-resolution features were ignored. Also, using the encoder feature maps during inference time makes the process memory intensive.

Real-Time Semantic Segmentation

Post FCN, various other networks such as DeepLab (introduced atrous convolutions), UNet (introduced encoder-decoder structure), etc., have made pioneering contributions to the field of semantic semantic segmentation. On the basis of the aforementioned networks, various state-of-the-art models like RefineNet, PSPNet, DeepLabv3, etc. have achieved an IoU (Intersection Over Union) > 80% on benchmark datasets like Cityscapes and PASCAL VOC.

Accuracy vs Time for various Semantic Segmentation architectures

Figure 2 [Source] : Accuracy vs Time for various Semantic Segmentation architectures

But real-time domains like autonomous vehicles need to make decisions in the order of milliseconds. As can be seen from Figure 2, the aforementioned networks are quite time-intensive. Table 1 also details the memory requirements of various models. This has encouraged researchers to explore novel designs to achieve output rates of >10fps from a neural network and contain fewer parameters.

Table 1: Semantic Segmentation models for autonomous vehicles


Using encoder pooling parameters at the decoder for efficient training

VGG-16 architecture with max-pooling indices from encoder to decoder and Sparse Upsampling - For Image Semantic Segmentation

Figure 3 [Source] : A) VGG-16 architecture with max-pooling indices from encoder to decoder, B) Sparse Upsampling


This network has much fewer trainable parameters since the decoder layers use max-pooling indices from corresponding encoder layers to perform sparse upsampling. This reduces inference time at the decoder stage since, unlike FCNs, the encoder maps are not involved in the upsampling. Such a technique also eliminates the need to learn parameters for upsampling, unlike in FCNs.


The SegNet architecture adopts the VGG16 network along with an encoder-decoder framework wherein it drops the fully connected layers of the network. The decoder sub-network is a mirror copy of the encoder sub-network, both containing 13 layers. Figure 3(B) shows how SegNet and FCN carry out their feature map decoding.


Strong downsampling hurts accuracy.


An intensive data augmentation centric encoder-decoder architecture to segment biomedical images


  • The U-net makes use of extensive data augmentation for segmentation of biomedical images. The fundamental goal is to enable segmentation of biomedical images on GPUs despite their resolution.
  • It prior work used patches of images to label a single pixel for segmentation, however, the number of patches thus generated were humongous to deal with the compute expenses and it was slow. To handle this, U-net uses patches of images to segment tiles. The patches provide the context to the tile information.


UNet - For Image Semantic SegmentationIt is a contraction-expansion network with skip connections between them to provide knowledge from the earlier phase to the later. The contraction part increases the number of feature maps with depth while the expansion path increases the image resolution through upsampling after catenation of images from the contraction path.


To handle the problem of boundaries within same class, the loss used was a weighted cross-entropy. The border pixels of a given cell were assigned more importance over the cells by considering the distance w.r.t borders of two adjacent cells.

Data augmentation: Traditional augmentation methods such as dropout, rotation, shift and deformations were used. In particular, random elastic deformations from a Gaussian distribution were introduced. ref-


A light network with reduced inference time using asymmetric encoder-decoder architecture.

E-Nets with Cityscapes output, elementary units and network layers with output dimensions - For Image Semantic Segmentation

Figure 4 [Source]: E-Nets with A) Cityscapes output, B) elementary units, C) network layers with output dimensions

Contribution: The authors created a light network using an asymmetric encoder-decoder architecture. In addition, they made various other architectural decisions such as early downsampling, dilated and asymmetric convolutions, not using bias terms, parametric ReLU activations and Spatial DropOut.


  • Efficient Neural Network (E-net) aims to reduce inference time on images by reducing a large number of floating point operations present in previous architectures. This too is an encoder-decoder based architecture, with the difference that the decoder is much larger than the encoder. Here, the authors take inspiration from ResNet-based architectures – there is a single main branch, with extensions (convolutional filters) that separate from it and merge back using element-wise addition as shown in Figure 4(B)(2).
  • The authors also do not use any bias terms and noticed that it does not lead to any loss of accuracy. They also employ early downsampling – Figure 4(C), which reduces the image dimensions and hence saves on costly computations at the beginning of the network. Using dilated convolutions also makes sure that the network has a wider receptive field and saves them from aggressive downsampling early on in the network.
  • This model has been tested by the authors on the Nvidia Jetson TX1 Embedded Platform and code may be found here.


A baseline model for several works in semantic segmentation, deeplab introduces atrous convolutions.


  • Atrous convolutions
  • Atrous spatial pyramid pooling
  • Using CRFs for efficient boundary segmentation

Atrous convolutions: The work proposes using atrous convolution(convolution kernels with holes/zeros) to preserve resolution. Though the kernel size increases, the effective computations considering only the non-zero elements remain the same. The use of atrous convolution enhances the FoV of each kernel to any arbitrary size without sacrificing the computation expense and maintaining invariance.

Atrous Spatial Pyramid Pooling: Handling scale variance is usually attempted by sampling the original image to different scales, extract the features from a DCNN and fuse them to the original resolution. This is done at both train and test time. However, this would result in expensive computations(as much as 3X for 3 scales). ASPP handles this through kernels of different hole resolutions(aka sampling rates).

Fully connected CRFs for boundary recovery: The CNN score maps show that the boundaries are smooth and spread further than the object*insert fig5*. To achieve a superior boundary segmentation, deeplab used fully connected CRFs. The CRFs minimize the negative-log-likelihood of the CNN score maps and pairwise potential which allows similar color pixels in a neighborhood to have the same labels and enforces smoothness between similar pixels.


  •  DeepLab-LargeFOV
  •  DeepLab-ASPP-S
  •  DeepLab-ASPP-L

The LargeFOV config uses the atrous conv layers at deeper layers. The ASPP config handles multiscale images through atrous conv layers as multiple parallel layers with different rates.

DeepLab: For Image Semantic Segmentation

The ASPP-L configuration has the best performance, the CRFs have improved mIoU by 3-4%.

DeepLab: For Image Semantic Segmentation [2]

The need for CRFs is refuted with a comparison of VGG-16 and ResNet-101 networks. The CRF prior maps have inefficient boundaries, this is highly noticeable with the VGG when compared to the ResNet config.

DeepLab: For Image Semantic Segmentation [3]



DeepLabv3 aims to eliminate CRFs and explore multi-grid configurations with atrous convolution


  • Having demonstrated that the VGG with denseCRF(context module) performs as good as ResNet without the post processing with CRF. They explore architectures to eliminate the context module.


  • The atrous(dilated) convolution layers in the work use cascaded conv layers of different atrous rates. When compared to traditional ResNet architecture where the output image is 256x smaller, this has an output_stride(input_resolution/output_resolution) of 16. They introduce the concept of Multi-grid method, where the conv rates within the blocks are variable unlike the entire block adopting a single rate.

DeepLab Architecture: For Image Semantic Segmentation [1]

The parallel atrous architecture(ASPP) employs the multi-grid atrous convolutions similar to the cascaded configuration:

DeepLab Architecture: For Image Semantic Segmentation [2]

Inference: Atrous convolutions with deeper networks(ResNet-101) have demonstrated significant performance improvements. Choosing the rate for convolution in multi-grid ASPP is tricking and large rates could lead to decremental performance. A constant atrous rate is not preferred. ref-


ICNet architecture with its three branches for multi-scale inputs - For Image Semantic Segmentation [1]

Figure 5 [Source] : ICNet architecture with its three branches for multi-scale inputs


  • ICNet (Image Cascade Network) cascades feature maps from various resolutions. This is done to exploit the processing efficiency of low-resolution images and high inference quality of high-resolution images. A representation of this logic is shown in Figure 5.
  • Instead of trying intuitive strategies such as downsampling inputs (as in ENets), downsampling feature maps or model compression (removing feature maps entirely), the authors use a CFF (Cascade Feature Fusion) unit to merge feature maps of low resolutions with those of high resolutions.


  • Here the input image is fed into three different branches with different resolutions of ¼, ½ and 1. Each branch reduces the spatial resolution of its input by ⅛ and hence the outputs of the three branches are ¹/₃₂, ¹/₁₆ and ⅛,  of the original image. The output of branch1 (o/p – ¹/₃₂) is fused with the output of branch2 (o/p – ¹/₁₆) using the previously mentioned CFF unit. Similar operations are performed for branch2 and branch3, and the final output of the network is an image which is ⅛ of the original size.
  • Since convolutional parameters are shared between the ¼ and ½ resolution branches, the network size is also reduced. Also, during inference, branch1 and branch2 are completely discarded, and only branch3 is utilised. This leads to a computation time of a mere 9ms.
  • The branches use the design of PSPNet50 (a 50-layer deep ResNet for semantic segmentation). They contain 50, 17 and 3 layers respectively.

Final Thoughts

Various architectures have made novel improvements in the way 2-dimensional data is processed through data graphs. Although embedded platforms continue to improve with more memory and FLOPS capability, the above architectural and mathematical improvements have led to major leaps in semantic segmentation network outputs. With state-of-the-art networks, we can now achieve an output rate (in fps) that is close enough to image acquisition rates, and with acceptable quality (in mIoU) for autonomous vehicles. If you are looking to scale up your image labelling needs, try Playment!