Choosing an appropriate loss function is important as it affects the ability of the algorithm to produce optimum results as fast as possible. Here we study losses in popular models like MaskRCNN and PSPNet.
Machine learning algorithms are designed so that they can “learn” from their mistakes and “update” themselves using the training data we provide them. But how do they quantify these mistakes? This is done via the usage of “loss functions” that help an algorithm get a sense of how erroneous its predictions are when compared to the ground truth. Choosing an appropriate loss function is important as it affects the ability of the algorithm to produce optimum results as fast as possible.
Basic Definitions
L2 LOSS
 This is the most basic loss available and is also called as the Euclidean loss. This relies on the Euclidean distance between two vectors — the prediction and the ground truth.
 This is very sensitive to outliers as the error is squared.
CROSSENTROPY LOSS
 Cross Entropy loss is a more advanced loss function that uses the natural logarithm (log_{e}). This helps in speeding up the training for neural networks in comparison to the quadratic loss.
 The formula for cross entropy (multiclass error) is as follows. It may also be called as categorical cross entropy. Here c=class_id and o=observation_id, p=probability
 The formula for cross entropy (binary class) is as follows. It may also be called as log loss. Here y = [0,1] and yˆ ε (0,1)
 For more information refer fast.ai wiki or this github gist.
SIGMOID FUNCTION
 The cross entropy function requires probabilities to be input for every scalar output of an algorithm. But since that may not always be the case, we can use the sigmoid function (a nonlinear function). Its formula (which is a special case of the logistic function) is as follows.
SOFTMAX FUNCTION
 We can use the softmax function for the same reason as stated above. This is also referred to as a normalized exponential function (this is a generalization of logistic function over multiple inputs). It “squashes” a Kdimensional vector(z) to a Kdimensional vector(σ(z)) in the range (0, 1) that add up to 1. One can also read up the definition here. The equation is as follows
Examples of Loss Functions in Popular Semantic Segmentation Models
Semantic Segmentation – PSPNet
 Quote from paper

Apart from the main branch using softmax loss to train the final classifier, another classifier is applied after the fourth stage, i.e., the res4b22 residue block
 Here the softmax loss refers to softmax activation function followed by the crossentropy loss function.

 Loss Calculation (code)

raw_output = net.layers['conv6'] # Step1 : raw_prediction = tf.reshape(raw_output, [1, args.num_classes]) label_proc = prepare_label(label_batch, tf.stack(raw_output.get_shape()[1:3]), num_classes=args.num_classes, one_hot=False) # [batch_size, h, w] raw_gt = tf.reshape(label_proc, [1,]) indices = tf.squeeze(tf.where(tf.less_equal(raw_gt, args.num_classes  1)), 1) # Ste2 : gt = tf.cast(tf.gather(raw_gt, indices), tf.int32) prediction = tf.gather(raw_prediction, indices) # Step3 : Pixelwise softmax loss. loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=prediction, labels=gt)
 The above code snippet defines loss on the masks for PSPNet. It is Split into three main section
 Step1: Reshaping the inputs
 Step2: Gathering the indices of interest
 Step3: Computing loss (Softmax Cross Entropy)

 Inference (code)

raw_output_up = tf.argmax(raw_output_up, dimension=3)
 Here we calculate the class_id for each pixel by finding the mask with the max value across dimension=3 (depth)

Instance Semantic Segmentation – MaskRCNN
 Quote

The mask branch has a Km 2 – dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes. To this we apply a perpixel sigmoid, and define L mask as the average binary crossentropy loss. For an RoI associated with groundtruth class k, L mask is only defined on the kth mask (other mask outputs do not contribute to the loss).

 Network Architecture
 As can be seen below, MaskRCNN splits into three branches — classes, bounding box, and mask. Let’s focus on the mask branch since that’s the one used to create masks for various objects of interest.
 Loss Calculation (code)


def mrcnn_mask_loss_graph(target_masks, target_class_ids, pred_masks): """Mask binary crossentropy loss for the masks head. target_masks: [batch, num_rois, height, width]. A float32 tensor of values 0 or 1. Uses zero padding to fill array. target_class_ids: [batch, num_rois]. Integer class IDs. Zero padded. pred_masks: [batch, proposals, height, width, num_classes] float32 tensor with values from 0 to 1. """ # Step1.1 : Reshape for simplicity. Merge first two dimensions into one. target_class_ids = K.reshape(target_class_ids, (1,)) mask_shape = tf.shape(target_masks) target_masks = K.reshape(target_masks, (1, mask_shape[2], mask_shape[3])) pred_shape = tf.shape(pred_masks) pred_masks = K.reshape(pred_masks, (1, pred_shape[2], pred_shape[3], pred_shape[4])) #Step1.2 : Permute predicted masks to [N, num_classes, height, width] pred_masks = tf.transpose(pred_masks, [0, 3, 1, 2]) #Step2 : Only positive ROIs contribute to the loss. And only # the class specific mask of each ROI. positive_ix = tf.where(target_class_ids > 0)[:, 0] positive_class_ids = tf.cast( tf.gather(target_class_ids, positive_ix), tf.int64) indices = tf.stack([positive_ix, positive_class_ids], axis=1) # Step2.1 : Gather the masks (predicted and true) that contribute to loss y_true = tf.gather(target_masks, positive_ix) y_pred = tf.gather_nd(pred_masks, indices) # Step3 : Compute binary cross entropy. If no positive ROIs, then return 0. # shape: [batch, roi, num_classes] loss = K.switch(tf.size(y_true) > 0, K.binary_crossentropy(target=y_true, output=y_pred), tf.constant(0.0)) loss = K.mean(loss) return loss
 The above code snippet defines loss on the masks for MaskRCNN. It is Split into three main section
 Step1: Reshaping the inputs
 Step2: Gathering the indices of interest
 Step3: Computing loss (Binary CrossEntropy Loss)


 Inference
 The class branch predicts the class id of a region of interest and that mask is accordingly picked out from the prediction
 The class branch predicts the class id of a region of interest and that mask is accordingly picked out from the prediction
Conclusion
Here we learned about some basic loss functions and how their complex variants are used in stateoftheart networks. Go ahead and play around with the repositories in the links above!