Land Cover Classification with U-Net

Satellite Image Multi-Class Semantic Segmentation Task with PyTorch Implementation of U-Net

13 min readJun 16, 2021

This article is co-authored by Srimannarayana Baratam and Georgios Apostolides, as a part of the “Seminar Computer Vision by Deep Learning” course offered at TU Delft.

1 Introduction

Developing a model which accurately predicts the different types of land covering is of particular importance for monitoring environmental changes as well as the growth of human settlements in the environment. This often requires experts to spend a lot of time watching satellite images and trying to label the land type. Instead a model could be trained using labels from experts so that it can predict automatically the type of land presented in the satellite images. This blog post discusses the task of image segmentation carried out on satellite images through an implementation of the famous U-Net model.

Fig1: An example of satellite image segmentation from LCC Dataset ([LandCoverClassificationDataset] © [2018] DigitalGlobe, Inc): On the left hand side, a sample image is shown which needs to be segmented to target classes of interest; as represented in the image (mask) on the right hand side.

Semantic segmentation refers to the task of predicting the class of individual pixels in an image. This is fundamentally different from the task of image classification/detection where a label is assigned to the whole image or locating objects of interest. In semantic segmentation of satellite images, we need to predict for every pixel, the type of land (even water bodies) it corresponds to.

Over the past few years, there has been a significant research interest in the area of semantic segmentation following the revolutionary “Fully Convolutional Networks (FCNs)”. For fine-grained localization, models based on Encode-Decoder Architecture (ED), Spatial Pyramid Pooling, Dilated Convolutions and other methods are actively being researched. An impressive summary on the deep learning based state of the art architectures for semantic segmentation is available in this survey for interested readers.

Also known as U-Nets, EDs are comprised of two parts: Encoder gradually reduces the spatial dimension with pooling layers, whilst Decoder gradually recovers the object details and spatial dimension. U-Net architectures have proven very useful for the segmentation of different applications, such as medical images, street view images, satellite images, etc. We shall be implementing this model to perform per pixel multi class prediction on images from “Land Cover Classification Dataset”.

In the following sections, we will discuss the U-Net architecture and Dataset, followed by challenges we faced in implementation of the model, modifications to the base code to overcome them, prediction results and suggestions for future work. We will also highlight key learnings from this attempt which might be of interest to fellow Deep Learning students and enthusiasts.

2 U-Net Architecture

U-net is one of the most popular Fully-convolutional architectures for semantic image segmentation. It splits into two major parts: the contractive (left) and the expansive path (right). The contracting and expansive path are joined with skip connections.

The contractive path uses a repeated application of two 3x3 convolutions. Every 3x3 convolution is followed by a ReLU function and 2x2 Max-pooling operation. For every down-sampling of the feature map in the contractive path the size of channels is doubled. In contrast, on the expansive path the feature map is up-sampled and then a 2x2 convolution is applied which divides the number of channels by two. The result of the contracting path and the 2x2 convolution are concatenated to form the result of the feature map to be up-sampled further. The last layer of the architecture performs an 1x1 convolution used to reduce the 64 components to the desired number of classes (in our case, it is 7).

The code for base implementation of U-Net is taken from this repository. We made the modified version of this repository available for others to perform the satellite image segmetation on custom DeepGlobe challenge dataset.

3 Dataset

The dataset opted for the implementation is part of DeepGlobe challenge in 2018. It consists of 3 challenges: Road Extraction Challenge, Building Detection Challenge and Land Cover Classification Challenge. We will focus on the land Cover Classification challenge, which poses the problem of classifying different land types from satellite images.

The original dataset consists of satellite images, namely 803 for training, 171 for validation and 171 for testing which can be downloaded from [LandCoverDataset]. For our implementation we will be using only the training data since the 171 image for validation and testing don’t have labels and the submission server used for the challenge has closed. We train our network on 65% (547 images), 20% is used for validation (136 images) and 15% test (120 images). [Available here]

The dataset consists of 7 land cover classes as presented in the paper [DeepGlobe2018]:

Urban Land: Man-made areas, buildings, human establishments.
Agriculture Land: Farms, plantation, cropland, orchards, vineyards etc.
Range Land: Any land which is not farms, or green land.
Forest: Land with at least 20% tree crown density plus clear cuts.
Water: Rivers, oceans, lakes, ponds and wetlands
Barren Land: Mountains, rocks deserts and beach
Unknown: Clouds and other artifacts.

The distribution of classes in our training+validation set is tabulated below. A significant skewness between the different classes is observed; a challenge that our implementation tried to tackle using different loss functions which will be discussed in a later section.

Fig2: Class Imbalance identified in the training and validation data together (Image by Author)

3.1 One-hot Encoding of Labels

The loss functions (will be introduced in the next sub-section) take the input for predictions and ground truth in the form of one-hot encoding. However, the labels provided in the dataset are in .png image format (refer to Fig1) and thus each class had a corresponding RGB value. Further, there are also intensity values in the range (0, 255) for which a threshold of 128 is chosen to fit them to one of the listed classes. An algorithm was developed to convert these RGB values into class indices and subsequently, one-hot encoding to be fed to the loss functions. Python implementation of the same is provided below for quick reference.

4 Loss Function

As mentioned in the section addressing the dataset, class imbalance is present across and also within the images (refer to Fig2) which needs to be tackled for good predictions. We initiated our approach by changing the base code from binary class segmentation to multi-class segmentation; binary cross entropy cannot be used since the prediction of pixels label is not binary (foreground vs background) but multi-class (7 land classes).

4.1 Simple Categorical Cross Entropy

While implementing and training the network using a simple cross entropy it is noticed that the network predictions often settle to a single color for all pixels for any image input. It quickly became evident that this color corresponds to the over-represented class — yellow/agriculture land in our case — due to the presence of large class imbalance as discussed earlier.

Simple Categorical Cross Entropy Loss Function

Fig3: Simple Categorical Cross Entropy | Left: Training Subset, Middle: Ground Truth, Right: Prediction (Images developed by Author from [LandCoverClassificationDataset] © [2018] DigitalGlobe, Inc)

“A Simple Categorical Cross Entropy loss is not enough for class imbalance datasets”

4.2 Dice Loss

Continuing our exploration for different loss functions used for semantic segmentation, we decided to implement one of the most widely used loss function for imbalanced datasets — Dice Loss.

Since the PyTorch framework does not come with a predefined Dice loss, we had to either implement the Dice Loss by ourselves or find an existing implementation. Based on the hint from an issue tracker, we implemented Dice Loss for multi class segmentation. While training the model however, the gradients appeared to be “freezed” and after a day’s worth of debugging we realized that certain operations on torch variables will detach the output and back-propagation won’t be possible as discussed in this thread. Functions are chosen carefully in order to avoid this detachment while implementing the algorithm.

“Avoid operations on variables that might detach the gradients (output), there will be no convergence without gradient information”

However, the convergence of the network’s loss was unstable (as argued here), the gradients of Dice Loss are much uglier than those of cross entropy and convergence become unstable during training.

Fig4: Traning curves on mini dataset: Magenta corresponds to training with weighted cross-entropy while cyan shows the trend with dice loss. On the left hand side, we see the validation/test loss (simple categorical cross entropy) and on the right hand side, the plot shows training loss over 400 epochs. (Images by Author)

“Dice Loss can become unstable during training due to its gradient’s nature.”

4.3 Weighted Cross Entropy (The Comeback)

Our final effort towards addressing the skewed class representation in the dataset and also a successful one, is the weighted form of cross entropy loss. It benefits from the good gradients of cross entropy with the use of weights for each class to compensate for class imbalances. This approach ensures steady convergence of loss given an appropriate learning rate (will be discussed in a later section). The weight for each class was calculated by dividing the total instances of class with the minimum presence with total instances of each class in the dataset (training).

The weighted cross entropy loss can then be used for training as shown below:

An implementation detail we would like to mention here is that we decided to exclude the class “Unknown” from the weights of the classes (assign zero weight) while passing them to the loss function. As mentioned in [DeepGLobe2018], this class refers to clouds and other artifacts. Its prediction is not important and is not considered in the evaluation of the algorithm.

Weighted Cross Entropy loss is a good choice and offers stable conversion in case of class imbalance both within a sample and across the dataset.

5 Mother of All Errors — OOM (Out Of Memory)

The most limiting if not the most frustrating errors for any deep neural network enthusiast/ student is the GPU memory shortfall throwing the error “CUDA_ERROR_OUT_OF_MEMORY” when model is fired up. We received no exemption in this as well and had to take the challenge head on. In addition to the control of batch size and image resizing, two interesting methods can drastically reduce the memory demand. We will discuss all these measures here.

5.1 Batch Size

Undeniably, the first and foremost answer on all forums when OOM is googled would be to reduce the batch size as much as possible. However, in the case of satellite image segmentation; each image is roughly of the dimension 2500pix * 2500pix from the dataset. So, despite setting the batch size to 1, OOM showed no mercy.

5.2 Image Scaling

Of course resizing the image to a lower resolution would lower the memory allocation requirement. Yet, we need to realize that reducing resolution is also “loosing local information” from the data. In case of images with scattered classes, this will lead to poor predictions. As much as we hated to reduce the resolution, it was inevitable and 40% of the original resolution as retained for training and prediction.

5.3 Half Precision

One highly effective way to reduce memory demand is to use half (16bits) precision mode of the neural network while training and convert it back to full (32bits) mode for back-propagation step. Well, the resulting model could have lesser accuracy than the full precision counterpart. However, the benefits outweigh the cons in this specific case and we chose to implement half precision model for training. It can be implemented with ease in PyTorch by simply calling the methods as shown below.

5.4 Reduced Model

As a final resort, one might even consider reducing the number of channels in each layer while retaining the overall “Encode-Decode” architecture; not so surprisingly, it is a common practice in prototyping to use half the number of channels proposed in the original paper. While we did use this approach to test out various aspects of the project initially, we eventually decided to train the model with full channels on the complete training data. Nevertheless, it is one powerful tool to tackle OOM issues.

Half Precision model training is a worthy weapon to challenge OOM errors. Even using a fraction of the channels for prototyping can be considered in case of memory limitations.

6 Optimal Learning Rate

“The learning rate is perhaps the most important hyper-parameter. If you have time to tune only one hyper-parameter, tune the learning rate.” — Ian Goodfellow et al.

While one can find an effective learning rate through the conventional grid search method, a very systematic approach has been proposed by Jeremy Jordan towards finding the optimal learning rate.

To summarize his approach in short, “……we gradually increase the learning rate after each mini batch, recording the loss at each increment…… This gradual increase can be on either a linear or exponential scale. The best learning rate is associated with the steepest drop in loss.”

We tested this approach in our implementation and the chosen learning rates have always provided very stable convergence. The batch(mini) loss curve for the half precision U-Net model on full dataset is shown below along with the learning rate associated. Learning rate in the range of 2e-6 to 2e-5 seemed to result in a very steep slope and thus we chose a LR of 4e-5 to fully train our model.

Fig5: Systematic approach towards optimal learning rate — U-Net (Half Precision mode) [Images by Author]

Summarizing the hyper-parameter settings carried out to improve training and inference while considering the compute limitations ; the model is trained with a batch size of 4, image scaling factor of 0.2, in half precision mode, ADAM optimizer, constant lr of 4e-5, train-valid split of 80–20 and 200 epochs.

Results

Training Curves

The model is trained on a Deep Learning VM hosted on Google Cloud Platform. With a K80 GPU, the half precision model took 24 hours to complete the training in the aforementioned configuration of hyper-parameters.

Before the reader gets upset over the test (validation) loss being so “high” compared to the training loss, we would like clarify that the training loss is “weighted cross entropy loss” as explained in an earlier section. The validation loss is however, the simple cross entropy — using weighted cross entropy loss is unfair (follow this thread for a nice discussion on the same); this may also explain the reason for the fluctuation in the validation loss (interested reader may follow this thread to read further).

Metric

For the evaluation of model, we went as instructed by the proposers of the challenge [DeepGlobeChallenge2018] and have used the Intersection of Union (IoU) metric and compare results on the land cover segmentation challenge.

IoU is a metric essentially for quantifying the precision of a segmentation algorithm by overlapping the predictions made by the model with the ground truth. As explained in this article, IoU divides the number of common pixels between the prediction and the ground truth (i.e. the intersection between the prediction set and the ground truth set) with the total number of pixels present in both ground truth and predictions (i.e. the union of the two sets).

For our case, the IoU was calculated for each class and then an average over the number of classes was taken.

It’s important to mention that as instructed by the challenge rubric the class unknown was not included for the model evaluation so the total number of classes used to calculate the IoU metric is 6.

Performance

Land Cover segmentation was the most challenging task proposed during the DeepGlobe2018 challenge. The baseline proposed by the organizers was based on ResNet18 with atrous spatial pyramid pooling and had an IoU score of 0.433 at epoch 30 with a patch size of 512x512 (corresponding to a scale of 0.2). Our model manages to achieve an IoU score of 0.608 at 30 epochs with the same image scaling factor on the test set separated from the training data. The best score from the leader-board of the Deep Globe Challenge 2018 is 0.6005 and of course evaluated on a different test set.

Fig7: Terminal output showing the IoU metric output

Fig8: Qualitative Analysis: Left Column — Subset of the input images ([LandCoverClassificationDataset] © [2018] DigitalGlobe, Inc)from the test set, Middle Column — Corresponding ground truth masks, Last Column — Predicted Segmentation Masks (Images by Author)

Fig8 shows few samples from the test set along with the predicted masks. Visually, we can qualitatively assess that the model is able to perform meaningful segmentation on the input images. We should also remember that only 40% of the original image resolution is used to train and evaluate; we believe this results in the loss of local information and might limit the model learning fragmented and fine classes instances within an image (refer to second example from Fig8).

Note that we could train on lesser number of images (compared to other challengers due to internal split of data as discussed already) and also with limited resolution and batch size, half precision due to limited memory/compute. We believe that the performance will definitely improve in the absence of such bottlenecks.

Conclusion and Future Work

Through this blog, we hope to have provided the reader with a comprehensive perspective of multi-class semantic segmentation as a dense prediction task, neural network architectures designed for this category of tasks and also implementation of U-Net on Satellite images covering common challenges one might face along the way. We also shared results of training and inference; qualitatively and quantitatively. The modified code along with a notebook wrapper can be found here.

Given the time and resources constraints, we could not explore the below listed approaches to address various bottlenecks discussed in this blog. Interested Deep Learning enthusiasts may consider pursuing these and continue the process of learning.

More advanced architectures like Spatial Pyramid Pooling and DeepLabV3+ could be put to task.
To address OOM issues, breaking the images into tiles (with mirror padding at borders) can be considered in the pre-processing stage. This will ensure that full image resolution is used effectively.
A soft dice loss can also be tested for the skewed class representation problem.
A scheduler can be added as well to reduce the learning rate gradually, of course within the optimal range found through the technique discussed.
Early stopping may be implemented using validation accuracy or other metric to prevent over-fitting of the model.