Mapping of Land Use and Land Cover (LULC) using EuroSAT and Transfer Learning

As the global population continues to expand, the demand for natural resources increases. Unfortunately, human activities account for 23% of greenhouse gas emissions. On a positive note, remote sensing technologies have emerged as a valuable tool in managing our environment. These technologies allow us to monitor land use, plan urban areas, and drive advancements in areas such as agriculture, climate change mitigation, disaster recovery, and environmental monitoring. Recent advances in AI, computer vision, and earth observation data have enabled unprecedented accuracy in land use mapping. By using transfer learning and fine-tuning with RGB bands, we achieved an impressive 99.19% accuracy in land use analysis. Such findings can be used to inform conservation and urban planning policies.


Introduction
The world population has increased significantly over the last few centuries and is projected to continue growing [1].With growing demand, human beings are consuming more natural resources, including water, energy, minerals, and agricultural products.Activities such as agriculture, forestry, and urbanization contribute to 23% of global greenhouse gas emissions [2], primarily due to deforestation and land degradation.Monitoring land use changes is vital for better environmental management, urban planning, and nature protection [3,4].
With advancements in remote sensing technologies, satellite image data are now either freely available or can be commercially acquired, promoting innovation and entrepreneurship.Thanks to the use of low orbit and geostationary satellites [5], we can observe the Earth with unprecedented detail.Moreover, improvements in remote sensing technology have resulted in better spatial resolution [6], enabling more precise ground surface analyses.Such data access has fueled advancements in agriculture, urban development, climate change mitigation, disaster recovery, and environmental monitoring [7][8][9].Advances in computer vision, AI, and earth observation data facilitate large-scale land use mapping [10,11].The community has extensively embraced methods for classifying land use and land cover (LULC), including machine learning [12] and deep learning (DL) [13].Recent studies suggest that DL techniques demonstrate remarkable performance in remote sensing (RS) image scene classification [14].The objective of image scene classification and retrieval is to automatically allocate class labels to every RS image scene stored in an archive.This differs from semantic segmentation tasks used for LULC mapping and classification.
One application of this technology is scene classification [15], which involves labeling an image based on specific semantic categories.This approach has numerous practical uses, including LULC analysis as well as land resource management [16,17].
Figure 1: Mapping of LULC using satellite images Figure 1 provides an overview of the LULC classification process using satellite images.Satellites capture images of the Earth.These images are then utilized to extract patches for classification.The objective is to automatically label the physical type of land or its utilization.Each image patch is fed into a classifier, which then outputs the corresponding class depicted on the patch.
However, DL models tend to overfit and demand large quantities of labeled input data to perform well on unseen data [18].This limitation has restricted their adoption in geosciences and remote sensing.Leveraging pre-trained models from optical datasets, such as ImageNet [19], can facilitate training new RS models using smaller labeled datasets.Various approaches have leveraged pre-trained models alongside EuroSAT and have demonstrated encouraging results [20][21][22].
With the advancement of Vision Transformers (ViT), many applications are adopting it for image classification tasks [23], including EuroSAT [24,25].It's suggested that further scaling can enhance performance [26], but this model has yet to be integrated with Geospatial data.
The contributions of this paper are summarized below: • In this study, we offer a thorough evaluation of the Vision Transformers (ViT) model, considering a range of settings and hyperparameters, with a specific focus on the EuroSAT datasets.
• We have implemented cutting-edge model improvement techniques to improve the performance of the selected model.
• The 'Kreis Borken' are is mapped using the best-performing model and geospatial data for visual classification.
The paper is structured as follows: Section 2 reviews related work.Section 3 describes the dataset used in this study and introduces the methodologies applied using the ViT model.Section 4 presents the results and provides an analysis.Section 5 discusses the findings and concludes the paper.

Related Works
This section discusses various state-of-the-art image classification techniques that use DL and Transfer Learning (TL) for LULC using the EuroSAT dataset.
Fine-tuning large-scale pretrained ViTs has shown prominent performance for computer vision tasks, such as image classification by Dosovitskiy et al. [27] using EuroSAT data from its early proposal.
The model utilized in his study is shown in Figure 2. In their study, Helber et al. [30] tested the GoogleNet and ResNet-50 architectures with various band combinations.Among these, the ResNet-50 using RGB bands outperformed the other configurations.Li et al.'s DDRL-AM method [31] achieved a peak accuracy of 98.74% with RGB bands.Yassine et al. [32] implemented two approaches on the EuroSAT dataset.The first utilized 13 Sentinel-2 spectral bands and achieved 98.78% accuracy.The second combined these 13 spectral bands with calculated indices, resulting in an improved accuracy of 99.58%.
Naushad et al. [33] advanced LULC classification by employing transfer learning with pre-trained VGG16 and Wide Residual Networks (WRNs) on the EuroSAT dataset's RGB version.Techniques such as data augmentation [34], gradient clipping [35], adaptive learning rates [36], and early stopping [37] optimized performance and reduced computational time.Their WRN-based approach achieved a remarkable accuracy of 99.17%, setting a new benchmark in efficiency and accuracy.
Recently, other variants, including hierarchical ViTs with diverse resolutions and spatial embeddings [38], have been proposed.Without a doubt, the advancements in large ViTs underscore the importance of developing efficient model adaptation strategies.

Materials and Methods
To classify LULC in a specific region using geospatial data, a transfer learning (TL) task was undertaken.The EuroSAT classes were subsequently color-mapped with the assistance of the ViT model pre-trained on ImageNet-21k.Two datasets were utilized -one with data augmentation and another without.For simplification, the RGB version of the EuroSAT dataset was chosen, and the model was trained using the PyTorch framework.Both model training and testing were conducted using the Tesla T4 GPUs available on Google Colab.

Dataset
The EuroSAT dataset is considered novel and comprises 27,000 labeled and georeferenced images taken from the Sentinel-2 satellite.The images are classified into ten scene classes: Forest, Herbaceous Vegetation, Highway, Pasture, River, Industrial, Permanent Crop, Residential, Annual Crop, and Sea/Lake.Each image patch contains 64 × 64 pixels with a spatial resolution of 10 m. Figure 3 displays several sample images obtained from the EuroSAT dataset [39].The dataset is split into a train set (80% of data) and a test set (20% of data) selected at random.

Training and Evaluation
During model training, two sets of data were fed into the system: one with augmentation and one without.In the augmentation process, various image transformation techniques [40], such as crops, horizontal flips, and vertical flips, were utilized to augment the data.This strategy aids in preventing the neural network from overfitting to the training dataset, enabling it to generalize more effectively to unseen test data.Since we utilized pre-trained models, the input dataset was normalized to match the statistics (mean and standard deviation) of those models.
The model was trained under various settings, and its accuracy was subsequently evaluated.In this context, the loss was quantified using cross-entropy loss.To counteract potential issues of vanishing or exploding gradients during training, which could adversely affect the parameters, the gradient clipping technique [41] was employed with a value set to 1.0.We used the Adam optimizer, combined with multiple learning rates, due to its proven efficacy in image classification tasks [42].Regularization strategies, including early stopping, dropout, and weight decay [43], were also implemented to combat overfitting and to optimize time and resources.We also recorded the total duration of the experiment1 .

Results
This section presents the results obtained from ViT, VGG16, and ResNet-50 models with same settings and narrows down the ViT model performance under different settings.Metrics such as accuracy and the time taken to train the model are measured.The model's performance is assessed using both data augmentation techniques and without data augmentation, maintaining similar settings for both.The accuracy of the model increases as it trains longer using augmented data; in contrast, the accuracy decreases for the non-augmented version, as observed in Table 2.The training time for the augmented data is also relatively longer compared to the non-augmented data.The loss and accuracy of both the augmented and non-augmented models are depicted in Figure 4. Subsequently, the model is tested with test data, as shown in Figure 5. Figure 6 shows the confusion matrix of ViT model with and without data augmentation on validation data.The Forest and Sea Lake classes showed best performance with higher accuracy in augmented data.The Pasture class has least accuracy mean while the rest has almost 99% accuracy.From the experiments conducted, we found that the model trained with augmented data yields better accuracy.Using Google Earth Engine and Sentinel 2A [44], we selected the Kreis Borken area.The data, drawn from satellite images spanning 2018 to 2020, was segmented into 64 × 64 tiles within the specified boundary.Using the model, these tiles were classified and color-mapped, as depicted in Figure 7.

Conclusion
The objective of this research was to explore the application of transfer learning in LULC classification.
We utilized the ViT model, pre-trained on ImageNet-21K, and fine-tuned it with the RGB bands of the EuroSAT dataset for classification.Consistent with other experiments, the results of this study suggest that transfer learning is a reliable method capable of delivering superior results.Our approach advanced the state-of-the-art by achieving a 99.19% accuracy for the RGB bands of the EuroSAT dataset.We also compared the classification results with and without data augmentation.Data augmentation was observed to enhance the visual variability of each training image without introducing new spectral or topological information, thereby enriching the dataset's diversity.The performance of the augmented data surpassed that of the model trained on the original dataset.
Additionally, this study incorporated model enhancement techniques, including regularization, early stopping, gradient clipping, and learning rate optimization, to optimize model training, improve performance, and reduce the computational time needed.The most proficient model was subsequently used to map class distributions and offer insights into a specific region of geospatial imagery.Such insights can assist in monitoring shifts in land usage and shaping policies for environmental conservation and urban development.

Figure 3 :
Figure 3: Sample example images of three different classes from the EuroSAT dataset.

Figure 4 :
Figure 4: Comparison of loss and accuracy of models at different epochs and iterations.

Figure 5 :Figure 6 :
Figure 5: Both augmented with 20 epochs and non augmented with 20 epochs model was able to predict correct results with test data

Figure 7 :
Figure 7: Mapping of LULC of 'Kreis Borken' area using satellite images and ViT model.

Table 1 :
Comparative experimental results of ViT, ResNet-50 and VGG16 models with and without data augmentation.

Table 1
shows that the ViT model is more accurate in Augmented and Non-Augmented Data, but takes longer to train than other models.Conversely, the ResNet-50 model exhibits better accuracy than VGG16, it requires relatively less training time for both augmented and non-augmented data.

Table 2 :
Comparative experimental results of ViT model with and without data augmentation.