Deep Learning Based License Plate Number Recognition for Smart Cities

: Smart city-aspiring urban areas should have a number of neces-sary elements in place to achieve the intended objective. Precise controlling and management of traffic conditions, increased safety and surveillance, and enhanced incident avoidance and management should be top priorities in smart city management. At the same time, Vehicle License Plate Number Recognition (VLPNR) has become a hot research topic, owing to several real-time applications like automated toll fee processing, traffic law enforce-ment, private space access control, and road traffic surveillance. Automated VLPNR is a computer vision-based technique which is employed in the recognition of automobiles based on vehicle number plates. The current research paper presents an effective Deep Learning (DL)-based VLPNR called DL-VLPNR model to identify and recognize the alphanumeric characters present in license plate. The proposed model involves two main stages namely, license plate detection and Tesseract-based character recognition. The detection of alphanumeric characters present in license plate takes place with the help of fast RCNN with Inception V2 model. Then, the characters in the detected number plate are extracted using Tesseract Optical Character Recognition (OCR) model. The performance of DL-VLPNR model was tested in this paper using two benchmark databases, and the experimental outcome established the superior performance of the model compared to other methods.


Introduction
There is a tremendous increase in the usage of vehicles in recent years, thanks to rapid economic growth of the country. In smart cities, road safety can be achieved for people through automated VLPNR process. VLNPR makes a significant gain in real-time under several aspects. It is useful in several applications like automated toll fee collection systems [1], car parking access controls [2] and road traffic control [3]. VLPNR is an active research domain that received more attention in the recent years. Various applications have been developed recently deploying intelligent transportation and surveillance systems along with the enhancement of digital camera and increased computation complexity. These systems are intended to recognize vehicles using their number plates. Such systems offer automated identification and recognition of vehicle license plates from real-time images. After a vehicle's front view is captured using a camera, the captured image is fed into computer vision-based algorithms as input to examine, identify, and filter the plate areas from backdrop. The identification process performs the character segmentation in detected area followed by its recognition. Identification and recognition of number plates are two different tasks. Various models exist for a particular kind of number plate style (font size, text, font type, backdrop color, and shape) or for particular conditions like motion of camera, angle, lighting, occlusion, and so on.
Classical VLPNR models utilize Machine Learning (ML) models especially its hand-crafted features to represent the essential features that exist in vehicle license plate image. These models gather some morphological variables and are susceptible to the presence of noise in image and complicated backdrop. DL models offer an option i.e., automated feature selection from images with the help of learned representations of underlying data using altered filters. Convolutional Neural Networks (CNNs) is one of the advanced and effective DL models which gained significant attention in the recent years in various fields of computer vision like handwriting recognition [4], text recognition [5], visual object recognition, etc. Since identifying the location of a vehicle license plate is treated as a detection problem, diverse region-based CNN models can be employed to detect the objects in a rapid and precise manner. For VLPNR, the available DL-based models are segregated into two types namely, segmentation-based and segmentation-free models. The former model carries out the segmentation task for character separation and recognition of individual characters. The latter model, on the other hand, recognizes the characters without separation using particular architectural models like Recurrent Neural N(RNN).
In detection process, one of the processes is the localization of bounding boxes of vehicle license plate from the complete actual input image. The outcome of the process affects the accuracy of the detection process significantly. Several VLPNR models have been proposed and implemented in the literature. The classical ML models, with handcrafted features, depend on a particular set of descriptors like edges, color, and texture descriptors [6]. Vehicle license plate recognition is a process of detecting a homogeneous text region through the detection of characters straightaway from the image [7].
Though it is simple and rapid, the existing models produced minimum detection rate, since the features learned from the characters are not sufficient to identify every character present in the image. Besides, the other characters present in the image create confusion in detecting the vehicle license plate. The existing models assume the license plate as an area with high contrast and edge density or otherwise as a portion that is comprised of high intensity key points identified with Scale-Invariant Feature Transform (SIFT) descriptor [8].
Currently, DL-based models are in use to localize the vehicle license plates. Especially, A4layer CNN-based models are in use to detect the text regions present in input image. Afterward, a second 4-layer plate/non-plate CNN classification model is applied to differentiate the vehicle license plates from typical textual characters. The authors, in the literature [9], utilized a classification model using FAST-YOLO network in the detection of front view of cars from the applied image. From this, the vehicle license plate details are extracted from the identified front view image. The literature [10] employed a pipeline architecture using a series of deep CNNs to detect vehicle license plate under diverse scenarios. The architecture operates around a set of different number plate designs. However, it is designed mainly for Arabian text due to which it cannot be applied for other languages.
The current research paper presents an effective DL-based VLPNR method to identify and recognize the alphanumeric characters, present in the license plate. The proposed model involves two main stages namely, license plate detection and Tesseract-based character recognition. License plate detection occurs with the help of Faster RCNN and Inception V2 model. Then, the characters in the detected number plate are extracted by Tesseract OCR model. The study validated the performance of DL-VLPNR model utilizing a set of two benchmark databases. The experimental outcomes established the optimal performance of the proposed method over compared techniques.
The upcoming sections of the paper are as follows. Section 2 briefs the works related to VLPNR model. Section 3 discusses the presented DL-VLPNR model. The validation of the proposed DL-VLPNR model is presented in Section 4, and the paper is concluded in Section 5.

Literature Survey
Segmentation-dependent models extract every individual character from vehicle license plate in its earlier stage. Afterwards, the OCR algorithm recognizes every character from the extracted image. The existing models on vehicle license plate image segmentation are of two different types such as projection-based and connected component-based. Between these methods, the former one makes use of the characters and backdrops that are different in color in number plate and the method offers contrary values in binary image. The histograms of vertical and horizontal pixel projections could be utilized in the segmentation of characters [11]. These models can be easily influenced by rotating the vehicle license plate. The connected component-based model segments the characters by labeling every linked pixel in binary image to components. Though it is robust to rotate, it fails in proper segmentation of characters, once they are combined or divided. After the segmentation of characters, recognition process occurs as classification, with an individual class for every alphanumeric character.
The existing techniques perform partition in two ways namely, template matching and learning-based techniques. The former one comprises of similarity comparison of a provided character against the template. In this method, the high resemblance character is chosen. Various similarity metrics are presented for instance, Mahalanobis and Hamming distances [12]. These are employed in binary images and are restricted since it operate only for original character size and font. It does not support rotating or broken letters. The latter model is highly robust and operates with characters of different sizes, fonts, and rotation. It makes use of ML models in differentiating the characters with the help of one or many features like edges, gradient, and SIFT. In the study conducted earlier [13], a 5-layer CNN was used to recognize the Malaysian vehicle license plate where every character undergoes manual extraction and segmentation. VLPNR process is treated as a classification process, including 33 classes. It has the capability to achieve a maximum accuracy of 98.79% on a limited sample count. A CNN-based model was introduced in the study conducted earlier [14] for VLPNR which used several preprocesses like filtering, thresholding, and segmentation.
In segmentation-free models, VLPNR is carried out on global vehicle license plate images with no character segmentation. Generally, a sliding window is utilized over an input image for the generation of several tentative characters in few steps. Then, every tentative character is, utilized by a recognition model. Once the input image is completely swiped by the sliding window, the predicted output is investigated, and the end sequence is decided. The successive identical characters are treated as a single character whereas the character space is applied for separation. Under VLPNR, some of the models use segmentation-free approaches via DL approaches. A CNN model was proposed in the literature [15] to obtain the features on vehicle license plate and RNN so as to find the series of characters. In the study conducted earlier [16], a VLPNR was proposed in the recognition of license characters as sequence labeling problem by RNN with Long Short-Term Memory (LSTM). A deep (16-layers) CNN was used in the study [17] depending on Spatial Transformer Networks to carry out a lesser sensitive character identification in spatial conversions on entire license plate image. This model is also used to avoid the crucial process of segmenting the image into characters. A YOLO-based network was proposed for VLPNR using an integrated classification-detection model. In the literature [18], a CNN model was utilized to identify the characters in vehicle license plate and localize the character bounding box corners. It dealt with classification process of a set of 33 classes for Italian VLPNR. Generally, it is observed that the DL models for VLPNR are still under progress and are limited to particular scenarios. Few of the models discussed above have performed VLPNR in a dedicated way while many models were based on hand-crafted features.

The Proposed DL-VLPNR Model
The working process involved in the proposed DL-VLPNR method is depicted in Fig. 1. The proposed DL-VLPNR model has two main stages. Number plate detection occurs with the help of Faster RCNN and Inception V2 model. Then, the characters in the detected number plate are extracted by Tesseract OCR model. At this point, the Tesseract OCR engine is used to realize the alphanumeric features present in the detected plate. The applied Tesseract engine is trained to improve the accuracy of analysis. Training process involves the development of characters in the images which need to be predicted using the desired fonts. Further, a dictionary of viable characters should be identified in number plate which has other information such as regional codes, suffixes as well as registration numbers. The result of this phase arrive at the text representation of vehicle number.

Faster R-CNN with Inception V2 Model for Number Plate Detection
Faster R-CNN method has two major stages such as Region-Based Proposals (RPN) and Fast R-CNN technique. When RPN is constrained with reliable feature rules, then the Fast R-CNN model explores the objects. The identification outcome is provided to RPN to generate the region proposals. Faster R-CNN approach obtains the whole image and the value of object proposals in the form of input in order to forecast the abnormalities that exist in the input image as shown in Fig. 2 [19]. Faster R-CNN characteristics are determined to filter the association, whereas, in second phase, the class labeling function is carried out. As a result, the class labels are declared for all the observed regions in a video frame. Later, the anomaly is predicted. A group of frames obtained from a video sequence of the tracked objects acts as the input for anomaly detection. A characteristic of Faster R-CNN performs this operation and maps the observed regions. When identifying the observed regions of a frame, the corresponding labels are assigned with respective prediction values.
RPN receives images of diverse sizes and offers the results as a group of rectangular object proposals along with specific objectless value. It is named after CNN and the main theme of this model is to allocate the processing with Fast R-CNN. To produce the region proposals, a tiny network is slid across a convolutional (Conv) feature map. It consumes the input as n × n spatial window. Each sliding window undergoes mapping with minimum dimensional feature and provides two fully connected layers. Here, the RPN is trained by assigning a binary class for all the anchors. A positive label is allocated for two anchors such as anchors with high Intersection-over-Union (IoU) that overlaps a ground-truth box and anchor with 0.7 IoU. A ground-truth box is shared for positive labels. In most of the cases, an alternate procedure is sufficient to find the positive samples, and primary criteria are applicable in rare scenarios.
Faster R-CNN focuses on reducing the objective function by applying multi-task loss in Fast R-CNN. Hence, it can be formulated as in Eq. (1): where u denotes the index and p u implies a predictive probability of anchor u, being an object. The ground-truth label p * u is one if the anchor is positive and 0 for negative anchor. z u represents a vector of four parameterized coordinate points and z * u signifies a ground-truth box. The classification loss L cls is log over two class labels. In case of regression loss, L reg (z u , z * u ) = R(z u − z * u ), where R refers to robust loss function (smooth L1). In term p * u L reg is a regression loss that is inactive state for p * u = 1 and in an inactive state for p * u = 0. The outcomes of cls and reg layers contain {p u } and {z u }. The bounding box regression parameters are shown in Eq. (2): a, b, w, and h are center coordinates of the box and its corresponding width and height values. The variables a, a x , and a * imply the detected box, anchor box, and ground truth box.
The outcomes attained from the simulated RPN gives the presented regions of diverse sizes. A different-sized region represents a different-sized CNN feature map. It is highly complex to develop an efficient method that can perform the features of different sizes. The Region of Interest (ROI) pooling undergoes simplification by reducing the feature maps to a similar size. Unlike Max-Pooling, the ROI pooling divides the input feature map to a predefined number into identical regions, and Max-Pooling is employed for all the regions. Thus, the result attained from ROI Pooling is assigned as k.
In order to detect the abnormalities in pedestrian walkways, Fast R-CNN method is applied. The strategy of the working process learns the conv layers that have been distributed from RPN and Fast R-CNN. RPN as well as Fast R-CNN are trained autonomously. Hence, Conv. layers can be changed in diverse modules. So, there is a requirement for development which allows the Conv. layers to be shared between two networks. This is executed by replacing the learning model that has been carried out in two distinct networks. It is very complex to describe the individual network of RPN and Fast R-CNN that undergoes optimization with the help of Back Propagation (BP) technique. The training process for Fast R-CNN depends on the predetermined object proposals.
A 4-step training model is applied in training the distributed features by other optimization models. Then, a predictive network undergoes training using the Fast R-CNN model, a derivative of RPN. Here, RPN and Fast R-CNN models are not capable of sharing the Conv. layers. The detection of networks are utilized in the third phase to initiate RPN training. Therefore, the distributed Conv. layers are permanent, and exclusive layers of RPN are fine-tuned. Consequently, the shared Conv. layers are provisioned from a constant full Conv. layer of Fast R-CNN that underwent fine-tuning. Therefore, RPN and Fast R-CNN share a similar Conv. layer and creates a unique system.

Inception V2 Model
Many developers from Google established an Inception network developed for ImageNet competition to classify and predict the challenges. The method consists of a fundamental component called 'Inception cell' to process a sequence of Conv. layers at diverse scales and consecutively assemble the simulation outcome. To save the process, 1 × 1 Conv. has been applied to decrease the input channel depth. For every cell, a collection of 1 × 1, 3 × 3, and 5 × 5 filters are applicable to extract the features from input at various scales. Also, Max pooling is employed, albeit with 'same' padding to save the dimensions. So, both of these could be combined appropriately. Inception network plays a significant role in the development of CNN classification models. Before inception network, well-known CNNs are stacked with Conv. layers at a high depth to attain a better function.

OCR Engine: Tesseract
The pipeline of the Tesseract OCR engine is shown in Fig. 3. Initially, Adaptive Thresholding is applied to change the image into binary version using Otsu's method. Page layout analysis is the next step and is applied in extracting the text blocks within the region. Then, the baselines of every line are detected and the texts are divided into words with the application of finite spaces as well as fuzzy spaces.

Figure 3: Processes involved in tesseract-based character recognition
In the next step, the character outlines are extracted from the words. Text recognition is initiated as a 2-pass method. In the first pass, word recognition is carried out with the application of static classification. Every word is passed satisfactorily to adaptive classifier in the form of training data. A second pass is run over the page by employing a novel adaptive classification model, in which the words are not examined thoroughly to re-examine the module.
A series of processes involved in the implementation of the presented method is summarized herewith.
• Initially, a collection of training images is provided. • In the next stage, a set of data points is obtained from the available annotated image. Next, the conversion of data points to .csv file takes place. • Then, the records are generated in TensorFlow. In the next stage, a training model is created using Faster R-CNN with Inception V2 method. • Upon the completion of training model, new input images are provided to the system as shown in the figure. • When new input images are provided, the Faster RCNN model detects the number plate at first instance correctly. • Then, the text is recognized with the help of PyTesseract.
• Finally, the text in the vehicle number plate is identified correctly.

Implementation Details
The proposed DL-VLPNR model was simulated using a PC i5, 8 th generation, 16 GB RAM. The DL-VLPNR model was programmed using Python language with TensorFlow, Pillow, OpenCV, and Py Tesseract. Fig. 4 shows the visualization outcomes of the proposed DL-VLPNR model. It is inferred that the DL-VLPNR model can clearly recognize the license plate number on all the images. Fig. 4 shows that the presented model accurately recognized the number plate, even it is inappropriately captured. This illustrates the reliable function of the developed system under different circumstances.  Fig. 5 shows the results of the analysis of the proposed model on FZU Cars dataset under different runs [20,21]. From the obtained results, it is apparent that the proposed model can demonstrate the maximum recognition in terms of precision, recall, F1-score, and mAP. The proposed model attained a high average precision of 0.9780, recall of 0.9820, F1-score of 0.9760, and mAP of 0.9690. Fig. 6 show the results offered by different VLPNR models on the applied FZU Cars dataset. The table values indicate that the ZF model produced ineffective detection outcomes with minimum precision, recall, F1-score, and mAP values i.e., 0.916, 0.948, 0.932, and 0.908 respectively. At the same time, it is pointed that the VGG16 model performed well than the previous model and attained slightly higher precision, recall, F1-score, and mAP values such as 0.925, 0.955, 0.940, and 0.912 respectively.   Along with that, the ResNet 50 model achieved even higher detection outcomes with precision, recall, F1-score, and mAP values such as 0.916, 0.948, 0.932, and 0.908 respectively. In line with this, it is observed that the ResNet 101 model produced an acceptable recognition with precision, recall, F1-score, and mAP values such as 0.945, 0.958, 0.951, and 0.922 respectively. Besides, the DA_Net136, DA_Net160, DA_Net168, and DA_Net200 approaches produced competitive and near similar recognition rates over the compared methods. The DA-Net136 model showed a slightly manageable outcome with precision, recall, F1-score, and mAP values such as 0.961, 0.964, 0.962, and 0.942 respectively. Next, the DA-Net160 offered slightly higher precision, recall, F1-score, and mAP values such as 0.965, 0.966, 0.965, and 0.952 respectively. Afterward, even higher performance was achieved by DA-Net168 with precision, recall, F1-score, and mAP values such as 0.966, 0.968, 0.967, and 0.955 respectively.

Tab. 2 and
In line with this, the DA-Net200 model produced near-optimal results with precision, recall, F1-score, and mAP values such as 0.978, 0.982, 0.976, and 0.969 respectively. However, the proposed DL-VLPNR model accomplished the optimal performance with precision, recall, F1-score, and mAP values beings 0.978, 0.982, 0.976, and 0.969 respectively. Fig. 7 show the results for detection analysis of the proposed model on HumAIn2019 dataset under different runs. From the resultant values, it is obvious that the proposed model exhibited an improved outcome in terms of precision, recall, F1-score, and mAP. The proposed model accomplished higher average precision of 0.9780, recall of 0.9820, F1-score of 0.9760, and mAP of 0.9740.  Fig. 8 show the recognition performance achieved by diverse VLPNR models on the applied HumAIn2019 dataset. The table values denote that the ZF model produced the worst detection outcomes with minimum precision, recall, F1-score, and mAP values such as 0.863, 0.873, 0.864, and 0.869 respectively. At the same time, it is revealed that the VGG16 model is superior to ZF model as it achieved slightly higher precision, recall, F1-score, and mAP values such as 0.869, 0.889, 0.874, and 0.876 respectively. Concurrently, the ResNet 50 model produced higher detection performance with precision, recall, F1-score, and mAP values being 0.871, 0.892, 0.887, and 0.892 respectively. Simultaneously, it is noticed that the ResNet 101 model produced a slightly satisfactory recognition with its precision, recall, F1-score, and mAP values being 0.913, 0.923, 0.913, and 0.925 respectively. Besides, the DA_Net136, DA_Net160, DA_Net168, and DA_Net200 methodologies accomplished somewhat satisfactory results. The DA-Net136 model exhibited slightly convenient outcomes with the precision, recall, F1-score, and mAP values such as 0.923, 0.931, 0.926, and 0.935 respectively. After that, the DA-Net160 offered slightly higher precision, recall, F1-score, and mAP values such as 0.936, 0.942, 0.937, and 0.938 respectively. An even more high performance was produced by DA-Net168 with precision, recall, F1-score, and mAP values such as 0.932, 0.948, 0.941, and 0.942 respectively. Concurrently, the DA-Net200 model produced near-optimal results with precision, recall, F1-score, and mAP values being0.945, 0.957, 0.949, and 0.953 respectively. However, the proposed DL-VLPNR model accomplished superior results over the earlier models in terms of precision, recall, F1-score, and mAP values being 0.978, 0.982, 0.976, and 0.974 respectively.   Tab. 5 examines the overall accuracy analysis results offered by DL-VLPNR model with existing techniques on the applied dataset. It can be inferred that the presented DL-VLPNR technique leads in optimal recognition performance as it produced the highest accuracy of 0.986. At the same time, the VGG16 and ResNet 50 models produced near-identical and competitive outcomes with its accuracy values being 0.971 and 0.976 respectively. Along with that, the VGG_ CNN_M_1024 approach offered a somewhat low accuracy of 0.967, whereas minimum accuracy was achieved by ZF and ResNet 101 methods i.e., 0.942 and 0.943 respectively. Overall, the proposed DL-VLPNR methodology effectively recognized all the applied images compared to other methods.

Conclusion
The current research article proposed a productive DL-VLPNR model to identify and analyze the license plate characters of a vehicle. The proposed method utilizes Faster RCNN with Inception V2 model to detect the alphanumerical characters in license plate of a vehicle image. Afterward, the characters in the detected number plate are extracted by Tesseract OCR model. The performance of the DL-VLPNR model was validated using a set of two benchmark databases namely, FZU Cars and HumAIn2019 dataset. The results were analyzed in terms of different measures such as precision, recall, F1-measure, accuracy, and mAP. The experimental results evidently indicate that the DL-VLPNR model has the ability to achieve optimal detection and recognition performance as it attained the highest accuracy of 0.986. The proposed DL-VLPNR model can be employed as an appropriate tool for VLPNR. In future, the proposed model can be implemented in real-time traffic surveillance cameras in smart cities to identify the vehicles that cross the traffic signal at a faster rate.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.