An Automatic Deep Neural Network Model for Fingerprint Classification

The accuracy of fingerprint recognition model is extremely important due to its usage in forensic and security fields. Any fingerprint recognition system has particular network architecture whereas many other networks achieve higher accuracy. To solve this problem in a unified model, this paper proposes a model that can automatically specify itself. So, it is called an automatic deep neural network (ADNN). Our algorithm can specify the appropriate architecture of the neural network used and some significant parameters of this network. These parameters are the number of filters, epochs, and iterations. It guarantees the highest accuracy by updating itself until achieving 99% accuracy then it stops and outputs the result. Moreover, this paper proposes an end-to-end methodology for recognizing a person’s identity from the input fingerprint image based on a residual convolutional neural network. It is a complete system and is fully automated whether in the features extraction stage or the classification stage. Our goal is to automate this fingerprint recognition system because the more automatic the system is, the more time and effort it saves. Our model also allows users to react by inputting the initial values of these parameters. Then, the model updates itself until it finds the optimal values for the parameters and achieves the best accuracy. Another advantage of our algorithm is that it can recognize people from their thumb and other fingers and its ability to recognize distorted samples. Our algorithm achieved 99.75% accuracy on the public fingerprint dataset (SOCOFing). This is the best accuracy compared with other models.


Introduction
Fingerprint recognition is one of the most extensively studied fields in biometrics. Recently, it has many applications, such as airport security, law enforcement, mobile access, and authentication. Passwords or ID cards have been used for security purposes for identifying individuals and controlling access. However, with the advancement in technology and the widespread presence of hackers, using personal identification numbers (PIN) does not guarantee security. Moreover, it is easy to steal, lose, or forget. Hence, using biometric traits such as a fingerprint is safer. It is unique; two people do not have the same fingerprint, including identical twins. It remains permanent and stable during a person's whole life. Therefore, we propose a system for fingerprint recognition. Almost ten years ago, in criminal cases, an automatic fingerprint identification system (AFIS) was used to identify people using their fingerprints available in a large database. For this application and others, as mentioned earlier, we faced challenges related to performance. The detection accuracy for smaller objects is still much lower than that of larger objects [1].
Several fingerprint recognition methods exist, and minutiae matching is the most often used method. Minutiae are the bifurcations and endings of the ridges that form fingerprint images (see Fig. 1). These features are not unique to one person; we all have them in our fingerprints. However, the difference is in their distribution. The precise position and direction of the ridge endings and bifurcations can distinguish one person from another.
There are many techniques to identify fingerprints: Fuzzy Logic (FL), Neural Networks (NN), and Genetic Algorithms (GA). These algorithms fall under Artificial Intelligence (AI), as listed in Tab. 1. There are also nonartificial intelligence methods like statistical techniques [2] for fingerprint recognition. We can combine the two methods (AI and statistical techniques) to efficiently classify fingerprints. Recently, the deep neural network (DNN) has shown efficiency in completing several tasks; hence, it is used in this study for fingerprint identification.
Several papers have been written on the minutiae extraction stage because of its importance. Here, we discuss the overall network, not just this stage, and explain how it can be successfully automated. The architecture of our network has two main stages. The first stage is feature extraction which mainly depends on convolutional layers. The second stage is a classifier, which can recognize or classify the fingerprint image using the outcome features extracted from the first stage. These two stages are illustrated in detail in Section 4. The contributions of our paper are as follows: Theoretically, we discuss the principle of the network depth, specifying the features and size of input images and the principle of the residual network. Practically, an automatic system based on a residual convolution network is proposed with efficient performance for fingerprint recognition.Hence, we can control the overall network (parameters and depth), achieving our goal of best accuracy.
In Section 2, some recent studies on fingerprint recognition algorithms based on DNN have been discussed. Section 3 discusses the principles of this network based on previous studies. Section 4 introduces our system in detail, its overall architecture, and how it solves the previous studies' disadvantages. Our experimental results and comparisons supported by charts are presented in Section 5. Finally, Section 6 summarizes the conclusion of the study.

Related Work
Fattahi et al. [6] presented the architecture of a model that recognizes damaged fingerprints using Convolutional Short-Term Memory Networks for forensics. Khan et al. [7] proposed a model that could classify rolled, plain, and latent fingerprint samples using a convolutional neural network (CNN) to facilitate the matching process of fingerprints. Tertychnyi et al. [8] used the VGG16-based deep network to classify low-quality fingerprint images resulting from dryness, physical damage, wetness, the existence of dots, and blurriness. Peralta et al. [9] proposed an approach to fingerprint classification using CNN and the recognition of low-quality fingerprints. Zhang et al. [10] distinguished between live fingerprints and fake ones by residual convolutional nets for protection from spoof attacks. Nahar et al. [11] proposed people recognition from their fingerprint images using ResNet50. It is a deep network with high accuracy. Our study also uses residual networks. However, our network achieves higher performance in less time, and it has the advantage of using automatic techniques. Many professional algorithms for object detection have emerged. Li et al. [12] proposed a Hierarchical Convolutional Neural Network (HCNN) for image classification. HCNNs consist of multiple subnetworks that are used to classify images progressively. Chang et al. [13] presented an improved deep-learning network, You only look once, version 3 (Yolov3), a clever recognition algorithm. Kumar et al. [14] proposed an automatic system to detect the existence of face masks using deep learning and image processing algorithms. Cao et al. [15] presented an improved network-inspired NN to solve object rotation problems.
For the minutiae extraction, Nguyen et al. [16] proposed a universal minutiae extractor based on a modified U-shaped network for segmentation. Zhou et al. [17] proposed a network consisting of two stages. In the first stage, a network produces initial candidate patches of minutiae; in the second stage, another network can extract the direction and precise minutia location of each patch. Although the feature extraction stage is essential in recognition systems, it is still an intermediate stage, unlike in our paper, which presents the complete fingerprint recognition system. Iancu et al. [3] Fuzzy logic Sagayam et al. [4] Neural networks Tan et al. [5] Genetic algorithms Shehu et al. [18] proposed a deep CNN that classifies three types of alterations: Z-cut, obliteration, and central rotation in the fingerprint image. This study used the SOCO database as we had used. Alterations classification is important, but the classification of people is more important and useful in real life. So, we did something to get the system to recognize individuals from their fingerprints. This is discussed in detail in the next section.
The all previous studies cannot be suitable for all kinds of datasets because of the unified network problem. However, our model can do that because it changes its network according to accuracy. Moreover, the previous studies do not achieve the best performance that led to the design of the proposed methodology.

Data Collection and System Implementation
We used the SOCOFing dataset [19] for training and testing to demonstrate the efficiency of our system. We divided our dataset into three sets-training, validation, and testing in the ratio of 7:2:1. The original dataset was divided into Z-cut, obliteration (Obl), central rotation alterations (CR), and real fingerprints. There are some samples of the dataset in Fig. 2. Our model was evaluated for classifying four categories or alterations, yielding a test accuracy of 0.965 and test loss of 0.098. However, because people's classification was applied in real life, we altered the dataset to classify according to people and not categories and split it into a split-folders library. It was easy and automatic; instead of a person splitting it manually. It saved time and effort. We split our data files into individual persons such that every folder of a person had his samples. All the preprocessing steps were also automatic without manual intervention.
We resized all the images in the dataset to 100 × 100 pixels; we then padded all the images to 106 × 106 pixels. This was the only preprocessing step to standardize the dimensions of all the images.
Our dataset consisted of fingerprint images of 600 subjects. Every subject had ten samples of his ten fingers giving a total output of 6000 fingerprint samples. The last three categories in Fig. 2 represent the different types of alterations. This empowers our model to recognize distorted samples and match them with their real subject. Furthermore, some samples had changed in real life and become distorted due to many factors, so distorted fingers were applied more. Every alteration of the three we mentioned is around 6000 samples, resulting in 17,934 altered images with simple parameter settings. There are three settings-easy, medium, and hard. After dividing the dataset by subject, we randomly selected 100 subjects to generate the set (10 × 100 × 4 = 4000 images) divided into training, validation, and testing, as shown in Tab. 2.
We used the Python 3.8 version, Anaconda environment, and Jupyter Notebook for the implementation. We used Keras and TensorFlow libraries. We first trained and tested our model on the Kaggle website, but we alternatively used the Anaconda environment because of its limited memory and run time. Our network had two tasks to perform. One was to extract features; another was to classify the input image. Feature extraction was performed by the conv layers automatically. We used the "sparse_categorical_cross-entropy" loss and "Adam" optimizer for the classification task.

Preliminaries
In Fig. 3, one of the most important stages in the recognition or classification system is feature extraction. This stage can be implemented manually or automatically. The classical method uses the histogram of oriented gradients (HOG) features, which have many steps like preprocessing, calculating gradients, calculating a histogram of gradients, block normalization, and creating HOG description vectors. Then, we used a classifier such as support vector machines (SVM) or NN. Our system uses a CNN owing to its ability to extract features of input images automatically without human intervention, which saves time and effort. The feature map is a representation of the resulting image feature.  Deep learning is a black-box technique, but the concept is that the network's early layers can detect lowlevel features [20] (like colors, edges, etc.), and the later layers of the network can detect high-level features (like shapes and objects). Adding so many layers to a deep network is important to extract more complex features. We used this example to clarify these complex features; a person is composed of his head, hands, and legs. His head is composed of the eyes, nose, and mouth. His eyes are composed of edges and circles. The network must be deep enough to learn these features at different levels (see Fig. 4). Also, adding more conv layers to large-scale images is useful in learning features, and it decreases error and increases accuracy.
The image of a fingerprint is composed of dark lines. The important features of these lines are minutiae, which are low-level features. Furthermore, the size of fingerprint images is often small compared to natural ones (images of persons, animals, etc.). We can use shallow networks to extract minutiae for fingerprint recognition. Therefore, we can specify our network's depth based on the complexity of the features and the size of the dataset images.
Our network is a residual CNN inspired by ResNet50 [11]. Then, we decrease the number of conv layers to learn minutiae features.
Despite the importance of adding layers to learn more features, there is a maximum threshold for depth in the CNN model [21]. The plot in Fig. 5 shows that the training error of the 20-layer is smaller than that of the 56-layer network.
The failure of 56-layer CNN could be blamed on the famous vanishing/exploding gradient problem [22]. Before the residual technique, the very deep nets (100-1000 layers) could not converge during training; instead of adding more layers, we used residual convolutional networks, which alleviated the problem of training very deep networks and degradation problems [21]. So, our network classifying fingerprints did  not have to be very deep, as our automatic system ensured that. This system will be discussed in the next section.

The Automatic System
For most networks, all the parameters are determined by constant values, for example, ResNet50 uses 64 filters in the primary stage. Why do we not start with fewer or more filters? Similarly with several layers, why do we choose a specified architecture whereas another achieves better results? In this research, we solved this conflict by using an automatic system. It is created automatically without any intervention by people, and it chooses to start with 32 filters and a specified number of layers which achieves the best accuracy.
After simple preprocessing, as shown in Fig. 6, we determined that as an initial number, the number of filters should be two or the user could input this number. Then, the system updates itself until it gets the best architecture with the best number of filters. In the first stage, our ResNet used 7 × 7 convolutions with stride 2, and the convolution process was implemented via the following equation: where I is the input image, its dimensions are W, H for width and height, respectively. nc denotes the number of image channels. K is the kernel matrix with k1 * k2 dimensions. b represents a bias value for each kernel K. x = 0,…,H and y = 0,…,W.
The dimensions of the output matrix can be calculated from the following equations: Here, n H ; n W denote the height and width of the output convolved image, and n Hprev , n Wprev denote the image height and width of the previous layer, respectively. f is the filter size, either width or height, as it is a square filter, p denotes the padding value, and s denotes the stride.
The main goal of the convolutional operation is to extract features and produce a feature map much smaller than the input fingerprint image.
Then max-pooling layer is implemented with a (3, 3) window size and two strides. The following equations calculate the dimensions of the image after pooling layers: These two equations are applied for either average pooling or max one for the pooling process. After that, the downsampling occurs by convolution layers and one average pooling layer at the end. The outcome of the flattened layer would be 100, which is nourished straightforwardly into a dense layer (softmax). In this way, the network can assign the fingerprint image to the class it belongs. In addition to the initial number of filters, this automatic system also allows the user to input epochs and iterations. Therefore, our model can control and specify these parameters (number of layers, filter number, epochs, and iterations). In Fig. 6, M and T represent the maximum initial number (no.) of filters and initial no. of filters, respectively. We determine M = 64; if T >= 64, the system updates NN architecture. Figure 6: Detailed structure of the automatic system: Numbers 2, 4, and 6 represent looping in the same architecture, numbers 3 and 5 represent updating to another architecture, z denotes printing results, and T, S, and M denote the initial number of filters, filter number in the conv layer, and the maximum no. of initial filters, respectively Notice that the number of filters in our system is 2 n when n = 1 no. of filters = 2, n = 2 no. of filters = 4, n = 3 no. of filters = 8 and so on. This is because the experiments prove that when the no. of filters equals 2 n , this achieves the best accuracy. Therefore, we chose to update no. of filters by twice as much. The threshold used depends on test accuracy.
Although our system stops at block 3 when it achieves 99% test accuracy as shown in Tab. 4, the structure in Fig. 6 continues at this point to make the automatic system applicable to any goal and dataset. The last network with the best parameters is presented in the experiments and results section. Block 2, block 3, or any later block consists of an identity block and conv block discussed in detail in the next section.

Residual Neural Network
We designed the following model to solve the drawbacks of the studies discussed in Section 2. Counter to the plain network, a sequence of traditional conv layers followed by fully connected layers, the residual nets have shortcut connections adding the output from the previous layer to the convolutional layer ahead through element-wise add operation. This operation is performed on two corresponding feature maps, channel by channel. The resulting feature map is fed to the activation function in Fig. 7.
We used two residual blocks in our network-identity and convolutional blocks, as shown in Fig. 8. Each of them has two conv layers; each conv layer is followed by the batch normalization layer [23] and then the activation function. We chose a rectified linear unit activation function (ReLU) [24].
The difference between the two blocks is the shortcut connection. The skipped connection of the identity block does not contain any parameters. It adds the two feature maps, but in the conv block, there are convolutional and batch normalization operations before the adding step. See the following figures: The function of the identity block is: where r represents the RELU function, and function z represents the convolutional and batch normalization processes. o denotes the output of the layer assigned (l). l is the index of the layer.
We designed the equation of the conv block as follows: The additional convolutional and batch normalization processes are represented by function z. The dimensions of the two feature maps should be the same to implement the element-wise add operation. Hence, we use the padding is the "same" not "valid" to preserve the dimensions, and the stride is 1. This is for the identity block.
In the Conv block, the stride of the first conv layer is 2 to shrink the spatial resolution of the feature map. The stride of the next conv layer in this block is one, and the padding is "same." In the shortcut connection, we implement the conv layer with stride 2 to make the two images in the same dimension. This step is responsible for reducing the dimensions of the feature map.
For the classification loss, we use sparse categorical cross-entropy as follows: where w refers to model parameters, i is the sample size, y i is the true label for the ith sample,ŷ i is the predicted label for the ith input data, N is the sample size.
We used an adaptive moment optimizer to train the network (ADAM) [25], with a learning rate of 0.001 and an exponential decay rate of 0.9 and 0.999 for the 1 st and 2 nd moment estimates, respectively.

Experiments and Results
This section is divided into three subsections. First, we display our experiments implemented. Second, the final and best network is presented after applying the automatic system. Finally, we present the model results.
First, we implemented the ResNet50 model; it had good accuracy, but this model was not fast and required many epochs, unlike our model, which achieved better accuracy in the earlier epochs, conforming to the deep learning principle.
Then, the LeNet network [26] was implemented with the ReLU activation function as an example of a simple CNN, not a residual one. The accuracy was high, but it was still lower than our model, proving that our choice of the residual CNN is suitable for our data and goal. In the original LeNet architecture, you may find that the Tanh activation function was used, but we used ReLU instead of Tanh for good accuracy. Next, Figure 8: The first figure represents the identity block, and the figure below is the conv block where conv denotes the convolution layer, batch denotes the batch normalization layer, and X is the input of the first layer the MobileNet model [27] was implemented as an example of a light weight DNN. This model uses depthwise separable CNN [28][29][30][31][32][33][34][35][36][37][38][39][40][41][42][43].
Tab. 3 compares the three algorithms above and our algorithm. The four algorithms used the same SOCOFing dataset. The table clarifies that our algorithm has the highest test accuracy, F1_score and Recall and the least test loss. Furthermore, our algorithm has the least no. of parameters, conforming to the low complexity compared with other algorithms. The comparison chart between the four algorithms is shown in Fig. 9.
Also, we tried the VGG-16 model [23], but it failed, and its implementation was not complete due to the small size of fingerprint images that do not suit very deep networks like VGG. As we mentioned earlier, these very deep networks are more suitable for large-scale images.
From these experiments, we could show that the advantages of our model are as follows: The system is completely automatic and guarantees the best accuracy. It is fast and works on the dataset without complex preprocessing stages. It is easy to implement, and the network is not very deep. It can recognize a person from any ten of his/her fingers, not just his/her thumb. Furthermore, it can recognize distorted samples. Fig. 10 shows the last network architecture reached after applying the automatic system. In this figure, notice that all the kernel size of the convolutional layers in our net is 3 × 3 except the first conv layer, which has 32 filters and the stride equals 2, which has a 7 × 7 kernel size. The best value of the initial number of filters is 32. The window size and stride are (2, 2) for the average pooling layer. The dotted arrow represents the conv block. However, the ordinary one represents the identity block. In our step-by-step figure, every conv layer is followed by a batch normalization layer and uses a ReLU activation function. Our goal is to have our model automatically specify those parameters based on accuracy. Tab. 4 compares test accuracy (TA) values of block1, block1 + block2, and block1 + block2 + block3 at 5 epochs and different numbers  of initial filters whereas Tab. 5 compares test loss (TL) values of block1, block1 + block2, and block1 + block2 + block3 at 5 epochs and different numbers of initial filters.

Figure 10:
Step-by-step architecture of the best network where the conv denotes the convolutional layer, the max pool is the max-pooling layer, Avg pool is the average pooling layer, and Fc is a fully connected layer Notice that in each architecture, the value of test accuracy increases gradually until a specific point where it starts to decrease. However, using three blocks, we achieved the condition (99%), then the system stops.
The two previous tables show the most successful architecture for our goal and dataset, which contains 3 blocks and begins with 32 filters.
We used four performance metrics for evaluation-accuracy, loss, f1_score, and recall, which evaluated our model on our datasets and showed how successfully it classified the fingerprint samples. Tab. 6 shows the results of our model when applied to the dataset in easy, medium, and hard settings. In Fig. 11, the network's accuracy and loss curves in the training and testing stages make it clear that there is not the overfitting problem.

Conclusion
We proposed a new and automatic system for fingerprint recognition, a residual convolutional neural network creating itself and achieving the highest performance. Our algorithm is substantially more efficient and faster than the other algorithms. Another useful aspect of this paper is the principle we have proposed, whereby the network should be deep according to the complexity of the features and the size of the images. We came to this conclusion after analyzing the properties of fingerprint images and previous convolutional neural network-based algorithms on fingerprints. Future work on the proposed fingerprint recognition algorithm will include: (1) attempting to automate the entire network by controlling all model parameters; (2) applying our algorithm to larger datasets; (3) implementing the algorithm to recognize other fingerprint types, like partial and latent fingerprints.