Soft Tissue Feature Tracking Based on Deep Matching Network

Research in the field of medical image is an important part of the medical robot to operate human organs. A medical robot is the intersection of multi-disciplinary research fields, in which medical image is an important direction and has achieved fruitful results. In this paper, a method of soft tissue surface feature tracking based on a depth matching network is proposed. This method is described based on the triangular matching algorithm. First, we construct a self-made sample set for training the depth matching network from the first N frames of speckle matching data obtained by the triangle matching algorithm. The depth matching network is pre-trained on the ORL face data set and then trained on the self-made training set. After the training, the speckle matching is carried out in the subsequent frames to obtain the speckle matching matrix between the subsequent frames and the first frame. From this matrix, the inter-frame feature matching results can be obtained. In this way, the inter-frame speckle tracking is completed. On this basis, the results of this method are compared with the matching results based on the convolutional neural network. The experimental results show that the proposed method has higher matching accuracy. In particular, the accuracy of the MNIST handwritten data set has reached more than 90%.


Introduction
In recent years, surgical robots have begun to be frequently used in minimally invasive surgery to reduce the pain of patients, reduce the work intensity of the surgeon, improve the accuracy of surgical operations, and reduce the difficulty of surgical operations [1][2][3][4]. This operation is mainly used for disease monitoring and treatment of various parts of the human body through an endoscope, which enters the human body through a small channel (a natural channel or a channel confirmed by a doctor). Compared with traditional surgery, the position perception of intraoperative equipment and soft tissue surface requires high accuracy, and because the intraoperative field of view is relatively narrow, it causes a lot of difficulties [5]. Therefore, many computer-assisted techniques have been proposed to assist the operation process [6,7], and many advanced robot-assisted surgical techniques have extremely high requirements for the tracking of the soft tissue surface characteristics of the surgical organs, such as abnormal brain detection method for magnetic resonance image and detecting tuberculosis from chest CT images [8,9]. Tracking research on the surface of soft tissue is conducive to the use of surgical robots' high-precision and high-flexibility characteristics. It can perform precise surgical operations in different organs and tissues of the human body [10][11][12]. It is conducive to the recovery and reconstruction of surgical organs and tissues, greatly reduces the danger caused by the shaking of the body during the operation of the surgeon, greatly enhances the doctor's confidence and reduces the surgeon's fatigue, and enhances the safety and effectiveness of the operation. In addition, the tracking of soft tissue surface features of endoscopic image sequences has very important applications in postoperative surgical effect analysis, surgical training and teaching, and virtual reality soft tissue 3D modeling [11,13].
The tracking problem in the medical field is a hot issue [14], and most of the technical routes adopted are based on the feature as the object to launch the tracking. However, problems such as low matching accuracy and slow speed of feature points in endoscopic images remain.
Recent research has revealed that image-based methods can enhance accuracy and safety in laser microsurgery. Schoob et al. proposed a non-rigid tracking using surgical stereo imaging [15]. A recently developed motion estimation framework based on piecewise affine deformation modeling is extended by a mesh refinement step and considers texture information. This compensates for tracking inaccuracies potentially caused by inconsistent feature matches or drift. To facilitate the online application of the method, the computational load is reduced by concurrent processing and affineinvariant fusion of the tracking and refinement results. The residual latency-dependent tracking error is further minimized by Kalman filter-based upsampling, considering a motion model in disparity space.
The surface feature of the soft tissue image is used as the tracking object to realize the tracking of the surface of the soft tissue [16]. The key step of the soft tissue surface feature tracking process is feature matching. The feature matching method is also applied to feature matching in different views of the same frame. At a certain moment, the coordinates of a three-dimensional space point are mapped to two different perspective images in the left and right views, but they are actually the same space point. Similarly, feature matching is performed on the points of the left and right views to obtain the parallax under different viewing angles. In addition, the internal and external parameters of the camera that shoots the left and right views and the focal length are added to obtain the threedimensional coordinates of the space points.
Robotic automation in surgery requires the precise tracking of surgical tools and mapping of deformable tissue. Previous works on surgical perception frameworks require significant effort in developing features for surgical tools and tissue tracking. In this work, Lu et al. [17] overcame the challenge by exploiting deep learning methods for surgical perception. They integrated deep neural networks, capable of efficient feature extraction, into the tissue tracking and surgical tool tracking processes. By leveraging transfer learning, the deep-learning-based approach requires minimal training data and reduced feature engineering efforts to fully perceive a surgical scene.
Verdie et al. [18] proposed a learning-based time invariant feature detector (TILDE), which can reliably detect key points in the case of severe changes in external conditions such as illumination. An effective method is proposed to generate the training set of training regression. This method learns three regressors, and the segmented regressor shows the best effect. The author evaluates the regressor on the new outdoor benchmark data set, which shows that the performance of the regressor proposed by the author on the benchmark data set is obviously better than the most excellent algorithm at that time. Savinov et al. proposed Quartnetworks [19]. They first proposed to learn the feature detector from scratch, train a neural network to rank the key points, and then find the key points from the top/bottom bits of the ranking. The workflow of the whole method is to extract random block pairs from two images. Each image block obtains a response through the neural network, then calculates the loss through the sorting consistency function of quadruple and optimizes it by gradient descent method. The algorithm based on data learning can not only learn the feature detector like Quarknetworks, but also learn the feature descriptor. With the improvement of machine learning [20][21][22][23][24], Simoserra et al. proposed Deepdesc [25] for key point descriptor learning. This method uses a convolutional neural network to learn the discriminant representation of image blocks (patches), trains a Siamese network with paired inputs, and processes a large number of paired image blocks by combining the random extraction of training sets and the mining strategy for patch pairs that are difficult to classify. The L2 distance is used in training and testing, and the learned 128-d descriptor is used. Its Euclidean distance reflects the similarity of patch pairs. The feature learning method maps the pixel values of image blocks to description vectors through nonlinear coding. The goal is to learn description vectors. The selection of measurement rules of these description vectors is generally related to the real label vector. The processing process of references [26,27] included multiple parameterization modules such as gradient calculation, spatial pooling, feature normalization and dimension reduction. Trzcinski et al. [28] used a "weak learning" accelerator, including a series of capabilities of gradient direction and spatial position parameterization. In order to find the optimal parameters, different types of optimization algorithms, Powell minimization, boosting and convex optimization are used respectively.
Feature based 3D reconstruction is the last step of soft tissue surface tracking, mainly to build a visual object model with more three-dimensional spatial characteristics. The key point of 3D reconstruction technology is feature matching. In the stereo matching of binocular vision, the corresponding points in space are obtained from the two-dimensional feature matching results and camera parameters, combined with the triangular knowledge of epipolar geometry, multiple feature matching, and then the three-dimensional point cloud set of multiple points is obtained. Finally, the three-dimensional shape of the soft tissue surface is restored through triangulation. The essence of soft tissue surface reconstruction is to accurately estimate the object's three-dimensional shape. It is a process of converting a two-dimensional image into a three-dimensional image based on feature point matching data. In [29], the authors proposed an intraoperative surface reconstruction method based on stereo endoscope images. At the same time, the author also proposed a new hybrid CPU-GPU algorithm, which unifies the advantages of CPU and GPU versions. An innovative synchronous positioning and mapping algorithm is proposed in [30], which used a series of images of the stereo mirror to reconstruct the surface deformably. The author introduced a distortion field based on embedded deformation nodes, which can restore the three-dimensional shape from continuous paired stereo images.
In this paper, we used the feature matching algorithm based on deep learning, mainly based on the soft tissue tracking of the deep matching network. First, we used the triangle matching algorithm to obtain a self-made data set, then used the ORL face data set to pre-train the deep matching network, and then used the self-made training set to train the deep matching network. After carrying out the two-class feature tracking of the deep matching network, we carried out the multi-class spot tracking based on the convolutional neural network. In the multi-class convolutional neural network part, the same neural network architecture is used, and different pre-training sets, the MNIST handwritten data set and CIFAR-10 data set are used to experiment with the effect of the pre-training set on retraining and get the results. Finally, experiments are carried out on the algorithm of this paper and the influence of the network structure and training data set on the experimental results is analyzed and compared. The innovations are that we used three unrelated data sets to pre-train and retrain the matching network, constructed the training data set to prepare the training samples for the neural training network, improved the depth matching network based on the Siamese network and finally achieved good matching results.

Dataset
The initialization parameters of the neural network are obtained by training on the ORL face data set [31]. The ORL face data set contains a total of 400 images of 40 different people. Each person has 10 different images. The light, facial expressions and details of the images are different, and the size is 112 * 92 grayscale images. The data set is shown in Fig. 1.

Figure 1: ORL face dataset
The two data sets used in the pre-training in this article are the MNIST data set and the CIFAR-10 data set [32,33]. The data set is shown in Figs. 2 and 3.

Figure 2: MNIST dataset
Firstly, the MNIST data set of handwritten scanned digits is introduced. NIST, on behalf of the National Institute of standards and technology, is the organization that originally collected these data "M" stands for modified. In order to use machine learning algorithm easier, we first preprocessed the data. The MNIST dataset includes scanning of handwritten digits and related labels (describing which number of 0∼9 is contained in each image). It includes 60000 training images with 28 * 28 pixels and 10000 test images with 28 * 28 pixels. As shown in Fig. 2, these handwritten digits are standardized in size and located in the center of the image, and the pixel value of the image is normalized to 0∼1. The CIFAR-10 dataset contains 10 categories, aircraft, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. A total of 60000 RGB color images, including 50000 training images and 10000 test images.
Medical dataset we used in this paper is a set of actual three-dimensional images of the soft tissue of the heart provided by Hamlyn Center at Imperial College London. They are available on the website: https://imperialcollegelondon.app.box.com/s/kits2r3uha3fn7zkoyuiikjm1gjnyle3.

Triangular Matching Algorithm
The constructed matching data set is shown in Fig. 4. The first frame is our known frame, as shown in the figure, the 25th and 30th frames are the data sets matched by our triangle matching algorithm. Because the spot detection algorithm is affected by light, etc., the triangle matching can not match the first frame one by one, and there is spot loss in the subsequent frame F i (i ≥ 1), but it does not prevent us from intercepting the spots to make the data set. Even if a 32 * 32 size screenshot of a certain spot is missing in a certain frame, a screenshot of a certain spot will still appear in its subsequent frames.

Speckle Tracking Based on Depth Matching Network
The depth matching network is mainly composed of two parts, feature extraction network and metric network. The feature extraction is composed of two convolutional neural networks [22,34] with shared weights. This thinking comes from the Siamese network and twin neural network, which is very suitable for the binary classification task of image matching. Each image block (patch) inputs a feature extraction network to generate a fixed dimension sift like feature. This feature is a depth feature, but different from sift, the similarity and difference between the two feature description vectors in sift are calculated by the Euclidean distance, while in depth matching network, the metric network is used.
The metric network consists of three fully connected layers. The last layer uses sigmoid function (i.e., Eq. (1)) to output scores to obtain the similarity probability of image blocks.
The feature extraction network includes five convolution layers and two lower sampling layers, and includes an FC layer used to reduce the dimension of the features extracted by the feature extraction network. The function of the FC layer is to reduce the feature dimension extracted by the feature extraction network and control the overfitting of the network, because the number of parameters involved in full connection is large. If the feature dimension is too high, it will easily lead to a large number of parameters and over fitting. The output 256 dimension of FC layer represents the advanced features of the input image block, and each image block represents the spots detected in the frame. Therefore, FC layer represents the spot depth features "integrated" by the feature extraction network.
Because our input image is small, the size of the convolution kernel we use is relatively small, which is also to comprehensively and carefully obtain the feature information in the image block (patch). The convolution kernel after filling and then convolution greatly increases the nonlinear characteristics without losing the resolution while keeping the scale of the feature map unchanged; A convolution kernel corresponds to a feature map after convolution. Different convolution kernels (with different weights and bias) will get different feature maps after convolution to extract different features.
We use the RELU (rectified linear units) function [35] as the activation function of the convolution layer. Similarly, the sign function of the full connection layer is also RELU. The introduction of the activation function is to increase the nonlinearity of the convolution layer. Without the activation function, each convolution layer is equivalent to matrix multiplication, and the output of each layer only goes through a linear transformation. No matter how many layers the neural network has, the linear transformation is of little significance as a whole. Obviously, we want to learn the nonlinear characteristics of image blocks. By adding an activation function, nonlinear changes are introduced into neurons, so the neural network can be used to simulate any nonlinear function arbitrarily.
The details of the depth matching network, such as parameters, convolution kernel size, convolution step size and so on, are listed in Tables 1 and 2.  After the detailed information description of the network layer is completed, we make necessary explanations for the training method. We select positive and negative samples with batch size from the training sample set to build a group of training samples. The number of positive and negative samples in each batch size is equal.
Depth matching network is based on minimizing the cross entropy loss function to train network parameters. The cross-entropy function [36] is: where y i is the 0/1 label of the input image pair, 1 represents a match, and 0 represents a non-match. y i is the actual output value of the matching network, n represents the number of picture pairs in each batchsize, and batchsize = 32. We update the network weight according to the cross entropy, and then continue to input the next set of training samples, and repeat the above training process, and complete the epoch group training. In order to ensure that the direction of optimization is correct, the number of positive samples in the input samples of each batch is equal to the number of negative samples, and the training process is shown in Fig. 5, where M = 32, B = 256, N2 = 9. We calculate the matching matrix between F i (i > 1) and the spots in F 1 . The image blocks corresponding to the feature points in F 1 and F i (i > 1) are combined into the depth matching network in pairs, and the similarity of the spots will be calculated, and the matching matrix will be filled in according to the correspondence between rows and columns. Each row of the matching matrix corresponds to a spot in F 1 , and each column corresponds to a detected spot in F i (i > 1). According to the matching matrix, we select the column with the highest score in each row (corresponding to a feature point in F 1 ) and exceed a set threshold (corresponding to the feature point in F i (i > 1)) as the matching feature point to complete the spot in the frame Match between (tracking). If the matching degree is less than the set threshold, it means that no matching spots are detected in this frame.

Experiments and Results
Our experiment is based on TensorFlow 2.1, python 3.7, NVIDIA Geforce GTX 1060ti and other platforms. The program running environment is shown in Table 3. When constructing the sample set, we need to first expand the edges of the bottom of the input image. Our input picture is 288 * 320, and the spot coordinates near the bottom edge of the detected spots are close to the y value (288) in the picture coordinate system, where the origin of the picture coordinate system is in the upper left corner. The y-value index of the spot coordinate has the hidden danger of crossing the boundary. Therefore, before capturing the picture, fill in the bottom of the picture. In this article, we choose the boundary pixel extension [37], which is conducive to the feature extraction network to fully extract the pixel information around the spots. The edge filling result is shown in Fig. 6.  Fig. 7 and the re-training results are shown in Fig. 8. It can be seen from Figs. 7 and 8 that although the pre-training stage has performed well, under the retraining of the self-made training set, the curve is smoother and converges faster from the accuracy curve and loss curve. Therefore, the weight parameters obtained by pre-training are effective, which speeds up the training progress and convergence speed.  For the traditional neural network, 32 * 32 * 3 endoscopic soft tissue speckle map is used as the input, and the previous n = 100 frame matching result is used as the training set. Due to the loss of spots, a total of 750 speckle maps are used as the training set, and 180 speckle images of the subsequent 20 frames are used as the test set to train and test the convolutional neural network.
The two data sets used for pre-training are the MNIST data set and the CIFAR-10 dataset. We will perform pre-training on the two data sets respectively, and then use our self-made training set for retraining. On the same network structure, different data sets are used for pre-training, and the influence of the pre-training data set on the convolutional neural network training can be compared; After the retraining stage, we can see the impact of pre-training on retraining.
In the network setting, the learning rate is 0.0001, the optimizer adopts Adam optimizer and small batch training method, batchsize = 32, and the maximum training epoch = 100. The training results are shown in Fig. 9.   As mentioned in the previous paper, we use three different data sets to study inter frame speckle matching on two neural networks. The differences between neural networks and the comparison of data sets are shown in Table 4. In the subsequent frames, in order to ensure the universality of the test, any frame is selected. In the experiment, any frame selected by the program is F 148 , after the spot detection, 29 spots are detected. Take the detected spot coordinates as the center, intercept 32 * 32 image blocks, form image block pairs with the 9 spots detected in the first frame, input the trained depth matching network, and the depth matching network outputs the similarity of each image block pair to obtain a 9 * 29 matching score matrix. Therefore, the matching result of the first frame and the F 148 spot can be obtained from the matching score matrix, as shown in Fig. 12. And the speckle matching diagram obtained by the matching score matrix is shown in Fig. 13.  From Figs. 9 and 10, the accuracy of the two data sets in the pre-training stage is different. The accuracy of the MNIST handwritten data set has reached more than 90%, which can be considered that the purpose of pre-training has been achieved. However, the accuracy rate on the CIFAR-10 data set is only 50%, and even the accuracy of the test set tends to decline.
As can be seen from Fig. 12, retraining has obvious divergence and overfitting. The reason is that the pre-training of the pre-training data set is not in place. We can also see from the pre-training curve that its accuracy curve and the trend of decline appear. Only the pre-training sample set is large, and it may continue to be pre-trained for hundreds of rounds, which will be the same as that of retraining. In the complex feature information, the pre-trained sample data set has a considerable impact on the subsequent retraining.
It can be seen from Figs. 10-12 that the grid multi-classification effect based on CIFAR-10 data set is obviously inferior to that based on the MNIST handwritten data set. It shows that the pre-training of the data set still has a significant impact on the subsequent retraining.
It can be seen from Table 4 that in the pre-training stage, the training results of the depth matching network on the face data set are excellent. If we continue to use the self-made training set, the convergence is faster, the accuracy curve is steeper, and its initial starting point is also relatively high, It shows that the weight parameter obtained by pre-training on the gray image face data set is an effective weight parameter for soft tissue image speckle training set, and shortens the retraining time; Convolutional neural network also performs well on MNIST handwritten gray-scale images, but it performs poorly on the RGB CIFAR-10 data set for two reasons: the first reason is that the characteristics of the MNIST handwritten digits are as simple as those of soft tissue image surface spots, and the gray-scale information is regional, while the image information of cars, animals and other images in CIFAR-10 data set is more complex, And the characteristics of surface spots in soft tissue images; The second reason is that the structure of convolutional neural network itself is relatively simple, which can only deal with images with simple feature information. For complex feature information such as images in CIFAR-10 dataset, it is easy to overfit, resulting in non-convergence or even divergence.
As shown in Fig. 12, it can be seen that the spots detected in F 148 are compared with the first frame. The spots 5 and 6 detected in the first frame are not detected in F 148 . Therefore, in the matching score matrix, the spots 5, 6 in the first frame, the similarity probability of the column corresponding to the row is below 0.5, indicating that the spots 5 and 6 are not detected, and the spots 5 and 6 fail to track in F 148 . The similarity probability of the columns corresponding to the rows of the remaining spots is above 0.5 and the largest among the columns. The column where the probability is located in the corresponding spot of the first frame spot in F 148 .
In the experiment, spot 1 is tracked from frame 121. From the pixel coordinates, the heartbeat range is still about 30 pixels, which means that the soft tissue feature tracking algorithm based on depth matching network is successful.

Conclusions
This paper constructs the training data set to prepare the training samples for the neural training network. After that, we improve the depth matching network based on the Siamese network to adapt to the feature extraction and measurement of soft tissue surface images. Firstly, we pre-train on the ORL face data set to get better results and then retrain on our own data set to get a smoother and steeper loss curve and accuracy curve so as to achieve the purpose of retraining. Furthermore, we compared the spot-matching algorithms of classified convolutional neural networks. In terms of convolutional neural networks, Lenet was used as the basic structure and slightly modified, pre-trained on the MNIST handwritten data set and the CIFAR-10 data set respectively, and then retrained with a self-made training set. The pre-training based on the MNIST data set performed well in the retraining stage. However, the pre-training accuracy based on the CIFAR-10 data set reaches 50%, and the loss in the retraining stage shows a divergent form, and the accuracy decreases significantly. It can be seen that the data set has a significant impact on the results of pre-training and retraining. Therefore, in the follow-up research, we can further select more data sets to train the deep matching network for better performance.