Dual-Transformer Head End-to-End Person Search Network

Person Search mainly consists of two submissions, namely Person Detection and Person Re-identiﬁcation (re-ID). Existing approaches are primarily based on Faster R-CNN and CNN (e.g., ResNet). While these structures may detect high-quality bounding boxes, they seem to degrade the performance of re-ID. To address this issue, this paper proposes a Dual-Transformer Head Network (DTHN) for end-to-end person search, which contains two independent Trans-former heads, a box head for detecting the bounding box and extracting eﬃcient bounding box feature, and a re-ID head for capturing high-quality re-ID features for the re-ID task. Speciﬁcally, after the image goes through the ResNet back-bone network to extract features, the RPN proposes possible bounding boxes. The box head then extracts more eﬃcient features within these bounding boxes for detection. Following this, the re-ID head computes the occluded attention of the features in these bounding boxes and distinguishes them from other persons or backgrounds. Extensive experiments on two widely used benchmark datasets, CUHK-SYSU and PRW, achieve state-of-the-art performance levels, 94.9 mAP and 95.3 top-1 score on the CUHK-SYSU dataset, and 51.6 mAP and 87.6 top-1 score on the PRW dataset, which demonstrates the advantages of this paper’s approach.


Introduction
Person Search aims to localize a specific target person from the gallery set, which means it contains two submissions, Person Detection, and Person Re-ID.Depending on these two different submissions, existing work can be divided into two-step and end-toend methods.Two-step methods [3,6,9,13,21,26] treat them separately by conducting re-ID on cropped person patches found by a standalone person box detector.They trade time and resource consumption for better performance, as shown in Fig. 1(a).By comparison, in a multi-task framework, end-to-end methods [1, 14, 23-25, 27, 28] effectively tackle both detection and re-ID simultaneously, as seen in Fig. 1(b).These approaches commonly utilize a person detector (e.g., Faster R-CNN [18], RetinaNet [15], or FCOS [19]) for detection and then feed the feature into re-ID branches.To address the issue caused by the parallel structure of Faster R-CNN, Li et al. [14] proposed SeqNet to perform detection and re-ID sequentially for extracting high-quality features and achieving superior re-ID performance.Yu et al. [28] introduced COAT to solve the imbalance between detection and re-ID by learning pose/scale-invariant features in coarse-to-fine manner and achieving improved performance.However, end-to-end methods still suffer from several challenges:

Bounding Box
• Handing occlusions with background objects or partial appearance poses a significant challenge.The detection and correct re-ID of persons become more challenging when they are obscured by objects or positioned at the edges of the captured image.
While current models may perform well in person search, they are prone to failure in complex occlusion situations.• The significant scale of pose variations makes it complicated to re-ID.Since current models mainly utilize CNN to extract re-ID features, they tend to suffer from the scale of pose variations due to inconsistent perceptual fields, which degrades the re-ID performance.• Efficient re-ID feature extraction remains a thorny problem.Existing methods either re-ID first or detection first, but still leave the unsolved issue of how to efficiently extract the re-ID feature for better performance.
For such cases, we propose a Dual-Transformer Head End-to-End Person Search Network (DTHN) method to address the above limitations.First, inspired by SeqNet, an additional Faster R-CNN head is used as an enhanced RPN to provide high-quality bounding boxes.Then a Transformer-based box head is utilized to efficiently extract box features to perform high-accuracy detection.Next, a Transformer-based re-ID head is employed to efficiently obtain the re-ID representation from the bounding boxes.Moreover, we randomly mix up partial tokens of instances in a mini-batch to learn the cross-attention.Compared to previous works that have difficulty dealing with the balance issue between detection and re-ID, DTHN can achieve high detection accuracy without degrading re-ID performance.
The main contributions of this paper are as follows: • we propose a Dual-Transformer head end-to-end person search network, refining the box and re-ID feature extraction problem previous end-to-end frameworks were limited.The performance is improved by designing a dual-Transformer head structure containing two independent Transformer heads for handling high quality bounding box feature extraction and high-quality re-ID feature extraction, respectively.• we improve the end-to-end person search performance by extracting high-quality fine-grained features in the occlusion case through the attention mechanism in Transformer.By employing the occlusion attention mechanism, the network can learn person features under occlusion, which substantially improves the performance of the re-ID in small-scale person and occlusion situations.• we validate the effectiveness of our approach by achieving state-of-the-art performance on two widely used datasets, CUHK-SYSU and PRW.94.9 mAP and 95.3 top-1 score were achieved on the CUHK-SYSU dataset, and 51.6 mAP and 87.6 top-1 score were achieved on the PRW dataset.
The remainder of this paper is organized as follows: Section 2 presents the research related to this work in recent years; Section 3 reviews the relative preparatory knowledge and presents the proposed DTHN design in detail; Section 4 presents some relevant experimental setups and verifies the effectiveness of the proposed method through experiments; Section 5 summarizes this work and provides an outlook for feature work.
2 Related Work

Person Search
Person search has received increasing attention since the release of CUHK-SYSU and PRW, two large-scale datasets.This development marked a shift in researchers' approach to person search, as they began viewing it as a holistic task instead of treating it separately.The early solutions were two-step methods, using a person detector or manually constructing the person box, then constructing a person re-ID model to search for targets in the gallery.With high performance comes high time and resource consumption, two-step methods tend to consume more computational resource and time to perform the same level with end-to-end methods.End-to-end person search has attracted extensive interest due to the integrity of solving two submissions together.Li and Miao [14] share the stem representations of person detection and re-ID, solving two submissions sequentially.Yan et al. [24] propose the first anchor-free person search method to address the misalignment problem at different levels.Furthermore, Yu et al. [28] present a three-cascade framework for progressively balancing person detection and re-ID.

Vision Transformer
Transformer [20] was initially designed to solve problems in natural language processing.Since the release of Vision Transformer (ViT) [7], it has become popular in computer vision (CV) [29][30][31][32].This pure Transformer backbone achieves state-of-theart performance on many CV problems and has been shown to extract multi-scale features that traditional CNNs struggle with.The re-ID process heavily relies on fine-grained features, making ViT a promising technology in this field.Several efforts have been made to explore the application of ViT in person re-ID.Li et al. [33] propose the part-aware Transformer to perform occluded person re-ID through diverse part discovery.Yu et al. [28] perform the person search with multi-scale convolutional Transformers, learning discriminative re-ID features and distinguishing people from the background in a cascade pipeline.Our paper proposes a dual-Transformer head for the end-to-end person search network to efficiently extract high-quality bounding boxes feature and re-ID feature.

Attention Mechanism
The attention mechanism plays a crucial role in the operation and function of the whole Transformer.After the proposal of ViT, numerous variants of ViT have tried to bring different features to the Transformer by changing the attention mechanism.Among them, in the target detection task, using a combination of artificial token transformations has become a mainstream approach to solve the detection of occluded targets.Based on this, Yu et al. [28] proposed an occlusion attention module in which both positive and negative samples in the same mini-batch are randomly partially swapped to simulate the encountered background occlusion of person, achieving good performance.This is also mainly the attention mechanism used in this paper.
To give the reader further insight into the work in this paper, Table 1 provides a brief summary of the related work and the work in this paper.

Methods
As previously mentioned, existing end-to-end person search works still struggle with the conflict of person detection and person re-ID.Prior studies have indicated that, despite a potential decrease in detection precision, the precision of re-ID can be maintained or even improved through serialization.However, achieving a high-level detection precision results in accuracy bounding box features, which are beneficial for re-ID.Thus, we propose the Dual-Transformer Head Person Search Network (DTHN) manage to get both high-quality detection and refined re-ID accuracy.

End-to-end person search network
As shown in Fig. 2, our network is based on the Faster R-CNN object detector backbone with Region Proposal Network.We first use the Resnet-50 [11] backbone to extract the 1024-dim backbone feature, then fed it into the RPN to obtain the region proposals.
During training, RoI-Align is performed using the proposals generated by RPN to obtain the features of region of interest for bounding box search, but RoI-Align is performed using Ground-truth bounding box during the re-ID phase.Note that instead using Resnet-50 stage 5 (res5) as our box head, we utilize a Transformer to extract high-quality box features and get high detection accuracy, and use the predictor head of Faster R-CNN to obtain high-confidence detection boxes.The RoI-Align operation is applied to pool a h * w region as our region of interest, we use it as the stem feature F ∈ R h * w * c .Note that F has the height of h and the width of w, and c denotes the number of channels.We set the intersection-over-union (IoU) thresholds at 0.5 in the training phase to distinguish positive and negative samples, and 0.8 IoU in the testing phase to get high-confidence bounding boxes.Then a Transformer re-ID head is utilized to extract distinguish features from the F .In each Transformer head, we learn the feature supervised by two losses L reg1 and L reg2 .Where N p denotes the number of positive samples, r i denotes the calculated regression of i-th positive samples, ∆ i denotes the corresponding ground truth regression, L loc denotes the Smooth-L 1 -Loss.They share the representation.
In addition, we also calculate the classification loss L cls1 , L cls2 after two Transformer head.Where N denotes the number of samples, p i denotes the predicted classification probability of i-th sample, c i denotes the ground truth label.
Note that L cls2 and the re-ID loss L reid are calculated by the Norm-Aware-Embedding L nae (.), where f denotes the extracted 256-dim features.
(3) Then we give the definition of the overall loss function, where λ i denotes the weight of each loss.

Occluded attention
The attention mechanism plays a crucial role in the Transformer.In our application, where we aim to extract high-quality bounding boxes and re-ID features, we must address the issue of occlusion.To this end, we use occluded attention in the DTH to prompt the model to learn the occlusion feature and address it in real applications, as shown in Fig. 3.

Token Maps
Tokens Random Switch Fig. 3 Occluded Attention.
First, we build the token bank X = {x 1 , x 2 , . . ., x p }, where p denotes the number of box proposals, x i denotes the token in one mini-batch.We exchange part of the tokens with another token form the token bank according to the index, using Token-Mix-Up (TMU) function, where x i , x j denote the token to be handled, R denotes the random value generated by system, T denotes the exchange threshold.
After random swapping, we transform the tokenized features into three matrices through three fully connected (FC) layers: query matrix Q, key matrix K and value matrix V , and then we compute the multi-head self-attention (MSA) as follows, where ĉ denotes the channel scale of the token, it equals c n , n is the number of slices during tokenization, m denotes the number of heads MSA has.
After MSA, we perform Feed Forward Network (FFN) to output features for feature regression, classification, and re-ID.

Dual-Transformer Head
The Dual-Transformer Head (DTH) consists of two individual Transformer heads designed for detection and re-ID.Although working in different parts of the network, the detection and re-ID heads share the same mechanism.The Transformer box head takes box proposals as input and generates processed feature as output.In contrast, the Transformer re-ID head takes ground truth as input during the training phase but proposals during the testing phase.Therefore, we hypothesize that the quality of detection can positively impact the re-ID performance.To provide a visual representation, the structure of the DTH is visualized in Fig. 4.  First, the pooled stem feature F ∈ R h * w * c is fed into Transformer box head and obtain the proposal feature, which is fed into Faster R-CNN to calculate the proposal regression and proposal classification.After that, F is re-fed into Transformer re-ID head and obtain box feature, which is fed into bounding box regressor and NAE to calculate the box regression and box classification.The loss function of NAE to calculate the box classification is shown in Equation below. where, where y ∈ {0, 1}, denotes that the box is a person or background.norm r ∈ [0, ∞).σ denotes the sigmoid activation function.The OIM loss is calculated using the features processed by NAE.OIM has two auxiliary structures, Look-Up Table (LUT) to store all feature vector with tagged identities and Circular Queue (CQ) to store untagged identities detected in the recent mini-batch.OIM loss is calculated as Equation below.
Where E x denotes the expectation, p t denotes the probability of being judged as t.
We take the Transformer re-ID head as an example to demonstrate the process.After the feature has been pooled into F ∈ R h * w * c , F will go through the tokenization.We split F to n slices channel-wise getting F ∈ R h * w * ĉ .We utilize series convolutional layers to generate tokens based on F getting F ∈ R ĥ * ŵ * ĉ .By flatten F into token x ∈ R ĥ ŵ * ĉ .After finishing TMU, go through the MSA and FFN mentioned above.The outcome will be projected into the same size it gets in, ĥ * ŵ * ĉ.The feature Transformer outputs will be pooled and delivered to different loss functions according to the type of Transformer head.The internal structure of the Transformer head is shown in Fig. 5.

Experiment
All training processes are conducted in PyTorch with one NVIDIA A40 GPU, while testing processes are conducted with one NVIDIA 3070Ti GPU.The origin image will go through the ResNet-50 stage 4 and be resized to 900*1500 as the input.

Datasets and metrics
We conduct our experiments on two wildly used datasets.The CUHK-SYSU dataset [23] contains images from 18184 scenes with 8432 identities and 96143 bounding boxes.The default gallery contains 2900 testing identities in 6978 images with a default size of 100.While the PRW dataset [26] collects 11816 video frames from 6 cameras with 5704 frames and 482 identities, dividing into a training set with 5705 frames and 482 identities and a testing set with 2057 query persons in 6112 frames.We evaluate our model following the standard evaluation metrics.According to the Cumulative Matching Characteristic (CMC), the detection box will only be considered correct when the IoU is more than 0.5.So, we use Recall and Average Precision (AP) as the performance metric for person detection.While the person re-ID uses the mean Average Precision (mAP) and top-1 scores.All the metrics the higher the better.
Where R n and P n separately denote the recall and precision of the n-th confidence threshold, C denotes the number of all classification.Top-1 score denotes the result with the highest accuracy under the classification.

Implementation detail
We take ResNet-50 pre-trained on the ImageNet as the backbone.The batch size is set to 5 during training and 1 during testing.The size of the F will be set to 14 * 14 * 1024.The number of heads m in MSA is set to 8. The loss weight λ 1 is set to 10, others are set to 1.We use the SGD optimizer with a momentum of 0.9 to train 20 epochs.The initial learning rate will warm up to 0.003 during the first epoch and decrease by 10 after 16th epoch.The CQ size of OIM is set to 5000 for CUHK-SYSU and 500 for PRW.The IoU threshold is set to 0.4 in the testing phase.

Ablation study
We conducted several experiments on the PRW dataset to analyze our proposed method.As shown in Table 2, we test several combinations of different box heads and re-ID heads and evaluate their performance on the PRW dataset.We set the default box head and re-ID head as ResNet-50 (stage 5) and conduct one experiment, follow by two experiments by setting the box head or the re-ID head to corresponding Transformer head respectively, and finally set both the box head and the re-ID head to the Transformer head for one experiment.As we can see from Table 2, when using ResNet-50 (stage 5) as the box head and the re-ID head, both detection and re-ID are at a moderate level.However, when we change the box head to Transformer, the detection accuracy does not improve, while the re-ID accuracy is also slightly reduced, so Transformer cannot play a good effect only for the box head.When we maintain the box head as ResNet-50 (stage 5), and replace the re-ID head with Transformer, the re-ID accuracy increases significantly, which shows that Transformer can maximize information extracted from the feature for re-ID.Finally, we replace both the box head and re-ID head with Transformer, while the detection accuracy is slightly reduced, the re-ID accuracy is significantly improved with the support of the DTH.As can be seen, although the Transformer box head reduces the detection accuracy, it efficiently extracts the valid information, and improves the overall re-ID performance with the Transformer re-ID head.The Transformer re-ID head undoubtedly enhances the re-ID performance in various occlusion scenarios, and significantly increases the overall re-ID performance.
Therefore, we believe that our design of the DTHN can fully extract both the box features and the unique features of person for efficient re-ID.

Comparison with state-of-the-art models
We compare our DTHN with state-of-the-art methods on CUHK-SYSU and PRW, including two-step and end-to-end methods.The results are shown in Table 3.
The results of using Context Bipartite Graph Matching (CBGM) are shown in Table 4.
Result on CUHK-SYSU.As shown in the table, we achieved the same 93.9mAP and comparable 94.3 top-1 score compared to the state-of-the-art two step method TCTS.Compared with the recent end-to-end works, our mAP outperforms the AlignPS, SeqNet and AGWF, and our top-1 score outperforms the AlignPS and AGWF.Additionally, by using the post-processing operation CBGM, both mAP and top-1 score of our method improved to 94.9 and 95.3, achieving the best mAP in all methods with highly competitive top-1 score.
Result on PRW.PRW dataset is well known as more challenging.We achieved 50.7 mAP and 85.1 top-1 score.Our mAP outperforms all the two-step methods.Among the end-to-end methods, our mAP and top-1 score outperform AlignPS and SeqNet, while remain a 2.5 gap with AGWT and COAT.Due to the structural advantage of COAT, it remains state-of-the-art status on the PRW dataset, but the DTHN proposed in this paper still achieves respectable results with a smaller number of parameters and computational effort.However, by applying CBGM as post-processing operation, we obtain a slight gain of 0.9 mAP and a significant gain of 2.5 for top-1 score, further improving the performance of our method and reducing the gap with COAT.This means that our proposed DTHN is effective in handling the challenging PRW dataset.
Efficiency Comparison.We compare our efficiency with two end-to-end networks SeqNet and COAT.All experiments are conducted on the RTX 3070Ti GPU on the PRW dataset.As shown in Table 5, we include the number of parameters, the multiplyaccumulate operations (MACs), and the running speed in frames per second (FPS) in the comparison.
Compared with SeqNet and COAT, we significantly reduce the number of parameters and remain the equivalent MACs, achieving a comparable accuracy.In terms of FPS, SeqNet has the highest 9.43 because it does not need to compute attention, and we have a slight advantage in running speed compared to COAT with also computes attention.In summary, our model can run efficiently while having a good performance.

Visualization analysis
To show the recognition accuracy of DTHN in different scenes, several scenes are selected as demonstrations as shown in Fig. 6.The green bounding box indicates the detection results that higher than 0.

Conclusion and Outlook
After noticing the challenges of occlusion and efficiency in end-to-end person search, we propose a DTHN to address the problems.We use two Transformer heads to deal with box detection and re-ID tasks separately.DTHN outperforms existing methods in the CUHK-SYSU dataset and achieves competitive results in the PRW dataset.
Although our method is slightly slower than traditional CNN methods due to the scale dot production used by the attention mechanism in the Transformer, which consumes more computational resources.However, thanks to the small size of Transformer, we have cut down the number of parameters compared to traditional CNNs, which gives us hope for deployment on terminal devices.Despite the good results, we believe that there is still room for improvement in our approach, either in terms of better and more convenient attention computation methods, or in terms of adaptive attention mechanisms.Eventually we may be able to create a pure Transformer model, using different attention heads on a single Transformer to accomplish different tasks.This is the main focus of our team afterwards.We believe that the deployment of person search on terminal devices is just around the corner.

Fig. 1
Fig. 1 Two main types of person search network.

Fig. 2
Fig. 2 Network structure of DTHN, dotted line means only happens in testing phase.
5 similarity.The number indicates the similarity between the query and the result.In most open scene, the model re-ID the target very well.And in the crowded scene where person overlap together, the model also gives the non-target with 0.59 similarity and the target with 0.64 similarity.Visually the two person have similar features, light-colored top, and jeans.It can be seen that the model can still extract features well despite the large occlusion of the non-target.

Table 1
A summary of related person search works and our work.-end methods jointly solve the detection and re-ID sub-problems in a more efficient, multi-task learning framework.Vision Transformer Based on the original Transformer model for natural language processing, Vision Transformer (ViT) is the first pure Transformer network to extract features for image recognition.Attention Mechanism Attention mechanism plays a crucial role in Transformers.The proposed occluded attention module considers token cross-attention between either positive or negative instances from the mini-batch.Our proposed We propose a Dual-Transformer head for the end-to-end person search network, refining the box and re-ID feature extraction problem previous end-to-end frameworks were limited.

Table 2
Comparison with different heads

Table 3
Comparison with SOTA models

Table 4
Comparison with SOTA models using CBGM

Table 5
Efficiency comparison