\addbibresource

main.bib

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

Mei Qiu, Lauren Christopher, and Lingxi Li Mei Qiu, Lauren Christopher, and Lingxi Li are with the Department of Electrical and Computer Engineering, Purdue University in Indianapolis, 723 West Michigan Street, SL-160, Indianapolis, Indiana 46202, USA. Emails:  meiqiu@iu.edu,{lauchris,ll7}@iupui.edu.
Abstract

Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0% compared to the the closest state-of-the-art (CAL) result of 80.9% on VehicleID dataset.

I INTRODUCTION

Refer to caption
Figure 1: Aspect ratio distribution of images in training datasets from ReID benchmark datasets, VeRi-776 and VehicleID, varies significantly. These datasets show that a substantial portion of the images are non-square.

A fundamental task in intelligent transportation systems is vehicle re-identification (Re-ID): identifying vehicles across multiple non-overlapping cameras [wang2019survey]. Despite its significance, vehicle Re-ID encounters challenges due to variations in vehicle appearance across different viewpoints, poses, illumination, and backgrounds. Deep learning models face the challenge of extracting discriminative features resistant to viewpoint variations [zheng2020vehiclenet, chen2023global]. Both global and local features are essential for generating robust representations for vehicle pairs [peng2019learning, zhang2020part, gu2021efficient].

Numerous benchmark datasets contain images from real-world surveillance scenes, including VeRi-776 [zheng2020vehiclenet], PKU-VD [yan2017exploiting], VehicleID [liu2016deep], Vehicle-1M [guo2018learning], and VERI-Wild [lou2019veri], crucial for vehicle Re-ID AI development [zakria2021trends]. State-of-the-art models leverage the self-attention mechanism. Vision transformers (ViTs) have demonstrated superior ability in capturing discriminative details compared to previous CNN-based methods [he2021transreid]. Additionally, images in different datasets exhibit various aspect ratios, as illustrated in Fig. 1. VeRi-776 and VehicleID display different size and shape distributions, posing a significant challenge in model training.

However, unlike CNN-based models, which can handle varying aspect ratios to some extent due to their translation invariance and local receptive fields, vision transformers consider the entire image as a sequence of patches. The fixed patch size and sequential nature of transformers necessitate careful consideration of how input images are resized and cropped [ke2021musiq, liu2022aspect, mao2022towards, dehghani2023patch]. Early implementations of vision transformers adopted resizing strategies from CNNs, often distorting the original aspect ratio and potentially compromising performance, particularly in tasks reliant on object shape and scale.

To address this, subsequent studies explored padding strategies, adaptive post-patch extraction, trainable resizing networks, aspect-ratio aware attention mechanisms, and multi-scale/multi-aspect training [xia2022vision, lv2022scvit, hwang2022vision, zhu2022aret, ke2021musiq, li2022multi]. However, these approaches entail computational burdens, data requirements, and optimization challenges.

In summary, gaps persist in applying Vision Transformers (ViTs) to vehicle Re-ID. Optimization of scaling and resizing strategies, understanding aspect ratio effects, and exploring data augmentations such as mixup [zhang2018mixup] at the patch level for ViTs in multi-aspect ratio scenarios are needed. Intra-image mixup can potentially enhance the model’s ability to learn detailed features, particularly when the whole image is distorted by resizing with unsatisfactory aspect ratios. TransReID [he2021transreid] introduces a jigsaw patch module (JPM) to enhance feature robustness and discrimination ability, representing a step towards addressing these challenges. However, this intra-mixup occurs at the feature level; its potential at the pixel level needs to be comprehensively explored.

Refer to caption
Figure 2: (Left) Existing Method: Image size is typically fixed and set to a single size with a square shape. (Right) Our Method: Combined Vision Transformer (ViT)-based ReID model that dynamically fuses features extracted from multiple models. Each model is trained on a fixed size and aspect ratio.
Refer to caption
Figure 3: The structure of each individual model is designed to adapt to the dataset’s size and aspect ratio distribution. During the patchification process, the stride size in the horizontal and vertical directions is dynamically determined based on the input object’s aspect ratio. Subsequently, a Patch Mixing (PM) module shuffles and mixes patches from the same image using an attention-guided strategy. Any Vision Transformer (ViT)-based architecture can be chosen as the backbone. In this study, we select ViT/B-16. The features extracted from ViTs are used for the vehicle ReID downstream task.

In this work, we are the first to propose that aspect ratio is a key factor affecting vehicle Re-ID performance and the robustness of feature learning in Vision Transformers (ViTs). We conduct a series of novel, scientific, and comprehensive experiments to explore the effects of various aspect ratios on ViT-based ReID.

To enhance the model’s generality to various aspect ratios in the input, we dynamically fuse features extracted from several models trained on images with different aspect ratios, as shown in Fig. 2. Additionally, when training a single model, we propose a novel intra-image patch mixup (PM) data augmentation method to improve the model’s learning ability on details and mitigate overfitting during training. Furthermore, to mitigate the distortion caused by unsatisfactory resizing, we employ an uneven stride strategy in the patchify step.

The key contributions of this work are:

  • Proposed using aspect ratio as a critical factor impacting vehicle Re-ID performance and ViT’s feature learning robustness.

  • Conducted novel and comprehensive experiments to explore the effects of various aspect ratios on ViT-based ReID.

  • Introducted dynamic feature fusion from models trained on different aspect ratios to enhance model generality.

  • Proposed an intra-image patch mixup (PM) data augmentation method to improve model learning ability and prevent overfitting.

  • Implementated an uneven stride strategy to reduce distortion in unsatisfactory resizing.

II Related Works.

Vision Transformer. The Vision Transformer (ViT) is an adaptation of the transformer architecture from natural language processing (NLP) tasks [vaswani2017attention] to computer vision. Dosovitskiy et al. introduced the ViT [dosovitskiy2010image], the first to demonstrate that a pure transformer applied directly to sequences of image patches can excel in large-scale image classification tasks. Within ViTs, the multi-head self-attention mechanism enables the model to capture diverse dependencies in images, including shapes, textures, and contextual relationships between objects in parallel, facilitating efficient learning of richer data representations. ViTs and their variants, such as DeiT [touvron2021training], Swin Transformer [liu2021swin], PVT [wang2021pyramid], and CPVT [chu2021conditional], prove beneficial across tasks including image classification, object detection, and Re-ID [khan2022transformers].

Vehicle Re-Identification. As discussed, vehicle Re-ID constitutes the primary focus of this study. Numerous works leveraging deep learning have achieved notable performance across various public vehicle Re-ID benchmarks. These works often utilize either CNN backbones [he2016deep, bashir2019vr, roman2021improving] or ViT backbones [he2021transreid, lian2022transformer, luo2021empirical, wei2022transformer]. Common loss functions employed in deep vehicle Re-ID network training include cross-entropy loss (ID loss) [zheng2017discriminatively], triple loss [liu2017end], and contrastive loss [hadsell2006dimensionality].

Refer to caption
Figure 4: Displayed are examples from the VehicleID (first two columns) and VeRi-776 (last two columns) test datasets, illustrating the effects of the intra-image patch mixup (PM) data augmentation method. This technique blends various parts of an image based on attention-driven distances, increasing image complexity to enhance model robustness and reduce overfitting. The top row presents images without the PM module, while the bottom row features images processed with the PM module.

III Method

Model Structure. We train one model for each major aspect ratio. The determination of how many models are needed for training with various aspect ratios is learned from the data, as illustrated in Fig. 3. For each model with a fixed aspect ratio, the resized input images undergo augmentation using a Patch Mixup (PM) module. Subsequently, a chosen Vision Transformer (ViT) backbone and its pre-trained weights are used for initialization. Features extracted from the last transformer layer are utilized for the ReID task.

Image Input with Adaptive Size and Shape. We employ a statistical method to estimate the dataset’s size and aspect ratios. Starting with the mean or median value of the dataset’s size is a suitable approach. For aspect ratios, clustering methods such as K-means can be used to generate several clusters representing the aspect ratios present in the original dataset. Subsequently, a combination of size and aspect ratio guides the resizing of images for the training of each specific model. Each model is trained on a fixed size and aspect ratio.

Patchification with uneven stride. To enhance the model’s ability to learn spatial relationships from image data, we incorporate uneven strides in different dimensions, based on the aspect ratio. Specifically, in our approach, the stride sizes shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and swsubscript𝑠𝑤s_{w}italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT can vary, with the stride size being smaller in the shorter dimension compared to the longer dimension. We maintain a fixed patch size p𝑝pitalic_p of 16. The total patch number n𝑛nitalic_n is calculated as (Hpsh+1)×(Wpsw+1)𝐻𝑝subscript𝑠1𝑊𝑝subscript𝑠𝑤1\left(\frac{H-p}{s_{h}}+1\right)\times\left(\frac{W-p}{s_{w}}+1\right)( divide start_ARG italic_H - italic_p end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG + 1 ) × ( divide start_ARG italic_W - italic_p end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + 1 ).

Patch Mixup intra image module. We propose a novel intra-image data augmentation method where each patch of an image has a probability to mix adaptively with another randomly chosen patch from the same image. Mixup weights are determined based on the spatial distance between these two patches in pixels, with closer patches assigned higher mixup weights. Let A𝐴Aitalic_A denote the original patches with n𝑛nitalic_n patches, and their positions are denoted by the coordinates of their top-left corners (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for the i𝑖iitalic_i-th patch. After patch partitioning, all patch indexes are fixed, with each patch having a unique index ranging from 00 to n1𝑛1n-1italic_n - 1. The Euclidean distance between the centers of two patches indexed by i𝑖iitalic_i and j𝑗jitalic_j can be computed as:

d(i,j)=(xixj)2+(yiyj)2𝑑𝑖𝑗superscriptsubscript𝑥𝑖subscript𝑥𝑗2superscriptsubscript𝑦𝑖subscript𝑦𝑗2d(i,j)=\sqrt{(x_{i}-x_{j})^{2}+(y_{i}-y_{j})^{2}}italic_d ( italic_i , italic_j ) = square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (1)

The patch distance matrix D𝐷Ditalic_D for all patches can then be represented as:

D=[d(1,1)d(1,2)d(1,n)d(2,1)d(2,2)d(2,n)d(n,1)d(n,2)d(n,n)]𝐷matrix𝑑11𝑑12𝑑1𝑛𝑑21𝑑22𝑑2𝑛𝑑𝑛1𝑑𝑛2𝑑𝑛𝑛D=\begin{bmatrix}d(1,1)&d(1,2)&\cdots&d(1,n)\\ d(2,1)&d(2,2)&\cdots&d(2,n)\\ \vdots&\vdots&\ddots&\vdots\\ d(n,1)&d(n,2)&\cdots&d(n,n)\end{bmatrix}italic_D = [ start_ARG start_ROW start_CELL italic_d ( 1 , 1 ) end_CELL start_CELL italic_d ( 1 , 2 ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_d ( 1 , italic_n ) end_CELL end_ROW start_ROW start_CELL italic_d ( 2 , 1 ) end_CELL start_CELL italic_d ( 2 , 2 ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_d ( 2 , italic_n ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_d ( italic_n , 1 ) end_CELL start_CELL italic_d ( italic_n , 2 ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_d ( italic_n , italic_n ) end_CELL end_ROW end_ARG ] (2)

Given a distance matrix D𝐷Ditalic_D where each element Dijsubscript𝐷𝑖𝑗D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the distance between patch i𝑖iitalic_i and patch j𝑗jitalic_j, the attention scores matrix S𝑆Sitalic_S can be computed as:

S=11+Dp𝑆11𝐷𝑝S=\frac{1}{1+D*p}italic_S = divide start_ARG 1 end_ARG start_ARG 1 + italic_D ∗ italic_p end_ARG (3)

In this formula, the attention score Sijsubscript𝑆𝑖𝑗S_{ij}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT increases as the distance Dijsubscript𝐷𝑖𝑗D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT decreases, indicating that closer patches have more influence on each other. After shuffling patches, we generate a new patch-to-patch mixup weights matrix Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT based on the patches’ indexes, as shown in Equation 4, where each value of Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is fetched from S𝑆Sitalic_S using the indexes of the patches:

S=S[Ak,Bk]superscript𝑆𝑆subscript𝐴𝑘superscriptsubscript𝐵𝑘S^{\prime}=S[A_{k},B_{k}^{\prime}]italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_S [ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] (4)

Here, k𝑘kitalic_k is the patch index of a patch from the original patches A𝐴Aitalic_A, and ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the patch index of the selected patch from the shuffled patches B𝐵Bitalic_B for mixup processing. Assuming the original patches A𝐴Aitalic_A are generated using non-overlapping patchification, and its patch distance matrix D𝐷Ditalic_D is calculated based on Equation 1 and 2, with D=16×D𝐷16𝐷D=16\times Ditalic_D = 16 × italic_D as the patch is a fixed-size square. The original spatial attention scores and mixup weights matrix based on the shuffled patches are calculated using Equations 3 and 4. The final patches of this image F𝐹Fitalic_F are given by: F=S×B+(1B)×A𝐹superscript𝑆𝐵1𝐵𝐴F=S^{\prime}\times B+(1-B)\times Aitalic_F = italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_B + ( 1 - italic_B ) × italic_A. Several image samples before and after Patch Mixing (PM) data augmentation are shown in Fig. 4, where it can be observed that after PM, detailed information from different parts of the same vehicle is mixed. The ablation study result is shown in Table I. The pseudocode of this method is shown in Algorithm 1.

Algorithm 1 Patch Mixing guided by Spatial Attention Scores in Vision Transformer
1:An input image I𝐼Iitalic_I of dimensions H×W×C𝐻𝑊𝐶H\times W\times Citalic_H × italic_W × italic_C, a patch size P𝑃Pitalic_P, and stride S𝑆Sitalic_S for overlapping.
2:Bbatch size of I𝐵batch size of 𝐼B\leftarrow\text{batch size of }Iitalic_B ← batch size of italic_I
3:Nnumber of patches along height𝑁number of patches along heightN\leftarrow\text{number of patches along height}italic_N ← number of patches along height
4:Mnumber of patches along width𝑀number of patches along widthM\leftarrow\text{number of patches along width}italic_M ← number of patches along width
5:Dpatch distance matrix of size (N×M)×(N×M)𝐷patch distance matrix of size 𝑁𝑀𝑁𝑀D\leftarrow\text{patch distance matrix of size }(N\times M)\times(N\times M)italic_D ← patch distance matrix of size ( italic_N × italic_M ) × ( italic_N × italic_M )
6:Scompute attention scores from D𝑆compute attention scores from 𝐷S\leftarrow\text{compute attention scores from }Ditalic_S ← compute attention scores from italic_D
7:for b=1𝑏1b=1italic_b = 1 to B𝐵Bitalic_B do
8:     Pbextract patches from image Ib with overlapsubscript𝑃𝑏extract patches from image subscript𝐼𝑏 with overlapP_{b}\leftarrow\text{extract patches from image }I_{b}\text{ with overlap}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← extract patches from image italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT with overlap
9:     πrandom permutation of {1,,N×M}𝜋random permutation of 1𝑁𝑀\pi\leftarrow\text{random permutation of }\{1,\ldots,N\times M\}italic_π ← random permutation of { 1 , … , italic_N × italic_M }
10:     PbPb[π] shuffled patches using permutation πsubscriptsuperscript𝑃𝑏subscript𝑃𝑏delimited-[]𝜋 shuffled patches using permutation 𝜋P^{\prime}_{b}\leftarrow P_{b}[\pi]\text{ shuffled patches using permutation }\piitalic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ italic_π ] shuffled patches using permutation italic_π
11:     SbS[π] adjust attention scores with πsubscriptsuperscript𝑆𝑏𝑆delimited-[]𝜋 adjust attention scores with 𝜋S^{\prime}_{b}\leftarrow S[\pi]\text{ adjust attention scores with }\piitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← italic_S [ italic_π ] adjust attention scores with italic_π
12:     for i=1𝑖1i=1italic_i = 1 to N×M𝑁𝑀N\times Mitalic_N × italic_M do
13:         λSbi,π(i) attention weight for current patch𝜆subscriptsuperscript𝑆subscript𝑏𝑖𝜋𝑖 attention weight for current patch\lambda\leftarrow S^{\prime}_{b_{i,\pi(i)}}\text{ attention weight for current% patch}italic_λ ← italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i , italic_π ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT attention weight for current patch
14:         Pbi′′(1λ)Pbi+λPbπ(i)subscriptsuperscript𝑃′′subscript𝑏𝑖1𝜆subscript𝑃subscript𝑏𝑖𝜆subscriptsuperscript𝑃subscript𝑏𝜋𝑖P^{\prime\prime}_{b_{i}}\leftarrow(1-\lambda)\cdot P_{b_{i}}+\lambda\cdot P^{% \prime}_{b_{\pi(i)}}italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← ( 1 - italic_λ ) ⋅ italic_P start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ ⋅ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT
15:     end for
16:     Obreconstruct output from mixed patches Pb′′subscript𝑂𝑏reconstruct output from mixed patches subscriptsuperscript𝑃′′𝑏O_{b}\leftarrow\text{reconstruct output from mixed patches }P^{\prime\prime}_{b}italic_O start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← reconstruct output from mixed patches italic_P start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
17:end for
18:return O𝑂Oitalic_O \triangleright Return the batch of mixed images

ReID Task. We optimize the network by constructing ID loss and triplet loss for global features. The ID loss LIDsubscript𝐿𝐼𝐷L_{ID}italic_L start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT is the cross-entropy loss without label smoothing. For a triplet set {a,p,n}𝑎𝑝𝑛\{a,p,n\}{ italic_a , italic_p , italic_n }, the triplet loss LTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with a soft-margin. The triplet loss LTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with soft-margin is defined as:

LT=log(1+exp(𝐟a𝐟p2𝐟a𝐟n2))subscript𝐿𝑇1superscriptnormsubscript𝐟𝑎subscript𝐟𝑝2superscriptnormsubscript𝐟𝑎subscript𝐟𝑛2L_{T}=\log(1+\exp(||\mathbf{f}_{a}-\mathbf{f}_{p}||^{2}-||\mathbf{f}_{a}-% \mathbf{f}_{n}||^{2}))italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = roman_log ( 1 + roman_exp ( | | bold_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - | | bold_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

where:

  • 𝐟asubscript𝐟𝑎\mathbf{f}_{a}bold_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐟psubscript𝐟𝑝\mathbf{f}_{p}bold_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and 𝐟nsubscript𝐟𝑛\mathbf{f}_{n}bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the feature embeddings of the anchor, positive, and negative samples, respectively.

  • ||||2||\cdot||^{2}| | ⋅ | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the squared Euclidean distance.

Inference Phase. As shown in Fig. 2, we use a Dynamic Feature Fusing strategy during inference to fuse features from models trained on multiple inputs. The output feature vector can be represented as:

F_out=w1f1+w2f2+w3f3F_outsubscript𝑤1subscriptf1subscript𝑤2subscriptf2subscript𝑤3subscriptf3\textbf{F\_out}=w_{1}\cdot\textbf{f}_{1}+w_{2}\cdot\textbf{f}_{2}+w_{3}\cdot% \textbf{f}_{3}F_out = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

or as a weighted concatenation:

F_out=[w1f1;w2f2;w3f3]F_outsubscript𝑤1subscriptf1subscript𝑤2subscriptf2subscript𝑤3subscriptf3\textbf{F\_out}=\left[w_{1}\cdot\textbf{f}_{1};w_{2}\cdot\textbf{f}_{2};w_{3}% \cdot\textbf{f}_{3}\right]F_out = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ]

where w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are weights. For the adaptive weight assignment w𝑤witalic_w:

w={1.3,if |model_arimage_ar|0.31.0,if 0.3<|model_arimage_ar|0.60.9,otherwise𝑤cases1.3if model_arimage_ar0.31.0if 0.3model_arimage_ar0.60.9otherwisew=\begin{cases}1.3,&\text{if }|\text{model\_ar}-\text{image\_ar}|\leq 0.3\\ 1.0,&\text{if }0.3<|\text{model\_ar}-\text{image\_ar}|\leq 0.6\\ 0.9,&\text{otherwise}\end{cases}italic_w = { start_ROW start_CELL 1.3 , end_CELL start_CELL if | model_ar - image_ar | ≤ 0.3 end_CELL end_ROW start_ROW start_CELL 1.0 , end_CELL start_CELL if 0.3 < | model_ar - image_ar | ≤ 0.6 end_CELL end_ROW start_ROW start_CELL 0.9 , end_CELL start_CELL otherwise end_CELL end_ROW

We investigated the ReID performance by summing or concatenating features, and the results are shown in Table I.

IV Experiments

IV-A Experiment Settings

Dataset. Two popular vehicle Re-ID benmark datasets: VeRi-776 [zheng2020vehiclenet] and VehicleID [liu2016deep], are used in this work.

Evaluation Metrics. In line with the Re-ID field, we employ mAP (mean Average Precision) and CMC (Cumulative Matching Characteristic) as our performance estimation metrics. Specifically, we consider Rank-1 (R1), Rank-5 (R5), and Rank-10 (R10) accuracy.

Dataset
Image Size/aspect ratio
(w/PM or w/o PM)
Stride Size Fusing Strategy mAP(%) R1(%) R5(%) R10(%)
[224,224]/1.0 (w/o) [16,16] - 74.9 95.0 97.6 98.6
[224,224]/1.0 (w/) [16,16] - 77.8 96.1 98.8 99.1
[224,212]/0.95 (w/o) [16,12] - 75.5 95.2 97.9 98.5
[224,212]/0.95 (w/) [16,12] - 73.2 93.5 97.4 98.4
[224,298]/1.33 (w/o) [12,16] - 78.5 96.5 98.0 99.0
[224,298]/1.33 (w/) [12,16] - 79.4 (4.5\uparrow) 96.1 98.2 99.0
Fusing three models
(w/o)
- Weighted Sum 78.1 96.1 98.1 98.7
Fusing three models (w/) - Weighted Sum 81.4(6.5\uparrow) 97.0(2.0\uparrow) 98.5 99.1
Fusing three models
(w/o)
-
Weighted
Concatenate
78.0 95.9 98.1 98.6
VeRi776
Fusing three models
(w/)
-
Weighted
Concatenate
81.4 97.0 98.5 99.0
[384,384]/1.0 (w/o) [16,16] - 90.2 85.0 97.5 99.3
[384,384]/1.0 (w/) [16,16] - 90.8(0.6\uparrow) 85.7 97.8 99.0
[384,308]/0.80 (w/o) [16,12] - 89.9 84.1 97.6 99.2
[384,308]/0.80 (w/) [16,12] - 89.6 83.8 97.1 98.8
[384,396]/1.03 (w/o) [12,16] - 87.8 81.7 96.1 98.6
[384,396]/1.03 (w/) [12,16] - 90.2 85.2 97.1 98.8
Fusing three models
(w/o)
- Weighted Sum 90.1 84.8 97.3 98.9
Fusing three models (w/) - Weighted Sum 91.0(0.8\uparrow) 86.3(1.3\uparrow) 97.4 99.0
Fusing three models
(w/o)
-
Weighted
Concatenate
90.5 85.4 97.4 99.0
VehicleID
Fusing three models
(w/)
-
Weighted
Concatenate
90.7 85.3 97.7 99.1
TABLE I: The basic results of our method. The blue color marks the baseline result we used for comparison and the red color marks the best performance.

Implementation Details. After analyzing the data size and aspect ratio distributions from the VeRi-776 and VehicleID training datasets, we standardized the resized image heights to 224 pixels for VeRi-776 and 384 pixels for VehicleID. We used a uniform patch size of 16×16161616\times 1616 × 16 for both training and testing phases, with stride sizes set to 16 for the longer dimension and 12 for the shorter. The aspect ratios were [1.0,0.95,1.33]1.00.951.33[1.0,0.95,1.33][ 1.0 , 0.95 , 1.33 ] for VeRi-776 and [1.0,0.80,1.03]1.00.801.03[1.0,0.80,1.03][ 1.0 , 0.80 , 1.03 ] for VehicleID. In the testing stage, we used the entire VeRi-776 test dataset and the largest VehicleID subset, containing 800 vehicles. Image preprocessing included 50% random horizontal flipping, padding, cropping, and erasing.

All experiments were conducted on four NVIDIA RTX A6000 GPUs using PyTorch with FP16 training. We used an SGD optimizer with a momentum of 0.1 and a weight decay of 1e-4, maintaining equal weights of 1.0 for ID and triplet losses. The batch size was set at 128, with four images per ID, across 120 training epochs. Initial learning rates were set at 0.035 for VeRi-776 and 0.045 for VehicleID, both decreasing linearly.

IV-B Results

Major Results of Our Method. Our experiments (Table I) compare baseline performance with different input shapes and the absence of patch mixup (PM) data augmentation. On VeRi-776, non-square input of 224×298224298224\times 298224 × 298 without PM outperforms square input of 224×224224224224\times 224224 × 224 by 4.5% in mAP. With sum-based weighted feature fusion from three models, the mAP improves by over 6.5%. Similarly, on VehicleID, non-square input of 384×396384396384\times 396384 × 396 without PM achieves 0.6% higher mAP than square input of 384×384384384384\times 384384 × 384, while feature fusion raises mAP by over 0.8%.

Comparsion with the State-of-the-Art Methods. In Tables II and III, we compare our best model’s ReID performance with three state-of-the-art methods (RPTM [ghosh2023relation], CAL [rao2021counterfactual], and TransReID [he2021transreid]) on VeRi-776 and VehicleID large test datasets. Our model, using ViT-B/16 backbone, outperforms pure ViT-B/16 by 23% mAP and 0.5% R1 on VeRi-776. On VehicleID, our approach achieves 91.0% mAP and 97.4% R5, surpassing RPTM by 10.5% mAP and 1.1% R5. These results demonstrate the significant potential for improving ReID performance by solely modifying ViTs’ input without altering their architectures.

Refer to caption
Figure 5: ReID Performance of Square Input on VeRi-776.
Refer to caption
Figure 6: ReID Performance of Square Input on VehicleID.
Backbone
Input
Size
Method mAP R1 R5
ResNet-101 240x240 RPTM[ghosh2023relation] 88.0 97.3 98.4
ResNet-50 384x192 CAL[rao2021counterfactual] 74.3 95.4 97.9
ViT/B-16 384x128 TransReID[he2021transreid] 82.0 97.1 -
ViT/B-16 256x128
ViT/B-16
Baseline[he2021transreid]
78.2 96.5 -
ViT/B-16 Fused Ours 81.5 97.0 98.5
TABLE II: Comparison with state-of-the-art methods of Re-ID on VeRi-776
Backbone
Input
Size
Method
mAP
(%)
R1(%) R5(%)
ResNet-101 240x240 RPTM[ghosh2023relation] 80.5 92.9 96.3
ResNet-50 384x192 [rao2021counterfactual] 80.9 75.1 88.5
ViT/B-16 256x128 TransReID[he2021transreid] - 85.2 97.5
ViT/B-16 256x128
ViT/B-16
Baseline[he2021transreid]
- 83.5 96.7
ViT/B-16 Fused Ours 91.0 86.3 97.4
TABLE III: Comparison with state-of-the-art methods of Re-ID on VehicleID large dataset

Ablation Study: Square input with different sizes. We assess the impact of input size on ReID performance using square shapes of 224×224224224224\times 224224 × 224, 256×256256256256\times 256256 × 256, 384×384384384384\times 384384 × 384, and 416×416416416416\times 416416 × 416 for both VeRI-776 and VehicleID datasets. All the trainings are conducted without using the PM module. The results are shown in Fig. 6 and 6. On both datasets, the ReID performance does not linearly increase with the input size. The highest ReID accuracy is observed when the input size is 256 pixels for VeRi-776 and 384 pixels for VehicleID.

Refer to caption
Figure 7: Average inference Time of single Image from Test Dataset.
Refer to caption
Figure 8: Attention map on VeRi-776 without and with PM module. The first row shows the results without using PM module, and the second row shows the results using PM module.
Refer to caption
Figure 9: Attention map on VehicleID without and with PM module. The first row shows the results without using PM module, and the second row shows the results using PM module.

IV-C Results Analysis

Patch Number affects Average Inference Time. Fig. 7 illustrates the correlation between patch count and inference time per object. Increased patches escalate time costs, indicating optimization opportunities. Despite longer inference times from model fusion, mitigation is possible through mask or attention strategies to trim patch count during training, thus curtailing parameter size.

How does patch mixup (PM) work? Grad-CAM visualizes attention (Fig. 8 and 9). Intra-image Patch Mixup improves object focus, especially in VeRi-776, enhancing robustness to view and aspect ratios, notably in VehicleID. Utilizing self-attention, our method enhances spatial relations, improving global feature learning in ViTs.

V Conclusion

We explore aspect ratio’s impact on ViT-based vehicle ReID. Fusion of varied aspect ratio models boosts robustness and Re-ID performance. Our intra-image Patch Mixup augmentation enhances generalization. Outperforming baselines on VeRi-776 and VehicleID, though fusion may raise inference time, network pruning can be used to mitigate this. Future work aims at efficient ReID models for diverse aspect ratios.

\printbibliography