\addbibresource

main.bib

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

Mei Qiu, Lauren Christopher, and Lingxi Li Mei Qiu, Lauren Christopher, and Lingxi Li are with the Department of Electrical and Computer Engineering, Purdue University in Indianapolis, 723 West Michigan Street, SL-160, Indianapolis, Indiana 46202, USA. Emails: meiqiu@iu.edu,{lauchris,ll7}@iupui.edu.

Abstract

Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0% compared to the the closest state-of-the-art (CAL) result of 80.9% on VehicleID dataset.

I INTRODUCTION

Refer to caption — Figure 1: Aspect ratio distribution of images in training datasets from ReID benchmark datasets, VeRi-776 and VehicleID, varies significantly. These datasets show that a substantial portion of the images are non-square.

A fundamental task in intelligent transportation systems is vehicle re-identification (Re-ID): identifying vehicles across multiple non-overlapping cameras [wang2019survey]. Despite its significance, vehicle Re-ID encounters challenges due to variations in vehicle appearance across different viewpoints, poses, illumination, and backgrounds. Deep learning models face the challenge of extracting discriminative features resistant to viewpoint variations [zheng2020vehiclenet, chen2023global]. Both global and local features are essential for generating robust representations for vehicle pairs [peng2019learning, zhang2020part, gu2021efficient].

Numerous benchmark datasets contain images from real-world surveillance scenes, including VeRi-776 [zheng2020vehiclenet], PKU-VD [yan2017exploiting], VehicleID [liu2016deep], Vehicle-1M [guo2018learning], and VERI-Wild [lou2019veri], crucial for vehicle Re-ID AI development [zakria2021trends]. State-of-the-art models leverage the self-attention mechanism. Vision transformers (ViTs) have demonstrated superior ability in capturing discriminative details compared to previous CNN-based methods [he2021transreid]. Additionally, images in different datasets exhibit various aspect ratios, as illustrated in Fig. 1. VeRi-776 and VehicleID display different size and shape distributions, posing a significant challenge in model training.

However, unlike CNN-based models, which can handle varying aspect ratios to some extent due to their translation invariance and local receptive fields, vision transformers consider the entire image as a sequence of patches. The fixed patch size and sequential nature of transformers necessitate careful consideration of how input images are resized and cropped [ke2021musiq, liu2022aspect, mao2022towards, dehghani2023patch]. Early implementations of vision transformers adopted resizing strategies from CNNs, often distorting the original aspect ratio and potentially compromising performance, particularly in tasks reliant on object shape and scale.

To address this, subsequent studies explored padding strategies, adaptive post-patch extraction, trainable resizing networks, aspect-ratio aware attention mechanisms, and multi-scale/multi-aspect training [xia2022vision, lv2022scvit, hwang2022vision, zhu2022aret, ke2021musiq, li2022multi]. However, these approaches entail computational burdens, data requirements, and optimization challenges.

In summary, gaps persist in applying Vision Transformers (ViTs) to vehicle Re-ID. Optimization of scaling and resizing strategies, understanding aspect ratio effects, and exploring data augmentations such as mixup [zhang2018mixup] at the patch level for ViTs in multi-aspect ratio scenarios are needed. Intra-image mixup can potentially enhance the model’s ability to learn detailed features, particularly when the whole image is distorted by resizing with unsatisfactory aspect ratios. TransReID [he2021transreid] introduces a jigsaw patch module (JPM) to enhance feature robustness and discrimination ability, representing a step towards addressing these challenges. However, this intra-mixup occurs at the feature level; its potential at the pixel level needs to be comprehensively explored.

In this work, we are the first to propose that aspect ratio is a key factor affecting vehicle Re-ID performance and the robustness of feature learning in Vision Transformers (ViTs). We conduct a series of novel, scientific, and comprehensive experiments to explore the effects of various aspect ratios on ViT-based ReID.

To enhance the model’s generality to various aspect ratios in the input, we dynamically fuse features extracted from several models trained on images with different aspect ratios, as shown in Fig. 2. Additionally, when training a single model, we propose a novel intra-image patch mixup (PM) data augmentation method to improve the model’s learning ability on details and mitigate overfitting during training. Furthermore, to mitigate the distortion caused by unsatisfactory resizing, we employ an uneven stride strategy in the patchify step.

The key contributions of this work are:

•

Proposed using aspect ratio as a critical factor impacting vehicle Re-ID performance and ViT’s feature learning robustness.
•

Conducted novel and comprehensive experiments to explore the effects of various aspect ratios on ViT-based ReID.
•

Introducted dynamic feature fusion from models trained on different aspect ratios to enhance model generality.
•

Proposed an intra-image patch mixup (PM) data augmentation method to improve model learning ability and prevent overfitting.
•

Implementated an uneven stride strategy to reduce distortion in unsatisfactory resizing.

II Related Works.

Vision Transformer. The Vision Transformer (ViT) is an adaptation of the transformer architecture from natural language processing (NLP) tasks [vaswani2017attention] to computer vision. Dosovitskiy et al. introduced the ViT [dosovitskiy2010image], the first to demonstrate that a pure transformer applied directly to sequences of image patches can excel in large-scale image classification tasks. Within ViTs, the multi-head self-attention mechanism enables the model to capture diverse dependencies in images, including shapes, textures, and contextual relationships between objects in parallel, facilitating efficient learning of richer data representations. ViTs and their variants, such as DeiT [touvron2021training], Swin Transformer [liu2021swin], PVT [wang2021pyramid], and CPVT [chu2021conditional], prove beneficial across tasks including image classification, object detection, and Re-ID [khan2022transformers].

Vehicle Re-Identification. As discussed, vehicle Re-ID constitutes the primary focus of this study. Numerous works leveraging deep learning have achieved notable performance across various public vehicle Re-ID benchmarks. These works often utilize either CNN backbones [he2016deep, bashir2019vr, roman2021improving] or ViT backbones [he2021transreid, lian2022transformer, luo2021empirical, wei2022transformer]. Common loss functions employed in deep vehicle Re-ID network training include cross-entropy loss (ID loss) [zheng2017discriminatively], triple loss [liu2017end], and contrastive loss [hadsell2006dimensionality].

III Method

Model Structure. We train one model for each major aspect ratio. The determination of how many models are needed for training with various aspect ratios is learned from the data, as illustrated in Fig. 3. For each model with a fixed aspect ratio, the resized input images undergo augmentation using a Patch Mixup (PM) module. Subsequently, a chosen Vision Transformer (ViT) backbone and its pre-trained weights are used for initialization. Features extracted from the last transformer layer are utilized for the ReID task.

Image Input with Adaptive Size and Shape. We employ a statistical method to estimate the dataset’s size and aspect ratios. Starting with the mean or median value of the dataset’s size is a suitable approach. For aspect ratios, clustering methods such as K-means can be used to generate several clusters representing the aspect ratios present in the original dataset. Subsequently, a combination of size and aspect ratio guides the resizing of images for the training of each specific model. Each model is trained on a fixed size and aspect ratio.

Patchification with uneven stride. To enhance the model’s ability to learn spatial relationships from image data, we incorporate uneven strides in different dimensions, based on the aspect ratio. Specifically, in our approach, the stride sizes $s_{h}$ and $s_{w}$ can vary, with the stride size being smaller in the shorter dimension compared to the longer dimension. We maintain a fixed patch size $p$ of 16. The total patch number $n$ is calculated as $\left(\frac{H-p}{s_{h}}+1\right)\times\left(\frac{W-p}{s_{w}}+1\right)$ .

Patch Mixup intra image module. We propose a novel intra-image data augmentation method where each patch of an image has a probability to mix adaptively with another randomly chosen patch from the same image. Mixup weights are determined based on the spatial distance between these two patches in pixels, with closer patches assigned higher mixup weights. Let $A$ denote the original patches with $n$ patches, and their positions are denoted by the coordinates of their top-left corners $(x_{i},y_{i})$ for the $i$ -th patch. After patch partitioning, all patch indexes are fixed, with each patch having a unique index ranging from $0$ to $n-1$ . The Euclidean distance between the centers of two patches indexed by $i$ and $j$ can be computed as:

d(i,j)=\sqrt{(x_{i}-x_{j})^{2}+(y_{i}-y_{j})^{2}}

(1)

The patch distance matrix $D$ for all patches can then be represented as:

D=\begin{bmatrix}d(1,1)&d(1,2)&\cdots&d(1,n)\\ d(2,1)&d(2,2)&\cdots&d(2,n)\\ \vdots&\vdots&\ddots&\vdots\\ d(n,1)&d(n,2)&\cdots&d(n,n)\end{bmatrix}

(2)

Given a distance matrix $D$ where each element $D_{ij}$ represents the distance between patch $i$ and patch $j$ , the attention scores matrix $S$ can be computed as:

S=\frac{1}{1+D*p}

(3)

In this formula, the attention score $S_{ij}$ increases as the distance $D_{ij}$ decreases, indicating that closer patches have more influence on each other. After shuffling patches, we generate a new patch-to-patch mixup weights matrix $S^{\prime}$ based on the patches’ indexes, as shown in Equation 4, where each value of $S^{\prime}$ is fetched from $S$ using the indexes of the patches:

S^{\prime}=S[A_{k},B_{k}^{\prime}]

(4)

Here, $k$ is the patch index of a patch from the original patches $A$ , and $k^{\prime}$ is the patch index of the selected patch from the shuffled patches $B$ for mixup processing. Assuming the original patches $A$ are generated using non-overlapping patchification, and its patch distance matrix $D$ is calculated based on Equation 1 and 2, with $D=16\times D$ as the patch is a fixed-size square. The original spatial attention scores and mixup weights matrix based on the shuffled patches are calculated using Equations 3 and 4. The final patches of this image $F$ are given by: $F=S^{\prime}\times B+(1-B)\times A$ . Several image samples before and after Patch Mixing (PM) data augmentation are shown in Fig. 4, where it can be observed that after PM, detailed information from different parts of the same vehicle is mixed. The ablation study result is shown in Table I. The pseudocode of this method is shown in Algorithm 1.

Algorithm 1 Patch Mixing guided by Spatial Attention Scores in Vision Transformer

1:An input image

I

of dimensions

H\times W\times C

, a patch size

P

, and stride

S

for overlapping.

B\leftarrow\text{batch size of }I

N\leftarrow\text{number of patches along height}

M\leftarrow\text{number of patches along width}

D\leftarrow\text{patch distance matrix of size }(N\times M)\times(N\times M)

S\leftarrow\text{compute attention scores from }D

7:for

b=1

B

P_{b}\leftarrow\text{extract patches from image }I_{b}\text{ with overlap}

\pi\leftarrow\text{random permutation of }\{1,\ldots,N\times M\}

10:

P^{\prime}_{b}\leftarrow P_{b}[\pi]\text{ shuffled patches using permutation }\pi

11:

S^{\prime}_{b}\leftarrow S[\pi]\text{ adjust attention scores with }\pi

12: for

i=1

N\times M

13:

\lambda\leftarrow S^{\prime}_{b_{i,\pi(i)}}\text{ attention weight for current% patch}

14:

P^{\prime\prime}_{b_{i}}\leftarrow(1-\lambda)\cdot P_{b_{i}}+\lambda\cdot P^{% \prime}_{b_{\pi(i)}}

15: end for

16:

O_{b}\leftarrow\text{reconstruct output from mixed patches }P^{\prime\prime}_{b}

17:end for

18:return

O

\triangleright

Return the batch of mixed images

ReID Task. We optimize the network by constructing ID loss and triplet loss for global features. The ID loss $L_{ID}$ is the cross-entropy loss without label smoothing. For a triplet set $\{a,p,n\}$ , the triplet loss $L_{T}$ with a soft-margin. The triplet loss $L_{T}$ with soft-margin is defined as:

L_{T}=\log(1+\exp(||\mathbf{f}_{a}-\mathbf{f}_{p}||^{2}-||\mathbf{f}_{a}-% \mathbf{f}_{n}||^{2}))

where:

•

$\mathbf{f}_{a}$ , $\mathbf{f}_{p}$ , and $\mathbf{f}_{n}$ are the feature embeddings of the anchor, positive, and negative samples, respectively.
•

$||\cdot||^{2}$ denotes the squared Euclidean distance.

Inference Phase. As shown in Fig. 2, we use a Dynamic Feature Fusing strategy during inference to fuse features from models trained on multiple inputs. The output feature vector can be represented as:

\textbf{F\_out}=w_{1}\cdot\textbf{f}_{1}+w_{2}\cdot\textbf{f}_{2}+w_{3}\cdot% \textbf{f}_{3}

or as a weighted concatenation:

\textbf{F\_out}=\left[w_{1}\cdot\textbf{f}_{1};w_{2}\cdot\textbf{f}_{2};w_{3}% \cdot\textbf{f}_{3}\right]

where $w_{1}$ , $w_{2}$ , and $w_{3}$ are weights. For the adaptive weight assignment $w$ :

w=\begin{cases}1.3,&\text{if }|\text{model\_ar}-\text{image\_ar}|\leq 0.3\\ 1.0,&\text{if }0.3<|\text{model\_ar}-\text{image\_ar}|\leq 0.6\\ 0.9,&\text{otherwise}\end{cases}

We investigated the ReID performance by summing or concatenating features, and the results are shown in Table I.

IV Experiments

IV-A Experiment Settings

Dataset. Two popular vehicle Re-ID benmark datasets: VeRi-776 [zheng2020vehiclenet] and VehicleID [liu2016deep], are used in this work.

Evaluation Metrics. In line with the Re-ID field, we employ mAP (mean Average Precision) and CMC (Cumulative Matching Characteristic) as our performance estimation metrics. Specifically, we consider Rank-1 (R1), Rank-5 (R5), and Rank-10 (R10) accuracy.

Dataset

Image Size/aspect ratio

(w/PM or w/o PM)

Stride Size

Fusing Strategy

mAP(%)

R1(%)

R5(%)

R10(%)

[224,224]/1.0 (w/o)

[16,16]

74.9

95.0

97.6

98.6

[224,224]/1.0 (w/)

[16,16]

77.8

96.1

98.8

99.1

[224,212]/0.95 (w/o)

[16,12]

75.5

95.2

97.9

98.5

[224,212]/0.95 (w/)

[16,12]

73.2

93.5

97.4

98.4

[224,298]/1.33 (w/o)

[12,16]

78.5

96.5

98.0

99.0

[224,298]/1.33 (w/)

[12,16]

79.4 (4.5

\uparrow

)

96.1

98.2

99.0

Fusing three models

(w/o)

Weighted Sum

78.1

96.1

98.1

98.7

Fusing three models (w/)

Weighted Sum

81.4(6.5

\uparrow

)

97.0(2.0

\uparrow

)

98.5

99.1

Fusing three models

(w/o)

Weighted

Concatenate

78.0

95.9

98.1

98.6

VeRi776

Fusing three models

(w/)

Weighted

Concatenate

81.4

97.0

98.5

99.0

[384,384]/1.0 (w/o)

[16,16]

90.2

85.0

97.5

99.3

[384,384]/1.0 (w/)

[16,16]

90.8(0.6

\uparrow

)

85.7

97.8

99.0

[384,308]/0.80 (w/o)

[16,12]

89.9

84.1

97.6

99.2

[384,308]/0.80 (w/)

[16,12]

89.6

83.8

97.1

98.8

[384,396]/1.03 (w/o)

[12,16]

87.8

81.7

96.1

98.6

[384,396]/1.03 (w/)

[12,16]

90.2

85.2

97.1

98.8

Fusing three models

(w/o)

Weighted Sum

90.1

84.8

97.3

98.9

Fusing three models (w/)

Weighted Sum

91.0(0.8

\uparrow

)

86.3(1.3

\uparrow

)

97.4

99.0

Fusing three models

(w/o)

Weighted

Concatenate

90.5

85.4

97.4

99.0

VehicleID

Fusing three models

(w/)

Weighted

Concatenate

90.7

85.3

97.7

99.1

TABLE I: The basic results of our method. The blue color marks the baseline result we used for comparison and the red color marks the best performance.

Implementation Details. After analyzing the data size and aspect ratio distributions from the VeRi-776 and VehicleID training datasets, we standardized the resized image heights to 224 pixels for VeRi-776 and 384 pixels for VehicleID. We used a uniform patch size of $16\times 16$ for both training and testing phases, with stride sizes set to 16 for the longer dimension and 12 for the shorter. The aspect ratios were $[1.0,0.95,1.33]$ for VeRi-776 and $[1.0,0.80,1.03]$ for VehicleID. In the testing stage, we used the entire VeRi-776 test dataset and the largest VehicleID subset, containing 800 vehicles. Image preprocessing included 50% random horizontal flipping, padding, cropping, and erasing.

All experiments were conducted on four NVIDIA RTX A6000 GPUs using PyTorch with FP16 training. We used an SGD optimizer with a momentum of 0.1 and a weight decay of 1e-4, maintaining equal weights of 1.0 for ID and triplet losses. The batch size was set at 128, with four images per ID, across 120 training epochs. Initial learning rates were set at 0.035 for VeRi-776 and 0.045 for VehicleID, both decreasing linearly.

IV-B Results

Major Results of Our Method. Our experiments (Table I) compare baseline performance with different input shapes and the absence of patch mixup (PM) data augmentation. On VeRi-776, non-square input of $224\times 298$ without PM outperforms square input of $224\times 224$ by 4.5% in mAP. With sum-based weighted feature fusion from three models, the mAP improves by over 6.5%. Similarly, on VehicleID, non-square input of $384\times 396$ without PM achieves 0.6% higher mAP than square input of $384\times 384$ , while feature fusion raises mAP by over 0.8%.

Comparsion with the State-of-the-Art Methods. In Tables II and III, we compare our best model’s ReID performance with three state-of-the-art methods (RPTM [ghosh2023relation], CAL [rao2021counterfactual], and TransReID [he2021transreid]) on VeRi-776 and VehicleID large test datasets. Our model, using ViT-B/16 backbone, outperforms pure ViT-B/16 by 23% mAP and 0.5% R1 on VeRi-776. On VehicleID, our approach achieves 91.0% mAP and 97.4% R5, surpassing RPTM by 10.5% mAP and 1.1% R5. These results demonstrate the significant potential for improving ReID performance by solely modifying ViTs’ input without altering their architectures.

Backbone

Input

Size

Method

mAP

ResNet-101

240x240

RPTM[ghosh2023relation]

88.0

97.3

98.4

ResNet-50

384x192

CAL[rao2021counterfactual]

74.3

95.4

97.9

ViT/B-16

384x128

TransReID[he2021transreid]

82.0

97.1

ViT/B-16

256x128

ViT/B-16

Baseline[he2021transreid]

78.2

96.5

ViT/B-16

Fused

Ours

81.5

97.0

98.5

TABLE II: Comparison with state-of-the-art methods of Re-ID on VeRi-776

Backbone

Input

Size

Method

mAP

(%)

R1(%)

R5(%)

ResNet-101

240x240

RPTM[ghosh2023relation]

80.5

92.9

96.3

ResNet-50

384x192

[rao2021counterfactual]

80.9

75.1

88.5

ViT/B-16

256x128

TransReID[he2021transreid]

85.2

97.5

ViT/B-16

256x128

ViT/B-16

Baseline[he2021transreid]

83.5

96.7

ViT/B-16

Fused

Ours

91.0

86.3

97.4

TABLE III: Comparison with state-of-the-art methods of Re-ID on VehicleID large dataset

Ablation Study: Square input with different sizes. We assess the impact of input size on ReID performance using square shapes of $224\times 224$ , $256\times 256$ , $384\times 384$ , and $416\times 416$ for both VeRI-776 and VehicleID datasets. All the trainings are conducted without using the PM module. The results are shown in Fig. 6 and 6. On both datasets, the ReID performance does not linearly increase with the input size. The highest ReID accuracy is observed when the input size is 256 pixels for VeRi-776 and 384 pixels for VehicleID.

IV-C Results Analysis

Patch Number affects Average Inference Time. Fig. 7 illustrates the correlation between patch count and inference time per object. Increased patches escalate time costs, indicating optimization opportunities. Despite longer inference times from model fusion, mitigation is possible through mask or attention strategies to trim patch count during training, thus curtailing parameter size.

How does patch mixup (PM) work? Grad-CAM visualizes attention (Fig. 8 and 9). Intra-image Patch Mixup improves object focus, especially in VeRi-776, enhancing robustness to view and aspect ratios, notably in VehicleID. Utilizing self-attention, our method enhances spatial relations, improving global feature learning in ViTs.

V Conclusion

We explore aspect ratio’s impact on ViT-based vehicle ReID. Fusion of varied aspect ratio models boosts robustness and Re-ID performance. Our intra-image Patch Mixup augmentation enhances generalization. Outperforming baselines on VeRi-776 and VehicleID, though fusion may raise inference time, network pruning can be used to mitigate this. Future work aims at efficient ReID models for diverse aspect ratios.

\printbibliography