pytorch - Why am I seeing unused parameters in position embeddings when using relative_key in BertModel?

I am training a BERT model using pytorch and HuggingFace's BertModel. The sequences of tokens can vary in length from 1 (just a CLS token) to 128. The model trains fine when using absolute position embeddings, but when I switch to using relative position embeddings (specifically setting position_embedding_type="relative_key"), training fails because of unused parameters. When I investigate further (adding print statements as proposed in this thread), I find that the unused parameter is module.bert_model.embeddings.position_embeddings.weight.

I am aware that I can avoid this error by setting find_unused_parameters=True in DDP, and that runs fine. But I'd like to understand why this is happening, to make sure there isn't a problem.

I tried padding all sequences to be the maximum length, and that did not help. I tried adjusting the attention mask to be all ones, and that did not help. I would expect the model to train and to use the position_embeddings.weight parameter, but it is not using it.

Why would switching to relative position embeddings cause the position_embeddings.weight parameter to be unused?

asked Jul 10 at 0:10

NW_liftoff

111 bronze badge

Add a comment |

Collectives™ on Stack Overflow

Why am I seeing unused parameters in position embeddings when using relative_key in BertModel?

0

Browse other questions tagged
pytorch
huggingface-transformers
bert-language-model
transformer-model
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pytorchhuggingface-transformersbert-language-modeltransformer-model or ask your own question.

Browse other questions tagged
pytorch
huggingface-transformers
bert-language-model
transformer-model
or ask your own question.