1

I am training a BERT model using pytorch and HuggingFace's BertModel. The sequences of tokens can vary in length from 1 (just a CLS token) to 128. The model trains fine when using absolute position embeddings, but when I switch to using relative position embeddings (specifically setting position_embedding_type="relative_key"), training fails because of unused parameters. When I investigate further (adding print statements as proposed in this thread), I find that the unused parameter is module.bert_model.embeddings.position_embeddings.weight.

I am aware that I can avoid this error by setting find_unused_parameters=True in DDP, and that runs fine. But I'd like to understand why this is happening, to make sure there isn't a problem.

I tried padding all sequences to be the maximum length, and that did not help. I tried adjusting the attention mask to be all ones, and that did not help. I would expect the model to train and to use the position_embeddings.weight parameter, but it is not using it.

Why would switching to relative position embeddings cause the position_embeddings.weight parameter to be unused?

0