3

I am using pre-trained LLM to generate a representative embedding for an input text. But it is wired that the output embeddings are all the same regardless of different input texts.

The codes:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)

def generate_embedding(document):
    inputs = tokenizer(document, return_tensors='pt')
    print("Tokenized inputs:", inputs)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state[0, 0, :].numpy()
    print("Generated embedding:", embedding)
    return embedding

text1 = "this is a test"
text2 = "this is another test"
text3 = "there are other tests"

embedding1 = generate_embedding(text1)
embedding2 = generate_embedding(text2)
embedding3 = generate_embedding(text3)

are_equal = np.array_equal(embedding1, embedding2) and np.array_equal(embedding2, embedding3)

if are_equal:
    print("The embeddings are the same.")
else:
    print("The embeddings are not the same.")

The printed tokens are different, but the printed embeddings are the same. The outputs:

Tokenized inputs: {'input_ids': tensor([[   1,  456,  349,  264, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  456,  349, 1698, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  736,  460,  799, 8079]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
The embeddings are the same.

Does anyone know where the problem is? Many thanks!

0

1 Answer 1

3

You're not slicing it the dimensions right at

outputs.last_hidden_state[0, 0, :].numpy()

Q: What is the 0th token in all inputs?

A: Beginning of sentence token (BOS)

Q: So that's the "embeddings" I'm slicing is the BOS token?

A: Try this:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np

PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)

model(**tokenizer("", return_tensors='pt')).last_hidden_state

[out]:

tensor([[[-1.7763,  1.9293, -2.2413,  ...,  2.6380, -3.1049,  4.8060]]],
       grad_fn=<MulBackward0>)

Q: Then, how do I get the embeddings from a decoder-only model?

A: Can you really get an "embedding" from a decoder-only model? The model outputs a hidden state per token it "regress" through, so different texts get different tensor output size.

Q: How do you make it into a single fixed size vector then?

A: Most probably,

2
  • 1
    Thank you so much for your detailed answer @alvas. I thought I could use the embedding of the special token at the beginning to represent the whole sequence, just like the embedding of the CLS token to represent a sequence for text classification. Now it turns out that the embedding of the BOS token in this model keeps almost the same for different input texts. I need to at least do a pooling among the embeddings of the tokens in the sequence then. Thanks!
    – Howie
    Commented Apr 12 at 9:12
  • 1
    @Howie: you might want to look at this answer for fetching sentence embeddings from decoder models.
    – cronoik
    Commented Apr 25 at 21:53

Not the answer you're looking for? Browse other questions tagged or ask your own question.