Skip to content

计算词汇表中token之间的欧式距离为啥反映不出相似度? #647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Cryptocxf opened this issue Apr 23, 2025 · 0 comments

Comments

@Cryptocxf
Copy link

您好,我在本地部署R 1 7b模型时,想获取词汇表中常见中文token之间的相似度,于是用以下代码获取token的嵌入向量
tokenizer = AutoTokenizer.from_pretrained("./my_path/DeepSeek-R1-Distill-Qwen-7B")
model = AutoModel.from_pretrained("/my_path/DeepSeek-R1-Distill-Qwen-7B")
model.eval()
embeddings = model.get_input_embeddings().weight.data

我计算出”苹果“这个token与其他中文token的欧式距离,并从小到大排序,但是得到的最近的100个token的语义与”苹果“的语义相差甚远,这是为什么?是因为我获取嵌入向量不对吗?还是其他问题?

我也试了英文token之间的欧式距离,发现与”apple“最近的20个token也没有与”apple“语义相近的!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant