Clarification for behavior of en_core_web_lg
with doc.tensor
#4787
Replies: 3 comments
-
The way I understand it, the tensors hold information related to the predictions of the statistical models (such as the tagger or parser), while the vectors represent semantic similarity. The
But I think you're right that this could be clarified in the docs more. |
Beta Was this translation helpful? Give feedback.
-
Doc.vector is a real-valued meaning representation. Defaults to an average of the token vectors. Got it from the link https://spacy.io/api/doc So, they are taking the average. If doc.tensor.shape = (7,768) then doc.vector.shape = (768,) nothing taking an average of all the 7 rows with 768 dimensions giving one row with 768 dimensions. |
Beta Was this translation helpful? Give feedback.
-
This section has been further expanded for the v3 docs: https://nightly.spacy.io/usage/linguistic-features#vectors-similarity. So I think we can close this one? If you have further suggestions for clarifications in the docs, feel free to open a PR :-) |
Beta Was this translation helpful? Give feedback.
-
Using the
en_core_web_lg
model, I was comparing the output fromdoc.tensor
, anddoc.vector
.Until recently, I was expecting
doc.tensor
to be a matrix representation of the stacked word vectors (i.e.,[token.vector for token in doc]
).As it turns out,
doc.tensor
is in fact similar in the dimension to thedoc.tensor
produced byen_core_web_sm
oren_core_web_md
(although the matrix entries do not match), which is not the dimension of the distincttoken.vector
s, anddoc.vector
.I was wondering whether this definition should be clarified in the Doc documentation page. I have not spotted any comment on the differences (or similarities) of the two properties. This is also especially relevant when considering the calculation of
.similarity()
, where it is not immediately clear from the docs whether this is based ondoc.tensor
ordoc.vector
.Furthermore, I could not find any comment on the specific computation of the tensors, so maybe a paragraph on this could also help clarify this issue?
Which page or section is this issue related to?
https://github.com/explosion/spaCy/blob/master/website/docs/api/doc.md
Beta Was this translation helpful? Give feedback.
All reactions