What does it mean to have CNN as both model and tok2vec? #12260

bpben · 2023-02-08T22:19:55Z

bpben
Feb 8, 2023

Apologies if this has been asked previously, but diving deeper into my configurations, I realized I'm not sure I understand what it means to have the model component and the model's tok2vec component both have CNN specified.

The component section looks like this:

[components]

[components.textcat]
factory = "textcat"

[components.textcat.model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = true

[components.textcat.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
pretrained_vectors = false

In my understanding - the HashEmbedCNN embeds each word and encodes them as vectors containing their context. That would mean for each document we have a TxE matrix (where T is the number of tokens and E is the encoding size). The docs for TextCatCNN say "A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network."

If I understand correctly - the tok2vec component is creating token vectors. Does the TextCatCNN in the config above just implement the mean-pooling and feed-forward network? Or is it doing another set of convolutions on the encoded text?

I read through this post, but was a bit confused by what it said about sentence splitting. Is this going on under the hood?

Hope this question makes sense - realized I've been using this for a while without really understanding it.

Answered by honnibal

Feb 9, 2023

This comes down to the details of the "listener" mechanism. Rest assured that we don't love this part either --- I tried really hard to find a better solution, and unfortunately I still don't have one.

The listener is a way for multiple components to share weights. So you can have a textcat and a POS tagger, and they both get the same token vectors, and those token vectors will be updated by gradients from both components.

Consider the following two layer definitions:

HashEmbedCNN: https://github.com/explosion/spaCy/blob/v3.5.0/spacy/ml/models/tok2vec.py#L34
Tok2VecListener: https://github.com/explosion/spaCy/blob/v3.5.0/spacy/ml/models/tok2vec.py#L17 , https://github.com/explosion/spaCy…

View full answer

honnibal · 2023-02-09T14:59:52Z

honnibal
Feb 9, 2023
Maintainer

This comes down to the details of the "listener" mechanism. Rest assured that we don't love this part either --- I tried really hard to find a better solution, and unfortunately I still don't have one.

The listener is a way for multiple components to share weights. So you can have a textcat and a POS tagger, and they both get the same token vectors, and those token vectors will be updated by gradients from both components.

Consider the following two layer definitions:

Both of these functions return a Model[List[Doc], List[Floats2d]]. So these are two layers that can be inserted into a larger model in order to compute token vectors for a batch of docs. The listener can only be used if you have the Tok2Vec component in the pipeline for it to connect to. But just having the Tok2Vec component there doesn't require a particular model to use it. I believe we don't raise an error if there are no components attached to it, either --- I think we can't really assume that's unintended, because components can be disabled. I could be wrong about that last part, though.

3 replies

bpben Feb 9, 2023
Author

This makes a lot of sense, thanks. I guess the piece I'm still missing is the role of the TextCatCNN. Looking at the code, it seems like it's just reformatting and reducing the tok2vec output and feeding that into a FC layer:

spaCy/spacy/ml/models/textcat.py

Line 23 in 2d4fb94

@registry.architectures("spacy.TextCatCNN.v2")

Am I understanding that right? I guess what's confusing me is that TextCatCNN doesn't seem to involve a CNN:

cnn = tok2vec >> list2ragged() >> reduce_mean()

Am I missing something in the code?

honnibal Feb 9, 2023
Maintainer

Ah you're not wrong that it's misleading. These things are tough to name.

It's a model that uses the token-to-vector encoding, produced by that tok2vec argument, which in default pipelines will be the hash-embed CNN. It's just you could plug in any other tok2vec layer there instead.

Historically the alternative model was bag-of-words based, and so that's the contrast that was discussed in the docs and in the forums ("Are you using the CNN textcat, or the bag-of-words?"). Then when we introduced the configs, we needed names for all of these architectures, and that's how that one ended up being called.

bpben Feb 9, 2023
Author

Ah! I see, thanks for clearing that up. Though - if you don't mind answering a follow-up, seems like the TextCatEnsemble does actually implement a CNN on top of the tok2vec inputs:

spaCy/spacy/ml/models/textcat.py

Lines 108 to 125 in 2d4fb94

    
           def build_text_classifier_v2( 
        
               tok2vec: Model[List[Doc], List[Floats2d]], 
        
               linear_model: Model[List[Doc], Floats2d], 
        
               nO: Optional[int] = None, 
        
           ) -> Model[List[Doc], Floats2d]: 
        
               exclusive_classes = not linear_model.attrs["multi_label"] 
        
               with Model.define_operators({">>": chain, "|": concatenate}): 
        
                   width = tok2vec.maybe_get_dim("nO") 
        
                   attention_layer = ParametricAttention(width) 
        
                   maxout_layer = Maxout(nO=width, nI=width) 
        
                   norm_layer = LayerNorm(nI=width) 
        
                   cnn_model = ( 
        
                       tok2vec 
        
                       >> list2ragged() 
        
                       >> attention_layer 
        
                       >> reduce_sum() 
        
                       >> residual(maxout_layer >> norm_layer >> Dropout(0.0)) 
        
                   )

Is that a correct understanding?

I can open another question on this if that's useful. Just trying to get my head around the whole thing.

kadarakos · 2023-02-14T08:50:02Z

kadarakos
Feb 14, 2023

The CNN in the TextCatEnsemble factory build_text_classifier_v2 which you linked here is passed as the tok2vec. Let's look at the code of the cnn_model

cnn_model = (
            tok2vec
            >> list2ragged()
            >> attention_layer
            >> reduce_sum()
            >> residual(maxout_layer >> norm_layer >> Dropout(0.0))
        )

This code here represents a chain of functions. In the usual notation if would be something like residual(reduce_sum(attention_layer(list2ragged(tok2vec(doc))))).

The tok2vec here is the CNN and the rest are layers on to of the token-vectors. The variable cnn_model is a full neural network built on top of a CNN tok2vec. This stack of layers starting with the tok2vec and ending up in the residual is one member of then ensemble.

This is going to be combined with linear_model. The linear_model used is here

spaCy/spacy/pipeline/textcat.py

Line 38 in 2d4fb94

[model.linear_model]

. It's the "spacy.TextCatBOW.v2".

The architectures get combined here model = (linear_model | cnn_model) >> output_layer https://github.com/explosion/spaCy/blob/2d4fb94ba0a23523cc9adb65e0dcf92bbf6177b6/spacy/ml/models/textcat.py.

So in conclusion, one member of the textcat ensemble is a the neural network that starts with a tok2vec that's passed into the factory and this tok2vec is a CNN by default. The other member is the BOW architecture by default. The two are combined to produce one ensemble model. There is no extra CNN ran in the architecture, it is within the provided tok2vec. I hope I made it more clear!

2 replies

bpben Feb 16, 2023
Author

Thank you, this makes it a lot more clear. It's interesting that these layers exist on top of the HashEmbedCNN in the Ensemble, but not the TextCatCNN. Was there a reason for that or is it just sort of left over from previous implementations? Again - just a discussion point, appreciate ya'll answering my questions so thoroughly.

kadarakos Feb 20, 2023

Maybe its more historical actually, I'm no too sure. Maybe, because the TextcatCNN is still considered a cheaper architecture and the TextCatEnsemble.v2 is more like on the bulkier side. But not sure, maybe we should do experiments with adding the self-attention and Maxout into the TextcatCNN too on a couple of data sets and checkout efficiency vs. accuracy. Thanks for pointing it out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does it mean to have CNN as both model and tok2vec? #12260

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What does it mean to have CNN as both model and tok2vec? #12260

bpben Feb 8, 2023

Replies: 2 comments · 5 replies

honnibal Feb 9, 2023 Maintainer

bpben Feb 9, 2023 Author

honnibal Feb 9, 2023 Maintainer

bpben Feb 9, 2023 Author

kadarakos Feb 14, 2023

bpben Feb 16, 2023 Author

kadarakos Feb 20, 2023

bpben
Feb 8, 2023

Replies: 2 comments 5 replies

honnibal
Feb 9, 2023
Maintainer

bpben Feb 9, 2023
Author

honnibal Feb 9, 2023
Maintainer

bpben Feb 9, 2023
Author

kadarakos
Feb 14, 2023

bpben Feb 16, 2023
Author