Is Possible?: Extracting Subject, Predicate, Conditional, Prepositional Clauses with a Spacy Trained Model, as opposed to rule based? #13125
Unanswered
grahamanderson
asked this question in
Help: Coding & Implementations
Replies: 1 comment 9 replies
-
In general I would recommend using a syntactic parse (either constituency or dependency) with rules to identify the types of clauses you're interested in. Our provided trained pipelines only include dependency parsers, but it can sometimes be simpler to identify clauses from constituency parses, so it could be worth it to check out constituency parsers from outside the core spacy library. A few examples of projects that include clause detection:
|
Beta Was this translation helpful? Give feedback.
9 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Question: Has someone identified clause spans with a spacy trained model, and was it effective?
[training...bunch of sentences]
[
{
"text": "When the levee breaks, the cat will smile.",
"spans": [
{"start": 0, "end": 21, "label": "CONDITIONAL_CLAUSE"},
{"start": 23, "end": 30, "label": "SUBJECT_CLAUSE"},
{"start": 31, "end": 41, "label": "PREDICATE_CLAUSE"}
]
},
// many more sentences...
]
given every token is a big vector, seems a model would be way more effective than trying to create sentence token rules (edited)
The below is a direction...has anyone tried this...and did it work better than trying to apply token logic?
As an alternative, I could feed sentences to openAI, which seems to work. At least I could make a training set from it.
Expected Output:
Rather than iterating over parts of speech sentence tokens (for logic rules), can I identify clause spans as a new/separate spacy entity that doesn't overlap with normal entities? I could do this with some Prodigy-like NER app.
Basically, store the Spans of the Clause Token Range and label as:
SUBJ_CLAUSE, PRED_CLAUSE, COND_CLAUSE'
Not dissimilar from normal entity training.
In my use case, identifying clause are crucial, and the wordings are often 'strange'.
I'm assuming you can create a new Entity type for Clauses with a set_extention or something. Iterating over each sentence with rules....seems inefficient and prone to failure, especially when you have a large corpus of custom sentence?
I feel like i've been learning Spacy in reverse...from advanced to beginner.
I'm guessing that the above is a huge rabiitt hole, that others have gone down....and come back to tell the tale
Any advice is appreciated :)
Beta Was this translation helpful? Give feedback.
All reactions