Is Possible?: Extracting Subject, Predicate, Conditional, Prepositional Clauses with a Spacy Trained Model, as opposed to rule based? #13125

grahamanderson · 2023-11-12T14:56:09Z

grahamanderson
Nov 12, 2023

Question: Has someone identified clause spans with a spacy trained model, and was it effective?

[training...bunch of sentences]
[
{
"text": "When the levee breaks, the cat will smile.",
"spans": [
{"start": 0, "end": 21, "label": "CONDITIONAL_CLAUSE"},
{"start": 23, "end": 30, "label": "SUBJECT_CLAUSE"},
{"start": 31, "end": 41, "label": "PREDICATE_CLAUSE"}
]
},
// many more sentences...
]

given every token is a big vector, seems a model would be way more effective than trying to create sentence token rules (edited)
The below is a direction...has anyone tried this...and did it work better than trying to apply token logic?

As an alternative, I could feed sentences to openAI, which seems to work. At least I could make a training set from it.

import spacy
from spacy.training import Example
from spacy.tokens import Span

# Add a custom attribute 'clause_type' to Span
Span.set_extension("clause_type", default=None, force=True)

nlp = spacy.blank("en_core_web_sm")  # create a blank English model
ner = nlp.add_pipe("ner")  # add a new NER component

# Add the new label to the NER component
for label in ["CONDITIONAL_CLAUSE", "SUBJECT_CLAUSE", "PREDICATE_CLAUSE"]:
    ner.add_label(label)

from spacy.util import minibatch, compounding
import random

optimizer = nlp.begin_training()
for i in range(20):  # Number of training iterations
    random.shuffle(TRAIN_DATA)
    losses = {}

    # Batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
    for batch in batches:
        examples = [Example.from_dict(nlp.make_doc(text), annotations) for text, annotations in batch]
        nlp.update(examples, drop=0.5, losses=losses)

    print("Losses", losses)

test_text = "When the levee breaks, the cat will smile."
doc = nlp(test_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Expected Output:

When the levee breaks, 0, 21, CONDITIONAL_CLAUSE
the cat, 23,30, SUBJECT_CLAUSE
will smile, 31,41, PREDICATE_CLAUSE

Rather than iterating over parts of speech sentence tokens (for logic rules), can I identify clause spans as a new/separate spacy entity that doesn't overlap with normal entities? I could do this with some Prodigy-like NER app.
Basically, store the Spans of the Clause Token Range and label as:
SUBJ_CLAUSE, PRED_CLAUSE, COND_CLAUSE'
Not dissimilar from normal entity training.
In my use case, identifying clause are crucial, and the wordings are often 'strange'.

I'm assuming you can create a new Entity type for Clauses with a set_extention or something. Iterating over each sentence with rules....seems inefficient and prone to failure, especially when you have a large corpus of custom sentence?
I feel like i've been learning Spacy in reverse...from advanced to beginner.

I'm guessing that the above is a huge rabiitt hole, that others have gone down....and come back to tell the tale

Any advice is appreciated :)

adrianeboyd · 2023-11-13T06:24:10Z

adrianeboyd
Nov 13, 2023

In general I would recommend using a syntactic parse (either constituency or dependency) with rules to identify the types of clauses you're interested in. Our provided trained pipelines only include dependency parsers, but it can sometimes be simpler to identify clauses from constituency parses, so it could be worth it to check out constituency parsers from outside the core spacy library.

A few examples of projects that include clause detection:

spacy-clausie, which I think uses rule-based clause detection
Healthsea, which uses the benepar constituency parser to identify clauses

9 replies

grahamanderson Nov 17, 2023
Author

For a new person like me, I think it might be easier to create the json list that prodigy can import, from a simple openAI call.
As alternative, I would love a method that works well within Prodigy that works seamlessly/intuitively, and doesn't involve too much cruft :)

import openai
import os
import json

def join_list_with_and(items):
    if not items:
        return ""
    if len(items) == 1:
        return items[0]
    return ', '.join(items[:-1]) + ' and ' + items[-1]

def generate_sentences(prompt, limit=10, num_variations=50):
    models =[
        'gpt-3.5-turbo-1106',
        "gpt-4-1106-preview"
    ]
    try:
        # Set your API key here
        openai.api_key = os.getenv("OPENAI_API_KEY")

        # Call the OpenAI API
        response = openai.ChatCompletion.create(
            model=models[1],
            response_format={"type": "json_object"},
            # prompt=prompt,
            max_tokens=3000,  
            stop=None,  
            temperature=0.7, 
            stream=False, 
            messages=[
                {"role": "system", "content": "Assistant is a large language model trained by OpenAI."},
                {"role": "system", "content": "Make sure to return a JSON output"},
                {"role": "user", "content": f"{prompt}"}
             ],
        )
        
        response = json.loads(response['choices'][0]['message']['content'])
        for data_list in response.values():
            return data_list


    except Exception as e:
        return json.dumps({'Error': str(e)})

# Example usage
limit= 3
product = 'Bell  Helicopter'
main_component = 'Main Rotor System'
resource_urls = ['https://en.wikipedia.org/wiki/Glossary_of_aerospace_engineering'] #, 'https://www.grc.nasa.gov/www/k-12/TRC/glossary.htm']
clause_types = ['subject', 'predicate', 'conditional']

prompt =f"""
    From the {product}'s {main_component}:
    Step 1: Create detailed list of the top {limit} components of the {main_component}""
            and recursively iterate through its components, and subcomponent until there are no more components to analyze.
            
    Step 2: for each component entry in this list  write at least 10 single synthetic requirements that are 
            common for that component. Include Simple and Complex examples.
    
    Step 3: return a list which contains:      
    - the entire component heirarchy (Parent to children), 
    - the component name
    - description of the current component, 
    - parse each synthetic requirement sentence into {join_list_with_and(clause_types)} clauses. 
       Include these clauses in the list.
     - For each requirement, include a list of extracted terminolgy commonly in a aeronautical engineeting terminology 
       like resources found in the urls: {join_list_with_and(resource_urls)}. 
       -- Within the list of extracted terminology, include all units (examples include miliseconds and parsecs) and values (like False, is less than 72, or between 46 and 72, is equal to False
"""
response = generate_sentences(prompt,limit=3)
response

Ali-MH-Mansour Nov 18, 2023

I'm also new. To split a sentence into clauses based on spacy (or other parser) requires studying grammar and language. This won't be easy. Especially since the parser will not be ideal in some cases. + for me english is not my home language.
I came across a recent article about clauses that contains a lot of useful information and I hope it will be of benefit to you
Hierarchical Clause Annotation: Building a Clause-Level Corpus for Semantic Parsing with Complex Sentences
Yunlong Fan 1,2 , Bin Li 1,2 , Yikemaiti Sataer 1,2, Miao Gao 1,2, Chuanqi Shi 1,2, Siyi Cao 3 and Zhiqiang Gao 1,2,

and also got this link: (you can get the parse tree for spacy intead of stanza)
https://stackoverflow.com/questions/26070245/clause-extraction-using-stanford-parser

grahamanderson Nov 18, 2023
Author

Thank you :) I'm beginning to believe that LLMs are better for this sort of work...My workflow (I think) would be training a bunch of LLM parsed sentences, do some span correction in Prodigy, and then train a model to flag those clauses as spancat entities.

Ali-MH-Mansour Nov 18, 2023

good luck :)

Ali-MH-Mansour Mar 28, 2024

Hello @grahamanderson
How do you progress with your chosen solution?
Please, I would like to ask you about the purpose of obtaining clauses?
For me and our R&D team, we are trying to simplify sentences into smaller components as part of meaning extraction process and then obtain triplets (verb, subject, object) to build knowledge graphs.

In your experience, why would people need to extract clauses?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Possible?: Extracting Subject, Predicate, Conditional, Prepositional Clauses with a Spacy Trained Model, as opposed to rule based? #13125

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is Possible?: Extracting Subject, Predicate, Conditional, Prepositional Clauses with a Spacy Trained Model, as opposed to rule based? #13125

grahamanderson Nov 12, 2023

Replies: 1 comment · 9 replies

adrianeboyd Nov 13, 2023

grahamanderson Nov 17, 2023 Author

Ali-MH-Mansour Nov 18, 2023

grahamanderson Nov 18, 2023 Author

Ali-MH-Mansour Nov 18, 2023

Ali-MH-Mansour Mar 28, 2024

grahamanderson
Nov 12, 2023

Replies: 1 comment 9 replies

adrianeboyd
Nov 13, 2023

grahamanderson Nov 17, 2023
Author

grahamanderson Nov 18, 2023
Author