Skip to content

Preserve paragraph divisions of documents #3501

Locked Answered by ines
Discussion options

You must be logged in to vote

Should I create 'paragraph Spans' or something like that?

Yes, that sounds like a good solution. If you have the character offsets of the paragraphs, you can use Doc.char_span to create the spans.

If you want to do this more elegantly, you could also store the paragraph offsets in the doc.user_data and then add a custom Doc attribute like doc._.paragraphs that returns (or yields) the paragraph spans for a given Doc.

For example:

doc = nlp(paragraph_text)
doc.user_data["paragraph_offsets"] = [(0, 392), (393, 848)] # etc.
from spacy.tokens import Doc

def get_paragraphs(doc):
    offsets = doc.user_data.get("paragraph_offsets", [])
    return [doc.char_span(start, end) for start, end in o…

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / doc Feature: Doc, Span and Token objects
2 participants
Converted from issue

This discussion was converted from issue #3501 on December 10, 2020 13:53.