Preserve paragraph divisions of documents #3501
-
Hello, I have documents that are structured as paragraphs as they are cleaned versions of scraped html pages. What's the best / most spaCy way of preserving the paragraph structure (while still having one Doc per file), so that I can later refer back to paragraph units for exporting xml or doing text classification on paragraph level? Should I create 'paragraph Spans' or something like that? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Yes, that sounds like a good solution. If you have the character offsets of the paragraphs, you can use If you want to do this more elegantly, you could also store the paragraph offsets in the For example: doc = nlp(paragraph_text)
doc.user_data["paragraph_offsets"] = [(0, 392), (393, 848)] # etc. from spacy.tokens import Doc
def get_paragraphs(doc):
offsets = doc.user_data.get("paragraph_offsets", [])
return [doc.char_span(start, end) for start, end in offsets]
Doc.set_extension("paragraphs", getter=get_paragraphs) |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your quick and helpful reply! It works :) Is it possible to access the paragraphs' sentences? |
Beta Was this translation helpful? Give feedback.
-
Yeah, there's |
Beta Was this translation helpful? Give feedback.
Yes, that sounds like a good solution. If you have the character offsets of the paragraphs, you can use
Doc.char_span
to create the spans.If you want to do this more elegantly, you could also store the paragraph offsets in the
doc.user_data
and then add a customDoc
attribute likedoc._.paragraphs
that returns (or yields) the paragraph spans for a givenDoc
.For example: