Preserve paragraph divisions of documents #3501

wuqui · 2019-03-28T15:44:33Z

wuqui
Mar 28, 2019

Hello,

I have documents that are structured as paragraphs as they are cleaned versions of scraped html pages. What's the best / most spaCy way of preserving the paragraph structure (while still having one Doc per file), so that I can later refer back to paragraph units for exporting xml or doing text classification on paragraph level? Should I create 'paragraph Spans' or something like that?

Thanks!
Quirin

Answered by ines

Mar 28, 2019

Should I create 'paragraph Spans' or something like that?

Yes, that sounds like a good solution. If you have the character offsets of the paragraphs, you can use Doc.char_span to create the spans.

If you want to do this more elegantly, you could also store the paragraph offsets in the doc.user_data and then add a custom Doc attribute like doc._.paragraphs that returns (or yields) the paragraph spans for a given Doc.

For example:

doc = nlp(paragraph_text)
doc.user_data["paragraph_offsets"] = [(0, 392), (393, 848)] # etc.

from spacy.tokens import Doc

def get_paragraphs(doc):
    offsets = doc.user_data.get("paragraph_offsets", [])
    return [doc.char_span(start, end) for start, end in o…

View full answer

ines · 2019-03-28T17:11:09Z

ines
Mar 28, 2019
Maintainer

Should I create 'paragraph Spans' or something like that?

Yes, that sounds like a good solution. If you have the character offsets of the paragraphs, you can use Doc.char_span to create the spans.

If you want to do this more elegantly, you could also store the paragraph offsets in the doc.user_data and then add a custom Doc attribute like doc._.paragraphs that returns (or yields) the paragraph spans for a given Doc.

For example:

doc = nlp(paragraph_text)
doc.user_data["paragraph_offsets"] = [(0, 392), (393, 848)] # etc.

from spacy.tokens import Doc

def get_paragraphs(doc):
    offsets = doc.user_data.get("paragraph_offsets", [])
    return [doc.char_span(start, end) for start, end in offsets]

Doc.set_extension("paragraphs", getter=get_paragraphs)

0 replies

wuqui · 2019-03-28T23:25:23Z

wuqui
Mar 28, 2019
Author

Thanks a lot for your quick and helpful reply! It works :)

Is it possible to access the paragraphs' sentences? .sents doesn't work, as Spans are probably meant to be used for sub-sentential patterns.

0 replies

ines · 2019-03-29T10:56:18Z

ines
Mar 29, 2019
Maintainer

Is it possible to access the paragraphs' sentences? .sents doesn't work, as Spans are probably meant to be used for sub-sentential patterns.

Yeah, there's Span.sent, but no Span.sents. However, the Doc.sents are also just spans, and they all expose a start_char/end_char (character offsets) and start/end (token offsets). So you could get those and for each paragraph span, check whether the sentence start and end is within the paragraph start and end. If so, you know that the paragraph contains the sentence.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve paragraph divisions of documents #3501

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Preserve paragraph divisions of documents #3501

wuqui Mar 28, 2019

Replies: 3 comments

ines Mar 28, 2019 Maintainer

wuqui Mar 28, 2019 Author

ines Mar 29, 2019 Maintainer

wuqui
Mar 28, 2019

ines
Mar 28, 2019
Maintainer

wuqui
Mar 28, 2019
Author

ines
Mar 29, 2019
Maintainer