Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #427 #428

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Closes #427 #428

wants to merge 4 commits into from

Conversation

nomisto
Copy link
Contributor

@nomisto nomisto commented Apr 12, 2022

Closes #427

Dataset contains 8 different subset_id's (different dataset settings), each with a bigbio and source schema.

Furthermore there is an subset called mediqa_ans_all which includes all data (articles, sections, URLs of documents, all four different kinds of summaries, ...). I did not implement a bigbio schema for the all view as I think this does not make sense here. Since the bigbio schema is missing for all tests fail for subset mediqa_ans_all.

Tests:

python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_all
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_multi_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_multi_extractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_single_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_page2answer_single_extractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_multi_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_multi_extractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_single_abstractive
python -m tests.test_bigbio biodatasets/mediqa_ans/mediqa_ans.py --subset_id mediqa_ans_section2answer_single_extractive

Comment on lines 236 to 253
def _source_to_t2t(self, example):
example_ = {}
example_["document_id"] = ""
example_["text_1_name"] = ""
example_["text_2_name"] = ""

text1 = ""
text1 += "Question ID: " + example["question_id"] + "\n"
text1 += "Question: " + example["question"] + "\n"
for article in example["articles"]:
text1 += "Answer ID: " + article["answer_id"] + "\n"
text1 += "Answer: " + article["text"] + "\n"
text1 += "Rating: " + article["rating"] + "\n"
example_["text_1"] = text1

example_["text_2"] = example["summary"]

return example_
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the transformation of the source data to fit the t2t schema.
Basically the summarization works like: question + answer -> summarized_answer so for t2t schema I concatenated all interesting values with "\n" for the value of text_1.

An of example page2answer_single_abstractive:

"1_Answer4": {
        "summary": "Abetalipoproteimemia, also known as Bassen-Kornzweig syndrome, ... ",
        "articles": " Bassen-Kornzweig syndrome Abetalipoproteinemia Acanthocytosis Apolipoprotein B deficiency...",
        "question": "abetalipoproteimemia hi, I would like to know if there is any support for those suffering with abetalipoproteinemia? ...",
        "question_id": "1",
        "rating": "3-Incomplete"
}

where "1_Answer4" is answer_id above and "articles" corresponds to article["text"]

@sunnnymskang sunnnymskang self-assigned this Apr 12, 2022
@hakunanatasha hakunanatasha self-assigned this Apr 25, 2022
@sunnnymskang sunnnymskang added tricky schema bigbio schema doesn't fit this dataset easily and removed tricky schema bigbio schema doesn't fit this dataset easily labels Apr 26, 2022
Copy link
Collaborator

@sunnnymskang sunnnymskang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nomisto In the description part, can you add information about subset_id (and mediqa_ans_all implements only source)? Confirmed that all other 8 subset id pass unit tests

@nomisto nomisto requested a review from debajyotidatta as a code owner April 26, 2022 07:57
@nomisto
Copy link
Contributor Author

nomisto commented Apr 26, 2022

Hi @sunnnymskang , Sure, I've added a description to the value of _DESCRIPTION and the docstring.

@hakunanatasha
Copy link
Collaborator

@nomisto Can you remind me why this fits the t2t schema better than question answering? We want to merge this PR asap; it looks mostly ok.

@nomisto
Copy link
Contributor Author

nomisto commented Apr 27, 2022

Hi @hakunanatasha , the name of this dataset is a little misleading: It is a summarization task, more specifically an answer summarization task. So the input is question + answer and the task is to generate a summarization of that answer.

@hakunanatasha
Copy link
Collaborator

@nomisto got it; I'll merge this later today. Sorry for the hold up. I assume since it's a summarization, the text-1/2-name are also blank as there is nothing to update here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal to add MEDIQA-AnS
3 participants