Add JCoLA task #93

kumapo · 2023-10-03T12:33:46Z

JCoLA results

Model	mcc	mcc_stderr
rinna/japanese-gpt-neox-3.6b	0.0503	0.0370
open-calm-large	0.0218	0.0364
rinna/japanese-gpt-1b	0.0183	0.0346
open-calm-medium	-0.0075	0.0330
rinna/bilingual-gpt-neox-4b	-0.0204	0.0333
Llama-2-7b-hf	-0.0236	0.0391
open-calm-7b	-0.0258	0.0088
rinna/japanese-gpt-neox-3.6b-instruction-sft-v2	-0.0313	0.0281
rinna/bilingual-gpt-neox-4b-instruction-sft	-0.0511	0.0263

from JCoLA paper

CoLA results

Model	mcc	mcc_stderr
Llama-2-7b-hf	0.3231	0.0322
Llama-7b	0.1093	0.0339

mkshing · 2023-10-04T05:02:58Z

@kumapo thank you as always for your contribution! We're willing to add this task when the LB is updated next time if there's no issue.
Please feel free to ping me when it's ready. I will review this pr asap 👍

mkshing · 2023-10-05T06:24:36Z

@kumapo we're adding balanced accuracy soon over this pr #95. Can you try to add this metric in this JCoLA too?

polm-stability · 2023-10-06T06:50:26Z

I went ahead and added the new metrics here. Besides missing the 0.2 prompt version this looks OK to me, what is left to be done?

kumapo · 2023-10-07T06:55:07Z

@mkshing @polm-stability

Sorry for late reply but I'm ready for the review now.
Actually I was worried that model's performance on JCoLA task seemed too bad.
(That's why I tried to update the prompt a few times.)
Based on the comparison with the performance of llama-7b on CoLA task, I believe it's reasonable for now.

please let me know if I missed something.

mkshing · 2023-10-10T04:27:17Z

@kumapo big thanks to your investigation and PR! And, @polm-stability thank you for your support :)

mkshing

@kumapo could you check the following things? Thank you in advance.

Can we remove all changes in models for now? As @kumapo and @polm know, we're evaluating all models soon.
please add JCoLA in docs/jptasks.md

polm-stability

Looks good!

mkshing

Added prompt version 0.6 for llama2-chat

lm_eval/tasks/ja/jcola.py

This reverts commit cd9a914.

This modifies cola, since jcola just inherits this part. It's not a problem to modify the parent task because it just adds some output.

https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md

jptasks.md and prompt_templates.md

kumapo · 2023-10-10T12:54:47Z

please let me know if I missed something again!

add JCoLAWithLlama2
update JCoLA's prompt version to 0.0
update jatasks.md and prompt_templates.md
revert updates in models/

mkshing

@kumapo thank you for fixing. Can we removeharness.jcola.sh? other than that, LGTM :)

kumapo · 2023-10-11T03:16:45Z

all harness.jcola.sh and the result files were deleted in this commit.
does that fulfill your request?

mkshing · 2023-10-11T03:59:51Z

@kumapo hi thank you for your quick work! But, I still see harness.jcola.sh in the file changes...

mkshing · 2023-10-11T04:13:54Z

@kumapo I deleted the script :)

mkshing

@kumapo thank you for your PR as always! Let's merge this.

kumapo · 2023-10-11T04:34:33Z

Sorry, I missed that, but thank you for your quick move as always!

* WIP: need JCoLA * Update harness.jcola.sh * update prompt * update prompt * update prompt * update prompt * Revert "update prompt" This reverts commit cd9a914. * WIP: evaluate on JCoLA * Add new metrics to cola This modifies cola, since jcola just inherits this part. It's not a problem to modify the parent task because it just adds some output. * Linter edits * evaluate on JCoLA * need JCoLAWithLlama2 * JCoLA's prompt version should be 0.0 https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md * documentation jptasks.md and prompt_templates.md * won't need harness and result for JCoLA * fix linter related issue * Delete harness.jcola.sh --------- Co-authored-by: Paul O'Leary McCann <[email protected]> Co-authored-by: mkshing <[email protected]>

* Initial working refactor This just pulls the argparse stuff into a separate function. * Do some rearrangement for the refactor Eval args are necessary, other params are optional. The print output is only needed when called from the cli, plus it assumes that various keys are present (even if None), which is not the case when calling from Python. * Move main script to scripts dir, add symlink Other scripts can't import the main script since it's in the top level. This moves it into the scripts dir and adds a symlink so it's still usable at the old location. * Work on adding example Python harness script * Add notify script * Fix arg * task cleanup * Add versions to tasks * Fix typo * Fix versions * Read webook url from env var * evaluate line-corporation large models (#81) * compare results between Jsquad prompt with title and without title (#84) * re-evaluate models with jsquad prompt with title * update jsquad to include titles into the prompt * re-evaluate models with jsquad prompt with title * inherit JSQuAD v1.2 tasks from v1.1 for readability * re-evaluate models with jsquad prompt with title * wont need jsquad_v11 * revert result.json and harness.sh in models * fix format * Verbose output for more tasks (#92) * Add output to jaqket v2 * Add details to jsquad * Add versbose output to xlsum --------- Co-authored-by: Paul O'Leary McCann <[email protected]> * Add gptq support (#87) * add EleutherAI PR519 autoGPTQ * add comma * change type * change type2 * change path * Undo README modifications --------- Co-authored-by: webbigdata-jp <[email protected]> * Add Balanced Accuracy (#95) * First implementation of balanced accuracy * Add comment * Make JNLI a balanced acc task * Add mcc and balanced f1 scores --------- Co-authored-by: Paul O'Leary McCann <[email protected]> * Remove 3.8 version spec from pre-commit config The version here makes it so that pre-commit can only run in an environment with python3.8 in the path, but there's no compelling reason for that. Removing the spec just uses system python. * Fix Linter Related Issues (#96) * Change formatting to make the linter happy This is mostly: - newlines at end of files - removing blank lines at end of files - changing single to double quotes - black multi-line formatting rules - other whitespace edits * Remove codespell Has a lot of false positives * boolean style issue * bare except These seem harmless enough, so just telling the linter to ignore them * More linter suggestions --------- Co-authored-by: Paul O'Leary McCann <[email protected]> * Simplify neologdn version This was pointing to a commit, but the relevant PR has been merged and released for a while now, so a normal version spec can be used. * Update xwinograd dataset The old dataset was deleted. * won't need llama2/llama2-2.7b due to duplication (#99) * add gekko (#98) Co-authored-by: webbigdata-jp <[email protected]> * add llama2 format (#100) * add llama2 format * add 0.6 in prompt_templates.md * make pre-commit pass * remove debugging line * fix bug on `mgsm` for prompt version `0.3` (#101) * Add JCoLA task (#93) * WIP: need JCoLA * Update harness.jcola.sh * update prompt * update prompt * update prompt * update prompt * Revert "update prompt" This reverts commit cd9a914. * WIP: evaluate on JCoLA * Add new metrics to cola This modifies cola, since jcola just inherits this part. It's not a problem to modify the parent task because it just adds some output. * Linter edits * evaluate on JCoLA * need JCoLAWithLlama2 * JCoLA's prompt version should be 0.0 https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md * documentation jptasks.md and prompt_templates.md * won't need harness and result for JCoLA * fix linter related issue * Delete harness.jcola.sh --------- Co-authored-by: Paul O'Leary McCann <[email protected]> Co-authored-by: mkshing <[email protected]> * Linter fixes * Remove example - script is used instead of function * Cleanup * Cleanup / linter fixes There were some things related to the old shell script usage that weren't working, this should fix it. * Add README section describing cluster usage --------- Co-authored-by: Paul O'Leary McCann <[email protected]> Co-authored-by: kumapo <[email protected]> Co-authored-by: webbigdata-jp <[email protected]> Co-authored-by: webbigdata-jp <[email protected]> Co-authored-by: mkshing <[email protected]>

kumapo force-pushed the jcola branch 3 times, most recently from 051d52c to b9117ab Compare October 4, 2023 02:00

mkshing requested review from mkshing and polm-stability October 4, 2023 02:10

kumapo force-pushed the jcola branch 3 times, most recently from 81c2306 to 9498cec Compare October 4, 2023 04:49

mkshing removed request for mkshing and polm-stability October 4, 2023 04:52

kumapo force-pushed the jcola branch 2 times, most recently from 5b5d346 to 4aa8971 Compare October 5, 2023 05:50

kumapo marked this pull request as ready for review October 7, 2023 06:55

kumapo requested a review from jon-tow as a code owner October 7, 2023 06:55

mkshing reviewed Oct 10, 2023

View reviewed changes

mkshing requested review from mkshing and polm-stability and removed request for jon-tow October 10, 2023 04:34

polm-stability approved these changes Oct 10, 2023

View reviewed changes

mkshing suggested changes Oct 10, 2023

View reviewed changes

lm_eval/tasks/ja/jcola.py Outdated Show resolved Hide resolved

kumapo added 5 commits October 10, 2023 21:05

WIP: need JCoLA

9d4f939

Update harness.jcola.sh

90ad4b7

update prompt

a86105b

update prompt

e24a3b4

update prompt

36b4459

kumapo and others added 7 commits October 10, 2023 21:05

update prompt

a61a042

Revert "update prompt"

1f7dd2d

This reverts commit cd9a914.

WIP: evaluate on JCoLA

1297c49

Add new metrics to cola

e16521c

This modifies cola, since jcola just inherits this part. It's not a problem to modify the parent task because it just adds some output.

Linter edits

145475d

evaluate on JCoLA

5c447c2

need JCoLAWithLlama2

21fcda2

kumapo force-pushed the jcola branch from 4507e3f to 21fcda2 Compare October 10, 2023 12:10

kumapo added 4 commits October 10, 2023 21:31

JCoLA's prompt version should be 0.0

8c2d4a1

https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md

documentation

26ee3c9

jptasks.md and prompt_templates.md

won't need harness and result for JCoLA

a8ede58

fix linter related issue

38e648a

mkshing reviewed Oct 10, 2023

View reviewed changes

Delete harness.jcola.sh

0475c88

mkshing approved these changes Oct 11, 2023

View reviewed changes

mkshing merged commit 950ba75 into Stability-AI:jp-stable Oct 11, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JCoLA task #93

Add JCoLA task #93

kumapo commented Oct 3, 2023 •

edited

Loading

mkshing commented Oct 4, 2023

mkshing commented Oct 5, 2023

polm-stability commented Oct 6, 2023

kumapo commented Oct 7, 2023 •

edited

Loading

mkshing commented Oct 10, 2023 •

edited

Loading

mkshing left a comment

polm-stability left a comment

mkshing left a comment

kumapo commented Oct 10, 2023

mkshing left a comment

kumapo commented Oct 11, 2023 •

edited

Loading

mkshing commented Oct 11, 2023

mkshing commented Oct 11, 2023

mkshing left a comment

kumapo commented Oct 11, 2023

Add JCoLA task #93

Add JCoLA task #93

Conversation

kumapo commented Oct 3, 2023 • edited Loading

JCoLA results

from JCoLA paper

CoLA results

mkshing commented Oct 4, 2023

mkshing commented Oct 5, 2023

polm-stability commented Oct 6, 2023

kumapo commented Oct 7, 2023 • edited Loading

mkshing commented Oct 10, 2023 • edited Loading

mkshing left a comment

Choose a reason for hiding this comment

polm-stability left a comment

Choose a reason for hiding this comment

mkshing left a comment

Choose a reason for hiding this comment

kumapo commented Oct 10, 2023

mkshing left a comment

Choose a reason for hiding this comment

kumapo commented Oct 11, 2023 • edited Loading

mkshing commented Oct 11, 2023

mkshing commented Oct 11, 2023

mkshing left a comment

Choose a reason for hiding this comment

kumapo commented Oct 11, 2023

kumapo commented Oct 3, 2023 •

edited

Loading

kumapo commented Oct 7, 2023 •

edited

Loading

mkshing commented Oct 10, 2023 •

edited

Loading

kumapo commented Oct 11, 2023 •

edited

Loading