Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Papers #3

Open
shm007g opened this issue Apr 19, 2023 · 7 comments
Open

Papers #3

shm007g opened this issue Apr 19, 2023 · 7 comments

Comments

@shm007g
Copy link
Owner

shm007g commented Apr 19, 2023

No description provided.

@shm007g
Copy link
Owner Author

shm007g commented Apr 19, 2023

[Instruction tuning with GPT-4, Microsoft, 2023.04]

  • This paper intent to build the first Self-Instruct LLM using GPT-4 response based on LLaMA-7B. blog.
  • 1st, it collect 52K English instruction dataset (and 52K Chinese from translation) using GPT-4 with 52K prompt input of alpaca dataset. Then it performs supervised fine-tune (as Stanford Alpaca) on this dataset and get model LLaMA-GPT4(-CN)(-7B).
  • 2nd, it train a reward model based on OPT-1.3B. Due to high cost of labeling comparison dataset and GPT-4's judging quality ability, It use GPT-4 to assign scores [1,10] for different responses for each prompt input.
  • 3rd, to evalute self-instruct tuned model on unseen instructions, it choose 3 instruction following dataset, as User-Oriented-Instructions-252, Vicuna-Instructions-80, Unnatural Instructions.
    • It use Amazon Mechanical Turk to perform human evaluation model generation results on User-Oriented-Instructions-252 dataset with a 3H alignment criteria. GPT-4-Instruction-tuned LLaMA-GPT4(-7B) lead to very camparable performance with original GPT-4.
    • image
    • It use GPT-4 to perform automatic evaluation of different SOTA models on Vicuna-Instructions-80. For each evalution, it ask GPT-4 to rate the response quality between 2 models with score from 1 to 10. LLaMA-GPT4(7B) fine-tuned on GPT-4 outputs works better than Alpaca-13B (fine-tuned on ChatGPT). It shows GPT-4 outputs is much better for instruction tuning.
    • image
    • Automaic evaluation on chinese Vicuna-Instructions-80 from GPT-4 translation. Vicuna-13B works good as well.
    • image
    • It perfrom ROUGE-L on unnatural instructions evaluated with 9K samples. It shows LLaMA-GPT4-7B perform better when response is long.
    • image
  • Most importantly, about reward model, in figure 4, it shows 1.3B regression reward model fine-tuned on GPT-4 generated camparison dataset, work just as well as original GPT-4. It shows a very promissing way to perform RLHF and whole 3 step finet-tune like ChatGPT in fulture work.

@shm007g shm007g changed the title LLM Papers LLM Theory Apr 25, 2023
@shm007g
Copy link
Owner Author

shm007g commented May 12, 2023

[PaLM 2 Technical Report, Google, 2023.05]

  • scaling law: power law to equal proportion(1:1), find out data size is at least as important as model size; data selection and efficient architecture/objectives can improve performace as well; design a more multilingual and diverse pre-training mixture extends across hundreds of languages and domains; build on strong UL2(20B in Paper); largest PaLM 2-L is significant smaller than largest PaLM(-540B) but much better.
  • scaling law experiment: there is optimal param size at each compute scale, 10^22 FLOPs as 10.7B, 10^21 as 3.35B, 10^20 as 1.04B.
  • model size: three variants of PaLM 2: a Small (S), Medium (M), and Large (L) version. PaLM 2 refers to the Large version. Blogs says there will be four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn.
  • Evaluation: six high level categories for academic benchmark: classification and question answering, reasoning, coding, translation and natural language generation. language proficiency exams for human benchmark.
  • (1) Language proficiency exams(multilingual): PaLM 2 pass all 6 professional language proficiency exams follow the C2 definition, include chinese, japenese, italian, french, spanish, german. Performed generic instruction finetuning without exam contents, pass exams with zero-shot prompting and native human evaluation.
  • (2) Classification and question answering: dataset commonly used in LLM literature and multilingual capabilities.
  • (2.1) English QA and classification tasks(one-shot setting)
    • Open-domain closed-book question answering tasks: TriviaQA (Joshi et al., 2017), Natural Questions2
      (Kwiatkowski et al., 2019), and WebQuestions (Berant et al., 2013)
    • Cloze and completion tasks: LAMBADA (Paperno et al., 2016), HellaSwag (Zellers et al., 2019), and StoryCloze
      (Mostafazadeh et al., 2016)
    • Winograd-style tasks: Winograd (Levesque et al., 2012) and WinoGrande (Sakaguchi et al., 2021)
    • Reading comprehension: SQuAD v2 (Rajpurkar et al., 2018) and RACE (Lai et al., 2017)
    • Common sense reasoning: PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and OpenBookQA (Mihaylov
      et al., 2018)
    • SuperGLUE (Wang et al., 2019)
    • Natural language inference: Adversarial NLI (ANLI; Nie et al., 2020)
  • (2.2) Multilingual QA (one-shot and no-content setting): TyDi QA (Clark et al., 2020)
  • (2.3) Multilingual toxicity classification
    • Toxicity classification with CivilComments
    • Multilingual toxicity classification with Jigsaw Multilingual
  • (3) Reasoning:
  • (3.1) representative reasoning datasets in a few-shot setting: WinoGrande (Sakaguchi et al., 2021), ARC-C (Clark et al., 2018), DROP (Dua et al.,2019), StrategyQA (Geva et al., 2021), CommonsenseQA (CSQA; Talmor et al., 2019), XCOPA (Ponti et al., 2020), and BIG-Bench (BB) Hard(23 tasks from 200+, where LLM performed below average human) (Suzgun et al., 2022). competitive with GPT-4.
    • Multilingual common sense reasoning: XCOPA
    • BIG-Bench (BB) Hard: 23 tasks from 200+, where LLM performed below average human, like multi-step arithmetic problems(multistep_arithmetic)
  • (3.2) Mathematical reasoning: finetuned on flan dataset (1800 tasks, at least 20 instruction templates per task)
    • MATH (Hendrycks et al., 2021), which contains 12,500 problems from high school competitions in 7 mathematics subject areas
    • GSM8K (Cobbe et al., 2021), a dataset of 8,500 grade school math word problems
    • MGSM (Shi et al., 2023), a multilingual version of GSM8K with translations of a subset of examples into ten typologically diverse languages.
  • (4) Coding: train the PaLM 2-S model on an extended, code-heavy, heavily multilingual data mixture, resulting model PaLM 2-S*.
    • Code Generation: 3 coding datasets: HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and ARCADE (Yin et al., 2022), PaLM 2-S* outperforms PaLM-540B-Coder on all benchmarks with few-shot setting.
    • Multilingual Evaluation: BabelCode (Orlanski et al., 2023) which translates HumanEval into a variety of other programming languages including c++, java, go, haskell and julia.
  • (5) Translation
    • WMT21 Experimental Setup: automatic metric using BLEURT, human metric using Multidimensional Quality Metrics (MQM) with hired professional translators
    • Regional translation experimental setup: FRMT benchmark
    • Potential misgendering harms
  • (6) Natural language generation: ROUGE on 1-shot-learning setting
    • XLSum (Hasan et al., 2021), which asks a model to summarize a news article
    • WikiLingua (Ladhak et al., 2020), which focuses on generating section headers for step-by-step instructions from WikiHow
    • XSum (Narayan et al., 2018), which tasks a model with generating a news article’s first sentence
    • Potential harms and bias: ParlAI Dialogue Safety, RealToxicityPrompts, BBQ Bias Benchmark for QA, Multilingual Representational Bias
    • Multilingual capabilities: Explaining jokes, Explaining translation ambiguities, Translating into dialects, Expanding abbrevations and fixing typos, Converting formal text into colloquial chat text, Transliterating into new scripts
  • (7) Memorization

@shm007g
Copy link
Owner Author

shm007g commented May 15, 2023

[GPT-4 Technical Report, OpenAI, 2023.03]

  • no further details about architecture (including model size), hardware, training compute, dataset construction, traning method, or similar.
  • Multi-modal: accept image and text inputs and produce text outputs.
  • Academic and professional exams (for human): exhibits human-level performance on the majority of these exams.
  • traditional NLP benchmark: outperforms previous LLM and system; academic benchmarks: (MMLU, HellaSwag, AI2 Reasoning Challenage(ARC), WinoGrande, HumanEval, DROP, GSM-8K)
  • HumanEval dataset: log pass rate predictable... for loss predictable...
  • inverse scaling prize: Hindsight Neglect, GPT-4 reverse the trend.
  • open sourcing OpenAI Evals: https://github.com/openai/evals
  • Visual Input: parallel to text-only setting;
  • hallucinations: GPT-4 reduces hallucinations to GPT-3.5 with 19% point higher on OpenAI Internal evaluations (which contains learning, technology, writing, history, math, science, recommendation, code, business).
  • TruthfulQA: RLHF post-training GPT-4 is much better than GPT-3.5; Lack knowledge of event after September 2021, majority data cuts off that date.
  • not fully reliable: hallucinations, limit context windows, do not learn from experience.
  • bring novel safety challenges:
  • developed a infrastructure and optimization methods: predictable behavior across multiple scales.
  • GPT-4 System Card: more than half length of the paper

@shm007g
Copy link
Owner Author

shm007g commented May 15, 2023

[Sparks of Artificial General Intelligence: Early experiments with GPT-4, MSFT, 2023.04]

  • (1) refine: refined over span of a month
  • (2) Multimodal and interdisciplinary composition: not only does demonstrate a high level of proficiency in different domains such as literature, medicine, law, mathematics, physical sciences, and programming, but it is also able to combine skills; understand image and text input and can manipulate text and image in geniue way, not just copy it; does not understand harmony in music.
  • (3) Code: reason about code execution, simulate the effects of instructions, and explain the results in natural language, even pseudocode;
  • HumanEval, desciption to code benchmark; Leetcode, 100 sample per level, in first 5 attempts; real world: data visualization, front-end / game development, write code for deep learning, interface with Latex;
  • understand existing code, reasoning about code execution; executing python code(plugin?);
  • (4) Mathematical abilities
  • GSM8K: an elementary school math dataset contains 8000 questions on topics such as arithmetic, fractions, geometry, and word problems;
  • MATH: a high school math dataset contains 12500 questions on topics such as algebra, calculus, trigonometry, and probability;
  • MMMLU-STEM: 2000 multiple choice question covering high school and college STEM topics;
  • specially fine-tuned math model named Minerva, score between text-davinci-003 and GPT-4; GPT-4 have many mistake on MATH due to arithmetic and calculation mistaskes;
  • image
  • Fermi questions: requires both quantitative thinking and general knowledge; don't make much progress;
  • Higher-Level mathematics: 2022 international mathematic Olympiad;
  • (5) Real World Interaction: tool use and embodied interaction;
  • (6) Interaction with humans: successfully passes Sally-Anne test, a classic false-belief test; miscommunication and misunderstanding; explainability;
  • (7) Discriminative capabilities: different aspect, situations; personally identi able information (PII); text anonymization benchmark (TAB); TruthfulQA, for misconceptions and face-checking;
  • (8) Limitations: Lack of planning in arithmetic/reasoning problems; long term memory;

@shm007g
Copy link
Owner Author

shm007g commented May 25, 2023

OpenAI Research

InstructGPT: [Training language models to follow instructions with human feedback, OpenAI, 2022.03]

GPT3: [Language Models are Few-Shot Learners, OpenAI, 2020.05]

GPT2

GPT1

other research

https://openai.com/research/techniques-for-training-large-neural-networks
https://openai.com/research/sparse-transformer
https://openai.com/research/measuring-goodharts-law
https://openai.com/research/webgpt
https://openai.com/research

@shm007g shm007g changed the title LLM Theory Papers May 30, 2023
@shm007g
Copy link
Owner Author

shm007g commented May 31, 2023

Prompt Tuning

  • prompt tuning
  • prefix tuning
  • p-tuning
  • p-tuning-v2

[Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021/01, Stanford]

[The Power of Scale for Parameter-Efficient Prompt Tuning, 2021/09, Google]

  • conditioning a frozen model with soft prompts; outperform GPT-3's few-shot learning on discrete text prompts on downstream tasks; benifits in robustness to domain transfer and efficient "prompt ensembling".
  • model tuning/fine-tuning, all model parameter are tuned; prompt design with task description and examples with frozen big models; soft prompt perform much better than prompt design and comparable performance with model tuning when param goes big;
  • 247628f05842337977dacf291f57094c
  • other methods: automate prompt design like search discrete space of words; prefix-tuning backpropagates errors to prefix tensor/activations;
  • this paper, prompt tuning;

[GPT Understands, Too, 2021/03, Tsinghua, Peking, BAAI]

[P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks, 2022/03, Tsinghua, BAAI]

@shm007g
Copy link
Owner Author

shm007g commented May 31, 2023

Google Research

T5, Flan-T5

Pathway, UL2, MoE


[LLM are zero-shot rankers for recommender system]
[Amazon, textbooks are all you need: learning language representation for sequence recommandation]
A new alternative to RLHF just dropped! https://twitter.com/rasbt/status/1663883300522295296
[Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/abs/2305.18290 ] https://github.com/eric-mitchell/direct-preference-optimization
LAION-AI/Open-Assistant#3347
distilling step by step: outperforming llm with less training data and smaller model size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant