lm-eval-harnessでよく使われている評価タスク (LAMBADA, HellaSwag, WinoGrande, PIQA, CoQA)

https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=0

とりあえずこのシートで使われてるタスクを理解する。

The LAMBADA dataset: Word prediction requiring a broad discourse context

https://github.com/EleutherAI/lm-evaluation-harness/blob/fdd3dbc3b2871d6a0a5433bfa42f56490ee058ab/lm_eval/tasks/lambada.py https://arxiv.org/pdf/1606.06031.pdf

LAnguage Modeling Broadened to Account for Discourse AspectsでLAMBADA。

(1) Context: “Yes, I thought I was going to lose the baby.” “I was scared too,” he stated, sincerity flooding his eyes. “You were ?” “Yes, of course. Why do you even ask?” “This baby wasn’t exactly planned for.”
Target sentence: “Do you honestly think that I would want you to have a ?”
Target word: miscarriage

回答は1単語で、accuracyやperplexityを評価指標として計算する。

HellaSwag: Can a Machine Really Finish Your Sentence?

常識を必要とする問題。人間には>95%の精度が出せる。"Adversarial Filtering"により、人間には簡単だが機械には間違えやすい選択肢を生成しているらしい。

SWAGというデータセットがすでに存在していて、こちらも常識を必要とするものだったが、BERTが人間並の性能になっちゃったからもっと難しいの作ったらしい。

HuggingFace Dataset

https://huggingface.co/datasets/hellaswag

{'activity_label': 'Removing ice from car',
 'ctx': 'Then, the man writes over the snow covering the window of a car, and '
        'a woman wearing winter clothes smiles. then',
 'ctx_a': 'Then, the man writes over the snow covering the window of a car, '
          'and a woman wearing winter clothes smiles.',
 'ctx_b': 'then',
 'endings': [', the man adds wax to the windshield and cuts it.',
             ', a person board a ski lift, while two men supporting the head '
             'of the person wearing winter clothes snow as the we girls sled.',
             ', the man puts on a christmas coat, knitted with netting.',
             ', the man continues removing the snow on his car.'],
 'ind': 4,
 'label': '3',
 'source_id': 'activitynet~v_-1IBHYS3L-Y',
 'split': 'train',
 'split_type': 'indomain'}

lm-eval-harness

https://github.com/EleutherAI/lm-evaluation-harness/blob/fdd3dbc3b2871d6a0a5433bfa42f56490ee058ab/lm_eval/tasks/hellaswag.py

    def _process_doc(self, doc):
        ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
        out_doc = {
            "query": self.preprocess(doc["activity_label"] + ": " + ctx),
            "choices": [self.preprocess(ending) for ending in doc["endings"]],
            "gold": int(doc["label"]),
        }
        return out_doc

この前処理をすると、以下のようになる。

{'choices': [', the man adds wax to the windshield and cuts it.',
             ', a person board a ski lift, while two men supporting the head '
             'of the person wearing winter clothes snow as the we girls sled.',
             ', the man puts on a christmas coat, knitted with netting.',
             ', the man continues removing the snow on his car.'],
 'gold': 3,
 'query': 'Removing ice from car: Then, the man writes over the snow covering '
          'the window of a car, and a woman wearing winter clothes smiles. '
          'Then'}

この query を与え、続きを生成させる。評価部分は継承元の MultipleChoiceTask のものがそのまま利用されている。choices のそれぞれに対するlog likelihoodを計算する。

    def process_results(self, doc, results):
        gold = doc["gold"]

        acc = 1.0 if np.argmax(results) == gold else 0.0
        completion_len = np.array([float(len(i)) for i in doc["choices"]])
        acc_norm = 1.0 if np.argmax(results / completion_len) == gold else 0.0

        return {
            "acc": acc,
            "acc_norm": acc_norm,
        }

一般的には acc_norm が利用されている。 acc_norm は回答の長さに応じた有利不利が出ないように、長さで正規化したもののようだ。

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

https://arxiv.org/abs/1907.10641

2011年に作られたThe Winograd Schema Challenge (WSC)というデータセットの新しい版みたいなもの。 WSCは273問しかなかったが、これは44k問ある。WSCは、常識に基づく推論に関するデータセットである。

文中の代名詞が指すものは何かを当てる。基本二択になるような感じで文章が作られており、細かい違いでその両方が答えになり得るようになっており、その両方が実際にexampleとして含まれている。その組をtwinと呼ぶ。

HuggingFace Dataset

https://huggingface.co/datasets/winogrande

このうち winogrande_xl が44kデータを含むもの。

{'answer': '2',
 'option1': 'Ian',
 'option2': 'Dennis',
 'sentence': "Ian volunteered to eat Dennis's menudo after already having a "
             'bowl because _ despised eating intestine.'}
{'answer': '1',
 'option1': 'Ian',
 'option2': 'Dennis',
 'sentence': "Ian volunteered to eat Dennis's menudo after already having a "
             'bowl because _ enjoyed eating intestine.'}

このようにtwinのexampleが別々に連続して入ってる。

lm-eval-harness

https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/winogrande.py

    def construct_requests(self, doc, ctx):
        target = self.partial_target(doc)
        lls = []
        for option in [doc["option1"], doc["option2"]]:
            partial_ctx = self.partial_context(doc, option)
            full_ctx = self.append_context(ctx, partial_ctx)
            lls.append(rf.loglikelihood(full_ctx, target)[0])
        return lls

partial_context は _ までの文字列 + 選択肢。partial_target は _ 以降の文字列。つまり、以下のようになる。

_ までの文字列 + 各選択肢、をcontextとして与える
_ 以降を生成させ _ 以降に対するlog likelihoodを計算
大きいのが正解の選択肢の方かどうか見る

PIQA: Reasoning about Physical Commonsense in Natural Language

https://arxiv.org/abs/1911.11641

Physicalとある通り、物理世界の理解を問うような質問になっているらしい。例えば何かするときの道具とか手順とかについて聞いてくる。問題は2択。

HuggingFace Dataset

https://huggingface.co/datasets/piqa

{'goal': "When boiling butter, when it's ready, you can",
 'label': 1,
 'sol1': 'Pour it onto a plate',
 'sol2': 'Pour it into a jar'}

lm-eval-harness

MultipleChoiceTask で実装されている。

        out_doc = {
            "goal": doc["goal"],
            "choices": [doc["sol1"], doc["sol2"]],
            "gold": doc["label"],
        }

hellaswagも MultipleChoiceTask だが、こっちは acc_norm ではなく acc が使われるっぽい。上の例のように、各選択肢が類似した文章になっておりほぼ同じ長さだからかな。

CoQA: A Conversational Question Answering Challenge

https://arxiv.org/abs/1808.07042

ストーリーがあり、それに関するQAを会話形式で行う。

HugingFace Dataset

lm-eval-harnessで使われているのは、HF Datasetそのものではない。

https://github.com/EleutherAI/lm-evaluation-harness/blob/d145167959c2b1826d900524912cb99c44d5fb30/lm_eval/datasets/coqa/coqa.py

{'additional_answers': {'0': {'input_text': [''],
                              'span_end': [-1],
                              'span_start': [-1],
                              'span_text': [''],
                              'turn_id': [-1]},
                        '1': {'input_text': [''],
                              'span_end': [-1],
                              'span_start': [-1],
                              'span_text': [''],
                              'turn_id': [-1]},
                        '2': {'input_text': [''],
                              'span_end': [-1],
                              'span_start': [-1],
                              'span_text': [''],
                              'turn_id': [-1]}},
 'answers': {'input_text': ['It was formally established in 1475',
                            'research',
                            'history, and law',
                            'philosophy, science and theology',
                            'a  project',
                            'into periods',
                            'five',
                            'The Vatican Apostolic Library',
                            'in Vatican City',
                            '1.1 million',
                            'at the beginning of the 17th century;',
                            '150,000',
                            'anyone who can document their qualifications and '
                            'research needs.',
                            'unknown',
                            'Photocopies',
                            'only books published between 1801 and 1990',
                            'the Holy See',
                            'a handful of volumes',
                            'digitising manuscripts',
                            'them to be viewed online.'],
             'span_end': [179,
                          494,
                          511,
                          545,
                          879,
                          1127,
                          1128,
                          94,
                          150,
                          412,
                          1009,
                          1046,
                          643,
                          -1,
                          764,
                          724,
                          125,
                          1384,
                          881,
                          910],
             'span_start': [151,
                            454,
                            457,
                            457,
                            769,
                            1048,
                            1048,
                            4,
                            94,
                            328,
                            917,
                            915,
                            546,
                            -1,
                            643,
                            644,
                            78,
                            1192,
                            785,
                            868],
             'span_text': ['Formally established in 1475',
                           'he Vatican Library is a research library',
                           'Vatican Library is a research library for history, '
                           'law',
                           'Vatican Library is a research library for history, '
                           'law, philosophy, science and theology',
                           'March 2014, the Vatican Library began an initial '
                           'four-year project of digitising its collection of '
                           'manuscripts',
                           'Scholars have traditionally divided the history of '
                           'the library into five period',
                           'Scholars have traditionally divided the history of '
                           'the library into five periods',
                           'Vatican Apostolic Library (), more commonly called '
                           'the Vatican Library or simply the Vat, ',
                           'is the library of the Holy See, located in Vatican '
                           'City.',
                           ' It has 75,000 codices from throughout history, as '
                           'well as 1.1 million printed books',
                           'atican Secret Archives were separated from the '
                           'library at the beginning of the 17th century;',
                           ' Vatican Secret Archives were separated from the '
                           'library at the beginning of the 17th century; they '
                           'contain another 150,000 items. ',
                           ' The Vatican Library is open to anyone who can '
                           'document their qualifications and research needs. ',
                           'unknown',
                           'Photocopies for private study of pages from books '
                           'published between 1801 and 1990 can be requested '
                           'in person or by mail. ',
                           'hotocopies for private study of pages from books '
                           'published between 1801 and 1990',
                           'simply the Vat, is the library of the Holy See,',
                           'Pre-Lateran period, comprising the initial days of '
                           'the library, dated from the earliest days of the '
                           'Church. Only a handful of volumes survive from '
                           'this period, though some are very significant',
                           'Vatican Library began an initial four-year project '
                           'of digitising its collection of manuscripts, ',
                           'manuscripts, to be made available online. '],
             'turn_id': [1,
                         2,
                         3,
                         4,
                         5,
                         6,
                         7,
                         8,
                         9,
                         10,
                         11,
                         12,
                         13,
                         14,
                         15,
                         16,
                         17,
                         18,
                         19,
                         20]},
 'id': '3zotghdk5ibi9cex97fepx7jetpso7',
 'questions': {'input_text': ['When was the Vat formally opened?',
                              'what is the library for?',
                              'for what subjects?',
                              'and?',
                              'what was started in 2014?',
                              'how do scholars divide the library?',
                              'how many?',
                              'what is the official name of the Vat?',
                              'where is it?',
                              'how many printed books does it contain?',
                              'when were the Secret Archives moved from the '
                              'rest of the library?',
                              'how many items are in this secret collection?',
                              'Can anyone use this library?',
                              'what must be requested to view?',
                              'what must be requested in person or by mail?',
                              'of what books?',
                              'What is the Vat the library of?',
                              'How many books survived the Pre Lateran period?',
                              'what is the point of the project started in '
                              '2014?',
                              'what will this allow?'],
               'turn_id': [1,
                           2,
                           3,
                           4,
                           5,
                           6,
                           7,
                           8,
                           9,
                           10,
                           11,
                           12,
                           13,
                           14,
                           15,
                           16,
                           17,
                           18,
                           19,
                           20]},
 'source': 'wikipedia',
 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican '
          'Library or simply the Vat, is the library of the Holy See, located '
          'in Vatican City. Formally established in 1475, although it is much '
          'older, it is one of the oldest libraries in the world and contains '
          'one of the most significant collections of historical texts. It has '
          '75,000 codices from throughout history, as well as 1.1 million '
          'printed books, which include some 8,500 incunabula. \n'
          '\n'
          'The Vatican Library is a research library for history, law, '
          'philosophy, science and theology. The Vatican Library is open to '
          'anyone who can document their qualifications and research needs. '
          'Photocopies for private study of pages from books published between '
          '1801 and 1990 can be requested in person or by mail. \n'
          '\n'
          'In March 2014, the Vatican Library began an initial four-year '
          'project of digitising its collection of manuscripts, to be made '
          'available online. \n'
          '\n'
          'The Vatican Secret Archives were separated from the library at the '
          'beginning of the 17th century; they contain another 150,000 '
          'items. \n'
          '\n'
          'Scholars have traditionally divided the history of the library into '
          'five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and '
          'Vatican. \n'
          '\n'
          'The Pre-Lateran period, comprising the initial days of the library, '
          'dated from the earliest days of the Church. Only a handful of '
          'volumes survive from this period, though some are very significant.'}

lm-eval-harness

ストーリーと何組かのQAを受け取った後、最後のQに回答する。contextは A: まで。モデルは "\nQ:" と出力するまで文字列を生成する。その後、生成した回答について評価される。

評価指標は2つ。

exact：正規化した後で文章が完全一致するかを見る
F1: tokenize後に単語の集合に関するprecision, recallに基づくF1 scoreを計算する

1つのQに対し複数の正答例がある場合があり、その場合は平均する……んだと思うんだけど compute_scores の実装はどういうことなんだろう？

iwiwi 備忘録

学んだことを殴り書きます。自分向けのメモです。

lm-eval-harnessでよく使われている評価タスク (LAMBADA, HellaSwag, WinoGrande, PIQA, CoQA)

The LAMBADA dataset: Word prediction requiring a broad discourse context

HellaSwag: Can a Machine Really Finish Your Sentence?

HuggingFace Dataset

lm-eval-harness

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

HuggingFace Dataset

lm-eval-harness

PIQA: Reasoning about Physical Commonsense in Natural Language

HuggingFace Dataset

lm-eval-harness

CoQA: A Conversational Question Answering Challenge

HugingFace Dataset

lm-eval-harness