<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">62004</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.062004</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Joint Generation of Distractors for Multiple-Choice Questions: A Text-to-Text Approach</article-title>
<alt-title alt-title-type="left-running-head">Joint Generation of Distractors for Multiple-Choice Questions: A Text-to-Text Approach</alt-title>
<alt-title alt-title-type="right-running-head">Joint Generation of Distractors for Multiple-Choice Questions: A Text-to-Text Approach</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Rodriguez-Torrealba</surname><given-names>Ricardo</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Garcia-Lopez</surname><given-names>Eva</given-names></name><email>eva.garcial@uah.es</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Garcia-Cabot</surname><given-names>Antonio</given-names></name></contrib>
<aff id="aff-1"><institution>Departamento de Ciencias de la Computaci&#x00F3;n, Universidad de Alcal&#x00E1;, Alcal&#x00E1; de Henares</institution>, <addr-line>Madrid, 28801</addr-line>, <country>Spain</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Eva Garcia-Lopez. Email: <email>eva.garcial@uah.es</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>16</day><month>04</month><year>2025</year>
</pub-date>
<volume>83</volume>
<issue>2</issue>
<fpage>1683</fpage>
<lpage>1705</lpage>
<history>
<date date-type="received">
<day>08</day>
<month>12</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>3</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_62004.pdf"></self-uri>
<abstract>
<p>Generation of good-quality distractors is a key and time-consuming task associated with multiple-choice questions (MCQs), one of the assessment items that have dominated the educational field for years. Recent advances in language models and architectures present an opportunity for helping teachers to generate and update these elements to the required speed and scale of widespread increase in online education. This study focuses on a text-to-text approach for joints generation of distractors for MCQs, where the context, question and correct answer are used as input, while the set of distractors corresponds to the output, allowing the generation of three distractors in a single model inference. By fine-tuning FlanT5 models and LongT5 with TGlobal attention using a RACE-based dataset, the potential of this approach is explored, demonstrating an improvement in the BLEU and ROUGE-L metrics when compared to previous works and a GPT-3.5 baseline. Additionally, BERTScore is introduced in the evaluation, showing that the fine-tuned models generate distractors semantically close to the reference, but the GPT-3.5 baseline still outperforms in this area. A tendency toward duplicating distractors is noted, although models fine-tuned with Low-Rank Adaptation (LoRA) and 4-bit quantization showcased a significant reduction in duplicated distractors.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Text-to-text</kwd>
<kwd>distractor generation</kwd>
<kwd>fine-tuning</kwd>
<kwd>FlanT5</kwd>
<kwd>LongT5</kwd>
<kwd>multiple-choice</kwd>
<kwd>questionnaire</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Universidad de Alcal&#x00E1; (UAH)</funding-source>
<award-id>PIUAH21/IA-010</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Comunidad Auton&#x00F3;ma de Madrid</funding-source>
<award-id>CM/JIN/2021-034</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>The educational field employs a diverse array of assessment instruments, each serving different purposes and learning outcomes. Among those, multiple-choice quizzes have been fundamental, playing an important role over the years [<xref ref-type="bibr" rid="ref-1">1</xref>]. Today, they continue to be a valuable assessment tool [<xref ref-type="bibr" rid="ref-2">2</xref>], and their success and acceptance in education are due to two main reasons. First, they facilitate the measurement of different types and levels of acquired knowledge in different domains, including measuring higher-order cognitive abilities such as synthesis and problem-solving [<xref ref-type="bibr" rid="ref-3">3</xref>]. Secondly, they are easy and quick to manage, allowing objective qualifications [<xref ref-type="bibr" rid="ref-4">4</xref>].</p>
<p>The basic structure of a multiple-choice item consists of three elements: the question to be answered (also called stem), the correct answer, and the incorrect (or in some cases, partially incorrect) options called distractors [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>]. In some cases, the stem also includes the context of the question, something common in reading comprehension assessments. When developing multiple-choice questions (MCQs), selecting plausible and effective distractors is crucial for setting the difficulty level of items, minimizing random guessing, and distinguishing between the different cognitive levels of students [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>].</p>
<p>Latest advancements in Natural Language Processing (NLP) and new Large Language Models (LLMs) offer the possibility to assist educators in routine tasks like assessment generation, so they can spend more time with the students, motivating them, and sharing their knowledge [<xref ref-type="bibr" rid="ref-7">7</xref>]. Although distractor generation (DG) is a key and time-consuming component of MCQs for student assessments, it has not gained as much attention in the natural language processing (NLP) community as other tasks like question answering (QA) or question generation (QG) [<xref ref-type="bibr" rid="ref-8">8</xref>]. This lack of popularity can be attributed to several factors, including the absence of standard benchmarks, metrics, and specific datasets dedicated to DG [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>However, recent studies have begun to bridge this gap by exploring the automatic generation of distractors using Reading Comprehension (RC) datasets and MCQs datasets as a source of data [<xref ref-type="bibr" rid="ref-9">9</xref>&#x2013;<xref ref-type="bibr" rid="ref-13">13</xref>]. This shift shows an emerging interest in understanding how the DG task can be improved.</p>
<p>The fundamental role of distractors in MCQs is to confuse students by introducing plausible alternatives that challenge their understanding and application of knowledge. If this is not achieved, the quality of the multiple-choice item is compromised, as students might identify the correct answer without requiring the application of the knowledge or skills that are being assessed [<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
<p>To address this problem, guidelines and recommendations for manually generating high-quality distractors have been developed in the past [<xref ref-type="bibr" rid="ref-4">4</xref>]. However, a concrete methodology for evaluating the quality of automatically generated distractors remains elusive. Currently, researchers rely on standard text generation metrics such as BLEU [<xref ref-type="bibr" rid="ref-14">14</xref>] and ROUGE [<xref ref-type="bibr" rid="ref-15">15</xref>], which may not fully capture the nuanced effectiveness of a distractor.</p>
<p>Despite the growing interest in the automatic generation of distractors, many existing approaches often miss an integrated mechanism to ensure that all generated distractors share a semantic relationship while remaining distinct from both the correct answer and from each other. Methods that generate one distractor at a time [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>] (e.g., approaches based on beam-search [<xref ref-type="bibr" rid="ref-16">16</xref>]) may struggle to keep the plausibility or semantic diversity across all distractors. In addition, they frequently rely on ranking or filtering steps, increasing complexity and computation. This opens an opportunity to explore approaches that jointly generate all distractors, allowing the model to capture cross-dependencies and potentially prevent too evident or overly similar distractors. A joint generation can improve semantic relevance, diversity, and overall quality of distractors compared to more traditional, single-output methods.</p>
<p>In response to these challenges, this research focuses on improving the creation of distractors for MCQs in the context of RC datasets. A joint generation of distractors (i.e., all at once) using a text-to-text approach is proposed, by fine-tuning the Flan-T5 [<xref ref-type="bibr" rid="ref-17">17</xref>] and LongT5 [<xref ref-type="bibr" rid="ref-18">18</xref>] models using the RACE dataset [<xref ref-type="bibr" rid="ref-19">19</xref>]. This approach is designed to generate all distractors at once in a single inference step, potentially offering a more cohesive and contextually relevant set of options, as an alternative to more common methods that output a single distractor at a time, as previously mentioned.</p>
<p>For evaluating the generated distractors, in addition to the standard BLEU and ROUGE metrics, an analysis of the semantic distance between the distractors and the correct answers is incorporated, with outputs compared across different models and datasets. In addition to the RACE dataset, the MCTest, SciQ, and OpenBookQA datasets are included in the evaluation framework, enabling the assessment of performance across various contexts that differ from the training data. Additionally, BERTScore [<xref ref-type="bibr" rid="ref-20">20</xref>] is utilized to assess the semantic relevance of distractors in relation to the references. Finally, a grammar check for each distractor is performed, comparing results across datasets and models. This approach helps to narrow the gap in the evaluation of distractor effectiveness.</p>
<p>In summary, the main contributions of this research are (1) Different versions of Flan-T5 and LongT5 models fine-tuned for DG task; (2) An alternative approach to jointly generate distractors using a text-to-text paradigm; (3) Implementation of BERTScore and cosine similarity analysis in the evaluation framework, offering a comprehensive assessment of the semantic proximity and diversity of the generated distractors.</p>
<p>The rest of this paper is organized into five sections. <xref ref-type="sec" rid="s2">Section 2</xref> reviews current research and models related to the distractor generation task. In <xref ref-type="sec" rid="s3">Section 3</xref>, Materials and Methods, we describe our text-to-text approach for generating distractors jointly and detail our experimental setup. <xref ref-type="sec" rid="s4">Section 4</xref> then presents our evaluation data and benchmarks, followed by the Discussion (<xref ref-type="sec" rid="s5">Section 5</xref>), where we analyze these findings and explore future work and Limitations (<xref ref-type="sec" rid="s6">Section 6</xref>). The <xref ref-type="sec" rid="s7">Section 7</xref>, Conclusion, summarizes our main contributions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Distractor Generation Approaches</title>
<p>Distractor generation for MCQs has long been a focus in the fields of education and assessment. Recently, there has been interest in using automated methods to create these distractors. Various approaches have been explored, including sequence-to-sequence models [<xref ref-type="bibr" rid="ref-21">21</xref>] along with the application of large Transformer-based language models [<xref ref-type="bibr" rid="ref-22">22</xref>], such as GPT-2 [<xref ref-type="bibr" rid="ref-23">23</xref>], BERT [<xref ref-type="bibr" rid="ref-24">24</xref>], or T5 [<xref ref-type="bibr" rid="ref-25">25</xref>], usually fine-tuned specifically for the DG task.</p>
<p>During the construction of the SciQ dataset [<xref ref-type="bibr" rid="ref-26">26</xref>], the DG problem was addressed more traditionally. The focus was to help crowdsources in selecting the best options from a large set of distractors created using a GloVe vocabulary [<xref ref-type="bibr" rid="ref-27">27</xref>]. To facilitate this, a classifier was trained to rank good candidates based on multiple features, including embeddings, POS-tagging, the distance between the correct answer and candidate distractor, token length, token overlap, and hypernymy/hyponymy indicators. Similarly, for technical domains like engineering, the use of ontologies to generate distractors in MCQs has been suggested [<xref ref-type="bibr" rid="ref-6">6</xref>]. Another study used the T5 model for producing English grammar MCQs [<xref ref-type="bibr" rid="ref-28">28</xref>]. However, distractors were generated based on inputs composed of a keyword and a part-of-speech template and then selected using a rule-based algorithm.</p>
<p>Another study introduced a framework called EDGE (quEstion and answer guided Distractor GEneration) [<xref ref-type="bibr" rid="ref-13">13</xref>]. This approach generates distractors based on the context, the question, and the correct answer, using a sequence-to-sequence model. Two important characteristics of distractors are improved with this model: incorrectness (using a gate mechanism that constrains answer-relevant words based on distance) and plausibility (employing the semantic representation of the question and the context). The study used a modified version of the RACE dataset named DG-RACE [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p>Exploring transformer-based approaches, the DG-RACE dataset was also used to fine-tune a T5 model specifically for the DG task in the context of the end-to-end generation of MCQs [<xref ref-type="bibr" rid="ref-9">9</xref>]. This study proposed a text-to-text approach, which generates a single distractor by leveraging the context, the question, and the correct answer.</p>
<p>The aforementioned studies share a common area for improvement: the output of newly generated distractors is not conditioned on the ones previously generated for the same MCQ. Also, they rely on beam-search methods for regulating the output [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
<p>In addition to T5, alternative Transformer models such as GPT-2 have also been explored. For example, another work approached the DG task by fine-tuning a GPT-2 language model with the RACE dataset to generate three distractors for a given question, correct answer, and context [<xref ref-type="bibr" rid="ref-12">12</xref>]. Following this, an additional step was incorporated, utilizing a DistilBERT model [<xref ref-type="bibr" rid="ref-29">29</xref>] fine-tuned as a classifier. This classifier, also trained using the RACE dataset, had the objective of filtering out MCQs composed of the generated distractors that could be answerable. However, this latter step did not show a meaningful improvement.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Datasets</title>
<p>Distractors, integral to the structure of MCQs, are typically found in RC datasets, such as MCTest, RACE, OpenBookQA, SciQ, CosmosQA, ARC [<xref ref-type="bibr" rid="ref-30">30</xref>], and CommonsenseQA [<xref ref-type="bibr" rid="ref-31">31</xref>], among others. Consequently, these datasets can be used as rich resources when training models for the DG task due to the inclusion of context for each question-answer pair, along with carefully constructed distractors. Many studies referenced above utilize the context paragraph, the question, and the correct answer as inputs to guide the generation of distractors.</p>
<p>The subject domain across these datasets is varied. Topics in the RACE dataset include narratives, ads, information, and passages across multiple subjects and domains like history, science, and geography, designed with the focus of evaluating the comprehension of texts built from English language exams for middle and high school students [<xref ref-type="bibr" rid="ref-19">19</xref>]. MCTest is an open-domain dataset, mainly composed of fictional narratives that a child could understand [<xref ref-type="bibr" rid="ref-32">32</xref>]. In the case of SciQ and OpenBookQA, topics are specific to the science domain, including biology, physics, chemistry, earth science, and others [<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-33">33</xref>]. While the correct answer for a given context and question can be inferred from the passage for RACE, MCTest, and SciQ, OpenBookQA is designed to use multi-step reasoning and common-sense knowledge to answer the questions [<xref ref-type="bibr" rid="ref-33">33</xref>].</p>
<p>In another line of research, a study utilized these RC datasets to train a unified model for question-answering (QA) tasks [<xref ref-type="bibr" rid="ref-34">34</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>]. The model, which implemented answering of MCQs (with and without context paragraphs), used a text-to-text approach based on T5 and BART [<xref ref-type="bibr" rid="ref-36">36</xref>]. This model could be used as a tool to validate automatically generated distractors by incorporating it as an answerability filter, a method proposed by other studies [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>].</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>T5 Model Variants</title>
<p>The inherent flexibility of T5 models and their text-to-text approach for a variety of NLP tasks have led to the development of new T5 model variants that can potentially be used for DG tasks. Among these, LongT5 has exhibited superior performance in processing long sequence inputs and offers a solution to the issue of size scalability often associated with the standard T5 model [<xref ref-type="bibr" rid="ref-18">18</xref>]. Furthermore, a version of LongT5 has been enhanced with an attention mechanism known as Transient Global (TGlobal) attention. This mechanism divides the input sequence into blocks, each one of k tokens. A global token is then calculated based on the summation and normalization of the embeddings associated with the tokens within the block. As a result, the attention mechanism enables the input tokens to attend not only to their immediate neighbors but also to the global token collection [<xref ref-type="bibr" rid="ref-18">18</xref>].</p>
<p>Another significant development is Flan-T5, which represents a T5 model fine-tuned on a larger corpus comprising 473 datasets and 1836 tasks. This comprehensive fine-tuning has enabled Flan-T5 to surpass the performance of published T5 checkpoints, in some instances by a margin exceeding 10% [<xref ref-type="bibr" rid="ref-17">17</xref>].</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Evaluation Challenges</title>
<p>Despite these advancements in model development, some challenges remain in the realm of DG evaluation. Multiple studies have highlighted the absence of standardized metrics and benchmarks for evaluating the quality of distractors [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>]. Traditionally, these studies have relied on metrics like BLEU and ROUGE, which measure word overlap and are widely used in machine translation tasks. However, distance measures, which have been employed both as features for distractor generation [<xref ref-type="bibr" rid="ref-26">26</xref>] and as supplementary evaluation metrics [<xref ref-type="bibr" rid="ref-9">9</xref>], offer an interesting alternative.</p>
<p>To improve the evaluation of text generation broadly, BERTScore has been introduced. This metric utilizes the contextual embeddings from BERT to calculate a similarity score between reference and generated text, thus providing, in the study, a better correlation with human judgment [<xref ref-type="bibr" rid="ref-20">20</xref>].</p>
</sec>
<sec id="s2_5">
<label>2.5</label>
<title>Research Gap and Proposed Approach</title>
<p>Multiple prior approaches to DG exhibit limitations in maintaining semantic coherence and diversity between distractors. Some methods rely on beam search, producing iterative outputs where distractors are generated independently, often leading to additional ranking and filtering steps. Others do not condition generated distractors on those already produced, increasing the possibility of redundant or trivial options. In addition, given the lack of a standard metric for DG, existing methods mostly use word overlap metrics and have not explored the deeper semantic characteristics of distractors, with only a few starting to incorporate cosine similarity.</p>
<p>To address these limitations, a joint distractor generation approach is proposed, utilizing fine-tuned FlanT5 and LongT5 models with a RACE-based dataset. By generating multiple distractors in a single inference step, this method leverages the text-to-text nature of the models to enhance the coherence and semantic relationship of the options. Additionally, BERTScore and cosine similarity are incorporated into the evaluation framework to assess the relevance of the generated distractors.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Materials and Methods</title>
<p>The appearance of the Transformer architecture has enabled the emergence of LLMs that revolutionized multiple domains of NLP [<xref ref-type="bibr" rid="ref-38">38</xref>]. These LLMs showcase an outstanding ability to capture linguistic patterns and dependencies at a level not seen before. Among these, the T5 model stands out due to its versatile interface, with a design approach where NLP problems such as summarization, classification, translation, and others, are framed as text-to-text tasks [<xref ref-type="bibr" rid="ref-25">25</xref>]. This characteristic allows the use of these models for multiple applications. T5 models use both encoder and decoder layers from the original Transformer architecture, unlike BERT or the GPT-X model family, which are based only on encoders or decoders, respectively.</p>
<p>LongT5 and Flan-T5 represent an evolution over the original T5 model, and their selection for this study was motivated by the improved efficiency in processing larger inputs and the adaptability to various tasks, respectively. These characteristics are interesting for the domain of MCQs because an efficient multi-task model can be optimized to perform QG, QA, and DG tasks, which are 3 dimensions of the MCQ generation problem [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>To optimize LLMs for specific tasks, Parameter Efficient Fine-Tuning (PEFT) techniques [<xref ref-type="bibr" rid="ref-39">39</xref>] have recently emerged, such as LoRA, and quantization. These techniques offer ways to fine-tune larger models with fewer computational resources by introducing efficient parameter updates and adaptations, as well as weight precision reduction. In this study, full-finetuning for medium-size models is used, while PEFT is performed for larger models.</p>
<p>The approach of this research for DG uses a text-to-text paradigm, which is natural to T5-like multitask language models, as mentioned before. The context, question, and correct answer are used as input for the model while the set of distractors is the expected output (<xref ref-type="fig" rid="fig-1">Fig. 1</xref>). The question is added at the beginning of the input, followed by the correct answer and the context paragraph, which are separated by the labels &#x201C;CORRECT-ANSWER:&#x201D; and &#x201C;CONTEXT:&#x201D;, respectively. The output text is structured as a lettered list of distractors. To avoid ambiguity during fine-tuning (especially for Flan-T5), the task is prefixed with the label &#x201C;GENERATE-DISTRACTORS:&#x201D;.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Text-to-Text approach for the joint generation of distractors</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-1.tif"/>
</fig>
<p>Both the Flan-T5 and LongT5 models are fine-tuned using the specified input and output structures. The maximum input length for these models is set to 1024 tokens and training examples that exceed this limit are excluded from the training set. The ordering of elements in the formatted input is intentionally designed to retain most of the pertinent information, even when inputs are larger than the limit during inferences. This ensures that the question and correct answer are always included, with only the less critical portions of the context potentially being truncated.</p>
<p>The training and evaluation processes are illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. The training process begins by formatting the train and validation splits of the RACE dataset, according to the format presented in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. Each example is then tokenized using the model-specific tokenizer (both Flan-T5 and LongT5 have their own tokenizers). Next, the process diverges based on the fine-tuning technique. For models undergoing full fine-tuning, the approach is straightforward: the pre-trained public model is fetched, and the training is executed using the Seq2SeqTrainer from the Transformers library in Python [<xref ref-type="bibr" rid="ref-40">40</xref>]. In the scenario of larger models, the methodology adopted is slightly different. Upon retrieving the pre-trained model, a LoRA adapter model, which is significantly smaller than the full model, is prepared. This adapter model will go through the fine-tuning process, updating only its parameters instead of the whole pre-trained model, which, in this case, is loaded to GPU memory using 4-bit quantization. This process effectively reduces memory demands during training. Following this, the training proceeds similarly to the full fine-tuning approach, utilizing a Seq2SeqTrainer from the Transformers library. However, in this instance, the output is an adapter model optimized for the DG task, which must be merged with the original pre-trained model. The output of this entire process is a model fine-tuned for the DG task.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Training and evaluation process. Boxes with dashed lines showcase specific steps for the case of full fine-tuning and LoRA &#x002B; quantization fine-tuning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-2.tif"/>
</fig>
<p>For the evaluation, the test splits from the RACE, MCTest, SciQ, and OpenBookQA datasets are formatted according to the structure proposed in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. The inputs undergo tokenization before using the models to generate distractors. The performance of each fine-tuned model is assessed using BLEU and ROUGE metrics, in addition to calculating the BERTScore. To further evaluate and understand the behavior of the generated distractors, an analysis of their grammatical correctness, as well as their distances with the correct answers is conducted.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Experimental Setup</title>
<sec id="s3_1_1">
<label>3.1.1</label>
<title>Datasets</title>
<p>As previously noted, datasets featuring MCQ structures composed of context, question, correct answer, and distractors are well-suited for the DG task. In the scope of this study, the RACE dataset, collected from English examinations for both middle and high school students, was pre-processed and used for model fine-tuning. <xref ref-type="table" rid="table-1">Table 1</xref> shows the number of examples in the dataset. Each example in the RACE dataset is composed of an article (the context of the question), the question stem, a set of 4 options (including the correct one), and the answer (the correct option identified by a letter from A to D). The letter of the answer is used to identify the correct answer in the set of 4 options.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Number of formatted text-to-text examples per dataset, including train, validation, and test splits for RACE, and only the test splits for SciQ, MCTest (mc500), and OpenBookQA</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset (Split)</th>
<th>Examples</th>
</tr>
</thead>
<tbody>
<tr>
<td>RACE (Train)</td>
<td>87,560</td>
</tr>
<tr>
<td>RACE (Val)</td>
<td>4867</td>
</tr>
<tr>
<td>RACE (Test)</td>
<td>4934</td>
</tr>
<tr>
<td>SciQ (Test)</td>
<td>1000</td>
</tr>
<tr>
<td>MCTest-mc500 (Test)</td>
<td>600</td>
</tr>
<tr>
<td>OpenBookQA (Test)</td>
<td>500</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>All dataset examples were transformed into the input-output format previously described in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. Specifically, each input is structured as follows: <italic>&#x201C;GENERATE-DISTRACTORS: &#x003C;question stem&#x003E;\nCORRECT-ANSWER: &#x003C;correct answer text&#x003E;\nCONTEXT: &#x003C;context/article&#x003E;&#x201D;</italic>. Each output is a lettered list of distractors: <italic>&#x201C;(A) &#x003C;distractor 1&#x003E;\n(B) &#x003C;distractor 2&#x003E;\n(C) &#x003C;distractor 3&#x003E;&#x201D;</italic>. These distractors are composed of the options available in the original MCQ dataset but removing the correct one. A training example is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, where the <italic>Formatted Input</italic> block is used to feed the model, and the <italic>Expected Output</italic> block is the target to match.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Formatted training example based on the RACE dataset for the distractor generation task</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-3.tif"/>
</fig>
<p>Once the input and output were structured accordingly, a tokenization process was applied to enforce the 1024-token input limit. As mentioned before, examples exceeding this limit were excluded, reducing the training samples for the train split of the RACE dataset from 87,866 to 87,560. This filtering procedure was consistently applied to all splits and additional datasets.</p>
<p>The test split was used to generate distractors for unseen inputs and then compared with baseline models. Additional evaluation was performed on test splits from the MCTest, SciQ, and OpenBookQA datasets, enabling the assessment of performance across various contexts that differ from the training dataset (<xref ref-type="table" rid="table-1">Table 1</xref>).</p>

</sec>
<sec id="s3_1_2">
<label>3.1.2</label>
<title>Fine-Tuned Models</title>
<p>For the purpose of this study, three versions of pre-trained Flan-T5 and LongT5-TGlobal models were fine-tuned: Base (250 M parameters), Large (780 M parameters), and XL (3 billion parameters). The Base and Large models underwent a full fine-tuning process on an RTX A6000 GPU, with a cost of $1.89/h. Due to memory limitations and the availability of GPUs, the XL models were fine-tuned on a Nvidia A10 G Tensor Core GPU ($1.21/h), utilizing Low-Rank Adaptation (LoRA) [<xref ref-type="bibr" rid="ref-41">41</xref>] and 4-bit quantization techniques [<xref ref-type="bibr" rid="ref-42">42</xref>]. To establish a reference for models employing LoRA and quantization, a LongT5 Base model was also fine-tuned using these methods.</p>
<p>In the fine-tuning process for all models, the max_source_length was set to 1024 tokens and the max_target_length to 256. These parameters were selected based on the distribution of text lengths in the RACE dataset, which has the longest inputs and outputs among the datasets used in the study. By using 1024 tokens for max_source_length, the training process was able to include 99.6% of the training examples using the proposed input format (<xref ref-type="fig" rid="fig-1">Fig. 1</xref>) without truncation, while keeping the memory requirements within the GPU and resource availability constraints. A max_target_length of 256 tokens provides enough output length for generating distractors in the proposed format for the studied datasets.</p>
</sec>
<sec id="s3_1_3">
<label>3.1.3</label>
<title>Baseline Models</title>
<p>To benchmark the performance of fine-tuned Flan-T5 and LongT5 models, BLEU and ROUGE results were compared with those reported by models from three separate studies: GPT-2 &#x002B; DistilBERT [<xref ref-type="bibr" rid="ref-12">12</xref>], T5-DG [<xref ref-type="bibr" rid="ref-9">9</xref>], and a Seq-to-Seq model [<xref ref-type="bibr" rid="ref-10">10</xref>]. In these studies, models were fine-tuned on the DG task using the RACE dataset, requiring the context, question, and correct answer as input, similar to the present study. In the case of GPT-2 &#x002B; DistilBERT, the reported model sizes were 355 M and 66 M parameters, respectively. The T5-DG model was based on T5-Small (60 M parameters) and the Seq-to-Seq model was based on a custom long short-term memory (LSTM) network that used GloVE as embeddings (840 B.300 d version). More details on the fine-tuning process of these models can be found in their respective studies.</p>
<p>Furthermore, for a comprehensive baseline across various datasets, distractors were generated for the test splits of RACE, MCTest, SciQ, and OpenBookQA using GPT-3.5-turbo-1106 via the OpenAI API, incurring a total cost of approximately $3.05. A system prompt indicated the details of the DG task and the expected output format, while a user prompt was utilized to input the question, correct answer, and context, as illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. The produced distractors and the corresponding metrics obtained serve as a reference for the DG task applied to all datasets evaluated in the research.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Prompts for generating distractors with GPT-3.5 and OpenAI API</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-4.tif"/>
</fig>
</sec>
<sec id="s3_1_4">
<label>3.1.4</label>
<title>Automatic Evaluation</title>
<p>To assess the quality of the generated distractors, three metrics were employed: BLEU (ranging from 1 to 4 n-grams) [<xref ref-type="bibr" rid="ref-14">14</xref>], ROUGE-L [<xref ref-type="bibr" rid="ref-15">15</xref>], and BERTScore [<xref ref-type="bibr" rid="ref-20">20</xref>]. These evaluation metrics were compared against the baseline models mentioned before.</p>
<p>While BLEU and ROUGE measure n-gram or subsequence token overlap, they can fail to capture semantic differences (or similarities). This is especially interesting when analyzing distractors because they should be contextually plausible but distinct from the correct answer. BERTScore leverages contextual embeddings from BERT-based models, offering a better approximation of semantic similarity, with improved correlation to human judgments [<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-43">43</xref>] compared to traditional token overlap metrics.</p>
<p>An additional distance analysis was conducted on pairs consisting of a correct answer and its corresponding distractors, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. These pairs were extracted from the test splits of all the datasets used in this study. Cosine similarity calculations were performed using embeddings from Sentence-BERT (all-MiniLM-L6-v2 version) [<xref ref-type="bibr" rid="ref-44">44</xref>]. The resulting similarity scores from the test splits served as a reference for comparison against the similarity scores obtained from distractors generated by both the fine-tuned models and the GPT-3.5 baseline.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Example of correct-answer and distractor pairs extracted from the same question, with the respective cosine similarity measure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-5.tif"/>
</fig>
<p>Both BERTScore and cosine similarity can offer opportunities for analyzing distractors. However, it is important to note that both metrics can overestimate similarity for pairs sharing common tokens or fail to capture small changes like negations [<xref ref-type="bibr" rid="ref-45">45</xref>]. The usage of contextual embeddings should help to mitigate this effect in similarities; however, these metrics are incorporated to provide an additional perspective to BLEU and ROUGE-L.</p>
<p>Finally, an analysis of the grammatical correctness of the generated distractors was performed. This is particularly relevant for the RACE dataset, where distractors are typically composed of multiple words. The analysis was based on LanguageTool<xref ref-type="fn" rid="fn-1"><sup>1</sup></xref><fn id="fn-1"><label>1</label><p><ext-link ext-link-type="uri" xlink:href="https://github.com/languagetool-org/languagetool">https://github.com/languagetool-org/languagetool</ext-link> (accessed on 1 January 2025).</p></fn>, an open-source grammar checker that shows high accuracy and exhibits a significant correlation with human ratings [<xref ref-type="bibr" rid="ref-46">46</xref>]. For this evaluation, the original distractors from the test split of all datasets, as well as those generated by the fine-tuned models and the GPT-3.5 baseline, are used. Specifically, the percentage of distractors that contain errors in the &#x201C;GRAMMAR&#x201D; category is computed. This category covers issues related to verb usage, pluralization, tense, nouns, and more [<xref ref-type="bibr" rid="ref-46">46</xref>]. It is important to note that this evaluation focuses primarily on structural errors rather than on spelling, capitalization, punctuation, or whitespace issues.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Results</title>
<p><xref ref-type="table" rid="table-2">Table 2</xref> presents the BLEU, ROUGE-L, and BERTScore F1 metrics for the models trained in this study, evaluated on the test split of the RACE dataset, alongside comparisons with baseline models. The table also includes the percentage of inferences that resulted in duplicated distractors, a phenomenon observed during the analysis of the generated outputs. BLEU (B1&#x2013;B4) and ROUGE-L (R-L) scores range from 0 to 100, with higher values indicating greater n-gram and longest common subsequence overlap, respectively. BERTScore F1 (BS-F1) also ranges from 0 to 100, reflecting semantic similarity to the reference, where higher scores are better. The percentage of duplicated distractors (%DUP) ranges from 0 to 100, with lower values being preferable.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Automatic evaluation results for DG on the RACE dataset. BLEU scores in columns B1 to B4, ROUGE-L in column R-L, BERT-Score F1 in column BS-F1. Percentage of instances with duplicated distractors in column %DUP. Results from GPT-2 &#x002B; DistilBERT, T5-DG, and Seq-to-Seq were sourced directly from their published studies. Bolded values indicate the highest performance for that metric</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>B1</th>
<th>B2</th>
<th>B3</th>
<th>B4</th>
<th>R-L</th>
<th>BS-F1</th>
<th>%DUP</th>
</tr>
</thead>
<tbody>
<tr>
<td>FlanT5-Base</td>
<td>52.53</td>
<td>35.73</td>
<td>23.98</td>
<td>12.42</td>
<td>37.32</td>
<td>90.18</td>
<td>10.9</td>
</tr>
<tr>
<td>FlanT5-Large</td>
<td>54.05</td>
<td>37.2</td>
<td>25.24</td>
<td><bold>13.53</bold></td>
<td><bold>38.55</bold></td>
<td>90.53</td>
<td>3.95</td>
</tr>
<tr>
<td>FlanT5-Base (LoRA)</td>
<td>40.82</td>
<td>26.21</td>
<td>15.69</td>
<td>4.69</td>
<td>23.25</td>
<td>87.52</td>
<td>0.79</td>
</tr>
<tr>
<td>FlanT5-XL (LoRA)</td>
<td>55.62</td>
<td>37.78</td>
<td>24.77</td>
<td>11.58</td>
<td>35.18</td>
<td>90.19</td>
<td>1.34</td>
</tr>
<tr>
<td>LongT5-Base</td>
<td>51.74</td>
<td>35.05</td>
<td>23.62</td>
<td>12.41</td>
<td>37.38</td>
<td>90.2</td>
<td>7.36</td>
</tr>
<tr>
<td>LongT5-Large</td>
<td>52.83</td>
<td>35.21</td>
<td>23.12</td>
<td>11.18</td>
<td>35.51</td>
<td>89.98</td>
<td>6.97</td>
</tr>
<tr>
<td>LongT5-XL (LoRA)</td>
<td>55.94</td>
<td><bold>38.2</bold></td>
<td><bold>25.39</bold></td>
<td>12.59</td>
<td>36.6</td>
<td>90.6</td>
<td>0.85</td>
</tr>
<tr>
<td>GPT-3.5</td>
<td>46.98</td>
<td>29.65</td>
<td>18.04</td>
<td>6.85</td>
<td>32.01</td>
<td><bold>91.43</bold></td>
<td><bold>0</bold></td>
</tr>
<tr>
<td>GPT-2 &#x002B; DistilBERT</td>
<td><bold>60.12</bold></td>
<td>26.56</td>
<td>13.64</td>
<td>9.17</td>
<td>12.36</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>T5-DG</td>
<td>14.80</td>
<td>7.06</td>
<td>3.75</td>
<td>2.16</td>
<td>14.91</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Seq-to-Seq</td>
<td>26.93</td>
<td>13.57</td>
<td>8.0</td>
<td>5.21</td>
<td>14.54</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It is visible that FlanT5 and LongT5 models, particularly in their Large (780 M parameters) and XL (3 B parameters) versions, achieve strong performance across BLEU scores, indicating an improvement in word overlap with reference texts, especially from 2 to 4 n-grams. The ROUGE-L scores, which focus on the longest common sequence between the generated text and the reference, are also robust for the fine-tuned models, with all fine-tuned versions improving the performance of the baselines, except by FlanT5-Base fine-tuned with LoRA.</p>
<p>Regarding BERTScore F1 metrics, values gravitate around 0.9, suggesting that the generated distractors are semantically close to the reference. However, GPT-3.5 scores slightly higher than all other models. This might suggest that despite lower n-gram overlap, distractors generated by GPT-3.5 might be semantically closer to the references.</p>
<p>As mentioned above, duplication of distractors in model outputs was observed. This can be an indicator of the inability of the models to generate diverse distractors. Notably, models fine-tuned with LoRA and quantization exhibit the lowest rates of duplication when compared to fully fine-tuned versions. This can be evidenced by the more than 10% reduction in duplication when comparing the fully fine-tuned and LoRA fine-tuned FlanT5-Base models. Nevertheless, GPT-3.5 generated zero duplicated distractors for all the inferences.</p>
<p>Another observation from the results data is that the TGlobal attention mechanism in the LongT5 models does not demonstrate a significant performance advantage over the FlanT5 models. In fact, the FlanT5-Large model clearly outperforms its LongT5 counterpart.</p>
<p><xref ref-type="table" rid="table-3">Table 3</xref> reinforces the observed phenomenon of duplication across other RC datasets, indicating that in the presence of other contexts, there is a higher tendency to duplicate one distractor, especially by smaller models. Regarding BERTScore, values still gravitate around 0.89 and 0.9, suggesting that the generated distractors are semantically close to the reference. However, they still underperform in this metric when compared to the GPT-3.5 baseline.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Automatic evaluation results for other popular RC datasets. BLEU scores in columns B1 to B4, ROUGE-L in column R-L, and BERT-Score F1 in column BS-F1. Percentage of instances with 1 duplicated distractor in column %DUP1 and with 2 duplicated distractors in %DUP2</title>
</caption>
<table>
<colgroup>
<col align="center" width="20mm"/>
<col align="center" width="12mm"/>
<col align="center" width="12mm"/>
<col align="center" width="12mm"/>
<col align="center" width="12mm"/>
<col align="center" width="12mm"/>
<col align="center" width="12mm"/>
<col align="center" width="15mm"/>
<col align="center" width="15mm"/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>B1</th>
<th>B2</th>
<th>B3</th>
<th>B4</th>
<th>R-L</th>
<th>BS-F1</th>
<th>%DUP1</th>
<th>%DUP2</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="9">FlanT5-Base</td>
</tr>
<tr>
<td>MCTest</td>
<td>66.19</td>
<td>50.96</td>
<td>35.15</td>
<td>18.93</td>
<td>50.77</td>
<td>91.6</td>
<td>20.33</td>
<td>4</td>
</tr>
<tr>
<td>SciQ</td>
<td>69.83</td>
<td>51.16</td>
<td>29.99</td>
<td>5.3</td>
<td>48.3</td>
<td>89.23</td>
<td>30.6</td>
<td>25.8</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>60.67</td>
<td>42.99</td>
<td>26.2</td>
<td>8.05</td>
<td>42.15</td>
<td>89.34</td>
<td>29.2</td>
<td>16.6</td>
</tr>
<tr>
<td colspan="9">FlanT5-Large</td>
</tr>
<tr>
<td>MCTest</td>
<td>68.97</td>
<td>53.97</td>
<td>37.4</td>
<td>20.45</td>
<td>51.15</td>
<td>92.08</td>
<td>3.67</td>
<td>0.17</td>
</tr>
<tr>
<td>SciQ</td>
<td>69.55</td>
<td>51.3</td>
<td>30.09</td>
<td>5.82</td>
<td>48.06</td>
<td>89.62</td>
<td>18.8</td>
<td>6.7</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>60.38</td>
<td>43.16</td>
<td>26.23</td>
<td>7.91</td>
<td>41.86</td>
<td>89.6</td>
<td>19.2</td>
<td>2.6</td>
</tr>
<tr>
<td colspan="9">FlanT5-Base (LoRA)</td>
</tr>
<tr>
<td>MCTest</td>
<td>45.36</td>
<td>31.22</td>
<td>18.54</td>
<td>4.78</td>
<td>29.7</td>
<td>88.18</td>
<td>0.33</td>
<td>0</td>
</tr>
<tr>
<td>SciQ</td>
<td>27.89</td>
<td>18.59</td>
<td>9.71</td>
<td>0.23</td>
<td>33.42</td>
<td>87.31</td>
<td>1.8</td>
<td>0.7</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>21.26</td>
<td>13.72</td>
<td>7.28</td>
<td>0.59</td>
<td>29.26</td>
<td>86.71</td>
<td>1</td>
<td>0.2</td>
</tr>
<tr>
<td colspan="9">FlanT5-XL (LoRA)</td>
</tr>
<tr>
<td>MCTest</td>
<td>63.27</td>
<td>46.9</td>
<td>30.54</td>
<td>13.25</td>
<td>43.69</td>
<td>91.08</td>
<td>0.17</td>
<td>0</td>
</tr>
<tr>
<td>SciQ</td>
<td>65.18</td>
<td>47.81</td>
<td>27.53</td>
<td>4.71</td>
<td>46.41</td>
<td>90.66</td>
<td>1.1</td>
<td>0.2</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>60.59</td>
<td>43.15</td>
<td>25.77</td>
<td>6.58</td>
<td>40.6</td>
<td>90.02</td>
<td>1.4</td>
<td>0</td>
</tr>
<tr>
<td colspan="9">LongT5-Base</td>
</tr>
<tr>
<td>MCTest</td>
<td>66.6</td>
<td>51.72</td>
<td>35.76</td>
<td>19.43</td>
<td>50.84</td>
<td>91.76</td>
<td>9.17</td>
<td>2.17</td>
</tr>
<tr>
<td>SciQ</td>
<td>68.08</td>
<td>49.57</td>
<td>28.89</td>
<td>5.17</td>
<td>47.81</td>
<td>89.29</td>
<td>25.9</td>
<td>16.3</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>59.05</td>
<td>41.81</td>
<td>25.42</td>
<td>7.64</td>
<td>41.33</td>
<td>89.11</td>
<td>24.6</td>
<td>14.6</td>
</tr>
<tr>
<td colspan="9">LongT5-Large</td>
</tr>
<tr>
<td>MCTest</td>
<td>61.76</td>
<td>45.03</td>
<td>29.93</td>
<td>13.91</td>
<td>45</td>
<td>90.88</td>
<td>9.33</td>
<td>2.17</td>
</tr>
<tr>
<td>SciQ</td>
<td>61.97</td>
<td>44.24</td>
<td>24.74</td>
<td>2.35</td>
<td>42.11</td>
<td>88.61</td>
<td>26.5</td>
<td>4.8</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>55.37</td>
<td>38.27</td>
<td>22.12</td>
<td>4.73</td>
<td>37.55</td>
<td>88.96</td>
<td>14.6</td>
<td>1</td>
</tr>
<tr>
<td colspan="9">LongT5-XL (LoRA)</td>
</tr>
<tr>
<td>MCTest</td>
<td>70.09</td>
<td>54.23</td>
<td>36.89</td>
<td>18.73</td>
<td>49.29</td>
<td>92.04</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>SciQ</td>
<td>48.92</td>
<td>34.35</td>
<td>18.5</td>
<td>1.45</td>
<td>44.31</td>
<td>89.54</td>
<td>0.3</td>
<td>0</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>45.41</td>
<td>30.7</td>
<td>16.5</td>
<td>1.82</td>
<td>36.07</td>
<td>88.61</td>
<td>1.2</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="9">GPT-3.5</td>
</tr>
<tr>
<td>MCTest</td>
<td>58.12</td>
<td>41.44</td>
<td>25.71</td>
<td>9.83</td>
<td>43.42</td>
<td>93.41</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>SciQ</td>
<td>72.7</td>
<td>54.85</td>
<td>32.17</td>
<td>6.88</td>
<td>49.68</td>
<td>94.24</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>OpenBookQA</td>
<td>57.29</td>
<td>40.25</td>
<td>23.67</td>
<td>6.23</td>
<td>40.18</td>
<td>92.69</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>When looking at the XL models, LongT5-XL (LoRA) excels at MCTest, showing no duplication and outperforming GPT-3.5 in BLEU and ROUGE-L metrics. However, it is still behind on BERTScore. FlanT5-XL (LoRA) notably outperforms its LongT5 counterpart in SciQ and OpenBookQA. Additionally, it outperforms GPT-3.5 in BLEU and ROUGE-L metrics for OpenBookQA, while showing relatively low duplication. GPT-3.5 clearly outperformed in BERTScore and generated no duplicated distractors.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Distance Analysis</title>
<p>The cosine similarity between the distractors and correct answers, measured for the test splits in the four evaluated datasets, showed a median value of around 0.4, with the interquartile range (IQR) falling between 0.2 and 0.6. However, there were large whiskers and outliers present in the box plots, indicating that the measures were not distributed evenly (<xref ref-type="fig" rid="fig-6">Fig. 6</xref>). It is visible that OpenBookQA behaves differently from the rest, with a range and median slightly lower than the other datasets.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Ranges of distance measures (cosine similarity) between distractors and correct answers for the test splits from the MCTest, OpenBookQA, RACE, and SciQ datasets</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-6.tif"/>
</fig>
<p>A distance analysis based on the distractors generated by the fine-tuned models (<xref ref-type="fig" rid="fig-7">Fig. 7</xref>) shows a similar spread when compared to the reference and tends to have a higher cosine similarity median than the reference. FlanT5-XL (LoRA) closely approximates the behavior of distance ranges observed in the reference and GPT-3.5. Additionally, the lower performance of FlanT5-Base (LoRA), as observed in <xref ref-type="table" rid="table-2">Table 2</xref>, becomes more evident in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, exhibiting a similarity median below 0.2.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Ranges of distance measures (cosine similarity) between the correct answer and distractors generated by fine-tuned models for the test split from the RACE dataset, compared against the reference and GPT-3.5</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-7.tif"/>
</fig>
<p>A similar comparison for MCTest, SciQ, and OpenBookQA is shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. It is evident that smaller models struggled to approximate the distances between the distractors and the correct answer across datasets. In particular, the Base versions with full fine-tuning, which also showed a high duplication rate, present a larger IQR, especially in the SciQ dataset. An upper quartile extending to 1 (the maximum similarity) indicates that these models are generating distractors that, in some form, are paraphrasing or synonymous with the correct answer. An example of this is shown in <xref ref-type="table" rid="table-4">Table 4</xref> for FlanT5-Base, where the generated distractors include the correct answer &#x201C;nervous system&#x201D; as a part of them. When compared to the reference and LongT5-XL (LoRA), the cosine similarity of the examples for FlanT5-Base is considerably higher. A similar phenomenon is also observed in OpenBookQA, as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Ranges of distance measures (cosine similarity) between the correct answer and distractors generated by fine-tuned models for the test split from MCTest, SciQ, and OpenBookQA datasets, compared with reference and GPT-3.5</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-8.tif"/>
</fig><table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Example of distractors generated by FlanT5-Base that include the correct answer as a part of the options and show high cosine similarity (CS) compared to distractors generated by a larger model and the SciQ reference</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Model or reference</th>
<th>Correct answer</th>
<th>Distractor</th>
<th>CS</th>
</tr>
</thead>
<tbody>
<tr>
<td>FlanT5-Base</td>
<td>Nervous system</td>
<td>The <italic>nervous system</italic></td>
<td>0.915</td>
</tr>
<tr>
<td>FlanT5-Base</td>
<td>Nervous system</td>
<td>The <italic>nervous system</italic> of the brain</td>
<td>0.870</td>
</tr>
<tr>
<td>FlanT5-Base</td>
<td>Nervous system</td>
<td>The <italic>nervous system</italic> of the body</td>
<td>0.891</td>
</tr>
<tr>
<td>LongT5-XL (LoRA)</td>
<td>Nervous system</td>
<td>Respiratory system</td>
<td>0.490</td>
</tr>
<tr>
<td>LongT5-XL (LoRA)</td>
<td>Nervous system</td>
<td>Digestive system</td>
<td>0.431</td>
</tr>
<tr>
<td>LongT5-XL (LoRA)</td>
<td>Nervous system</td>
<td>Circulatory system</td>
<td>0.496</td>
</tr>
<tr>
<td>SciQ Reference</td>
<td>Nervous system</td>
<td>Cardiovascular system</td>
<td>0.542</td>
</tr>
<tr>
<td>SciQ Reference</td>
<td>Nervous system</td>
<td>Circulatory system</td>
<td>0.496</td>
</tr>
<tr>
<td>SciQ Reference</td>
<td>Nervous system</td>
<td>Central system</td>
<td>0.440</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Grammatical Correctness</title>
<p>Overall, the larger models (XL versions) fine-tuned using LoRA and quantization tend to have a lower rate of distractors with grammatical issues (less than 0.25%), across all fine-tuned models (<xref ref-type="fig" rid="fig-9">Fig. 9</xref>). However, it is worth noting that FlanT5-Base (LoRA) exhibits an even lower percentage of distractors with grammatical errors for the RACE dataset but is one of the higher for SciQ. In general, FlanT5 models show fewer grammatical problems across all datasets when compared to LongT5, with LongT5-Large being the worst performer in the grammar analysis. These results could be due to the nature of the pre-trained FlanT5, which has been fine-tuned on several other datasets and tasks.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Percentage of distractors with grammar issues, generated by the fine-tuned models for the test split from RACE, MCTest, SciQ, and OpenBookQA datasets, compared with reference and GPT-3.5</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_62004-fig-9.tif"/>
</fig>
<p>The GPT-3.5 baseline has consistently the lowest percentage of distractors with grammatical errors, except for MCTest. Interestingly, the references from each dataset tend to be on the higher end of the percentage of errors (although still very low, around 0.50% for RACE, MCTest, and OpenBookQA). This could be caused by the fact that these datasets were built using a mix of crowdsourcing and semi-automated techniques.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>The results demonstrate that the proposed approach can outperform baseline models in the DG task, evidenced by the improvements in BLEU and ROUGE-L metrics in <xref ref-type="table" rid="table-2">Table 2</xref>. When compared to previous works [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>], the fine-tuned models in this study, which employ a text-to-text format to generate all distractors in a single inference jointly, improved BLEU-2 to 38.2, BLEU-3 to 24.39, BLEU-4 to 13.53 and ROUGE-L to 38.55 for the RACE dataset. These results showcase an improvement in the overlap between the generated distractors and the reference, especially for multiple n-grams. It is worth noting that one of the reference models (GPT-2 &#x002B; DistilBERT) still shows a higher BLEU-1 score. However, the relevance of 1-g metrics for distractors in the RACE dataset is limited, given that most distractors are composed of multiple-word tokens. The baseline models, T5-DG and Seq-to-Seq, which generate a single distractor per inference without any ranking, exhibit significantly lower performance than the fine-tuned models, including the smaller ones.</p>

<p>Further analysis is needed to understand the extent to which some distractors generated by this approach may represent variations of the correct answer, for instance, through the use of synonyms or paraphrasing. Another recent study also observed this phenomenon, focused on generating MCQs related to programming [<xref ref-type="bibr" rid="ref-47">47</xref>]. This study used GPT-4, and in some instances, all distractors generated for a question were valid correct answers. The presented analysis of cosine similarity metrics between the correct answer and distractors can provide insights into this tendency, which is indicated by an increase in the median and upper quartile of cosine similarity scores.</p>
<p>It is also worth noting that the distractors contained in the datasets do not necessarily originate from the text or refer to it. Sometimes, options are plausible for a human reader but may not directly relate to the context, making them challenging to model. This is why conventional token overlapping metrics like BLEU or ROUGE do not reflect an accurate quality measure for distractors. The inclusion of BERTScore provides insights into the semantic proximity to the reference text. However, unlike this study, the metrics reported by baseline models of other works do not include this score, opening opportunities for future research in this area.</p>
<p>It is also important to mention that the BERTScore for GPT-3.5 outputs was consistently higher across all datasets. This is particularly evident for MCTest, SciQ, and OpenBookQA. The best-performing fine-tuned models in this study achieved BERTScores of 92.08, 90.66, and 90.02, respectively. In contrast, GPT-3.5 scored 93.41, 94.24, and 92.69, respectively, indicating that the fine-tuned models still fall behind in terms of semantic proximity of the distractors, when compared to a LLM like GPT-3.5. One potential explanation for this is the GPT-3.5 model size and the vast amount of data used for its training, allowing it to model semantic relationships better. Using larger T5 variants, like XXL, and increasing the diversity of the datasets used for fine-tuning them (not only RACE), could potentially improve this metric.</p>
<p>The observed tendency to duplicate distractors was not fully removed. Models fine-tuned using LoRA exhibited higher distractor diversity, evidenced by a significantly lower rate of distractor duplication. The precise reasons for this require further investigation, and future work could explore them. However, it could be due to the nature of LoRA, where only a small set of parameters is fine-tuned, and the majority remains unchanged. This leads to more efficient learning of patterns necessary for generating diverse distractors and prevents overfitting, consequently leading to better generalization.</p>
<p>Given that the datasets used in the experiments were mostly composed of questions with 4 options (1 correct answer &#x002B; 3 distractors), the flexibility to control the resulting number of distractors for each question is limited. This could be addressed by enriching the training dataset with a variable number of distractors per question (for example, generated by GPT-3.5) and adjusting the prefix of the DG task to specify the number of distractors to generate.</p>
<p>Also, the semantic proximity of distractors with the correct answer (and between them) cannot be controlled. The datasets utilized consist of MCQs with a single correct answer option. Therefore, further research is required to investigate the performance of the proposed approach when dealing with correct answers comprising multiple options.</p>
<p>Fine-tuned models were capable of generating distractors that are grammatically correct, sometimes matching the level of GPT-3.5 and even surpassing the reference. However, a deeper analysis and future research are needed to automatically evaluate their effectiveness and quality, including considerations such as the length of the distractor compared to the correct answer, the plausibility of the generated options, grammatical concordance with the question, and linguistic complexity, among other recommendations for writing effective multiple-choice items [<xref ref-type="bibr" rid="ref-4">4</xref>]. In addition, human evaluation could offer an opportunity to further assess the quality of generated distractors using the proposed method and explore to what extent these distractors can confuse examinees. Due to the constraints of the current study, this was outside of the scope, presenting an opportunity for future work.</p>
<p>When comparing the fine-tuned XL versions, FlanT5 outperformed LongT5 on SciQ and OpenBookQA. These datasets have significantly shorter inputs and distractors compared to RACE and MCTest, where LongT5 with the TGlobal attention mechanism exhibited better performance. However, for smaller models, FlanT5 outperformed LongT5 most of the time across all datasets. As a consequence, further research is needed to understand the impact and effectiveness of the TGlobal attention mechanism, particularly for the DG task.</p>
<p>LLMs typically require extensive fine-tuning and significant computational resources, even when using LoRA and quantization to reduce memory usage. Most of the cost comes from training, which can take many hours (XL models took between 60&#x2013;65 h on a single GPU). However, the cost difference compared to API-based solutions like GPT-3.5 is smaller during inference. For instance, the fine-tuned Large models can generate distractors for all test splits for about $1.89 (1 h compute time), compared to $3.05 from GPT-3.5-Turbo.</p>
<p>Overall, the findings of this study demonstrate the potential of using a text-to-text approach for the joint generation of distractors for MCQs. Nevertheless, more comprehensive research is required to fully understand its limitations and potential and investigate alternate datasets, architectures, and methodologies for distractor generation via large language models.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Limitations</title>
<p>Due to GPU resource limitations, this study only fine-tuned models up to 3 billion parameters (XL versions), and it was not possible to fine-tune the larger models with 11 billion parameters (XXL versions).</p>
<p>The evaluation of distractor quality in this study relies mainly on automatic metrics, which do not capture the impact of the distractors on learning outcomes. Human evaluation, case studies, and psychometric analyses for examining item difficulty and discrimination are recommended to validate the educational effectiveness of the generated distractors and their applicability in educational settings.</p>
<p>The method was based on the RACE dataset, and the evaluation included other RC datasets with different domains, such as science and common knowledge. However, its generalizability to any other domain, question type, or language besides English remains unconfirmed.</p>
<p>Finally, while metrics like BERTScore provide useful perspectives for the DG task, they do not model all the characteristics of a good distractor. In fact, they can overestimate similarity for cases like negations.</p>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusion</title>
<p>This study presents a text-to-text approach for the joint generation (i.e., all at once) of distractors and evaluates its potential by fine-tuning FlanT5 models and LongT5 with TGlobal attention models using a RACE-based dataset. Both Base and Large model variants are fully fine-tuned, while XL variants are fine-tuned using LoRA and 4-bit quantization. Compared to previous works, the proposed method and models demonstrate an improvement in the standard metrics, BLEU and ROUGE-L, for distractors generated for the RACE dataset. They also show better performance on the same metrics than a baseline generated in this study using GPT-3.5. The fine-tuned models have been published and made available on the Huggingface platform (<xref ref-type="sec" rid="app-1">Appendix A</xref>).</p>
<p>An additional evaluation is performed by generating distractors for other MCQ datasets (MCTest, SciQ, and OpenBookQA). The FlanT5-XL model fine-tuned with LoRA outperformed its LongT5 counterpart on SciQ and OpenBookQA, but LongT5-XL performed better on MCTest and RACE. In the case of smaller models, FlanT5 typically outperformed LongT5 across all datasets.</p>
<p>This study introduces BERTScore as an additional metric in the evaluation framework for DG, given that research suggests token overlapping metrics like BLEU and ROUGE do not fully measure the quality of distractors. BERTScore results show that models fine-tuned using the proposed approach generate distractors that are semantically close to the reference. However, despite underperforming in BLEU and ROUGE, the GPT-3.5 baseline still scored better on this metric.</p>
<p>The presented approach generates multiple distractors per model inference, taking into consideration the relationship of all distractors with the context, question, and correct answer. This leads to better performance when compared to generating a single distractor per inference. Additionally, this method generates sets of grammatically correct distractors that can approximate the range of semantic distances with the correct answer observed in the references, especially those generated by the XL models. A tendency toward the repetition of distractors has been observed, with models fine-tuned using LoRA exhibiting a considerably lower rate of duplicated distractors when compared to fully fine-tuned models. Additional research is needed to fully understand how LoRA fine-tuning leads to better diversity with the proposed approach for DG.</p>
<p>Future work can explore how elements like distractor length, option plausibility, grammatical consistency, and linguistic complexity, alongside the relationship of distances between distractors, correct answers, and the context, could help develop better metrics to automatically assess the quality of distractors generated for MCQs. Although the ability of the generated distractors to confuse examinees is not analyzed, human evaluation offers an opportunity for future studies. Furthermore, this study suggests how the proposed text-to-text approach can be improved by enriching the training dataset and adjusting the task prefix to control the number of distractors generated in a single inference. Lastly, distractors generated using GPT-3.5-turbo-1106 for the test splits of RACE, MCTest, SciQ, and OpenBookQA datasets have been made available (<xref ref-type="sec" rid="app-2">Appendix B</xref>). These distractors can be used by other studies as baselines for comparing performance in future works.</p>
</sec>
</body>
<back>
<ack>
<p>This work was partially co-funded by the Comunidad de Madrid (Grant number: CM/JIN/2021-034) and the University of Alcala (Grant number: PIUAH21/IA-010 and PIUAH23/IA-007).</p>
</ack>


<sec>
<title>Funding Statement</title>
<p>This work was supported by the Universidad de Alcal&#x00E1; (UAH) under Grant PIUAH21/IA-010; and Comunidad Auton&#x00F3;ma de Madrid under Grant CM/JIN/2021-034.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm their contribution to the paper as follows: Ricardo Rodriguez-Torrealba: Conceptualization, Investigation, Methodology, Software, Visualization, Writing&#x2014;review &#x0026; editing. Eva Garcia-Lopez: Conceptualization, Investigation, Methodology, Supervision, Writing&#x2014;review &#x0026; editing. Antonio Garcia-Cabot: Funding acquisition, Investigation, Methodology, Supervision, Writing&#x2014;review &#x0026; editing. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are available from the corresponding author, EGL, upon reasonable request.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<app-group id="apg-1">
<app id="app-1"><label>Appendix A</label>
<title></title>
<p>Fine-tuned FlanT5 and LongT5 models for DG have been made public under the Apache-2.0 Licensein the Huggin Face<xref ref-type="fn" rid="fn-2"><sup>2</sup></xref><fn id="fn-2"><label>2</label><p><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/">https://huggingface.co/</ext-link> (accessed on 1 January 2025).</p></fn> platform (<xref ref-type="table" rid="table-5">Table A1</xref>).</p>
<table-wrap id="table-5">
<label>Table A1</label>
<caption>
<title>List of published models</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Variant</th>
<th>URL</th>
</tr>
</thead>
<tbody>
<tr>
<td>FlanT5-Base</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/flan-t5-base-joint-dg">https://huggingface.co/rrodrigu3z/flan-t5-base-joint-dg</ext-link> (accessed on 1 January 2025)</td>
</tr>
<tr>
<td>FlanT5-Large</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/flan-t5-large-joint-dg">https://huggingface.co/rrodrigu3z/flan-t5-large-joint-dg</ext-link> (accessed on 1 January 2025)</td>
</tr>
<tr>
<td>FlanT5-Base (LoRA)</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/flan-t5-base">https://huggingface.co/rrodrigu3z/flan-t5-base</ext-link> (accessed on 1 January 2025)</td>
</tr>
<tr>
<td>FlanT5-XL (LoRA)</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/flan-t5-xl/tree/main">https://huggingface.co/rrodrigu3z/flan-t5-xl/tree/main</ext-link> (accessed on 1 January 2025)</td>
</tr>
<tr>
<td>LongT5-Base</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/long-t5-tglobal-base-joint-dg">https://huggingface.co/rrodrigu3z/long-t5-tglobal-base-joint-dg</ext-link> (accessed on 1 January 2025)</td>
</tr>
<tr>
<td>LongT5-Large</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/long-t5-tglobal-large-joint-dg">https://huggingface.co/rrodrigu3z/long-t5-tglobal-large-joint-dg</ext-link> (accessed on 1 January 2025)</td>
</tr>
<tr>
<td>LongT5-XL (LoRA)</td>
<td><ext-link ext-link-type="uri" xlink:href="https://huggingface.co/rrodrigu3z/long-t5-tglobal-xl/tree/main">https://huggingface.co/rrodrigu3z/long-t5-tglobal-xl/tree/main</ext-link> (accessed on 1 January 2025)</td>
</tr>
</tbody>
</table>
</table-wrap>
</app>
<app id="app-2"><label>Appendix B</label>
<title></title>
<p>Distractors generated for all datasets using GPT-3.5-turbo-1106 via OpenAI API can be downloaded in the following URL: <ext-link ext-link-type="uri" xlink:href="https://dg-inferences.s3.amazonaws.com/gpt-3.5-baseline/chatgpt_predictions.jsonl">https://dg-inferences.s3.amazonaws.com/gpt-3.5-baseline/chatgpt_predictions.jsonl</ext-link> (accessed on 1 January 2025).</p>
</app>
</app-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Butler</surname> <given-names>AC</given-names></string-name></person-group>. <article-title>Multiple-choice testing in education: are the best practices for assessment also good for learning?</article-title> <source>J Appl Res Mem Cogn</source>. <year>2018</year>;<volume>7</volume>(<issue>3</issue>):<fpage>323</fpage>&#x2013;<lpage>31</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jarmac.2018.07.002</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Gierl</surname> <given-names>MJ</given-names></string-name></person-group>. <article-title>Multiple-choice item distractor development using topic modeling approaches</article-title>. <source>Front Psychol</source>. <year>2019</year>;<volume>10</volume>:<fpage>825</fpage>. doi:<pub-id pub-id-type="doi">10.3389/fpsyg.2019.00825</pub-id>; <pub-id pub-id-type="pmid">31133911</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Lane</surname> <given-names>S</given-names></string-name></person-group>. <source>Handbook of test development</source>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Routledge</publisher-name>; <year>2015</year>. doi:<pub-id pub-id-type="doi">10.4324/9780203102961</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Haladyna</surname> <given-names>TM</given-names></string-name>, <string-name><surname>Rodriguez</surname> <given-names>MC</given-names></string-name></person-group>. <source>Developing and validating test items</source>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Routledge</publisher-name>; <year>2013</year>. doi:<pub-id pub-id-type="doi">10.4324/9780203850381</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>VanLehn</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Evaluation of auto-generated distractors in multiple choice questions from a semantic network</article-title>. <source>Interact Learn Environ</source>. <year>2021</year>;<volume>29</volume>(<issue>6</issue>):<fpage>1019</fpage>&#x2013;<lpage>36</lpage>. doi:<pub-id pub-id-type="doi">10.1080/10494820.2019.1619586</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kumar</surname> <given-names>AP</given-names></string-name>, <string-name><surname>Nayak</surname> <given-names>A</given-names></string-name>, <string-name><surname>Manjula Shenoy</surname> <given-names>K</given-names></string-name>, <string-name><surname>Goyal</surname> <given-names>S</given-names></string-name></person-group>. <article-title>A novel approach to generate distractors for multiple choice questions</article-title>. <source>Expert Syst Appl</source>. <year>2023</year>;<volume>225</volume>(<issue>7</issue>):<fpage>120022</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2023.120022</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><collab>UNESCO</collab></person-group>. <article-title>Artificial intelligence in education: challenges and opportunities for sustainable development [Internet]</article-title>. <source>Work Pap Educ Policy</source>. <year>2019</year>;<volume>7</volume>:<fpage>46</fpage>. [cited 2025 Jan 1]. Available from: <ext-link ext-link-type="uri" xlink:href="https://en.unesco.org/themes/education-policy-">https://en.unesco.org/themes/education-policy-</ext-link>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>F</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>C</given-names></string-name>, <string-name><surname>Bao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>M</given-names></string-name></person-group>. <chapter-title>Neural question generation from text: a preliminary study</chapter-title>. In: <source>Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2018</year>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-73618-1_56</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rodriguez-Torrealba</surname> <given-names>R</given-names></string-name>, <string-name><surname>Garcia-Lopez</surname> <given-names>E</given-names></string-name>, <string-name><surname>Garcia-Cabot</surname> <given-names>A</given-names></string-name></person-group>. <article-title>End-to-end generation of multiple-choice questions using text-to-text transfer transformer models</article-title>. <source>Expert Syst Appl</source>. <year>2022</year>;<volume>208</volume>:<fpage>118258</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2022.118258</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bing</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>P</given-names></string-name>, <string-name><surname>King</surname> <given-names>I</given-names></string-name>, <string-name><surname>Lyu</surname> <given-names>MR</given-names></string-name></person-group>. <article-title>Generating distractors for reading comprehension questions from real examinations</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>; <year>2019 Jan 27&#x2013;Feb 1</year>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v33i01.33016423</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Dave</surname> <given-names>N</given-names></string-name>, <string-name><surname>Wham</surname> <given-names>D</given-names></string-name>, <string-name><surname>Pursel</surname> <given-names>B</given-names></string-name>, <string-name><surname>Giles</surname> <given-names>CL</given-names></string-name></person-group>. <article-title>Distractor generation for multiple choice questions using learning to rank</article-title>. In: <conf-name>Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications</conf-name>; <year>2018 Jun 5</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>284</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/w18-0533</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Offerijns</surname> <given-names>J</given-names></string-name>, <string-name><surname>Verberne</surname> <given-names>S</given-names></string-name>, <string-name><surname>Verhoef</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Better distractions: transformer-based distractor generation and multiple choice question filtering [Internet]</article-title>. <year>[cited 2021 Nov 13]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2010.09598">http://arxiv.org/abs/2010.09598</ext-link>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Qiu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Automatic distractor generation for multiple choice questions in standard Tests</article-title>. <comment>arXiv:2011.13100. 2021</comment>. doi:<pub-id pub-id-type="doi">10.18653/v1/2020.coling-main</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Papineni</surname> <given-names>K</given-names></string-name>, <string-name><surname>Roukos</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ward</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>WJ</given-names></string-name></person-group>. <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>. In: <conf-name>Proceedings of the 40th Annual Meeting on Association for Computational Linguistics&#x2014;ACL &#x2019;02</conf-name>; <year>2002 Jul 7&#x2013;12</year>; <publisher-loc>Philadelphia, PA, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>CY</given-names></string-name></person-group>. <article-title>Rouge: a package for automatic evaluation of summaries</article-title>. In: <conf-name>Proceedings of the Workshop on Text Summa-Rization Branches Out (WAS 2004)</conf-name>; <year>2004 Jul 25&#x2013;26</year>; <publisher-loc>Barcelona, Spain</publisher-loc>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Vijayakumar</surname> <given-names>AK</given-names></string-name>, <string-name><surname>Cogswell</surname> <given-names>M</given-names></string-name>, <string-name><surname>Selvaraju</surname> <given-names>RR</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Crandall</surname> <given-names>D</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Diverse beam search: decoding diverse solutions from neural sequence models [Internet]</article-title>. <year>[cited 2021 Dec 23]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1610.02424">https://arxiv.org/abs/1610.02424</ext-link>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chung</surname> <given-names>HW</given-names></string-name>, <string-name><surname>Hou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Longpre</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zoph</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tay</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fedus</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Scaling instruction-finetuned language models</article-title>. <source>J Mach Learn Res</source>. <year>2024</year>;<volume>25</volume>(<issue>70</issue>):<fpage>1</fpage>&#x2013;<lpage>53</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ainslie</surname> <given-names>J</given-names></string-name>, <string-name><surname>Uthus</surname> <given-names>D</given-names></string-name>, <string-name><surname>Ontanon</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ni</surname> <given-names>J</given-names></string-name>, <string-name><surname>Sung</surname> <given-names>YH</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>LongT5: efficient text-to-text transformer for long sequences</article-title>. <comment>arXiv:2112.07916. 2021</comment>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lai</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hovy</surname> <given-names>E</given-names></string-name></person-group>. <article-title>RACE: large-scale ReAding comprehension dataset from examinations</article-title>. In: <conf-name>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</conf-name>; <year>2017 Sep 9&#x2013;11</year>; <publisher-loc>Copenhagen, Denmark</publisher-loc>. doi:<pub-id pub-id-type="doi">10.18653/v1/d17-1082</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Kishore</surname> <given-names>V</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Weinberger</surname> <given-names>KQ</given-names></string-name>, <string-name><surname>Artzi</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>BERTScore: evaluating text generation with BERT</article-title>. <comment>arXiv:1904.09675. 2019</comment>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sutskever</surname> <given-names>I</given-names></string-name>, <string-name><surname>Vinyals</surname> <given-names>O</given-names></string-name>, <string-name><surname>Le</surname> <given-names>QV</given-names></string-name></person-group>. <article-title>Sequence to sequence learning with neural networks [Internet]</article-title>. <year>[cited 2025 Jan 1]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1409.3215">http://arxiv.org/abs/1409.3215</ext-link>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>AN</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Attention is all you need</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2017</year>;<volume>30</volume>:<fpage>5999</fpage>&#x2013;<lpage>6009</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Radford</surname> <given-names>A</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Child</surname> <given-names>R</given-names></string-name>, <string-name><surname>Luan</surname> <given-names>D</given-names></string-name>, <string-name><surname>Amodei</surname> <given-names>D</given-names></string-name>, <string-name><surname>Sutskever</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Language models are unsupervised multitask learners</article-title>. <source>OpenAI Blog</source>. <year>2019</year>;<volume>1</volume>(<issue>8</issue>):<fpage>9</fpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Devlin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Toutanova</surname> <given-names>K</given-names></string-name></person-group>. <article-title>BERT: pre-training of deep bidirectional transformers for language understanding [Online]</article-title>. <year>[cited 2025 Jan 1]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1810.04805">http://arxiv.org/abs/1810.04805</ext-link>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Raffel</surname> <given-names>C</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Roberts</surname> <given-names>A</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Narang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Matena</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer [Internet]</article-title>. <source>J Mach Learn Res</source>. <year>2019</year>;<volume>21</volume>:<fpage>1</fpage>&#x2013;<lpage>67</lpage>. <comment>[cited 2025 Jan 1]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1910.10683">http://arxiv.org/abs/1910.10683</ext-link>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Welbl</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>NF</given-names></string-name>, <string-name><surname>Gardner</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Crowdsourcing multiple choice science questions</article-title>. In: <conf-name>Proceedings of the 3rd Workshop on Noisy User-generated Text</conf-name>; <year>2017 Sep 7</year>; <publisher-loc>Copenhagen, Denmark</publisher-loc>. doi:<pub-id pub-id-type="doi">10.18653/v1/w17-4413</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pennington</surname> <given-names>J</given-names></string-name>, <string-name><surname>Socher</surname> <given-names>R</given-names></string-name>, <string-name><surname>Manning</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Glove: global vectors for word representation</article-title>. In: <conf-name>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</conf-name>; <year>2014 Oct 25&#x2013;29</year>; <publisher-loc>Doha, Qatar</publisher-loc>. doi:<pub-id pub-id-type="doi">10.3115/v1/d14-1162</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chomphooyod</surname> <given-names>P</given-names></string-name>, <string-name><surname>Suchato</surname> <given-names>A</given-names></string-name>, <string-name><surname>Tuaycharoen</surname> <given-names>N</given-names></string-name>, <string-name><surname>Punyabukkana</surname> <given-names>P</given-names></string-name></person-group>. <article-title>English grammar multiple-choice question generation using text-to-text transfer transformer</article-title>. <source>Comput Educ Artif Intell</source>. <year>2023</year>;<volume>5</volume>:<fpage>100158</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.caeai.2023.100158</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sanh</surname> <given-names>V</given-names></string-name>, <string-name><surname>Debut</surname> <given-names>L</given-names></string-name>, <string-name><surname>Chaumond</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wolf</surname> <given-names>T</given-names></string-name></person-group>. <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>. <comment>arXiv:1910.01108. 2019</comment>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Clark</surname> <given-names>P</given-names></string-name>, <string-name><surname>Cowhey</surname> <given-names>I</given-names></string-name>, <string-name><surname>Etzioni</surname> <given-names>O</given-names></string-name>, <string-name><surname>Khot</surname> <given-names>T</given-names></string-name>, <string-name><surname>Sabharwal</surname> <given-names>A</given-names></string-name>, <string-name><surname>Schoenick</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Think you have solved question answering? Try ARC, the AI2 reasoning challenge</article-title>. <comment>arXiv:1803.05457. 2018</comment>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Talmor</surname> <given-names>A</given-names></string-name>, <string-name><surname>Herzig</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lourie</surname> <given-names>N</given-names></string-name>, <string-name><surname>Berant</surname> <given-names>J</given-names></string-name></person-group>. <article-title>CommonSenseqa: a question answering challenge targeting com-monsense knowledge</article-title>. In: <conf-name>NAACL HLT 2019&#x2014;2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</conf-name>; <year>2019 Jun 2&#x2013;7</year>; <publisher-loc>Minneapolis, MN, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Richardson</surname> <given-names>M</given-names></string-name>, <string-name><surname>Burges</surname> <given-names>CJC</given-names></string-name>, <string-name><surname>Renshaw</surname> <given-names>E</given-names></string-name></person-group>. <article-title>MCTest: a challenge dataset for the open-domain machine comprehension of text</article-title>. In: <conf-name>Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</conf-name>; <year>2013 Oct 18&#x2013;21</year>; <publisher-loc>Seattle, WA, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Mihaylov</surname> <given-names>T</given-names></string-name>, <string-name><surname>Clark</surname> <given-names>P</given-names></string-name>, <string-name><surname>Khot</surname> <given-names>T</given-names></string-name>, <string-name><surname>Sabharwal</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Can a suit of armor conduct electricity? A new dataset for open book question answering</article-title>. In: <conf-name>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</conf-name>; <year>2018 Oct 31&#x2013;Nov 4</year>; <publisher-loc>Brussels, Belgium</publisher-loc>. doi:<pub-id pub-id-type="doi">10.18653/v1/d18-1260</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Khashabi</surname> <given-names>D</given-names></string-name>, <string-name><surname>Min</surname> <given-names>S</given-names></string-name>, <string-name><surname>Khot</surname> <given-names>T</given-names></string-name>, <string-name><surname>Sabharwal</surname> <given-names>A</given-names></string-name>, <string-name><surname>Tafjord</surname> <given-names>O</given-names></string-name>, <string-name><surname>Clark</surname> <given-names>P</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>UNIFIEDQA: crossing format boundaries with a single QA system</article-title>. <comment>arXiv:2005.00700. 2020</comment>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Khashabi</surname> <given-names>D</given-names></string-name>, <string-name><surname>Kordi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hajishirzi</surname> <given-names>H</given-names></string-name></person-group>. <article-title>UnifiedQA-v2: stronger generalization via broader cross-format training</article-title>. <comment>arXiv:2202.12359. 2022</comment>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lewis</surname> <given-names>M</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Goyal</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ghazvininejad</surname> <given-names>M</given-names></string-name>, <string-name><surname>Mohamed</surname> <given-names>A</given-names></string-name>, <string-name><surname>Levy</surname> <given-names>O</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>. In: <conf-name>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</conf-name>; <year>2020 Jul 5&#x2013;10</year>; <publisher-loc>Online</publisher-loc>. doi:<pub-id pub-id-type="doi">10.18653/v1/2020.acl-main.703</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Raina</surname> <given-names>V</given-names></string-name>, <string-name><surname>Gales</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Multiple-choice question generation: towards an automated assessment framework</article-title>. <comment>arXiv:2209.11830. 2022</comment>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Qiu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>A survey of transformers</article-title>. <source>AI Open</source>. <year>2022</year>;<volume>3</volume>(<issue>120</issue>):<fpage>111</fpage>&#x2013;<lpage>32</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.aiopen.2022.10.001</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Mangrulkar</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gugger</surname> <given-names>S</given-names></string-name>, <string-name><surname>Debut</surname> <given-names>L</given-names></string-name>, <string-name><surname>Belkada</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Paul</surname> <given-names>S</given-names></string-name>, <string-name><surname>Bossan</surname> <given-names>B</given-names></string-name></person-group>. <source>PEFT: state-of-the-art parameter-efficient fine-tuning methods</source>. <publisher-loc>San Francisco, CA, USA</publisher-loc>: <publisher-name>GitHub</publisher-name>; <year>2022</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Wolf</surname> <given-names>T</given-names></string-name>, <string-name><surname>Debut</surname> <given-names>L</given-names></string-name>, <string-name><surname>Sanh</surname> <given-names>V</given-names></string-name>, <string-name><surname>Chaumond</surname> <given-names>J</given-names></string-name>, <string-name><surname>Delangue</surname> <given-names>C</given-names></string-name>, <string-name><surname>Moi</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Transformers: state-of-the-art natural language processing [Internet]</article-title>. <year>[cited 2025 Jan 1]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1910.03771">https://arxiv.org/abs/1910.03771</ext-link>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>EJ</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wallis</surname> <given-names>P</given-names></string-name>, <string-name><surname>Allen-Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>LoRA: low-rank adaptation of large language models</article-title>. In: <conf-name>ICLR 2022&#x2014;10th Inter-National Conference on Learning Representations</conf-name>; <year>2022 Apr 25&#x2013;29</year>; <publisher-loc>Virtual</publisher-loc>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dettmers</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pagnoni</surname> <given-names>A</given-names></string-name>, <string-name><surname>Holtzman</surname> <given-names>A</given-names></string-name>, <string-name><surname>Zettlemoyer</surname> <given-names>L</given-names></string-name></person-group>. <article-title>QLoRA: efficient finetuning of quantized LLMs</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2023</year>;<volume>36</volume>:<fpage>10088</fpage>&#x2013;<lpage>115</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jauregi Unanue</surname> <given-names>I</given-names></string-name>, <string-name><surname>Parnell</surname> <given-names>J</given-names></string-name>, <string-name><surname>Piccardi</surname> <given-names>M</given-names></string-name></person-group>. <article-title>BERTTune: fine-tuning neural machine translation with BERTScore</article-title>. In: <conf-name>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)</conf-name>; <year>2021 Aug 1&#x2013;6</year>; <publisher-loc>Online</publisher-loc>. doi:<pub-id pub-id-type="doi">10.18653/v1/2021.acl-short.115</pub-id>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Reimers</surname> <given-names>N</given-names></string-name>, <string-name><surname>Gurevych</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Sentence-BERT: sentence embeddings using Siamese BERT-networks</article-title>. In: <conf-name>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</conf-name>; <year>2019 Nov 3&#x2013;7</year>; <publisher-loc>Hong Kong, China</publisher-loc>. doi:<pub-id pub-id-type="doi">10.18653/v1/d19-1410</pub-id>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hanna</surname> <given-names>M</given-names></string-name>, <string-name><surname>Bojar</surname> <given-names>O</given-names></string-name></person-group>. <article-title>A fine-grained analysis of BERTScore</article-title>. In: <conf-name>Proceedings of the Sixth Conference on Ma-chine Translation</conf-name>; <year>2021 Nov 10&#x2013;11</year>; <publisher-loc>Punta Cana, Dominican Republic</publisher-loc>. p. <fpage>507</fpage>&#x2013;<lpage>17</lpage>. [cited 2025 Jan 1]. Available from: <ext-link ext-link-type="uri" xlink:href="https://aclanthology.org/2021.wmt-1.59/">https://aclanthology.org/2021.wmt-1.59/</ext-link></mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Crossley</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Bradfield</surname> <given-names>F</given-names></string-name>, <string-name><surname>Bustamante</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing</article-title>. <source>J Writ Res</source>. <year>2019</year>;<volume>11</volume>(<issue>2</issue>):<fpage>251</fpage>&#x2013;<lpage>70</lpage>. doi:<pub-id pub-id-type="doi">10.17239/jowr-2019.11.02.01</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Doughty</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Bompelli</surname> <given-names>A</given-names></string-name>, <string-name><surname>Qayum</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education</article-title>. In: <conf-name>Proceedings of the 26th Australasian Computing Education Conference</conf-name>; <year>2024 Jan 29&#x2013;Feb 2</year>; <publisher-loc>Sydney, NSW, Australia</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1145/3636243.3636256</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>