HEDS Form

Download to file
download json

Press the button to download your current form in JSON format.
Upload from file


upload json

Press the button to upload a JSON file. Warning: This will clear your current form completely then upload the contents from the file.
Count of errors
Updates every 60 seconds.
41 blank fields.

Instructions

This is the Human Evaluation Datasheet (HEDS) form which is designed to record full details of human evaluation experiments in Natural Language Processing (NLP), addressing a history of details often going unreported in the field (in extreme cases, no details at all are reported). Reporting such details is crucial for gauging the reliability of results, determining comparability with other experiments, and for assessing reproducibility (Belz et al., 2023a,b; Thomson et al., 2024; Thomson and Belz, 2024). Having a standard set of questions to answer (as provided by HEDS) means not having to worry about what information to include or in what detail, as well as the information being in a format directly comparable to information reported for other human evaluation experiments. To maximise standardisation, questions are in multiple-choice format where possible.

The HEDS form is divided into five main sections, containing questions that record information about resources, evaluated system(s), test set sampling, quality criteria assessed, and ethics, respectively. Within each of the main sections there can be multiple subsections which can be expanded or collapsed.

Each HEDS question comes with instructions and notes to help with answering it, except where the task is exceedingly simple (e.g. when a contact email address is asked for).

HEDS Section 4 needs to be completed for each quality criterion that is evaluated in the experiment. Instructions on how to do this are shown at the start of HEDS Section 4.

The form is not submitted to any server when it is completed, and instead needs to be downloaded to a local file. A tool is available in the GitHub repository for converting the file to latex format (which we used to generate the next section). Please use the "download json" button in the "Download to file" section. This will download a file (in .json format) that contains the current values from each form field. You can also upload a json file (see the "Upload from file" section" on the left of the screen). Warning: This will delete your current form content, then populate the blank form with content from the file. It is advisable to download files as a backup when you are compelting the form. The form saves the field values in local storage of your browser, it will be deleted if you clear the local storage, or if you are in a private/incognito window and then close it.

The form will not prevent you from downloading your save file, even when there are error or warning messages. Yellow warning messages indicate fields that have not been completed. If a field is not relevant for your experiment, enter N/A, and ideally also explain why. Red messages are errors, for example if the form expects an integer and you have entered something else, a red message will be shown. These will still not prevent you from saving the form.

You can generate a list of all current errors/warnings, along with their section numbers, in the "all form errors" tab at the bottom of the form. A count of errors will also be refreshed every 60 seconds on the panel on the left side of the screen.

We recognise that completing a form of this length and level of detail constitutes an overhead in terms of time and effort, especially the first time a HEDS form is completed when the learning curve is steepest. However, this overhead does go down substantially with each use of HEDS, and, we believe, is far outweighed by the benefits: increased scientific rigour, reliability and repeatability.

We envisage the main uses of HEDS to be as follows. Ideally, it should be completed before a human evaluation experiment is run, at the point when the design is final, as part of a formal preregistration process. Once the experiment has been run, the information in the sheet can be updated if necessary, e.g. if the final number of evaluators had to change due to unforeseen circumstances.

Another use is for the purpose of reporting the details of a completed experiment. For this, the completed HEDS sheet can be automatically converted to Latex, ready for inclusion in the supplementary material.

A third use is for carrying out reproducibility studies, as has been done extensively in the ReproGen and ReproNLP shared tasks (Belz et al., 2022, Belz & Thomson, 2024). Here, the HEDS sheets were used to ensure that original work and reproduction experiment had the same properties, hence can be expected to produce similar results.


How to cite
The paper describing HEDS 3.0 is Belz & Thomson 2024.



Question 1.1.1:  Where can the main reference for the evaluation experiment be found?

Multiple-choice options (select one)

Referring to the main reference entered for Question 1.1.1, identify the experiment that you’re completing this form for (see instructions section at the start for explanation of term ‘experiment’), in particular to differentiate this experiment from any others that you are carrying out as part of the same overall work: (a) if a link for a published paper was entered under Question 1.1.1, give here the section(s) and/or table(s) that best identify the experiment, plus a brief description for clarity; (b) if ‘preregistration’ or ‘unpublished’ was selected, enter a brief description of the experiment, mentioning quality criteria, dataset and systems.

1.1.2:  Please complete this question.


Question 1.2:  Where can the resources that were used in the evaluation experiment be found?

Multiple-choice options (select one)



1.3.1.1:  Please complete this question.

1.3.1.2:  Please complete this question.

1.3.1.3:  Please complete this question.


1.3.2.1:  Please complete this question.

1.3.2.2:  Please complete this question.

1.3.2.3:  Please complete this question.

Notes: Questions 2.1–2.5 record information about the system(s) evaluated in the experiment that this sheet is being completed for. The input, output, and task questions in this section are closely interrelated: the value for one partially determines the others, as indicated for some combinations in Question 2.3.


Question 2.1:  What type of input do the evaluated system(s) take?

Notes: The term ‘input’ here refers to the text, representations and/or data structures that all of the evaluated systems take as input (including prompts). This question is about input type, regardless of number. E.g. if the input is a set of documents, you would still select ‘text: document’ below.

Check-box options (select all that apply)

Please provide further details for your above selection(s)
2.1:  Please select at least 1 of the above options.

Question 2.2:  What type of output do the evaluated system(s) generate?

Notes: The term ‘output’ here refers to the text, representations and/or data structures that all of the evaluated systems produce as output. This question is about output type, regardless of number. E.g. if the output is a set of documents, you would still select ‘text: document’ below.

Check-box options (select all that apply)

Please provide further details for your above selection(s)
2.2:  Please select at least 1 of the above options.

Question 2.3:  What is the task that the evaluated system(s) perform in mapping the inputs in Question 2.1 to the outputs in Question 2.2?

Notes: This question is about the task(s) performed by the system(s) being evaluated. This is independent of the application domain (financial reporting, weather forecasting, etc.), or the specific method (rule-based, neural, etc.) implemented in the system. We indicate mutual constraints between inputs, outputs and task for some of the options below.

Check-box options (select all that apply)

Please provide further details for your above selection(s)
2.3:  Please select at least 1 of the above options.

Question 2.4:  What are the language(s) of the inputs accepted by the system(s)?

Notes: Select any language(s) that apply from this list of standardised full language names as per ISO 639-1 (2019). If language is not (part of) the input, select ‘N/A’.

Check-box options (select all that apply)

Please provide further details for your above selection(s)
2.4:  Please select at least 1 of the above options.

Question 2.5:  What are the language(s) of the outputs produced by the system?

Notes: Select any language(s) that apply from this list of standardised full language names as per ISO 639-1 (2019). If language is not (part of) the output, select ‘N/A’.

Check-box options (select all that apply)

Please provide further details for your above selection(s)
2.5:  Please select at least 1 of the above options.


Questions 3.1.1–3.1.3 record information about the size of the sample of outputs (or human-authored stand-ins) evaluated per system, how the sample was selected, and what its statistical power is.


The number of system outputs (or other evaluation items) that are evaluated per system by at least one evaluator in the experiment. For most experiments this should be a single integer. If the number of outputs varies please explain how and why.

3.1.1:  Please complete this question.

Question 3.1.2:  How are system outputs (or other evaluation items) selected for inclusion?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.1.2:  Please select at least 1 of the above options.

Notes: All evaluation experiments should perform a power analysis to determine an appropriate sample size. If none was performed, enter ‘N/A’ in Questions 3.1.3.1–3.1.3.3


The name of the method used, and a URL linking to a reference for the method.

3.1.3.1:  Please complete this question.

The numerical results of the statistical power calculation on the output sample obtained with the method in Question 3.1.3.1.

3.1.3.2:  Please complete this question.

A URL linking to any code used in the calculation in Question 3.1.3.2.

3.1.3.3:  Please complete this question.


A single integer representing the total number of evaluators whose assessments contribute to results in the experiment. Don’t count evaluators who performed some evaluations but who were subsequently excluded.

3.2.1:  Please complete this question.


Question 3.2.2.1:  Are the evaluators in this experiment domain experts?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.2.2.1:  Please select at least 1 of the above options.

Question 3.2.2.2:  Did participants receive any form of payment?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.2.2.2:  Please select at least 1 of the above options.

Question 3.2.2.3:  Were any of the participants previously known to the authors?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.2.2.3:  Please select at least 1 of the above options.

Question 3.2.2.4:  Were any of the researchers running the experiment among the participants?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.2.2.4:  Please select at least 1 of the above options.

Explain how your evaluators are recruited. Do you send emails to a given list? Do you post invitations on social media? Posters on university walls? Were there any gatekeepers involved?

3.2.3:  Please complete this question.

Describe any training evaluators were given to prepare them for the evaluation task, including any practice evaluations they did. This includes introductory explanations, e.g. on the start page of an online evaluation tool.

3.2.4:  Please complete this question.

Use this space to list any characteristics not covered in previous questions that the evaluators are known to have, e.g. because of information collected during the evaluation. This might include geographic location, educational level, or demographic information such as gender, age, etc. Where characteristics differ among evaluators (e.g. gender, age, location etc.), also give numbers for each subgroup.

3.2.5:  Please complete this question.


Question 3.3.1:  Has the experimental design been preregistered?

Notes: If the answer is yes, also give a link to the registration page for the experiment.

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.3.1:  Please select at least 1 of the above options.

Describe the platform or other medium used to collect responses, e.g. paper forms, Google forms, SurveyMonkey, Mechanical Turk, CrowdFlower, audio/video recording, etc.

3.3.2:  Please complete this question.

Notes: Question 3.3.3.1 records information about the type(s) of quality assurance employed, and Question 3.3.3.2 records the details of the corresponding quality assurance methods.


Question 3.3.3.1:  What types of quality assurance methods are used to ensure that evaluators are sufficently qualified and/or their responses are of sufficient quality?

If any quality assurance methods other than those listed were used, select ‘other’, and describe why below. If no methods were used, select none of the above.

Check-box options (select all that apply)

Please provide further details for your above selection(s)
3.3.3.1:  Please select at least 1 of the above options.

Give details of the methods used for each of quality assurance types from the last question. E.g. if quality checks were used, give details of the check. If no quality assurance methods were used, enter ‘N/A’.

3.3.3.2:  Please complete this question.


Enter a URL linking to a screenshot or copy of the form if possible. If there are many files, please create a signpost page (e.g. on GitHub) that contains links to all applicable files. If there is a separate introductory interface/page, include it under Question 3.2.4.


Describe the types of information (the evaluation item, a rating instrument, instructions, definitions, etc.) evaluators can see while carrying out each assessment. In particular, explain any variation that cannot be seen from the information linked to in Question 3.3.4.1.

3.3.4.2:  Please complete this question.

Question 3.3.5:  How free are evaluators regarding when and how quickly to carry out evaluations?

Check-box options (select all that apply)

Please provide further details for your above selection(s)
3.3.5:  Please select at least 1 of the above options.

Question 3.3.6:  Are evaluators told they can ask questions about the evaluation and/or provide feedback?

Check-box options (select all that apply)

Please provide further details for your above selection(s)
3.3.6:  Please select at least 1 of the above options.

Question 3.3.7:  What are the conditions in which evaluators carry out the evaluations?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
3.3.7:  Please select at least 1 of the above options.

For those conditions that are not controlled to be the same, describe the variation that can occur. For conditions that are controlled to be the same, enter ‘N/A’.

3.3.8:  Please complete this question.

Notes: Questions in this section record information about each quality criterion (Fluency, Grammaticality, etc.) assessed in the human evaluation experiment that this sheet is being completed for.

If multiple quality criteria are evaluated, the form creates subsections for each criterion headed by the criterion name for each one. These are implemented as overlaid windows with tabs for navigating between them.


In this section you can create named subsections for each criterion that is being evaluated. The form is then duplicated for each criterion. To create a criterion type its name in the field and press the New button, it will then appear on tab that will allow you to toggle the active criterion. To delete the current criterion press the Delete current button.



Notes: Questions 4.1.1–4.1.3 capture aspects of quality assessed by a given quality criterion in terms of three orthogonal properties: (i) what type of quality is being assessed; (ii) what aspect of the system output is being assessed; and (iii) whether system outputs are assessed in their own right or with reference to some system-internal or system-external frame of reference. For full explanations see Belz et al. (2020).


Question 4.1.1:  What type of quality is assessed by the quality criterion?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Question 4.1.2:  Which aspect of system outputs is assessed by the quality criterion?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Question 4.1.3:  Is each output assessed for quality in its own right, or with reference to a system-internal or external frame of reference?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Notes: Questions 4.2.1–4.2.3 record properties that are orthogonal to quality criterion properties (preceding section), i.e. any given quality criterion can in principle be combined with any of the modes (although some combinations are much more common than others).


Question 4.2.1:  Does an individual assessment involve an objective or a subjective judgment?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Question 4.2.2:  Are outputs assessed in absolute or relative terms?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Question 4.2.3:  Is the evaluation intrinsic or extrinsic?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Notes: The questions in this section concern response elicitation, by which we mean how the ratings or other measurements that represent assessments for the quality criterion in question are obtained. This includes what is presented to evaluators, how they select a response, and via what type of tool, etc.



The name you use to refer to the quality criterion in explanations and/or interfaces created for evaluators. Examples of quality criterion names include Fluency, Clarity, Meaning Preservation. If no name is used, state ‘no name given’.


Map the quality criterion name used in the evaluation experiment to its equivalent in a standardised set of quality criterion names and definitions such as QCET (Belz et al. 2024, Belz et al. 2025), and enter the standardised name and reference to the paper here. In performing this mapping, the information given in Questions 4.3.7 (question/prompt), 3.3.4.1–3.3.4.2 (interface/information shown to evaluators), 4.3.2 (QC definition), 3.2.4 (training/practice), and 4.3.1.1 (verbatim QC name) should be taken into account, in this order of precedence.


Copy and paste the verbatim definition you give to evaluators to explain the quality criterion they’re assessing. If you don’t explicitly call it a definition, enter the nearest thing to a definition you give them. If you don’t give any definition, state ‘no definition given’.


An integer representing the number of different possible response values obtained with the scale or rating instrument. Enter ‘continuous’ if the number of response values is not finite. Enter ‘N/A’ if there is no scale or rating instrument. E.g. for a 5-point rating scale, enter ‘5’; for a slider that can return 100 different values (even if it looks continuous), enter ‘100’. If no rating instrument is used (e.g. when evaluation gathers post-edits or qualitative feedback only), enter ‘N/A’.


List, or give the range of, the possible response values returned by the rating instrument. The list or range should be of the size specified in Question 4.3.3. If there are too many to list, use a range. E.g. for two-way forced-choice preference judgments collected via a slider, the list entered might be ‘[-50,+50]’. If no rating instrument is used, enter ‘N/A’.


Question 4.3.5:  How is the scale or other rating instrument presented to evaluators?

Multiple-choice options (select one)

Please provide further details for your above selection(s)

If (and only if) there is no rating instrument, i.e. you entered ‘N/A’ for Questions 4.3.3–4.3.5, use this space to describe the task evaluators perform, and what information is recorded. Tasks that don’t use rating instruments include ranking multiple outputs, finding information, playing a game, etc.). If there is a rating instrument, enter ‘N/A’.


Copy and paste the verbatim text that evaluators see during each assessment, that is intended to convey the evaluation task to them. E.g. Which of these texts do you prefer? Or Make any corrections to this text that you think are necessary in order to improve it to the point where you would be happy to provide it to a client.


Question 4.3.8:  What form of response elicitation is used in collecting assessments from evaluators?

The terms and explanations in this section have been adapted from Howcroft et al. (2020).

Multiple-choice options (select one)

Please provide further details for your above selection(s)

Normally a set of separate assessments is collected from evaluators and then converted to the results as reported. Describe here the method(s) used in the conversion(s). E.g. macro-averages or micro-averages are computed from numerical scores to provide summarising, per-system results. If no such method was used, enter ‘results were not processed or aggregated before being reported’.


The list of methods used for calculating the effect size and significance of any results, both as reported in the paper given in Question 1.1, for this quality criterion. If none calculated, enter ‘None’.



The method(s) used for measuring inter-annotator agreement. If inter-annotator agreement was not measured, enter ‘InterAA not assessed’.


The inter-annotator agreement score(s) obtained with the method(s) in Question 4.3.11.1. Enter ‘InterAA not assessed’ if applicable.



The method(s) used for measuring intra-annotator agreement. If intra-annotateor agreement was not measured, enter ‘IntraAA not assessed’.


The intra-annotator agreement score(s) obtained with the method(s) in Question 4.3.12.1. Enter ‘IntraAA not assessed’ if applicable.



Normally, research organisations, universities and other higher-education institutions require some form ethical approval before experiments involving human participants, however innocuous, are permitted to proceed. Please provide here the name of the body that approved the experiment, or state ‘No ethical approval obtained’ if applicable.

5.1:  Please complete this question.

Question 5.2:  Does personal data (as defined in GDPR Art. 4, §1: https://gdpr.eu/article-4-definitions) occur in any of the system outputs (or human-authored stand-ins) evaluated, or responses collected, in the experiment this sheet is being completed for?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
5.2:  Please select at least 1 of the above options.

Question 5.3:  Does special category information (as defined in GDPR Art. 9, §1: https://gdpr.eu/article-9-processing-special-categories-of-personal-data-prohibited) occur in any of the evaluation items evaluated, or responses collected, in the evaluation experiment this sheet is being completed for?

Multiple-choice options (select one)

Please provide further details for your above selection(s)
5.3:  Please select at least 1 of the above options.

If an ex ante or ex post impact assessment has been carried out, and the assessment plan and process, as well as the outcomes, were captured in written form, describe them here and link to the report. Otherwise enter ‘no impact assessment carried out’. Types of impact assessment include data protection impact assessments, e.g. under GDPR. Environmental and social impact assessment frameworks are also available.

5.4:  Please complete this question.

List of all errors
refresh list of all errors

Press the button to refresh the list of all errors.