Instructions
This is the Human Evaluation Datasheet (HEDS) form which is designed to record full details of human evaluation experiments in Natural Language Processing (NLP), addressing a history of details often going unreported in the field (in extreme cases, no details at all are reported). Reporting such details is crucial for gauging the reliability of results, determining comparability with other experiments, and for assessing reproducibility (Belz et al., 2023a,b; Thomson et al., 2024; Thomson and Belz, 2024). Having a standard set of questions to answer (as provided by HEDS) means not having to worry about what information to include or in what detail, as well as the information being in a format directly comparable to information reported for other human evaluation experiments. To maximise standardisation, questions are in multiple-choice format where possible.
The HEDS form is divided into five main sections, containing questions that record information about resources, evaluated system(s), test set sampling, quality criteria assessed, and ethics, respectively. Within each of the main sections there can be multiple subsections which can be expanded or collapsed.
Each HEDS question comes with instructions and notes to help with answering it, except where the task is exceedingly simple (e.g. when a contact email address is asked for).
HEDS Section 4 needs to be completed for each quality criterion that is evaluated in the experiment. Instructions on how to do this are shown at the start of HEDS Section 4.
The form is not submitted to any server when it is completed, and instead needs to be downloaded to a local file. A tool is available in the GitHub repository for converting the file to latex format (which we used to generate the next section). Please use the "download json" button in the "Download to file" section. This will download a file (in .json format) that contains the current values from each form field. You can also upload a json file (see the "Upload from file" section" on the left of the screen). Warning: This will delete your current form content, then populate the blank form with content from the file. It is advisable to download files as a backup when you are compelting the form. The form saves the field values in local storage of your browser, it will be deleted if you clear the local storage, or if you are in a private/incognito window and then close it.
The form will not prevent you from downloading your save file, even when there are error or warning messages. Yellow warning messages indicate fields that have not been completed. If a field is not relevant for your experiment, enter N/A, and ideally also explain why. Red messages are errors, for example if the form expects an integer and you have entered something else, a red message will be shown. These will still not prevent you from saving the form.
You can generate a list of all current errors/warnings, along with their section numbers, in the "all form errors" tab at the bottom of the form. A count of errors will also be refreshed every 60 seconds on the panel on the left side of the screen.
We recognise that completing a form of this length and level of detail constitutes an overhead in terms of time and effort, especially the first time a HEDS form is completed when the learning curve is steepest. However, this overhead does go down substantially with each use of HEDS, and, we believe, is far outweighed by the benefits: increased scientific rigour, reliability and repeatability.
We envisage the main uses of HEDS to be as follows. Ideally, it should be completed before a human evaluation experiment is run, at the point when the design is final, as part of a formal preregistration process. Once the experiment has been run, the information in the sheet can be updated if necessary, e.g. if the final number of evaluators had to change due to unforeseen circumstances.
Another use is for the purpose of reporting the details of a completed experiment. For this, the completed HEDS sheet can be automatically converted to Latex, ready for inclusion in the supplementary material.
A third use is for carrying out reproducibility studies, as has been done extensively in the ReproGen and ReproNLP shared tasks (Belz et al., 2022, Belz & Thomson, 2024). Here, the HEDS sheets were used to ensure that original work and reproduction experiment had the same properties, hence can be expected to produce similar results.
How to cite
The paper describing HEDS 3.0 is
Belz & Thomson 2024.