ICFP 2021 - Artifact Evaluation

Authors with a paper accepted to ICFP 2021 are invited to submit an artifact that supports the conclusions of the paper. The Artifact Evaluation Committee will read the paper, explore the artifact, and provide feedback on how easy it would be for future researchers to build on. The ultimate goal of artifact evaluation is to support future researchers in their ability to reproduce and build on today’s work.

If you have a paper accepted at ICFP 2021, please see the Call for Artifacts for instructions on submitting an artifact for evaluation.

Call for Artifacts

The Artifact Evaluation Committee (AEC) invites authors of accepted papers to submit an artifact that supports the conclusions of the paper. The committee will read the paper, explore the artifact, and provide feedback on how easy it would be for future researchers to build on. The ultimate goal of artifact evaluation is to support future researchers in their ability to reproduce and build on today’s work.

The submission of artifacts for review is voluntary and will not influence the final decision of whether the paper itself is accepted. Papers with successfully reviewed artifacts will receive a seal of approval printed on the first page of the paper in the ICFP proceedings. Authors of papers with successfully reviewed artifacts are encouraged to make the artifact publicly available upon publication of the proceedings, by including them as “source materials” in the ACM Digital Library.

If the artifact review is successful then it will be awarded the “Artifact evaluated: functional” badge. For information on what we consider to be “functional” please see the page on Expected Forms of Artifacts.

Types of Artifacts

An artifact that supports the paper’s conclusions can take many forms, including any or all of the following:

a working copy of the software and its dependencies, including benchmarks, examples and/or case studies
experimental data sets
a mechanized proof

Paper proofs are not accepted as artifacts for evaluation.

Selection Criteria

The AEC targets a 100% acceptance rate. The artifact will be evaluated in relation to the expectations set by the paper, and should be:

consistent with the claims of the paper and the results it presents,
as complete as possible, supporting all claims of the paper,
well documented,
future-proof, and
easy to reuse, facilitating further research.

The community benefits most when an artifact facilitates future research. For example, future researchers may build on an artifact by extending it to cover new situations or augmenting it with new components to solve a different class of problems. Other researchers may try an alternative approach to solving the same problem. This new work can benefit by comparing new results directly with the ones produced by the artifact, and by understanding the various tradeoffs and engineering decisions that were taken in the past.

We expect that most artifacts submitted for review at ICFP will have a few common forms: compilers, interpreters, proof scripts and so on. We have codified common forms of artifacts on a separate page. If you are considering submitting an artifact that does not have one of these forms, we encourage you to contact the Artifact Evaluation chairs before the submission deadline to discuss what is expected.

Submission Process

The evaluation process uses an (optional, lightweight) double-blind system. Of course, authors will not know the names of reviewers. Authors are also encouraged (but not required) to take any reasonably easy steps to anonymize their submissions, and reviewers will be discouraged from trying to learn the names of authors. Note that we do not intend to impose a lot of extra work here: anonymizing artifact submissions is an opportunity to help ensure the reviewing process is fair, but is in no way required, especially if the anonymization would require a lot of work or compromise the integrity of the artifact. There will be a mechanism for free, double-blind communication between reviewers and authors, so that small technical problems can be overcome during the reviewing process. Authors may iteratively improve their artifacts during the process to overcome small technical problems, but may not submit new material that would require substantially more reviewing effort.

We intend for most artifact submissions to include BOTH:

Software installed into a QEmu Debian base Virtual Machine (VM) image provided by the committee. The base VM image (500 MB) can be downloaded here, and includes instructions on how to use and build on it, as well as help debugging common problems.
A separate source tarball that includes just the source files.

In most cases, artifacts should include BOTH the extended VM image AND a separate source tarball. The intention is that reviewers who are familiar with certain tools (e.g. Agda or OCaml) can inspect the artifact sources directly, while reviewers that are less familiar can still execute the artifact without needing to install new software on their own machines besides QEmu. The VM image will be archived so that future researchers, say in 5 years time, do not need to worry about version incompatibilities between old tool versions and new operating systems.

The detailed submission process is as follows:

Read the Forms of Artifacts page for details on artifact preparation.
Register your intent to submit an artifact on the separate artifact only HotCRP site before the end of Thursday 13th May.
Prepare your artifact, building upon the base VM image.
Upload your artifact to Zenodo (recommended), or otherwise make it available via a stable URL (i.e. the URL should not change if you later make updates to the artifact; and ideally, the URL has a good chance of continuing to exist well into the future).
- See here for one recommendation on how to anonymize your submission to Zenodo.
Finalize your submission on HotCRP with a link to your artifact. You should also upload a preprint of your paper, and any additional materials the reviewers may find helpful (e.g. appendices).

For questions about the overall review process or specific reviews, please contact Gabriel Scherer (gabriel.scherer@gmail.com). For questions relating to technical aspects of HotCRP, uploading to Zenodo, etc., please contact Brent Yorgey (byorgey@gmail.com). If you are not sure, feel free to contact both!

Timeline

It takes time to produce a good artifact; thus we allot two weeks between conditional paper acceptance and artifact submission. These are the key dates (all dates are in the Anywhere on Earth (AOE / UTC-12) timezone):

Event	Date
ICFP Conditional Acceptance	Sat 8 May
Registration date	Thu 13 May
Artifact submission	Thu 20 May
Review and technical clarification	Wed 2 June - Wed 16 June
Preliminary reviews available	Wed 16 June
Further clarification if needed	Wed 16 June - Tue 22 June
Final decision sent to authors	Tue 22 June

More Information

For additional information, clarification, or answers to questions, please contact the ICFP Artifact Evaluation co-chairs:

Gabriel Scherer (gabriel.scherer@gmail.com)
Brent Yorgey (byorgey@gmail.com)

Most artifacts that are submitted for review at ICFP have one of a few common forms, and we have codified what we expect from each of these common forms. We also have advice to authors and reviewers about how to prepare and review them. This material should be taken as highly suggestive, but not prescriptive. If authors or reviewers have questions about what is expected, please contact the AEC co-chairs as early as possible. If you are an author and your artifact does not fit into one of the obvious categories then please contact the AEC co-chairs well before the submission deadline.

We also describe some common problems to avoid. This advice has been distilled from past experience at a variety of events, and does not describe specific papers, artifacts or authors.

Selection Criteria

As stated on the main page, the artifact should be:

consistent with the paper
as complete as possible
well documented
future-proof, and
easy to reuse, facilitating future research.

Consistent with the paper

The artifact should directly implement or support the technical content of the paper. If the paper describes a particular algorithm that does something in a certain way, then the artifact should also do it that same way. It is fine for the artifact to implement an extended version of what is in the paper, but the examples discussed in the paper should work with the artifact with minimal changes.

Complete as possible

All examples, benchmarking results and graphs described in the paper should be reproducible with the artifact. If the paper describes an interpreter that evaluates some example expressions, and the artifact includes an implementation of that interpreter, then all the examples should work with the interpreter. If the paper describes a program to compute numerical results and presents graphs of those results, then the artifact should be able to reproduce all of those graphs.

Common problem: A paper contains five graphs of benchmarking results, but the artifact only reproduces two of them. Reviewers are prone to reject such an artifact due to it being incomplete. In general all graphs should be reproducible by the artifact, else very clear reasons should be given why this is not possible. For example, if a benchmark run needs two weeks of compute time then reviewers are not expected to reproduce that during the review process, but if it only takes 30 minutes they will probably expect to.

Common problem: A paper describes a compiler or interpreter implementation, but its execution depends on commercially licensed tools (like MATLAB, or some commercial SMT solver). Reviewers are prone to reject such an artifact due to it being either incomplete or not easy to reuse. However, it is not unreasonable for authors to expect such tools to be available to the intended audience of the paper. One way to address this may be for the authors to provide a login environment for reviewers to use. A better way is to contact the AEC co-chairs before submission so that we can try to assign reviewers that already have the required tools.

Common problem: A paper describes the implementation of a particular algorithm and gives benchmark results, but the artifact includes only the results and not the implementation. This can happen if the implementation was built as proprietary software, or part of it cannot be released due to confidentiality issues. In this case please contact the AEC co-chairs to discuss whether the artifact is eligible for review. Perhaps some defined fragment of it can still be reviewed.

Well documented

The paper itself describes the technical content of the artifact, and all reviewers will read the paper. However, there must be sufficient documentation to be able to build, test and execute the implementation, as well as to debug minor problems. At an absolute minimum, for a compiler project there needs to be clear instructions about how to build it, execute the test suite, and run it on the examples provided by the paper. Reviewers will also expect to have enough documentation to adapt those examples and try out some of their own.

Common problem: A paper describes an new language and the artifact has an interpreter for that language, but the command line interpreter program does not have basic “–help” style functionality for the reviewers to work out what the flags mean. If the paper shows some form of intermediate representation (IR) produced by the compiler, then this sets the expectation that reviewers could inspect the IR version for some of their own examples. The flags to do this need to be documented.

Future proof

Artifacts should be reusable in 5-10 year time frames. For this reason they should not unnecessarily depend on specific operating system versions, kernel drivers, processor architectures and so on. If the paper describes a DSL for configuring a particular sort of FPGA device, then it is fine to require that particular device – and perhaps provide the reviewers with a simulator. However, artifacts should not depend on quirks of particular operating systems, such as requiring specific system linker or driver versions when this is not intrinsically necessary.

Common problem: An artifact contains source code written in a particular language, but it only builds with an old compiler version. This old compiler version does not work on the latest version of the popular operating system used by the reviewer. Most of these problems will be mitigated by basing the artifact around the standard VM image. However, if an author asks a reviewer to downgrade to an older version of their installed tools then the artifact will probably fail the “future proof” criteria.

Easy to reuse

Future researchers will have a limited attention span for debugging problems with archived artifacts. If an artifact requires excessive configuration or hand-holding to execute, then it is unlikely a future researcher will put time into comparing their new work with the old artifact.

Common problem: An artifact needs to perform numerical computation for 8 hours before producing a result, but it is not possible to pause and resume the computation. A reviewer may not be able to leave such computations running on their personal machines for extended periods, as they may be travelling or need to do other work. It is particularly irksome to both reviewers and authors if the artifact crashes when run on new examples, or if the host machine itself is unstable. Such artifacts should have clear checkpoints that allow intermediate results to be saved and resumed. When the artifact runs it should be clear to the reviewers how to resume computation from a particular checkpoint – such as by printing resumption commands to the console at regular intervals. Such resumption commands should not be swamped by debugging output from the tool.

Common problem: An artifact includes system software that only works with particular operating system driver versions, such as custom Linux networking drivers. Although it may be possible to install new driver versions, reviewers are unlikely to want to do so for risk of destabilizing their own machines. Networking code can also be difficult to review if the ability to reproduce the results depends on particular network bandwidths or transmission latencies. In such cases it may be appropriate to supply a simulator, so that the overall algorithm can be tested without needing a particular physical network configuration.

Advice per Artifact

All artifacts

All artifacts must contain a top-level Readme.md file that gives the name of the paper and step-by-step instructions about how to execute the artifact.

In most cases the step-by-step instructions should be a list of commands to execute to build and test the artifact on the examples described in the paper, and to reproduce all the graphs and benchmarking results. The instructions should call out particular features of the results such as “this produces the graph in Fig 5 that shows our algorithm runs in linear time”. Try to keep the instructions clear enough that reviews can work through them in under 30 minutes.

If the build process emits warning messages, perhaps when building libraries that are not under the author’s control, then include a note in the instructions that this is the case. Without a note the reviewers may assume something is wrong with the artifact itself.

Separately from the step-by-step instructions, provide other details about what a reviewer should look at. For example, “our artifact extends existing system X and our extension is the code located in file Y”.

Try to avoid requiring graphical environments (X Windows) to be installed into the VM unless truly necessary. Graphical environments in VMs are sometimes slow and unstable. If possible, keep graphics rendering such as web browsing on the host.

Consider providing a top-level Makefile so that the commands to be executed are just make targets that automatically build their prerequisites.

Command-line tools

Unix command-line tools should have standard --help style command-line help pages. It is not acceptable for an executable to throw uninformative exceptions when executed with no flags, or with the wrong flags.

Compilers and interpreters

It should be obvious how to run the tool on new examples that the reviewers write themselves. Do not just hard-code the examples described in the paper.

If your tool consumes expressions in a custom DSL then we recommend supplying a grammar for the concrete syntax, so that reviews can try the tool on new examples. Papers that describe such languages often give just an abstract syntax, and it is often not clear what the full concrete syntax is from the paper alone.

Proof scripts

In most cases, the artifact VM should contain an installation of the proof checker and specify a single command (preferably “make”) to re-check the proof. It is fine to leave the VM itself command-line only, and require reviewers to browse the proof script locally on their own machines. It should not be necessary to have CoqIDE or Emacs/ProofGeneral installed into the VM, unless the paper is particularly about IDE functionality.

Include comments in the proof scripts that highlight the main theorems described in the paper. Use comments like “This is Theorem 1.2: Soundness described on page 6 of the paper”. Proof scripts written in “apply style” are typically unreadable without loading them into an IDE, but reviewers will still want to find the main lemmas and understand how they relate.

Reviewers almost always complain about lack of comments in proof scripts. To authors, the logical statements of the lemmas themselves are likely quite readable, but reviewers typically want English prose that repeats the same information.

Before submission, scan through the script and erase TODO and FIXME style comments. Reviewers will expect proved statements to be true, so there should be not be TODOs in submitted proofs.

Programming environments presented via web interfaces

Try to get the server running locally inside the VM, and allow the reviewer to connect to it via a web browser running natively on their host machine. Graphical environments installed into VMs are sometimes laggy and unstable, and standard web protcols are stable enough that such artifacts should be usable with new browsers.

Programs that generate images

If the artifact produces a .bmp or .png file then expect the reviewer to use “scp” or some such to copy it out to the host machine and view it. Authors should test that the connection to the VM works, so that this is possible.

Long running artifacts

If the artifact needs to run for more than 10 minutes then this must be highlighted in the instructions, and there should be a way to stop and resume the computation.

Artifacts that run on GPUs

If the artifact needs standard GPU hardware then the authors must specify this very clearly when the artifact is submitted. It should not be a problem to find reviewers with standard GPU hardware, but this needs to be called out so that the AEC co-chairs can assign reviewers that do have it.

Artifacts that need many resources

If the artifact needs CPU, disk or memory resources that are larger than are found on a typical laptop then please contact the AEC co-chairs before submission. At the time of writing, if the artifact runs with < 8GB RAM and < 16GB disk space then this should not be a problem.

Advice for Reviewers

Expect to spend about a day reviewing each artifact. Budget about 4 hours for reading the paper, and 4 hours for experimenting with the artifact itself. You should be able to get the basic artifact functionality to work in about 1 hour, spend 2 hours inspecting the implementation, and 1 hour writing up your report.

If you find yourself debugging problems for more then 1 hour then set the artifact to a preliminary ‘Reject’ and discuss what to do about it with the other committee members.

If an artifact runs for more than 10 minutes and crashes or fails 3 times in a row, then set the review to a preliminary ‘Reject’ and discuss what to do about it with the other committee members. It may be that other reviewers have had more success in slightly different environments.

Questions? Use the ICFP Artifact Evaluation contact form.

Artifact EvaluationICFP 2021

Call for Artifacts

Types of Artifacts

Selection Criteria

Submission Process

Timeline

More Information

Forms of Artifacts

Selection Criteria

Consistent with the paper

Complete as possible

Well documented

Future proof

Easy to reuse

Advice per Artifact

All artifacts

Command-line tools

Compilers and interpreters

Proof scripts

Programming environments presented via web interfaces

Programs that generate images

Long running artifacts

Artifacts that run on GPUs

Artifacts that need many resources

Advice for Reviewers

Tracks

Workshops

Co-hosted Symposia