Autograding at Scale: One Fix for Flaky Tests

1 Answers

Abhishek Pareek

Founder & Director at Coders.dev

Answered 4 months ago

We eliminated an entire class of flaky tests by requiring an explicit I/O readiness probe in our GitHub Actions workflow. Before the autograder runs, a simple shell script must run successfully and create, write, and delete a temporary file in the working directory of the container within a 50ms timeout. We've now decoupled container filesystem latency from student code, preventing tests from running before the environment is really ready. Our diagnostic for discriminating a flake vs a student bug was the "three-strike" rule in fresh containers. If a test failed, the Action would automatically re-run that single test in a fresh instance of a clean container; if it failed a second time, it would run in a third and final run. A test that fails identically three times in a row is a student bug. A test that passes on the second or third attempt is flagged as a flake for our team to review.

What is one tactic that eliminated flaky tests in your containerized autograder pipeline for data structures assignments using GitHub Classroom and Actions? What simple diagnostic or threshold helped you decide when a test was truly flaky versus a student bug?

1 Answers

Abhishek Pareek

Related Questions

What is one tactic that eliminated flaky tests in your containerized autograder pipeline for data structures assignments using GitHub Classroom and Actions? What simple diagnostic or threshold helped you decide when a test was truly flaky versus a student bug?

1 Answers

Abhishek Pareek