Gradle projects might be HUGE. Particularly in the Android world it is quite easy to see around projects like a monorepo with several hundreds modules, quite few millions lines of code, dozens of thousands of unit/integration tests and all sorts of customisations both at build level and automation level as results of the project + team + company status quo.
If you work in a project like this, one tricky question for you : is your team enforcing timeouts properly at Gradle builds?
The importance of timeouts in builds
In my experience, timeouts on build tasks are usually neglected, either for builds running at developer’s machines or continuous integration systems. In addition to that, we need to acknowledge that for huge projects it is quite easy to see things going wrong and builds getting stuck for all sorts of mysterious reasons.
For instance, in a previous big Android project I worked - which pretty much full-filled all the conditions aforementioned - in a given momentum of the tooling the Android Lint task was always getting frozen due some live-lock / race condition which was very hard to track, since logs offered no meaningful clue if the issue was at Android Gradle Plugin level or at Gradle build system level.
As a result, we saw some Jenkins jobs running for quite a few days without achieving or reporting anything, consuming CI resources and delaying all other Android jobs - pretty much because no timeouts were in-place. The same situation is quite common when running integration tests for Backend services/systems, especially ones which rely on all sorts of connectivity.
When talking about builds it is common sense that all code execution should end somehow - code that runs forever is not the use case here - but in a rule of thumb not all code related to builds is designed (and built) with completion guarantees in mind. This means that sometime we’ll need to enforce such completion guarantees - or timeouts - by ourselves.
I see that enforced timeouts might have a series of benefits when designing and building the Developer Experience around builds, namely :
- Timeouts will break what I call developer’s passive mode when building locally, exposing that something is wrong at local machine level
- Timeouts will offer observable evidences - usually at CI level - that something in the build’s status quo has changed, which makes easy to prioritize initiatives around the issue
- Timeouts will contribute to cost control when running CI builds. In a world where we have a CI system like Github Actions billing per minute, we want to make sure that things fail fast if possible. For instance, in Github Actions the default timeout for a job is 6 hours which is way more than the amount of time for any sane Android pipeline to complete.
A hierarchy for enforced timeouts
Taking for instance Gradle as our build system and Github Actions as our CI system
At build task level : we want to ensure that an individual task, for instance
→ ./gradlew app:bundleRelease --no-build-cache --no-daemon
won’t take more than half-hour to complete
At build step level : we want to ensure that a particular set of tasks - usually with the same semantic relationship in the build pipeline - for instance
→ ./gradlew ktlintCheck detektCheck lintDebug
won’t take more than 20 minutes to complete
At build pipeline level : we want to ensure - mostly at CI level - that a set of steps like
- Setup the environment
- Checkout the source code
- Run static analysis
- Run unit tests for the debug build variant
- Run Espresso tests for the debug build variant
- Build the Debug APK
- Build the QA APK
- Store artifacts
- Send notifications
won’t take more than two hours to complete.
Since timeouts are somehow enforced from outside the task/step/pipeline execution context, the hierarchy shown above follows naturally with higher precendence according the outer build context.
For instance, consider that for a given build system and CI system the configuration is such that :
- The timeouts for all individual build tasks in is 1 minute
- The timeout for a step (or set of tasks at build system level) is 2 minutes
- The timeout for a given pipeline run is 5 minutes
Thus, is easy to realise that
- A set of three tasks in the same step can potentially timeout in 2 minutes, although each individual timeout enforced per task is 1 minute
- A set of three steps in the same pipeline can potentially timeout in 5 minutes, although each individual timeout enforced per step is 2 minutes
Enforcing timeouts at Gradle level
Since version 5.0, Gradle provides an easy mechanism to define a timeout for an individual task via a timeout property one might set:
However, there is no comprehensive way to define a timeout for all tasks available in a given Gradle project - specially those ones you don’t own and and/or are dynamically added - and neither a way to enforce timeouts per set of tasks running in a given Gradle build command.
Fortunately, tackle these use cases can be quite easy. My take here was a small Gradle plugin which demonstrates that and you might be interested in : gradle-timeouts-enforcer is a very simple take on this problem and it allows you to easily enforce timeouts using a straightforward setup :
(1) At your root-project/build.gradle file add
(2) At your root-project/gradle.properties file
This plugin will pretty much set the timeout property for each task evaluated in the build task graph, without making any distinctions between fast Gradle tasks or slow Gradle tasks at all. In addition to that, in order to enforce a timeout at tasks set level a global timer is installed at Gradle’s invocation time and the timeout-reached condition is checked after each task initialization and/or resolution.
Alongside with the definition at CI level, this plugin might help one to make sure that no Gradle builds will run outside sane time boxes given the project’s size and team’s and company’s resources for Continuous Integration.
Timeouts in builds are often a neglected compliance aspect we would like to ensure as part of the CI/CD journey, and like any other compliance requirement “fail fast” is the approach to go here; frequent violations in enforced rules should be observed via the proper tooling in order to generate actionables at build infrastructure level.
Enforced timeouts for builds naturally follow a hierarchy according the enforcement level in build pipelines, with the timeout definition at CI configuration level overriding any other definitions at local build level in order to both (a) leverage observability at the CI system level and (b) make sure that CI system resources are fairly used, contributing to Cloud cost control.
Last but not least, while the Gradle build system does not provide a super fancy mechanism to enforce timeouts on tasks/builds, it is quite easy to tackle that - with all sorts of customisations - with a custom Gradle plugin.
Therefore, no more excuses to have your Jobs triggered by your big Android project running forever on CI : define sane timeouts values at all build levels, enforce them, learn and take action when violations arise.