Enabling fault-tolerant HTTP networking for native Android apps

Why do Mobile apps need a fault-tolerant HTTP networking layer?#

HTTP resilience is a notable topic when creating and maintaining backend services. Considering JVM-powered projects, several libraries like Resilience4j and Failsafe implement the most embraced resilience patterns for service-to-service communication over the HTTP protocol.

In the context of Mobile applications, networking over HTTP is a critical way to drive the communication between servers and clients. Although we have great solutions like OkHttp in the landscape of HTTP client libraries, understanding the different types of networking issues, how they might affect customers on the go, and how we can build reliable user experiences over unreliable networks remains an open problem.

After over a decade of working with Mobile apps, I see the implementation of a fault-tolerant HTTP data layer as an underrated problem. Even for products that target high-end users living in the most developed countries of the world, connectivity failures happening over unreliable or unstable networks are always a scenario to consider, whether when our customers need to use our product when attending a music festival or a conference or maybe when commuting in the subway. These use cases are even more critical in non-developed countries.

Therefore, in this article I want to demonstrate how to design and test fault-tolerant HTTP abstractions for Android apps, simulating common issues in the last mile of Mobile networks and ensuring the desired behavior with integration tests. All the code snippets presented here also exist in this Github repository, which implements some techniques for Android projects I’ll write about in upcoming articles, so stay tuned!

Understanding HTTP resilience patterns in Mobile applications#

I could write a whole article about HTTP resilience patterns and how Mobile and Front-end applications might use them well. To keep the focus of this article, let’s do a quick review of the most common patterns and discuss how they fit (or not) in our context.

Let’s start with timeouts for HTTP requests, which are essential in most front-end applications. Few things stress our users more than an action that never completes or feedback that never arrives, and implementing this pattern is generally simple in most cases (Android, iOS, and Web). Usually, there is no need to implement different versions of this pattern in the same application (e.g., having a more flexible timeout for a specific HTTP request). Hence, having this implemented is a no-brainer.

The following pattern in our list is retrying requests, a challenge to implement correctly even when designing resilient service-to-service communication. The inner challenge of this pattern comes from the fact that, by retrying a any HTTP request that just failed, we may contribute to decreasing the availability of a potentially struggling server 🔥. In the context of the backend services, circuit breakers are a widely used alternative to avoid the exhaustion of upstream servers.

Although circuit breakers do not fit well for Mobile or Front-end applications, retrying is something to consider. In general, we should ponder retrying when we face connectivity errors, especially when we know that the requested endpoint provides an idempotent API. We are also qualified to recover from network failures when performing GETs from endpoints supported directly by a cache, for example.

There are several other use cases to consider. Still, a critical aspect to note when implementing a solid retry policy is the time-scheduling strategy we’ll have in place since waiting some seconds before firing another request might be the only way to avoid DDoSing our systems. While a fixed-time interval (like two seconds between attempts) is doable when retrying due to connectivity issues, it might be too dangerous to adopt in all scenarios, making an exponential backoff a preferred option when sticking with a single retry policy for HTTP requests.

Conceptually, other well-known HTTP resilience patterns like bulkheading or rate-limiting requests are eligible for Mobile applications but have yet to be widely adopted. Consider them in particular situations, like in systems where you have to deal with minimal connectivity scenarios, offline-first products, or use cases where we want to aggressively minimize the amount of data we exchange over the network from the Mobile client perspective. In addition, be aware that a solid implementation suitable for Mobile may be available on something other than existing HTTP resilience libraries since projects like Resilience4j or Failsafe focus on the backend implementation for these patterns.

Designing a fault-tolerant HTTP layer with OkHttp, Retrofit, and Resilience4j#

In most Android projects, the starting point of a solid HTTP layer is the foundational JVM libraries OkHttp and Retrofit. Both have been around for a few years now, are battle-tested, and actively maintained and improved by the fine folks of Square. We stick with the industry standard in this article and its companion project.

Before getting started with HTTP resilience, it is critical to implement an effective mechanism to differentiate errors the HTTP stack raises. There are several ways to do that, but in the end, we want to guarantee proper disambiguation between HTTP errors based on status codes and issues related to networking. For the sake of this article, we achieve that using simple transformation functions that we will wire on top of the HTTP stack later.

object ConnectivityErrorTransformer : NetworkingErrorTransformer {

    override fun transform(incoming: Throwable) =
        when {
            (!isNetworkingError(incoming)) -> incoming
            isConnectionTimeout(incoming) -> OperationTimeout
            cannotReachHost(incoming) -> HostUnreachable
            else -> ConnectionSpike
        }

    private fun isNetworkingError(error: Throwable) = {...}

    private fun isRequestInterrrupted(error: Throwable) = {...}

    private fun isRequestCanceled(error: Throwable) = {...}

    private fun cannotReachHost(error: Throwable) = {...}

    private fun isConnectionTimeout(error: Throwable) = {...}
}

The companion project for this article implements two resilience patterns: timeouts and retries. The first doesn’t need particular libraries since we can use a built-in functionality from the OkHttp client: we can access different timeout configurations at OkHttpClient.Builder and a simple way to enforce time limits for all requests is to stick with the callTimeout method.

Given how we model the underlying API proxied by Retrofit, we chose the Resilience4j library to implement the Retry pattern. We must use Resilience4j v1.7.x since newer versions require JVM bytecode compatible with JDK17, which is not officially supported by the Android tooling right now. Although Failsafe is also an excellent option for JVM-based projects, Resilience4j has a dedicated Kotlin module where we’ll find some handy utilities geared to Kotlin Coroutines and function composability, which just fit in our demo project.

Configuring a retry policy with Resilience4j is simple; examples are available in the official documentation as expected. As previously discussed, the points of attention when implementing retries are:

the conditions that trigger the retry function
the time rule used to schedule a retry attempt

Hence, we design an HttpResilience data structure to encode our expectations:

// Note : Duration requires minSDK 26+ on Android
import java.time.Duration

data class HttpResilience(
    val retriesAttemptPerRequest: Int,
    val delayBetweenRetries: Duration,
    val timeoutForHttpRequest: Duration
)

In our scope, we want to recover from potential connectivity errors in the last mile. Since we already have a method to capture such exceptions when they emerge from the underlying stack, we can control the retry function conditionally. In addition, we stick with a simple fixed time interval function in this configuration since we assume that we don’t need an exponential backoff strategy when healing from connectivity issues.

Thus, an abstraction wrapping the configuration for Retry and leveraging the first-class Resilience4j support for Kotlin Coroutines could look like this:

ResilienceAwareExecution private constructor(spec: HttpResilience) {

    private val retryRegistry by lazy {
        RetryRegistry.ofDefaults()
    }

    private val fixedInterval by lazy {
        IntervalFunction.of(spec.delayBetweenRetries)
    }

    private val retryConfig by lazy {
        RetryConfig {
            maxAttempts(spec.retriesAttemptPerRequest)
            retryOnException { it is HttpNetworkingError.Connectivity }
            intervalFunction(fixedInterval)
        }
    }

    private val retryRunner by lazy {
        retryRegistry.retry("retry-policy", retryConfig)
    }

    suspend operator fun <T> invoke(block: suspend () -> T): T =
        retryRunner.executeSuspendFunction {
            NetworkingErrorAwareExecution {
                block.invoke()
            }
        }

    companion object {
        fun create(spec: HttpResilience) = ResilienceAwareExecution(spec)
    }
}

Finally, we wire everything inside the API service client that upstream feature modules will use:

class RestClient(
    private val service: ChuckNorrisService,
    private val resilience: HttpResilience
) {

    private val resilienceAwareExecution by lazy {
        ResilienceAwareExecution.create(resilience)
    }

    suspend fun categories(): List<String> =
        resilienceAwareExecution {
            service.categories()
        }

    suspend fun search(query: String): RawSearch =
        resilienceAwareExecution {
            service.search(query)
        }
}

Simulating networking issues in the last mile with Toxiproxy, and Testcontainers#

Everything previously exposed constitutes a simple but solid example of a resilient HTTP layer suitable for our use case. Our next step is figuring out how to simulate connectivity errors effectively.

Luckily for us, plenty of tools nowadays address scenarios of Chaos Engineering. In particular, Toxiproxy is one solid option for our use case.

Understanding how Toxiproxy works is straightforward: think of an HTTP proxy server empowered with an API designed to inject failures in the proxied requests programmatically. In a nutshell, Toxiproxy models different categories of injectable faults and calls each one Toxic. Each Toxic applies over a Direction associated with the HTTP connection (Downstream or Upstream) and follow a level of Toxicity, i.e., a probability of having the fault happening during the request lifecycle.

Of course, one can spin a Toxiproxy instance and connect a real or virtual Android pointing it, whether to inject failures on manual or instrumented tests. However, there is a clever way to automate integration tests and verify the effectiveness of an HTTP resilience strategy: we’ll take advantage of the fantastic Testcontainers library.

In addition to being a battle-tested solution for integration testing in the Backend world, we chose this option because it has an official module Toxiproxy, and it also makes it trivial to containerize a WireMock server, which we picked due to its simplicity and speed. Note that WireMock does not have an official module for Testcontainers yet, but it is pretty easy to create one on top of a GenericContainer.

Testcontainers make it easy to design our test setup by proxying a containerized instance of any fake server through a containerized Toxiproxy, which makes exercising our HTTP abstraction through the unreliable proxy straightforward.

// Pins the Docker components we want to use

private val toxyProxyImage by lazy {
    val componentName = "ghcr.io/shopify/toxiproxy"
    val imageTag = "2.5.0"
    DockerImageName.parse(componentName).withTag(imageTag)
}

private val wireMockServerImage by lazy {
    val componentName = "wiremock/wiremock"
    val imageTag = "2.32.0"
    DockerImageName.parse(componentName).withTag(imageTag)
}

// Configures Toxiproxy and Wiremock server in the same network

@get:Rule val network: Network = Network.newNetwork()

@get:Rule val wireMockContainer: WireMockContainer =
    WireMockContainer(wireMockServerImage)
        .withStubMapping("categories", loadFile("categories-stub.json"))
        .withNetworkAliases("wiremock")
        .withNetwork(network)

@get:Rule val toxiProxyContainer: ToxiproxyContainer =
    ToxiproxyContainer(toxyProxyImage)
        .withNetwork(network)
        .dependsOn(wireMockContainer)

Most Android Engineers are unfamiliar with Docker, Containers, and related tooling and concepts, so I leave some implementation details from the snippet above as an exercise for the reader. You may also want to have a look in the complete working example. Once done with the test setup, we can now design test cases, which will follow a structure like the one described below:

GIVEN

A desired response for a given HTTP call

A mock server that answers with the desired response

The resilient HTTP abstraction under test

A Toxic configured on Toxiproxy

WHEN

The HTTP request goes through the proxy

THEN

I expected to get the desired response

Toxiproxy supports several Toxics, but we’ll stick with three of them to simulate situations typical in the last mile of Mobile networks:

Timeout, to forge HTTP requests failing on time due to connection timeouts
Latency, to simulate both latency and jitter in the network
LimitData, to falsify interruptions on networking connection during an in-flight request

Our integration tests aim to validate a recovery from the Toxic effect in the request lifecycle. However, we should be aware that a Toxicity level is a probability, which means that, by design, such integration tests are non-deterministic. This fact damages one of the most acknowledged properties of good tests, as per Kent Beck’s Test Desiderata, and it has significant consequences, whether in how we write or execute our tests. More on that ahead.

One example of such a test ensures resilience against connection spikes:

private lateinit var toxiproxy: Proxy
private lateinit var restClient: RestClient

@Before fun `before each test`() {
    val toxiproxyClient = toxiProxyContainer.let {
        ToxiproxyClient(it.host, it.controlPort)
    }

    toxiproxy = toxiproxyClient.createProxy(
        "wiremock",
        "0.0.0.0:$toxyProxyPort",
        "wiremock:$wireMockPort"
    )

    val baseUrl = toxiProxyContainer.let {
        "http://${it.host}:${it.getMappedPort(toxyProxyPort)}"
    }

    restClient = RestClientFactory.create(baseUrl, resilienceSpec)
}

@Test fun `should recover from connection spikes`() {

    // Our WireMock stub is around 30 bytes, hence we force an end of stream
    toxiproxy.limitData(numberOfBytes = 15).setToxicity(ToxicityLevel.MODERATE)

    runBlocking {
        val categories = restClient.categories()
        assertThat(categories).isNotEmpty()
    }
}

Note that we have a ToxicityLevel concept that wraps a sensitive value around the reasonable level of Toxicity we want to use. In most of our tests, a small amount of Toxicity is suitable to trigger our retry logic at least once. Still, we can also force the exhaustion of retry attempts if we want by just disabling the proxy (which causes an immediate interruption for any incoming connection) and also ensure no retries due HTTP-based failures.

Due to the non-deterministic nature of such integration tests, we need to take additional measures to ensure that test flakiness does not screw with the trust in our test suite, especially considering that flaky tests might be false positives in the scenario where a low probability drove them. A straightforward way to improve here is retrying the build system task at the CI level, ensuring that we re-run the failing tests if needed.

With all that in place, we now have a foundation to improve on HTTP resilience, eventually supporting different strategies like retries for HTTP-based request failures. But more than that, we are now in a position where we have more confidence in our HTTP layer and provide some additional value to our users.

Final Remarks#

This article shows that Chaos Engineering is suitable and desirable for Mobile projects. Although we only have a little material on this topic, we can take inspiration from techniques widely adopted in backend services and evaluate how they fit our use cases, as Twitch folks shared with us some time ago.

Nevertheless, I see Testcontainers as an underrated tool in the Mobile testing toolkit. Although OkHttp’s MockWebServer is a great option for integration tests in Android, either JVM-based or Android/Instrumentation ones, tools like Docker, WireMock, Pact, Toxiproxy, etc. put us close to the way Backend Engineers work and eventually enable better context sharing with them.

Last but not least, it’s vital to acknowledge that there are no silver bullets when implementing solid abstractions closer to the infrastructure layers of our Mobile application. Sometimes we’ll get a few things out-the-box from our tools of trade; sometimes, another project already solves our problem, and sometimes we’ll have to craft a custom solution geared towards our context.

If you made it this far, I’d like to thank you for reading. I also share a couple of links that might be of your interest:

Shopify’s blog post that introduced Toxiproxy
More on the Retry pattern by Azure team
Automated Chaos Testing in the Front-end by Twitch team
Chaos testing with Testcontainers and Toxiproxy by AtomicJar team
A comparison between Resilience4j and Failsafe by Nicolas Fränkel

See you in the next post!