Fault Tolerance with Resilience4j and Spring Boot
Introduction
Fault Tolerance patterns help us define how our application should behave when it cannot fulfil the received request, be it because of its own resources exhaustion or errors coming from external calls. If, for example, an external service works very slowly and inconsistently, returning errors after long waiting periods, it’s not the best strategy to keep calling the service, not letting it recover and blocking our own resources. It would be best to detect such a scenario, change application behaviour to automatically fall into error flow without even calling the external service, and, after some time, try to return to the original behaviour. That is exactly what a Circuit Breaker pattern does. There are multiple patterns that we can apply to our methods, securing them from various fault states, each having an application in a slightly different scenario. Applying adequate patterns to our services not only helps us manage our resources better, ultimately reducing the actual problem’s lifetime, but also lets our application be more responsive and consistent.
Some of the most important Fault Tolerance patterns include:
- Circuit Breaker: The application keeps track of how often the protected service fails. If failures exceed a configured threshold, the circuit is opened and all subsequent calls fail immediately, giving the faulty service time to recover. After some time the circuit becomes half-open and permits some calls to go through, testing the service. If it’s responsive the circuit is closed and the application’s behaviour goes back to normal. Otherwise, the circuit opens again.
- Retry: If a protected service fails, the application waits a configured period of time, after which it tries again, repeating the process up to a configured number of times.
- Bulkhead: Only a certain amount of open calls to the protected service is possible at the same time, the rest is rejected.
- RateLimiter: Only a certain level of traffic is accepted to the protected service. If the traffic is too heavy (depending on the configuration), subsequent calls are rejected.
Fault Tolerance with Resilience4J
Resilience4j is a lightweight fault tolerance library, which we can easily add and use within our Spring Boot project. It provides implementation for a variety of fault tolerance and stability patterns. It is annotation-based: we can add an annotation, like a @CircuitBreaker annotation, on a method or class (which means all public methods) and provide the specific configuration for it in our yaml profile files. We can also provide a fallback method for the annotated method, which will be called in case of any error, instead of propagating an exception. Resilience4j is a successor to the popular, but currently outdated, Netflix Hystrix library. It works well with Spring Boot 3 and extends its predecessor’s functionalities, providing more configurability (for example it greatly expands the configurability for the Circuit Breaker half-open state – Hystrix would only perform a single call in that state).
Setting up the project
Let’s set up a Spring Boot 3 project to test Resilience4j functionalities. We will define a simple backend server calling an external service, protect the external calls with Resilience4j, and test the error-scenarios using spring boot integration tests and WireMock.
REST services
Let’s add a three-layered structure: – The controller receives the request and passes it to the Service. – The dervice performs business logic and calls the Connector. – The connector calls the external API. It is the connector method, which we will annotate with the Resilience4j annotations.
We will test the Circuit Breaker, Retry, Bulkhead and RateLimiter patterns. In our tests, for brevity, we will directly call the Service methods, so we don’t have to duplicate all the methods in the controller.
Controller:
@Slf4j
@RestController
@RequiredArgsConstructor
public class TaskController {
private final TaskService taskService;
@GetMapping("/task/{id}")
public String getTaskDetails(@PathVariable("id") Integer id) {
return taskService.getTaskDetails(id);
}
}
Service:
@Slf4j
@Service
@RequiredArgsConstructor
public class TaskService {
private final TaskConnector taskConnector;
public String getTaskDetailsCircuitBreaker(Integer id) {
log.info("Getting task {} details using Circuit Breaker pattern", id);
// some other business logic here
return taskConnector.getTaskDetailsCircuitBreaker(id);
}
public String getTaskDetailsRetry(Integer id) {
log.info("Getting task {} details using Retry pattern", id);
// some other business logic here
return taskConnector.getTaskDetailsRetry(id);
}
public String getTaskDetailsBulkhead(Integer id) {
log.info("Getting task {} details using Bulkhead pattern", id);
// some other business logic here
return taskConnector.getTaskDetailsBulkhead(id);
}
public String getTaskDetailsRatelimiter(Integer id) {
log.info("Getting task {} details using Ratelimiter pattern", id);
// some other business logic here
return taskConnector.getTaskDetailsRatelimiter(id);
}
}
Connector:
@Slf4j
@Service
@RequiredArgsConstructor
public class TaskConnector {
private final RestTemplate restTemplate;
// TODO: add resilience4j annotation
public String getTaskDetailsCircuitBreaker(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
// TODO: add resilience4j annotation
public String getTaskDetailsRetry(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
// TODO: add resilience4j annotation
public String getTaskDetailsBulkhead(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
// TODO: add resilience4j annotation
public String getTaskDetailsRatelimiter(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
}
We also define the RestTemplate bean that our connector uses:
@Configuration
public class RestConfiguration {
@Bean
public RestTemplate restTemplate() {
return new RestTemplateBuilder().rootUri("http://localhost:8081")
.build();
}
}
For testing purposes we configure our app to call external services on the following address: localhost:8081.
Wiremock
In order to use Wiremock in our project, a tool for mocking external services in the tests, we need to add dependencies:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-contract-stub-runner</artifactId>
<version>4.0.3</version>
<scope>test</scope>
</dependency>
We can now set up our Spring Boot tests like this:
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@AutoConfigureWireMock(port = 8081)
class Resilience4jApplicationTests {
private static final String SERVER_ERROR_NAME = "org.springframework.web.client.HttpServerErrorException$InternalServerError";
private static final String CIRCUIT_BREAKER_ERROR_NAME = "io.github.resilience4j.circuitbreaker.CallNotPermittedException";
private static final String BULKHEAD_ERROR_NAME = "io.github.resilience4j.bulkhead.BulkheadFullException";
private static final String RATELIMITER_ERROR_NAME = "io.github.resilience4j.ratelimiter.RequestNotPermitted";
@Autowired
private TaskService taskService;
@SpyBean
private RestTemplate restTemplate;
@BeforeEach
void initWireMock() {
stubFor(get(urlEqualTo("/task/1")).willReturn(aResponse().withBody("Task 1 details")));
stubFor(get(urlEqualTo("/task/2")).willReturn(serverError().withBody("Task 2 details failed")));
}
// [...]
}
- We specify that during integration tests our application should run at a random port.
- We define constants: the name of the exception that we will throw in the external service wiremock stub and the names of the exceptions we except resilience4j to throw when it rejects a protected method call.
- We auto wire our TaskService.
- We prepare a mockito spy for the RestTemplate that our connector will call.
- Using Wiremock we specify that the external calls with a path param of value ‘1’ should always succeed, and those with the value of ‘2’ – should always fail.
Adding Resilience4j to the Spring Boot project
Now let’s actually add the resilience4j to our poject. We will need to add these dependencies:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>2.1.0</version>
</dependency>
Now we can add Resilience4j annotations to our TaskConnector! In further sections we will configure and test the Fault Tolerance patterns described in the earlier section.
Circuit Breaker
We add the @CircuitBreaker annotation to the connector method:
@CircuitBreaker(name = "CircuitBreakerService", fallbackMethod = "getTaskDetailsFallback")
public String getTaskDetailsCircuitBreaker(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
- We specify the name of the circuit breaker – this will identify it in our configurations file.
- We specify the fallbackMethod – this method will be called if the original function throws an exception.
The fallback part is not mandatory. When it’s not filled, the exception is simply propagated further. In our case the fallback method will return the error message containing the name of the thrown exception. That way we will test both whether the fallback is being called, and whether the original exception is correct. The fallback method’s signature must be the same as the annotated one’s, with the addition of the Throwable argument, which contains the exception causing the fallback.
public String getTaskDetailsFallback(Integer id, Throwable error) {
return "Default fallback with error " + error.getClass().getName();
}
We define the Circuit Breaker’s configurations in the yaml file, like this:
resilience4j:
circuitbreaker:
instances:
CircuitBreakerService:
failure-rate-threshold: 50
minimum-number-of-calls: 5
automatic-transition-from-open-to-half-open-enabled: true
wait-duration-in-open-state: 15s
permitted-number-of-calls-in-half-open-state: 3
sliding-window-type: count_based
sliding-window-size: 10
As we can see, we’ve used the identifier from the annotation. With the configurations, we can customize the behaviour of our circuit breaker:
- failure-rate-threshold: The failure threshold percentage. If the failure rate is equal or higher, the circuit is opened.
- minimum-number-of-calls: The minimum number of calls which are required to calculate the error rate.
- automatic-transition-from-open-to-half-open-enabled: Whether the transition from open to half-open should be automatic, dependent only on the time elapsed.
- wait-duration-in-open-state: How long to wait before transitioning from open to half-open.
- permitted-number-of-calls-in-half-open-state: How many testing calls to permit in the half-open state.
- sliding-window-type: Defines the sliding window type: whether to keep track of the last errors basing on the count or basing on the time.
- sliding-window-size: The size of the sliding window – the number of calls if the type is count_based or seconds if it’s time_based.
The full list of possible configurations can be found at Circuit Breaker Documentation.
Testing the Circuit Breaker
We add a test to our Resilience4jApplicationTests:
@Test
void testCircuitBreaker() {
IntStream.rangeClosed(1, 5).forEach(i -> {
String details = taskService.getTaskDetailsCircuitBreaker(2);
assertThat(details).isEqualTo("Default fallback with error " + SERVER_ERROR_NAME);
});
IntStream.rangeClosed(1, 5).forEach(i -> {
String details = taskService.getTaskDetailsCircuitBreaker(2);
assertThat(details).isEqualTo("Default fallback with error " + CIRCUIT_BREAKER_ERROR_NAME);
});
Mockito.verify(restTemplate, Mockito.times(5)).getForObject("/task/{id}", String.class, 2);
}
- The first 5 calls should fail with the server error exception, as we have configured the minimum number of calls to 5.
- The next 5 calls should fail with the resilience4j CallNotPermittedException, as the calculated error rate for the window is equal or higher than our configured 50%.
- The actual restTemplate should be called only 5 times, not 10, as the circuit should be opened.
We can see that the test passes correctly:
Retry
We add annotation @Retry to the connector method:
@Retry(name = "RetryService", fallbackMethod = "getTaskDetailsFallback")
public String getTaskDetailsRetry(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
Again, we use the identifier from the annotation for configurations:
retry:
instances:
RetryService:
max-attempts: 3
wait-duration: 1s
- max-attempts: Maximum number of attempts to try.
- wait-duration: How long to wait before each try.
The full list of possible configurations can be found at Retry Documentation.
Testing the Retry
We add a test to our Resilience4jApplicationTests:
@Test
public void testRetry() {
String result1 = taskService.getTaskDetailsRetry(1);
assertThat(result1).isEqualTo("Task 1 details");
Mockito.verify(restTemplate, Mockito.times(1)).getForObject("/task/{id}", String.class, 1);
String result2 = taskService.getTaskDetailsRetry(2);
assertThat(result2).isEqualTo("Default fallback with error " + SERVER_ERROR_NAME);
Mockito.verify(restTemplate, Mockito.times(3)).getForObject("/task/{id}", String.class, 2);
}
- Invocations for the id of ‘1’ should succeed, so there should be only one attempt.
- Invocations for the id of ‘2’ should fail, so there should be three attempts, ultimately returning an error through the fallback.
We can see that the test passes correctly:
Bulkhead
We add the @Bulkhead annotation to the connector method:
@Bulkhead(name = "BulkheadService", fallbackMethod = "getTaskDetailsFallback")
public String getTaskDetailsBulkhead(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
Again, we use the identifier from the annotation for configurations:
bulkhead:
instances:
BulkheadService:
max-concurrent-calls: 3
max-wait-duration: 1
- max-concurrent-call: The maximum number of concurrent calls to accept.
- max-wait-duration: How long the subsequent calls can be pending (in ms) before being rejected.
The full list of possible configurations can be found at Bulkhead Documentation.
Testing the Bulkhead
We add a test to our Resilience4jApplicationTests:
@Test
public void testBulkhead() throws Exception {
ExecutorService executorService = Executors.newFixedThreadPool(5);
CountDownLatch latch = new CountDownLatch(5);
AtomicInteger successCounter = new AtomicInteger(0);
AtomicInteger failCounter = new AtomicInteger(0);
IntStream.rangeClosed(1, 5)
.forEach(i -> executorService.execute(() -> {
String result = taskService.getTaskDetailsBulkhead(1);
if (result.equals("Default fallback with error " + BULKHEAD_ERROR_NAME)) {
failCounter.incrementAndGet();
} else if (result.equals("Task 1 details")) {
successCounter.incrementAndGet();
}
latch.countDown();
}));
latch.await();
executorService.shutdown();
assertThat(successCounter.get()).isEqualTo(3);
assertThat(failCounter.get()).isEqualTo(2);
Mockito.verify(restTemplate, Mockito.times(3)).getForObject("/task/{id}", String.class, 1);
}
- We call the Bulkhead method from 5 separate threads.
- Only the first three calls should return the success response, the last two should be rejected and should return an error response through the fallback.
- Ultimately, restTemplate should be called exactly 3 times.
We can see that the test passes correctly:
RateLimiter
We add annotation @RateLimiter to the connector method:
@RateLimiter(name = "RateLimiterService", fallbackMethod = "getTaskDetailsFallback")
public String getTaskDetailsRatelimiter(Integer id) {
return restTemplate.getForObject("/task/{id}", String.class, id);
}
Again, we use the identifier from the annotation for configurations:
ratelimiter:
instances:
RateLimiterService:
limit-for-period: 5
limit-refresh-period: 60s
timeout-duration: 0s
- limit-for-period: The limit of calls permitted within a single period. After each period the limit is reset.
- limit-refresh-period: The period of a limit refresh.
- timeout-duration: The default wait time for a thread when the limit is exhausted within the current period, before the call is rejected.
The full list of possible configurations can be found at Ratelimiter Documentation.
Testing the RateLimiter
We add a test to our Resilience4jApplicationTests:
@Test
public void testRateLimiter() {
AtomicInteger successCounter = new AtomicInteger(0);
AtomicInteger failCounter = new AtomicInteger(0);
IntStream.rangeClosed(1, 10)
.parallel()
.forEach(i -> {
String result = taskService.getTaskDetailsRatelimiter(1);
if (result.equals("Default fallback with error " + RATELIMITER_ERROR_NAME)) {
failCounter.incrementAndGet();
} else if (result.equals("Task 1 details")) {
successCounter.incrementAndGet();
}
});
assertThat(successCounter.get()).isEqualTo(5);
assertThat(failCounter.get()).isEqualTo(5);
Mockito.verify(restTemplate, Mockito.times(5)).getForObject("/task/{id}", String.class, 1);
}
- We call the RateLimiter service simultaneously 10 times.
- Only the first 5 should return the success response, subsequent calls will exceed the limit, so they will be rejected.
- Ultimately, restTemplate should be called exactly 5 times.
We can see that the test passes correctly:
Conclusion
In the article we’ve learned about the importance of the fault tolerance patterns, and how to implement them using the resilience4j library. Using correct patterns on a per-case basis greatly increases our control over external errors and capacity errors, making our application more predictable. Maybe it’s worth giving it a try in your Spring Boot service?