The topic of this post “Hystrix” may be familiar to many of us. There are already so many blogs/posts readily available on the web that talks about what is hystrix, its advantages, use cases, how to configure it etc. The intention of post is not to cover all these areas and create one more duplicate post whereas the intention is to figure out issues that we ran into with our micro-services deployment when hystrix was used as a solution for fault tolerance and latency.

Let’s understand the problem and the issue first. In our micro-services deployments, the Hystrix problem statement that we had was

“We were facing cascading failures with Hystrix and observed when one API fails, it impacts other APIs too. Its bringing down other APIs and finally we decided to move away from Hystrix and started making direct remote API calls..”

Observations :

As part of my study on this issue, when started analysing the usage pattern of Hystrix in our micro-services components, below was the common pattern across all services that was observed.

  • All the micro-services components exposes more then 5-10 APIs each.
  • For each micro-service, a client JAR is generated as part of maven build.
  • All consumers of these micro-services uses client JAR as dependency to make remote calls to APIs exposed by these micro-services.
  • Within this client JAR, a client class (XXXServiceClient.java) is available that wraps each API in Hystrix commands and makes a remote call.
  • All the methods available in client class (wrapping remote calls in hystrix commands) takes one common Hystrix property object that contains hystrix configuration properties that provides information about how hystrix should be configured.
Consumption of APIs
Consumers calling APIs from other Micro-services

 

The figure above demonstrates the same problem where one consumer service is using multiple client JAR files and as can be seen it’s not only one remote API but all the APIs from all the clients are using the same Hystrix configuration bean. How does this pattern harms our micro-services deployments ?

Micro-service Client (A) and (B) both are using common configuration for all its APIs. This kind of common hystrix configuration says “Here is my benchmarked hystrix properties and you use it for all your remote calls irrespective of what remote API you are going to call.”   It implies that all APIs irrespective of nature of API are using same number of timeouts, same no. of threads, same retry time period etc. and with all these indications, it pointed out for a reason to change the implementation and see if we can pass through these failures. Few facts about common hystrix configuration for all API calls are listed below:

  • When we use APIs exposed by micro-services components, we should not expect all the APIs to be of same nature. For example, a remote call finding a resource by primary key would come back in milliseconds whereas, another API from same micro-service component may require a complex logic to execute before it return a response. We should know how much time it takes for each API to response. For each API, response time may be different based on the logic it executes to build a response.
  • Not all APIs can be wrapped in Hystrix command. For example, locating a resource by primary key, updating or deleting a resource are good candidate for Hystrix because more or less the execution time is easy to estimate. But there could be APIs that cannot be predicted. For example, generating a PDF document. Here if we know document size can vary from few Kbs to 100 MB, its not advisable to wrap this call in hystrix because execution time varies from few milliseconds to minutes.
  • All the APIs from a client are not used at same frequency. For example, API to validate a session will be used more frequently (10-20 times per page reloaded approximately) from session service. Whereas, the delete session from session service is used only when user logout. Its not advisable to assign same number of threads to both the APIs here. Here in our case we were assigning same number of default threads to each API call and this was one of the candidate contributing to cascading failures when an API like validate session API is in trouble and consuming all allocated threads and blocking delete session API too.
  • and many more similar issues can be identified with single hystrix configuration approach being used here.

Solution:

Having identified above issues, the configuration of Hystrix was updated to make it work with all above issues resolved. Here are the details of changes.

  • Let the client JAR be free from Hystrix code and make plain HTTP remote calls and let the consumer of APIs decide how to call this API based on consumer’s usage pattern. Above client code is updated as following:

  • Next, on the consumer side define Hystrix configuration based on the API usage requirements like threads required, timeout based on expected response time, when to call hystrix circuit open based on failure rate etc and then pass this API specific hystrix configuration to call the remote API using client JAR call as following:

The above API is configured as Spring bean and hystrix properties are injected as dependency for this API bean.

With these changes being done, we have freedom to decide whether we want to wrap remote API calls in Hystrix command or we just want to make plain HTTP calls and similarly many more API specific configuration parameters that can be set based on each API behaviour.

Results:

When the above mentioned fix was applied and tested to see whether if it resolves problem in discussion just by changing the way it was used, the results after changes were in inclined towards Hystrix. After the changes, test results shows that Hystrix library did not had any issue in itself where it was a problem with the way hystrix was integrated with micro-services. The Hystrix library is for client side use rather then plugging it in with server side code artifacts.

Here is a reference video that I captured to see the impact on micro-service components without Hystrix and with Hystrix (not the faulty configuration and fixed configuration and this is because our implementation of our micro-services are not using hystrix as of today and Hystrix was removed when we had these issues).

Few results from the video are as below:

Results when Hystrix is not used with micro-services:

API in trouble: When used without Hystrix and one of the API call starts failing, there is no way to recover and no fault tolerance to failing HTTP remote call. As can be seen in below screenshot, all calls to providerById taking consistence failure time (Load time and Latency : 43 seconds).

Screen Shot 2016-05-22 at 8.01.04 am

Impacted on other API: When the above call was failing from micro-service component, it blocked all threads and this impacted other APIs as well from the same micro-service component. As we can see below, other API from the same micro-service took close to 27 secs to respond (Load time at the bottom of image in red).

Screen Shot 2016-05-22 at 8.11.41 am.png

 

Results when Hystrix is configured with Micro-service components:

Now let’s see the results when Hystrix is used with these micro-service components and compare the results from previous two observations:

API in trouble: When used with Hystrix and one of the API call starts failing, the API fails faster as compared to previous configuration because once the threshold value is reach, hystrix circuit is open until API recovers from fault and there is no traffic routed to failing API. As we can see below as compared to previous results (43 secs) now failing API is not blocked for longer time instead they respond within minimal time (few milliseconds Latency & Load time: 8 milliseconds) .

Screen Shot 2016-05-22 at 8.16.20 am

Impact on other APIs: When we did not used Hystrix, all the configured threads were consumed by failing API because there was no way to allocate threads on per API basis and other API was impacted and took 43 secs. Comparing the same API when used with Hystrix, it just took few milliseconds to respond (load time displayed at the bottom in Red)

Screen Shot 2016-05-22 at 8.24.57 am

 

Conclusion:

When we work with distributed system, failing remote calls is always expected but the applications has to be prepared to deal with these remote call failures. To be precise, not just failures but also how fast they fail and how minimal impact these failures put on other part of applications. Hystrix fits well in solving these problems and works well too.

Thanks for reading 🙂