Failure As A Use Case

It is nearly a guarantee that your production software systems are going to fail in some way.

In an actual production environment, a multitude can go wrong. Things like memory utilization and leaks, port exhaustion, connection pool timeouts, too many resource file handles, and numerous others.

Even more potential issues are introduced when distributed systems such as Microservices are adopted, as the entire system has more moving parts than a standard monolithic web application. Service registries, load balancing and failover, and redundancy are essential, so there is even more surface area for these types of potential failures. Handling these types of failures is a characteristic of system stability.

System stability can be tested and validated outside of production. Unfortunately, it’s difficult (and expensive) to do this. Having to create, maintain, and then apply similar usage and loads to emulate production is complex and costly.

Why Not Do It Purposefully?

A successful Microservices platform requires a durable and resilient environment that supports the ability to continuously deploy multiple services. Automated deployment is a must, and when possible, automated recovery from failures should be implemented, because failures will happen, (i.e. Murphy’s Law: Anything that can go wrong, will go wrong).

Our answer: treat failure as a use case, and engineer failures into your platform’s production environment purposefully.

A use case is a description of how users will perform and experience tasks on your website. It outlines a system’s behavior as it responds to a request. Instead of waiting for a failure to occur and seeing how durable and resilient your platform is, we suggest that you be proactive and make failure a USE CASE of your platform.

Netflix has been a pioneer of this purposeful error strategy, using a framework called Chaos Monkey which can be configured to randomly take down AWS resources (i.e. load balancers, etc.) during normal business hours. When this occurs, automated or manual procedures should occur to remediate problems, while still continuing to operate and serve users.

If you know failures are occurring, yet pagers are not going off at 3 a.m. and the help desk is not being called, then you know your system is durable.

Introducing Trouble Maker

Netflix’s Chaos Monkey is based upon Amazon EC2 API. Alternatively, we wanted to implement a solution that was not dependent upon the cloud and could be easily used within an enterprise environment.

Trouble Maker was implemented for Java-based web and Microservices-based applications. It will randomly take down application services and provide a web console to perform stability tests against servers.

Random Kill

Trouble Maker is a Java Spring Boot application that communicates with a client service that has a small servlet registered with a Java API-based service application. By default, Trouble Maker accesses Eureka to discover services, and based upon a cron task, randomly selects a service to kill (i.e. shut down).

By default, when started, once per day Monday through Friday, a random service will be selected and killed. This option can be configured or turned off.

Trouble Dashboard

The Trouble Maker Dashboard has both an event log and on-demand trouble action requester.

Trouble Maker Dashboard

From the dashboard, a service can be selected and the following troubles applied:

  • KILL – Terminate the service (i.e. system exit will be performed). Tests fail over and alert mechanisms.
  • LOAD – The selected service will be invoked with numerous blocking API calls. Blocking time and number of threads can be specified. This emulates how service acts under API load.
  • MEMORY – The selected service will consume memory until HEAP memory is met, then will block for a specified time. This can be used to emulate how system performs under low memory conditions.
  • EXCEPTION – The selected service will throw an exception. This tests the exception handling, logging, handling, and reporting mechanisms of the service.

Open Source Project

Trouble Maker is an open source project hosted on Github that can be found here: https://github.com/in-the-keyhole/khs-trouble-maker.

A Spring Boot auto-configuration  starter can be found open source on Github here: https://github.com/in-the-keyhole/khs-spring-boot-troublemaker-starter.

Please feel free to make any suggestions or submit pull requests. Our goal is for this to help organizations that are implementing Microservices to implement stable and durable platforms.

— The Keyhole Labs Team

Leave a Reply

Your email address will not be published. Required fields are marked *