Event-Driven Integration - Testing Indeterminate Systems
This is a super fun topic and was the bane of my existence for a couple years actually. Event-driven systems are becoming more and more popular as a way to build integrated systems, and for good reason. The idea in an event-driven system is one component of the system fires an ‘event’ that is logged in a queue. At that point, any other integrated system that is ‘watching’ that queue can grab the event and ‘do something’ with it. In this way all of your systems are nicely encapsulated: they work off the queue with no direct integration between them.
This idea is often combined with microservices, where your system is broken down into really tiny components that do a very specific task and each component stays in sync via the queue.
One way to think about this is as a GUI system with event handlers. When a User updates their Address, an Address Change Event is raised and the Payment system may catch that and update the users billing address. When a User checks out, a Purchase Event is raised and the Inventory Management system may catch it, which itself may trigger a Low Inventory Event, which is in turn picked up by the Supply Management system to order more stock. All the systems know exactly what they need to know exactly when they need to know it, and they use the queue to talk indirectly.
Alternately, try this. In a traditional integration where systems are explicitly talking to each other via APIs or even database queries, those systems are having a close, intimate, quiet conversation with each other - sharing ideas in detail. Event-driven integration is like using a megaphone in a crowded room - you don't care who hears (in fact, you want everyone to hear), you just want to broadcast your message to everyone within range - what they do with that information is completely up to them (think social media).
This system is extremely hard to test. There is infinite complexity in such a system.
Understanding Indeterminate Systems
We like to think that, given enough time, we can test every corner of our software. This is, of course, not true in even the simplest of softwares. But we like to believe that it is. We want to believe that we can handwave certain things ‘invisible’ things: we only officially support Chrome, so we don't need to test other browsers. We only officially support Windows 8.1 August 2017 release, so we don't need to test other OSs. We clearly state you need exactly 4GB Corsair RAM to run our software, we don't need to test other configurations. Etc, etc. and if we just ignore those silly little things, then we can test everything! (We can't, but we tend to ignore those things and redefine ‘complete testing’ to suit our whims - and our egos).
What frustrates us about event-driven systems is the fact that you can't test every corner is glaringly obvious. It’s baked into the design of the system itself, in bold print in the requirements documents:
Components will be able to be replaced without affecting the functionality of other components.
You may currently have System A talking to System B, but the entire purpose of the design is that you can replace System B at any time with System B2 or System BZ87. It’s right there in the definition. In order to test every input, every output, every message possible, you’d need to conceive of every possible system that could ever be integrated.
This is of course impossible. Even more than impossible, it’s part of the definition. We have to finally acknowledge that there are things we can't test and we have to embrace better test design.
The initial thought in testing here is usually to do ‘round trip’ or end-to-end testing. That is, you trigger an event in System A, wait for System B to read it, then verify that System B did what it was supposed to do.
This sounds simple until you think about what that actually means. You’d need to follow an entire workflow through System A, read the message in the queue, wait for System B to read it, then follow a second totally unrelated workflow through System B. This design is indeterminate: you must design tests for every integrated system imaginable. Further, even if you wanted to do this for a few test cases it creates a massive amount of overhead - these tests will be long, egregiously complicated, and exceptionally fragile.
If, on the other hand, you can say with a straight face, "well, System A will only really ever be integrated with System B and System C..." then I have to question why you're choosing an event-driven architecture. Usually the response is, "in case we want to scale in the future!" to which I say checkmate - your tests should scale right along with your systems. If your system is designed to scale to be indeterminante, so too should your tests.
Creating a Determinant System by Design
The point of using event-driven integration is encapsulation - each system is independent of one another. So why would you then tightly couple your tests? This is where the team discipline comes into play: you need to encapsulate your services. Each team has a core responsibility to a subset of specific services (preferably one or two), and the testers embedded on those teams (both automated and manual) have responsibility for testing those specific systems.
Essentially, know your boundaries. As part of the system design, System A needs to have a well defined set of responsibilities that do not overlap with any other system. Each system needs to have well defined and documented APIs and messages. The exact data contracts and APIs need to be defined and documented. If System B wants to consume messages that System A produces, it simply needs to conform to System A’s data contracts. In turn, if System A wants to consume messages from System B, it also simply needs to conform to System B’s data contracts.
The queue is the delineation point between the systems. Thus, as QA, we only need to test two things (technically 4 if you want to do some negative testing as well):
- Does our system’s outgoing messages match the intended data contracts?
- Accomplished by defining what actions trigger what messages and figuring out how to mock that in test. Once the message is in the queue, have your tests consume it and verify. This can be tricky, some messages will be very hard to force your system to produce.
- Does our system accept and properly process messages that match the intended data contracts?
- Accomplished by injecting a message in the queue and verifying your system consumes it and does whatever it’s suppose to do.
Now we’ve reduced our test cases significantly. Testing this can still be complicated, but you’ve reduced the number of them. You still need to figure out how to get your system to send a message to the queue, and then read the resulting message. The same in reverse, you need to inject a message into the queue and verify that your system acts accordingly. These are difficult tasks with varied solutions, but they are finite, they are determinate, they are testable to within a degree of confidence.
Testing to the Contract
I believe that Test Automation, and QA in general, is a natural result of a high-functioning team. If you have proper processes, your testing should be incredibly clear. If you’re unsure how to test your system, in most cases you should look at where the process went wrong, not development. Event-driven systems will magnify these deficiencies and make them glaringly obvious. Your teams and your project managers need the utmost discipline to properly define the system and the team structure before the project begins. If you don't, you’ll end up with a mess.
For any integration, documentation is key. This is doubly so in event-driven integration. The product is the sum of the parts: each individual system is developed by an independent team, but that system cannot stand alone, it must work in concert with dozens or hundreds of other systems each being equally developed by independent teams. Proper documentation of each system's data contracts - what data it will accept and in what form - and each systems event triggers - what events it sends and consumes and for what reasons - is paramount to ensuring system stability.
If your data contracts are properly documented and you have fluid cross-team communication where needed, then you can confidently test just the data contracts of each system without worrying about round-trip or end-to-end testing (at least in AQA, manual QA may want do run some round-trip scenarios). If you do encounter a problem where the systems pass their individual tests but fail when put into production, the bug is easy to identify: the data contract was incorrect, and the documentation was wrong.