Performance Testing: Benchmarking Your Product

Performance Testing: Benchmarking Your Product

Performance tests test the performance of your system (circular definitions are rarely helpful). They can tell you many things about your system: how fast does your system run? Did your system get slower over time or with the latest release? What areas of your system run poorly?

This information is extremely valuable for any system. If your system is customer-facing, if a page takes too long to load people won't use it. If your system is back-end or an internal library, if the computations take too long the issue is magnified in whatever system is relying on it. If certain subsystems eat up resources, that can cost you in hosting charges or additional equipment.

In short, poor performance can cost you money and clients. Technology has advanced to the point that working product is no longer good enough: it has to be fast too.

The concept of performance testing is fairly simple: you build out a plan, say go to the homepage, search for an item, add the first result to the cart, then proceed to checkout. Then you tell the performance tool to run that scenario ten thousand times - simultaneously - a monitor the response times of the server.

Performance testing is something that simply cannot be done without automation. It’s one of the main areas that AQA can increase the quality of your product in a way that manual testing simply cannot.

Aside: Big O Notation

Feel free to skip this section if you want, but Big O notation comes up frequently when talking Performance Testing, so a brief primer in it is helpful. A simple google search come up with many great articles that can explain it in far better detail then I’m going to attempt. Consider the graph at the top of the post.

This shows the amount of time needed to complete a computation vs the amount of items being processed. The common Big O notations are as follows:

  • O(1). The computation always takes the same amount of time.
  • O(log N). The computation takes slightly longer with more items, but levels off eventually. This is the ideal case that most systems strive for.
  • O(N). The computation increases in time in direct relation to the number of items. This is the common case with most systems.
  • O(N^2). The computation increases faster then the number of items added. Each additional item adds exponentially to the computation time. This is a poor performing system that failed to make performance a priority.

Big O Notation is a rating that shows certain algorithms perform differently depending on the number of items they are given. This is key to understand when talking about Performance Testing: testing with just one user isn't enough to say you have a high-performing system, you must test your system under load.

Lexicon: Performance, Load, Stress Testing, Oh My

As noted in my Testing Taxonomy post, there are lots of names thrown around for lots of different kinds of testing. Different terms mean different things to different teams and not everyone is speaking the same language. Performance testing is a good case study in this.

Technically speaking, there are three types of Performance Tests:

  • Performance Testing. This tests how fast the system is capable of performing a task. Click a link, time how long it takes the next page to load. Call a function, time how long until you get a response.
  • Load Testing. This tests how fast the system responds under load. Load here meaning ‘multiple people using the product concurrently’. Similar to Pormance Testing above, but you have lots of ‘users’ (or more specifically sessions or threads) active and doing the same thing at the same time. Some functions or servers will work super fast with just one user, but scales at a rate of O(2^N).
  • Stress Testing. This tests the absolute bounds of your system. These tests literally crash your system, that is what they are designed to do. They continually spin up more and more concurrent users until the system crashes. They tell you how many maximum users your system can handle.

If you’ll notice, while these are technically three separate types of tests that will give you different results, the only real difference is the number of concurrent users. Thus, a well-designed performance test can be scaled to a load test which can be scaled to a stress test simply by increasing the number of users.

Thus, I’ve taken to simply calling them Performance Tests as an umbrella term. I found this decreases frustration in my teams by nitpicking lexicon all the time, especially with managers. As long as everyone is aware that Test Suite A uses 1 user and Test Suite B uses 500, and that  those differences result in different results that, taken together, marks our systems performance - that matters more than what you call it.

Tools: jMeter

jMeter is the workhorse of the performance testing world. It’s a Java-based open-source testing tool from Apache. It doesn't look like much, the Java AWT UI hasnt evolved much since the late 90’s, but it is immensely powerful.

There are many, many online tutorials and books to get you running in jMeter and a huge amount of community support. The important thing to note is that there isn't really one best way to do anything: jMeter has a great variety of features (and any feature it doesn't have can be had via plugin), and it’s up to you to understand them and adapt them to your specific needs.

It’s a bulky and cumbersome tool that takes a lot of getting used to. You’ll need to invest a fair amount of time to research and training to really get your tests running in a maintainable fashion (hint: start small and simple, and really refine those before moving to more complicated scenarios). I wouldnt typically recommend such a tool, except for the fact that it works and works well.

Performing at Every Level

Performance testing can be done (and arguably should be done) at all levels of your product: from database to front-end. Personally as with all AQA, I like to start at the API layer. Just as with regular API tests, performance tests for APIs are quick and simple.

The important thing to understand is the more layers you apply your tests to the better your results. If only test the front-end, you’ll know that a certain page is slow to load but you won't know what’s causing it. If you have tests at all layers, you’ll instead see a spike in the performance of the front-end and API layer, but not the database - that’s a good indication that some poor-performing code was added to the API layer.

Benchmarks: Interpreting Your Results

Of course the question is how fast is fast enough? Is a response time of 300 milliseconds sufficient? How about 500, is that acceptable? The answer is it depends on your product. Unlike traditional tests that have readily identifiable pass or fail conditions, Performance Test do not. Performance Tests will output a huge spreadsheet of data with many rows and columns of response times - it’s up to you to figure out if those numbers are good or bad. And for that, you need benchmarks.

Benchmarks are simply a marker. An arbitrary line in the sand that says Benchmarks answer two absolutely key questions to Performance Testing:

  1. “How fast are we?”
  2. “How fast do we need to be?”

There are two types of benchmarks: external and internal. Neither of them alone are sufficient to answer the above questions.

External benchmarks come from outside your company. They are things like user studies on how long people will wait for a page to load (5 seconds, by the way), or looking at other competing products to gather heuristics. External benchmarks show you how you stand in the market as a whole.

Internal benchmarks are set by your company and come from your own users or testing existing software. If users complain a certain feature is slow, run a Performance Test and see the response time. That is now your benchmark - that number is too slow.

Benchmarks are incredibly important - without them you cannot interpret the results of your tests. Benchmarks can serve as metrics that drive development efforts. Ignoring benchmarks - or worse, not having them - can cause you to make poor decisions about where you spend your development efforts - including causing you to believe a certain component is ‘done’ only to have to revisit it when clients complain of 45 second load times. But it’s equally important to remember that they are arbitrary lines in the sand that can easily get washed away as the tides change - and tides change frequently.

Performance Tuning: Using Your Benchmarks

Fun fact: in the Java source code, the default collections.sort uses a quicksort - unless the collection contains less than 7 items in which case a merge sort is used instead. Why? Someone figured out that merge sort was faster at that arbitrary size.

This is performance tuning: You build the application, run performance tests first to get a baseline, and then look for any bottlenecks - operations that run noticeably slower than others - and try to improve those. Over time you’ll develop a baseline of how fast your application can run (an important benchmark) and you’ll use that as the metric for future features.

Tuning also helps you deal with spikes - suddenly this API went from 200ms to 900ms, why? Was that expected? Was a new feature added that added complexity? If so, is that feature worth the degradation? Is there anything we can do to improve it?

Out-Performing: Better Tests for Better Performance

Performance testing is highly subjective - which means it can take some creativity to come up with well-designed test cases. While most AQA has a hard line of ‘pass’ or ‘fail’, Performance Testing does not, not to mention you’ll get very different results under different circumstances (realistically, this is true of all testing as no two environments are the same - the age old “but it works fine on my computer!” - but the point is ideally it shouldn't. In contrast, Performance Tests will, by their very nature, have different results in different environments).

Varying your tests between performance, load, and stress can show you the breakpoints of your product. Perhaps your product performs fine with one user, but not under load. Or perhaps something that seems inefficient for one user handles loads remarkably well (a legitimate trade off you may need to make).

Simultaneous users are not the only factor to consider when performance testing. Larger systems will probably also have larger data sets. This will strain your search and sort algorithms - suddenly that section of code that worked just fine with a triple-nested for-each loop on small data sets is causing O(2^N) performance. Attempting a filtered search for in-stock jeans sorted by brand name is going to be different then a search for all Kitchenware products in the warehouse sorted by brand and sub-sorted by name. Dont neglect these differences as part of your testing.

If you support more than one browser, or operating system, or hosting service, make sure to test combinations of these. Perhaps not all combinations, but a reasonable sub-set.

Performance testing will require a bit more Exploratory Testing - and manual QA skillset - than your typical AQA tests to tease out these different scenarios.


Action-Command Scripting

Action-Command Scripting

Event-Driven Integration - Testing Indeterminate Systems

Event-Driven Integration - Testing Indeterminate Systems