Data Generators: The Wise Fool
In the US, second-year college and university students are referred to as ‘sophomores.’ In the original Greek, this translates to ‘wise fool.’ The moniker comes from the idea that second-year students have often learned just enough to be dangerous: first-year 100-level classes often introduce you to many big-idea concepts without any of the refinement, nuance, or context that is meant to come in later, higher-level classes.
Data generators fall into this same category. Data generators sound like a good idea on their face: you build a database, pre-load it with test data such as Person Names, then in your tests rather than coding specific Person Names you simply make a call to the database to get a random Person Name.
Data generators are wise: It’s a good use of metadata. Your tests are nicely decoupled from your data. You can add or remove data from the data generator without altering your tests, and vice versa and the data generator (if build well) is smart enough to surface the perfect data sets.
Data generators are also fools: they assume that you can decouple your tests and your data. Your tests ARE your data. Your product’s entire purpose is taking data from Point A, transforming it, and moving it to Point B. Your test steps are the series of actions that need to be taken to ensure that process move forward. Altering the data IS altering the tests: they are inherently linked.
Data generators aren't bad. They mean well, but they can lead you astray and Dory pictured up top in particular has a rather short memory. When you’re running a test case, you want repeatability and readability - if you’re using a data generator, you cant guarantee either.
Readability is key to reliable test cases. Your code needs to be simple and easy to understand. Data generators make your code less readable by making it difficult to trace specific scenarios.
This is a fine line to walk, as I often advocate for separating your data from your tests. But we need to be careful about how far that separation happens. I often use JSON files inside the test project itself as my ‘external’ dataset as part of ACS (page-object model), a database would also serve this purpose. But it needs to be specific: you should be able to find each tests data separate from all other test data (a single file or table per test). With a data generator, usually you re-use large tables of things like ‘valid usernames’ or ‘invalid emails’ or worse a generic ‘user’ table which could contain any number of datasets.
This problem is compounded by typically having a middle-layer translation. Your tests pass flags to a middle-man class that then actually handles the data generation/database queries. This only further obfuscates the tests.
Consider you run a test case with random data and it fails. You run it ten more times and it passes. Why did it fail the first time? Was it a fluke? Was it some temporary error? Or was there some piece of data used on that particular run that caused the failure? The only way to know for sure is to re-run the test, only this time hard-coding all the exact data that was generated - and now you’re defeating the purpose of using random data and creating more work for yourself.
Properly defined test cases should have specific data sets that test for specific scenarios. You can argue that having some “wildcard” tests that use randomized data can help uncover bugs you never though to look for - and you wouldnt be wrong, but be careful deploying it on large scale, those tests will be specific wildcard tests.