AMA: Test Data Infrastructure

Anonymous asks…

Do you have set up (inexpensive) infrastructure to store data collected in your automated tests? We are currently using using selenium Java webdriver to automate our tests and IntelliJ as our IDE. We create data from scratch for each and every test case :(

My response…

I’m a little confused by the question and whether it’s about test data: data is that is needed by the automated tests, or test results data: insights into the results of our automated tests. So I’ll answer both 😀

Infrastructure to manage test data

Our tests run on specific test accounts and sites on production databases. Since our tests are end-to-end in fashion, we try to make our tests have as few dependencies as possible on existing data. Often an end-to-end scenario will involve creating, viewing, editing and deleting something. If we don’t do all of this by our UI we can use hooks that either use services or database jobs to clean up the data. I explained this in more detail previously.

Infrastructure to manage test results data

We use CircleCI for automated end-to-end tests. We have a number of projects that run different types of end-to-end tests from the same code repository for different purposes (canary tests, visual-diff tests, full regression tests for example).

We generate x-unit test results (from Mocha/Magellan) which CircleCI uses to provide insights into our test results such as this:

You can also drill down into slowest tests and most failed tests etc.

Since all our tests are open source you can view these build insights yourself!

We’re pretty happy with the insights we get from CircleCI at the moment so we don’t see a need to currently develop anything ourself.

AMA: product APIs for test automation

Michael Karlovich asks..

What’s your design approach for incorporating internal product APIs into test automation? I don’t mean in order to explicitly test them, but more for leveraging them to stage data and set application states.

My response…

As explained previously, in my current role at Automattic I primarily work on end-to-end automated tests for These tests run against live data (Production) no matter where our UI client (Calypso) is running (for example on localhost), so we don’t use APIs for staging data or setting application state.

In previous roles we utilised a REST API to create dynamic data for an internally used web application which we found useful/necessary for repeatable UI tests.

We also utilised test controllers to set web application state for a public website. These test controllers were very handy as they allowed you to visit something like which would set up an order for you with products in your session, and instantly display the checkout page, which would typically take 8 steps from the start of the process to this page.

This saved us lots of time and made our specific tests more deterministic as we could avoid the 8 or so ‘setup’ steps and use a single URL to access our page.

This approach had a couple of downsides in that this couldn’t ever be deployed to production, and it didn’t test realistic user flow which includes those ‘setup’ steps. There were two things we had to do to avoid the risk of using this approach; firstly ensure that these test controllers were never deployed to production though config, and secondly we had to ensure we had some end-to-end coverage so we were at least testing some real user flows.

AMA: handling the database

Andy asks…

How are you handling the db in automation suites?

I’m running into issues where the test DB is, by necessity, a rather weighty 900mb, so a simple drop and restore from known backup is hugely time consuming.

“If you automate a mess, you get an automated mess.” -Rod Michael

My response…

In my current role at Automattic I primarily work on end-to-end automated tests for These tests run against live data (Production) no matter where our UI client (Calypso) is running (for example on localhost), so we just make sure our config points to the data that we need (test sites) and create other test data within the e2e scenarios.

In previous organisations I have used a scaled down backup of production that had specific test data ‘seeded’ into it. Our DBAs had a bunch of scripts that would take a backup and cleanse/remove a whole heap of data (for example, archived products, orders) so that resulted in a small manageable backup that we could quickly restore into an environment. I found this to be a good approach as it gave us realistic data but it wasn’t time consuming restoring this when necessary, eg. before a CI test run.

I also shared some other data creation techniques in a previous answer.

AMA: managing test data

Cameron asks…

I’m new to test automation. I’m writing selenium/protractor tests in C# within the project solution, which allows developers to run all of my UI tests along side their own Unit tests.

The project is all very new, and big chunks aren’t built yet. I’m trying to grow my tests along with the project as each function is fleshed out.

I’m struggling with test data! The BAs have had a tool built for them which allows them to create series of test data in XML and have it all imported. This seems a bit cumbersome for my uses and I’d prefer to seed in my test data programmatically. I have figured out mostly how to use the data layer of our application to get stuff in there, but it’s very quickly getting out of hand with the amount of test data being created, it’s very hard to manage.

Should each test case seed it’s own test data as part of the test run? This would have the benefit of if requirements change, the test will fail, I can go directly to it and amend the test data to match the new requirements.

Or, should test data be separated out in a central location?

My response…

I answered a similar question to this yesterday, so it might help to read that first.

It’s great to hear you’re writing tests alongside the application code: I have found this to lead to better collaboration and increased usefulness and adoptability of automated testing.

As per that other post, I find a combination of seeding test data in a central location that is generic enough to be used across many different tests, and programmatically creating/destroying data in test hooks (via scripts or APIs) works quite well. I avoid as much as possible having to manually create data as this isn’t easily repeatable.

Should you use production data or generate test data for testing?

This post is part of the Pride & Paradev series.

Should you use production data or generate test data for testing?

You should generate test data for testing

Generating test data is the only reliable way to accurately run tests repeatedly and consistently knowing that the input test data hasn’t changed.

Some applications rely upon specific data which is either hard to find, or hard to fake. For example, the web application I am working on displays different promotions based upon which day of the week you are using the system, and also changes prices depending on the day of week and time of day.

If you were using production data for testing, you would either have to run tests at specific dates/times to test different promotions/prices, or you would have to change the server date/time to test these. Changing the date/time on the server will effect anyone else using that server, so should be avoided. It also means that as you run your automated tests continuously against new check-ins, if you don’t use a known set of generated test data, you will get different results depending on time of day.

When developing an entirely new feature, there won’t be production data that you can use for testing, so you will need to generate some in this case.

Generating specific test data will often take longer sourcing production data, but will retrieve results over time as tests are run very consistently against a known data set.

You should use production data for testing

When you’re testing a web application, you’re as much testing the data as testing the application behavior. Using production data will ensure that what you are testing will be as close as possible to the actual behavior once the feature is released to production users.

If you generate test data and use it to test, who is to say that this test data is actually valid. If you generate test data through lower level means (such as SQL insert scripts), you may introduce test data that isn’t representative of that in production that may either introduce errors in functionality when actually running against production data, or errors in test that won’t actually exist in production. As your database schema updates and evolves, you will need to also keep your data generation scripts up to date so they are reflective of production at all times.

If you do use production data, you need to be clever about how to source data. Querying the database using SQL scripts is an effective approach as it will enable you to quickly find real data that you can use to verify a story has been implemented correctly.

It will also allow you to identify outliers and edge cases that can be tested using real production data against the system in development.

If there any concerns about using production data for testing, these can be mitigated by obfuscating the data so it is indistinguishable.

Three ways to generate test data for your ruby automated tests

I like generating test data that is varied, but still is realistic looking and fun. These are my three favourite ways to generate test data for my ruby automated tests.


When ever I need some form of fake data, whether it be names, company names or email addresses etc, I use the brilliant faker gem. This gem makes it super easy to generate random fake data that still looks realistic (unlike a randomly generated word you can generate yourself like ‘HSKHJKUWG’). My favourite method is, which as the name implies, generates some great BS!

require 'faker'

# Nathanael Botsford
# Labadie, Marvin and Kassulke
puts Faker::Company.catch_phrase
# Self-enabling bottom-line project
# grow B2C platforms


When ever I need to input a piece of data that needs to be uniquely identifiable, I use the UUID (universally unique identifer) capability in built in ruby 1.9.3 to generate a UUID. I prefer this to using the current date/time as it requires less formatting to get it to unique.

Ruby 1.9.3 has this in built, otherwise there’s the UUID gem.

require 'securerandom'

puts SecureRandom.uuid
# ffe71bd2-2650-4135-b366-f8da08b4b708


A relatively newcomer (it was released last week) is my quoth gem to generate random wikiquotes. I used this in the Wikimedia example tests to append interesting content to my test user page. You could use this in tests that need to insert blocks of content where you may like something with variety and that is interesting.

require 'quoth'

puts Quoth.get
# If I have ventured wrongly, very well, life then helps me with its penalty.
# But if I haven't ventured at all, who helps me then? ~ Søren Kierkegaard


I find these three methods useful. What do you use to generate test data? Or do you use hard coded test data?