Flaky tests are the bugbear of any automated test engineer; as Alister says “insanity is running the same tests over and over again and getting different results”. Flaky tests cause no end of despair, but perhaps there’s no such thing as a flaky or non-flaky test, perhaps we need to look at this problem through a different lens. We should spend more time building more deterministic, more testable systems than spending time building resilient and persistent tests. Alister will share some examples of when test flakiness hid real problems underneath the system, and how it’s possible to solve test flakiness by building better systems.
I once heard the brilliant suggestion that you should never compare your insides with another person’s outsides; because they’re not the same thing. For example, just because someone may seem happy and drive an expensive car, that’s only the outside view of that person and it doesn’t paint the full picture of that person’s insides, which you’re dangerously trying to compare with your own inner thoughts/status.
The same applies with comparing things at your organization to things you’ve heard about other organizations. Countless times, including just this week, have I heard managers and colleagues say things they’ve heard like: “Facebook don’t have testers”, “Google has 10,000+ engineers in 40 offices working on trunk” and “Flickr deploys to production 10 times a day so we can too”. These are examples of comparing our insides to other’s outsides, again.
Yes, Google may have 10,000+ engineers committing to one branch but having spoken to people who work at Google it’s not quite as amazing as it seems. For example, firstly the code-base is broken down into projects (imagine the checkout time without this), each and every change set must be code reviewed, have automated and manual tests performed against it (which can take hours/days) before it is even committed to the trunk, even before it can even be considered for a production release.
I didn’t realize it at the time but the keynote at GTAC last year captured this phenomenon perfectly:
It can not only be really annoying/unhealthy for staff to constantly hear such comparisons, it can also be dangerous because doing something just because Google/Facebook/Twitter/Flickr does it without knowing the inner workings of their organizations will inevitably lead to failure when you try to do it without that context and experience.
So next time you are tempted to drop something you’ve heard from a conference or a blog post about how another company does something better than yours, or to justify that we can/should do it this way, remember, never compare your organization’s insides with another organization’s outsides.
I was lucky enough the attend the Google Test Automation Conference (GTAC) at Google Kirkland in Washington last week. As usual, it was a very well run conference with an interesting mix of talks and attendees.
Whilst there wasn’t an official theme this year, I personally saw two themes emerge throughout the two days: dealing with flaky tests and running automated tests on real mobile devices.
There wasn’t too many talks that didn’t mention flaky automated tests (known as ‘flakes’) at some point. Whilst there seemed to be some suggestions for dealing with flaky tests (like Facebook running new tests x times to see if they fail and classify them as flaky and assign to the owner to fix), there didn’t seem to be a lot of solutions for avoiding the creation of flaky tests in the first place which I would have liked to see.
Real Mobile Devices
The obsession of running mobile automated tests on real devices continued from last year’s conference with talks about mobile devices as a service. I personally think we’d be better spending the time and effort on making more realistic mobile emulators that we can scale rather than continuing the real device test obsession.
My key takeaway was even highly innovative companies like Google, Facebook and Netflix still struggle balancing software quality and velocity. In Australia, these companies don’t have a strong presence here, and often the IT management of smaller companies here like to say things like “Google does x” or “Facebook does y”. The problem with this is they only know these companies from the outside. Ankit Mehta’s slides at the beginning of his keynote captured this perfectly and hence were my favorite slides of the conference:
Recently our WebDriver tests that run in Chrome via a Windows service all suddenly stopped working even though we hadn’t made any changes to our tests. It seems Chrome had automatically updated itself on our WebDriver agents introducing a Chromium 38 bug meaning WebDriver won’t work at all (full details here and here). Getting these tests running again has been very painful, mainly due to Google not having standalone Chrome installers for any previous versions of Chrome publicly available.
If you run any WebDriver tests I highly recommend you lock down your browser versions to stop this happening to you in the future. Here’s how:
Firefox is fantastic in this regard as they make every back version easily accessible as well as a simple way on all platforms to stop automatic upgrades. I tend to lock down to Firefox ESRs (Extended Support Releases) such as versions 24 and 31 which are listed on this comprehensive Wikipedia page.
To stop updates all you do is open preferences, advanced, update and select ‘never’.
Chrome is a P.I.T.A. in both being able to install a previous version or lock down the currently installed one. Google prefer a Chrome web installer which always installs the latest version of Chrome, and if you want a specific version you need the alternate (offline) installer (for all users if you use a Windows service), but they only provide the latest installer. It’s hard if not impossible to find older alternate (offline) installers on the web, even oldapps.com can’t host them.
Once you have a version of Chrome on Windows that you want to keep, you need to download a group policy template, and disable automatic updates before running Chrome (so it doesn’t automatically update before you set the group policy). I won’t go into full details but you should be able to find all details here. Some sites mention using a plugin to stop updates but this doesn’t work so you’ll need to go down the group policy path.
Locking down browser versions avoids having to suddenly work out why your entire WebDriver test suite fails.
One of the reasons why myself, and many others, still refer to Selenium 2 as WebDriver, I suspect, is it’s googleable. If you’re trying to find information about Selenium 2 through Google, it constantly brings up results for Selenium RC or Selenium IDE, which are vastly different from Selenium 2. When you Google ‘Webdriver’ you know you’re getting Selenium 2 and that only.
That’s why googleability is so important. Many open source projects have short hipsterish names such as grape, gatling, bacon and hoe which aren’t googleable. Watir is very googleable as watir will return watir results, and even watir webdriver is pretty good for Googleability too. This blog’s name is very googleable.
It applies for people too. It should be no surprise that the first thing someone hiring does when looking at a CV is Google the candidate’s name. If the person has a generic name, eg Ben Smith, it’s going to be very hard to find that person quickly. Since I have a fairly generic last name, Scott, we have purposely chosen to give our children interesting first names (Finley, Orson) to increase their Googleability (yes I Google my yet to be born children’s names).
So next time you’re naming an open source project, or baby, think of googleability.
Ari Shamash from Google talked about the consistent issue of non-deterministic (flaky) automated tests and how Google use hermetic environments to highlight these tests. This involves creating 5-20 instances of an application and running tests repeatably to identify inconsistent results.
James Waldrop from Twitter discussed their ongoing strive to eliminate the fail whale through performance testing. He discussed production testing techniques: canaries (small subset of users provided new functionality), dark traffic (use existing app but send some traffic to new version and throwaway response), and tap compare which is comparing dark traffic to actual. He then talked about his tool homegrown performance tool Iago (commonly called Lago because of the capital I in sans-serif fonts).
Malini Das and David Burns from Mozilla discused automated testing of the FirefoxOS mobile operating system and how it uses WebDriver extensively to test the inner context (content) and outer context (chrome) of FirefoxOS. They have a neat Panda Board (headless devices) device pool which can cause non-determistic test failures due to hardware failure. One key point was how important volume/soak testing as people don’t turn off their phones – they expect them to run without rebooting them or turning them off.
Igor Dorovskikh and Kaustubh Gawande from Expedia discussed Expedia’s approach to test driven continuous delivery. Interestingly they use ruby for their automated integration and acceptance tests even though the programmers write their web application in Java. Having a green build light is critical to them which means a failed build rolls back automatically after 10 minutes: giving someone 10 minutes to check in a fix. To enable this, they have created a build coach role which is shared amongst the team, even project managers and directors can take on this role to keep the build green. They also stated that running mobile web app tests on real devices and emulators (using WebDriver) has been beneficial, as well as standard browser user agent emulation to get around issues with multiple windows for features like Facebook authentication.
David Röthlisberger from YouView demonstrated automated set top box testing which uses a video capture comparison tool that compares expected images – similar to Sikuli. These images are stored in a library should the application change in look and feel.
Ken Kania from Google discussed ChromeDriver 2.0 and its advanced support for mobile Chrome browsers.
Simon Stewart from Facebook talked about Android application testing at Facebook. Originally Facebook used Web Views in Android & iOS which enabled frequent deployment but resulted in a terrible user experience. They have since started developing native applications for each feature. Interestingly, every feature team has responsibility for all platforms: web, mobile web, Android and iOS. This enables feature parity across platforms. Facebook use their own build tool BUCK which enables faster builds. Simon also pointed out that engineers are entirely responsible for testing at Facebook: they have no test team, no QA department or testers employed. Some engineers are passionate about testing, like some others are passionate about Databases. Dogfooding is very common amongst engineers which results in edge cases being discovered before being released to Production. A highly entertaining talk.
Google really know how to run a conference. It’s hands-down the smoothest one I’ve attended; from the sign-in process to the schedule being adhered to. They even have stenographers and have sign language interpreters.
Oh, and NYC is great. I went to the top of the Empire State Building yesterday: the view to lower Manhattan was amazing.