Etsy’s Journey to Continuous Integration for Mobile Apps
Positive app reviews can greatly help in user conversion and the image of a brand. on the other hand, bad reviews can have dramatic consequences; as Andy Budd puts it: “Mobile apps live and die by their ratings in an App Store”.
The above reviews are actual reviews of the Etsy iOS App. As an Etsy developer, it is sad to read them, but it’s a fact: bugs sometimes sneak through our releases. On the web stack, we use our not so secret weapon of Continuous Delivery as a safety net to quickly address bugs that make it to production. However, releasing mobile apps requires a 3rd party’s approval (the app store) , which takes five days on average; once an app is approved, users decide when to upgrade - so they may be stuck with older versions. Based on our analytics data, we currently have 5 iOS and 10 Android versions currently in use by the public.
Through Continuous Integration (CI), we can detect and fix major defects in the development and validation phase of the project, before they negatively impact user experience: this post explores Etsy’s journey to implementing our CI pipeline for our android and iOS applications.
“Every commit should build the mainline on an integration machine”
This fundamental CI principle is the first step to detecting defects as soon as they are introduced: failing to compile. Building your app in an IDE does not count as Continuous Integration. Thankfully, both iOS and Android are command line friendly: building a release of the iOS app is as simple as running:
xcodebuild -scheme "Etsy" archive
Provisioning integration machines
Integration machines are separate from developer machines - they provide a stable, controlled, reproducible environment for builds and tests. Ensuring that all the integration machines are identical is critical - using a provisioning framework to manage all the dependencies is a good solution to ensure uniformity and scalability.
At Etsy, we are pretty fond of Chef to manage our infrastructure - we naturally turned to it to provision our growing Mac Mini fleet. Equipped with the homebrew cookbook for installing packages and rbenv cookbook for managing the ruby environment in a relatively sane way, our sysops wizard Jon Cowie sprinkled a few hdiutil incantations (to manage disk images) and our cookbooks were ready. We are now able to programmatically install 95% of Xcode (some steps are still manual), Git, and all the Android packages required to build and run the tests for our apps.
Lastly, if you ever had to deal with iOS provisioning profiles, you can relate to how annoying they are to manage and keep up to date; having a centralized system that manages all our profiles saves a lot of time and frustration for our engineers.
Building on push and providing daily deploys
With our CI machines hooked up to our Jenkins server, setting up a plan to build the app on every git push is trivial. This simple step helps us detect missing files from commits or compilation issues multiple times a week - developers are notified in IRC or by email and build issues are addressed minutes after being detected. Besides building the app on push, we provide a daily build that any Etsy employee can install on their mobile device - the quintessence of dogfooding. An easy way to encourage our coworkers to install pre-release builds is to nag them when they use the app store version of the app.
Testing
iOS devices come in many flavors, with seven different iPads, five iPhones and a few iPods; when it comes to Android, the plethora of devices becomes overwhelming. Even when focusing on the top tier of devices, the goal of CI is to detect defects as soon as they are introduced: we can’t expect our QA team to validate the same features over and over on every push!
Our web stack boasts a pretty extensive collection of test suites and the test driven development culture is palpable. Ultimately, our mobile apps leverage a lot of our web code base to deliver content: data is retrieved from the API and many screens are web views. Most of the core logic of our apps rely on the UI layer - which can be tested with functional tests. As such, our first approach was to focus on some functional tests, given that the API was already tested on the web stack (with unit tests and smoke tests).
Functional tests for mobile apps are not new and the landscape of options is still pretty extensive; in our case, we settled down on Calabash and Cucumber. The friendly format and predefined steps of Cucumber + Calabash allows our QA team to write tests themselves without any assistance from our mobile apps engineers.
To date, our functional tests run on iPad/iPhone iOS 6 and 7 and Android, and cover our Tier 1 features, including:
- searching for listings and shops
- registering new accounts
- purchasing an item with a credit card or a gift card
Because functional tests mimic the steps of an actual user, the tests require that certain assumed resources exist. In the case of the Checkout test, these are the following:
- a dedicated test buyer account
- a dedicated seller test account
- a prepaid credit card associated with the account
Our checkout test then consists of:
- signing in to the app with our test buyer account
- searching for an item (in the seller test account shop)
- adding it to the cart
- paying for the item using the prepaid credit card
Once the test is over, an ad-hoc mechanism in our backend triggers an order cancellation and the credit card is refunded.
A great example of our functional tests catching bugs is highlighted in the following screenshot from our iPad app:
Our registration test navigates to this view, and fills out all the visible fields. Additionally, the test cycles through the "Female", "Male" and "Rather Not Say" options; in this case, the tests failed (since the "male" option was missing).
By running our test suite every time an engineer pushes code, we not only detect bugs as soon as they are introduced, we detect app crashes. Our developers usually test their work on the latest OS version but Jenkins has their back: our tests run simultaneously across different combinations of devices and OS versions.
Testing on physical devices
While our developers enjoy our pretty extensive device lab for manual testing, maintaining a set of devices and constantly running automated tests on them is a logistical nightmare and a full time job. After multiple attempts at developing an in-house solution, we decided to use Appthwack to run our tests on physical devices. We run our tests for every push on a set of dedicated devices, and run nightly regression on a broader range of devices by tapping into Appthwack cloud of devices. This integration is still very recent and we’re still working out some kinks related to testing on physical devices and the challenges of aggregating and reporting test status from over 200 devices.
Reporting: put a dashboard on it
With more than 15 Jenkins jobs to build and run the tests, it can be challenging to quickly surface critical information to the developers. A simple home grown dashboard can go a long way to communicating the current test status across all configurations:
Static analysis and code reviews
Automated tests cannot catch all bugs and potential crashes - similar to the web stack, developers rely heavily on code reviews prior to pushing their code. Like all code at Etsy, the apps are stored in Github Enterprise repositories, and code reviews consist of a pull request and an issue associated to it. By using the GitHub pull request builder Jenkins plugin, we are able to systematically trigger a build and do some static analysis (see static analysis with OCLint post ) on the review request and post the results to the Github issue:
Infrastructure overview summary
All in all, our current infrastructure looks like the following:
Challenges and next steps
Building our continuous integration infrastructure was strenuous and challenges kept appearing one after another, such as the inability to automate the installation of some software dependencies. Once stable, we always have to keep up with new releases (iOS 7, Mavericks) which tend to break the tests and the test harness. Furthermore, functional tests are flaky by nature, requiring constant care and optimization.
We are currently at a point where our tests and infrastructure are reliable enough to detect app crashes and tier 1 bugs on a regular basis. Our next step, from an infrastructure point of view, is to expand our testing to physical devices via our test provider Appthwack. The integration has just started but already raises some issues: how can we concurrently run the same checkout test (add an item to cart, buy it using a gift card) across 200 devices - will we create 200 test accounts, one per device? We will post again on our status 6 months from now, with hopefully more lessons learned and success stories - stay tuned!
You can follow Nassim on Twitter at @kepioo