…or how to avoid shutting down an app used by 10+ m
Shpock, the mobile marketplace that brings millions of private buyers & sellers and local businesses together, has learned a thing or two about maintenance in the past six years. With more than 10 million active users every month, one can imagine that redundancy is critical for a high-performance service like the Shpock API. To this day, the Shpock team has never turned on its own maintenance page.
In the following article, Stefan Lingler, CTO and Senior Backend Developer of Shpock, describes how to avoid maintenance windows from the standpoint of one of Europe’s largest shopping apps.
Don’t get me wrong: maintenance is an important measure to take important actions in a software project. Whether you apply such action successfully or not depends on many factors, such as the type of software project. However, fuelled by our skyrocketing demand for availability, we had to make the right decisions early on. Of course, if we would have known some things earlier, we sometimes would have decided differently. However, in a fast evolving environment, decisions on technology have to be taken quickly with a strong focus on building an adaptable but robust stack to win the market.
Since the beginning, we have followed two core principles:
- Do not deploy multiple versions of the API! That sounds easy to achieve, but with more than 130 individual app versions on the client side, it does make a huge difference.
- Do not take the “easy” route – shutting down support for all the old clients so that the users have to update to the latest app version. For Shpock, every single user counts. Plus, some can’t even update their app as their operating system does not allow them to.
Given that attitude, we were able to never ever need a downtime due to maintenance in the past six years (except when we rolled out a major database upgrade which required it). Despite developing an “under maintenance” page, we have never had to turn it on since then.
Of course, there are numerous legitimate reasons to work with maintenance downtimes. But users of your app do not and should not have to care about it. Being able to use a certain app at any given time is crucial for a high-level user experience. Every planned maintenance window leads to downtime which usually results in a high user churn rate – something you should care about as an app provider.
How to avoid maintenance windows
So here are a few examples of the measures we applied at Shpock to keep off maintenance downtimes.
- Load balancer level routing
We use routing at a load balancer level to not only distribute traffic to our backend nodes but also to do staged rollouts of changes to the production environment. In fact every deployment – and there are around 10 on average per day – follows that pattern. If something goes wrong we just shut down the affected backend stack.
- Incremental upgrade
We often introduce new features that require new fields in our MongoDB cluster but also changes existing ones. There is a simple approach to apply a fast transition by enabling your application code to deal with both old and new schema. We then have the option to schedule the migration of a single entity through our messaging queue or let a background task run through it. Important: Make sure you are able to stop the background task any time and resume later if required.
- Shadow migration
The data for our front row app view – the discover screen – is delivered exclusively through an Elasticsearch index. The number of shards per index cannot be changed once set. Whenever we are required to change the mapping or shard configuration of an index we start indexing in multiple indexes until the data set is identical. After that, we simply update Elasticsearch aliases and deploy a small change to fully utilize the new index.
Where to start?
When you plan to apply seamless migrations of data, features or even teams you need to focus on a few essentials. It does not matter when you start with that but you should not wait until you are stuck in a dead end road.
- Fast re-indexing
If you are running an Elasticsearch index or any other kind of replicated database to increase performance: Make sure you have tools to re-index fast enough. This is critical in many cases, for instance, to recover from a data impacting bug.
- Any kind of messaging queue
Find a technology like Apache Kafka, RabbitMQ or similar that enables you to process migration tasks on whatever kind of entity you are changing. Most important is that it scales well and allows you to work off jobs asynchronously and parallel.
- Phased deployments
If necessary, split deployments to your production environment into small logical packages. It does no harm if you deploy three or four times to complete a transition if this provides the option to monitor, approve or – in the worst case – revert a step. And yes, we do have a staging environment, unit and integration tests but database related changes usually behave quite different in production.
- Clear documentation
Sounds obvious but reduces the effort to onboard new developers as well as the risk of cleaning up seemingly obsolete legacy code that is still important.
- Centralized Logging
There are so many great tools out there that do not require a lot to start collecting events of your application code. Use it to estimate the impact of a change, follow your migration process, track down random events of a single user’s problem and most importantly get an idea of how your API is utilized.
- Quality code
Even if you just migrate a few thousand documents in your mongo cluster: Apply the same quality standards as for your application code. You might have to run such migration tasks over and over again which is very difficult and unreliable with just a bunch of code snippets.
- Not just „Plan A“
Prepare for plan A to fail and have at least B and C ready. The minimal approach to this is to go through all scenarios you can think of – ideally with your team members or developer friends.
- Knowing the calling client
If you run a private or public API: Make sure to have some sort of information about the client – for example a request header with the client version. This helps you to understand the correlation between an incident and the potential client causing it.
High user satisfaction has always been our main priority. The whole Shpock team is working hard to make the buying and selling experience as quick and as simple as possible. Avoiding maintenance windows plays a vital role in user experience. But apart from that, it also means a great satisfaction for developers to build a flexible code base that allows them to step ahead without being set back by old-fashioned “Please come back later“ 😉
DESCRIPTION OF THE SPEAKER
CTO and Senior Backend Developer
Stefan is a Senior Developer and part of the management team of Shpock, a mobile marketplace that brings millions of private buyers & sellers and local businesses together. Since the launch, he largely contributed to optimizing the performance and scaling the rapidly growing platform – making it one of Europe’s leading shopping apps with more than 10 million active users per month. Besides his 10+ years full-stack experience – from conventional LAMP to modern NoSQL, Messaging, CRM, HA and high traffic technologies, Stefan has built the backend team from ground up. He holds a Diploma of Engineering Media Technology and accomplished vast experience in MongoDB, Elasticsearch, Nginx and System Engineering in general.