Télétravail massif
< < Articles

Covid-19: Performance and scalability of digital services in the age of massive remote working

03/04/2020

Step by step, this article aims to help companies through this crisis in emergency, and up to 6 months later, in order to reconcile performance and scalability.

Monday, March 16th, 2020, 9 a.m. My daughter sits down in front of the family computer to open the Digital Workspace (ENT), the digital platform that connects students to their teachers and families to schools. Aim of the game: to continue her distance schooling, as announced by the Minister of Education. But she will get only one feedback: an error message! It accompanies her all day long: the ENT is saturated with requests. It is not planned/equipped/thought out to face an army of panicked schoolchildren!

The coronavirus crisis is forcing new ways of working around the world for those who can. Massive teleworking is having a colossal impact on digital networks and services, making some of them inoperable because they are saturated. These new uses put an enormous weight on the networks, and all the usually social interactions are shifted there. As a digital system operator, your service may no longer be able to keep up. Since confinement and remote working will last for a while yet, we propose in this article to come back to the basic principles of performance and scalability of a digital service. With rapidly exploitable solutions, we propose a list of actions that will allow you not only to cope with the emergency, but also to better anticipate and face this kind of crisis. Step by step, we help you through this period of emergency, and up to 6 months after the end of the crisis, in order to reconcile performance and scalability. The aim here is not to list all the actions in an exhaustive manner, but rather to provide you with an overview. So follow the guide!

Identifying the weak link

This is the first step. A good digital service is like a good hi-fi system: it is always limited by its weakest link. The main problem is to know the weakest link in the service. To do this, it is necessary to have set up indicators on each of the elements: incoming traffic, RAM and disk occupation, CPU usage, end-to-end HTTP response time, database request response time, internal network traffic... All this monitoring data will allow you to identify your weak link(s).

At this stage these indicators will probably come from multiple sources and for some of them collected manually. This is why the implementation of complete and integrated monitoring solutions will be one of the priorities, in order to efficiently direct efforts.

The good news is that there are many solutions available to create an ecosystem around your application that can help it scale up. These include Caches, Proxies, Virtual Machines, Cloud, Virtual Waiting Rooms, Automated Deployment, and DevOps methodologies ... The table below lists various levers that can be activated in the more or less short term.

Tableau différents leviers

These different solutions are then inserted at different points in the request/response cycle between the browser and the servers.

cycle requêtes / réponses entre le navigateur et les serveurs

In a hurry!

So we have a whole arsenal to deal with the problem. It remains to use it in the right order, depending on the situation and the level of urgency. Each of these levers can be easily activated according to certain constraints. For example, it is more difficult to take advantage of a SSC if static resources are not versioned. Horizontal scalability is not possible if the application has not been designed for . And sometimes, it's just a disk that's too small that blocks the whole system... In short, often, the first reflex must be to see if we can increase the size of the machine on which the application runs. In the past, you had to intervene in the machine room, add a disk, ram ... Today it is often enough to connect to the VMWARE console and add a CPU or "virtual" RAM.

This can help to pass a peak, even if it is probably not satisfying either intellectually or ecologically. But then, you have 2 hours to fix the problem, so you have to be efficient. We'll close our eyes this time, so go ahead!

In any case, it's important to have all the indicators you need to make this decision: if you don't have any indicators and don't know what to act on, the very first action is to install a modern monitoring tool such as Datadog, Newrelic and others. Most of them come today with their own set of system and application probes, capable of adapting to almost any context. These APM (Application Performance Monitoring) tools have often extended monitoring all the way to the user's browser, providing a view of performance throughout the entire chain: from the execution of JavaScript in the browser, to the performance of database queries, via the network and the application. These tools will very quickly provide you with indicators that will allow you to identify your weak link(s). Most importantly, these indicators will tell you, as you take action, the impact they have had and the next weaknesses to be corrected, and will guide you in your priorities for action.

Within a week

The next step, if you don't have one in your architecture at the moment, is to set up proxies / caches (e.g. Varnish). Starting with the simplest ones, the ones you put in front of your application and that memorize the questions and answers.

Let's take the example of our student: The first time a 4th A student wants to access the "homework" page of March 25th, the proxy queries the application, which will search the database for the content entered by the teacher. Then the proxy stores this result, which is the same for all the students of the 4th A, it does not need to query either the application or the database anymore. By caricaturing a bit, you have potentially divided by 30 the load on your application!

Another important element is to relieve your digital service of the delivery of static resources as much as possible: depending on the typology of your application, these can represent up to 90% of the incoming http requests since for a requested page, you need to load the CSS, js scripts and potentially the images / icons or other resources associated with this page. Setting up a CDN (Content Delivery Network) is a relatively simple operation today and the effect will be very immediate to relieve network and servers from this management.

This first week of operation under heavy traffic can also be an opportunity to work on the optimization of your database(s) if they are part of the highly solicited elements of the architecture: checking slow queries, adding or modifying the indexes on certain tables prove to be profitable choices quickly and allow you to gain the precious milliseconds which, multiplied on all users, will help the service.

If your service is aimed at the general public, you don't know the potential number of users. It is therefore very important to be able to control your incoming traffic and not to be subjected to it. To do this, virtual waiting room services exist: they allow you to keep the user busy and therefore not lose him/her while your application is able to process his/her requests. In this category we can mention for example https://queue-it.com or https://www.netacea.com/.

Finally, this first week will also have brought you to gather around the table various trades: project managers, product owners, developers, system administrators. This is the opportunity to start breaking down silos and to set up more direct collaboration processes.

The first month

You have all the solutions in place that can be activated quickly. Over a slightly longer period of time, it is also possible, via the application developers or system engineers, to improve the application simply enough to further increase performance.

You have already set up varnish caches upstream of the application, which act at the http level. It can often be very efficient to implement them at the level of the application itself or by intercepting entities/queries to the database. Solutions such as Redis or Memcache can easily store and invalidate on demand user session information, catalog information on an e-commerce site, or other user preferences necessary for the application to process each http request.

This type of improvement is also a frequent preamble to the implementation of horizontal scalability on the service: once the context is no longer stored by the application servers, they can be added or deleted much more easily. This makes it possible to share information across all instances of the application.

Depending on the type of application, it is also possible to implement, even in such a short time, certain types of horizontal scalability, for example with a load balancer managing sticky sessions. Finally, in the context of a very busy Internet network, industrializing your front resources will improve the user experience with lighter page loads and therefore faster retrieval by users. This industrialization requires the optimization of static resources:

  • Minification for CSS or javascript, compression (gzip/brotli),
  • Removing dead or unused code,
  • Versioning of resources to make the best use of the browser's caching features,
  • Image optimization for user support (mobile, tablet, desktop, ...)
  • Optimization of videos according to the user's bandwidth, possibly, as the major streaming players are doing in an effort of solidarity at the moment, stopping broadcasting in UHD formats for example to free up bandwidth for operators.

In the following months

It is on this timeline that it is interesting to review its technical architecture, industrialize both the front and the entire infrastructure so that it is able to adapt to the traffic. And as a bonus, review the application code so that it is capable of adapting to more sustainable architectures in terms of user traffic management: implementation of micro-services, asynchronous mechanisms, serverless or reactive architecture, there are now many possibilities to respond to these issues.

Beyond the production platform itself, it is the deployment and tooling chain that will have to be adapted:

  • Performance improvement is done incrementally, by treating the contention points one after the other and observing the results to decide on the next step.
  • During the crisis, mobilizing 6 people for several hours for each production run was possibly conceivable, but this is of course not viable in the long term.
  • The automation provided by a controlled chain of integration and continuous deployment will allow you to enter a truly agile approach.

This virtuous circle, initiated in response to a crisis, will become a real strength for your business by becoming the support of your innovation. Beyond the tools, it is the culture of the teams that will have been impacted, showing that it is possible to deploy changes quickly, to measure their effectiveness and to manage a project in a cross-disciplinary way. What used to be a crisis response strategy could well become one of your assets to win back your market in the longer term!

To sum up, to counter the urgency of access to your service :

  1. Set up monitoring.
  2. Increase memory, disk and cpu if the operation is simple and fast.
  3. Follow your monitoring indicators.
  4. Set up caches and proxies on the infrastructure side.
  5. Follow your monitoring indicators.
  6. If the service is too saturated, set up a virtual queue system.
  7. Follow your monitoring indicators.
  8. Set up application and/or database caches.
  9. Monitor your monitoring indicators.
  10. Set up an API aggregation system if you use one and modify your frontend accordingly.
  11. Monitor your monitoring indicators.
  12. Start infrastructure, frontend and backend industrialization and possible architecture overhauls.