Covid-19: how to optimize the cost of your cloud services ?
The health crisis that we are currently living in has put to test many of our technical infrastructures (training sites, video, food e-commerce...) by causing an exceptional overload which can be difficult to bear.
Even if it is difficult to draw a definitive assessment, it appears that platforms and services relying on the elasticity offered by public cloud providers have generally fared much better than average. Indeed, the observed violent spikes in connections (x100 for some MOOC platforms) are much easier to handle when one is not limited by the physical capacity of one's own datacenter, and when services are designed to scale dynamically.
The upcoming unprecedented economic crisis will provide an opportunity to take advantage of another advantage of the cloud, by optimizing its costs, again through elasticity, but also through a few architectural elements. Here are a few examples and tips on how to best optimize your cloud services.
The cloud brings so much ease and promise of innovation that very quickly every company is faced with multiple environments, projects and resources that are sometimes difficult to map. Who has never been trying to track the person who started (and forgot) a "test" VM?
The sine qua non condition to be able to act on costs is to know how to link resources to their owners (physical person, project, business unit, ...). Otherwise you quickly find yourself unable to act.
To do this as soon as possible, it is necessary to set up governance and deployment rules: how, for example, will we structure GCP projects? How do you tag the resources of different sub-projects on the same AWS account?
Depending on your cloud provider, different tools will also be available to audit and automatically correct missing information. For example, AWS Config can automatically shut down any VM that does not have the tag identifying its Business Unit owner.
Turn off & delete
Depending on your architecture, infrastructure services and specifically virtual machines can represent a significant part of the bill. On AWS it is quite trivial to automate the scheduled shutdown and reboot of these virtual machines and RDS instances using a Lambda function (e.g.https://aws.amazon.com/premiumsupport/knowledge-center/start-stop-lambda-cloudwatch/). Thus development servers can be shut down between 8pm and 7am, as well as on weekends, leading to a saving of around 50% (storage for example continues to be billed).
More and more managed services also offer these features (e.g. RDS and more recently Redshift, also from AWS).
Infrastructure as Code tools (Terraform, Cloudformation, Troposphere, CDK...) are also key in this context. They allow you to build a complete environment in a few minutes. Therefore, if one can easily recreate, one can remove what is not immediately useful (for example a project frozen for 6 months).
In terms of elasticity, not all applications are equal, depending on whether we are looking at a "native cloud" or at a "lift and shift" architecture. In the second case, the infrastructure is often composed of virtual machines hosting monolithic applications and relational database services (RDS at AWS, SQL at GCP, ...) which cannot necessarily benefit from autoscaling.
It is therefore advisable to examine the resource consumption (CPU, memory, disk) of each server to identify potential savings.
Be careful to use the right tools for this: The AWS Trusted Advisor, for example, is based on the CPU and network consumption of virtual machines, but does not take into account the memory consumption of applications. It is therefore necessary to complete its analysis with the appropriate metrics, or use specialized tools.
Different paths are possible. The first one is most of the time to decouple the components of the application in order to make them scalable, as we explained here https://www.ekino.fr/articles/performance-et-scalabilite-des-services-numeriques-a-lheure-du-teletravail. Beyond these steps, a more in-depth overhaul of the architecture is possible in order to optimize resources. Many cloud managed services charge on a per-use basis, and therefore in a more linear way than an infrastructure that is permanently on.
- A GCP cloud-storage can in many cases replace a VM running Nginx
- API-Gateway and AWS Lambdas can replace a Java server
- An AWS Simple Queue Service queue can replace a RabbitMQ server.
This is obviously a simplified vision (the functionalities are never totally equivalent) but depending on your consumption it is always interesting to study these opportunities. Beyond the reduction of the "cloud" bill, it is also a paying investment in the medium and long term on the human costs of operation.
Explore service options and pricing
For each of the services implemented, the topic of the billing model and the options activated must also be reviewed. Google Cloud Storage offers 4 classes of storage depending on the access needs (Standard, Nearline, Coldline, Archive). Logs kept several years for regulatory reasons can be stored in “Archive” dividing the cost by 6.
AWS Relational Database Service offers high availability databases, but this option doubles the price of the service. Is it necessary to activate it in a development environment?
AWS also offers numerous price options for EC2 virtual machines: standard, reserved instances, saving plan, spot instances... This last option, particularly suited for batch processing, easily brings cost reductions of 60% and more.
Finally, it is always useful to pay attention to network transfer costs. Depending on whether the flows are intra zone, intra region, public, and the services they link, prices vary greatly. Whenever possible, optimizing the network architecture will have the triple effect of reducing costs, improving performance and security. A classic example is the activation of VPC endpoints (S3 in particular) at AWS.
AWS, GCP, Azure offer great flexibility thanks to their many services. But this is often at the price of technical and organisational complexity. For simple projects, Platform as a Service such as Platform.sh or Clever Cloud are perfectly suited. Thanks to standardised environments, they are very quick to implement, offer git-based deployment workflows, and hide a large part of the complexity (and therefore costs) of the run.
To get started
As we've seen, there are many leads to investigate, especially since your cloud ecosystem is probably varied. One of the most effective approaches is to bring together for a workshop the technical managers and product owners of each platform to analyze together the expenses. The AWS cost explorer is an excellent tool for this purpose. In a few clicks it allows to understand the distribution of costs, and if necessary to go down in great details (more so if the resources are correctly tagged).
Around this tool, and in small groups, a brainstorming session lasting about thirty minutes is held to identify ways of optimizing the architecture, starting with the largest cost centers . The ideas are then shared and debated to come up with a plan to which everyone commits, in the manner of an agile sprint planning.
This way, in an iterative way, each team can take ownership of the subject and responsibility for its financial impact, thus starting a "FinOps" approach.
The consequences of this crisis will be multiple, and are still difficult to draw, but it is clear that companies will have to focus even more than yesterday on the value they bring to their customers. Services must make sense, be resilient, but more than ever, they must be economical and even frugal.
It will be interesting in the coming months to see how companies will review their cloud strategy for the future: streamlining services to leverage economies of scale, investing in serverless technologies, focusing on "sovereign" clouds, ... many options are possible. But in any case, the fundamental movement of the cloud remains inseparable from the agility required to get out of the crisis in the coming years.