loading...

AWS – Understanding the five pillars of the Well-Architected Framework

The AWS Well-Architected Framework is divided into five pillars that cover the gamut from security to operations and providing best practice guidelines for all aspects of developing and maintaining a solution in the cloud. Each of the pillars is described at length in a whitepaper that is available to download from the AWS site: https://aws.amazon.com/architecture/well-architected/.

Security

Security should be considered job zero when designing a cloud architecture. It should be the first consideration you take, not the last – which is unfortunately too often the case. If securing your infrastructure is an inconvenient afterthought, it will be difficult, if not impossible, to keep your customer data safe and private.

The security pillar of the Well-Architected Framework focuses on protecting systems and information. Systems are the physical and virtual resources that you have provisioned in your IT environment, and information is the data that flows throughout those systems.

The foundation of any secure platform is Identity and Access Management (IAM). You must be able to reliably authenticate your users, and once you know exactly who they are, you must provide mechanisms to authorize users to limit what they’re allowed to do. The same applies to applications that require access. 

Never hardcode credentials into your application code or into configuration files! One of the most common causes for data breaches comes from passwords or API keys that are checked into the source code. Use things such as cross-account roles, EC2 instance profiles, and AWS Secrets Manager to work around use cases where you might be tempted to use hardcoded access keys.

A best practice for securing your cloud system is to adopt a policy of zero trust. Zero trust means that even when data is flowing between servers behind your firewall, on your internal network, you still apply stringent security measures. It helps to be a little paranoid and assume that a bad actor is on your network, watching everything that happens, and looking for a way to infiltrate further. An example of zero trust is encrypting communications between your web servers and your database server, regardless of the fact that they are sitting on your internal network.

One exception to this rule is network traffic that is flowing to and from your EC2 instances. Due to the design of the hypervisor, it is impossible for a third party to sniff that traffic. Unlike a traditional network where a sniffer running in promiscuous mode can see everything, the hypervisor prevents the delivery of packets to anywhere but the target. Regardless, many compliance regimes require all traffic to be encrypted, and it’s still good practice.

With Amazon CloudTrail, it’s possible to leave an audit trail that details every action taken within your account in all regions. If a security event occurs, auditors must have rapid access to log files that can help them find out who the culprit is, and how they broke in.

Always enable CloudTrail and send the logs to an S3 bucket in a different account, accessible only to security auditors. Placing the audit trails in a different security domain means that an attacker will have a harder time covering their tracks.

Simulate security events frequently so that your operations team can practice incident response. When you identify patterns in those responses, automate them so that a human is not required to deal with the situation. This will improve consistency and time to response. An example of this is automatically responding to an S3 bucket being made public – instead of simply sending an email to an administrator to alert them to the problem, write a Lambda function that flips the bucket back to private immediately.

To learn more about the security pillar, read the official whitepaper: https://d1.awsstatic.com/whitepapers/architecture/AWS-Security-Pillar.pdf.

Operational excellence

The operational excellence pillar is often overlooked, and even in companies that have a good grasp of architectural principles, this is an area that needs improvement. The key to operational excellence is to design your systems with operations in mind from day one. Just like with the security pillar, where we must not wait until after the application is built to consider it, we have to make sure what we are constructing is operable.

When you add a new feature to your application, ask yourself these questions:

  • How will I know if users are using the new feature?
  • What are the Key Performance Indicators (KPIs) that tell me if the feature is working well?
  • Are there any alerts that I should send to the operations team if this feature is not working as intended?
  • Are there automated tests in place to make sure new code changes do not break this feature?
  • Is the feature documented thoroughly for both internal and external users?
  • Is the data that’s generated by this feature available to a reporting team for detailed analysis?

A feature is not done when just the code for the feature itself is complete. All of those aspects must be considered before calling it done and shipping it to production.

Another critical aspect of operational excellence is to adopt Infrastructure as Code (IaC). In Chapter 1, AWS Fundamentals, you were introduced to AWS CloudFormation, which allows you to construct your resources using JSON or YAML files, instead of logging in to the console and creating them manually, or via the command-line interface (CLI). IaC allows you to rapidly spin up new environments for development, testing, and disaster recovery. Treating infrastructure as code also allows you to adopt best practices from the software development industry, such as revision control and code reviews.

When you experience an operational failure, don’t sweep it under the rug and pretend it never happened. Write a detailed report about exactly what happened, why it happened, what the root cause was, and how you solved it. Document the measures you took to make sure it never happens again, and then share that data with everyone in your organization. We all fail at some point. All systems, both human and machine, are fallible. Expect failure, prepare for it, and learn from it.

One of the best ways to prepare for operational events is to conduct what is called a game day. Take a copy of your infrastructure, assign your operations team to monitor it, and then purposefully throw them curveballs to see how they react. What happens if you stop a critical EC2 instance? How will they react if you misconfigure a security group and your web servers are no longer accessible? Can they recover from a faulty software release? The only way to know is to test your team frequently, document the results, and make changes to improve your performance.

There will come times when you are facing a problem that you can’t solve alone. AWS support is a fantastic resource that you should take advantage of to help you solve problems with your infrastructure.

At a bare minimum, purchase a Business Support contract on all production accounts!

Business Support gives you full access to AWS Trusted Advisor and is an invaluable resource that can identify cost savings so significant that it more than makes up for the support fees. And even more critical than that, a Business Support contract gives you a very short service-level agreement (SLA) so that when you need help, you can get someone on the line quickly.

The most common support call you will make is to increase your service limits. AWS sets soft limits on many resources to protect customers from accidentally provisioning too many resources and waking up to a huge bill. When you hit those limits, your application can grind to a halt. Without business support, you might not be able to meet the SLA with your customers to get them back up and running in an acceptable time period.

Operational excellence is the key to providing your customers with a smooth, consistent experience. It’s an area where you are never really done – there are always more things to learn and improvements to make, so keep studying the best practices and keep looking for ways to enhance your operational readiness.

To learn more about the operational excellence pillar, see the official white paper: https://d1.awsstatic.com/whitepapers/architecture/AWS-Operational-Excellence-Pillar.pdf.

Performance efficiency

On its surface, this pillar is all about speed. It’s a well-known fact that website performance has a huge impact on user adoption. Just a few milliseconds can be the difference between a happy user and someone who abandons your site for a competitor.

But the performance efficiency pillar of the Well-Architected Framework goes further than that. It covers the efficient use of computing resources, and how to keep your systems performing at their best as the technology landscape changes and as user demand fluctuates. This pillar is closely related to cost optimization because, in some cases, it’s possible to be over-provisioned and paying for resources that you don’t need. An efficient system does not necessarily have to be as fast as possible; instead, it should be as fast as it needs to be, using the most optimal resources to accomplish the task at hand.

One of the easiest ways to improve your application’s performance and take advantage of the latest technologies is simply to make use of cloud services. AWS offers many services directly related to performance, such as Amazon CloudFront, that would be very difficult for a small technical team to implement on their own. Serverless architectures powered by AWS Lambda offer configurable performance rungs and scalable compute resources at a fraction of the cost of self-hosting your own code on machines that you provision.

With AWS, it’s easy to choose the perfect tool for the job to maximize performance. For example, you can experiment with a variety of NoSQL databases, such as Amazon DynamoDB, and compare performance to more traditional databases such as PostgreSQL running on Amazon Aurora. CloudWatch dashboards make it easy to chart your performance so that you can make more informed choices.

Performance efficiency is one of those topics that requires a lot of data so that it can inform us of the best decisions we can make. Make sure that your applications are instrumented to log as much data as possible with regard to performance. Automate your responses to applications that are performing poorly. For example, if a cluster of web servers running on EC2 is struggling to keep up with the load, configure an Auto-Scaling Group to automatically provision new servers until performance falls in line with expectations.

There are EC2 instance types to cover almost every possible performance scenario. Since many applications haven’t evolved to containers or serverless functions yet, it is very important for an AWS administrator to understand the plethora of options available. The following is a quick summary of the main EC2 instance types:

  • General-purpose: These are the go-to instance types that can handle most typical workloads, and are as follows:
    • A1: ARM-based AWS Graviton processors.
    • T3: Next-generation burstable instances for spiky workloads.
    • M5: The latest generation of standard, general-purpose instances. This should be the default starting point for most applications.
  • Compute-optimized: Choose one of these types if your application has excessive CPU requirements:
    • C5: The standard instance type for cost-effective compute-intensive workloads
    • C5n: Similar to C5, but with up to 100 Gbps networking
  • Memory-optimized: If your application eats a ton of RAM, choose an instance from this category:
    • R5: Configure an instance for up to 768 GiB.
    • X1e: These instances are made specifically for high-performance databases with extreme memory requirements and attached solid-state disks.
    • High memory: If you run a large SAP HANA installation, choose this instance to configure up to a whopping 12 TiB per machine!
    • Z1d: The Z family offers the highest core frequency available at the time of writing, that is, 4.0 GHz.
  • Accelerated computing: Many machine learning tasks require access to Graphical Processing Units (GPUs):
    • P3: The latest generation of GPU instances, offering the best bang for your buck currently offered by any provider.
    • F1: A few specific types of machine learning algorithms perform better with the field-programmable gate arrays (FPGAs) provided by this instance type.
  • Storage-optimized: Choose an instance from this category if the main requirement of your application is hard drive space and disk I/O performance:
    • H1: Up to 16 TB of local HDD storage for high-throughput use cases that utilize spinning disks.
    • I3: This family offers Non-Volatile Memory Express (NVMe) drives for high IOPS at a relatively low cost.
    • D2: These instances have the best cost-to-throughput ratio for spinning hard drives, with one 48 TB HDD per instance.

Like any architecture, you should conduct extensive experiments with different instance types under production-equivalent loads to make sure you have made the correct choice.

This is by no means an exhaustive list. See the official AWS documentation for the complete current list of instances, which evolves rapidly: https://aws.amazon.com/ec2/instance-types/.

Of course, there is much more to optimizing performance than choosing the correct instance type, especially if you have moved on to serverless architectures. Evaluate your storage and networking needs, experiment, gather data, and continually refine your choices as the landscape evolves.

For more information on the performance efficiency pillar, see the official white paper: https://d1.awsstatic.com/whitepapers/architecture/AWS-Performance-Efficiency-Pillar.pdf.

Reliability

The reliability pillar has a number of crossovers with the operational excellence pillar. We live in a 24/7 world, and users expect applications to be available at all hours, running at full capacity. The reliability pillar of the AWS Well-Architected Framework focuses on techniques and practices that can help you achieve a zero-downtime infrastructure.

When we speak of reliability, be it uptime for a web server or the durability of saved data, we usually speak in terms of nines. For example, Amazon S3 offers a staggering 11 nines of durability for stored objects. That’s 99.999999999% durability, which is made possible by storing redundant copies throughout a region in various Availability Zones (AZs). Several services offer SLAs that guarantee a certain amount of uptime, such as Amazon DynamoDB, which offers five nines (99.999%) of uptime for global tables.

How exactly does that break down? Let’s take five nines and see what this means in terms of time per week. In a single week, we have the following:

7 Days * 24 = 168 Hours * 60 = 10,080 Minutes * 60 = 604,800 Seconds

99.999% of 604,800 seconds leaves us with 6 seconds of total downtime each week.

You have probably heard many vendors bragging that their service has five nines or better, but in reality, it is extremely difficult to achieve. It means there are only 6 seconds throughout an entire week, or 5 minutes per year, that a request to your application will fail due to an outage. And this includes scheduled maintenance!

To achieve levels of reliability anywhere close to five nines, it’s obvious that you are going to need a lot of redundancy, because software and hardware fail all the time. It’s just a fact of life. When you are running thousands of servers, something is almost always broken. The only way to hide that brokenness from your users is to double, and triple-up on everything so that when the inevitable single failure happens, there is always a backup standing by to take its place.

And this brings us to another concept, called single point of failure. Many architectures take care to replicate common resources such as web servers, but what about the router? The firewall? The database? A chain is only as strong as its weakest link. Be sure to inspect every component of your system to make sure you haven’t forgotten something that doesn’t have any redundancy.

Deploying changes to an environment without downtime can be an extremely difficult problem to solve. For most traditional, monolithic applications, users had to get used to scheduled downtime for maintenance, be that a monthly OS patch update or a weekly new software build. The application is taken offline, changes are applied, and everyone crosses their fingers and hopes for the best when it comes back online. Software updates are almost always to blame when a complex software system has unexpected downtime.

Here are a few strategies that can help you to achieve zero-downtime updates:

  • Blue-green deployments: Spin up a copy of your workload that is behind an alternate URL. The test is thorough, and when it is ready, swap it out with the prior version so that the new copy (blue) is now live and the old copy (green) is no longer in use. If you see errors in the new deployment with live traffic, simply swap them again to roll back.
  • Canary deployments: Spin up a new copy of your workload and start by sending small amounts of live traffic to it. If there are no errors, ramp up traffic until the new copy has all the traffic.
  • Feature flags: Deploy hidden features that can be rolled out slowly and rolled back if needed via configuration at runtime.
  • Schema-less databases: Traditional relational databases often require downtime for significant schema changes, which are inevitable in any evolving application. Using a database with a flexible schema system, such as Amazon DynamoDB, can mitigate this issue.

The key to any reliable system is testing. Test everything, before and after it is put into production. Automate as much of your testing as possible so that your tests can be run quickly and efficiently every time a change is applied.

In the cloud, it’s possible to run your tests with production amounts of data and traffic since it’s inexpensive to spin up resources for a short-lived test. One of the key principles of the reliability pillar is to stop guessing about your necessary capacity. Use data to inform your decisions about the resources you provision.

Use auto-scaling for any resource that offers it, such as EC2 and DynamoDB. Configure thresholds for expanding and contracting resources so that you are always using exactly as much as you need for the current workload.

Finally, don’t forget to analyze your dependencies. It doesn’t matter how reliable your system is if it has a dependency on a separate system that is not reliable. If you can’t control the reliability of the dependency, implement a queue or a batch so that the dependency is not in real time.

This pillar also crosses over with cost optimization, since a fully redundant system operating at five nines can be quite expensive! You might need to make some trade-offs to supply your users with a system that has acceptable levels of downtime, such as three nines (roughly 8 hours of downtime per year), in exchange for a more affordable service.

As we mentioned in several chapters in this book, service limits are often a surprising way to experience downtime and an excellent reason to maintain a Business Support contract so that you can quickly get someone on the phone to bump up your soft limits in case you reach capacity. Soft limits are there to protect you from accidentally over-provisioning resources and waking up to an expensive bill. But if you forget about them, they will cause downtime. Have a strategy for studying limits, raising them where appropriate, and monitoring your resources so that you know how much ceiling you have left at any given time.

Refer to the official AWS white paper for more information: https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf.

Cost optimization

Most conversations about migrating to the cloud start with cost savings. Although, in reality, it’s not the most compelling reason to move to the cloud, it’s definitely the first thing on the minds of executives who are making the decisions and writing the checks. In an environment where you can spin up thousands of new resources in minutes at the click of a button, you can just as easily run up a high bill if you aren’t careful.

Study the cost optimization pillar of the AWS Well-Architected Framework to learn how to be careful.

If you have just started researching a move to the cloud and you have done a few simple back-of-the-napkin calculations, you may be scratching your head and wondering where the reported cost savings of the cloud are coming from. Optimizing an infrastructure for costs is not something that comes easily, just like it’s not easy to create a compelling and useful application that users enjoy. It takes continuous effort and diligent study of all the options available to you, and you will need to move beyond simply doing a lift-and-shift of your on-premises applications to EC2 instances by administrators who have only a superficial understanding of AWS.

In the Performance efficiency section, we reviewed the different EC2 instance families. As we know, choosing the right one will make a huge difference in your costs. But there are more than 140 services now offered by AWS, and EC2 is only one of them! Learning the service landscape and finding components that you can offload from instances is where the real cost savings start to happen. And for applications that still require EC2, learning about how purchasing reserved instances or bidding on spot instances can save you significant amounts of money is crucial.

When calculating your Total Cost of Ownership (TCO) in the cloud, don’t forget to add human resources to the cloud resources you have provisioned. Some roles change, some go away, and some are replaced by completely new functions that you need to understand in order to paint the entire picture when it comes to the money you are spending to support a workload.

Use the TCO calculator to compare the costs of on-premises applications to running those applications in the cloud: https://aws.amazon.com/tco-calculator/.

Use automation to shut down resources that you don’t need, especially if you have parts of your application that are not in use 24 hours per day. A perfect example of this is development environments and servers that are used for testing. When a developer creates a web-based IDE using AWS Cloud9, the environment is configured to automatically hibernate after it has gone unused for a specified amount of time. If you use AWS CodeBuild as your build environment, containers are only in use for the amount of time that it takes to complete the build. This is a huge advantage over on-premises development environments, where all of those resources require a big upfront investment and continuous maintenance.

It’s possible to categorize your AWS resources using tags. You can use resource tags to allocate costs to various departments. Ideally, you should implement some sort of chargeback system so that business owners are responsible for their portion of the cloud infrastructure.

Make use of AWS Trusted Advisor to identify areas where you are spending money on underutilized resources. Upgrade to a Business Support contract on all production accounts so that you can enjoy the full benefits that the Trusted Advisor has to offer. Support is one service you definitely don’t want to skimp on. The Cost Optimization screen in Trusted Advisor is worth its weight in gold. If you don’t have a Business Support contract, valuable information is not available to you, as shown in the following screenshot:

Upgrade your support plan!

Do your best to estimate your future charges on AWS, but in the end, the only way to know exactly what an actual, running workload will cost is to test it. With automation features such as Infrastructure as Code (IaC) offered by AWS CloudFormation, you can quickly spin up a test environment, run it for a short period of time to test your applications and gauge your costs, and then tear it down just as quickly.

To dive into all the many ways you can optimize your costs on AWS, read the official white paper: https://d1.awsstatic.com/whitepapers/architecture/AWS-Cost-Optimization-Pillar.pdf.

Comments are closed.

loading...