Cloud Principles

This post explains some of the cloud pricinples to be utilized when working with Amazon Web Services. Though the references here are for AWS services, pricinples can be used across multiple clouds.

Principles:
– Design for failure and nothing will fail:
  • What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for?
  • What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails?
  • If there are master and slaves in your architecture, what if the master node fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master?
  • What happens to my application if the dependent services changes its interface?
  • What if downstream service times out or returns an exception?
  • What if the cache keys grow beyond memory limit of an instance?
Best practice:
  1. Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures
  2. Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Utilize Amazon RDS Multi-AZ [21] deployment functionality to automatically replicate database updates across multiple Availability Zones.
  3. Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication.
  4. Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones.
  5. Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances. 
  6. Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups.
– Decouple your components:
the more loosely coupled the components of the system, the bigger and better it scales.
  • Which business component or feature could be isolated from current monolithic application and can run standalone separately?
  • And then how can I add more instances of that component without breaking my current system and at the same time serve more users?
  • How much effort will it take to encapsulate the component so that it can interact with other components asynchronously?
Best prCTICES:
  1. Use Amazon SQS to isolate components 
  2. Use Amazon SQS as buffers between components
  3. Design every component such that it expose a service interface and is responsible for its own scalability in all appropriate dimensions and interacts with other components asynchronously
  4. Bundle the logical construct of a component into an Amazon Machine Image so that it can be deployed more often 
  5. Make your applications as stateless as possible. Store session state outside of component (in Amazon SimpleDB, if appropriate)
– Implement elasticity
  1. Proactive Cyclic Scaling: Periodic scaling that occurs at fixed interval (daily, weekly, monthly, quarterly)
  2. Proactive Event-based Scaling: Scaling just when you are expecting a big surge of traffic requests due to a scheduled business event (new product launch, marketing campaigns) 
  3. Auto-scaling based on demand. By using a monitoring service, your system can send triggers to take appropriate actions so that it scales up or down based on metrics (utilization of the servers or network i/o, for instance)
Automate Your Infrastructure
  • Create a library of “recipes” – small frequently-used scripts (for installation and configuration)
  • Manage the configuration and deployment process using agents bundled inside an AMI 
  • Bootstrap your instances
Bootstrap Your Instances
  1. Recreate the (Dev, staging, Production) environment with few clicks and minimal effort
  2. More control over your abstract cloud-based resources
  3. Reduce human-induced deployment errors
  4. Create a Self Healing and Self-discoverable environment which is more resilient to hardware failure
Best Practices:
  1. Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
  2. Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
  3. Store and retrieve machine configuration information dynamically: Utilize Amazon DynamoDB to fetch config data during boot-time of an instance (eg. database connection strings). SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
  4. Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
  5. Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef, Puppet, CFEngine or Genome.
  6. Bundle Just Enough Operating System (JeOS22) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data23 and instance metadata after launch.
  7. Reduce bundling and launch time by booting from Amazon EBS volumes24 and attaching multiple Amazon EBS volumes to an instance. Create snapshots of common volumes and share snapshots25 among accounts wherever appropriate.
  8. Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.
– Think Parallel: The cloud makes parallelization effortless.
Best Practices:
  1. Multi-thread your Amazon S3 requests  
  2. Multi-thread your Amazon SimpleDB GET and BATCHPUT requests
  3. Create a JobFlow using the Amazon Elastic MapReduce Service for each of your daily batch processes (indexing, log analysis etc.) which will compute the job in parallel and save time.
  4. Use the Elastic Load Balancing service and spread your load across multiple web app servers dynamically
– Keep Dynamic Data close to Compute and Static Data close to End User:
Best Practices:
  1. Ship your data drives to Amazon using the Import/Export service. It may be cheaper and faster to move large amounts of data using the sneakernet28 than to upload using the Internet.
  2. Utilize the same Availability Zone to launch a cluster of machines
  3. Create a distribution of your Amazon S3 bucket and let Amazon CloudFront caches content in that bucket across all the 14 edge locations around the world