Cloud Principles

This post explains some of the cloud pricinples to be utilized when working with Amazon Web Services. Though the references here are for AWS services, pricinples can be used across multiple clouds.

Principles:
– Design for failure and nothing will fail:
  • What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for?
  • What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails?
  • If there are master and slaves in your architecture, what if the master node fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master?
  • What happens to my application if the dependent services changes its interface?
  • What if downstream service times out or returns an exception?
  • What if the cache keys grow beyond memory limit of an instance?
Best practice:
  1. Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures
  2. Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Utilize Amazon RDS Multi-AZ [21] deployment functionality to automatically replicate database updates across multiple Availability Zones.
  3. Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication.
  4. Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones.
  5. Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances. 
  6. Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups.
– Decouple your components:
the more loosely coupled the components of the system, the bigger and better it scales.
  • Which business component or feature could be isolated from current monolithic application and can run standalone separately?
  • And then how can I add more instances of that component without breaking my current system and at the same time serve more users?
  • How much effort will it take to encapsulate the component so that it can interact with other components asynchronously?
Best prCTICES:
  1. Use Amazon SQS to isolate components 
  2. Use Amazon SQS as buffers between components
  3. Design every component such that it expose a service interface and is responsible for its own scalability in all appropriate dimensions and interacts with other components asynchronously
  4. Bundle the logical construct of a component into an Amazon Machine Image so that it can be deployed more often 
  5. Make your applications as stateless as possible. Store session state outside of component (in Amazon SimpleDB, if appropriate)
– Implement elasticity
  1. Proactive Cyclic Scaling: Periodic scaling that occurs at fixed interval (daily, weekly, monthly, quarterly)
  2. Proactive Event-based Scaling: Scaling just when you are expecting a big surge of traffic requests due to a scheduled business event (new product launch, marketing campaigns) 
  3. Auto-scaling based on demand. By using a monitoring service, your system can send triggers to take appropriate actions so that it scales up or down based on metrics (utilization of the servers or network i/o, for instance)
Automate Your Infrastructure
  • Create a library of “recipes” – small frequently-used scripts (for installation and configuration)
  • Manage the configuration and deployment process using agents bundled inside an AMI 
  • Bootstrap your instances
Bootstrap Your Instances
  1. Recreate the (Dev, staging, Production) environment with few clicks and minimal effort
  2. More control over your abstract cloud-based resources
  3. Reduce human-induced deployment errors
  4. Create a Self Healing and Self-discoverable environment which is more resilient to hardware failure
Best Practices:
  1. Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
  2. Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
  3. Store and retrieve machine configuration information dynamically: Utilize Amazon DynamoDB to fetch config data during boot-time of an instance (eg. database connection strings). SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
  4. Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
  5. Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef, Puppet, CFEngine or Genome.
  6. Bundle Just Enough Operating System (JeOS22) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data23 and instance metadata after launch.
  7. Reduce bundling and launch time by booting from Amazon EBS volumes24 and attaching multiple Amazon EBS volumes to an instance. Create snapshots of common volumes and share snapshots25 among accounts wherever appropriate.
  8. Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.
– Think Parallel: The cloud makes parallelization effortless.
Best Practices:
  1. Multi-thread your Amazon S3 requests  
  2. Multi-thread your Amazon SimpleDB GET and BATCHPUT requests
  3. Create a JobFlow using the Amazon Elastic MapReduce Service for each of your daily batch processes (indexing, log analysis etc.) which will compute the job in parallel and save time.
  4. Use the Elastic Load Balancing service and spread your load across multiple web app servers dynamically
– Keep Dynamic Data close to Compute and Static Data close to End User:
Best Practices:
  1. Ship your data drives to Amazon using the Import/Export service. It may be cheaper and faster to move large amounts of data using the sneakernet28 than to upload using the Internet.
  2. Utilize the same Availability Zone to launch a cluster of machines
  3. Create a distribution of your Amazon S3 bucket and let Amazon CloudFront caches content in that bucket across all the 14 edge locations around the world

Amazon Web Services (AWS) Security – an outside view

AWS and Security – A view from outside
Shared Responsibility Model:
Secure SDLC
     – static code analysis run as a part of build process
     – threat modeling
MFA
     – Google authenticator/RSA
MFA for AWS service API
     – terminating EC2 instance
     – sensitive data in S3 bucket
Security of Access Keys
     – must be secured 
     – use IAM roles for EC2 management
Enable CloudTrail
Run Trusted Advisor
EC2:
– encrypted file systems
– disabling password-only access to your guests, 
– utilizing some form of multi-factor authentication to gain access to instances (or at a minimum certificate-based SSH Version 2 access).
– privilege escalation mechanism with logging on a per-user basis.
– utilize certificatebased SSHv2 to access the virtual instance,
– disable remote root login,
– use command-line logging, 
– use ‘sudo’ for privilege escalation.
– generate your own key pairs in order
Firewall
– ports which are required
– certain CIDR blocks
– think about IPTables
EBS
– encrypt volume
– use DoD methods to wipe volume before deleting
ELB
– any particular cipher to use? for PCI/SOX compliance?
– use Server Order preference
– use of Perfect Forward Secrecy
VPC
– VPC security group
– IP range, Internet gateway, virtual private gateway
– Need Secret Access Key of the account
– To consider subnet and route tables
– To consider firewall/security groups
– Network ACLs:  inbound/outbound from a subnet within VPC
– ENI: Elastic Network Interface for management network / security appliance on network
CloudFront:
– By default, you can deliver content to viewers over HTTPS by using https://dxxxxx.cloudfront.net/image.jpg. If you want to deliver your content over HTTPS using your own domain name and your own SSL certificate, you can use SNI Custom SSL or Dedicated IP Custom SSL.
– With Server Name Identification (SNI) Custom SSL, CloudFront relies on the SNI extension of the TLS protocol,
– With Dedicated IP Custom SSL, CloudFront dedicates IP addresses to your SSL certificate at each CloudFront edge location so that CloudFront can associate the incoming requests with the proper SSL certificate.
S3 security:
– Use IAM policies
– Use of ACL to grant read/write access to other AWS account users
– Bucket policies: add/deny permission within single object in a bucket
– Restrict access to specific resource using POLICY KEYS: Based on request time (date condition),  whether request was send using SSL (Boolean condition),  Requestor IP address (IP condition) or requestor’s client (string condition)
– Use SSL endpoint for S3 via internet or via EC2
– Use client encryption library
– Use server side encryption (SSE)- S3 managed encryption
– S3 metadata not encrypted
– S3 data to Glacier archival at regular frequency
– s3 delete control via mfa
– CORS: cross-origin resource sharing – allows S3 objects to be referenced in HTML pages else they are considered cross-site scripting
DynamoDB:
– DynamoDB resources and API permissions via IAM
– Database level permission that allow/deny at item(row) and attribute(column) 
– Fine-grained access control allow you to specify via policy under what circumstances user/application can access DynamoDB table.
– IAM policy can restrict access individual items in the tables, attributes in these items or both.
– Allow Web Identity Federation instead of using IAM users via AWS STS (Secure Token Services)
– Each request must contain HMAC-SHA256 signature in header when sending request to DynamoDB
Amazon RDS:
– Access Control: Master User Account and Password, Create additional user accounts, DB Security Group – similar to EC2 security group, which defaults to deny all”. Access can be granted by adding database port in firewall via network IP range or EC2 security group.
– Using IAM further granular access can be granted.
– Network Isolation in muti-az deployment using DB Subnet groups
– RDS instance in VPC can be access via EC2 instances outside of VPC using SSH Bastion host and Internet Gateway.
– Encryption at RDS is available as means of transport encryption. SSL certificate installed on MySQL and SQL server – so app to DB connection is secure. 
– Encryption at rest is supported via TDE (Transparent Data Encryption) for SQL and Oracle Enterprise Edition.
– Encryption at rest is not supported for MySQL natively and application must send encrypted data if they want data-at-rest encrypted.
– Point-in-time recovery via automated backup with db log and tran log stored for user-specified retention period.
– Restore upto last 5 minutes.. store the backup for 35 days by-default.
– During backup storage I/O is suspended but with multi-az deployment, backup is done at standby, so no performance impact.
AWS RedShift:
– Cluster is closed to everyone by-default.
– Utilize security groups for network access to cluster.
– Database user permission is per cluster basis instead of per table basis. Though user can see the data in table rows generated by his activities; rows generated by other is not visible to the user.
– User who create an object is owner and only owner/superuser can query, grant or modify permission on the object.
– Redshift data is spread across multiple compute nodes in a cluster. Snapshot backups are uploaded to S3 of user-defined period.
– Four-tier Key Based architecture:
  • Data Encryption Keys: Encrypts Data Blocks in Cluster
  • Database Key: Encrypts Data Encryption Keys in Cluster
  • Cluster Key: Encrypts Database Keys in Cluster. Use AWS or HSM to store the cluster key.
  • Master Key: Encrypts Cluster Key, if stored in AWS. Encrypts the Cluster-Key-Encrypted-Database-Key if Cluster key is in HSM.
– RedShift uses Hardware-Accelerated SSL
– Offers strong cipher suites that uses Elliptic Curve Diffie-Hellman Ephemeral (ECDHE) protocol allows PFS (Perfect Forward Secrecy).
AWS ElastiCache:
– Cache Security group like firewall
– By default, network access is turned off
– Use authorize Cache Security Group ingress API/CLI to authorize EC2 Security Group (in turn allows EC2 instances)
– Backup/Snapshot of ElastiCache Redis cluster point-in-time backup or scheduled backup.
AWS CloudSearch:
– Access to search domain’s endpoint is restricted by IP address so that only authorized hosts can submit documents and send search requests. 
– IP address authorization is used only to control access to the document and search endpoints.
AWS SQS:
– Access is based on AWS acct/IAM user and once authenticated, user has full access to all user operations. 
– Default access to individual queue is restricted to the AWS account that created it.
– Data stored in SQS is not encrypted by AWS but can be encrypted/decrypted by means of application. 
AWS SNS:
– Amazon SNS delivers notifications to clients using a “push” mechanism that eliminates the need to periodically check or “poll” for new information and updates. Amazon SNS can be leveraged to build highly reliable, event-driven workflows and messaging applications without the need for complex middleware and application management. The potential uses for Amazon SNS include monitoring applications, workflow systems, time-sensitive information updates, mobile applications, and many others.
– SNS provided access control mechanism so topics and message are secured against unauthorized access. 
– Topic owners can set policies on who can publish/subscribe to a topic.
AWS SWF:
– Access is granted based on an AWS account/IAM user. 
– Actors that participate in the execution of a workflow – deciders, activity workers, workflow administrators – must be IAM users under the AWS account that owns the AWS SWF resources. Other AWS account can’t be granted access to AWS SWF workflows.
AWS SES:
– AWS SES requires users to verify their email address or domain in order to confirm that they own it and to prevent others from using it. To verify a domain, Amazon SES requires the sender to publish a DNS record that Amazon SES supplies as proof of control over the domain. 
– SES uses content-filtering technologies to help detect and block messages containing viruses or malware before they can be sent.
– SES maintain complaint feedback loops with major ISPs.
– SES supports authentication mechanisms such as Sender Policy Framework (SPF) and DomainKeys Identified Mail (DKIM). When you authenticate an email, you provide evidence to ISPs that you own the domain. 
– For SES over SMTP, it requires to encrypt the connection using TLS – supported mechanisms: STARTTLS and TLSWrapper. 
– For SES over HTTP, communication will be protected by TLS through AWS SES’s HTTPS endpoint.
AWS Kinesis:
– Logical access to Kinesis is via AWS IAM, controlling which Kinesis operations users have permission to perform. 
– By associating EC2 instance with IAM role, credentials available as a part of role is available to the applications on that EC2 instances. Thus it avoid using long-term AWS security credentials.
AWS IAM:
– Allows to create multiple users and manage permission for each users within AWS account. 
– User permissions must be granted explicitly.
– IAM is integrated with AWS Marketplace to control software subscription, usage and cost. 
– Role uses temporary security credentials to delegate access to user/service that normally don’t have access to AWS resources.
– Temporary security credentials is in short life-span (default 12 hours) and it can’t be reused after expiry. 
– Temporary security credential are: Security Token, an Access Key ID, a Secret Access Key
– Useful in situations such as:
  • Federated (non-AWS) User access:
    • Identity federation between AWS and non-AWS users in corporate identity and authorization system.
    • Using SAML, AWS as Service Provider and provide users with federated Single-Sign-On (SSO) to the AWS management Console or get federated access to call AWS APIs. 
  • Cross-Account Access: For organization who uses multiple AWS accounts to manage their resources, a role can provider users who have permission in one account to access resources in another account.
  • Applications running on EC2 instance that need to access AWS resources: If EC2 need to make calls to S3 or DynamoDB resources, it can utilize role allowing management of large fleet of instances/autoscaling.
AWS CloudHSM:
– Dedicated Hardware Security Module (HSM) appliance to provide secure cryptographic key storage and operations within an intrusion-resistant, temper-evident device. 
– Variety of use cases such as database encryption, Digital Rights Management (DRM), Public Key Infrastructure (PKI), authentication and authorization, document signing, and transaction processing. 
– Support some of the strongest cryptographic algorithm available – AES, RSA, ECC etc. 
– Connection to CloudHSM available with EC2 and VPC via SSL/TLS using two-way digital certificate authentication
– Cryptographic partition is a logical and physical security boundary that restricts access to keys, so only owner of keys can control the keys and perform operations on HSM. 
– CloudHSM’s temper detection erase the cryptographic key material and generate event logs if tempering (physical or logical) detected. After 3 unsuccessful attempt to access HSM partition with Admin credentials, HSM appliance erase its HSM partition.
CloudTrail:
– Enable CloudTrail will send event to S3 bucket in 5 minutes. Data captured: Info about every API calls, location of that call, either console, CLI, SDK; captures console sign-in events, create log record every time AWS account owner, federated users, IAM user sign-in.
– CloudTrail access can be limited to only certain users via IAM.