- What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for?
- What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails?
- If there are master and slaves in your architecture, what if the master node fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master?
- What happens to my application if the dependent services changes its interface?
- What if downstream service times out or returns an exception?
- What if the cache keys grow beyond memory limit of an instance?
- Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures
- Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Utilize Amazon RDS Multi-AZ  deployment functionality to automatically replicate database updates across multiple Availability Zones.
- Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication.
- Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones.
- Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances.
- Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups.
- Which business component or feature could be isolated from current monolithic application and can run standalone separately?
- And then how can I add more instances of that component without breaking my current system and at the same time serve more users?
- How much effort will it take to encapsulate the component so that it can interact with other components asynchronously?
- Use Amazon SQS to isolate components
- Use Amazon SQS as buffers between components
- Design every component such that it expose a service interface and is responsible for its own scalability in all appropriate dimensions and interacts with other components asynchronously
- Bundle the logical construct of a component into an Amazon Machine Image so that it can be deployed more often
- Make your applications as stateless as possible. Store session state outside of component (in Amazon SimpleDB, if appropriate)
- Proactive Cyclic Scaling: Periodic scaling that occurs at fixed interval (daily, weekly, monthly, quarterly)
- Proactive Event-based Scaling: Scaling just when you are expecting a big surge of traffic requests due to a scheduled business event (new product launch, marketing campaigns)
- Auto-scaling based on demand. By using a monitoring service, your system can send triggers to take appropriate actions so that it scales up or down based on metrics (utilization of the servers or network i/o, for instance)
- Create a library of “recipes” – small frequently-used scripts (for installation and configuration)
- Manage the configuration and deployment process using agents bundled inside an AMI
- Bootstrap your instances
- Recreate the (Dev, staging, Production) environment with few clicks and minimal effort
- More control over your abstract cloud-based resources
- Reduce human-induced deployment errors
- Create a Self Healing and Self-discoverable environment which is more resilient to hardware failure
- Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
- Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
- Store and retrieve machine configuration information dynamically: Utilize Amazon DynamoDB to fetch config data during boot-time of an instance (eg. database connection strings). SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
- Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
- Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef, Puppet, CFEngine or Genome.
- Bundle Just Enough Operating System (JeOS22) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data23 and instance metadata after launch.
- Reduce bundling and launch time by booting from Amazon EBS volumes24 and attaching multiple Amazon EBS volumes to an instance. Create snapshots of common volumes and share snapshots25 among accounts wherever appropriate.
- Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.
- Multi-thread your Amazon S3 requests
- Multi-thread your Amazon SimpleDB GET and BATCHPUT requests
- Create a JobFlow using the Amazon Elastic MapReduce Service for each of your daily batch processes (indexing, log analysis etc.) which will compute the job in parallel and save time.
- Use the Elastic Load Balancing service and spread your load across multiple web app servers dynamically
- Ship your data drives to Amazon using the Import/Export service. It may be cheaper and faster to move large amounts of data using the sneakernet28 than to upload using the Internet.
- Utilize the same Availability Zone to launch a cluster of machines
- Create a distribution of your Amazon S3 bucket and let Amazon CloudFront caches content in that bucket across all the 14 edge locations around the world
- Data Encryption Keys: Encrypts Data Blocks in Cluster
- Database Key: Encrypts Data Encryption Keys in Cluster
- Cluster Key: Encrypts Database Keys in Cluster. Use AWS or HSM to store the cluster key.
- Master Key: Encrypts Cluster Key, if stored in AWS. Encrypts the Cluster-Key-Encrypted-Database-Key if Cluster key is in HSM.
- Federated (non-AWS) User access:
- Identity federation between AWS and non-AWS users in corporate identity and authorization system.
- Using SAML, AWS as Service Provider and provide users with federated Single-Sign-On (SSO) to the AWS management Console or get federated access to call AWS APIs.
- Cross-Account Access: For organization who uses multiple AWS accounts to manage their resources, a role can provider users who have permission in one account to access resources in another account.
- Applications running on EC2 instance that need to access AWS resources: If EC2 need to make calls to S3 or DynamoDB resources, it can utilize role allowing management of large fleet of instances/autoscaling.
Probably that’s the reason it was reported last week of AWS reboot across regions.
General Risk Management Model:
Step 1: Asset Identification
Identify and classify the assets, systems, and processes that need protection because they are vulnerable to threats.
Step 2: Threat Assessment
After identifying assets, you identify both the threats and the vulnerabilities associated with each assets and the likelihood of their occurrence. All things have vulnerabilities; one of the key is to examine exploitable vulnerabilities. To list: CWE (from mitre.org), SANS Top 25 list, OWASP Top 10 list..
Step 3: Impact Determination and Quantification:
An impact is the loss created when a threat is realized and exploits a vulnerability. Tangible impact results in financial loss or physical damage. An intangible impact, such as impact on the reputation of a company, assigning a financial value can be difficult.
Step 4: Control Design and Evaluation:
Software Engineering Institute Model:
Examine the system, enumerating potential risks.
Convert the risk data gathered into information that can be used to make decisions. Evaluate the impact, probability, and timeframe of the risk. Classify and prioritize each of the risks.
Review and evaluate the risks and decide what actions to take to mitigate them. Implement the plan.
Monitor the risks and the mitigation plans. Review periodically to measure progress and identify new risks.
Make corrections for deviations from risk mitigation plans. Changes in business procedures may require adjustments in plans or actions, as do faulty plans and risks that become problems.
Various models discussed here are:
Access Control Models:
Other models discussed below: Discretionary Access Control (DAC), Mandatory Access Control (MAC), Role-based Access Control (RBAC), Rule-based Access Control (RBA)
Bell-LaPadula Confidentiality Model:
Bell-LaPadula security model is combination of mandatory and discretionary access control mechanism.
First Principle, known as – Simple Security Rule – that no subject can read information from an object with a security classification higher than that possessed by the subject itself. This is also refferred as “no-read-up” rule.
So arrange the access level in hierarchal form, with defined higher and lower level of access.
Bell-LaPadula was designed to preserve “confidentiality” – focused on read and write access.
Reading material higher than subject’s level is a form of unauthorized access.
Second Principle, known as *-property (star property) – states that subject can write an object only if it’s security classification is less than or equal to the object’s security classification.
Also known as “No-Write-Down” principle.
This prevents the dissemination of information users that do not have appropriate level of access.
Usage example – to prevent data leakage, publishing bank balance – to a public page..
- Built upon Graph Theory
- Distinct Advantage: Definitively Determine Rights – Unique Rights (take and grant)
- Value lies in ability to analyze an implementation is complete or might be capable to leak information.
Confidentiality: Confidentiality is the concept of preventing the disclosure of information to unauthorized parties. In layman terms, keeping secret secret is confidentiality.
Integrity: Integrity is similar to confidentiality, except rather than protecting the data from unauthorized access, integrity refers to protecting data from unauthorized alteration.
Availability: Access to systems by authorized personnel can be expressed as the system’s availability.
Authentication: Authentication is the process of determining the identity of a user. Three general methods are used in authentication. In order to verify your identity, you can provide:
- Something you know
- Something you have
- Something about you (something that you are)
Authorization: Authorization is the process of applying access control rules to a user process, determining whether or not a particular user process can access an object. Three elements are used in discussion of authorization:
- A requester (sometimes referred to as the subject)
- The object
- The type or level of access to be granted.
Accounting (Auditing): Accounting is means of measuring activity. With IT systems, this can be done by logging crucial elements of activity as they occur. With respect to Data Elements, accounting is needed when activity is determined to be crucial to the degree that it may be audited at a later date and time.
*** A key element in audit logs is the employment of a monitoring, detection, and response process. Without mechanism or processes to “trigger” alerts or notifications to admins based on particular logged events, the value of logging is diminished or isolated to a post-incident resource instead of contributing to an alerting or incident prevention resource.
Non-repudiation: Non-repudiation is the concept of preventing a subject from denying a previous action with an object in a system. When authentication, authorization and auditing are properly configured, the ability to prevent repudiation by a specific subject with respect to an action and an object is ensured.
Session Management: Session management refers to the design and implementation of controls to ensure that communication channels are secured from unauthorized access and disruption of a communication.