Maintaining sanity on the Cloud: June 2014

Monday, June 30, 2014

If you try to install Virtualbox on top of a RHEL instance you will likely see an error

When you try to install a virtualization layer on top of an EC2 instance, you will likely see an error like below:-

As per Amazon Support, "It's not possible to run VirtualBox within an EC2 instance since we trying to run a hypervisor within Xen based hypervisor which is unsupported. The required VT-x CPU extensions simply aren't exposed at the guest layer either. You can try to convert your appliance (into a format supported by Xen which is very similar to EC2) and then to import the VM per the instructions provided at instances of your vm

Simple grep command to output lines before and after the search string

$grep -C 3 "NullPointerException" *.log --color

The option "-C 3" will output 3 lines before and 3 lines after the occurrence of NPE in the logs.

Friday, June 27, 2014

Using jmap utility to take heap dump on an application running in VPC

If you have secure environments in which there are strict ingress/egress rules and cannot open up jmx ports for tools like VisualVM or set JVM parameters like -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump

In such cases you can use jmap utility such as

$ jmap -dump:format=b,file=/tmp/heapdump <pid>
Dumping heap to /tmp/heapdump ...
Heap dump file created

Once the dump has been generated, you can analyze the heap using Eclipse memory analyzer (MAT)

Monday, June 23, 2014

AWS ELB cannot redirect users from http port to https port automatically

In AWS ELB there is no option to automatically redirect http requests back to https unlike apache httpd, so we will have to allow both ports (http and https) and then use reverse proxy like httpd to have a rewrite rule such as below

1. Have your ELB pass both HTTP and HTTPS traffic on to your backend server as HTTP traffic on port 80. ELB Backend Http -> http Https -> http 2. Create a rewrite rule on your Back end Web-Server For Apache: <VirtualHost *:80> ... RewriteEngine On RewriteCond %{HTTP:X-Forwarded-Proto} =http RewriteRule https://%{HTTP:Host}%{REQUEST_URI} [L,R=permanent] ... </VirtualHost> In the above re-write rule we are utilizing the X-Forwarded-Proto header from the request to do the redirection. The X-Forwarded-Proto request header helps you identify the protocol (HTTP or HTTPS) that a client used to connect to your server. Your server access logs contain only the protocol used between the server and the load balancer; they contain no information about the protocol used between the client and the load balancer. To determine the protocol used between the client and the load balancer, use the X-Forwarded-Proto request header. Elastic Load Balancing stores the protocol used between the client and the load balancer in the X-Forwarded-Proto request header and passes the header along to your server. Your application or website can use the protocol stored in the X-Forwarded-Proto request header to render a response that redirects to the appropriate URL. More information on the X-Forwarded headers http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/TerminologyandKeyConcepts.html#x-forwarded-headers

Wednesday, June 18, 2014

Uploading SSL certificates on ELB through AWS CLI without providing console access

In certain situations (like when you are hosting sites for your customer) you may not want your customers to send you their SSL certificates and private keys for security reasons. Best practices dictate that private keys reside with the customers. In other cases, you could be hosting a site that your customers wants to CNAME to on their DNS server. Under those conditions, you don't want to handle customer's SSL certificates and private keys. You could follow the steps below to have the upload the certs

Create a temporary IAM user in your customer called "temp-cert-user" and assign the below custom IAM policy:-

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["iam:*AccessKey*"],
"Resource": "arn:aws:iam::<AWS_ACCOUNT_ID>:*"
},
{
"Effect": "Allow",
"Action": ["iam:UploadServerCertificate"],
"Resource": "arn:aws:iam::*"
}
]
}

Next, ask your customer to install AWS CLI
From the customer's machine, ask them to add the AWS_ACCESS_KEY and AWS_SECRET_KEY in $HOME/.aws/config file
Next, they can execute the below command

$aws iam --profile <profile> upload-server-certificate --server-certificate-name test-cert --certificate-body file://testcert.pem --private-key file://testcertkey.pem --certificate-chain file://testcert.pem

{

"ServerCertificateMetadata": {

"Path": "/",

"Arn": "arn:aws:iam::<AWS_ACCOUNT_ID>:server-certificate/test-cert",

"ServerCertificateId": "ASC....QN7B4",

"ServerCertificateName": "test-cert",

"UploadDate": "2014-06-18T17:34:16.567Z"

}

Now they can check if the certificate got upload by using get-server-certificate command

C:\certs>aws iam --profile tempcert get-server-certificate --server-certificate-name test-cert

{

"ServerCertificate": {

"CertificateChain": "-----BEGIN CERTIFICATE-----

-----END CERTIFICATE-----",

"CertificateBody": "-----BEGIN CERTIFICATE-----

-----END CERTIFICATE-----",

"ServerCertificateMetadata": {

"Path": "/",

"Arn": "arn:aws:iam::<AWS_ACCOUNT_ID:server-certificate/test-cert",

"ServerCertificateId": "ASC...QN7B4",

"ServerCertificateName": "test-cert",

"UploadDate": "2014-06-18T17:34:16Z"

}

NOTE - For allowing temp-cert-user to have get-server-certificate and delete-server-certificate authorization, you have to modify the IAM policy to include the below:-

{
   "Effect": "Allow",
   "Action": ["iam:GetServerCertificate"],
   "Resource": "arn:aws:iam::*"
  },
  {
   "Effect": "Allow",
   "Action": ["iam:DeleteServerCertificate"],
   "Resource": "arn:aws:iam::*"
  }

Tuesday, June 10, 2014

Key ELB metrics to watch for in a load spike

When experiencing a surge in inbound requests, we have to watch the ELB's cloudwatch metrics closely. The metrics are documented in the MonitoringLoadBalancerWithCW

The key metrics are

RequestCount
Latency
HTTPCode_ELB_5XX
HTTPCode_Backend_5XX
SurgeQueueLength

Typically, you will see a linear relationship between 'RequestCount' and 'Latency' metrics. When the load increases, the latency will also increase correspondingly. With default settings you will see a ELB timeout of 60 secs getting invoked if latency intervals are greater than ELB timeout. Latency metric is indicative of the time duration that ELB has to wait for a response from the instance to which it has handed the request. If the instances takes longer to respond (HTTP 200, 4xx or 5xx codes) then we will see increased latency. HTTPCode_ELB_5xx metric indicates the no. of occurrences of ELB failing to handle the incoming request and ELB directly send a http 5xx error code back to the client. HTTPCode_Backend_5xx indicates the occurrences of errors in the backend instances failing to get a valid response from end service. SurgeQueueLength metric indicates the no. of requests that have been queued up by the ELB waiting for a healthy instance to become available.

The CloudWatch metrics of RequestCount and Latency showing linear relationship will look like

NOTE - If your instances behind the ELB are in the same zone, then you may want to disable the "cross zone load balancing" feature of the ELB so reduce the performance overhead by a small amount.

you can also enable access log and access collection as per the AWS documentation links below:-

http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/enable-access-logs.html
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html

Amazon questionnaire to request ELB pre-warming through AWS support

Amazon suggests informing them of any load tests in your infrastructure that is front ended by ELB's. If any flash traffic is expected, then it would make sense to have the ELB's pre-warmed. You can read more about pre-warming in the AWS documentation - pre-warming

Questionnaire from AWS support which we need to answer for requesting ELB pre-warming:-

******************

What is the DNS name for the ELB(s) that require manual scaling?
An approximate increase percentage in traffic, or expected requests/sec that will go through the load balancer (whichever is easier to answer).
The average amount of data passing through the ELB per request/response pair (size of packets in KB).
The rate of traffic increase expected. This can be a qualitative value such as "this should increase steadily over the span of an hour," or "we expect traffic to increase suddenly once our sale/event is announced."
Expected percent of traffic going through the ELB that will be using SSL termination.
The number of Availability Zones added to the load balancer, or that will be added.
Is the back-end currently scaled to the level it will be during the event?
If not, when do you expect to add the required back-end instance count? Also what type of instance and how many ?
When will your event start?
When do you expect the event will end?
A brief description of your use case, with a detailed explanation of your expected (if you're going to deviate from the normal request pattern) and normal traffic patterns.
We're looking to glean details about what kinds of requests you expect. For example:

Are they long running requests?
Do you need connections to be open for an extended period of time?
Are these basic GET requests? POSTs? PUTs? Large file uploads/downloads?
How frequently do you expect surges in traffic, and how quickly traffic will ramp up to peak during these surges.
Are the back-end instances using persistent connections (aka: keep-alive)
Is your production application currently down due to this, or is your production application traffic severely impacted?.
Is this an expected surge in traffic?

******************

Monday, June 9, 2014

ELB front ending instances running in private subnet of a VPC should reside in a subnet that hast internet gateway association

When creating a ELB for front ending instances running in a private subnet of a VPC, you have to be sure to add the ELB in the subnet(s) that has igw-* (internet gateway) association. If ELB is added to the subnet with NAT association then inbound calls to the private instances from the internet will fail.

Subnet with igw-* association will have igw as part of route table

ELB's availability zone will look like

Running multiple commands with ssh

You can run multiple OS commands as part of ssh with -t (force pseudo tty allocation) such as

$ ssh -t -i <key> ec2-user@<EIP> 'sudo yum -y update openssl && sudo openssl version -a'

Friday, June 6, 2014

ELB soft limit of 20 per region in an AWS account

There is a soft limit of 20 ELB's per region in an AWS a/c. When you exceed that no. of ELB's, you will get an error like below

For other types of limits, pl. refer to Amazon docs at aws_service_limits. You can raise a ticket with AWS support to increase that soft limit if needed.

Thursday, June 5, 2014

Amazon says AWS infrastructure is impacted by Openssl CVE-2014-0224

Amazon issued a statement that the latest Openssl vulnerability reported by CVE-2014-0224 exposes the infrastructure to possible man-in-the-middle attack:

Amazon advisory

Amazon has updated the openssl package and suggested running "sudo yum update openssl". Openssl has issued their own advisory in the link below:-

Openssl advisory

Wednesday, June 4, 2014

Restarting iptables service sometimes loads old firewall rules

Sometimes we find in a running instance our iptables rules gets overridden by some old rules in /etc/sysconfig/iptables file after doing a restart on iptables service

Whenever you add a new rule to the iptables, be sure to call "iptables-save" otherwise the rules won't be persisted. Check whether the rules have been persisted in /etc/sysconfig/iptables

*********
# Generated by iptables-save v1.4.7 on Thu May 8 17:24:09 2014
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [41:3768]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
# Completed on Thu May 8 17:24:09 2014
*********

If for some reason, you think the rules are not correct, you can run "sudo iptables --flush" and that will flush the rules and then add your new rules manually or load it from a file using "iptables-restore" command and then do a "sudo iptables-save".

Monday, June 2, 2014

AWS EC2 service interruption in US-EAST-1b zone today

On some of the instances running in us-east-1b, there are below errors in system log:-

********
Initialising Xen virtual ethernet driver. microcode: CPU0 sig=0x206d7, pf=0x1, revision=0x70a platform microcode: firmware: requesting intel-ucode/06-2d-07 microcode: CPU1 sig=0x206d7, pf=0x1, revision=0x70a platform microcode: firmware: requesting intel-ucode/06-2d-07 Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba microcode: CPU0 update to revision 0x710 failed microcode: CPU1 update to revision 0x710 failed microcode: CPU0 update to revision 0x710 failed microcode: CPU1 update to revision 0x710 failed microcode: CPU0 update to revision 0x710 failed microcode: CPU1 update to revision 0x710 failed microcode: CPU0 update to revision 0x710 failed microcode: CPU1 update to revision 0x710 failed microcode: CPU0 update to revision 0x710 failed microcode: CPU1 update to revision 0x710 failed NET: Registered protocol family 10 lo: Disabled Privacy Extensions
********

Amazon had mentioned about the problems in a single zone in the status dashboard:-

http://status.aws.amazon.com/

Later today evening, AWS Support informed that interruption has been resolved:-

*************
8:59 PM PDT Between 3:22PM -- 8:40PM PDT we experienced elevated API error rates and latencies launching and stopping instances, and attaching and detaching EBS volumes in a single Availability Zone in the US-EAST-1 Region. Running instances and volumes were not affected. The issue has been resolved and the service is operating normally.
************

resonable resource limits for production EC2 instances

Typically, you may want to tune your production ec2 instance's resource limits because the defaults are fairly low for any practical purposes:

In your /etc/security/limits.conf, you can set the hard limit and soft limit for no. of file descriptors and max user processes. NOTE - If you set max process value in etc/security/limits.d/90-nproc.conf, it will override the value in limits.conf. Also, you must reboot your instances in both cases.

*******

* hard nofile 65535

* soft nofile 65535

@<user> hard nproc 16384

@<user> soft nproc 4096

*******

If you have multiple user processes running, then you may want to set "kernel.max_pid" parameter in /etc/sysctl.conf

********

#Allow for more PIDs

kernel.pid_max = 65536

********