openswan ipsec in ec2 4

Posted by peter on August 24, 2011

This may be totally invalid given amazon rolling out cross-region VPC a few weeks ago, but for those who still insist on rolling their own…

I was dealing with setting up ipsec (openswan) in EC2 for some folk which included, among other things, cross region EC2-instance-to-EC2-instance links. We had endless trouble with connections just suddenly dying. UDP isn’t the easiest thing to get right with NAT, and though it’s hard to be conclusive (especially when debugging linux ipsec- not the easiest thing to follow in and out of the kernel), I point my blame-finger at trouble caused by bad interactions with double-NAT between EC2 regions.

Problem was eventually solved with a combination of aggressive dead peer detection settings (dpddelay=4 dpdtimeout=16) and (the trickier setting to find) by adding disable_port_floating=yes to the config setup region of ipsec.conf. That setting stops pluto from changing what port it communicates on, which, I assume, makes an easier job for Amazon’s NAT. This also means NAT-T behavior is probably not going to work with other vendors’ implementations in this setup, as pluto doesn’t listen on 4500 anymore, but we’re openswan everywhere, and it’s made our links stable.

ebs clappy award 6

Posted by peter on August 09, 2011

From Amazon’s status page regarding their recent outage in Dublin, there’s this little alarming snippet inside the wall of text (most of it having to do with failure due to lightning strike) that could easily go missed.

3:11 PM PDT Separately, and independent from the power issue in the affected availability zone, we’ve discovered an error in the EBS software that cleans up unused snapshots. During a recent run of this EBS software in the EU-West Region, one or more blocks in a number of EBS snapshots were incorrectly deleted. The root cause was a software error that caused the snapshot references to a subset of blocks to be missed during the reference counting process. This process compares the blocks scheduled for deletion to the blocks referenced in customer snapshots. As a result of the software error, the EBS snapshot management system in the EU-West Region incorrectly thought some of the blocks were no longer being used and deleted them. We’ve addressed the error in the EBS snapshot system to prevent it from recurring. We have now also disabled all of the snapshots that contain these missing blocks.

We are in the process of creating a copy of the affected snapshots where we’ve replaced the missing blocks with empty block(s). Customers can then create a volume from that copy and run a recovery tool on it (e.g. a file system recovery tool like fsck); in some cases this may restore normal volume operation. We will email affected customers as soon as we have the copy of their snapshot available. You can tell if you have a snapshot that has been affected via the DescribeSnapshots API or via the AWS Management Console. The status for the snapshot will be shown as “error.” Alternately, if you have any older or more recent snapshots that were unaffected, you will be able to create a volume from those snapshots without error. We apologize for any potential impact it might have on customers applications.

Another clappy for the EBS team, and another reason not to use EBS for anything you can’t lose.