experiments with go and nagios 2

Posted by peter on October 25, 2011

I’ve been working on contract at the goog of late, and had the opportunity to play with Go. I ended up liking it a lot more than I expected to, so I decided to rewrite a few small infrastructure pieces/shunts that were bothering me in go.

First is a replacement for ganglia-graphite.rb that runs about 15x faster. Our central gmond machine couldn’t keep up using rexml-based ganglia-graphite.rb, and was burning all its available cpu time spitting out data every 90 seconds or so.

Second piece is a replacement for nagios’ ncsa which accepts check results (with multi-line output and as much perf data as you please) as json or protobuf. There are certainly other replacements out there, but this one is mine. I think it sucks only a little bit. I’m maybe letting this one go too early, as I’ve got plans to extend it to replace NRPE as well (a piece of nagios that also sucks), but I’m using it right now to fill a need to return a large number of passive informative checks from a few single hosts.

Both are up on my github page (the former here and the latter here). They’re a bit rough; I’m both new to the language and a shitty coder, so dragons, etc etc. Let me know if you find them useful, or alternately if you find them worthless.

openswan ipsec in ec2 4

Posted by peter on August 24, 2011

This may be totally invalid given amazon rolling out cross-region VPC a few weeks ago, but for those who still insist on rolling their own…

I was dealing with setting up ipsec (openswan) in EC2 for some folk which included, among other things, cross region EC2-instance-to-EC2-instance links. We had endless trouble with connections just suddenly dying. UDP isn’t the easiest thing to get right with NAT, and though it’s hard to be conclusive (especially when debugging linux ipsec- not the easiest thing to follow in and out of the kernel), I point my blame-finger at trouble caused by bad interactions with double-NAT between EC2 regions.

Problem was eventually solved with a combination of aggressive dead peer detection settings (dpddelay=4 dpdtimeout=16) and (the trickier setting to find) by adding disable_port_floating=yes to the config setup region of ipsec.conf. That setting stops pluto from changing what port it communicates on, which, I assume, makes an easier job for Amazon’s NAT. This also means NAT-T behavior is probably not going to work with other vendors’ implementations in this setup, as pluto doesn’t listen on 4500 anymore, but we’re openswan everywhere, and it’s made our links stable.

ebs clappy award 6

Posted by peter on August 09, 2011

From Amazon’s status page regarding their recent outage in Dublin, there’s this little alarming snippet inside the wall of text (most of it having to do with failure due to lightning strike) that could easily go missed.

3:11 PM PDT Separately, and independent from the power issue in the affected availability zone, we’ve discovered an error in the EBS software that cleans up unused snapshots. During a recent run of this EBS software in the EU-West Region, one or more blocks in a number of EBS snapshots were incorrectly deleted. The root cause was a software error that caused the snapshot references to a subset of blocks to be missed during the reference counting process. This process compares the blocks scheduled for deletion to the blocks referenced in customer snapshots. As a result of the software error, the EBS snapshot management system in the EU-West Region incorrectly thought some of the blocks were no longer being used and deleted them. We’ve addressed the error in the EBS snapshot system to prevent it from recurring. We have now also disabled all of the snapshots that contain these missing blocks.

We are in the process of creating a copy of the affected snapshots where we’ve replaced the missing blocks with empty block(s). Customers can then create a volume from that copy and run a recovery tool on it (e.g. a file system recovery tool like fsck); in some cases this may restore normal volume operation. We will email affected customers as soon as we have the copy of their snapshot available. You can tell if you have a snapshot that has been affected via the DescribeSnapshots API or via the AWS Management Console. The status for the snapshot will be shown as “error.” Alternately, if you have any older or more recent snapshots that were unaffected, you will be able to create a volume from those snapshots without error. We apologize for any potential impact it might have on customers applications.

Another clappy for the EBS team, and another reason not to use EBS for anything you can’t lose.

chrome short host trick 4

Posted by peter on November 15, 2010

I usually visit sites hosted at the shortbus labs (my apartment) using a short hostname, but using Chrome this is sometimes problematic- it’ll attempt to search the Goog for results rather than simply visiting the site. It’ll then pop up an apology with a link to click to reach the short host, but this means hands leaving the keyboard to the trackpad, which makes me grumpy.

there is a trick, though- put a / at the end of the shorthost (i.e. shorthost/ as opposed to shorthost), and it won’t send your shorthost name to google for searches.

This is incredibly simple, and really doesn’t deserve a blog post, but it was such an annoyance for me that I’m giving it one.

disabling anti-aliased fonts in netbeans on osx 5

Posted by peter on December 08, 2009

it’s harder than you might think, and i haven’t found a perfect solution, but i’m pretty close.

in eclipse, likely because of swt’s use of jni to use os-native forms, you can simply set an application-specific anti-aliasing threshold for the application:

defaults write org.eclipse.eclipse AppleAntiAliasingThreshold 20

and this more or less works (thanks to tim for pointing this out here), but with netbeans, no joy.

our buddy netbeans uses swing, which doesn’t look at application property plists. since apple rolls their own ui elements for swing/awt on osx (best of my knowledge), we’re in strange waters, as well. there’s a hackier way to do this, documented here on the netbeans forums for netbeans 6.5, and involves appending the following flags to netbeans_default_options in the netbeans.conf file inside the netbeans application bundle- that’s .../NetBeans\ 6.7.1.app/Contents/Resources/NetBeans/etc/netbeans.conf

-J-Dswing.aatext=false -Dawt.useSystemAAFontSettings=false

and this works fine and dandy, if you’re using the standard “Monospace” font.

netbeans 3.7.1 using monospace on 10.6

However, this fails miserably if you’re using something else, like profont or (my pick) pragmata. in that case, at least if you’re on 10.6, you get a mess that looks like this:

netbeans-pragmata-badrender

note that fonts are being drawn on top of each other. selecting text and moving the cursor around makes for an even bigger mess.

i’ve noticed that apple’s release notes contain a note saying that the swing.aatext system property was ignored in apple’s first release of java6 on osx, so my guess is that either they didn’t get it quite right, or that netbeans is getting rendered glyph sizes wrong. my money is on this being an apple screwup, as it works great with the default monospaced font, but this is also me talking out my ass as i haven’t bothered even looking at how this works in another swing application.

this is the point where i give up and admit that i’m spending more time than i should on getting my special font working. if anyone else has any clues past this point, i’m all ears!

enable screen sharing remotely in osx

Posted by peter on August 26, 2009

I forget how to do this all the time, so I’m sticking the command here.

$ sudo /System/Library/CoreServices/RemoteManagement/ARDAgent.app\
/Contents/Resources/kickstart -activate -configure -access -on -users admin \
-privs -all -restart -agent -menu

will enable screen sharing for all admin users. More specific options are available from the kickstart command’s -help option.

one-window safari

Posted by peter on July 13, 2009

There is a way to force Safari to live with one window only- Apple has, true to form, hidden this functionality away in hidden preferences. Run the following in Terminal:

defaults write com.apple.Safari TargetedClicksCreateTabs -bool true

..and Safari will no longer create new windows unless you ask it to. (via eNik)

Apple’s X11 Keymap and virt-manager 31

Posted by peter on June 01, 2009

Apple’s default X11 key mapping causes trouble when using virt-manager on the local X server. For reasons unknown to me, Apple maps the option/alt key to Mode_switch, meaning that when virt-manager grabs the cursor, it’ll never let go, as there’s no way of sending Ctrl+Alt to it- apparently it works explicitly off of keysyms instead of mod maps. It’s been long enough since I’ve dealt with X11 configuration issues that I had to do a little hunting to figure things out, so I figured I’d throw my solution up here.

On the machine where you’ll be running your X server (your mac), stick this in ~/.Xmodmap

clear Mod1
keycode 66 = Alt_L
keycode 69 = Alt_R
add Mod1 = Alt_L
add Mod1 = Alt_R

This will make the control keys on your keyboard emit (I think) more sensible events, while keeping Mod1 in the same place. Finally, no more restarting X11 when you want to break free of virt-manager.


UPDATE: The above still works, but it turns out there’s a better way. XQuartz, the open source project that eventually gets rolled back upstream to Apple, has a setting in preferences to switch between Apple default behavior and what the .Xmodmap above will get you. It’s also got other improvements, such as working properly with Spaces. Use it instead.

Auto-starting Xen/xVM domains in OpenSolaris

Posted by peter on May 19, 2009

Normally, when using Red Hat’s tools (which like everyone else Sun has ripped off wholesale) this is exposed in virt-manager and virsh, but the versions that Sun has brought over from Redhat lag quite a bit behind upstream, so you have to dig in below the convenience tools. Shut down your guest, then use xm list -l to dump the xen config file. There’s a on_xend_start property in here- change it from ignore to start, then dump it back into xen with xm new -F .sxp.

All at once:

pfexec xm list -l  > guest.sxp
sed -i 's/on_xend_start ignore/on_xend_start start/' guest.sxp
pfexec xm new -F guest.sxp

..and you’ve got a domain that’ll auto-start when xend is brought up, which is pretty much what I usually want.

making a manifest for SMF 1

Posted by peter on October 22, 2008

Sun is an incredibly frustrating company. A number of the new features introduced in Solaris 10 are very excellent, but I find all the Solaris userland a huge shitpile compared to my familiar GNU tools. Furthermore, the documentation for the aforementioned excellent features is pretty lacking at times, meaning a lazy dude like me gets little use out of them. However, I just finally stumbled into SMF, and I am now a fan.

I wanted to move a few of the programs I run into SMF, one of the aforementioned awesome features. ¬†SMF is Sun’s replacement init process, which takes a more active approach to services than the traditional “tell it to start and hope it does.” It, and all the other new init replacements borrow a good bit from djb’s daemontools, which everyone seems to be realizing, ten years later, is actually a good idea.

SMF provides two primary benefits over traditional init script methods, in my mind, the first of which is process supervision. Process supervision means that when your app crashes, you don’t have to rely on monitoring scripts (or other people) to notice it’s gone, SMF notices immediately and attempts to restart the process. If it won’t stay started after a few tries, it’ll stop and switch the process to maintenance mode, meaning you get to fix it.

The second is its ability to break out instances of an application by essentially subclassing the main “default” instance and redefining configuration variables, such as which config file to read, where to keep the database, etc. This means you can, with a few commands, bring up another instance of apache for testing, or maybe a number of mongrels serving different apps, from the generic service definition.

Sun has a pretty good how-to online dealing with postgresql, which is helpful, but leaves out a few chunks. SMF expects you to provide it with start and stop methods, which is helpful if you just want to move a legacy service over to SMF, but some applications are simpler than that and work fine by just running them and sending them a kill signal when they need to go away. In this case, you can just give the command to launch the daemon as the start method, and fill the stop method with :kill. This will instruct SMF to send a kill signal to the process and children executed by the start method. Incidentially, if the app reloads configuration when sent -HUP, you can fill in :kill -HUP as the refresh method. This isn’t explicitly spelled out in any documentation I’ve read, but it’s used in a few places in the manifests that ship with Solaris.

Below is a manifest I whipped up for mt-daapd, which scans and shares out all of my spiffy tunes so itunes can play ’em.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="manifest" name="mt-daapd">
    <service name="network/mt-daapd" type="service" version="1">
        <dependency name="network" grouping="require_all" restart_on="none" type="service">
            <service_fmri value="svc:/milestone/network:default"/>
        </dependency>
        <exec_method type="method" name="start" exec="/export/home/pjjw/opt/pkgs/mt-daapd/sbin/mt-daapd -y -c %{mt-daapd/conffile}" timeout_seconds="10" />
        <exec_method type="method" name="refresh" exec=":kill -HUP" timeout_seconds="10" />
        <exec_method type="method" name="stop" exec=":kill" timeout_seconds="10" />
        <instance name="default" enabled="false">
            <property_group name='config' type='application'> 
                <propval name='conffile' type='astring' value='/export/home/pjjw/opt/pkgs/mt-daapd/etc/mt-daapd.conf' /> 
            </property_group>
        </instance>
        <stability value="Evolving"/>
        <template>
            <common_name>
                <loctext xml:lang="C">Firefly Media Server</loctext>
            </common_name>
            <documentation>
                <doc_link name="Firefly Media Server homepage" uri="http://fireflymediaserver.org"/>
            </documentation>
        </template>
    </service>
</service_bundle>

I run two instances of this app- one which shares out the music that i’ve sorted and tagged properly, and then another one that shares out stuff that’s just arrived into torrent directories or the like. Ignore the ridiculous location I’ve installed this app to and note the property group and properties. I’ve got two config files, one describing one configuration, and another describing the other. I can make the second instance from the first with the following commands:

svccfg -s network/mt-daapd add incoming
svccfg -s network/mt-daapd:incoming setprop config/conffile = astring: /export/home/pjjw/opt/pkgs/mt-daapd/etc/incoming.conf 

..and there it is, a new instance of the server.

[00:28:05][pjjw@push:~]$ svcs mt-daapd
STATE          STIME    FMRI
online         Oct_17   svc:/network/mt-daapd:incoming
online         Oct_17   svc:/network/mt-daapd:default

I’d actually just been running this in screen for about 6 months and restarting it when it crashed. This is much nicer now.