20 September 2013

win win habits

PMman was designed for long lived VM's (compare contra: ephemeral AWS or OpenShift type instances) , and so our communication needs vary from other cloud instance control interfaces.  This applies for the client tenant side views, but also for the sysadmin / devop side view

One of the convenience features of the PMman cloud product we run is the ease of communicating with 'tenants' on a given dom0.  When I came in this morning, I had this notice waiting for me:
From: mdadm monitoring
To: root@kvm-nNNN.pmman.com
Subject: DegradedArray event on /dev/md0:kvm-nNNN.pmman.com

This is an automatically generated mail message from mdadm
running on kvm-nNNN.pmman.com

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.
It is not the end of the world -- there are four members in that raid array; all mutable data is backed up nightly, and it is easy enough to fix with a hot drive swap.  BUT, we did a drive swap for a different raid member on that same unit within the last month.  I want to totally remove that unit from production, and put it into 'on the bench' mode, so I can see if there is some deeper hardware issue

The neat part from my point of view is that part of our design included a way for contacting the tenant VM owners on JUST that box alone.  It is as easy as clicking a couple buttons and typing
The following message will be sent to the to the list of users [on box: kvm-nNNN.pmman.com]. Here is the opportunity to fine tune the list.

From:     support@pmman.com
Subject:     Raid array member maintenance needed
Notice Level:     N
Message:     We have had some raid array member failures on the underlying dom0 of a machine you run at PMman. We replaced a drive hot three weeks ago on this same unit, and now have a new failure.

This may be a portent of a failing drive control subsystem, rather than the drive (although the previously removed drive tested bad and has been RMA'ed)

One new feature of PMMan of which you may not be aware is the 'no extra charge' weekly reboot and Level 0 backup. This is called a 'cyclic' backup with a 7 day repetition interval. To the extent you have NOT enabled this feature, I strongly suggest you test it and take advantage of this enhanced functionality of the interface

It is accessible off the 'Backups' button of the VM control web interface. Feedback is welcome of course. A nice side benefit from OUR point of view of offering this to customers is that it enables us to do invisible migrations from one dom0 to another as the backup and reboot are occurring, and so 'clear out' a dom0 host of running client instance VMs


-- Russ herrold
User list: ... [elided] ...
Emails go out, and a history item gets dropped into each VM's logfile as well as in our admin side logs.  Optionally I can have the tool turn on an 'attention needed' flag in the end tenant's console, that will persist until they acknowledge it.  We already do that as to 'too long between updates' and' 'too long between reboots' and such

We can of course do invisible 'hot' migrations of machines around, but even safer is to encourage the good habit of encouraging tenant VM owners to  take (and we automatically test) Level zero backups