02 October 2013

About those doughnuts ...

In a recent post, I closed:
Dollars to doughnuts, there is 'more than one roach' lurking.
I'll cover a bet that there are not tested backups in that shop
I had occasion to speak with that person again the other day. No backups, no thought to plan for such by that person's predecessor, and so by definition, no Disaster Recovery Plan
Staufs storefront
I am looking forward to getting a box of doughnuts -- from the DK would be nice. Please make sure two Apple Fritters are in there, as the other coffee vultures always grab the one I wanted. The Doughnut Kitchen is close to Staufs

20 September 2013

win win habits

PMman was designed for long lived VM's (compare contra: ephemeral AWS or OpenShift type instances) , and so our communication needs vary from other cloud instance control interfaces.  This applies for the client tenant side views, but also for the sysadmin / devop side view

One of the convenience features of the PMman cloud product we run is the ease of communicating with 'tenants' on a given dom0.  When I came in this morning, I had this notice waiting for me:
From: mdadm monitoring
To: root@kvm-nNNN.pmman.com
Subject: DegradedArray event on /dev/md0:kvm-nNNN.pmman.com

This is an automatically generated mail message from mdadm
running on kvm-nNNN.pmman.com

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.
It is not the end of the world -- there are four members in that raid array; all mutable data is backed up nightly, and it is easy enough to fix with a hot drive swap.  BUT, we did a drive swap for a different raid member on that same unit within the last month.  I want to totally remove that unit from production, and put it into 'on the bench' mode, so I can see if there is some deeper hardware issue

The neat part from my point of view is that part of our design included a way for contacting the tenant VM owners on JUST that box alone.  It is as easy as clicking a couple buttons and typing
The following message will be sent to the to the list of users [on box: kvm-nNNN.pmman.com]. Here is the opportunity to fine tune the list.

From:     support@pmman.com
Subject:     Raid array member maintenance needed
Notice Level:     N
Message:     We have had some raid array member failures on the underlying dom0 of a machine you run at PMman. We replaced a drive hot three weeks ago on this same unit, and now have a new failure.

This may be a portent of a failing drive control subsystem, rather than the drive (although the previously removed drive tested bad and has been RMA'ed)

One new feature of PMMan of which you may not be aware is the 'no extra charge' weekly reboot and Level 0 backup. This is called a 'cyclic' backup with a 7 day repetition interval. To the extent you have NOT enabled this feature, I strongly suggest you test it and take advantage of this enhanced functionality of the interface

It is accessible off the 'Backups' button of the VM control web interface. Feedback is welcome of course. A nice side benefit from OUR point of view of offering this to customers is that it enables us to do invisible migrations from one dom0 to another as the backup and reboot are occurring, and so 'clear out' a dom0 host of running client instance VMs


-- Russ herrold
User list: ... [elided] ...
Emails go out, and a history item gets dropped into each VM's logfile as well as in our admin side logs.  Optionally I can have the tool turn on an 'attention needed' flag in the end tenant's console, that will persist until they acknowledge it.  We already do that as to 'too long between updates' and' 'too long between reboots' and such

We can of course do invisible 'hot' migrations of machines around, but even safer is to encourage the good habit of encouraging tenant VM owners to  take (and we automatically test) Level zero backups


19 June 2013

I am not Harry Truman

I received a email from a customer, followed by a phone call, to the effect they had received huge number of email 'return' bounces to a general intake email address.  He and I have had this discussion before

I have written about email sender forgery (There is probably NOT an email account: godzilla@microsoft.com) and its fallout ("Customer: My cousin says that his email to me is not going through") before.  So let's take the time to think it through yet again

Takeaway: He wanted me to stop such pieces from cluttering their email box, but he is unwilling to have 'heavy' spam filtering

As a personal matter, and also wearing my sysadmin hat, I would like to stop seeing this cruft as well

But as a technical matter, it seems that it cannot easily be done without constant 'tuning' of rejection rules or some other rather serious matching of 'Message-ID' of pieces sent against return pieces offered.  An attempt to do so through filtering tools with no prior knowledge of Message-ID's sent, is to always 'play defense' against the spammers, without an ability ever score a 'win'.  The effort to match Message-Id's in offered return pieces is perhaps more promising

But, so far, no-one has been sufficiently vexed by it in the FOSS community to publish such a tool and to commit do doing do the ongoing 'tuning' of message parsers needed.  Perhaps we can design around it with existing tools, and amending our outgoing pieces by adding a certification that a given candidate email is truly from us

As a design matter, building a milter, writing some procmail rules, and parsing sendmail logs, probably into a database backend, as my first thought as to how I would approach the matter.  The database constraint is troubling, though.  I have other work that I need to attend to first, but I went through the thought process.  I memorialize that process in part in case someone is interested.  Even more, I will provide webspace, mailing list support, and a VCS gratis, if someone 'feels the itch'.  It would be useful to have, but is not urgent to attain -- Seven Habits Quadrant 2 or 4 stuff.  Absent such a volunteer effort or a paying customer, for me, Quadrant 4

Or, version two, a trusted cohort of outbound mailservers could build a MAC MIME Multipart attachment for each outbound message, and also a second MIME attachment that is cryptographically validated 'clearsign' of that MAC part.  Possibly bundle this up into a Multipart Related set of structured attachments.  Add these two new MIME attachments to all messages on every outbound piece.  The first part -- the MAC part -- would be based a hash of the message body, plus a timestamp of seconds since Epoch or such, and other optional entropy, to avoid forgery and replay attacks

Later then, when a putative return is offered, only accept for further processing those returns that had a validating pair of MIME attachments, produced based on a re-hash the message body in chief, and that MAC section's timestamp; and  that had previously clearsigned by it. Discard stale stuff, and non-validating content. This gets rid of the need for the database and simplifies the procmail rules.  A well-formed candidate return piece can carry around all that is needed to known to decide if one will pass a mail return message along to human eyes

Not free, as it will burn up compute cycles on every send, and a few more at return time, but also complete and under controllable locally so resistant to spammers.  Avoids the database requirement, so it can scale out. Most of the needed tools already exist as FOSS.  hmmm

The protocols governing what constitutes: email permit a sender to enter whatever 'return address, and 'sender address' they wish on a piece of email.  It is trivial to find a 'open' relay to accept email to send to any third party.  Consider the analogy:
  • All the while being careful to not leave a fingerprint or other biometric, I use cash to purchase a post card at the corner store, along with a stamp
  • I address it to someone of tender sensibilities, and assert that I noticed that their car was parked outside the local 'adult entertainment' establishment
  • I sign it: Harry S Truman
  • I enter a 'return address' of:

  Harry S Truman
  President Emeritus
  1600 Pennsylvania Avenue
  Washington DC  20500
  • I mail it

  • The recipient is outraged to find such a libelous assertion, visible for their letter carrier to see, and demands that the person who did so be identified, and stopped.  Also, for good measure, they want the Postal Service Inspectors to get on the matter to prevent such heartbreaking assertions to never happen again

    About all the Postal Service will offer to do in the usual case is to return the piece to its nominal sender. And he no longer receives mail at that address

    (I note parenthetically that the Postal Service DOES seem to scan images of ALL paper mail passing through their system)

    Stopping spam (here: bounce backsplatter and 'joe jobs') is just not going to turn out to have a durable, easy, and comprehensive solution, without re-thinking what we send looks like.  Spammers and legitimate receivers are in a 'arms race' and today's fix will rot if senders can re-engineer around the fixes.  If this state of affairs distresses a person greatly and until I can get that MIME solution going to test my hypothesis: stop reading email; hire a full time, 24x7 secretary to pre-read all email and toss the junk.; turn up the filtering and accept the false positives; grow a thick skin

    Or, of course, start coding and beat me to it

    13 June 2013

    Phone call: 'I've got this sick machine ...'

    me:  well, why it is sick?

    them:  yum complains about a missing signing key

    me: so install the key; it is down in /etc/pki/rpm-gpg/, and rpm --import ... will do the trick

    them: that directory is not there

    me: who set up the machine?

    them:  well, I was handed it, and ...

    me: so, take a level zero backup and then clean up the machine before trying to work on it, or deploy a new one

    them: well, I can't

    I just got off that call from a friend in a new employment situation

    The technical fix was outlined by me long ago, and I sent an email with the link along to the person calling

    BUT: Fixing the mindset inside the caller's head: do not try to work in a undefined (here: broken) environment is harder

    But the caller has a problem in their work-flow process; a fix has to be done; sooner is probably better than later; a broken machine in production is 'technical debt', pure and simple.  Fundamental expectations are not met; binary partition will not work well to isolate problems, as more than one piece is probably broken.  It will break again, and a perception may well form that the caller may be the problem, rather than the broken environment they were handed

    Be sure to make a note to yourself to also address the broken process that permitted that machine to escape into production.  Dollars to doughnuts, there is 'more than one roach' lurking.  I'll cover a bet that there are not tested backups in that shop

    04 January 2013

    Another pet died across the holidays

    I wrote before about un-maintained and orphaned WordPress sites being exploited.  That same frantic user from two months ago, called again.  The TL;DR summary is:
    • cPanel administration with multiple accounts in a single host without protections 
    • OS Updates not being run
    • WordPress updates not being run
    • Random add-on's being used without an awareness of security issues
    • No SELinux (disabled)
    An exploit un-gzip-ping a hostile payload from cache was used, and the machine taken over

    The absence of good sysadmin skills, well packaged content, and updates 'for the loss' ...

    30 October 2012

    disable IPv6 DNS results

    We had an end user appear in the main #centos IRC channel the other day with a IPv6 problem.  That person had leased a VPS somewhere, and their provider had included and enabled IPv6, at least partially.  Something was wrong in the network fabric, so that while some IPv6 services worked, others did not; DNS results returned with AAAA record results; but then the VM hoster was not transiting port 80 TCP traffic.  Very curious, and frustrating to the end user who just wanted yum to work so they could install updates and packages on their instance

    The culprit is the grafting in of IPv6 readiness in man 2 getaddrinfo.  This is the way of the future, so there is no fighting it on a long term basis, but tactically having a means to be IPv4 only is appealing for people just wanting to work in the older address space.  The TL; DR uptake of that man page is that in a properly functioning system, name resolution answers under IPv6 are preferred, and only if not available, does one fall back to the older IPv4.  But this places a premium on IPv6 actually working when present.  We've shipped a full native IPv6 setup for customers at PMMan for a couple of years ago, but I assure you that we had some head-scratching as we rolled it out, and found customers using tunnels from HE or SixXs were also leaking advertisements to other VM's.  We added rules to filter out the false traffic after a bit of tracing with TCPDUMP

    I have blogged about it before when IPv6 link-local address resolution (the ^FE family) was confusing distcc under Debian a couple of years ago.  There are links in the CentOS wiki for approaches on disabling IPv6 traffic, which vary between C5 and C6

    That last mentioned article has an outlink to a bugzilla ticket that offers food for thought.  It mentions in passing that one can direct a nameserver to NOT deliver IPv6 results with a fairly simple tweak
    Another option is to add this to /etc/sysconfig/named:

    ... so, ... it should be possible to set up a local cacheing nameserver on the localhost, configured to NOT return IVv6 content, and so workaround the issue.  This smells sort of 'hackish', but it would have the benefit of being a single method that should work in the general case and not be tied to any particular kernel version, or other variable

    27 September 2012

    Feeding the pet

    We had a frantic call from a sometimes customer today.  Their self-administered WordPress-based website had a Trojan in it, and it was saturating their website traffic allocation.  "THE SITE WAS DOWN!!"  They had signed up at a CPanel mediated, shared hosting firm, and a plug-in they had installed turned out to contain a well-known trojan

    We spent a couple of hours looking into it.  And then a couple hours looking into the WordPress security notification system.  Perhaps, I should say: non-notification system as to getting subscribed to a formal notification mailing list from the WordPress folks, proper

    The WordPress model seems to be: treat your WordPress site as though it is a pet that needs daily feeding.  And to be 'put down' when you lose interest in it, move on, or forget about it -- Oops.  Log in daily as an administrator, and look for a notification
    that you need to apply the 'latest and greatest' update.   Run the update process manually whenever it appears.  Oh yeah, did you remember to take a backup FIRST, and test that you can roll back to it if the 'update' breaks anything? Oops

    This of course RULES OUT using a packaged approach to managing such sites, as the lag for stabilizing a new RPM package, accounting for potential database changes, and the like 'take too long'. Just unroll a tarball, and trust that it will not break any local customizations

    I see fourteen open tabs in my browser panel still open, related to trying to track down a central and formal notification feed that I (or any person seeking to get 'push' notification) might subscribe containing only 'Security' notifications.  Weeding through the tabs, ...

    • The 'Famous 5-Minute Install' for WordPress -- Nope, no useful outlink for hardening, nor to subscribe to notifications, beyond a pointer to a third-party Ubuntu appliance with an 'automatic security updates'.  That appliance's page has pointers to a tool to enable taking database backups, adding PHPMyAdmin, and Webmin.  Not good choices for a person caring about security
    • Perhaps FAQ items tagged with: Security -- Nope, clearly incomplete, as for example a Google search turns up this third-party alert for version 3.3.2,  but the Release Notice does not get titled with: Security
    • This bug (#10253) lingered for three years with a Security tag in their Trac issue tracker as to the current release series (3.4), and was amended ten days ago; But the latest release (for 3.4.2) was twenty days ago when this is written.  Should an update have been release?  Who Knows?
    • Perhaps their FAQ Security -- Nope, no push notification link suggested there, but lots of clutter as to copyright infringement notification handling, and miscellaneous topics
    • Perhaps watch the Releases News in an RSS reader - Oops, no sub-tag feed offered, and there has not been an "Important" Security release since December 2010, if one used that approach
    • Run a Google search daily, and look for third-party commendary - Nope, although nuggets may be found, for it is not viable as: Not Authoritative, irregular and partial as to updates, and wading through search engine hit, or RSS feed clutter will kill your productivity
    Clearly, one MUST configure the webserver to NOT permit off-site access to the credentials and configuration file: wp-config.php but I'll be darned if I can see instructions on the WordPress site, showing a novice administrator how to do this. In a shared hosting environment without 'root' level control, it is probably not even doable.  There is not hint of this rather elementary precaution on the official write-up concerning editting the file

    A quick Google search for: turns up lots of vulnerable candidate installations, and a handy, dandy code fragment for parsing information out of potential victims so found, to automate take-overs. No criticism of the author of that code publishing his work; a knife can heal (as a scalpel), prepare dinner, or injure, depending on the intent of its holder

    I see an official  recovery outline  suggestion, anyway