Showing posts with label sysadmin. Show all posts
Showing posts with label sysadmin. Show all posts

11 July 2014

The problem with older packaging managers in two lines ...

There was a post from the NetBSD oriented 'pkgsrc' project into the Scientific-Linux mailing list earlier this week. Lots of skilled geeks following their muse ... but From the #pkgsrc IRC channel ('nick' changed to protect the poster)
09:11 wxzy> Ugh. I think I should have chosen the X from pkgsrc.
09:11 wxyz> The one I have now lacks fonts that applications need.
Yeah. Package 'Managers' that unroll tarballs, but lack a strong Requirements / Dependency system are like that

02 October 2013

About those doughnuts ...

In a recent post, I closed:
Dollars to doughnuts, there is 'more than one roach' lurking.
I'll cover a bet that there are not tested backups in that shop
I had occasion to speak with that person again the other day. No backups, no thought to plan for such by that person's predecessor, and so by definition, no Disaster Recovery Plan
Staufs storefront
I am looking forward to getting a box of doughnuts -- from the DK would be nice. Please make sure two Apple Fritters are in there, as the other coffee vultures always grab the one I wanted. The Doughnut Kitchen is close to Staufs

20 September 2013

win win habits

PMman was designed for long lived VM's (compare contra: ephemeral AWS or OpenShift type instances) , and so our communication needs vary from other cloud instance control interfaces.  This applies for the client tenant side views, but also for the sysadmin / devop side view

One of the convenience features of the PMman cloud product we run is the ease of communicating with 'tenants' on a given dom0.  When I came in this morning, I had this notice waiting for me:
From: mdadm monitoring
To: root@kvm-nNNN.pmman.com
Subject: DegradedArray event on /dev/md0:kvm-nNNN.pmman.com

This is an automatically generated mail message from mdadm
running on kvm-nNNN.pmman.com

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.
It is not the end of the world -- there are four members in that raid array; all mutable data is backed up nightly, and it is easy enough to fix with a hot drive swap.  BUT, we did a drive swap for a different raid member on that same unit within the last month.  I want to totally remove that unit from production, and put it into 'on the bench' mode, so I can see if there is some deeper hardware issue

The neat part from my point of view is that part of our design included a way for contacting the tenant VM owners on JUST that box alone.  It is as easy as clicking a couple buttons and typing
The following message will be sent to the to the list of users [on box: kvm-nNNN.pmman.com]. Here is the opportunity to fine tune the list.

From:     support@pmman.com
Subject:     Raid array member maintenance needed
Notice Level:     N
Message:     We have had some raid array member failures on the underlying dom0 of a machine you run at PMman. We replaced a drive hot three weeks ago on this same unit, and now have a new failure.

This may be a portent of a failing drive control subsystem, rather than the drive (although the previously removed drive tested bad and has been RMA'ed)

One new feature of PMMan of which you may not be aware is the 'no extra charge' weekly reboot and Level 0 backup. This is called a 'cyclic' backup with a 7 day repetition interval. To the extent you have NOT enabled this feature, I strongly suggest you test it and take advantage of this enhanced functionality of the interface

It is accessible off the 'Backups' button of the VM control web interface. Feedback is welcome of course. A nice side benefit from OUR point of view of offering this to customers is that it enables us to do invisible migrations from one dom0 to another as the backup and reboot are occurring, and so 'clear out' a dom0 host of running client instance VMs

Thanks

-- Russ herrold
User list: ... [elided] ...
Emails go out, and a history item gets dropped into each VM's logfile as well as in our admin side logs.  Optionally I can have the tool turn on an 'attention needed' flag in the end tenant's console, that will persist until they acknowledge it.  We already do that as to 'too long between updates' and' 'too long between reboots' and such

We can of course do invisible 'hot' migrations of machines around, but even safer is to encourage the good habit of encouraging tenant VM owners to  take (and we automatically test) Level zero backups

Win-win

13 June 2013

Phone call: 'I've got this sick machine ...'

me:  well, why it is sick?

them:  yum complains about a missing signing key

me: so install the key; it is down in /etc/pki/rpm-gpg/, and rpm --import ... will do the trick

them: that directory is not there

me: who set up the machine?

them:  well, I was handed it, and ...

me: so, take a level zero backup and then clean up the machine before trying to work on it, or deploy a new one

them: well, I can't

I just got off that call from a friend in a new employment situation

The technical fix was outlined by me long ago, and I sent an email with the link along to the person calling

BUT: Fixing the mindset inside the caller's head: do not try to work in a undefined (here: broken) environment is harder

But the caller has a problem in their work-flow process; a fix has to be done; sooner is probably better than later; a broken machine in production is 'technical debt', pure and simple.  Fundamental expectations are not met; binary partition will not work well to isolate problems, as more than one piece is probably broken.  It will break again, and a perception may well form that the caller may be the problem, rather than the broken environment they were handed

Be sure to make a note to yourself to also address the broken process that permitted that machine to escape into production.  Dollars to doughnuts, there is 'more than one roach' lurking.  I'll cover a bet that there are not tested backups in that shop

04 January 2013

Another pet died across the holidays

I wrote before about un-maintained and orphaned WordPress sites being exploited.  That same frantic user from two months ago, called again.  The TL;DR summary is:
  • cPanel administration with multiple accounts in a single host without protections 
  • OS Updates not being run
  • WordPress updates not being run
  • Random add-on's being used without an awareness of security issues
  • No SELinux (disabled)
An exploit un-gzip-ping a hostile payload from cache was used, and the machine taken over

The absence of good sysadmin skills, well packaged content, and updates 'for the loss' ...

27 September 2012

Feeding the pet

We had a frantic call from a sometimes customer today.  Their self-administered WordPress-based website had a Trojan in it, and it was saturating their website traffic allocation.  "THE SITE WAS DOWN!!"  They had signed up at a CPanel mediated, shared hosting firm, and a plug-in they had installed turned out to contain a well-known trojan

We spent a couple of hours looking into it.  And then a couple hours looking into the WordPress security notification system.  Perhaps, I should say: non-notification system as to getting subscribed to a formal notification mailing list from the WordPress folks, proper

The WordPress model seems to be: treat your WordPress site as though it is a pet that needs daily feeding.  And to be 'put down' when you lose interest in it, move on, or forget about it -- Oops.  Log in daily as an administrator, and look for a notification
that you need to apply the 'latest and greatest' update.   Run the update process manually whenever it appears.  Oh yeah, did you remember to take a backup FIRST, and test that you can roll back to it if the 'update' breaks anything? Oops

This of course RULES OUT using a packaged approach to managing such sites, as the lag for stabilizing a new RPM package, accounting for potential database changes, and the like 'take too long'. Just unroll a tarball, and trust that it will not break any local customizations

I see fourteen open tabs in my browser panel still open, related to trying to track down a central and formal notification feed that I (or any person seeking to get 'push' notification) might subscribe containing only 'Security' notifications.  Weeding through the tabs, ...

  • The 'Famous 5-Minute Install' for WordPress -- Nope, no useful outlink for hardening, nor to subscribe to notifications, beyond a pointer to a third-party Ubuntu appliance with an 'automatic security updates'.  That appliance's page has pointers to a tool to enable taking database backups, adding PHPMyAdmin, and Webmin.  Not good choices for a person caring about security
  • Perhaps FAQ items tagged with: Security -- Nope, clearly incomplete, as for example a Google search turns up this third-party alert for version 3.3.2,  but the Release Notice does not get titled with: Security
  • This bug (#10253) lingered for three years with a Security tag in their Trac issue tracker as to the current release series (3.4), and was amended ten days ago; But the latest release (for 3.4.2) was twenty days ago when this is written.  Should an update have been release?  Who Knows?
  • Perhaps their FAQ Security -- Nope, no push notification link suggested there, but lots of clutter as to copyright infringement notification handling, and miscellaneous topics
  • Perhaps watch the Releases News in an RSS reader - Oops, no sub-tag feed offered, and there has not been an "Important" Security release since December 2010, if one used that approach
  • Run a Google search daily, and look for third-party commendary - Nope, although nuggets may be found, for it is not viable as: Not Authoritative, irregular and partial as to updates, and wading through search engine hit, or RSS feed clutter will kill your productivity
Clearly, one MUST configure the webserver to NOT permit off-site access to the credentials and configuration file: wp-config.php but I'll be darned if I can see instructions on the WordPress site, showing a novice administrator how to do this. In a shared hosting environment without 'root' level control, it is probably not even doable.  There is not hint of this rather elementary precaution on the official write-up concerning editting the file

A quick Google search for: turns up lots of vulnerable candidate installations, and a handy, dandy code fragment for parsing information out of potential victims so found, to automate take-overs. No criticism of the author of that code publishing his work; a knife can heal (as a scalpel), prepare dinner, or injure, depending on the intent of its holder

I see an official  recovery outline  suggestion, anyway

19 July 2012

Right, Left, down the middle

A couple weeks ago, a 'Derecho' blew through Columbus, on its way to the metro DC area. Amazon had some failures that cascaded through to people who did not have site redundancy. People know that the East Coast was hit hard, but as we are out in 'fly-over' country they did not perhaps realize that we had several hundred thousand people around here without electricity for a couple weeks as well

I've mentioned before that the primary datacenter that we run our PMman product out of is at the Tier IV level -- multiply redundant cooling, power grid, power backup, fiber entrances, carriers. The owner, a friend, is just a fiend that he does not HAVE outages

Me, too. In our after-event review, I see that one of our secondary sites here in town fell back to its generators, but the rest were all fine. But all sites we use are well covered, all fiber, all multi-homed. Planning for failure was in our deployment planning checklist; we pay for (and we charge for) that coverage; and I consider it worth it

A national footprint customer based in Canada agrees. And their lead technical person reports that our connectivity is haster than their datacenter eighty miles from their home ofice. Not surprising, as oAltantaur main DC is on a 'main line' fiber route between Chicago, NYC, DC, and Atlanta -- financial markets and federal government presences can help, that way

If the availability of your online presence matters to you, feel free to ask for a quote

13 April 2012

LOPSA at the PMman DC

I went up to a meeting at our North datacenter for PMMan, where local group of system admins held a meeting, starting up a local LOPSA chapter. Food and soda were provided by the DC operator, along with salad ... since when did sysadmins starting more healthy food, rather than a diet of high sugar, high caffeine junk food?

The presentation slide deck was fine, and the presenter (a 'long timer' at a local credit-card clearance operation) ran through his bullet list of what to look for in the 'build vs. buy (lease space)' decision, and then a number of siting concerns.

Now I am familiar with his firm's site from prior visits, and it is adjacent to a major highway with regular closures for accidents; adjacent to a major rail yard where chemical spills have caused evacuations; and sole serviced into the power company grid

Our North site was chosen after a survey of all offerings within a radius we were willing to drive to for 'end of the world' 'hands on' intervention; is jacked into two independent power grids along with the on-site generators, is a premier demonstration location for the former Liebert (now owned by Emerson) power and site conditioning

I happen do drink coffee daily at Staufs with Liebert's representative here in town. My evaluation team suggested the location as a finalist, and when I checked, it turned out that I already knew the owner / developer from long, long ago telephony days, and when I have time, I'll go up and 'shoot the bull' with him on Saturday mornings at the DC

We have had a grand total of ZERO power related outages, and only one network connectivity issue in the last three years, that lasting less than 15 minutes, and that, due to human error in not handling a BGP fail-across migration properly [the cut-over protocol was changes, as I noticed the drop from my monitoring and called the owner's cell phone at once ;) ]. Well suited to our 'enterprise' customers

It is 'carrier neutral' and hugely connected -- multiple entrances of up to 88 x 100 G fiber spread across six or seven principal carriers. Native IPv6 to all drops we run through multiple carriers, along with the IPv4. I helped with the IPv6 design and cut-over some 18 months ago, and it has been seamless. The facility, and our services, just do not have outages except that human error causes

It is not the cheapest in town ... but it is fairly priced for the value we have received

I had not sat down and reflected on how satisfied I was with that shift of our center of operations to the DC, but as I think about it, I am well pleased