Future Cloudy, Ask Again Later

Recently my pal Bill Schell and I were gassing on about the current and future state of IT employment, and he brought up the topic of IT jobs being “lost to the Cloud”.  In other words, if we’re to believe in the marketing hype of the Cloud Computing revolution, a great deal of processing is going to move out of the direct control of the individual organizations where it is currently being done.  One would expect IT jobs within those organizations that had previously been supporting that processing to disappear, or at least migrate over to the providers of the Cloud Computing resources.

I commented that the whole Cloud Computing story felt just like another turn in the epic cycle between centralized and decentralized computing.  He and I had both lived through the end of the mainframe era, into “Open Systems” on user desktops, back into centralized computing with X terminals and other “thin clients”, back out onto the desktops again with the rise of extremely powerful, extremely low cost commodity hardware, and now we’re harnessing that commodity hardware into giant centralized clusters that we’re calling “Clouds”.  It’s amazingly painful for the people whose jobs and lives are dislocated by these geologic shifts in computing practice, but the wheel keeps turning.

Bill brought up an economic argument for centralized computing that seems to crop up every time we’re heading back into the shift towards centralized computing.  Essentially the argument is summarized as follows:

  • As the capital cost of computing power declines, support costs tend to predominate.
  • Centralized support costs less then decentralized support.
  • Therefore centralized computing models will ultimately win out.

If you believe this argument, by now we should have all embraced a centralized computing model.  Yet instead we’ve seen this cycle between centralized and decentralized computing.  What’s driving the cycle?  It seems to me that there are other factors that work in opposition and keep the wheel turning.

First, it’s generally been a truism that centralized computing power costs more than decentralized computing.  In other words, it’s more expensive to hook 64 processors and 128GB of RAM onto the same backplane than it is to purchase 64 uniprocessor machines each with 2GB of RAM.  The Cloud Computing enthusiasts are promising to crack that problem by “loosely coupling” racks of inexpensive machines into a massive computing array. Though when “loose” is defined as Infiniband switch fabrics and the like, you’ll forgive me if I suspect they may be playing a little Three Card Monte with the numbers on the cost spreadsheets.  The other issue to point out here is that if your “centralized” computing model is really just a rack of “decentralized” servers, you’re giving up some of the savings in support costs that the centralized computing model was supposed to provide.

Another issue that rises to the fore when you move to a centralized computing model is the cost to the organization to maintain their access to the centralized computing resource.  One obvious cost area is basic “plumbing” like network access– how much is it going to cost you to get all the bandwidth you need (in both directions) at appropriately low latency?  Similarly, when your compute power is decentralized it’s easier to hide environmental costs like power and cooling, as opposed to when all of those machines are racked up together in the same room.  However, a less obvious cost is the cost of keeping the centralized computing resource up and available all the time, because now with all of your “eggs in one basket” as it were your entire business can be impacted by the same outage.  “Five-nines” uptime is really, really expensive.  Back when your eggs were spread out across multiple baskets, you didn’t necessarily care as much about the uptime of any single basket and the aggregate cost of keeping all the baskets available when needed was lower.

The centralized vs. decentralized cycle keeps turning because in any given computing epoch the costs of all of the above factors rise and fall.  This leads IT folks to optimize one factor over another, which promotes shifts in computing strategy, and the wheel turns again.

Despite what the marketeers would have you believe, I don’t think the Cloud Computing model has proven itself to the point where there is a massive impact on the way mainstream business is doing IT.  This may happen, but then again it may not.  The IT job loss we’re seeing now has a lot more to do with the general problems in the world-wide economy than jobs being “lost to the Cloud”.  But it’s worth remembering that massive changes in computing practice do happen on a regular basis, and IT workers need to be able to read the cycles and position themselves appropriately in the job market.

Making Mentoring a Priority

 

I always appreciate (and am in search of) tips for how to be a better sysadmin. I’ve never had the opportunity … to be in a large IT org. I think I miss out on a lot of learning opportunities by not being a part of a large IT org.

from a comment by “Joe” to “Queue Inversion Week

This comment reflects an industry trend that I’ve been worrying about for a while now.  Back in the 80’s when I was first learning to do IT Operations, it seemed like there were more opportunities to come up as a junior member of a larger IT organization and be mentored by the more senior members of the team.  It’s not overstating the case to say that I wouldn’t appear to be the “expert” that I seem to be today without liberal application of the “clue bat” by those former co-workers (and thanks to all of you– some of you don’t even know how much you helped me).

These days, however, it seems like there are a lot more “one person shops” in the IT world.  And a lot of IT workers are learning in a less structured way on their own– either on the job, or by fooling around with systems at home.  When they get stuck, their only fallback may be Google.  This has to lead to some less-than-optimal solutions and a lot of frustration and burn-out.

So if you’re a one person shop and you’re feeling the lack of mentoring, let me give you some suggestions for finding a support network.

Local User Groups

See if you can find a user group in your area.  Aside from the fact that most local groups sponsor informative talks, they’re also a good way to “network” with other IT folks in your area.  These are people you can call on when you get stuck on a problem.  There’s also the pure “group therapy” aspect of being able to be in a room with people who are living with the same day-to-day problems that you are and understand your language without need of Star Trek technology translation devices.

Google can help you find groups in your area.  Both SAGE and LOPSA also track local IT groups that are affiliated with those organizations.

If you can’t find an existing local group in your area, you might consider starting one.  I’ve found LinkedIn to be helpful for finding other IT people in my geographic area and contacting them.

Mailing Lists and Internet Forums

I subscribe to several IT-related mailing lists with world-wide memberships.  Some of the most active and useful mailing lists for getting questions answered seem to be the SAGE, LOPSA, and GIAC mailing lists, though there are membership costs and/or conference fees associated with getting access to these lists.  Also, there’s nothing that says you can’t subscribe to the mailing lists for various local user groups, even if you’re not actually close enough to attend their meetings.

There are of course different Internet forums where you can post questions and where you might actually get questions answered occasionally.  I haven’t done an exhaustive survey here, but I have found good Linux advice at the Ubuntu Forums and LinuxQuestions.org.  If you have favorites, you might mention them in the comments section.

Live Mentoring

This one is scary for most people, but you might consider contacting somebody who you think is an “expert” and asking them out to coffee/beer/lunch/dinner.  If they’re too busy, they’ll tell you.  But if you don’t ask you’ll never know, and you might be missing out on a great opportunity.

You must understand that my expectation is that if somebody helps you in this way, you are morally obligated to help someone else in a similar fashion in the future.  This is why I think you’ll find that most “experts” worth their salt are more than willing to extend this courtesy to you– somebody in their past provided them with guidance, and they’re just “paying back” by helping you.

Teaching Others

If my last idea was scary, this one will probably make you want to hide under a rock.  But teaching others is a great way to motivate yourself to learn.  I find that I don’t really master a subject until I have to organize my thoughts well enough to convey it to others.

Can’t locate anybody nearby to teach at?  Start a blog and write down your expertise for others to read.  Answer questions for some of the users on the Internet forums mentioned above.  Submit articles to technical journals (as the former Technical Editor for Sys Admin Magazine, I can attest that most of these publications are absolutely desperate for content)– some of them even pay money for articles.

If you’ve taken a SANS course and obtained your GIAC certification, you may be eligible to become a SANS mentor.  This can be an entre into becoming a SANS Instructor, and is therefore well worth pursuing.

In Conclusion

It’s unfortunate that there are so many folks out there without the built-in support network of working in a large IT organization.  But if you search diligently, I think you may be able to find some other people in your area to network with and get guidance from.  Remember that we all have different levels of expertise in different areas, so sometimes you’re the apprentice and sometimes you’re the “expert” (I’m constantly learning things from my students– yet another reason to teach others).

For the Senior IT folks who are reading this blog, I ask you to please make it a priority to reach out to the more junior members of our profession and help bring them along.  Somebody did it for you, and now it’s your turn.

Lessons from the Metasploit DDoS

This is the second in a series of post summarizing some of the information I presented in the recent SANS roundtable webcast that I participated in with fellow SANS instructors Ed Skoudis and Mike Poor.  Hopefully this written summary will be of use to those of you who lack the time to sit down and listen to the webcast in its entirety.

During the webcast we were discussing the recent distributed denial-of-service (DDoS) attacks against the Immunity, Metasploit, Milw0rm, and Packet Storm web sites.  H.D. Moore has been posting brief updates about the attacks on his blog, and there are some interesting lessons to be extracted from his commentary.  It’s also good reading for H.D.’s scathing commentary on the quality of the attack.

To summarize, the attacks began around 9pm CST on Friday 2/6. The attack was a SYN-only connection attack on H.D.’s web servers on 80/tcp.  H.D. observed peak traffic loads of approximately 80,000 connections per second.

When faced with a DDoS like this, one of the first questions is are the source IPs on the traffic real or spoofed?  H.D. was able to observe his DNS server-related logs and see that about 95% of the source addresses were resolving metasploit.com periodically.  This tends to indicate that the attacker was not bothering to spoof source IP addresses in the traffic.

It also led to the first corrective action to thwart the DDoS: H.D. simply changed the A record for metasploit.com to 127.0.0.1.  The downside to this action is that now legitimate users couldn’t find metasploit.com, and in particular it broke the standard update mechanism for the Metasploit Framework.  However, since the attack was only targeting the main metasploit.com web site, H.D. was able to move his other services (blog.metasploit.com, etc) to an alternate IP address and keep operating for the most part.  He was also able to use various “Social Media” outlets (like Twitter, his blog, etc) to let people know about the changes and help them find the relocated services (I can hear the Web 2.0 press organs hooting now, “Twitter Foils DDoS Attack!”– *blech*).

Actually, H.D. is now in sort of an odd situation.  The attackers have hit him with a fire hose of data, but effectively left him controlling the direction of the nozzle.  If H.D. were less ethical than he is, he simply could have changed his A record to point at a routable IP address and used the attack to DDoS somebody else.  It’s unlikely that the folks at the re-targeted site would even have realized why this traffic was suddenly being directed at them, since the attacker wasn’t bothering to send actual HTTP transactions with information about the original target site.

In any event, the attack finally abated at about 9pm CST on Sunday, 2/8.  However, it restarted again Monday morning (read H.D.’s comment regarding the timing of the attack– priceless).  This time, the attacker appears to be using the hard-coded IP addresses of both the original metasploit.com web site and the alternate IP that H.D. had relocated to during the initial attack.  So changing the A record would no longer be effective at stopping the attack.

However, the attackers were still only targeting port 80/tcp.  All H.D. had to do is shut down the web servers on 80/tcp and restart them on 8000/tcp (again notifying his normal users via Twitter, etc).  H.D. notes that his SSL traffic on 443/tcp was completely unaffected by the DDoS, which again was a pretty poor showing by the attackers.

And so the attack  continues.  By Tuesday, the SYN traffic has reached 15Mbps/sec, and H.D.’s ISP is starting to get worried.  However, by this point the DDoS sources are apparently back to using DNS resolution because H.D. notes he’s pointing the metasploit.com A record back to 127.0.0.1 and moving his services to metasploit.org.  Of course, at some point the attackers are going to target H.D. at his new domain.  Maybe H.D. should consider deploying the Metasploit web presence into a fast-flux network to continue evading the DDoS weasels?

In any event, I think the countermeasures that H.D. took against the DDoS (changing the A record, changing IPs, changing ports) are all good stop-gap approaches to consider if you’re ever targeted by a DDoS attack like this.  I also commend H.D. for his use of “social media” as an out-of-band communications mechanism with his regular constituents.  And finally, the transparency of blogging about the attack to inform the rest of the community is certainly helpful to others.  Good show, H.D.!

We’re In Your Kernel, Escalating Our Privs

Ed Skoudis and Mike Poor were kind enough to invite me to sit in on their recent SANS webcast round-table about emerging security threats.  During the webcast I was discussing some emerging attack trends against the Linux kernel, which I thought I would also jot down here for those of you who don’t have time to sit down and listen to the webcast recording.

Over the last several months, I’ve been observing a noticable uptick in the number of denial-of-service (DoS) conditions reported in the Linux kernel.  What that says to me is that there are groups out there who are scrutinizing the Linux kernel source code looking for vulnerabilities.  Frankly, I doubt they’re after DoS attacks– it’s much more interesting to find an exploit that gives you control of computing resources rather than one that lets you take them away from other people.

Usually when people go looking for vulnerabilities in an OS kernel they’re looking for privilege escalation attacks.  The kernel is often the easiest way to get elevated priviliges on the system.  Indeed, in the past few weeks there have been a couple [1] [2] of fixes for local privilege escalation vulnerabilities checked into the Linux kernel code.  So not only are these types of vulnerabilities being sought after, they’re being found (and probably used).

Now “local privilege escalation” means that the attacker has already found their way into the system as an unprivileged user.  Which begs the question, how are the attackers achieving their first goal of unprivileged access?  Well certainly there are enough insecure web apps running on Linux systems for attackers to have a field day.  But as I was pondering possible attack vectors, I had an uglier thought.

A lot of the public Cloud Computing providers make virtualized Linux images available to their customers.  The Cloud providers have to allow essentially unlimited open access to their services to anybody who wants it– this is, after all, their entire business model.  So in this scenario, the attacker doesn’t need an exploit to get unprivileged access to a Unix system: they get it as part of the Terms of Service.

What worries me is attackers that pair their local privilege escalation exploits with some sort of “virtualization escape” exploit, allowing them hypervisor level access to the Cloud provider’s infrastructure.  That’s a nightmare scenario, because now the attacker potentially has access to other customers’ jobs running in that computing infrastructure in a way that will likely be largely undetectable by those customers.

Now please don’t mistake me.  As far as we know, this scenario has not occurred.  Furthermore, I’m willing to believe that the Cloud providers supply generally higher levels of security than many of their customers could do on their own (the Cloud providers having the resources to get the “pick of the litter” when it comes to security expertise).  At the same time, the scenario I paint above has got to be an attractive one for attackers, and it’s possible we’re seeing the precursor traces of an effort to mount such an attack in the future.

So to all of you playing around in the Clouds I say, “Watch the skies!”

Good show!

Gene Kim, Kevin Behr, and I have been having some fun on Twitter (where we’re @RealGeneKim, @kevinbehr, and @hal_pomeranz respectively) with the #itfail thread.  Every time we come across an example of poor IT practice, one of us throws out a tweet.  It’s driven some entertaining discussions, and generally been a good way to blow off steam.

But I also think folks with good IT practices should be recognized in some way.  So I just wanted to send out this quick post to recognize the proactive IT process demonstrated by our hosts here at WordPress.com.  When I logged into my dashboard today, I found the following message waiting for me:

We will be making some code changes in about 42 hours which will log you out of your WordPress.com account. They should only take a few seconds and you should be able to log in afterwards without any problems. (more info)

It seems like such a simple thing, but to me it demonstrates some excellent IT process competancies:

  • Planned changes: they know well in advance when change is occurring and how long they expect it to take
  • Situational awareness: they know who is going to be impacted and in what fashion
  • Communication: they have a mechanism for alerting the affected parties in a reasonable time frame
  • Transparency:  they’re willing to alert their user community that they will be inconvenienced rather than just let it happen and hope nobody notices

While these all may seem like trivial or “of course” items to many of you reading this blog, let me tell you that many of the IT shops that I visit as part of my consulting practice regularly #itfail in some or all of the above areas.

So kudos to the folks at WordPress!  Good show!

Queue Inversion Week

Reliving the last story from my days at the mid-90’s Internet skunkworks, reminded me of another bit of tactical IT advice I learned on that job, and which has become a proven strategy that I’ve used on other engagements.  I call it “Queue Inversion Week”.

One aspect of our operations religion at the skunkworks was, “All work must be ticketed” (there’s another blog post behind that mantra, which I’ll get to at some point).  We lived and died by our trouble-ticketing system, and ticket priority values generally drove the order of our work-flow in the group.

The problem that often occurs to organizations in this situation, however, is what I refer to as the “tyranny of the queue”.  Everybody on the team is legitimately working on the highest-priority items.  However, due to limited resources in the Operations group, there are lower priority items that tend to collect at the bottom of the queue and never rise to the level of severity that would get them attention.  The users who have submitted these low-priority tickets tend to be very understanding (at least they were at the skunkworks) and would wait for weeks or months for somebody in my group to get around to resolving their minor issues.  I suspect that during those weeks/months the organization was actually losing a noticable amount of worker productivity due to these “minor” issues, but we never quantified how much.

What did finally penetrate was a growing rumble unhappiness from our internal customers.  “We realize you guys are working on bigger issues,” they’d tell me in staff meetings, “but after a few months even a minor issue becomes really irritating to the person affected.”  The logic was undeniable.

I took the feedback back to my team and we started kicking around ideas.  One solution that had a lot of support was to simply include time as a factor in the priority of the item: after the ticket had sat in the queue for some period of time, the ticket would automatically be bumped up one priority level.  The problem is that when we started modeling the idea, we realized it wouldn’t work.  All of the “noise” from the bottom of the queue would eventually get promoted to the point where it would be interfering with critical work.

Then my guy Josh Smift, who basically “owned” the trouble ticketing system as far as customization and updates was concerned, had the critical insight: let’s just “invert” the queue for a week.  In other words, the entire Operations crew would simply work items from the bottom of the queue for a week rather than the top.  It was simple and it was brilliant.

So we looked at the project schedule and identified what looked like a “slack” week and declared it to be “Queue Inversion Week.”  We notified our user community and encouraged them to submit tickets for any minor annoyances that they’d been reluctant to bring up for whatever reason.

To say that “Queue Inversion Week” was a raging success was to put it mildly indeed.  Frankly, all I wanted out of the week was to clear our ticket backlog and get our customers off our backs, but the whole experience was a revelation.  First, the morale of my Operations team went through the roof.  Analyzing the reasons why, I came to several conclusions:

  • It got my folks out among the user community and back in touch with the rest of the company, rather than being locked up in the data center all day long.  The people who my folks helped were grateful and expressed that to my team, which makes a nice change from the usual, mostly negative feedback IT Ops people tend to get.
  • The tickets from the bottom of the queue generally required only the simplest tactical resolutions.  Each member of my team could resolve dozens of these items during the week (in fact, a friendly competition arose to see who could close the most tickets), and feel terrific afterwards because there was so much concrete good stuff they could see that they’d done.
  • Regardless of what outsiders think, I believe most people in IT Operations really want to help the people who are their customers.  It’s depressing to know that there are items languishing at the bottom of the queue that will never get worked on.  This week gave my team an excuse to work on these issues.

I think I can reasonably claim that Queue Inversion Week also had a noticable impact on the morale of the skunkworks as a whole.  After all, many of the annoying problems that our users had been just doing work-arounds for were now removed as obstacles.  Like a good spring cleaning, everybody could breathe a little easier and enjoy the extra sunshine that appeared through the newly cleaned windows.

We repeated Queue Inversion Week periodically during my tenure at the skunkworks, and every time it was a positive experience that everybody looked forward to and got much benefit from.  You can’t necessarily have it happen on a rigid schedule, because other operational priorities interfere, but any time it looks like you have a little “slack” in the project schedule coming up and the bottom of your queue is full of little annoying tasks, consider declaring your own “Queue Inversion Week” and see if it doesn’t do you and your organization a world of good.

How I Learned to Start Loving Implementation Plans

Lately I’ve been seeing multiple articles pop up arguing for the importance of plans and checklists in many walks of life, but especially in IT process.  I couldn’t agree more.  In fact, when an IT person believes something as strongly as I believe in the power of implementation plans and checklists, you know there must be a story involved.  Let me tell you mine.

Back in the mid-90’s my good friend Jim Hickstein and I were working for the Internet skunkworks of a huge East Coast direct mail marketing company.  At the time I think I was the Operations Manager and Jim was my tech lead in the Sys Admin group, but since we were a small operation both Jim and I were still deeply involved in the tactical implementation and roll-out of new technologies.

In particular, there was one instance where Jim and I were scheduled to roll out a new storage array into our production ecommerce infrastructure.  Our production environment was hosted at company HQ down in Connecticut (our offices were in the Boston area), and in some sense this IT activity was part of an ongoing evangelism on our part of new “open” technology in what was largely a mainframe shop up until our arrival.  We wanted to show that we were better, faster, cheaper, and just as reliable as the platforms that they were used to.

I remember talking to Jim and suggesting that we come up with an implementation plan prior to the installation activity.  But clearly neither one of us had gotten the religion yet, because after talking about it for a while we decided that we “knew what we were doing” (despite never having deployed a unit of this type before) and didn’t need a plan.  So with our figurative cowboy hats firmly in place, we had the equipment shipped and a couple of weeks later we hopped in Jim’s car and followed it down to wildest Connecticut.

Upon arrival, we met up with a couple of the IT folks from HQ who were curious about what we were doing and wanted the chance to observe how we operated.  That was cool with us, because we wanted to evangelize as much as possible, remember?  Off we all went to the data center, unpacked and inspected the equipment, and generally killed time waiting for our three-hour outage window to commence.  Once the outage window arrived, Jim and I began our work.

Crisis set in almost immediately.  I can’t even remember at this point what started going wrong, but I think almost every person who’s done production Operations recognizes that “bottom drops out” feeling you get in your stomach when you suddenly realize what you thought was routine work has gone horribly awry.  As I recall, I was working the back of the machine hooking up the storage array and Jim was on the system console and I clearly remember staring across the rack at Jim and seeing the same horrified look reflected in his eyes.

I don’t remember much of the next two and a half hours.  I know we didn’t panic, but started working through the problems methodically.  And we did end up regaining positive control and actually got the work completed– just barely– within our outage window.  But frankly we had forgotten anything else in the world existed besides our troubled system.

When the crisis was resolved and we returned to something resembling the real world, we remembered that our actions were being scrutinized by our colleagues from HQ.  I turned to acknowledge their presence, expecting to see (deservedly) self-satisfied smiles on their faces from seeing us nearly fall on our faces in a very public fashion.  Well boy was I surprised: they were actually standing there in slack-jawed amazement.  “That was the most incredible thing I’ve every seen!  You guys are awesome!” one of them said to me.

What could we do?  Jim and I thanked them very much for their kind words, used some humor to blow off some of our excess adrenaline, and extracted ourselves from the situation as gracefully as possible.  After leaving the data center, Jim and I walked silently back to his car, climbed in, and just sat there quietly for a minute.  Finally, Jim turned to me and said, “Let’s never do that again.”

And that was it.  From that point on our operations policy playbook– enforced religiously by both Jim and myself– included the requirement that all planned work include a detailed implementation and rollback plan that must be reviewed by at least one other member of the operations staff before any work would be approved.  And to those who are thinking that this policy must have slowed down our ability to roll out changes in our production infrastructure, you couldn’t be more wrong.  In fact, our ability to make changes improved dramatically because our percentage of “successful” change activity (defined as not requiring any unplanned work or firefighting as a result of the change) got to better than 99%.  We simply weren’t wasting time on unplanned work, so as a consequence we had more time to make effective changes to our infrastructure.

That’s it.  Something as humble as a checklist can make your life immeasurably better.  But you have to be willing to admit that you need the help of a checklist in the first place, which compromises our need for the “expert audacity” that so many of us in IT feed on.  It’s a tough change to make in oneself.  I only hope you won’t have to learn this lesson the hard way, like Jim and I did.