Sorry for the slowness for the last few hours.
My webserver was standing up to the load just fine, I just did something stupid.
As some background, I almost always leave a terminal multiplexer (screen) running on my vps. It eats a few megs of memory, but I have enough to spare that it isn’t an issue.
And sometimes, when I’m bored, I’ll run a command like:
$tail -f apache_access_log.log| grep 'GET article-url'
It basically shows me real time access to a particular url - every time someone reads that article I see their user-agent string, referrer, and a bit of other useful information.
Of course I have real analytics software too, but sometimes it’s just fun to watch the hits scroll by - it’s like a pretty matrix-esque screen saver.
Well, I was doing that this morning, and then I left my computer. I disconnected my screen session and ssh, and went about life in the real world. When I returned ~6 hours later, I noticed my brand new blog was down. Uh-oh.
After connecting to a console to investigate (ssh was timing out), I was greeted by this beautiful sight:
#uptime [...]load average: 20.84, 22.16, 23.84
For the record, that’s bad. I don’t think I’ve ever seen a load that high on a single cpu machine …
I was particularly surprised since I had spent a few hours the previous day setting up the site (caching, php compiling, tweaking apache/mysql, etc) to handle slashdot level loads. I had simulated well over a 1000 concurrent users without a hiccup.
After looking at ps, I realized that my little log trick was the culprit. I killed the job, and load dropped right down to 0.
I’m still looking into why it pegged the CPU so hard though - I’ve left that same command running for hours on a terminal I’m connected to with no discernible effect. My best guess is having that much information going to STDOUT in a disconnected screen session triggered some sort of pathological edge case … I’ll investigate more thoroughly tonight.
Anyways, my apologies to anyone who tried to access the site when it’s down. I’m always the first person to criticize twitter when I hear it’s down (again) - maybe there’s something is to be said about glass houses …
Has anyone else ever made a stupid mistake like this that brought a server to its knees? If so, I’d love to hear about it. I think I need a little pick me up ![]()
Comments 7
I doubt screen was the culprit. Screen only maintains a buffer for the most recent N lines (configurable via defscrollback, defaults to 100). That’s it.
I would blame the VPS.
Posted 07 Sep 2008 at 3:53 pm ¶I worked a sales-marketing company a decade ago. We had call centers which relied on SCO Unix running on 486’s.
These boxes (rhymes with ACK) came with dedicated hardware and software to allow agents to make/receive calls and input data.
Anyhoo, the SW ordinarily kept things pretty tight, and running anything else on the machine would run up loads from 12’s into the low 20’s.
The worst happened one day when one system was untaring from DAT, running the SW on an active outbound call floor, unsanctioned DB lookups on active customer files, NFS’g several other files across a WAN, and FTP’g data directly to a client system.
Damn thing hit 32 before it started scrambling streams, and response was worse than glacial at the console. Lost everything on HDD’s, had to send dozens of agents home, one client fired us, and had to rebuild system from week- and 2-day old backups to get it right again.
I had warned by boss this was likely to happen, but kept my mouth shut as I rebuilt the smoking hulk. Programmer’s group and client services both looked pretty dumb for a while…
Posted 07 Sep 2008 at 4:00 pm ¶Something else going on, I suspect. I do this trick all the time. Currently doing the same thing, with screen, in am emacs shell buffer while I hit my (WebSphere) server with requests.
16:44:23 up 16 days, 12:07, 4 users, load average: 0.54, 0.14, 0.04
Posted 07 Sep 2008 at 4:45 pm ¶Oh, in case it matters to you:
uname -a =>
Linux 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux
Posted 07 Sep 2008 at 4:47 pm ¶I would hardly call this a stupid mistake. It seems more than reasonable to expect a tail + grep in a detached screen not to eat up your cpu. Load average is the # of processes in the run queue, so a single process hammering the cpu wouldn’t generate such a high load, unless other active processes were being starved. What were those other processes? Pre-forked apache child processes?
Posted 07 Sep 2008 at 6:04 pm ¶Yeah, I’ve taken a production machine offline myself, with a two-line command. This was over twenty-five years ago, at a company called MAI Basic/Four in Irvine, CA. I was a member of the very first technology (computer) oriented Eagle Scout group learning programming on a mainframe.
I was on a standard coding workstation (you know, the ones that have been put into a sandbox, right?) when I was goofing off playing with different commands.
Two in particular, when put together in a particular way, had unanticipated consequences: they crashed THE mainframe for the entire corporation. It took 3 teams of engineers working 24-hours per day five entire days to get the mainframe back up. The cost to the company in lost productivity was never really calculable. The official loss record was in the millions (in 1982 dollars).
Sadly, our Eagle Scout group was no longer welcome after that (and IIRC about 30 testers had Very Bad Things happen to them for not catching that bug.)
Hope that made you smile,
Posted 08 Sep 2008 at 10:37 am ¶Robert~
I’m guessing the log file rolled over and tail -f held on to the original file inode, somehow causing the entire file associated with the original inode to be fed through grep every second.
Posted 08 Sep 2008 at 2:42 pm ¶Post a Comment