PDA

View Full Version : [ubuntu] 10.04 x64 server periodically unresponsive to SSH, IMAP, etc...



danep
May 8th, 2010, 04:26 PM
Periodically (about once per day for 10-30 minutes) my server becomes completely unresponsive to SSH and IMAP requests, and other commands such as svn commit/update (though existing SSH connection seem to be unaffected). These are always accompanied by an error in kern.log that I almost hesitate to post, because similar ones show up all over the web but are mostly claimed to be fixed and treated as problems unto themselves, whereas I think the error is more of a symptom in my case:


May 5 20:53:22 avatar kernel: [38400.383881] INFO: task cron:6459 blocked for more than 120 seconds.
May 5 20:53:23 avatar kernel: [38400.396057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 5 20:53:23 avatar kernel: [38400.419698] cron D 00000000ffffffff 0 6459 902 0x00000000
May 5 20:53:23 avatar kernel: [38400.419704] ffff8800c89cbd10 0000000000000086 0000000000015bc0 0000000000015bc0
May 5 20:53:23 avatar kernel: [38400.419711] ffff8800c8bbc890 ffff8800c89cbfd8 0000000000015bc0 ffff8800c8bbc4d0
May 5 20:53:23 avatar kernel: [38400.419717] 0000000000015bc0 ffff8800c89cbfd8 0000000000015bc0 ffff8800c8bbc890
May 5 20:53:23 avatar kernel: [38400.419723] Call Trace:
May 5 20:53:23 avatar kernel: [38400.419728] [<ffffffff8155593d>] schedule_timeout+0x22d/0x300
May 5 20:53:23 avatar kernel: [38400.419734] [<ffffffff81134c52>] ? __slab_alloc+0x92/0x2d0
May 5 20:53:23 avatar kernel: [38400.419738] [<ffffffff81064141>] ? copy_signal+0x51/0x3a0
May 5 20:53:23 avatar kernel: [38400.419742] [<ffffffff81554bf6>] wait_for_common+0xd6/0x170
May 5 20:53:23 avatar kernel: [38400.419747] [<ffffffff810634d4>] ? check_preempt_wakeup+0x1c4/0x3c0
May 5 20:53:23 avatar kernel: [38400.419751] [<ffffffff8105b280>] ? default_wake_function+0x0/0x20
May 5 20:53:23 avatar kernel: [38400.419756] [<ffffffff81554d4d>] wait_for_completion+0x1d/0x20
May 5 20:53:23 avatar kernel: [38400.419760] [<ffffffff810660dc>] do_fork+0x14c/0x430
May 5 20:53:23 avatar kernel: [38400.419764] [<ffffffff8115edb0>] ? mntput_no_expire+0x30/0x110
May 5 20:53:23 avatar kernel: [38400.419769] [<ffffffff8107e3c5>] ? set_one_prio+0x75/0xd0
May 5 20:53:23 avatar kernel: [38400.419773] [<ffffffff8101afb5>] sys_vfork+0x25/0x30
May 5 20:53:23 avatar kernel: [38400.419778] [<ffffffff81013513>] stub_vfork+0x13/0x20
May 5 20:53:23 avatar kernel: [38400.419782] [<ffffffff810131b2>] ? system_call_fastpath+0x16/0x1b
These cron tasks run every 5 minutes and some of them check an IMAP mailbox for new mail, which I think causes them to block.

This has happened ever since a kernel update in Karmic last year (sorry, can't recall which one). Even after booting into the old kernel, however, these errors continued to appear. After upgrading to Lucid I feel like they might have decreased slightly in frequency/length, but it's hard to say. All packages are up-to-date wrt the Karmic/Lucid repos.

I've been very patient so far, hoping that this would be fixed in Lucid, but now I'm beginning to beat my head against the wall... any help would be appreciated :)

danep
June 5th, 2010, 03:38 PM
This still continues, and I've learned some more things about it. For one, it doesn't seem to be correlated with any other cron task in particular (for instance, PHP rotating logs and clearing sessions). One very peculiar thing about it is that 95% of the time it occurs between the hours of 8PM and 3AM :confused: Like I mentioned, I can shut down any cron tasks that run regularly during that time and the error still occurs. The server is located in an enterprise-level data center so I don't think it would be something about the server's physical environment changing based on the time of day, but it does seem to be strongly correlated with time of day... I've been trying to think what kind of things happen around 8-9PM that could regularly cause this, but I'm really at a loss... ??