server crashing once a day

**tgalati4** · March 17th, 2010

Rather than waiting for the server to crash, monitor the swap file. When it passes some predetermined value (say 500 MB) then start pruning processes and examining log files.

Write a script that examines your swap file every 10 minutes. If it exceeds 500 MB then dump some vmstat data and send an email or wall message to take action.

cat /proc/swaps

man gawk

cat /proc/swaps | gawk '/sda3/ {print $4}'

If above expression is greater than 500 MB then do something. Your swap device will be different than mine (sda3).

echo "My server is about to crash!!" | wall

**flipybcn** · March 18th, 2010

First, thanks for all your tips/suggestions.

It keeps crashing, no matter what I do. It is pretty unusable since I can't login to it (neither remotely nor local).

Looking at the logs the only thing I see for all the sites we're hosting is a bunch of "file does not exist", but that shouldn't hurt, right?

However, looking at the php configuration I've found that the variable memory_limit was set to 128M. I've never used such a high value before, so it must have been a suggestion from a client. I've decided to change it to a safer value (32M).

Since this is pretty critical, I'll try to deploy a HA Cluster with 2 servers.
But, if the error is caused by some code, it will be useless.

I'll keep monitoring this week, checking swap and log files (I'd love to find the problem, but seems that is not going to happen...).

Thanks!

**ploum** · March 23rd, 2010

I've exactly the same problem. I monitor swap and memory and nothing special happens. It's very very sudden as I can see on graphs.

So not a leak but more like a fork-bomb.

It's completely random but not once-a-day. More 3 times a week then nothing for one month.

No idea of what it could be.

My log :

Mar 23 16:56:41 localhost kernel: Out of memory: kill process 16678 (apache2) score 17006 or a child
Mar 23 16:56:41 localhost kernel: Killed process 8674 (apache2)
Mar 23 16:56:41 localhost kernel: fail2ban-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Mar 23 16:56:41 localhost kernel: fail2ban-server cpuset=/ mems_allowed=0
Mar 23 16:56:41 localhost kernel: Pid: 22547, comm: fail2ban-server Not tainted 2.6.32.2-xxxx-std-ipv4-32 #1
Mar 23 16:56:41 localhost kernel: Call Trace:
Mar 23 16:56:41 localhost kernel: [] oom_kill_process+0xa0/0x2b0
Mar 23 16:56:41 localhost kernel: [] ? select_bad_process+0xab/0xe0
Mar 23 16:56:41 localhost kernel: [] __out_of_memory+0x4e/0xb0
Mar 23 16:56:41 localhost kernel: [] out_of_memory+0x52/0xa0
Mar 23 16:56:41 localhost kernel: [] __alloc_pages_nodemask+0x527/0x540
Mar 23 16:56:41 localhost kernel: [] __do_page_cache_readahead+0xd2/0x1d0
Mar 23 16:56:41 localhost kernel: [] ra_submit+0x28/0x40
Mar 23 16:56:41 localhost kernel: [] filemap_fault+0x3b0/0x3c0
Mar 23 16:56:41 localhost kernel: [] __do_fault+0x4c/0x460
Mar 23 16:56:41 localhost kernel: [] ? filemap_fault+0x0/0x3c0
Mar 23 16:56:41 localhost kernel: [] handle_mm_fault+0x13c/0x7f0
Mar 23 16:56:41 localhost kernel: [] ? finish_task_switch+0x3a/0xb0
Mar 23 16:56:41 localhost kernel: [] ? ktime_get_ts+0xed/0x110
Mar 23 16:56:41 localhost kernel: [] ? poll_select_copy_remaining+0xc7/0x110
Mar 23 16:56:41 localhost kernel: [] do_page_fault+0x121/0x300
Mar 23 16:56:41 localhost kernel: [] ? sys_select+0x3d/0xb0
Mar 23 16:56:41 localhost kernel: [] ? do_page_fault+0x0/0x300
Mar 23 16:56:41 localhost kernel: [] error_code+0x66/0x6c
Mar 23 16:56:41 localhost kernel: [] ? do_page_fault+0x0/0x300
Mar 23 16:56:41 localhost kernel: Mem-Info:
Mar 23 16:56:41 localhost kernel: DMA per-cpu:
Mar 23 16:56:41 localhost kernel: CPU 0: hi: 0, btch: 1 usd: 0
Mar 23 16:56:41 localhost kernel: CPU 1: hi: 0, btch: 1 usd: 0
Mar 23 16:56:41 localhost kernel: Normal per-cpu:
Mar 23 16:56:41 localhost kernel: CPU 0: hi: 186, btch: 31 usd: 61
Mar 23 16:56:41 localhost kernel: CPU 1: hi: 186, btch: 31 usd: 87
Mar 23 16:56:41 localhost kernel: HighMem per-cpu:
Mar 23 16:56:41 localhost kernel: CPU 0: hi: 42, btch: 7 usd: 27
Mar 23 16:56:41 localhost kernel: CPU 1: hi: 42, btch: 7 usd: 13
Mar 23 16:56:41 localhost kernel: active_anon:113730 inactive_anon:113864 isolated_anon:0
Mar 23 16:56:41 localhost kernel: active_file:555 inactive_file:749 isolated_file:0

**ploum** · April 19th, 2010

It looks like I was simply targetted by some slowloris bots.

Installing libapache2-mod-antiloris allowed me to not reboot anymore.

**jhetrick62** · June 4th, 2010

Looks to me like you are running zoneminder, possible? I'm having the same issues so the zoneminder code may be the issue, I haven't found it yet though.

Jeff

**tgalati4** · June 4th, 2010

Yea, zoneminder is not exactly enterprise-class code. If you are running zoneminder, then move it another machine or stop it for a month. If a camera stream gets interrupted, zoneminder (or the video modules) don't exit gracefully--kernel panics!