PDA

View Full Version : [all variants] I give you bondage_a_gogo - network bonding monitoring and reporting



bitmonkey
October 10th, 2010, 09:09 PM
I recently set up network bonding on my file server and wanted the system to send an email if one of the links was down - doesn't seem to be a lot of point having redundancy if it can fail and I don't know about it. I couldnt'd find anything to monitor the kernel bonding driver so I've written a little script. It's a bit of a hack, but it works. If anyone here knows a bit more than I do about how to daemonise, standardise and otherwise get it in a fit state to be released as a distro package I'd appreciate a pointer to a beginners guide to such things.

So, without further a do, Ladies and Gentlemen, I give you:
BONDAGE A GOGO:

#!/bin/bash

# bondage_a_gogo script by Paul Bradley (paul@paul-bradley.com)
# Version 0.1 - October 2010
# Run as a cron job to check the status of NIC bonding and send notifications if a problem is found
# NB this script assumes miimon monitoring is being used (not arp_interval).
# If you are using arp_interval it's probably not a big thing to hack that in.
# Cron this script to run as often as you want to have your bond checked - I run it once per minute.
# Just cron up more than one instance with different bond_interface settings if you have multiple bonds to monitor.
# NB NO WARRANTIES GIVEN OR IMPLIED - USE AT YOUR OWN RISK

# User settable parameters:
bond_interface="bond0" # the name of the bond to monitor
notification_email_address="you@yourdomain.com" # address to send alerts to when a bond is found degraded or offline
repeat_notification_interval=24 # interval in hours between repeat notifications if a bond stays degraded
pidfile="/var/run/bondage_a_gogo.$bond_interface.pid"
logfile="/var/log/bondage_a_gogo.log"

################################################## ################################################## #################

DATE="$(date)"
if [ ! -f $logfile ]
then
echo "$DATE - Starting up. B~O~N~D~A~G~E ~A~ G~O~G~O !" > $logfile
fi

# We establish the numbers of links up and down (correcting for the extra reported for the overall bond)
if test -z "$(cat /proc/net/bonding/$bond_interface | grep "MII Status: up")"
then
links_up=0
links_down=$(cat /proc/net/bonding/$bond_interface | grep "MII Status: down" | wc -l)
let "links_down = links_down - 1"
else
links_down=$(cat /proc/net/bonding/$bond_interface | grep "MII Status: down" | wc -l)
links_up=$(cat /proc/net/bonding/$bond_interface | grep "MII Status: up" | wc -l)
let "links_up = links_up - 1"
let "links_total = links_up + links_down"
fi


# Maybe everything is OK?
if test -z "$(cat /proc/net/bonding/$bond_interface | grep "MII Status: down")"
then
if [ -f $pidfile ]
then
$(rm $pidfile)
echo "$DATE - Found $bond_interface OK with $links_up of $links_total links strapped together. FULL BONDAGE MODE IS IN EFFECT" >>$logfile
exit 0
fi
exit 0
fi


# Then we test for a new failure in case the bond is down and there is no pidfile yet
if test -n "$(cat /proc/net/bonding/$bond_interface | grep "MII Status: down")"
then
if ! [ -f $pidfile ]
then
if [ $links_down = "1" ]
then
if [ $links_up != "1" ]
then
echo -e "\n\nBONDAGE MALFUNCTION!\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME is offline\n$links_up links remain active on this bond\n\n$(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION! - $HOSTNAME" $notification_email_address
touch $pidfile
fi
fi

if [ $links_up = "1" ]
then
if [ $links_down != "1" ]
then
echo -e "\n\nBONDAGE MALFUNCTION!\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME are offline\n$links_up link remains active on this bond\n\n$(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION! - $HOSTNAME" $notification_email_address
touch $pidfile
else
echo -e "\n\nBONDAGE MALFUNCTION!\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME is offline\n$links_up link remains active on this bond\n\n$(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION! - $HOSTNAME" $notification_email_address
touch $pidfile
fi
fi

if [ $links_up != "1" ] && [ $links_down != "1" ]
then
echo -e "\n\nBONDAGE MALFUNCTION!\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME are offline\n$links_up links remain active on this bond\n\n$(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION! - $HOSTNAME" $notification_email_address
touch $pidfile
fi
echo "$DATE - BONDAGE ALERT - $links_down of $links_total links in the $bond_interface bondage group were found down. BONDAGE FAILURE">>$logfile
exit 0
fi
fi



# Then we check to see if there is a pidfile over 24 hours old and if so send repeat notification
if [ -f $pidfile ]
then
if [ `stat --format=%Y $pidfile` -le $(( `date +%s` - (3600*$repeat_notification_interval) )) ]; # tests if the pidfile is old enough that we need to send a repeat notification
then
if [ $links_down = "1" ]
then
if [ $links_up = "1" ]
then
echo -e "\n\nBONDAGE MALFUNCTION!\n\nRepeat notification from $0 on $HOSTNAME\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME is still offline\n$links_up link remains active on this bond\n\n\n $(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION! - $HOSTNAME - REPEAT NOTIFICATION" $notification_email_address
echo -e "$0 sent an email to $notification_email_address at $DATE" >> $pidfile
else
echo -e "\n\nBONDAGE MALFUNCTION!\n\nRepeat notification from $0 on $HOSTNAME\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME is still offline\n$links_up links remains active on this bond\n\n\n $(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION - $HOSTNAME - REPEAT NOTIFICATION" $notification_email_address
touch $pidfile
fi
fi

if [ $links_up = "1" ]
then
if [ $links_down != "1" ]
then
echo -e "\n\nBONDAGE MALFUNCTION!\n\nRepeat notification from $0 on $HOSTNAME\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME are still offline\n$links_up link remains active on this bond\n\n\n $(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION - $HOSTNAME - REPEAT NOTIFICATION" $notification_email_address
touch $pidfile
fi
fi

if [ $links_up != "1" ] && [ $links_down !="1" ]
then
echo -e "\n\nBONDAGE MALFUNCTION!\n\nRepeat notification from $0 on $HOSTNAME\n\n$links_down of the interfaces in the $bond_interface group on $HOSTNAME are still offline\n$links_up links remain active on this bond\n\n\n $(cat /proc/net/bonding/$bond_interface)\n\n\nEmail generated by the $0 script running on $HOSTNAME at $DATE" |mail -s "BONDAGE MALFUNCTION - $HOSTNAME - REPEAT NOTIFICATION" $notification_email_address
touch $pidfile
fi
echo "$DATE - BONDAGE ALERT - $bond_interface still degraded with $links_down of $links_total unstrapped - Repeat email alert sent - BONDAGE FAILURE">>$logfile
fi
fi