simple UNIX commands to cull out duplicates from huge merged files.
Hi Rocksockdoc
I'll confess I'd not seen this technique before: what a great way to deal with some of the spam!
How about this to address your management problems:
Make a file /etc/hosts.real with your real things, and a directory /etc/hosts.d to contain the files you get from others. Just put the files you download in there, and any that you maintain yourself.
The script below then creates a new hosts file for you in /etc which looks like
Code:
# hosts file made Mon Nov 21 16:18:44 GMT 2011
# contents of /tmp/hosts.real
127.0.0.1 localhost
127.0.1.1 laptop
192.168.0.32 myserver
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# contents of /tmp/hosts.d/*
# total 1200
# -rw-rw-r-- 1 jonathan jonathan 612606 2011-11-21 12:37 hosts2.txt
# -rw-rw-r-- 1 jonathan jonathan 612606 2011-10-13 18:49 hosts.txt
127.0.0.1 ads.7days.ae
127.0.0.1 ads.angop.ao
127.0.0.1 smartad.mercadolibre.com.ar
[14,000 more...]
is that what you were looking for? The script is unecessarily elegant regarding the sorting (so that bad.spam.com is next to awful.spam.com, not badman.com), but perhaps you'll enjoy that!
You could have a cronjob do wget to pick up http://winhelp2002.mvps.org/hosts.txt periodically and rebuild your /etc/hosts, or just run it manually when you like. As listed here it works on /tmp/hosts.d, /tmp/hosts.real and produces /tmp/hosts: edit the red bit to get it to do it for real (which will probably require root permission to overwrite /etc/hosts and create /etc/hosts.new)
Code:
#!/bin/sh
set -e
dir=/tmp
echo "# hosts file made `date`" > $dir/hosts.new
echo "# contents of $dir/hosts.real" >> $dir/hosts.new
cat $dir/hosts.real >> $dir/hosts.new
echo "# contents of $dir/hosts.d/*" >> $dir/hosts.new
ls -l $dir/hosts.d | awk '{print "# ", $0;}' >> $dir/hosts.new
# tr to lowercase
# remove any cr in the files
# delete comments and trailing newlines
# lose line which is 'lowercase'
# print domain backwards (www.example.com -> com.example.www)
# lose extra trailing dot
# sort and uniq
# backwards back to normal order (www.example.com)
# lose extra trailing dot
# print with address, right justified
# into the file
#
cat $dir/hosts.d/* \
| tr A-Z a-z \
| tr -d '\r' \
| awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
| grep -v '^localhost$' \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| sort | uniq \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| awk '{printf("127.0.0.1 %40s\n", $1);}' \
>> $dir/hosts.new
mv $dir/hosts.new $dir/hosts
# end
Hope that's helpful.
Kind regards,
Jonathan.
Bookmarks