Easiest way to meld multiple large huge hosts files (block unwanted parasitic sites)
Can you suggest improvements to this process of maintaining a large hosts file?
When I find an objectionable site (advertisement, redirect, flash content, etc.) I simply add the domain to my hosts file and never have to see that content again.
But every few months, I find a few more unwanted parasites slipping through, so I pick up a new (huge) hosts file, from, say:
Code:
http://winhelp2002.mvps.org/hosts.htm
What I do is save the hosts.txt file to /tmp/hosts.txt and then strip out all but the redirects (remove double spaces, remove comments, & remove the localhost line)
Code:
http://winhelp2002.mvps.org/hosts.txt
grep -v \# hosts.txt | sed -e 's/ / /g' | grep -v 127.0.0.1 localhost\$ | sort -u >> /etc/hosts
NOTE: The sed is removing redundant spaces so it's consistent with my existing hosts file.
My first problem is getting the syntax right for the removal of the localhosts line (I actually have to delete that one line manually because some of the valid lines to keep also have localhosts in the domain name).
Then, I try to cull out duplicates, using:
Code:
sort -u /etc/hosts -o /etc/hosts
Unfortunately, that process moves the localhost and other top-level lines to a lower level which I then have to manually bring back to the top to keep a semblance of the original hosts order.
Code:
127.0.0.1 machine localhost.localdomain localhost
127.0.1.1 machine
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# begin parasitic blocks
This process works - but it could use an improvement.
Any ideas?
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
The easiest way is to give up and use DynDNS or OpenDNS instead of trying to do it yourself.
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Quote:
Originally Posted by
HermanAB
The easiest way is to give up and use DynDNS or OpenDNS instead of trying to do it yourself.
I guess I asked the wrong question ... because I was just hoping to find an easier way to merge multiple huge files culling out the duplicates ...
But, since both suggestions are unknown to me, looking up what DynDNS & OpenDNS are (in Wikipedia), I find:
- According to Wikipedia, DynDNS apparently provides two free services
- It regularly changes your apparent IP address
- It also seems to have a poorly defined Internet content filtering service
- According to Wikipedia, OpenDNS apparently provides ad-supported services
- It seems to do free content filtering but it's supported by advertisements
- These ads seem to show up in strange places (when undecipherable URLs are entered???)
The whole point is to easily eliminate millions of advertisements with a single hosts file, so there doesn't appear to be much sense in further exploring OpenDNS (because it seems to be supported by advertisements - which is the entire point to eliminate!).
But the DynDNS service does seem intriguing at first glance. I actually like BOTH things it does (i.e., change your IP address regularly and perform some type of content filtering, whatever that means to it).
I'll explore DynDNS further - but what I was 'really' looking for were simple UNIX commands to cull out duplicates from huge merged files.
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Quote:
simple UNIX commands to cull out duplicates from huge merged files.
Hi Rocksockdoc
I'll confess I'd not seen this technique before: what a great way to deal with some of the spam!
How about this to address your management problems:
Make a file /etc/hosts.real with your real things, and a directory /etc/hosts.d to contain the files you get from others. Just put the files you download in there, and any that you maintain yourself.
The script below then creates a new hosts file for you in /etc which looks like
Code:
# hosts file made Mon Nov 21 16:18:44 GMT 2011
# contents of /tmp/hosts.real
127.0.0.1 localhost
127.0.1.1 laptop
192.168.0.32 myserver
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# contents of /tmp/hosts.d/*
# total 1200
# -rw-rw-r-- 1 jonathan jonathan 612606 2011-11-21 12:37 hosts2.txt
# -rw-rw-r-- 1 jonathan jonathan 612606 2011-10-13 18:49 hosts.txt
127.0.0.1 ads.7days.ae
127.0.0.1 ads.angop.ao
127.0.0.1 smartad.mercadolibre.com.ar
[14,000 more...]
is that what you were looking for? The script is unecessarily elegant regarding the sorting (so that bad.spam.com is next to awful.spam.com, not badman.com), but perhaps you'll enjoy that!
You could have a cronjob do wget to pick up http://winhelp2002.mvps.org/hosts.txt periodically and rebuild your /etc/hosts, or just run it manually when you like. As listed here it works on /tmp/hosts.d, /tmp/hosts.real and produces /tmp/hosts: edit the red bit to get it to do it for real (which will probably require root permission to overwrite /etc/hosts and create /etc/hosts.new)
Code:
#!/bin/sh
set -e
dir=/tmp
echo "# hosts file made `date`" > $dir/hosts.new
echo "# contents of $dir/hosts.real" >> $dir/hosts.new
cat $dir/hosts.real >> $dir/hosts.new
echo "# contents of $dir/hosts.d/*" >> $dir/hosts.new
ls -l $dir/hosts.d | awk '{print "# ", $0;}' >> $dir/hosts.new
# tr to lowercase
# remove any cr in the files
# delete comments and trailing newlines
# lose line which is 'lowercase'
# print domain backwards (www.example.com -> com.example.www)
# lose extra trailing dot
# sort and uniq
# backwards back to normal order (www.example.com)
# lose extra trailing dot
# print with address, right justified
# into the file
#
cat $dir/hosts.d/* \
| tr A-Z a-z \
| tr -d '\r' \
| awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
| grep -v '^localhost$' \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| sort | uniq \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| awk '{printf("127.0.0.1 %40s\n", $1);}' \
>> $dir/hosts.new
mv $dir/hosts.new $dir/hosts
# end
Hope that's helpful.
Kind regards,
Jonathan.
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Quote:
Originally Posted by
Jonathan L
Hi Rocksockdoc
I'll confess I'd not seen this technique before: what a great way to deal with some of the spam!
How about this to address your management problems:
Make a file /etc/hosts.real with your real things, and a directory /etc/hosts.d to contain the files you get from others. Just put the files you download in there, and any that you maintain yourself.
The script below then creates a new hosts file for you in /etc which looks like
Hope that's helpful.
Kind regards,
Jonathan.
I did not critique your entire script, but no need to pipe sort to uniq , just use sort -u . See man sort for details.
I suspect you could simplify the script further as there is little need to pipe cat to awk either. Depending on the format of your source hosts file you can probably do all this with a one liner.
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Thanks for your comments Bodhi:
Quote:
but no need to X / could simplify the script / you can probably do all this with a one liner.
Indeed, of course. A script with the same overall effect (but different sort order and formatting) is:
Code:
dir=/tmp
(cat $dir/hosts.real
awk '{gsub("\r",""); gsub("#.*$", ""); gsub(" *$",""); if ((NF==2)&&($2!="localhost")) {print "127.0.0.1", tolower($2);}}' $dir/hosts.d/* | sort -u) > $dir/hosts.new && mv $dir/hosts.new $dir/hosts
Personally I find it much better to have a style which is more modular and easier to debug than one which is more "efficient" for the computer. In particular, the "cat-at-top" style allows you to insert
in between any line, or add 'tr' somewhere or another sort or another awk. (As I'm sure some readers know and some don't: awk will take any number of files, tr doesn't take filenames, and some versions of sort don't allow -u.) If you use this tyle of "big modular pipeline" you can build things up quickly, adding short one-liners to solve each part of the puzzle. In this particular case, some of the tr was added late, to fix the spurious carriage returns, and it's irritating to have move the filenames around.
And again, I was trying to make something intelligible for rocksocdoc to get some ideas from.
Just a matter of what I've found works well over the years: but to each their own.
Kind regards
Jonathan.
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Nice one liner, and thank you for your explanations, they are nice
I like the idea of being modular, and I was not trying to be critical, sorry if I came across that way.
I used to :
cat | grep
sort | uniq
awk similarly a powerful tool ;)
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Quote:
Originally Posted by
Jonathan L
Code:
cat $dir/hosts.d/* \
| tr A-Z a-z \
| tr -d '\r' \
| awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
| grep -v '^localhost$' \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| sort | uniq \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| awk '{printf("127.0.0.1 %40s\n", $1);}' \
>> $dir/hosts.new
try this addition, it sorts keeping like-domains together
Code:
| tr A-Z a-z \
| tr -d '\r' \
| awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
| grep -v '^localhost$' \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| awk -F\. '{ORS=""; for(I=NF; I>=1;I--){print $I; if(I!=1){print "."}}; print "\n"}' \
| sort | uniq \
| awk -F\. '{ORS=""; for(I=NF; I>=1;I--){print $I; if(I!=1){print "."}}; print "\n"}' \
| awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
| sed 's/\.$//' \
| awk '{printf("127.0.0.1 %40s\n", $1);}' \
Disregard this code... blah
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
HI Emiller
Thanks for your suggestion, but I think you'll find that the original script already sorted the domains from the right (ie, all the com together, all the uk together etc), which is the purpose (as noted in the comments) of the two awks around the sort; your additions nullify that effect.
For what it's worth: I recommend printf over print and echo in various languages. The main reason for this is portability and flexibility, for the price of a bit of learning. As C, perl, php, awk, java and many other languages have printf, you can learn it once and use it in lots of places.
Rocksockdoc: hope you're finding what you need in this thread.
Kind regards,
Jonathan.
Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit
Quote:
Originally Posted by
Jonathan L
HI Emiller
Thanks for your suggestion, but I think you'll find that the original script already sorted the domains from the right (ie, all the com together, all the uk together etc), which is the purpose (as noted in the comments) of the two awks around the sort; your additions nullify that effect.
For what it's worth: I recommend printf over print and echo in various languages. The main reason for this is portability and flexibility, for the price of a bit of learning. As C, perl, php, awk, java and many other languages have printf, you can learn it once and use it in lots of places.
Rocksockdoc: hope you're finding what you need in this thread.
Kind regards,
Jonathan.
Ah yes, I see now. Don't know how I missed that before. As I was testing out your script I thought that it was sorting the urls from left to right (normal sort). my apologies.