Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sites)

  1. #1
    Join Date
    Aug 2010
    Beans
    407
    Distro
    Ubuntu 10.04 Lucid Lynx

    Easiest way to meld multiple large huge hosts files (block unwanted parasitic sites)

    Can you suggest improvements to this process of maintaining a large hosts file?

    When I find an objectionable site (advertisement, redirect, flash content, etc.) I simply add the domain to my hosts file and never have to see that content again.

    But every few months, I find a few more unwanted parasites slipping through, so I pick up a new (huge) hosts file, from, say:
    Code:
    http://winhelp2002.mvps.org/hosts.htm
    What I do is save the hosts.txt file to /tmp/hosts.txt and then strip out all but the redirects (remove double spaces, remove comments, & remove the localhost line)
    Code:
    http://winhelp2002.mvps.org/hosts.txt
    grep -v \# hosts.txt | sed -e 's/  / /g' | grep -v 127.0.0.1 localhost\$ | sort -u >> /etc/hosts
    
    NOTE: The sed is removing redundant spaces so it's consistent with my existing hosts file.
    My first problem is getting the syntax right for the removal of the localhosts line (I actually have to delete that one line manually because some of the valid lines to keep also have localhosts in the domain name).

    Then, I try to cull out duplicates, using:
    Code:
    sort -u /etc/hosts -o /etc/hosts
    Unfortunately, that process moves the localhost and other top-level lines to a lower level which I then have to manually bring back to the top to keep a semblance of the original hosts order.

    Code:
    127.0.0.1 machine localhost.localdomain localhost
    127.0.1.1 machine
    # The following lines are desirable for IPv6 capable hosts
    ::1     localhost ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    ff02::3 ip6-allhosts
    # begin parasitic blocks
    This process works - but it could use an improvement.

    Any ideas?

  2. #2
    Join Date
    Oct 2005
    Location
    Lab, Slovakia
    Beans
    10,790

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    The easiest way is to give up and use DynDNS or OpenDNS instead of trying to do it yourself.

  3. #3
    Join Date
    Aug 2010
    Beans
    407
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    Quote Originally Posted by HermanAB View Post
    The easiest way is to give up and use DynDNS or OpenDNS instead of trying to do it yourself.
    I guess I asked the wrong question ... because I was just hoping to find an easier way to merge multiple huge files culling out the duplicates ...

    But, since both suggestions are unknown to me, looking up what DynDNS & OpenDNS are (in Wikipedia), I find:

    • According to Wikipedia, DynDNS apparently provides two free services
      • It regularly changes your apparent IP address
      • It also seems to have a poorly defined Internet content filtering service
    • According to Wikipedia, OpenDNS apparently provides ad-supported services
      • It seems to do free content filtering but it's supported by advertisements
      • These ads seem to show up in strange places (when undecipherable URLs are entered???)

    The whole point is to easily eliminate millions of advertisements with a single hosts file, so there doesn't appear to be much sense in further exploring OpenDNS (because it seems to be supported by advertisements - which is the entire point to eliminate!).

    But the DynDNS service does seem intriguing at first glance. I actually like BOTH things it does (i.e., change your IP address regularly and perform some type of content filtering, whatever that means to it).

    I'll explore DynDNS further - but what I was 'really' looking for were simple UNIX commands to cull out duplicates from huge merged files.

  4. #4
    Join Date
    Sep 2011
    Location
    London
    Beans
    384

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    simple UNIX commands to cull out duplicates from huge merged files.
    Hi Rocksockdoc

    I'll confess I'd not seen this technique before: what a great way to deal with some of the spam!

    How about this to address your management problems:

    Make a file /etc/hosts.real with your real things, and a directory /etc/hosts.d to contain the files you get from others. Just put the files you download in there, and any that you maintain yourself.

    The script below then creates a new hosts file for you in /etc which looks like
    Code:
    # hosts file made Mon Nov 21 16:18:44 GMT 2011
    # contents of /tmp/hosts.real
    127.0.0.1    localhost
    127.0.1.1    laptop
    192.168.0.32    myserver
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     localhost ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    ff02::3 ip6-allhosts
    # contents of /tmp/hosts.d/*
    #  total 1200
    #  -rw-rw-r-- 1 jonathan jonathan 612606 2011-11-21 12:37 hosts2.txt
    #  -rw-rw-r-- 1 jonathan jonathan 612606 2011-10-13 18:49 hosts.txt
    127.0.0.1                             ads.7days.ae
    127.0.0.1                             ads.angop.ao
    127.0.0.1              smartad.mercadolibre.com.ar
    [14,000 more...]
    is that what you were looking for? The script is unecessarily elegant regarding the sorting (so that bad.spam.com is next to awful.spam.com, not badman.com), but perhaps you'll enjoy that!

    You could have a cronjob do wget to pick up http://winhelp2002.mvps.org/hosts.txt periodically and rebuild your /etc/hosts, or just run it manually when you like. As listed here it works on /tmp/hosts.d, /tmp/hosts.real and produces /tmp/hosts: edit the red bit to get it to do it for real (which will probably require root permission to overwrite /etc/hosts and create /etc/hosts.new)
    Code:
    #!/bin/sh
    
    set -e
    dir=/tmp
    
    echo "# hosts file made `date`" > $dir/hosts.new
    echo "# contents of $dir/hosts.real" >> $dir/hosts.new
    
    cat $dir/hosts.real >> $dir/hosts.new
    
    echo "# contents of $dir/hosts.d/*" >> $dir/hosts.new
    ls -l $dir/hosts.d | awk '{print "# ", $0;}' >> $dir/hosts.new
    
    # tr to lowercase
    # remove any cr in the files
    # delete comments and trailing newlines
    # lose line which is 'lowercase'
    # print domain backwards (www.example.com -> com.example.www)
    # lose extra trailing dot
    # sort and uniq
    # backwards back to normal order (www.example.com)
    # lose extra trailing dot
    # print with address, right justified
    # into the file
    #
    cat $dir/hosts.d/* \
    | tr A-Z a-z \
    | tr -d '\r' \
    | awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
    | grep -v '^localhost$' \
    | awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
    | sed 's/\.$//' \
    | sort | uniq  \
    | awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
    | sed 's/\.$//' \
    | awk '{printf("127.0.0.1 %40s\n", $1);}' \
    >> $dir/hosts.new
    
    mv $dir/hosts.new $dir/hosts
    
    # end
    Hope that's helpful.

    Kind regards,
    Jonathan.
    Last edited by Jonathan L; November 21st, 2011 at 06:07 PM. Reason: Adverb police

  5. #5
    Join Date
    Apr 2006
    Location
    Montana
    Beans
    Hidden!
    Distro
    Kubuntu Development Release

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    Quote Originally Posted by Jonathan L View Post
    Hi Rocksockdoc

    I'll confess I'd not seen this technique before: what a great way to deal with some of the spam!

    How about this to address your management problems:

    Make a file /etc/hosts.real with your real things, and a directory /etc/hosts.d to contain the files you get from others. Just put the files you download in there, and any that you maintain yourself.

    The script below then creates a new hosts file for you in /etc which looks like
    Code:
    | sort | uniq  \
    Hope that's helpful.

    Kind regards,
    Jonathan.
    I did not critique your entire script, but no need to pipe sort to uniq , just use sort -u . See man sort for details.

    I suspect you could simplify the script further as there is little need to pipe cat to awk either. Depending on the format of your source hosts file you can probably do all this with a one liner.
    There are two mistakes one can make along the road to truth...not going all the way, and not starting.
    --Prince Gautama Siddharta

    #ubuntuforums web interface

  6. #6
    Join Date
    Sep 2011
    Location
    London
    Beans
    384

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    Thanks for your comments Bodhi:

    but no need to X / could simplify the script / you can probably do all this with a one liner.
    Indeed, of course. A script with the same overall effect (but different sort order and formatting) is:
    Code:
    dir=/tmp
    
    (cat $dir/hosts.real
    awk '{gsub("\r",""); gsub("#.*$", ""); gsub(" *$",""); if  ((NF==2)&&($2!="localhost")) {print "127.0.0.1", tolower($2);}}'  $dir/hosts.d/* | sort -u) > $dir/hosts.new && mv  $dir/hosts.new $dir/hosts
    Personally I find it much better to have a style which is more modular and easier to debug than one which is more "efficient" for the computer. In particular, the "cat-at-top" style allows you to insert
    Code:
    | tee tmpfile \
    in between any line, or add 'tr' somewhere or another sort or another awk. (As I'm sure some readers know and some don't: awk will take any number of files, tr doesn't take filenames, and some versions of sort don't allow -u.) If you use this tyle of "big modular pipeline" you can build things up quickly, adding short one-liners to solve each part of the puzzle. In this particular case, some of the tr was added late, to fix the spurious carriage returns, and it's irritating to have move the filenames around.

    And again, I was trying to make something intelligible for rocksocdoc to get some ideas from.

    Just a matter of what I've found works well over the years: but to each their own.

    Kind regards
    Jonathan.
    Last edited by Jonathan L; November 22nd, 2011 at 10:00 AM.

  7. #7
    Join Date
    Apr 2006
    Location
    Montana
    Beans
    Hidden!
    Distro
    Kubuntu Development Release

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    Nice one liner, and thank you for your explanations, they are nice

    I like the idea of being modular, and I was not trying to be critical, sorry if I came across that way.

    I used to :

    cat | grep
    sort | uniq

    awk similarly a powerful tool
    There are two mistakes one can make along the road to truth...not going all the way, and not starting.
    --Prince Gautama Siddharta

    #ubuntuforums web interface

  8. #8
    Join Date
    Jun 2010
    Location
    ~
    Beans
    Hidden!

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    Quote Originally Posted by Jonathan L View Post
    Code:
    cat $dir/hosts.d/* \
    | tr A-Z a-z \
    | tr -d '\r' \
    | awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
    | grep -v '^localhost$' \
    | awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
    | sed 's/\.$//' \
    | sort | uniq  \
    | awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
    | sed 's/\.$//' \
    | awk '{printf("127.0.0.1 %40s\n", $1);}' \
    >> $dir/hosts.new
    try this addition, it sorts keeping like-domains together
    Code:
    | tr A-Z a-z \
    | tr -d '\r' \
    | awk '{gsub("#.*$", ""); gsub(" *$",""); if (NF==2) {print $2;}}' \
    | grep -v '^localhost$' \
    | awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
    | sed 's/\.$//' \
    | awk -F\. '{ORS=""; for(I=NF; I>=1;I--){print $I; if(I!=1){print "."}}; print "\n"}' \
    | sort | uniq  \
    | awk -F\. '{ORS=""; for(I=NF; I>=1;I--){print $I; if(I!=1){print "."}}; print "\n"}' \
    | awk -F. '{for(i=NF;i>0;--i){printf("%s.",$i);}{printf("\n");}}' \
    | sed 's/\.$//' \
    | awk '{printf("127.0.0.1 %40s\n", $1);}' \
    Disregard this code... blah
    Last edited by emiller12345; November 23rd, 2011 at 04:47 PM. Reason: its a mistake
    CADWEB Advance Toolkit Utility: http://cad.webatu.com/
    Homesite: http://digitalmagican.comze.com/

  9. #9
    Join Date
    Sep 2011
    Location
    London
    Beans
    384

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    HI Emiller

    Thanks for your suggestion, but I think you'll find that the original script already sorted the domains from the right (ie, all the com together, all the uk together etc), which is the purpose (as noted in the comments) of the two awks around the sort; your additions nullify that effect.

    For what it's worth: I recommend printf over print and echo in various languages. The main reason for this is portability and flexibility, for the price of a bit of learning. As C, perl, php, awk, java and many other languages have printf, you can learn it once and use it in lots of places.

    Rocksockdoc: hope you're finding what you need in this thread.

    Kind regards,
    Jonathan.

  10. #10
    Join Date
    Jun 2010
    Location
    ~
    Beans
    Hidden!

    Re: Easiest way to meld multiple large huge hosts files (block unwanted parasitic sit

    Quote Originally Posted by Jonathan L View Post
    HI Emiller

    Thanks for your suggestion, but I think you'll find that the original script already sorted the domains from the right (ie, all the com together, all the uk together etc), which is the purpose (as noted in the comments) of the two awks around the sort; your additions nullify that effect.

    For what it's worth: I recommend printf over print and echo in various languages. The main reason for this is portability and flexibility, for the price of a bit of learning. As C, perl, php, awk, java and many other languages have printf, you can learn it once and use it in lots of places.

    Rocksockdoc: hope you're finding what you need in this thread.

    Kind regards,
    Jonathan.
    Ah yes, I see now. Don't know how I missed that before. As I was testing out your script I thought that it was sorting the urls from left to right (normal sort). my apologies.
    CADWEB Advance Toolkit Utility: http://cad.webatu.com/
    Homesite: http://digitalmagican.comze.com/

Page 1 of 2 12 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •