August 13th, 2009, 10:31 AM
Good day!
I have a problem here: I have a big Squid blacklist file that has lines that partly match. for instance:

.addition.somebadsite.comIs it possible to parse this file with sed to find lines with similarities and delete the longer ones? Keep in mind that I don't know pattern in this case.
Thank you in advance!

August 13th, 2009, 01:56 PM
I don't know how to do this with just sed, but here is a solution in python:

#!/usr/bin/env python
sites=[site.strip() for site in open('blacklist','r').read().split()]
for pattern in open('blacklist','r'):
for site in sites:
if pattern!=site:
if idx>-1:
# pattern is in site
print '\n'.join(result)

Save it to ~/bin/blacklist_filter.py
Make it executable:

chmod +x ~/bin/blacklist_filter.py

Run it in the directory that contains blacklist like this:

blacklist_filter.py > new-blacklist

Note that this program requires enough memory to load the entire blacklist twice.
If that is too onerous, I could be modify the program to write partial results to disk.
This could save on memory, at the expense of more disk I/O.

August 13th, 2009, 03:19 PM
Thank you, buddy! Beer's on me if you'll ever happen to visit Russia! :)