PDA

View Full Version : Can sed to this?



PryGuy
August 13th, 2009, 10:31 AM
Good day!
I have a problem here: I have a big Squid blacklist file that has lines that partly match. for instance:

.somebadsite.com
.somebadsite.com/with/additionsor
.somebadsite.com
.addition.somebadsite.comIs it possible to parse this file with sed to find lines with similarities and delete the longer ones? Keep in mind that I don't know pattern in this case.
Thank you in advance!

unutbu
August 13th, 2009, 01:56 PM
I don't know how to do this with just sed, but here is a solution in python:


#!/usr/bin/env python
sites=[site.strip() for site in open('blacklist','r').read().split()]
result=sites[:]
for pattern in open('blacklist','r'):
pattern=pattern.strip()
for site in sites:
if pattern!=site:
idx=site.find(pattern)
if idx>-1:
# pattern is in site
result.remove(site)
sites=result
print '\n'.join(result)

Save it to ~/bin/blacklist_filter.py
Make it executable:

chmod +x ~/bin/blacklist_filter.py

Run it in the directory that contains blacklist like this:


blacklist_filter.py > new-blacklist

Note that this program requires enough memory to load the entire blacklist twice.
If that is too onerous, I could be modify the program to write partial results to disk.
This could save on memory, at the expense of more disk I/O.

PryGuy
August 13th, 2009, 03:19 PM
Thank you, buddy! Beer's on me if you'll ever happen to visit Russia! :)