Hi everyone.
I'm posting here because I'm all out of ideas. I've made a ton of research in order to improve my script's performance and all I managed to squeeze out of it was an extra 15% speed and arguably compromised code readability. So I'm asking for your help. Allow me to explain the situation.
I am programming in python. I have built a script that takes two files. One is a (relatively) small list of potentially interesting information, with each line as an entry. The other is a very large list with a lot of uninteresting entries, and with relevant information about each entry that is not available in the first file; similarly, the second file also has one entry per line. (Just FYI the "small" file is 63Mb and the large file is 3.2Gb).
Here is what my script does:
It matches the entry names in the 1st file to the entry names in the second files, and when it finds a match, it checks the information about the entry in the second file, and if it is interesting, then it is sent into a shortlist, which is the output file.
Using "cProfile" I have found out that (not surprisingly) the operation that takes the longest is where I check if the lines from the first file exist on the second file.
For simplicity I will call the small file "list" and the large file "reference".
Originally I had something like:
Code:
def Indexer(reference, list):
indeces = []
for lines in list:
if lines in reference:
indeces.append(reference.index(call)) #I only need the index number, not the line itself.
return indeces
Which, after some research, I changed to a "map" version which granted me my 15% performance bonus, which looked like this:
Code:
def Indexer(reference, list):
names = tuple(filter(lambda x: x in ref, list))
indeces = tuple(map(ref.index,names))
return indeces
But this is still not nearly enough. My original program has been running for more than 48 hours now, and since I will have to run it very often and with (sigh!) even bigger files, I definitely need to improve it's performance.
The only idea I have left it to try and multi-thread the program, but I read that python was not very good at it, and that it would actually be kind of pointless (here and here, but there were more).
So what else can I do to dramatically improve the run times?
Have I reached python's limit, speedwise? I want to think I haven't. =-)
Notes:
If the code looks weird, it's python3.
I am already using tuples instead of lists everywhere I possibly can think of. This did have a very positive effect on performance, but I didn't measure it at the time.
In order to test this I'm using clipped versions of my files ("list" is 250k and "reference" is 5.2Mb).
Any help is greatly appreciated.
Bookmarks