PDA

View Full Version : Python replace text



idn
May 1st, 2008, 04:42 PM
Hi, I am trying to develop a script which does a reasonable effort of figuring out the word count of my latex university essays. I have lots of refereces - like a good student :) - in my essays, that shouldnt actually contribute to the word count. They are all in the bibtex format, for example:

\cite{Nielsen:Overgaard:Pedersen:Stage:Stenild:200 6}

I was wondering is there a way to remove this text from a string in python. I have already written all the code to replace a bunch of other stuff (see attachment) but I am completely useless with regular expressions and can't figure out how to do this at all.

Any help would be much appreciated

Jon

tseliot
May 1st, 2008, 04:58 PM
you don't have to use regular expressions for such things. If you're reading the file line by line you can make sure that, for example:

line.strip().startswith('\cite{') == False

EDIT: I didn't notice the attached file

WW
May 1st, 2008, 05:36 PM
I agree with tseliot--you don't need a regular expression.

Perhaps something like this:


# Remove all occurrences of \cite{*}
while lines.find("\cite{") != -1:
pre,copen,rest = lines.partition("\cite{")
citation,cclose,post = rest.partition("}")
lines = pre + post

It uses the partition() function of Python 2.5. If you are using 2.4, you could do something equivalent by using find() and pulling apart the string.

tseliot
May 1st, 2008, 05:46 PM
If you want to use re, try this:

pattern = r'\\cite{.*}\\n'
pat1 = re.compile(pattern)
m1 = pat1.match(lines)
if m1:
lines = lines.replace(m1.group(0), '')

WW
May 1st, 2008, 05:47 PM
P.S. In your code, you can replace this


# Seek to the start position in the file
fileHandle.seek(0)
fileList = fileHandle.readlines()
# Loop through each line in the file
for fileLine in fileList:
lines += fileLine

with this:


lines = fileHandle.read()

WW
May 1st, 2008, 06:17 PM
If you want to use re, try this:

pattern = r'\\cite{.*}\\n'
pat1 = re.compile(pattern)
m1 = pat1.match(lines)
if m1:
lines = lines.replace(m1.group(0), '')
This appears to assume that each \cite{} is on a line by itself. This is not a requirement in LaTeX.

A really concise way to get rid of \cite{...} is to use the sub function in the re module:


lines = re.sub(r'\\cite{.*}','',lines)

(Add the line import re somewhere in the beginning of your file.) The first argument to sub is the same pattern as used by tseliot, but without the trailing newline.

Although I said above that you don't need a regular expression, I like this version better than my first one. :)

ghostdog74
May 2nd, 2008, 01:11 AM
you might also want to check for non-greediness in the regexp.

WW
May 2nd, 2008, 01:57 AM
you might also want to check for non-greediness in the regexp.
Good point! If a line contains two \cite{} functions, the re.sub function that I gave above will delete both of them and all text between them. This version avoids that:


lines = re.sub(r'\\cite{[^}]*}','',lines)

nanotube
May 2nd, 2008, 04:08 AM
Good point! If a line contains two \cite{} functions, the re.sub function that I gave above will delete both of them and all text between them. This version avoids that:


lines = re.sub(r'\\cite{[^}]*}','',lines)


better yet (or at least, alternatively), use non-greedy matching:


lines = re.sub(r'\\cite{.*?}','',lines)

ghostdog74
May 2nd, 2008, 04:12 AM
Good point! If a line contains two \cite{} functions, the re.sub function that I gave above will delete both of them and all text between them. This version avoids that:


lines = re.sub(r'\\cite{[^}]*}','',lines)


usually, the qualifier for non-greediness is ?, here (http://www.amk.ca/python/howto/regex/regex.html#SECTION000730000000000000000)is a read. So another way is (but am not sure how the outcome of the .sub() will be..have to test out)


r'\\cite{.*?}'

nanotube
May 2nd, 2008, 04:20 AM
r'\\cite{.*?}'


ha, beat you to it :) :lolflag:

ghostdog74
May 2nd, 2008, 05:09 AM
ha, beat you to it :) :lolflag:

well done, but I am too poor to give you a prize for that :)

nanotube
May 2nd, 2008, 04:43 PM
well done, but I am too poor to give you a prize for that :)

hehe, and i'm too poor to expect one! :)