I'm writing a script that takes a text file (like a novel) as input and outputs it with HTML entity names instead of special characters (or other substitutes for them). This is to make a zipped HTML file to submit to Kindle Direct Publishing. (Yes, I don't want to use a Word document, contrary to what is popular.)
Anyway, I don't really know how to use regular expressions. It looks like a lot of reading through to pick stuff out and memorizing to me, until you get it down. So, I was wondering if someone who already knows could help me on my way by telling me how to do certain specific tasks, specifically in Python.
What I want to do is this:
Find all hyphens surrounded by numbers and substitute those hyphens (not the numbers, too) with en dashes (more specifically, with the HTML entity name:).Code:–
So, instead of having something like this: 1-10 (or more properly, 1–10), I'd have this:Anyway, that's my question.Code:1–10
However, it would be extra-nice if I could distinguish ranges of numbers with things like phone numbers and other codes. I don't want en dashes in phone numbers. I think I'll have to settle on hyphens instead of figure dashes for the phone numbers, however, since I don't think KDP supports figure dashes, yet (it uses Latin-1). The way I was thinking to distinguish them was to rule out cases where there is more than one hyphen number combination in the same 'word'. By word, I mean a group of characters that doesn't involve punctuation (sans hyphens) and such.
Another (and far more important) extra nice thing you could tell me, if you like, is this: How to replace an arbitrary number of things in a string each with its specific thing.
I mean, instead of calling myString.replace("thing1","thing2") a whole bunch of times, I was wondering if there's a way to do it without taking so much processer power, and thus speed it up a lot. I mean, if I'm inputting a huge book, this is going to take a while with all those replacements. If there's a function that only cycles through the book once (or less than a lot of times) for all the replacing, that would be awesome.
As an alternative (if what I wanted isn't practical), I could program in Java, Lua or Vala/Genie to make it faster, perhaps. I haven't tested it to see if it actually would be, but from what I know about those languages, it should be.