rename and regex

**Vaphell** · June 3rd, 2013

Code:

$ rename -nv 's/.*([a-zA-Z][0-9]+[a-zA-Z][0-9]+).*[.]([^.]+)$/\L$1.$2/' really*
really.long.filename.with.extra.stuff.A12B33.and.m ore.stuff.extension renamed as a12b33.extension

**rebeltaz** · June 3rd, 2013

Originally Posted by Vaphell

Code:

$ rename -nv 's/.*([a-zA-Z][0-9]+[a-zA-Z][0-9]+).*[.]([^.]+)$/\L$1.$2/' really*
really.long.filename.with.extra.stuff.A12B33.and.m ore.stuff.extension renamed as a12b33.extension

Wow... regex may be REGULAR, but it makes my head hurt! OK, so...

I understand the .*([a-zA-Z][0-9]+[a-zA-Z][0-9]+).* but why the next [.] after that? How does [^.]+ match the extension? If I understand regex correctly (and I do not!) [^.] matches any character that is not a . (which is confusing because ^ without the brackets indicates the start of the line!

). Would that not be every alphanumeric character including the extension?

I promise I am not looking JUST to have this written for me. I really do want to understand it so I can do this on my own next time. Yeah, I know... rather presumptuous of me

But I do appreciate all of the help!

**Vaphell** · June 3rd, 2013

[^.]+ = not-dot 1-or-more times
general idea:
anything(char-digits-char-digits)anything[dot](not-dots)end-of-line => $1.$2
obviously 2nd parenthesis can only store extension because it forces everything between last dot and end-of-line. In the replacement you invoke the content of that parenthesis with $2

your current code simply finds the episode number and takes everything after it verbatim, that's why any garbage that happens to be there after the number gets to the final name (first part up to the number gets transformed but anything after gets through).

(something.s01e01=> s01e01).garbage.ext (bold shows the scope of your regex substitution)

In my regex i match the whole name from start to end, with .*[.] to consume any garbage to the last dot leaving only extension to be captured and used to construct final name.

something.s01e01.garbage.ext => s01e01.ext

**rebeltaz** · June 3rd, 2013

I think what I don't understand is why .*[.] stops at the LAST [.] instead of the first [.] it comes to...

**ofnuts** · June 3rd, 2013

Originally Posted by rebeltaz

I think what I don't understand is why .*[.] stops at the LAST [.] instead of the first [.] it comes to...

Because that the normal "greedy" behavior in regular expressions. In "aaabbbcccaaabbbccc" you have three possible matches for "aaa.*ccc": the first "aaabbbccc", the second one, or the whole string. I know, you are going to ask, "But why the VisualBasic is the default behavior to match the whole string"? And the answer is, because it is a lot easier in that case to prevent that behavior[*] and write an expression that matches only the first or last small strings than it would be, if the "frugal" behavior was the default, to construct a regexp that matches the whole string.

Some regexp syntaxes have a modifier (*?, +?) that let you specify the shortest match. But Real Men don't use it

[*] "aaa[^a]*ccc"

**Vaphell** · June 3rd, 2013

just like ofnuts said by default regexes try to consume as much as possible, besides when you write regex .*[.][^.]+$ there is no other option, $ clarifies it: line has to end with [.][^.]+. If after that dot only non-dots are allowed then it's the last one.

Code:

abc.def.ghi /  .*[.][^.]+$

no way it will ever match, dot would have to be consumed by non-dot+ which is impossible

Code:

abc.def.ghi / .*[.][^.]+$

everything is fine

**rebeltaz** · June 4th, 2013

Oh! I didn't see the dollar sign at the end of that equation. Now I get it.

Aren't the two examples above (abc.def.ghi / .*[.][^.]+$) the same?

Thank you all. You have been a great help!

**trent.josephsen** · June 4th, 2013

Originally Posted by rebeltaz

Oh! I didn't see the dollar sign at the end of that equation. Now I get it.

That's good, but it's important to note that the pattern will match (for the example strings) in exactly the same way without the dollar sign, because of the greediness of the * and + quantifiers and because [^.]+ matches only non-dots.

Aren't the two examples above (abc.def.ghi / .*[.][^.]+$) the same?

What Vaphell was pointing out was that some parts of the pattern match different parts of the string, which affects the substrings ($1 and $2) captured by the parentheses () in the original pattern.

E.g. when matching against the string "hello, world":

/([a-z]*)/ will match the first 5 characters of the string, putting 'hello' into $1;
/.*([a-z]*)/ will match the whole string, putting '' (the empty string) into $1;
/.*?([a-z]*)/ will match the whole string, putting 'world' into $1;
/([a-z]*)$/ will match the last 5 characters of the string, putting 'world' into $1.

The first three patterns match all the same strings -- it's not possible to construct a string for which one of them succeeds but another fails. (The fourth pattern only matches at the end of the string, so it'll never match a string like "hello, world4".) The differences lie in which parts of the string they match first, and how quickly that happens. (In many cases, /.*PATTERN/ matches the same thing as /PATTERN$/, but is likely to do it faster -- sometimes much faster.)

This is stuff I picked up from the Camel book, and happens to apply because rename is written in Perl. Other languages and regex engines have slightly different rules, syntaxes and performance profiles, but the general concepts (like greediness) are the same.

**rebeltaz** · June 4th, 2013

I think I understand, but I may need to take college course on regex if I ever attempt this again!

**ofnuts** · June 4th, 2013

No need for a college course. Everything is there: http://www.amazon.com/Mastering-Regu...dp/0596528124/

Thread: rename and regex

Thread Tools

Display

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Re: rename and regex

Bookmarks

Bookmarks

Posting Permissions