View Full Version : [SOLVED] Regular expression question
sha1sum
January 6th, 2014, 10:55 PM
Hello guys, I need help with regular expressions.
Quick recap: we know that with advanced regular expressions, you can put something in parenteses to save a part of a regular expression. For example, the regular expression ^(.)(.)\2\1$ will match the words like noon and deed. However, it wil also match a string like 'aaaa' . I don't want that. Is there a way to specify that \1 and \2 cannot be the same?
Additional info: the example is a simplification of the actual problem. A workaround with egrep -v won't work.
sha1sum
January 7th, 2014, 12:33 AM
Okay, I´ve searched a lot on the internet and in various reference books, but I really couldn´t find a way to do this. Then I took a break, and ingested a copious amount of caffeine and sugar, and I came up with a solution for my own problem.
I´m going to try to write a script that counts the unique characters in a string. For a word like deed that would be 2, but for aaaa it would be only 1. Then I can filter based on that. That should also work for my bigger problem.
ofnuts
January 7th, 2014, 12:46 AM
For example, the regular expression ^(.)(.)\2\1$ will match the words like noon and deed..
It also, surprisingly, matches "boob", and I don't know why that word crossed my mind... :)
sha1sum
January 7th, 2014, 02:26 AM
It also, surprisingly, matches "boob", and I don't know why that word crossed my mind... :)
lol it did for me too. The original example I wanted to write was: "I need a regular expression that matches book, but not boob". But then I realized; who in their right mind would prefer books over boob-s? So I changed it. :p
ofnuts
January 7th, 2014, 03:11 AM
That explains how you came up with the '(.)(.)' syntax :)
steeldriver
January 7th, 2014, 03:32 AM
Apparently you can do it with negative lookahead --> http://stackoverflow.com/a/8057827
Based on that answer,
cat file
noon
book
deed
boob
aaaa
$ grep -Po '(.)((?!\1).)\2\1' file
noon
deed
boob
sha1sum
January 7th, 2014, 12:50 PM
steeldriver you are awesome.
sha1sum
January 7th, 2014, 01:00 PM
That explains how you came up with the '(.)(.)' syntax :smile:
ROFL! Never thought that regular expressions could get so... Freudian. LOL
kurum!
January 7th, 2014, 04:21 PM
I´m going to try to write a script that counts the unique characters in a string. For a word like deed that would be 2, but for aaaa it would be only 1. Then I can filter based on that. That should also work for my bigger problem.
# echo "deed" | ruby -e 'puts gets.chomp.split("").uniq.size'
2
# echo "aaaa" | ruby -e 'puts gets.chomp.split("").uniq.size'
1
sha1sum
January 7th, 2014, 08:25 PM
I made the following awk script:
# charcount: an awk script that filters based on the number of unique characters in a line
BEGIN{ FS="" } # Make every individual character a field
{ b=0; delete a # delete variable b and array a from the previous line
for(i=1;i<=NF;i++) a[$i]++ # Make an array entry for every character in the line
for (i in a) b++ } # count the number of array entries: ie the number of unique characters
b==2 # print the line if the number of characters equals 2
So then the command looks like this:
cat list.txt | egrep '^(.)(.)\2\1$' | awk -f charcount
This line will print words like deed, noon and boob, but not words like book, blob, and bbbb.
With that I was able to determine that the only four letter palindromes in the english language are the following:
boob
deed
kook
noon
peep
poop
sees
toot
I don't know what "kook" is, but it was listed in /usr/share/dict/american-english
ofnuts
January 7th, 2014, 09:39 PM
I don't know what "kook" is, but it was listed in /usr/share/dict/american-english
http://en.wiktionary.org/wiki/kook
Powered by vBulletin® Version 4.2.2 Copyright © 2024 vBulletin Solutions, Inc. All rights reserved.