Results 1 to 6 of 6

Thread: Searching for repeat words in text files.

  1. #1
    Join Date
    Jul 2011
    Beans
    7

    Searching for repeat words in text files.

    Hi All,

    For a while I have been considering writing a simple programme to search for repeat words such as and and in documents I produce in latex. I always seem to miss one when proof reading my work. Unfortunately, I have no idea where to start and which language would be best.

    Any ideas or do similar algorithms already exist?

    Moa

  2. #2
    Join Date
    Sep 2009
    Location
    Canada, Montreal QC
    Beans
    1,809
    Distro
    Ubuntu 11.10 Oneiric Ocelot

    Re: Searching for repeat words in text files.

    So you basically want to find cases similar to this:

    Code:
    There was a rabbit and and a fox fox ...
    ?

    Perl is pretty good for this kind of task.
    I know not with what weapons World War III will be fought, but World War IV will be fought with sticks and stones.
    Freedom is measured in Stallmans.
    Projects: gEcrit

  3. #3
    Join Date
    Jul 2011
    Beans
    7

    Re: Searching for repeat words in text files.

    Quote Originally Posted by cgroza View Post
    So you basically want to find cases similar to this:

    Code:
    There was a rabbit and and a fox fox ...
    ?

    Perl is pretty good for this kind of task.
    Exactly. I have been considering learning Perl, maybe this problem will encourage me.

  4. #4
    Join Date
    Sep 2009
    Location
    Canada, Montreal QC
    Beans
    1,809
    Distro
    Ubuntu 11.10 Oneiric Ocelot

    Re: Searching for repeat words in text files.

    Quote Originally Posted by moadeep View Post
    Exactly. I have been considering learning Perl, maybe this problem will encourage me.
    If you want to just look for the lines where these words occur, you can use grep -E:
    It uses a regexp to find those words.

    I put together this Perl regular expression that apparently works:
    Code:
    m/(\w+)\s+\1+\b/ig
    Note, I am not sure it works with grep, I tested it using Perl.
    Last edited by cgroza; August 3rd, 2011 at 04:32 PM.
    I know not with what weapons World War III will be fought, but World War IV will be fought with sticks and stones.
    Freedom is measured in Stallmans.
    Projects: gEcrit

  5. #5
    Join Date
    Jul 2011
    Beans
    7

    Re: Searching for repeat words in text files.

    Quote Originally Posted by cgroza View Post
    If you want to just look for the lines where these words occur, you can use grep -E:
    It uses a regexp to find those words.

    I put together this Perl regular expression that apparently works:
    Code:
    m/(\w+)\s+\1+\b/ig
    Note, I am not sure it works with grep, I tested it using Perl.

    Thanks for your help. Unfortunately I have zero knowledge of Perl (at the moment) and have no idea how to execute the command. I love the simplicity in the expression and will definitely be investigating the language when I have some free time next week.

    Moa

  6. #6
    Join Date
    Sep 2009
    Location
    Canada, Montreal QC
    Beans
    1,809
    Distro
    Ubuntu 11.10 Oneiric Ocelot

    Re: Searching for repeat words in text files.

    Quote Originally Posted by moadeep View Post
    Thanks for your help. Unfortunately I have zero knowledge of Perl (at the moment) and have no idea how to execute the command. I love the simplicity in the expression and will definitely be investigating the language when I have some free time next week.

    Moa
    It should easily be converted to shell regular expression syntax.
    EDIT: Here is the grep command:
    Code:
    grep -E "(\w+)\s+\1+\b" yourfile.txt
    Last edited by cgroza; August 3rd, 2011 at 07:29 PM.
    I know not with what weapons World War III will be fought, but World War IV will be fought with sticks and stones.
    Freedom is measured in Stallmans.
    Projects: gEcrit

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •