Page 1 of 3 123 LastLast
Results 1 to 10 of 21

Thread: Beginners programming challenge #24

  1. #1
    Join Date
    Jun 2009
    Location
    Land of Paranoia and Guns
    Beans
    194
    Distro
    Ubuntu 12.10 Quantal Quetzal

    Beginners programming challenge #24

    Welcome to the 24th Beginners programming challenge.

    The challenge is to create a text content analyzer. This is a tool used by writers to find statistics such as word and sentence count on essays or articles they are writing.

    Write a program that analyzes input from a file, essay.txt, and compiles statistics on it.
    The program should output:

    1. The total word count
    2. The count of unique words
    3. The number of sentences

    Example output, using the attached essay.txt:
    Code:
    analyze
    Total word count: 468
    Unique words: 223
    Sentences: 38
    Cookie points

    Cookie points will be awarded for the following extras:

    1. The ability to calculate the average sentence length in words
    2. The ability to find often used phrases (a phrase of 3 or more words used over 3 times)
    3. A list of words used, in order of descending frequency
    4. The ability to accept input from STDIN, or from a file specified on the command line.


    Disqualified Entries:

    Any overly obfuscated code will be immediately disqualified without account for programmer's skill. Please remember that these challenges are for beginners and therefore the code should be easily readable and well commented.

    Any non-beginner entries will not be judged. Please use common sense when posting code examples. Please do not give beginners a copy paste solution before they have had a chance to try this for themselves.

    BASH programmers are NOT allowed to use wc.

    If your program calculates the lexical density or gunning fog index, then you are not a beginner and your entry won't be judged.


    Assistance:

    If you require any help with this challenge please do not hesitate to come and chat to the development focus group. They have a channel on irc.freenode.net #ubuntu-beginners-dev
    Attached Files Attached Files
    Don't use W3Schools as a resource! (Inconsequential foul language at the jump)
    Open Linux Forums (More foul language, but well worth it for the quality of support and good humor.)
    If you want to discuss W3Schools, please PM me instead of posting.

  2. #2
    Join Date
    Apr 2007
    Location
    NorCal
    Beans
    1,149
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Beginners programming challenge #24

    What defines a "sentence"? A string that ends in [.?!] possibly inside quotation marks?
    Posting code? Use the [code] or [php] tags.
    I don't care, I'm still free. You can't take the sky from me.

  3. #3
    Join Date
    Jun 2007
    Location
    Porirua, New Zealand
    Beans
    Hidden!
    Distro
    Ubuntu

    Re: Beginners programming challenge #24

    Quote Originally Posted by schauerlich View Post
    What defines a "sentence"? A string that ends in [.?!] possibly inside quotation marks?
    That's part of the fun/challenge of coming up with solutions, deciding how to define how your program copes with different situations.
    Forum DOs and DON'Ts
    Never assume that information you find using a search engine is up-to-date.

  4. #4
    Join Date
    Apr 2007
    Location
    NorCal
    Beans
    1,149
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Beginners programming challenge #24

    Quote Originally Posted by lisati View Post
    That's part of the fun/challenge of coming up with solutions, deciding how to define how your program copes with different situations.
    But should this count as one sentence or two?

    "'This is Unix, I know this!' exclaimed the girl as as the raptors clawed at the door, bringing her ever closer to her inevitable demise."
    It could be argued either way, and if we're being judged on correctness, having a definition of correct is important.
    Posting code? Use the [code] or [php] tags.
    I don't care, I'm still free. You can't take the sky from me.

  5. #5
    Join Date
    Apr 2007
    Location
    NorCal
    Beans
    1,149
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Beginners programming challenge #24

    No wc you say? Challenge accepted.

    Code:
    #!/bin/bash
    
    # NF contains the number of "fields" (whitespace delimited strings) in each line. Summing
    # this for every line will give you the word count.
    awk '{w += NF} END {print "Total word count: " w}' $1
    
    # For every word on each line, print it (ie separate it on stdin with a newline)
    # change all upper case to lower case
    # remove all punctuation
    # sort it
    # remove duplicates
    # print the number of words left
    awk '{for (i = 1; i <= NF; i++) print $i}' $1 | tr '[A-Z]' '[a-z]' | tr -d '[:punct:]' | sort | uniq | awk '{l += 1} END {print "Unique words: " l}'
    
    # find all occurrences of '.', '!' or '?' optionally followed by any number of quotes. Print
    # how many there are.
    grep -o '[.!?]['\''"]*' $1 | awk '{l += 1} END {print "Sentences: " l}'
    Code:
    $ 24.sh essay.txt 
    Total word count: 468
    Unique words: 223
    Sentences: 38
    Last edited by schauerlich; December 17th, 2011 at 07:22 AM.
    Posting code? Use the [code] or [php] tags.
    I don't care, I'm still free. You can't take the sky from me.

  6. #6
    Join Date
    Apr 2007
    Location
    NorCal
    Beans
    1,149
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Beginners programming challenge #24

    And in python:

    Code:
    import os
    import sys
    import re
    import string
    
    def clean_text(s):
        """Make text lower case, and remove all punctuation."""
        return re.sub("[%s]" % re.escape(string.punctuation), '', s.lower())
    
    def main():
        words = []
    
        # Make a list of each word in the file
        with open(sys.argv[1], "r") as f:
            for line in f:
                words.extend(line.split())
                
        print "Total word count:", len(words)
    
        # We need to "clean" the text in order to remove potential duplicates.
        # for example, "Cat", "cat", and "cat." would all show up as different
        # when for our purposes they are the same. appling clean_text will make
        # all of these "cat" so they appear identical.
        words_clean = map(clean_text, words)
        
        # By turning it into a set, we remove all duplicates. Therefore, the
        # length of this set is the number of unique words in the file.
        print "Unique words:", len(set(words_clean))
    
        sentences = 0
        for word in words:
            # If a word ends in ".", "!" or "?" (potentially followed by any 
            # number of quotation marks), we should count it as the end of a 
            # sentence.
            if re.search(r'[.!?][' "'" '"]*', word):
                sentences += 1
    
        print "Sentences:", sentences
    
    if __name__ == "__main__":
        main()
    Posting code? Use the [code] or [php] tags.
    I don't care, I'm still free. You can't take the sky from me.

  7. #7
    Join Date
    Jun 2009
    Location
    Land of Paranoia and Guns
    Beans
    194
    Distro
    Ubuntu 12.10 Quantal Quetzal

    Re: Beginners programming challenge #24

    Quote Originally Posted by schauerlich View Post
    But should this count as one sentence or two?

    For the purposes of this challenge, I'll count it as two sentences.
    @schauerlich If your earlier entries differ, don't worry about it. The whole point is to get people programming, and not to worry about insignificant discrepancies between the entries.
    Don't use W3Schools as a resource! (Inconsequential foul language at the jump)
    Open Linux Forums (More foul language, but well worth it for the quality of support and good humor.)
    If you want to discuss W3Schools, please PM me instead of posting.

  8. #8
    Join Date
    Sep 2009
    Location
    Canada, Montreal QC
    Beans
    1,809
    Distro
    Ubuntu 11.10 Oneiric Ocelot

    Re: Beginners programming challenge #24

    My solution in Haskell.
    Compile with ghc filename.hs
    Written in point free style as much as possible .
    Code:
    module Main where 
    import Control.Monad.Instances
    import Data.Char
    import Data.List
    import System.IO
    import System (getArgs)
    
    -- Removes punctuation marks. Words function could count them too.
    removePunctuation :: String -> String
    removePunctuation = filter (not . isPunctuation)
    
    countWords, countUniqueWords, countSentences :: String -> Int
    -- Counts words, removing punctuation first.
    countWords = length . words . removePunctuation
    
    -- Counts unique words, removing punctuation first.
    countUniqueWords = length . nub . words . map toLower . removePunctuation
    
    -- Counts sentences by looking for ";.?!" marks.
    countSentences = length . filter (`elem` ";.?!")
    
    -- Analyzes input using above functions.
    analyze :: String -> [(String, Int)]
    analyze = zip attributes . sequence [countWords, countUniqueWords, countSentences]
                where attributes = ["Total Word Count: ", "Unique Words: ", "Sentences: "]
    
    main :: IO ()
    main = do args <- getArgs
              analyzeAndPrint =<< (if null args then hGetContents stdin else readFile $ head args)
                where analyzeAndPrint :: String -> IO ()
                      analyzeAndPrint = mapM_ (putStrLn . \(a,v) -> a ++ show v) . analyze
    Last edited by cgroza; December 17th, 2011 at 09:12 PM.
    I know not with what weapons World War III will be fought, but World War IV will be fought with sticks and stones.
    Freedom is measured in Stallmans.
    Projects: gEcrit

  9. #9
    Join Date
    Apr 2008
    Location
    Ireland
    Beans
    286
    Distro
    Ubuntu 11.10 Oneiric Ocelot

    Re: Beginners programming challenge #24

    My entry in PERL:

    Code:
    #!/usr/bin/perl -s
    use warnings;
    
    # Buffer the input here
    my $buffer;
    
    # Function to process the buffer and output statistics
    sub process {
        # Create an array of words, including ones with ' and - in them.
        my @arr = $buffer =~ m/\w+'?-?\w*/g;
        print "Words: " . scalar @arr . "\n"; # print number of array elements
    
        # Map the array onto a hash. Since hash keys (which are converted to lowercase)
        # cannot have duplicates we get number of uniques words. As the values for each key
        # we use the number of occurences of the word
        my %hash;
        %hash = map { lc($_), ++$hash{ lc($_) } } @arr;
        print "Unique Words: " . keys (%hash) . "\n"; # print number of keys
    
        # The number of sentences is given by the number of regex matches.
        # Match at least one of ".!?", but potentially more consequent
        # occurences, so that we don't count stuff like "..." or "!?"
        # as mulitple sentences. =()= is the goatse operator used to
        # assign values coming from the right to an anonymous array and
        # then getting the number of elements of that array from the left.
        my $sentences =()= $buffer =~ m/[\.\!\?]+/g;
        print "Sentences: $sentences\n";
    
        # Print the most used words if the user called
        # the program with the -f switch at command line
        if ($f) {
            print "Most used words:\n";
            # reverse sort the hash of words by value, which the number of occurences
            foreach (reverse sort { $hash{$a} <=> $hash{$b} } keys %hash) {
                print "$_\t$hash{$_}\n";
                # break out of the foreach loop if decrementing
                # the value of $f makes it zero
                last unless --$f;
            }
        }
    }
    
    # Check type of invocation
    if (@ARGV) {
        # There were some command line arguments other than the optional -f
        # So we use perl's diamond operator to get the content of the files
        # pointed to by the given filenames and buffer them.
        while (<>) {
            $buffer .= $_;
        }
        process $buffer;
    } elsif (! -t STDIN) {
        # Looks like STDIN is connected to something, let's buffer up that input
        while (<STDIN>) {
            $buffer .= $_;
        }
        process $buffer;
    } else {
        # Incorrect invocation - print help
        print "Usage: $0 [-f[={number}]] file.txt\n";
        print "or:    echo \"text\" | $0 [-f={number}]\n\n";
        print "OPTIONS: -f          : print the most used word\n";
        print "         -f={number} : show the list of {number} most used words\n";
    }
    Sample executions:

    Code:
    roccivic@roccivic-pc:~/Desktop$ ./wa.pl
    Usage: ./wa.pl [-f[={number}]] file.txt
    or:    echo "text" | ./wa.pl [-f={number}]
    
    OPTIONS: -f          : print the most used word
             -f={number} : show the list of {number} most used words
    Code:
    roccivic@roccivic-pc:~/Desktop$ ./wa.pl essay.txt
    Words: 468
    Unique Words: 223
    Sentences: 38
    Code:
    roccivic@roccivic-pc:~/Desktop$ ./wa.pl -f essay.txt
    Words: 468
    Unique Words: 223
    Sentences: 38
    Most used words:
    cats	25
    Code:
    roccivic@roccivic-pc:~/Desktop$ ./wa.pl -f=4 essay.txt
    Words: 468
    Unique Words: 223
    Sentences: 38
    Most used words:
    cats	25
    the	20
    a	15
    are	13
    [EDIT]: fixed bug where the total count of words was incorrectly computed
    Last edited by roccivic; December 23rd, 2011 at 08:28 PM. Reason: switched from PHP to CODE tags with custom formatting, as regexes were not displayed correctly

  10. #10
    Join Date
    Jun 2007
    Location
    Porirua, New Zealand
    Beans
    Hidden!
    Distro
    Ubuntu

    Re: Beginners programming challenge #24

    Quote Originally Posted by sh228 View Post
    For the purposes of this challenge, I'll count it as two sentences.
    @schauerlich If your earlier entries differ, don't worry about it. The whole point is to get people programming, and not to worry about insignificant discrepancies between the entries.
    Exactly why I didn't go into more details with my comment earlier.
    Forum DOs and DON'Ts
    Never assume that information you find using a search engine is up-to-date.

Page 1 of 3 123 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •