OK, now I need to scale it up a little. This part is a small part of a bigger problem - full details here: http://acookson.org/sms-voting-campaign/
My elementary script now loops through a file and picks out selected field values with the use of the substr construct:
Code:
#!/usr/bin/perl
use warnings;
use strict;
use DBI();
# Connect to the database.
my $dbh = DBI->connect("DBI:mysql:database=voting_campaigns;host=localhost",
"cookie", "*********",
{'RaiseError' => 1});
open (LOGFILE, '/var/www/voting/votes.txt') or die("Could not open log file.");
foreach my $line (<LOGFILE>) {
my $vote;
my $epoch;
my $campaign;
my $validity;
my $choice;
my $CONN;
my $MSISDN;
my $GUID;
my $Shortcode;
($vote, $epoch, $campaign, $validity, $choice, $CONN, $MSISDN, $GUID, $Shortcode) = split(' ', $line);
# parse the field:value entries...
$campaign = substr $campaign, 9, 11, ''; # ssss_uk_01C
$validity = substr $validity, 9, 6, ''; # during
$choice = substr $choice, 7, 10, ''; # Brown
my $sth = $dbh->prepare("INSERT INTO voting (vote, epoch, campaign, validity, choice,
CONN, MSISDN, GUID, Shortcode) VALUES (?,?,?,?,?,?,?,?,?)");
$sth->execute($vote, $epoch, $campaign, $validity, $choice, $CONN, $MSISDN, $GUID, $Shortcode);
# debug
print "$campaign $validity $choice \n"; # ssss_uk_01B during Green
}
close (LOGFILE);
All good so far, the script executes successfully. Now I need to focus on:
Any lines that are not well-formed should be discarded. Your script should assume the data is in uncompressed plain text.
A brief explanation of what we would need to consider if we wanted to import a large number of similar files (millions of votes)
- ‘Choice:’ indicates the candidate the user is voting for. In every campaign there will be a limited set of candidates that can be voted for. If Choice is blank, it means the system could not identify the chosen candidate from the text of the SMS message.
All such messages should be counted together as ‘errors’, irrespective of their Validity values. There is a limited set of values for ‘Choice’, each of which represents a candidate who can be voted for.
I'm thinking a regular expression is the way forward:
Code:
my $MSISDN;
my $GUID;
my $Shortcode;
# Skip lines that have empty field values...
next if !~ /^VOTE\sCampaign:\w{11,}\sValidity:\w{6,}\sChoice:\w{2,30}\s $/;
# Keep count of errors or malformed lines
my $errors++;
($vote, $epoch, $campaign, $validity, $choice, $CONN, $MSISDN, $GUID, $Shortcode) = split(' ', $line);
# parse the field:value entries...
$campaign = substr $campaign, 9, 11, ''; # ssss_uk_01C
I've not ran it yet. How bad is it? I have my doubts over the
Code:
next if
my $errors++;
construct, It's not going to fly is it? I'd expect $errors to increment regardless of the next statement. A more logical way for me would be:
Code:
my $regex = /^VOTE\sCampaign:\w{11,}\sValidity:\w{6,}\sChoice:\w{2,30}\s $/;
if(!$regex){
# bad line, increment $errors..
$errors++;
# skip to next line
next
}
#continue loop
but I'm not using Perl now, I'm using psuedo code.
Bookmarks