Iterative Development

UPDATE 1903 hours CDT:

I’ve polished it up a bit and run perltidy:

#!/usr/bin/perl
use strict;
use warnings;

##############################################################################
#                                                                            #
# This program is intended to take a plain text document filename as an      #
# argument, read the file into memory, and perform linear-weight word        #
# frequency analysys.  As an option, it will accept a comma-delimited list   #
# of domain-specific keywords that will weigh higher on a linear scale if    #
# found in the document.                                                     #
#                                                                            #
##############################################################################

#                             Perl, 80 columns                               #

##
# NOTES:
# I will use a linear ratio of 10:1 for text:keywords for this iteration of  #
# the build.  Files under 10 words long will have only one keyword.          #
##

# options
my $debug            = 0;    # 1=enable debug messages, 0=disable debug messages
my $DEBUG_OPTION     = "-D"; # Command-line switch to enable debugging
my $STOPWORDS_OPTION = "-w"; # Command-line argument to specift stopwords file
my $DEFAULT_STOPWORDS = "/usr/local/etc/stopwords";

# default stopwords filename

# blibals
my $text;                    # the plain text of the document
my $numkeys;                 # the number of keywords to extract
my $textlen;                 # the number of words in the text
my $inputfile;               # the name of the input file
my $stopwords_file = $DEFAULT_STOPWORDS;

# the name of the file containing the stopwords
my %table;                   # the keywords->counter table
my @words;                   # the keywords array

sub handle_args {

    # filename argument is always the last one in the array
    my $i;
    my $arg;    # since a modifiable @ARGV isn't a given, we need a temp
    my $numargs = $#ARGV + 1;
    die "wrong number of arguments" if $numargs < 1;

    $inputfile = pop;
    $numargs--;
    print "input file: $inputfile\n" if $debug;
    die "empty input file"           if length $inputfile < 1;    # sanity check

    # now iterate through argument vector to parse the rest
    for ( $i = 0 ; $i < $numargs ; $i++ ) {
        $arg = $ARGV[$i];
        chomp $arg;

        if ( $arg eq $DEBUG_OPTION ) {
            $debug = 1;
            next;
        }

        if ( $arg eq $STOPWORDS_OPTION ) {

            # roll forward and check for last arg
            die "too few arguments" if ++$i >= $numargs;

            # get teh stopwords file
            $stopwords_file = $ARGV[$i];
            next;
        }

        # this is a fallthrough for un-handled args
        die "I don't know what $arg means.";
    }

}

sub slurp {
    my $document_text;    # the contents of the text file

    # argument is the file name
    my $numargs = $#_ + 1;
    print "We have $numargs arguments.\n" if $debug;
    die "Wrong number of arguments." unless $numargs == 1;

    # read the file into a bugger
    local $/ = undef;     # unset the input delimiter
    open( DOC, $_[0] ) or die $!;
    $document_text = <DOC>;
    close DOC;

    if ($debug) {
        my $bytes = length $document_text;
        print "Slurped 0x";
        print hex $bytes;
        print " bytes into buffer.\n";

        print $document_text . "\n";
    }

    $document_text;

}

sub debone {
    my $numargs = $#_ + 1;
    die "wrong number of arguments" unless $numargs == 1;
    my $document_text = pop;
    my $stopword;
    my @stopwords;

    #DONE implement command-line option for stopwords file
    open( STOPWORDS, $stopwords_file ) or die $!;
    @stopwords = <STOPWORDS>;
    close STOPWORDS;
    chomp $_ foreach @stopwords;

    foreach $stopword (@stopwords) {
        $document_text =~ s/\b$stopword\b//ig;
    }

    $document_text =~ s/\W/ /g;
    $document_text =~ s/\s+/ /g;

    $document_text;
}

sub pack {

    # takes a single argument - the document text
    my $numargs = $#_ + 1;
    die "wrong number of arguments" unless $numargs == 1;
    my $document_text = pop;
    my @wordlist;
    my @buffer;
    my %grapes;

    @buffer = split(/ /, $document_text );
    foreach (@buffer) {
        push( @wordlist, $_ ) if length $_ > 1;
    }

    @grapes{@wordlist} = (0);
    @wordlist = sort keys %grapes;
}

sub freq {

    # takes two arguments - the keywords list and the document text
    my $numargs = $#_ + 1;
    die "wrong number of arguments" unless $numargs > 1;
    my $document_text = pop;
    my $keyword;
    $textlen = 0;
    my @wordlist = @_;
    my %grapes;

    for ( split(/ /, $document_text ) ) {
        $textlen++;
    }

    foreach $keyword (@wordlist) {
        my $count = 0;
        ++$count while $document_text =~ m/\b$keyword\b/g;
        $grapes{$keyword} = $count / $textlen;
    }

    %grapes;
}

# scaffold
if ($debug) {
    my $numargs = $#ARGV + 1;
    print "/usr/bin/perl got $numargs arguments from the shell.\n";
}

&handle_args(@ARGV);
$text  = &slurp($inputfile);
$text  = &debone($text);
@words = &pack($text);
%table = &freq( @words, $text );

#print $text . "\n" if $debug;
#print $_ . "\n" foreach (@words) if $debug;
my $key;
my $maxkeys = $textlen / 100;
my $count   = 0;
foreach $key ( sort { $table{$b} <=> $table{$a} } keys %table ) {
    if ($debug) {
        print "$key: $table{$key}\n";
    }

    else {
        print "$key";
    }

    $count++;
    if ( $count > $maxkeys ) {
        last;
    }

    else {
        print ',';
    }
}

print "\n" unless $debug;

UPDATE 0900 hours CDT:

Iteration 1 is finished.  Here’s the code:

<snipped due to newer code above.>

When I run it against A Suggested Festival, it yields these results:

Horde 755% ./naive.pl /tmp/a-suggested-festival
/usr/bin/perl got 1 arguments from the shell.
We have 1 arguments.
Slurped 0x1633 bytes into buffer.
June: 0.0263157894736842
propose: 0.0175438596491228
get: 0.0175438596491228
house: 0.0175438596491228
Teh: 0.0175438596491228
drunk: 0.0087719298245614
know: 0.0087719298245614
Gathering: 0.0087719298245614
aged: 0.0087719298245614
need: 0.0087719298245614
whatever: 0.0087719298245614
Peoples: 0.0087719298245614

UPDATE 1236 hours CDT:

This is what I’ve got in the 10 minutes I’ve been working on this today:

<snipped due to newer code>

With my terminal colors the way they are, writing Perl reminds me of that pastel colored candy they pass out at Easter time:

Colorful syntax highlighting

Colorful syntax highlighting

Pastel Candy

I’ve finished two of the projects I was working on:  A Google Calendar Sync engine in Java and a GECOS phonebook in Perl.  I now have another project:  A document keyword suggester.

Here are the iterations I had in mind:

  1. A simple textual frequency analysis engine based on linear scale with optional weighting inputs of common domain-specific words.
  2. A more complex frequency analysis engine based on logarithmic scale with full domain weighting.
  3. A Bayes-complete multi-document summary engine with tagging and cross-reference keying.

I was planning on implementing at least the first two in Perl.  The third may require either Enterprise Java or straight C due to the hard math involved in even relatively simple Bayesian learning algorithms (as evidenced by the amount of CPU SpamAssassin uses in Bayes mode).

I know this is reinventing the wheel as there are already a number of these exact type of programs out there but they’re all commercial and all cost lots of money.  I have yet to find a decent open source implementation out there.

Advertisements
  1. #1 by Chadwick on June 8, 2010 - 9:26 AM

    So, I take it that this is a program/process that goes through existing files/pages/documents, and by examining their contents, makes assumptions about the docs to automatically tag them for you (I may be completely off-base here). So where and how are you planning to implement this once it’s complete? And (assuming everything works correctly), does this then notably simplify some part of your (or another person’s) life?

    I always see you working on these projects, and I’m never (well, not never, I can sort out a reasonable chunk of the calendar sync and the phonebook) sure how this stuff gets put to use post-completion.

    • #2 by Joshua on June 8, 2010 - 11:56 AM

      Your assessment is dead on. The first place I plan to use it is to tag all of my tagless posts on this blog. The second is to implement tagging on the Lake Lodge and First Lutheran websites so they get picked up better by facebook and such.

      As far as getting put to use, so far two of them have been put to use: The Google Calendar sync has been alpha tested and is awaiting inspection by the Grand Lodge’s District 12 Officers before being beta tested on the District 12 Google Calendar. The phonebook is already implemented in the Lake Lodge website and is being reviewed by the Lodge to determine if it runs afoul of privacy laws and such.

      The other projects (like the CHadowrun thing) are nice-to-haves. THey are big hairy things and without external support, I’m disinclined to work on them. I know they’d be popular projects if I put them on say, sourceforge or somewhere but I don’t know how much time I could devote to championing them, reviewing code and documentation, and getting the community developers involved.

    • #3 by Joshua on June 8, 2010 - 11:58 AM

      And all of my system admin scripts are ones that I’ve hacked together or gotten from sysadmin manuals and websites like Stackoverflow. Since I find them useful for my day-to-day server admin use, I figure someone else might also.

    • #4 by Joshua on June 8, 2010 - 12:01 PM

      BTW, you’d appreciate the Bayes Theorem. It has infinite uses in low-end pattern recognition systems where a little slop is tolerated. Much more complex systems are needed for things like enterprise resource planning and scientific modeling. Bayesian learning works just fine for spamfiltering, low-level security stuff, and demographic infometrics.

  2. #5 by Joshua on June 8, 2010 - 12:41 PM

    I have comments like “blibals” and “slurp the file into a bugger” because I preserve all unintentional typos in comments. It’s just something I’ve done since ADCOMPROG/C++ in high school. I remember writing a program once and mistyping “menu” as “mune” almost every time in the comments. I actually got marked down on one assignment by refusing to correct the hundreds of cout << lines that I mistyped as cotu << and putting in a #define cotu cout line at the top.

    • #6 by Joshua on June 8, 2010 - 12:44 PM

      I also always type “permissions” as “persimmons” in admin scripts because of this cartoon and the one right after it.

    • #7 by Chadwick on June 8, 2010 - 1:40 PM

      For everything I write on this blog, I make a point to leave every “teh” typo alone, given the title of this place.

      • #8 by Joshua on June 8, 2010 - 2:27 PM

        I’ve noticed. I like it.

      • #9 by Joshua on June 8, 2010 - 2:29 PM

        I particularly like the buffer->bugger one. I make that one a lot. I try not to but it often happens when I’m typing operators using the , and . keys and then go to type “buffer.”

  3. #10 by Joshua on June 8, 2010 - 12:54 PM

    And despite the tile of the terminal window (root@gateway~), I wasn’t really root on gateway – I was imbrius on Horde but CSH doesn’t update the window title like Bash does so it kept the last title that was set when I was copying my firewall settings from gateway.

  4. #11 by Joshua on June 9, 2010 - 9:08 AM

    %grapes is what I name hashes that I can’t think of a name for because % kinda looks like a bunch of grapes.

  5. #12 by Joshua on June 9, 2010 - 9:11 AM

    BTW, the numbers after the keywords are raw density scores – in other words, for each word, the next word has a probability X of being the given word, where X is the density score of the word.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: