How I make my tags

For everything I post here, I try to put relevant keywords in the tags field.  I get them from naive arithmetic frequency analysis using a Perl script that I run each article through prior to posting:

#!/usr/bin/perl
use strict;
use warnings;

##############################################################################
#                                                                            #
# This program is intended to take a plain text document filename as an      #
# argument, read the file into memory, and perform linear-weight word        #
# frequency analysys.  As an option, it will accept a comma-delimited list   #
# of domain-specific keywords that will weigh higher on a linear scale if    #
# found in the document.                                                     #
#                                                                            #
##############################################################################

#                             Perl, 80 columns                               #

##
# NOTES:
# I will use a linear ratio of 10:1 for text:keywords for this iteration of  #
# the build.  Files under 10 words long will have only one keyword.          #
##

# options
my $debug            = 0;    # 1=enable debug messages, 0=disable debug messages
my $DEBUG_OPTION     = "-D"; # Command-line switch to enable debugging
my $STOPWORDS_OPTION = "-w"; # Command-line argument to specift stopwords file
my $DEFAULT_STOPWORDS = "/usr/local/etc/stopwords";

# default stopwords filename

# blibals
my $text;                    # the plain text of the document
my $numkeys;                 # the number of keywords to extract
my $textlen;                 # the number of words in the text
my $inputfile;               # the name of the input file
my $stopwords_file = $DEFAULT_STOPWORDS;

# the name of the file containing the stopwords
my %table;                   # the keywords->counter table
my @words;                   # the keywords array

sub handle_args {

    # filename argument is always the last one in the array
    my $i;
    my $arg;    # since a modifiable @ARGV isn't a given, we need a temp
    my $numargs = $#ARGV + 1;
    die "wrong number of arguments" if $numargs < 1;

    $inputfile = pop;
    $numargs--;
    print "input file: $inputfile\n" if $debug;
    die "empty input file"           if length $inputfile < 1;    # sanity check

    # now iterate through argument vector to parse the rest
    for ( $i = 0 ; $i < $numargs ; $i++ ) {
        $arg = $ARGV[$i];
        chomp $arg;

        if ( $arg eq $DEBUG_OPTION ) {
            $debug = 1;
            next;
        }

        if ( $arg eq $STOPWORDS_OPTION ) {

            # roll forward and check for last arg
            die "too few arguments" if ++$i >= $numargs;

            # get teh stopwords file
            $stopwords_file = $ARGV[$i];
            next;
        }

        # this is a fallthrough for un-handled args
        die "I don't know what $arg means.";
    }

}

sub slurp {
    my $document_text;    # the contents of the text file

    # argument is the file name
    my $numargs = $#_ + 1;
    print "We have $numargs arguments.\n" if $debug;
    die "Wrong number of arguments." unless $numargs == 1;

    # read the file into a bugger
    local $/ = undef;     # unset the input delimiter
    open( DOC, $_[0] ) or die $!;
    $document_text = <DOC>;
    close DOC;

    if ($debug) {
        my $bytes = length $document_text;
        print "Slurped 0x";
        print hex $bytes;
        print " bytes into buffer.\n";

        print $document_text . "\n";
    }

    $document_text;

}

sub debone {
    my $numargs = $#_ + 1;
    die "wrong number of arguments" unless $numargs == 1;
    my $document_text = pop;
    my $stopword;
    my @stopwords;

    #DONE implement command-line option for stopwords file
    open( STOPWORDS, $stopwords_file ) or die $!;
    @stopwords = <STOPWORDS>;
    close STOPWORDS;
    chomp $_ foreach @stopwords;

    foreach $stopword (@stopwords) {
        $document_text =~ s/\b$stopword\b//ig;
    }

    $document_text =~ s/\W/ /g;
    $document_text =~ s/\s+/ /g;

    $document_text;
}

sub pack {

    # takes a single argument - the document text
    my $numargs = $#_ + 1;
    die "wrong number of arguments" unless $numargs == 1;
    my $document_text = pop;
    my @wordlist;
    my @buffer;
    my %grapes;

    @buffer = split(/ /, $document_text );
    foreach (@buffer) {
        push( @wordlist, $_ ) if length $_ > 1;
    }

    @grapes{@wordlist} = (0);
    @wordlist = sort keys %grapes;
}

sub freq {

    # takes two arguments - the keywords list and the document text
    my $numargs = $#_ + 1;
    die "wrong number of arguments" unless $numargs > 1;
    my $document_text = pop;
    my $keyword;
    $textlen = 0;
    my @wordlist = @_;
    my %grapes;

    for ( split(/ /, $document_text ) ) {
        $textlen++;
    }

    foreach $keyword (@wordlist) {
        my $count = 0;
        ++$count while $document_text =~ m/\b$keyword\b/g;
        $grapes{$keyword} = $count / $textlen;
    }

    %grapes;
}

# scaffold
if ($debug) {
    my $numargs = $#ARGV + 1;
    print "/usr/bin/perl got $numargs arguments from the shell.\n";
}

&handle_args(@ARGV);
$text  = &slurp($inputfile);
$text  = &debone($text);
@words = &pack($text);
%table = &freq( @words, $text );

#print $text . "\n" if $debug;
#print $_ . "\n" foreach (@words) if $debug;
my $key;
my $maxkeys = $textlen / 100;
my $count   = 0;
foreach $key ( sort { $table{$b} <=> $table{$a} } keys %table ) {
    if ($debug) {
        print "$key: $table{$key}\n";
    }

    else {
        print "$key";
    }

    $count++;
    if ( $count > $maxkeys ) {
        last;
    }

    else {
        print ',';
    }
}

print "\n" unless $debug;

The tags for this article were generated by feeding the program a copy of itself. 🙂

Advertisements

, , , , ,

  1. #1 by Chadwick on December 22, 2010 - 3:19 PM

    What, and it didn’t spit out the “recursive” tag? Some program…

    • #2 by Joshua on December 22, 2010 - 3:26 PM

      I had been working on a better algebraic frequency analysis algorithm that used word-relational relevance patterns (like if “red” appears within two words of “apple” multiple times within the article, they’d get higher scores). But that fell by the wayside when I picked up the freelance Degree Scheduler gig that I’m tracking in the “Development” post.

    • #3 by Andrew D. on December 22, 2010 - 10:57 PM

      It’s not recursive, it’s iterative. This script fills a buffer, reads the buffer, fills a buffer, reads a buffer… there’s no serious recursion here. Sorry.

      • #4 by Joshua on December 23, 2010 - 7:19 AM

        I think he’s meaning that the program should recognize when it’s being fed a copy of itself as input. In fact, I think I’m going to implement that just to please Chad.

        • #5 by Chadwick on December 23, 2010 - 9:40 AM

          That was, in fact, my intent.

      • #6 by Joshua on December 23, 2010 - 7:21 AM

        And, in my case, it fills a “bugger” and reads it. And declares “blibal” variables. I purposely leave typos in comments. “Bugger” is supposed to be “buffer” and “blibal” is supposed to be “global.”

      • #7 by Joshua on December 23, 2010 - 7:46 AM

        I knew the Dorney knew C++ but I didn’t know he knew Perl and Python. I don’t know C++ well. I know x86 Assembly, C, Java, Perl, Python, Bourne shell, IBM 370 JCL, and Ruby and COBOL from cookbook. I’d like to learn LISP but can’t think of a use for it.

        • #8 by Joshua on December 23, 2010 - 7:47 AM

          And I used to know assembly for SPARC, IBM/370, and AS/400 but it’s been years since I used them.

        • #9 by Joshua on December 23, 2010 - 7:48 AM

          Oh and PHP. I keep forgetting that one even though it’s arguable the one I use the most. And the usual complement of SQL.

        • #10 by Andrew D. on December 23, 2010 - 8:37 AM

          Yep. I’m highly fluent in Bash and various markup languages (ReStructuredText comes to mind), moderately fluent in C++, Python, and PHP, and barely fluent in C#/.NET.

          I use Bash a lot for automating tasks, and the markup languages come from regular use at work and home. My C++ and C# knowledge come from code reviewing at work, as well as a UWM class here or there. Python and PHP are my “I’m bored; let’s go hack something neat together” languages, but I don’t use them outside of hobbyist realms.

          I realized a while back that my brain thinks in strings (which was pretty obvious once I remembered I had a degree in English). Therefore languages which rely heavily on strings and indentation work well with how my head operates. Languages like C++ and MySQL that rely on other types of data structures are simply harder for me to learn. Not impossible, just a much higher barrier to entry.

          • #11 by Joshua on December 23, 2010 - 8:40 AM

            See and my brain tends to operate on trees, hashtables, and linked lists. That’s why things like assembly, C, and LISP appeal to me.

  2. #12 by Joshua on December 22, 2010 - 3:29 PM

    Here’s an interesting question: Since the Chad doesn’t know Java or Perl but does know the basics of procedural programming, which does he think is easier to read? I think Java is much easier to read but harder to see the big picture.

    • #13 by Joshua on December 22, 2010 - 3:32 PM

      Perl syntax was certainly harder to learn. Python was easier than Perl. Ruby syntax continues to baffle me and when I do Ruby (which I do sometimes) it’s entirely by voodoo and cookbook (which is a very bad way to code).

    • #14 by Chadwick on December 22, 2010 - 4:35 PM

      I think you’re about right, there. Certainly Java looks a lot cleaner.

      • #15 by Joshua on December 22, 2010 - 4:43 PM

        Yeah. Totally. Perl’s sigils (those little $, @, and %) are sort of notorious for lending a plate-of-hash quality to the code.

        • #16 by Chadwick on December 22, 2010 - 4:44 PM

          Yeah.

        • #17 by Joshua on December 22, 2010 - 4:49 PM

          Python is basically Perl but with a much cleaner and easier to read syntax.

          • #18 by Andrew D. on December 22, 2010 - 10:55 PM

            I ❤ Python.

          • #19 by Joshua on December 23, 2010 - 7:18 AM

            I’m not a fan of Python. It wouldn’t be so bad if it (say it with me now) used block delimiters other than whitespace. Reminds me wayyyy too much of COBOL.

          • #20 by Joshua on December 23, 2010 - 7:51 AM

            It also means that you can’t implement a perfect quine in Python since it can’t be rewritten entirely on a single line.

  3. #21 by Joshua on December 23, 2010 - 7:39 AM

    I just noticed a few lines that could be better optimized:

    119     chomp $_ foreach @stopwords;  
    
    120    
    
    121     foreach $stopword (@stopwords) {  
    
    122         $document_text =~ s/\b$stopword\b//ig;  
    
    123     }  
    

    Could be better optimized to:

    foreach $stopword (@stopwords) {
         chomp;
         $document_text =~ s/\b$stopword\b//ig;
    }
    
    • #22 by Joshua on December 23, 2010 - 7:40 AM

      Not just in lines, but the current implementation has the script walking the stopwords vector twice: once to chomp and once to prune the document text.

  1. Perl scripts are a sysadmin’s best friend « Around Teh Table

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: