PrintLogo

Compression of Text With Perl - Part 1




In this article I show how to use Perl to extract keywords from a text file, create an index of these keywords, and reassemble the text in a simplified way. This will give you compression, as well as control over stored data. I used The Snow Queen with the simplified character set of the Odyssey 2 in mind, but certainly this has other applications for searching and extracting data. This also strictly controls non-word characters, which is useful for security reasons when accepting form text. For more information on my Odyssey 2 projects, see this page. Here is the script:

%found = ();
%index= ();
$i=0;
open (SQ, "< snowqueen");
open (SQI, "> snowqueeni");
while (<SQ>){
if (/^$/ ) {
s/^/ endofparagraph/g;
}
s/\./ endofsentence/g;
s/--/ doubledash /g;
s/-/ singledash /g;
s/\"/ doublequote /g;
s/\,/ commainsentence/g;
s/:/ coloninsentence/g;
s/\?/ questioninsentence/g;
s/!/ banginsentence/g;
while ( /(\w['\w-]*)/g ){
if (!(exists $found{uc $1})){
$found{uc $1}=$i;
print SQI (uc $1),"\n";
$i++;
}
}
}
close (SQ);
close (SQI);
open (SQC, "> snowqueenc");
open (SQ, "< snowqueen");
while (<SQ>){
if (/^$/ ) {
s/^/ endofparagraph/g;
}
s/\./ endofsentence/g;
s/--/ doubledash /g;
s/-/ singledash /g;
s/\"/ doublequote /g;
s/\,/ commainsentence/g;
s/:/ coloninsentence/g;
s/\?/ questioninsentence/g;
s/!/ banginsentence/g;
while ( /(\w['\w-]*)/g ){
print SQC $found{uc $1},"\n";
}
}
close SQC;
close SQ;
open (SQI, "< snowqueeni");
$i=0;
while (<SQI>){
chop;
$index{$i}=$_;
$i++;
}
close SQI;
open (SQC, "< snowqueenc");
while (<SQC>){
chop;
use Switch;
switch ($index{$_}){
case "COMMAINSENTENCE" { print "\/"; }
case "COLONINSENTENCE" { print ":"; }
case "QUESTIONINSENTENCE" { print "?"; }
case "BANGINSENTENCE" { print "!"; }
case    "ENDOFSENTENCE" { print "\."; }
case    "DOUBLEQUOTE" { print "\>"; }
case    "DOUBLEDASH" { print "--"; }
case    "SINGLEDASH" { print "-"; }
case    "ENDOFPARAGRAPH" { print "\n\n"; }
else {  print " ".$index{$_}; }
}
}
close SQC;

Here is what the compressed file looks like:

u-1@srv-1 sq $ head snowqueenc -n 20
0
1
2
3
4
5
6
7
8
9
10
11
12
9
0
13
3
14
15
16

Here is what the index file looks like:

u-1@srv-1 sq $ head snowqueeni -n 20
THE
SNOW
QUEEN
ENDOFPARAGRAPH
FIRST
STORY
ENDOFSENTENCE
WHICH
TREATS
OF
A
MIRROR
AND
SPLINTERS
NOW
THEN
COMMAINSENTENCE
LET
US
BEGIN

Here is a section of text in the original story:

"Oh, how long I have stayed!" said the little girl. "I intended to look for
Kay! Don't you know where he is?" she asked of the roses. "Do you think he is
dead and gone?"
"Dead he certainly is not," said the Roses. "We have been in the earth where
all the dead are, but Kay was not there."
"Many thanks!" said little Gerda; and she went to the other flowers, looked
into their cups, and asked, "Don't you know where little Kay is?"
But every flower stood in the sunshine, and dreamed its own fairy tale or its
own story: and they all told her very many things, but not one knew anything

Here is the rendered text using the bit of code at the end of the script:

> OH/ HOW LONG I HAVE STAYED!> SAID THE LITTLE GIRL.> I INTENDED TO
LOOK FOR KAY! DON'T YOU KNOW WHERE HE IS?> SHE ASKED OF THE ROSES.> DO
YOU THINK HE IS DEAD AND GONE?>
> DEAD HE CERTAINLY IS NOT/> SAID THE ROSES.> WE HAVE BEEN IN THE EARTH
WHERE ALL THE DEAD ARE/ BUT KAY WAS NOT THERE.>
> MANY THANKS!> SAID LITTLE GERDA AND SHE WENT TO THE OTHER FLOWERS/
LOOKED INTO THEIR CUPS/ AND ASKED/> DON'T YOU KNOW WHERE LITTLE KAY IS?>
BUT EVERY FLOWER STOOD IN THE SUNSHINE/ AND DREAMED ITS OWN FAIRY TALE OR ITS
OWN STORY: AND THEY ALL TOLD HER VERY MANY THINGS/ BUT NOT ONE KNEW ANYTHING

The characters on the Odyssey 2 are limited, so I've had to do some interesting things with punctuation. In this article I used a different color for quoted strings, and render the Snow Queen.



This article comes from NetAdminTools:
http://www.netadmintools.com/

The URL for this story is:
http://www.netadmintools.com/art350.html

Copyright 1997-2008 NetAdminTools.com. Read our Terms of Use.