CLIPS Corpus Filtering Toolkit

- Manual Pages -

Brigitte Bigi
21th November 2003

Filtering programs and scripts:

Utility programs and scripts:

Example:



Filtering programs and scripts



NAME

FiltreN.awk

Filtre1.awk - Extracts 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996
Filtre2.awk - Extracts 1998
Filtre3.awk - Extracts 2000, 2001, 2002
Filtre4.awk - Extracts 1997, 1999

DESCRIPTION

Data filtering programs of the newspaper corpus "Le Monde" (corpus provided by ELRA). This program includes the following tags in the output:


NAME

ReplaceSpecialChars.awk

DESCRIPTION

ReplaceSpecialChars replaces some characters and add the beginnings and ends of sentences (respectively <s> and </s>). This script replaces :
  • "%" by "pour_cent",
  • "," by "virgule" in numbers
  • "°c" by "degré(s) celcius"
and other special characters like CTRL+...



NAME

Tokenize.awk

DESCRIPTION

Tokenize text using a vocabulary



NAME

StickWords.awk

DESCRIPTION

Stick words of a corpus using the _ character. This program can be used to introduce phrases.



NAME

RemoveWords.awk

DESCRIPTION

Remove words of a corpus. This program can be used to remove punctuation, etc.



Utility programs and scripts


NAME

ToLower.awk

DESCRIPTION

ToLower converts a corpus in lower case, except <ART>, </ART> and topics.


NAME

Text2wfreq.awk

DESCRIPTION

Text2wfreq estimates word frequencies of a corpus (words are cutted at spaces). The output file is a 2 columns list with words ranked by frequencies.



Example of use


#!/bin/csh
 
echo "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
echo
echo "          L E   M O N D E   1987 - 2002   "
echo
echo "          distribué par ELRA pour ESTER"
echo
echo "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
 
 
echo "Filtrage des données"
echo "--------------------"
 
echo -n "année 1987 : "
foreach i (`find ../Originaux/1987 -type f`)
        cat $i | Filtre1.awk | ReplaceSpecialChars.awk | Tokenize.awk -v fichier_vocab=Lexique.vocab | Num2Text.awk > `basename $i`.tok
end
echo [ OK ]



echo "Mise en forme des données"
echo "-------------------------"

foreach i (`ls *.tok`)
    cat $i | StickWords
.awk -v fichier_vocab=MotsComposes.vocab | ToLower.awk | RemoveWords.awk -v fichier_vocab=Ponctuation.vocab > `basename $i`.txt
end

echo
echo "That's all folks !"
echo