Filtering programs and scripts:
Utility programs and scripts:
Example:
Filtering programs and scripts
|
NAME
FiltreN.awk
Filtre1.awk - Extracts 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996
Filtre2.awk - Extracts 1998
Filtre3.awk - Extracts 2000, 2001, 2002
Filtre4.awk - Extracts 1997, 1999
DESCRIPTION
Data filtering programs of the newspaper corpus "Le
Monde" (corpus provided by ELRA). This program includes the following
tags in the output:
- <ART> and </ART> for articles
- <TITLE> and </TITLE> for the title of an article (if defined)
- <TOPICS> and </TOPICS> for the topics of an article (if defined)
- <p> and </p> for paragraphs
- <s> and </s> for sentences (only at the begenning and end of paragraphs)
NAME
ReplaceSpecialChars.awk
DESCRIPTION
ReplaceSpecialChars replaces some characters and add the beginnings
and ends of sentences (respectively <s> and </s>). This script replaces :
- "%" by "pour_cent",
- "," by "virgule" in numbers
- "°c" by "degré(s) celcius"
and other special characters like CTRL+...
NAME
Tokenize.awk
DESCRIPTION
Tokenize text using a vocabulary
NAME
StickWords.awk
DESCRIPTION
Stick words of a corpus using the _ character. This program can be used to introduce phrases.
NAME
RemoveWords.awk
DESCRIPTION
Remove words of a corpus. This program can be used to remove punctuation, etc.