Thorsten Trippel, Nils Jahn, Dafydd Gibbon (U Bielefeld)
Soma Ouattara (U Cocody, Abidjan)
DOBES Technical Report n' (Ega)
(Status: RFC draft, January 2001 -- printed January 12, 2001)
Specification, Design and Implementation of an Audio Concordance Specification, Design and Implementation of an Audio Concordance
The Portable Audio Concordance (PAC) is a basic tool required for spoken language lexicography. The present tools is tailored to the needs of training local corpus analysists and lexicographers concerned with the documentation of endangered languages in their own countries using low-end hardware.
The tool is specified as a deliverable in the DOBES project Ega: a documentation model for an endangered Ivorian language.
It is anticipated that this tool will be used in a hybrid lingware development environment together with tools such as Praat or Transcriber and Shoebox. Compatibility with other tools is ensured by providing ASCII interfaces, in particular transcriptions in X-SAMPA and with XML markup, and by training local computer science personnel in automatic text processing techniques in order to interface tools in a hybrid environment.
The educational and local involvement aim has priority: the tool is not intended to be an industry standard implementation, but may of course be used a specification and proof-of-concept by other implementers.
We present a requirements specification, a system design specification, a project design specification, an implementation description, and an outline of the planned evaluation, distribution and maintenance policies. Object specifications in a standard format, and source code of a proof-of-concept implementation in Perl are included in the Appendices.
Perl is not an ideal language for software development but was selected because of built-in efficient regular expression processing capability, seamless integration with CGI and other interactive interfaces, and the portability of the code from UNIX/Linux to Windows and Macintosh environments, enabling rapid prototyping and efficient cyclical development.
This is a Request for Comments document. The specification is incomplete and we are aware that a number of errors and inconsistencies remain at the time of in-consortium distribution. Comments in the form of critique, corrections and suggestions for improvement and extension are welcome.
An important workhorse tool for language documentation, in particular for lexicography, is the concordance. There are many varieties of concordance, from the traditional `keyword in context' concordance on paper to electronic hyperconcordances which are statically pre-compiled or dynamically compiled on the fly. To be useful with unwritten languages (and indeed all forms of spoken language), a concordance needs to include audio indexing. For these reasons, the design and proof-of-concept implementation of an audio concordance was included in our proposal Ega: a document model for an endangered Ivorian language.
The overall goal of the present activity is to present in relatively informal outline form
for a portable audio concordance for use in lexicography within the framework of the documentation of endangered languages. This does not exclude utility in other contexts, but the present minimal specification is specifically formulated to ensure efficient initial language documentation. By `Portable Audio Concordance' (PAC) we mean concordance shell software which can be used on as many platforms as possible, in particular in an offline laptop environment in the field or with low-end or older hardware.
The specific goals are
We proceed by specifying requirements in terms of first user groups, second application areas, third user needs.
The potential groups for a audio concordance include:
The main intended application of PAC is in the lexicography of endangered languages. Consequently, the main considerations are ergonomic use by the language documenter rather than other users:
Our implementation perspective leans strongly towards practice in computational linguistics, and involves
The present tool is straightforward and does not involve essentially new concepts; it takes up ideas and experience from the VerbMobil and EAGLES projects and applies them to the specific area of the documentation of endangered languages.
The further technical requirements for PAC are summarised informally as follows:
A specific lingware requirement is that PAC should be interfaced with
Interfaces to standard ASCII based formats for text components of corpora and lexica need to be specified (e.g. widely used UNIX databases, archives with ASCII markup such as HTML, XML, Shoebox files).
By `corpus' we mean a set of related primary language data sources, following EAGLES recommendations (cf. [Gibbon, Moore & Winski 1997]), including the following minimum
The following options may also be included:
Finally, the audio-concordance should be as language independent as possible. This means in particular that it should take into acount more than one language and has to be extensible to other languages as long as the data are available in some standard way. This requirement applies to the proof of concept implementation.
In this design specification we outline both system design and project design.
We define a Concordance System from a declarative point of view as a pair of functions
(1) |
The acquisition function maps a corpus into a concordance consisting of a set of pairs of keyword and keyword-in-context set:
(2) |
where
(3) |
(4) |
(5) |
The aquisition function creates a list of keys from a given -- possibly marked up -- text. This list of keys are to be used as access criteria to the contexts of these keys, i.e. the Key Words In Context (KWIC).
The consultation function maps a pair of keys (often just one) and a corpus into a keyword-in-context concordance:
(6) |
where
(7) |
The main dependencies involving both functions are illustrated in condensed form in Figure 1.
Two different types of electronic concordance were taken into consideration from a procedural point of view:
We note that there is a logical dependency between static and dynamic concordances: a static concordance is a subset of the output of a dynamic concordance. Consequently, PAC design starts with the dyamic concordance. The modules of the dynamic concordance are shown in Figure 2 and the architecture of a system designed to realise these functions is shown in Figure 3.
The hyperconcordance1which is the output of the software should be a browser accessible format. A user interface event should result on a list of occurrences of the keyword in context and an audio rendering of the context, including the keyword.
It is intended to use the concordance as a source of contextual lexical information within a lexicon, as lexicon as described by [Adouakou & Schulte 2000]. Further information on a microstructure for a suitable lexicon can be found in [Gibbon 2001].
The lexicon microstructure required for Ega, a tone language putatively with vowel harmony, consists of the following items:
For the concordance a simpler microstructure subset is used:
The tagging hierarchy for use with the concordance is shown in Figure 4.
The system should require the following files:
Required scripts/modules are:
Generated, static files include:
Generated, not static files include:
Three user interfaces are being included:
Further interface design specifications will follow in a later version of this document.
In order to test the functionality, the PAC system was subjected to initial informal evaluation using two different Ivorian languages: Koulango (Gur/Senoufo), Anyi (Kwa/Tano).
Currently the first Ega data is arriving from Abidjan and is being incorporated into the evaluation.
Further corpus data specifications will follow in a later version of this document.
Task | Who |
Specification and design for an audio-concordance | Trippel, Ouattara, Jahn |
Specification and design for an audio-concordance | Trippel, Ouattara, Jahn |
Design and definition of markup | Trippel, Ouattara, Schulte |
Coordination, collation | Trippel, Schulte, Adouakou |
Evaluation | Trippel, Adouaou |
Module definition | Trippel, Ouattara, Jahn |
ASCII to Markup converter | Trippel, Ouattara, Jahn |
Search function | Ouattara, Jahn |
User interface design | Trippel, Jahn |
CD-Rom production | Trippel, Adouakou |
The following main tasks have been defined:
The tasks are coordinated closely (and some shared) with the DAAD project Encyclopaedia Design for Ivory Coast Languages until the end of that project (December 2000).
The basic conditions for implementation are as follows:
The design of the `container tree' of elements and tag types was specified above (cf. Figure 4): a text element is the container element for sentences, which are in turn container elements for words and sentence-end punctuation such as periods, question marks, exclamation marks. These are included because they have a semantic function for orthographic texts. Tone-language-specific prosodic markup will be included at a later stage.
The DTD is deliberately minimal and subject to revision with respect to the distinction between elements and attribute-value pairs in consultation with other teams.
<!-- DTD for the concordance markup Concordance Tagged Text (CTT) --> <!-- Developed 2000 by Thorsten Trippel, Soma Outtara, Nils Jahn at the University of Bielefeld, Germany --> <!-- root element is ctt --> <!ELEMENT ctt - - (concinf, conctext)> <!-- head element with general information --> <!ELEMENT concinf - - (title, author, date, changes*)> <!ELEMENT title - - (#PCDATA)> <!ELEMENT author - - (#PCDATA)> <!ELEMENT date - - (#PCDATA)> <!ELEMENT language - - (#PCDATA)> <!ELEMENT changes - - (#PCDATA)> <!-- body element with marked up text --> <!ELEMENT conctext - - (concsentence)* > <!-- Element to tag single sentences with id attributes --> <!ELEMENT concsentence - - ((concword+),concsentend) > <!ATTLIST concsentence sentencenumber ID #REQUIRED> <!-- Element to tag single words with id attributes --> <!ELEMENT concword - - (#PCDATA)> <!ATTLIST concword wordnumber ID #REQUIRED> <!-- Element to tag sentence end punctuation such as . ! ? --> <!ELEMENT concsentend - - (#PCDATA)>
The following sample text illustrates the CTT format:
<?xml version="1.0" standalone="no"?> <!DOCTYPE ctt public "-//UBI//DTD CONCORDANCE 0.1a//EN" > <ctt> <concinf> <title>Testtext</title> <author>Trippel</author> <date>08 Oct 2000</date> <language>English</language> <changes>08 Oct 2000</changes> </concinf> <conctext> <concsentence sentencenumber="sentence1"> <concword wordnumber="word1">This</concword> <concword wordnumber="word2">is </concword> <concword wordnumber="word3">sentence</concword> <concword wordnumber="word4">1</concword> <concsentend>.</concsentend> </concsentence> </conctext> </ctt>
The normalisation function converts a SAMPA text into a marked-up text. Every sentence receives a unique identification number. Within the sentences each word receives an identification which is composed of the number of the current sentence and the number of its position in the sentence. The SAMPA text does not contain any punctuation and the end of a sentence is marked up with the line-feed symbol. The normalisation function will be invoked once per text.
The normalisation function (cf. Table 2) expects two arguments which are the name of the input file and the name of the output file. The read line of the input file is stored in a string variable. The line of the file will be splitted into an array, and two integer variables are used as index variables. The first index will be the number of the current sentence and the second the number of the current word of the sentence.
The following algorithm first opens the input and the output file. The first thing which will be written into the output file is the markup header, i.e. root element and information about title, author, and date of the text. The two index variables are initialised and the input file is read line by line. Each line is split into an array and the the sentence and the words of the sentence are marked up with tags and also receive identification numbers which are provided by the index variables sentence-number and word-index. When the end of the input file is reached the end tags are written into the output file and the files are closed.
For the provisional proof-of-concept code see Appendix B.
keywords which occur in the input texts. The list is in alphabetical order and does not contain any double occurences. The list is in ASCII and each word is seperated by a line feed symbol.
The acquisition module searches the data directory and processes each file in this directory. The module first stores the files of the mentioned directory in an array. Then a file is opened, its contents are read line by line and the extracted words are stored in an array. After that the file is closed. This procedure is repeated until all files in the directory have been processed.
The array which contains the words of all files is sorted alphabetically. After that the first element of the array is copied into another array. The module checks if the next element is equal to the one just copied. If it is equal it checks the next one, if it is not equal it is copied to the next array. This is repeated until each element of the first array has been checked. The second array consists of unique words. The array is split into a string variable and each word is separated by a newline and stored into the target file.
For the provisional proof-of-concept code see Appendix D.
The consultation module searches a given text for a specific keyword. If the keyword is found all sentences are shown in which the desired keyword occures. In case there are no matches it returns the message that no matches were found. It is also possible to search more than one text at the same time, but the results are sorted according the source texts.
The consultation module expects as arguments the keyword to be searched, the output file where the results will be stored and a list of input files which could be as long as necessary.
The module first checks whether the number of arguments is less than 3. In that case it terminates and returns an error message to the user and prints out how which arguments are expected. The first input and the output file is opened. The whole input file is stored into one variable. The whole variable is searched for a specific pattern and the line number and the sentence are stored in variables.
Then the variable which contains the sentence is searched whether it contains the keyword and if it contains it the whole sentence with the line number is written to the output file. This is repeated until the whole input file is checked, then the input file is closed. Then the next file from the input list is processed. This is repeated until the whole list is processed. At the end the module checks whether there were any matches, if not a message is returned to the screen. The output file is closed.
The consultation function is frontended with CGI-interaction for user access. After selecting a language from a pick-list on the introductory page (see figure 6), two pick lists enable users to select a word and a corpus where a context could come from that is language specific. It is also possible to select all corpora at the same time.
After the consultation request a list of occurences with accompanying line numbers and contexts are given. All of this is generated on the fly.
Three user interfaces are being incorporated; design and implementation (except for command line access) is still in process. The current implementation of the graphical user-interface forms is shown in figure 6.
For provisional proof-of-concept code see Appendix F.
The testing programme follows EAGLES recommendations for language and speech technology ([Gibbon, Mertins & Moore 2000]) and involves
The software and documentation is distributed continuously within the Ega project between Bielefeld and Abidjan, and the present document makes it available in preliminary form to the partners in the DOBES project.
The documentation will simultaneously be made available on the Ega project website.
Initial object specifications follow for
#!/vol/bin/perl -w # normalisation.pl # version: 0.9b # N. Jahn, S. Ouattara, T. Trippel # November 2000, University of Bielefeld, Germany # [jahn,soma,ttrippel]@spectrum.uni-bielefeld.de # Functionality: for a given line it ennumerates the line, # breaks it into words, gives every word a unique identifier. # Syntax: normalisation.pl <INFILE> <OUTFILE> # Additional information: the user will be prompted for # title, author, date of and changes to the document ($input, $output) = @ARGV ; #store the arguments in $input and $output if ($#ARGV < 1) { #if there are less than 2 arguments, tell the user and exit program print "usage: normalisation.pl <input> <output>\n" ; exit ; } open (IN, "< $input") #open $input for reading access or die "\n Input file couldn't be opened!!\n" ; open (OUT, "> $output") #open $output for writing access or die "\n Output file couldn't be created!!\n" ; print "Please give the title: \n"; chomp($title = <STDIN>); print "Please give the authors name: \n" ; chomp($author =<STDIN>); print "Please give the date when the text was created:\n"; chomp($date = <STDIN>); print "Please give the language of the text:\n"; chomp($language = <STDIN>); print "Please give the changes to the text:\n"; chomp($changes = <STDIN>); print OUT qq%<?xml version="1.0" standalone="no"?>\n% ; #print document markups into $output print OUT "<!DOCTYPE ctt public \"-//UBI//DTD CONCORDANCE 0.1a//EN\" >\n\n"; print OUT "<ctt>\n\n" ; print OUT "<concinf>\n"; print OUT "<title>$title</title>\n"; print OUT "<author>$author</author>\n"; print OUT "<date>$date</date>\n"; print OUT "<language>$language</language>\n"; print OUT "<changes>$changes</changes>\n"; print OUT "</concinf>\n\n"; print OUT "<conctext>\n"; $count = 1 ; $line = 1 ; $word = 0 ; while(<IN>) { chop ; # deletes the last character of the line @array = split(" ", $_) ; # splits the words into an array if ($#array > 0) { # checks for non-empty lines print OUT "<concsentence sentencenumber=\"$line\">\n" ; foreach $word (@array) { # for each word of the current line do print OUT "<concword wordnumber=\"word$count.$line\">$word<\/concword>\n"; $count++ ; } print OUT "<\/concsentence>\n" ; } $count= 1; $line++ if ($#array > 0) ; # increments the line number by one if the line is non-empty } print OUT "<\/ctt>\n" ; close(IN) ; close(OUT);
#!/vol/bin/perl use Tk ; sub openError { $nf = MainWindow->new() ; $nf->Frame(-label => "\nNo input or output file specified!\n")->pack() ; $nf->Button(-text => "ok", -command =>sub {$nf->withdraw()})->pack() ; } sub norm { $mw = shift ; $input = $e5->get() ; $output = $e6->get() ; $title = $e1->get() ; $author = $e2->get() ; $date = $e3->get() ; $changes = $e4->get() ; if ($input eq undef) { openError ; } open (IN, "< $input") or die "Input file couldn't be opened!\n" ; open (OUT, "> $output") or die "Output file couldn't be opened!\n" ; print OUT qq%<?xml version="1.0" standalone="no"?>\n% ; print OUT "<!DOCTYPE ctt public \"-//UBI//DTD CONCORDANCE 0.1a//EN\" >\n\n"; print OUT "<ctt>\n\n" ; print OUT "<concinf>\n"; print OUT "<title>$title</title>\n"; print OUT "<author>$author</author>\n"; print OUT "<date>$date</date>\n"; print OUT "<changes>$changes</changes>\n"; print OUT "</concinf>\n\n"; print OUT "<conctext>\n"; $count = 1 ; $line = 1 ; $word = 0 ; while(<IN>) { chop ; # deletes the last character of the line @array = split(" ", $_) ; # splits the words into an array if ($#array > 0) { # checks for non-empty lines print OUT "<concsentence sentencenumber=\"$line\">\n" ; foreach $word (@array) { # for each word of the current line do print OUT "<concword wordnumber=\"word$count.$line\">$word<\/concword>\n"; $count++ ; } print OUT "<\/concsentence>\n" ; } $count= 1; $line++ if ($#array > 0) ; # increments the line number by one if the line is non-empty } print OUT "<\/ctt>\n" ; close(IN) ; close(OUT); exit ; } $mw = MainWindow->new() ; $mw->Label(-text => "Title")->pack() ; $e1 = $mw->Entry()->pack() ; $mw->Label(-text => "Author")->pack() ; $e2 = $mw->Entry()->pack() ; $mw->Label(-text => "Date")->pack() ; $e3 = $mw->Entry()->pack() ; $mw->Label(-text => "Last changes")->pack() ; $e4 = $mw->Entry()->pack() ; $mw->Label(-text => "")->pack() ; $mw->Label(-text => "Input File")->pack() ; $e5 = $mw->Entry()->pack() ; $mw->Label(-text => "Output File")->pack() ; $e6 = $mw->Entry()->pack() ; $mw->Label(-text => "")->pack() ; $mw->Button(-text => "Normalise Text", -command => \&norm)->pack() ; $mw->Button(-text => "Quit", -command => sub {exit})->pack() ; MainLoop;
#!/vol/bin/perl #authors Jahn & Ouattara #Program : acquire #gets as input a <ctt> text, then creates a key wordlist and sorts it automatically sub getDir { opendir(ETC, "/project/langdoc/SOFTWARE/CONCORDANCE/DATA/") or die "Cannot open it!" ; while ($toc = readdir(ETC)) { if ($toc =~ m/\S+?\.ctt/g) { push(@inh, $toc) ; } } closedir(ETC); return @inh ; } @array = () ; @narray = () ; @dir = getDir() ; print @dir ; foreach $datei (@dir) { open(DATEI, "< /project/langdoc/SOFTWARE/CONCORDANCE/DATA/$datei") ; while (<DATEI>) { if (m/<concword wordnumber=\"word\d+\.\d+\">(\S+?)<\/concword>/g) { push(@array, $1) ; } #extracts a word and pushes it onto an array } close(DATEI) ; } @narray = sort(@array) ; #sorts the array $index = 0 ; $current = 0 ; @rarray = () ; while($index <= $#narray){ #compares index to the length of the array push(@rarray, $narray[$index]) ; #pushes word onto the array $index++ ; while ($rarray[$current] eq $narray[$index]){ $index++ ; #skips equal words } $current++; } open(OUT, "> /project/langdoc/SOFTWARE/CONCORDANCE/DATA/wortliste.wl") ; print OUT join("\n", @rarray) ; # converts the array into string close(OUT) ;
#!/vol/bin/perl -w use CGI qw(:standard); my $language = param("language"); #the language to be investigated #my $language = "AGNI"; $defaultpath= "../html-data/DATA/"."$language"."/"; # $_=$DATEI; #s/\/project\/langdoc\/SOFTWARE\/CONCORDANCE\/DATA\///; #s/\.ctt//; #$filename=$_; %titlefile =(); @array = () ; @narray = () ; @dir = getDir() ; # print @dir ; foreach $datei (@dir) { open(DATEI, "< /project/langdoc/SOFTWARE/CONCORDANCE/DATA/$language/$datei") ; while (<DATEI>) { if (m/<concword wordnumber=\"word\d+\.\d+\">(\S+?)<\/concword>/g) { push(@array, $1) ; } #extracts a word and pushes it onto an array elsif (/<title>\w(.*)<\/title>/){ s/<title>//; s/<\/title>//; $title=$_; $titlefile{"$datei"}=$title; } } close(DATEI); } @narray = sort(@array) ; #sorts the array $index = "0" ; $current = "0" ; @rarray = () ; while($index <= $#narray){ #compares index to the length of the array push(@rarray, $narray[$index]) ; #pushes word onto the array $index++ ; while ($rarray[$current] eq $narray[$index]){ $index++ ; #skips equal words } $current++; } print header(); print <<END_of_HEAD; <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <title>AUDIO-Concordance Wordlist and Userinterface</title> <link rev="MADE" href="mailto:ttrippel\@spectrum.uni-bielefeld.de" /> <base href="http://coral.lili.uni-bielefeld.de/langdoc/cgi-bin/acquisition.pl" /> <meta name="copyright" content="University of Bielefeld, Computational Linguistics and Spoken Language" /> <meta name="author" content="Thorsten Trippel" /> <meta name="description" content="Wordlist for the audio concordance and Userinterface" /> <meta name="date" content="23 Nov 2000" /> <link rel="stylesheet" href="../langdoc.css" /> <script language="JavaScript"> <!-- function doList() { counter=0; for(var i=0;i<document.forms["consultationstart"].infile.options.length;i++) { if(document.forms["consultationstart"].infile.options[i].selected) { counter++; } } if(counter==0){ alert("Don't want to proceed?\\n There are no files selected!"); return; } else {document.forms["consultationstart"].submit();} function select_all(formList) { for (var i = 0; i < formList.options.length; i++) { formList.options[i].selected =true; } } function deselect_all(formList) { for (var i = 0; i < formList.options.length; i++) { formList.options[i].selected =false; } } // --> </script> </head> <body link="#ffffff" vlink="#fafafa" alink="ff0000"> END_of_HEAD @alltitles= values(%titlefile); print <<HEAD_of_TABLE; <form name="consultationstart" action="http://coral.lili.uni-bielefeld.de/langdoc/cgi-bin/consultation.pl" method="post"> <input type="hidden" name="language" value="$language" /> <table class="intern" > <tr><!-- 1.Reihe leer nur leere Bilder --> <td class="background" width="137"> <img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="68" vspace="1" /></td> <td class="background" width="300%" colspan="3"><img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="137" vspace="1" /></td> <td class="background" width="137"><img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="68" vspace="1" /></td> </tr> <tr><!-- 2. Reihe Tabellenueberschriften --> <td class="tablehead">Contents</td> <td class="tablehead" colspan="3">Search for words in one ore more text(s) <br /> in the language <b>$language</b>.</td> <td class="tablehead">Links</td> </tr> <tr> <!-- 3. Reihe, das ist die erste Reihe des Tabelleninhalts --> <td class="content" rowspan="3"> <p><a href="../LangDoc/index.html">Language Documentation Notes</a></p> <p><a href="../index.html">Introductory page</a></p> <!-- <p><a href="../acquisition.pl">Search the concordance</a></p> --> <p><a href="../SPECIFICATION/">Specification of the audio concordance</a></p> <p>E-mail: <a href="mailto:langdoc\@spectrum.uni-bielefeld.de">langdoc\@spectrum.uni-bielefeld.de</a></p> <p><a href="../about.html">About the project</a></p> <p>Designed: November 2000</p> <!-- <img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="1" vspace="100" /> --> </td> <td class="body" rowspan="2"> <!-- <table class="intern" border="0" cellspacing="0" cellpadding="0"> --> HEAD_of_TABLE print <<SELECT_END; <!-- <tr> <td class="body" rowspan="2" > --> <strong>Select word:</strong><br /> <select name="word" size="20" multiple="multiple"> SELECT_END $word="0"; for ($word=0;$word<=$#rarray;$word++){ print "<option value=\"$rarray[$word]\">$rarray[$word]</option>\n" } print <<CONTENT_START; </select> </td> <td align="center" class="body" colspan="2" rowspan="1"> <strong>Select corpus:</strong><br /> <select name="infile" size="3" multiple="multiple"> CONTENT_START while (($file,$title) = each(%titlefile)){ print "<option value=\"$defaultpath$file\">$title</option>\n"; } print <<CONTENT_END; </select> </td> <td class="linklist" rowspan="3"> </td> </tr> <tr> <!-- 4. Reihe, Uebersicht und linkliste sind verbraucht, 2. Spalte auch bleibt noch Spalte 3 und 4 --> <td align="center" class="body" colspan="2"> <input type="button" value="Select All Files" onclick="select_all(form.infile)" /> <!-- </td> <td align="center" class="body" > --> <input type="button" value="Deselect All Files" onclick="deselect_all(form.infile)" /> <br /><input type="button" value="Select All Words" onclick="select_all(form.word)" /> <!-- </td> <td align="center" class="body" > --> <input type="button" value="Deselect All Words" onclick="deselect_all(form.word)" /></td> </tr> <tr><!-- 5. Reihe, Uebersicht und linkliste sind verbraucht, Rest noch nicht --> <td align="center" class="body" colspan="3"> <input type="button" value="Search for word" onclick="doList(form)" /> <!-- <input type="submit" value="Search for word" / > </td> <td align="center" class="body" >--> <input type="reset" value="Reset" /></td> </tr> CONTENT_END print <<END_of_TABLE; <!-- </table> </td> </tr> --> <tr> <td class="tablehead"> </td> <td class="tablehead" colspan="3"> </td> <td class="tablehead"> </td> </tr> </table> </form> </body> </html> END_of_TABLE sub getDir { opendir(ETC, "/project/langdoc/SOFTWARE/CONCORDANCE/DATA/$language/") or die "Cannot open the DATA directory!" ; while ($toc = readdir(ETC)) { if ($toc =~ m/\S+?\.ctt/g) { push(@inh, $toc) ; } } closedir(ETC); return @inh ; }
#!/vol/bin/perl -w #authors Jahn & Ouattara #Program : search.pl #looks for a given word in a given text and prints out the results in multiple matching #a result is composed of the line number and the the contents of that line undef $/ ; if ($#ARGV < 2) { print "Usage: consultation.pl search outputfile inputfile(1) ... inputfile(n)\n" ; exit[0] ; } print "@ARGV\n" ; $word = $ARGV[0] ; #the word to be searched @input = @ARGV[2..$#ARGV] ; #the input file to look through $output = $ARGV[1] ; #the output file $found = 0 ; #boolean variable which is 0 if there aren't any matches open (OUT, "> $output") or die "\n Output file couldn't be created!!\n" ; foreach $dat (@input) { open(IN, "< $dat") or die "$dat couldn't be opened!!\n" ; $text = <IN> ; while ($text =~ m/<concsentence sentencenumber=\"(\d+)\">(.+?)<\/concsentence>/gs) { #matches the #sentence number and its contents in standard variables $zeile = $1 ; $inhalt = $2 ; print $inhalt ; if ($inhalt =~ m/>$word</g) { #matches the word with the contents $found = 1 ; $inhalt =~ s/concword wordnumber/a name/g ; $inhalt =~ s/concword>/a>/g ; print OUT "line $zeile\n" ; #prints the sentence number into a file print OUT "$inhalt\n" ; #prints the contents of the sentence into a file } } close(IN) ; } if ($found == 0) { print "No matches found !!\n" ; } close(OUT);
#!/vol/bin/perl -w use CGI qw(:standard); my @word = param("word"); #the word to be searched my @input= param("infile"); #the input file to look through # my $output= param("outfile"); #the output file my $language = param("language"); #the language to be investigated undef $/ ; print header(); print <<END_of_HEAD; <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <title>AUDIO-Concordance output</title> <link rev="MADE" href="mailto:ttrippel\@spectrum.uni-bielefeld.de" /> <base href="http://coral.lili.uni-bielefeld.de/langdoc/cgi-bin/test.pl" /> <meta name="copyright" content="University of Bielefeld, Computational Linguistics and Spoken Language" /> <meta name="author" content="Thorsten Trippel" /> <meta name="description" content="Results from the audio concordance query" /> <meta name="date" content="23 Nov 2000" /> <link rel="stylesheet" href="../langdoc.css" /> </head> <body link="#ffffff" vlink="#fafafa" alink="#fa1340"> END_of_HEAD print <<HEAD_of_TABLE; <table class="intern" > <tr> <td class="background" width="137"> <img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="68" vspace="1" /></td> <td class="background" width="300%"><img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="137" vspace="1" /></td> <td class="background" width="137"><img src="../IMAGES/1pix.gif" width="1" height="1" alt="" hspace="68" vspace="1" /></td> </tr> <tr> <td class="tablehead" bgcolor="#CCCCCC" >Contents</td> <td class="tablehead">Hits: search for <b>$word</b> <br />in $language <!-- text $filename as a corpus. --> </td> <td class="tablehead">Links</td> </tr> <tr> <td class="content"> <p><a href="/LangDoc/index.html">Language Documentation Notes</a></p> <p><a href="../index.html">Introductory page</a></p> <!-- <p><a href="acquisition.pl">Search the concordance</a></p> --> <p><a href="../SPECIFICATION/">Specification of the audio concordance</a></p> <p>E-mail: <a href="mailto:langdoc\@spectrum.uni-bielefeld.de">langdoc\@spectrum.uni-bielefeld.de</a></p> <p><a href="about.html">About the project</a></p> <p>Designed: November 2000</p> </td> <td> <table class="intern" border="0" cellspacing="0" cellpadding="0"> HEAD_of_TABLE foreach $file (@input) { open (IN, "< $file") or die "\n Input file couldn't be opened!!\n" ; $text = <IN> ; $_=$file; s/\.\.\/html-data\/DATA\/$language\///; s/\.ctt//; $filename=$_; while ($text =~ m/<concsentence sentencenumber=\"(\d+)\">(.+?)<\/concsentence>/gs) { #matches the #sentence number and its contents in standard variables $zeile = $1 ; $inhalt = $2 ; foreach $word (@word){ if ($inhalt =~ m/>$word</g) { #matches the word with the contents # # $found = 1 ; $inhalt =~ s/concword wordnumber/a name/g ; $inhalt =~ s/concword>/a>/g ; $inhalt =~ s/>$word</><b>$word<\/b></g ; print #<<CONTENT_END; ("<tr><td class=\"body\">text: $filename, line $zeile: <br /> $inhalt\n</td><td class=\"body\"><a href=\"../CORPUS/AUDIO/$filename"."$zeile.wav\"> <img src=\"../IMAGES/speaker.gif\" alt=\"Link to audio\" /></a> </td> </tr> ") # CONTENT_END # print p("text: $file <br /> line $zeile: <br /> $inhalt\n") ; #prints the sentence number into a file #prints the contents of the sentence into a file } } } } print <<END_of_TABLE; </table> <!-- --> </td> <td class="linklist"> </td> </tr> <tr> <td class="tablehead"> </td> <td class="tablehead"> </td> <td class="tablehead"> </td> </tr> </table> </body> </html> END_of_TABLE close(IN) ; #print end_html();