Computational Lexicography

2008-1-15

Summaries

This lecture is actually quite hard to summary, as it's my blank area. In this case, I'd like to apply the form of answering questions in the "Quiz" part to review what we have learned.

Criteria for Good Lexicography

* Quantity:

o Completeness of coverage:

+ extensional coverage: number of entries

+ intensional coverage: number of types of lexical information

* Quality:

o Correctness of information:

+ Types of lexical information

o Consistency of structure:

+ Macrostructure

+ Microstructure

+ Mesostructure

Quizes & Answers

  • 1. What is a KWIC concordance?

    A KWIC (KeyWord In Context) concordance is a special kind of preliminary, corpusbased dictionary:
    - each word in a text corpus is paired with its contexts of occurence in this corpus .

    Google is a special form of KWIC concordance.

  • 2. Which are the two main components of lexicon construction based on empirical data?

    Information retrieval and Linguistic analysis.

  • 3. Which layers of abstraction are involved in corpus acquisition?

    Layer 1: Primary data (audio / video recordingand
    Layer 2: Secondary data (transcription, annotation, metadata)

  • 4. Which layers of abstraction are involved in lexicon construction? Describe them.

    Layer 1: Corpus lexicon (wordlist, concordance).
    Layer 2: Lexicon matrix (entries x data categories, no generalisations).
    Layer 3: Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)

  • 5. Which layer do standard dictionary types typically belong to?

    Layer 3: Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)

  • 6. What are the 6 main steps in KWIC concordance construction?
    • Corpus creation/collation - get the corpus, e.g. texts.
    • Tokenisation - normalising text, e.g. change upper case letters into lowercase letters, remove punnctuation marks (end of the sentence vs. abbreviation), deal with numbers.
    • Keywordlist extraction - create a list of words that occur in the text.
    • Context collation - pick contex unit, e.g. the keyword in context of three words on the left side and three words on the right side.
    • Keyword search - look for the key word in context.
    • Output formatting - make the output look nice and understandable to the user.
  • 7. In which programming languages could the concordance software be implemented?

    Perl, Unix shell script, Python or LaTeX formatting language

  • 8. What are the problems with the demonstration software which need to be removed in a later realistic project?

    The program will have to allow flexible handling of contexts and filenames, treat more than one text, have modular structure/ogranisation.

9. The Status of Dictionaries

The dictionary is

  • one of the three main components of language documentation:
    • corpus of recordings and texts
    • dictionary
    • sketch grammar
  • the central component of any linguistic description
  • the most useful linguistic product for use by the speech community, or non-linguists in general.

All above are quoted from the notes of the lecture given by Dr. Gibbon.

Evaluation

This lecture, as pronounced at the very beginning, is really very difficult, not only because computational linguistics is completely new to me, but also because there are many new glossaries to learn by heart.

Reference

  • Gibbon, Dafydd. "Computational Lexicography." 14.01.2008. University of Bielefeld. 15.01.2008 <http://wwwhomes.uni-bielefeld.de/~gibbon/Classes/Classes2007WS/HTMD/htmd10-computationallexicography.pdf>.