Visualisation of distances in language quality spaces

DistGraph, a teaching tool for language typology data mining

D. Gibbon, U Bielefeld

This is a test prototype. It includes extensive error trapping (with notifications in red), but if you see a weird 'Traceback' message this means that there is an untrapped error in the dataflow. If this happens, please send me an email with details of the data and settings which triggered it. Thanks!

Kru languages of Côte d'Ivoire

  • Source of map: Marchese, L. 1984. Atlas Linguistique des Langues Kru. Abidjan: Institut de Linguistique Appliquée, Agence de Coopération Culturelle et Technique.
  • Ethnologue
  • Wikipedia
The input table in CSV (character separated value) format contains the consonant inventories of the languages described in the Atlas. The width of the table corresponds to the size of the set of all the consonants of all the languages. A consonant from this set which is not present in a given language is marked by the underscore "_".
The most interesting outputs are the parametrised LED (Levenshtein Edit Distance) graph and the HTML output of the LED distance matrix.
The most interesting parameters to adjust are the range of distances and the random seed (which determines the shape of the map).
IO parameters
Input table CSV separator:

Graphics format:
Output type:
parametrised LED graph
    (properties of same attributes in same field position)
parametrised SIRD graph
    (use only if properties in different fields are different, i.e. sets)
CSV HTML XML formatted input data
CSV HTML XML output of LED distance matrix
CSV HTML XML output of LED distance triples

Graph parameters
Graph engines (from AT&T GraphViz package):
neato spring model
dot undirected graph model
twopi centred circle model
circo circle model
Numerical parameters:
... range of distances to be processed
(check distance matrix for full data range)
random seed for neato spring model (trial and error)
% graph width (percent of window)
title, comment, etc. (HTML formatting permitted)

Please note that calculation may take some time, depending on data quantity and graph settings.
Data input field

Paste the CSV object-property matrix into the following field (select CSV separator above (default: ";"):

Data formatting

In this demo, the '_' (underscore) is used for empty cells. This is advisable in order to ensure that records are of equal length and that there are comparable items at the same position in the property lists. However, equal record lengths are not required, though with unequal record lengths the distances may not be so easily interpretable. Notifications of unequal record lengths are generated in case the inequality is not intentional.

Prepare the data according to the following instructions, and remember the old technicians' and programmers' principles:

    "Garbage in - garbage out!" - "If it ain't tested, it don't work!"
So if you get weird output, no output, or threatening messages, check your data formatting very carefully!

Currently only ASCII encoded text data are accepted and other UTF-8 characters will generate garbage. Note that for encodings of IPA phonetic transcriptions which are both human-readable and machine-readable the X-SAMPA ASCII encoding is still widely used where Unicode is not easily available.

Each row in the property matrix begins with the name of the object to which the properties are assigned. The object property matrix must have rows of equal length; the 'empty' cells must be padded with a string which does not occur in the data. It is convenient to prepare the data using a spreadsheet, and to export in 'csv' (character separated value) format for pasting into the data field.

Top - Description - Functional specs - Design specs - Interface - Assignments

Suggested assignments

  1. Start with the dot engine and compare the results for data ranges with maxdist 1, ..., 6. What determines the appearance of subgraphs? What do the subgraphs tell you about the languages in different subgraphs?
  2. Compare the results for the dot and the neato graph engines. Which is more informative? Explain.
  3. Compare results for the other graph engines. Which do you prefer, and for what reason?
  4. Compare the consonant graphs with the graphs for nasal vowels and non-nasal vowels. Are the graphs for these different units comparable or different? Explain the differences.
  5. Take any other dataset - e.g. properties of different people, pets, towns, ... and examine the graphs.
  6. Apply the methods used in these assignments to a language of your own choice.

Data source for the demo data:

CGI implementation using GraphViz library.
D. Gibbon email Updated Monday, July 7, 2014 12:04:35 PM CEST