2. COMPOSITION OF CORPUS
The final total number of words collected was 52,637. In line with the conventions used in the LOB corpus project, each sample text is assigned to an overall category (indicated by a letter) and identifled by a "part number" (indicated by a digit). In addition, each text is given an absolute number to indicate its position in the corpus as a whole.
The composition of the corpus is shown in Figure 1, with total number of words in each category, and this figure as a percentage of the total number of words in the corpus.
Category | Total | %Total | ||
A Commentary | 9066 | 17 | ||
B News broadcast | 5235 | 10 | ||
C Lecture type l - aimed at general audience |
4471 | 8 | ||
D Lecture type II - aimed at restricted audience |
7451 | 14 | ||
E Religlous broadcast - including liturgy |
1503 | 3 | ||
F Magazine-style reporting | 4710 | 9 | ||
G Fiction | 7299 | 14 | ||
H Poetry | 1292 | 2 | ||
J Dialogue | 6826 | 13 | ||
K Propaganda | 1432 | 3 | ||
M Miscellaneous | 3352 | 6 | ||
Grand total: | 52637 |
Category A - Commentary
News reports on events happening around the world. The texts are more informal than those in category B. In Perspective covers a religious topic, the From our own Correspondent reports all deal with overseas news events:
Category B - News Broadcasts
News reports of current and recent events in Great Britain and abroad. B04 contains one speaker, B01-B03 contains a main newsreader and addifional reporters. The style of the main newsreaders is more formal than that of the reporters.
Category C - Lecture type l
A lecture on economics addressed to the general public and entitled "Needs, Centralism, and Autarchy".
Category D - Lecture type II
Lectures designed to be used as part of an Open University course. D01 covers the Berlin Dada movement in Germany, and so contains some German names and words. D02 is a discussion of theology and science in 18th century France. D03 describes the development in the notation used in the representation of fractions, and contains some simple mathematical formulae.
Category E - Religlous Broadcast
Religlous services (the hymns have been edited out).
Category F - MagazIne-style reporting
Magazine-style in-depth reporting of financial news. Topics covered are: The perks of owning shares; the upgrading of state benefits; listeners' trusts; and Building Society rates.
Category G - Fiction (general)
All these texts are general fiction. G01 is a story aimed at an adult audience - "Through the Tunnel", by Doris Lessing. G02 is aimed at children aged between 5 and 10 - "Lion at School", by Philippa Pearce. G03 and G04 are stories taken from the ELT course book "Streamline English". G05 is aimed at adults - 'What shall we do if it rains?", by Graham Seal.
Category H - Poetry
Poetry readings of John Betjeman and Sir Henry Newbolt's poetry. H01-H03 are John Betjeman's readings of his own poems: Eunice, The Last of Her Order, and Harrow-on-the-Hill. H04 and H05 are actors reading Sir Henry Newbolt's poems: The Linnet's Nest, and The Nightjar.
Category J - Dialogue
Dialogues of varying degrees of informality. J01 consists of a radio discussion of notable sports events of 1986. J02-J05 are dialogues contrived to illustrate a particular facet of English (although this is not immedlately noticeable) for the Streamline English ELT course. J06 is an informal dialogue between two MA students about working abroad.
Category K - Propaganda
Charity appeals.
Category M - Miscellaneous
M01 is a sample of John Betjeman reading a section of prose: An Unpleasant Nursemaid. M02 and M07 consist of reports on road conditions. M03 and M08 are weather forecasts. M04 and M09 give details of forthcoming programmes on Radio 4. M05 and M06 are speeches delivered at degree ceremonies before the presentation of honorary degrees to Nelson Mandela and Tom Stephenson.
When selecting material for the corpus, we chose only those speakers whose accent was as close to RP as possible. This was relatively easy where material from the BBC programmes From our own Correspondent and the News was concerned, as the BBC themselves require similar standards from their presenters of news or news commentary programmes. Most of the speakers in the corpus have an accent which -is close to RP; if a speaker used a particularly strong definable regional accent, they were not included in the corpus.
Of the 53 texts in the corpus, 17 contain female speakers. This represents 30 per cent of the corpus. We have tried where possible to achieve a balance between male and female speakers, and in the highly-stylised texts - poetry, religious broadcast, propaganda, and dialogue, we have managed to do this. The higher percentage of male speakers in the News and Commentary categories reflects the tendency of the BBC to use mainly male speakers in these types of programmes.
The following lists the speakers in the corpus by category.
A01 A02 A03 A04 A05 A06 |
Rosemary Hartill Gerald Butt Jon Silverman John Carlin James Morgan David Smeeton |
A07 A08 A09 A10 A11 A12 |
Laurie Margolis Keith Graves Graham Leach Alan MacDonald Peter Ruff Jim Biddulph |
||
B01 B02 |
Brian Perkins Mike Wooldridge Laurie Margolis Janet Trewin Brian Perkins James Cox Clive Small Kevin Bocquet Ann Cadwallader Peter Burden |
B03 B04 |
David Geary Peter Smith Colin Blane David Davis Peter Bragg |
||
C01 | David Henderson | ||||
D01 D02 D03 |
Dawn Adies Dr Robert Fox Graham Flegg |
||||
E01 | Frances Gumley | E02 | Rev Stephen Oliver | ||
F01 F02 F03 |
Louise Botting Peter Day Louise Botting Vincent Duggleby Louise Botting Juliet Johnson Frances McDonald |
F04 | Kevin Geary Chris Florence Harry Peart Simon Taylor Linda Spurr Chris Poole Mike Costello |
||
G01 G02 G03 |
Elizabeth Bell John Hollis Male speaker |
G04
|
Female speaker Martin Jarvis |
||
H01 H02 H03 |
John Betjeman John Betjeman John Betjeman |
H04 H05 |
Isabelle Dean Martin Jarvis |
||
J01 J02 J03 |
Kevin Geary Martin Fookes Paddy Feeny Male speakers Male speakers |
J04 J05 J06 |
Male speaker Female speaker Male speaker Female speaker Heather Kempson Rita Green |
||
K01 | Brian Redhead | K02 | Susan Hampshire | ||
M01 M02 M03 M04 M05 |
John Betjeman Male speaker Male speaker Male speaker Colin Lyas |
M06 M07 M08 M09 |
Colin Lyas Male speaker Male speaker Male speaker |
For material obtained from BBC radio programmes the date of broadcast is given:
Cat | Date | ||
A01 | in Perspective | November 24th, 1984 | |
A02-A06 | From our own Correspondent | November 24th, 1984 | |
A07-A12 | From our own Correspondent | June 22nd, 1985 | |
B01 | News (R4) | November 24th, 1984 | |
B02 | News (R4) | June 22nd, 1985 | |
B03 | News (R2) | December 5th, 1985 | |
B04 | News (R3) | January 14th, 1986 | |
C01 | The Reith Lectures Ill | November 20th, 1985 | |
E01 | Daily Service | November 26th, 1985 | |
E02 | Daily Service | November 27th, 1985 | |
F01 | Money Box | November 24th, 1984 | |
F02-F03 | Money Box | June 22nd, 1985 | |
F04 | Review of the Year | December, 1986 | |
G01 | Story Time | June 25th, 1985 | |
G02 | Listening & Reading | January 28th, 1987 | |
G 05 | Morning Story | November 26th, 1986 | |
H04-H05 | Time for Verse | November 26th, 19861 | |
J01 | Review of the Year | December, 1986 | |
K01 | Week's Good Cause | January 18th, 1987 | |
K02 | Week's Good Cause | January 25th, 1987 | |
M02 | Motoring News | January 18th, 1987 | |
M03 | Weather Forecast | January 18th, 1987 | |
M04 | Programme News | January 18th, 1987 | |
M07 | Travel Roundup | January 25th, 1987 | |
M08 | Weather Forecast | January 25th, 1987 | |
M09 | Programme News | January 25th, 1987 |
The samples from "Betjeman Reads Betjeman" are all dated as "slnce 1954":
H01 | Eunice | ||
H02 | The Last of Her Order | ||
H03 | Harrow-on-the-Hill | ||
M01 | An Unpleasant Nursemaid |
For material prepared at the Media Services Unit, University of Lancaster, actual recording date is given:
J06 | Kempson & Green dialogue | March 11th, 1987 | |
M05 | Nelson Mandela speech | January 29th, 1987 | |
M06 | Tom Stephenson speech | January 29th, 1987 |
The Streamline English texts were irst published in 1982. The relevant categories are:
G03 | SE Unit 10 - "A funny thing happened to me ..." | |
G04 | SE Unit 19 - "Night flight" | |
J02 | SE Unit 16 - "inside Story" | |
J03 | SE Unit 25 - "Murder at Gurney Manor" | |
J04 | SE Unit 72 - "Getting things done" | |
J05 | SE Unit 75 - "Messages" |
The material obtained from the Open University unfortunately contained no information on date of composition or publication. The categories are:
D01 | OU Modem Art | |
D02 | OU Science & Belief in 18th C-France | |
D03 | OU Development of Fractions |
The total duration of the corpus is 339 minutes 18 seconds. The average length of a sample is 6 minutes, but an individual text may vary greatly from this. Texts are not of equal length as we wanted to study complete sections of speech - having a predetermined cut-off point based on number of words or length of extract wouid have resulted in an unnatural-sounding endpoint to a sample of speech. The following gives details of extract lengths in the corpus.
Cat | m: s | Cat | m: s | |||
A01 | 15:00 | G01 | 20:00 | |||
A02 | 4:28 | G02 | 8:56 | |||
A03 | 4:01 | G03 | 2:39 | |||
A04 | 5:41 | G04 | 5:30 | [Tot G] |
||
A05 | 4:48 | G05 | 9:20 | [46:25] | ||
A06 | 4:32 | |||||
A07 | 3:54 | H01 | 1:41 | |||
A08 | 4:08 | H02 | 2:03 | |||
A09 | 5:12 | H03 | 1:00 | |||
A10 | 4:26 | H04 | 2:59 | [Tot H] | ||
A11 | 4:15 | [Tot A] | H05 | 1:17 | [ 9:00] | |
A12 | 4:05 | [64:30] | ||||
J01 | 7:58 | |||||
B01 | 9:32 | J02 | 1:31 | |||
B02 | 9:40 | J03 | 2:04 | |||
B03 | 5:00 | [Tot B] | J04 | 0:27 | ||
B04 | 5:00 | [ 29:12] | J05 | 1:28 | [ Tot J] | |
J06 | 24:00 | [37:28] | ||||
C01 | 30:00 | [30:00] | K01 | 4:32 | [Tot K] | |
K02 | 4:09 | [ 8:41] | ||||
D01 | 19:00 | |||||
D02 | 19:00 | [Tot D] | M01 | 0:41 | ||
D03 | 19:00 | [57:00] | M02 | 1:10 | ||
E01 | 6:48 | [Tot E] | M03 | 0:43 | ||
E02 | 4:30 | [11:18] | M04 | 1:40 | ||
M05 | 4:33 | |||||
F01 | 3:48 | M06 | 7:05 | |||
F02 | 3:32 | M07 | 1:06 | |||
F03 | 4:54 | [Tot F] | M08 | 0:47 | [Tot M] | |
F04 | 13:16 | [25:30] | M09 | 2:24 | [20:14] |
The following samples were obtained with the permission of the BBC and are covered by a contract with them covering copyright permission
A01 | In Perspective | |
A02-Al 2 | From our own Correspondent | |
B01-B04 | News | |
C01 | The Reith Lectures - III | |
E01-E02 | Daily Service | |
F01-F03 | Money Box | |
F04 | Review of the Year | |
G01 | Story Time | |
G02 | Listening & Reading | |
G05 | Morning Story | |
H04-H05 | Time for Verse | |
J01 | Review of the Year | |
K0l-K02 | Week's ood Cause | |
M02 | Motoring News | |
M03 | Weather Forecast | |
M04 | Programme News | |
M07 | Travel Roundup | |
M08 | Weather Forecast | |
M09 | Programme News |
In addition to the copyright agreement with the BBC, permission to use material had to obtained from any individuals not employed by the BBC.
The Open University Educational Enterprises Ltd gave permission for the inclusion of samples D01-D03.
Oxford University Press gave permission for the Streamline English material to be included - texts G03, G04, J02, J03, J04, and J05.
Decca International gave permission for the samples of John Betjeman's work to be included - texts H01, H02, H03 and M01.
The remaining material was recorded at the University of Lancaster's Media Services Unit using volunteer speakers - M05 and M06 speeches, and the J06 dialogue.
The following table gives details of programme title (where relevant), speaker(s), and number of words for each text in the corpus
Cat Programme title Speaker(s) Words
|
|
|
|
|
Category A: |
|
Commentary:9066 words |
|
|
A01 |
1 |
In Perspective |
Rosemary Hartill |
793 |
A02 |
2 |
From our own Correspondent |
Gerald But |
734 |
A03 |
3 |
|
Jon Silverman |
620 |
A04 |
4 |
|
John Carln |
977 |
A05 |
5 |
|
James Morgan |
804 |
A06 |
6 |
|
David Smeeton |
828 |
A07 |
7 |
|
Laurie Margolis |
716 |
A08 |
8 |
|
Keith Graves |
618 |
A09 |
9 |
|
Graham Leach |
787 |
A10 |
10 |
|
Alan MacDonald |
800 |
A11 |
11 |
|
Peter Ruff |
785 |
A12 |
12 |
|
Jim Biddulph |
604 |
|
|
|
|
|
Category B: |
|
News Broadcasts: 5235 words |
|
|
B01 |
13 |
Radio 4 News |
Brian Perkins |
1722 |
B02 |
14 |
Radio 4 News |
Brian Perkins |
1720 |
B03 |
15 |
Radio 2 News |
David Geary |
940 |
B04 |
16 |
Radio 3 News |
Peter Bragg |
853 |
|
|
|
|
|
Category C: |
|
Lecture Type 1: 4471 words |
|
|
C01 |
17 |
The Reith Lectures - III |
David Henderson |
4471 |
|
|
|
|
|
Category D: |
|
Lecture Type 11: 7451 words |
|
|
D01 |
18 |
OU Modem Art |
Dawn Adies |
2410 |
D02 |
19 |
OU Science & Belief |
Dr Robert Fox |
2434 |
D03 |
20 |
OU Development of Fractions |
Graham Fiegg |
2607 |
|
|
|
|
|
Category E: |
|
Religlous Broadcast: 1503 words |
|
|
E01 |
21 |
Daily Service |
Frances Gumley |
915 |
E02 |
22 |
Daily Service |
Rev Stephen Oliver |
588 |
|
|
|
|
|
Category F: |
|
Magazine-style reporting: 4710 words |
|
|
F01 |
23 |
Money Box |
Louise Botting |
671 |
F02 |
24 |
Money Box |
Louise Botting |
667 |
F03 |
25 |
Money Box |
Louise Botting |
850 |
F04 |
26 |
Review of the Year |
Kevin Geary |
2522 |
|
|
|
|
|
Category G: |
|
Fiction: 7299 words |
|
|
G01 |
27 |
Story Time |
Elizabeth Bell |
3163 |
G02 |
28 |
Listening & Reading |
John Hollis |
1221 |
G03 |
29 |
"A funny thing happened..." |
Male |
442 |
G04 |
30 |
"Night Flight" |
Female |
810 |
G05 |
31 |
Morning Story |
Martin Jarvis |
1663 |
|
|
|
|
|
Category H: |
|
Poetry: 1292 words |
|
|
H01 |
32 |
"Eunice" |
John Betjeman |
248 |
H02 |
33 |
"The Last of Her Order" |
John Betjeman |
286 |
H03 |
34 |
"Harrow-on-the-Hill" |
John Betjeman |
157 |
H04 |
35 |
"The Linnet's Nest" |
Isabelle Dean |
405 |
H05 |
36 |
"The Nightjar" |
Martin Jarvis |
196 |
|
|
|
|
|
Category J: |
|
Dialogue: 6826 words |
|
|
J01 |
37 |
Review of the Year |
Kevin Geary |
1674 |
J02 |
38 |
"Inside Story" |
Males |
279 |
J03 |
39 |
"Murder at Gurney Manor" |
Males |
375 |
J04 |
40 |
"Getting things done" |
Male & Female |
74 |
J05 |
41 |
"Messages" |
Male & Female |
277 |
JO6 |
42 |
Kempson & Green dialogue |
Rita Green |
4147 |
|
|
|
|
|
Category K: |
|
Propaganda: 1432 words |
|
|
K01 |
43 |
Week's Good Cause |
Brian Redhead |
798 |
K02 |
44 |
Week's Good Cause |
Susan Hampshire |
634 |
|
|
|
|
|
Category M: |
|
Miscellaneous: 3352 words |
|
|
M01 |
45 |
"An Unpleasant Nursemaid" |
John Betjeman |
93 |
M02 |
46 |
Motoring News |
Male |
200 |
M03 |
47 |
Weather Forecast |
Male |
140 |
M04 |
48 |
Programme News |
Male |
298 |
M05 |
49 |
Nelson Mandela speech |
Colin Lyas |
738 |
M06 |
50 |
Tom Stephenson speech |
Colin Lyas |
1112 |
M07 |
51 |
Travel Roundup |
Male |
187 |
M08 |
52 |
Weather Forecast |
Male |
143| |
M09 |
53 |
Programme News |
Male |
441 |
Column A = Absolute number of text in corpus
1Date of composisition:
H04 The Linnett's Nest - May, 1924
H05 The Nightjar - May, 1925