2. COMPOSITION OF CORPUS

2.1 Breakdown into categories

2.2 Speakers in the Corpus

2.3 Dates of Texts

2.4 Duration of Extracts

2.5 Source of material

2.6 SEC text details

The final total number of words collected was 52,637. In line with the conventions used in the LOB corpus project, each sample text is assigned to an overall category (indicated by a letter) and identifled by a "part number" (indicated by a digit). In addition, each text is given an absolute number to indicate its position in the corpus as a whole.

The composition of the corpus is shown in Figure 1, with total number of words in each category, and this figure as a percentage of the total number of words in the corpus.

  Category      Total      %Total
A  Commentary      9066      17
B  News broadcast      5235      10
C  Lecture type l
    - aimed at general audience
     4471      8
D Lecture type II
    - aimed at restricted audience
     7451      14
E Religlous broadcast
    - including liturgy
     1503      3
F Magazine-style reporting      4710      9
G Fiction      7299      14
H Poetry      1292      2
J Dialogue      6826      13
K Propaganda      1432      3
M Miscellaneous      3352      6
Grand total:      52637

2.1 Breakdown into categories

Category A - Commentary

A01 In Perspective A02 - A12 From our own Correspondent

News reports on events happening around the world. The texts are more informal than those in category B. In Perspective covers a religious topic, the From our own Correspondent reports all deal with overseas news events:

A02 The PNC meetings in Lebanon
A03 Westmorland and Sharon suing Time magazine
A04 The conflict in El Salvador
A05 The economic cilmate in Rumania
A06 The plight of Turks living in Huttenheim, Germany
A07 The hijack of the TWA passengers and the Shi'tes' release
A08 Security checks at Athens airport
A09 The new government in Namibia
A10 The plight of the Tamil refugees
A11 Report on Gorbachev and his government
A12 The financial state of banks in Hong Kong

Category B - News Broadcasts

B01 - B02 Radio 4 News
B03 Radio 2 News
B04 Radio 3 News

News reports of current and recent events in Great Britain and abroad. B04 contains one speaker, B01-B03 contains a main newsreader and addifional reporters. The style of the main newsreaders is more formal than that of the reporters.

Category C - Lecture type l

C01 The Reith Lectures - III

A lecture on economics addressed to the general public and entitled "Needs, Centralism, and Autarchy".

Category D - Lecture type II

D01 Open University - Modem Art
D02 Open University - Science & Belief
D03 Open University - Development of Fractions

Lectures designed to be used as part of an Open University course. D01 covers the Berlin Dada movement in Germany, and so contains some German names and words. D02 is a discussion of theology and science in 18th century France. D03 describes the development in the notation used in the representation of fractions, and contains some simple mathematical formulae.

Category E - Religlous Broadcast

E01 - E02 Daily Service

Religlous services (the hymns have been edited out).

Category F - MagazIne-style reporting

F01 - F03 Money Box
F04 Review of the Year

Magazine-style in-depth reporting of financial news. Topics covered are: The perks of owning shares; the upgrading of state benefits; listeners' trusts; and Building Society rates.

Category G - Fiction (general)

G01 Story Time
G02 Listening & Reading
G03 - G04 Streamline English course samples
G05 Morning Story

All these texts are general fiction. G01 is a story aimed at an adult audience - "Through the Tunnel", by Doris Lessing. G02 is aimed at children aged between 5 and 10 - "Lion at School", by Philippa Pearce. G03 and G04 are stories taken from the ELT course book "Streamline English". G05 is aimed at adults - 'What shall we do if it rains?", by Graham Seal.

Category H - Poetry

H01 - H03 John Betjeman
H04 - H05 Time for Verse

Poetry readings of John Betjeman and Sir Henry Newbolt's poetry. H01-H03 are John Betjeman's readings of his own poems: Eunice, The Last of Her Order, and Harrow-on-the-Hill. H04 and H05 are actors reading Sir Henry Newbolt's poems: The Linnet's Nest, and The Nightjar.

Category J - Dialogue

J01 Review of the Year
J02 – J05 Streamline English course samples
J06 Kempson & Green dialogue

Dialogues of varying degrees of informality. J01 consists of a radio discussion of notable sports events of 1986. J02-J05 are dialogues contrived to illustrate a particular facet of English (although this is not immedlately noticeable) for the Streamline English ELT course. J06 is an informal dialogue between two MA students about working abroad.

Category K - Propaganda

K01 -K02 Week's Good Cause

Charity appeals.

Category M - Miscellaneous

M01 John Betjeman
M02 Motoring News
M03 Weather Forecast
M04 Programme News
M05 - M06 Oratory
M07 Travel Roundup
M08 Weather Forecast
M09 Programme News

M01 is a sample of John Betjeman reading a section of prose: An Unpleasant Nursemaid. M02 and M07 consist of reports on road conditions. M03 and M08 are weather forecasts. M04 and M09 give details of forthcoming programmes on Radio 4. M05 and M06 are speeches delivered at degree ceremonies before the presentation of honorary degrees to Nelson Mandela and Tom Stephenson.

 

2.2 Speakers in the Corpus

When selecting material for the corpus, we chose only those speakers whose accent was as close to RP as possible. This was relatively easy where material from the BBC programmes From our own Correspondent and the News was concerned, as the BBC themselves require similar standards from their presenters of news or news commentary programmes. Most of the speakers in the corpus have an accent which -is close to RP; if a speaker used a particularly strong definable regional accent, they were not included in the corpus.

Of the 53 texts in the corpus, 17 contain female speakers. This represents 30 per cent of the corpus. We have tried where possible to achieve a balance between male and female speakers, and in the highly-stylised texts - poetry, religious broadcast, propaganda, and dialogue, we have managed to do this. The higher percentage of male speakers in the News and Commentary categories reflects the tendency of the BBC to use mainly male speakers in these types of programmes.

The following lists the speakers in the corpus by category.

A01
A02
A03
A04
A05
A06
Rosemary Hartill
Gerald Butt
Jon Silverman
John Carlin
James Morgan
David Smeeton
       A07
A08
A09
A10
A11
A12
Laurie Margolis
Keith Graves
Graham Leach
Alan MacDonald
Peter Ruff
Jim Biddulph
         
B01




B02
Brian Perkins
Mike Wooldridge
Laurie Margolis
Janet Trewin

Brian Perkins
James Cox
Clive Small
Kevin Bocquet
Ann Cadwallader
Peter Burden
  B03




B04
David Geary
Peter Smith
Colin Blane
David Davis

Peter Bragg
         
C01 David Henderson      
         
D01
D02
D03
Dawn Adies
Dr Robert Fox
Graham Flegg
     
         
E01 Frances Gumley   E02 Rev Stephen Oliver
         
F01


F02


F03
Louise Botting
Peter Day

Louise Botting
Vincent Duggleby

Louise Botting
Juliet Johnson
Frances McDonald
  F04 Kevin Geary
Chris Florence
Harry Peart
Simon Taylor
Linda Spurr
Chris Poole
Mike Costello
         
G01
G02
G03
Elizabeth Bell
John Hollis
Male speaker
  G04
G05
Female speaker
Martin Jarvis
         
H01
H02
H03
John Betjeman
John Betjeman
John Betjeman
  H04
H05
Isabelle Dean
Martin Jarvis
         
         
         
J01



J02

J03
Kevin Geary
Martin Fookes
Paddy Feeny

Male speakers

Male speakers
  J04


J05


J06
Male speaker
Female speaker

Male speaker
Female speaker

Heather Kempson
Rita Green
K01 Brian Redhead   K02 Susan Hampshire
         
M01
M02
M03
M04
M05
John Betjeman
Male speaker
Male speaker
Male speaker
Colin Lyas
  M06
M07
M08
M09
Colin Lyas
Male speaker
Male speaker
Male speaker

2.3 Dates of Texts

For material obtained from BBC radio programmes the date of broadcast is given:

Cat     Date
A01 in Perspective   November 24th, 1984
A02-A06 From our own Correspondent   November 24th, 1984
A07-A12 From our own Correspondent   June 22nd, 1985
B01 News (R4)   November 24th, 1984
B02 News (R4)   June 22nd, 1985
B03 News (R2)   December 5th, 1985
B04 News (R3)   January 14th, 1986
C01 The Reith Lectures – Ill   November 20th, 1985
E01 Daily Service   November 26th, 1985
E02 Daily Service   November 27th, 1985
F01 Money Box   November 24th, 1984
F02-F03 Money Box   June 22nd, 1985
F04 Review of the Year   December, 1986
G01 Story Time   June 25th, 1985
G02 Listening & Reading   January 28th, 1987
G 05 Morning Story   November 26th, 1986
H04-H05 Time for Verse   November 26th, 19861
J01 Review of the Year   December, 1986
K01 Week's Good Cause   January 18th, 1987
K02 Week's Good Cause   January 25th, 1987
M02 Motoring News   January 18th, 1987
M03 Weather Forecast   January 18th, 1987
M04 Programme News   January 18th, 1987
M07 Travel Roundup   January 25th, 1987
M08 Weather Forecast   January 25th, 1987
M09 Programme News   January 25th, 1987

The samples from "Betjeman Reads Betjeman" are all dated as "slnce 1954":

H01   Eunice  
H02   The Last of Her Order  
H03   Harrow-on-the-Hill  
M01   An Unpleasant Nursemaid  

For material prepared at the Media Services Unit, University of Lancaster, actual recording date is given:

J06 Kempson & Green dialogue   March 11th, 1987
M05 Nelson Mandela speech   January 29th, 1987
M06 Tom Stephenson speech   January 29th, 1987

 

The Streamline English texts were irst published in 1982. The relevant categories are:

G03   SE Unit 10 - "A funny thing happened to me ..."
G04   SE Unit 19 - "Night flight"
J02   SE Unit 16 - "inside Story"
J03   SE Unit 25 - "Murder at Gurney Manor"
J04   SE Unit 72 - "Getting things done"
J05   SE Unit 75 - "Messages"

The material obtained from the Open University unfortunately contained no information on date of composition or publication. The categories are:

D01   OU Modem Art
D02   OU Science & Belief in 18th C-France
D03   OU Development of Fractions

2.4 Duration of extracts

The total duration of the corpus is 339 minutes 18 seconds. The average length of a sample is 6 minutes, but an individual text may vary greatly from this. Texts are not of equal length as we wanted to study complete sections of speech - having a predetermined cut-off point based on number of words or length of extract wouid have resulted in an unnatural-sounding endpoint to a sample of speech. The following gives details of extract lengths in the corpus.

Cat m: s   Cat m: s  
A01 15:00   G01 20:00  
A02 4:28   G02 8:56  
A03 4:01   G03 2:39  
A04 5:41   G04 5:30 [Tot G]
 
A05 4:48   G05 9:20 [46:25]
A06 4:32        
A07 3:54   H01 1:41  
A08 4:08   H02 2:03  
A09 5:12   H03 1:00  
A10 4:26   H04 2:59 [Tot H]
A11 4:15 [Tot A] H05 1:17 [ 9:00]
A12 4:05 [64:30]      
      J01 7:58  
B01 9:32   J02 1:31  
B02 9:40   J03 2:04  
B03 5:00 [Tot B] J04 0:27  
B04 5:00 [ 29:12] J05 1:28 [ Tot J]
      J06 24:00 [37:28]
C01 30:00 [30:00] K01 4:32 [Tot K]
      K02 4:09 [ 8:41]
D01 19:00        
D02 19:00 [Tot D] M01 0:41  
D03 19:00 [57:00] M02 1:10  
E01 6:48 [Tot E] M03 0:43  
E02 4:30 [11:18] M04 1:40  
      M05 4:33  
F01 3:48   M06 7:05  
F02 3:32   M07 1:06  
F03 4:54 [Tot F] M08 0:47 [Tot M]
F04 13:16 [25:30] M09 2:24 [20:14]

2.5 Source of Material

The following samples were obtained with the permission of the BBC and are covered by a contract with them covering copyright permission

A01   In Perspective
A02-Al 2   From our own Correspondent
B01-B04   News
C01   The Reith Lectures - III
E01-E02   Daily Service
F01-F03   Money Box
F04   Review of the Year
G01   Story Time
G02   Listening & Reading
G05   Morning Story
H04-H05   Time for Verse
J01   Review of the Year
K0l-K02   Week's ood Cause
M02   Motoring News
M03   Weather Forecast
M04   Programme News
M07   Travel Roundup
M08   Weather Forecast
M09   Programme News

In addition to the copyright agreement with the BBC, permission to use material had to obtained from any individuals not employed by the BBC.

The Open University Educational Enterprises Ltd gave permission for the inclusion of samples D01-D03.

Oxford University Press gave permission for the Streamline English material to be included - texts G03, G04, J02, J03, J04, and J05.

Decca International gave permission for the samples of John Betjeman's work to be included - texts H01, H02, H03 and M01.

The remaining material was recorded at the University of Lancaster's Media Services Unit using volunteer speakers - M05 and M06 speeches, and the J06 dialogue.

2.6 SEC Text Details

The following table gives details of programme title (where relevant), speaker(s), and number of words for each text in the corpus

Cat Programme title Speaker(s) Words

 

 

 

 

 

Category A:

 

Commentary:9066 words

 

 

A01

1

In Perspective

Rosemary Hartill

793

A02

2

From our own Correspondent

Gerald But

734

A03

3

 

Jon Silverman

620

A04

4

 

John Carln

977

A05

5

 

James Morgan

804

A06

6

 

David Smeeton

828

A07

7

 

Laurie Margolis

716

A08

8

 

Keith Graves

618

A09

9

 

Graham Leach

787

A10

10

 

Alan MacDonald

800

A11

11

 

Peter Ruff

785

A12

12

 

Jim Biddulph

604

 

 

 

 

 

Category B:

 

News Broadcasts: 5235 words

 

 

B01

13

Radio 4 News

Brian Perkins
Mik Wooldridge
Laurie Margolise
Janet Trewin

1722

B02

14

Radio 4 News

Brian Perkins
James Cox
Clive Small
Kevin Bocquet
Ann Cadwallader
Peter Burden

1720

B03

15

Radio 2 News

David Geary
Peter Smith
Colin Blane
David Davis

940

B04

16

Radio 3 News

Peter Bragg

853

 

 

 

 

 

Category C:

 

Lecture Type 1: 4471 words

 

 

C01

17

The Reith Lectures - III

David Henderson

4471

 

 

 

 

 

Category D:

 

Lecture Type 11: 7451 words

 

 

D01

18

OU Modem Art

Dawn Adies

2410

D02

19

OU Science & Belief

Dr Robert Fox

2434

D03

20

OU Development of Fractions

Graham Fiegg

2607

 

 

 

 

 

Category E:

 

Religlous Broadcast: 1503 words

 

 

E01

21

Daily Service

Frances Gumley

915

E02

22

Daily Service

Rev Stephen Oliver

588

 

 

 

 

 

Category F:

 

Magazine-style reporting: 4710 words

 

 

F01

23

Money Box

Louise Botting
Peter Day

671

F02

24

Money Box

Louise Botting
Vincent Duggleby

667

F03

25

Money Box

Louise Botting
Juliet Johnson
Frances MacDonald

850

F04

26

Review of the Year

Kevin Geary
Chris Florence
Harry Peart
Simon Taylor
Linda Spurr
Chris Poole
Mike Costello

2522

 

 

 

 

 

Category G:

 

Fiction: 7299 words

 

 

G01

27

Story Time

Elizabeth Bell

3163

G02

28

Listening & Reading

John Hollis

1221

G03

29

"A funny thing happened..."

Male

442

G04

30

"Night Flight"

Female

810

G05

31

Morning Story

Martin Jarvis

1663

 

 

 

 

 

Category H:

 

Poetry: 1292 words

 

 

H01

32

"Eunice"

John Betjeman

248

H02

33

"The Last of Her Order"

John Betjeman

286

H03

34

"Harrow-on-the-Hill"

John Betjeman

157

H04

35

"The Linnet's Nest"

Isabelle Dean

405

H05

36

"The Nightjar"

Martin Jarvis

196

 

 

 

 

 

Category J:

 

Dialogue: 6826 words

 

 

J01

37

Review of the Year

Kevin Geary
Martin Fookes
Paddy Feeny

1674

J02

38

"Inside Story"

Males

279

J03

39

"Murder at Gurney Manor"

Males

375

J04

40

"Getting things done"

Male & Female

74

J05

41

"Messages"

Male & Female

277

JO6

42

Kempson & Green dialogue

Rita Green
Heather Kempson

4147

 

 

 

 

 

Category K:

 

Propaganda: 1432 words

 

 

K01

43

Week's Good Cause

Brian Redhead

798

K02

44

Week's Good Cause

Susan Hampshire

634

 

 

 

 

 

Category M:

 

Miscellaneous: 3352 words

 

 

M01

45

"An Unpleasant Nursemaid"

John Betjeman

93

M02

46

Motoring News

Male

200

M03

47

Weather Forecast

Male

140

M04

48

Programme News

Male

298

M05

49

Nelson Mandela speech

Colin Lyas

738

M06

50

Tom Stephenson speech

Colin Lyas

1112

M07

51

Travel Roundup

Male

187

M08

52

Weather Forecast

Male

143|

M09

53

Programme News

Male

441

Column A = Absolute number of text in corpus


1Date of composisition:
H04 The Linnett's Nest - May, 1924
H05 The Nightjar - May, 1925