The building of this Corpus was a collaborative effort led by Dr Yunhyong Kim and Prof Seamus Ross at the Humanities
Advanced Technology and Information Institute at the University of
Glasgow between 2005 and 2008. It was part of the
Digital Curation Centre initiative to automated metadata extraction
from text documents. The PICS genre classification system that facilitated the collection of documents was created by Andrew McHugh and Adam Rusbridge.The actual documents were collected by the students of the University of Glasgow. The administrative organisation and analysis of the Corpus was supported by research assistants Laura Brouard and Vera Berninger. Two people were employed to perform large scale reclassification of the documents and other volunteer labellers provided small scale reclassification of some documents.
1. Who was involved?
2. What does it contain?
The KRYS I Corpus contains over 6300 PDF documents classified into one of 70 genres (see table 1.1 for the genre
classification schema).
Table 1.1. Scope of genres under examination
|
Genre Group |
Genre |
Genre Group |
Genre |
|
Book
|
Academic Monograph Poetry Book Book of Fiction Other Book Handbook |
Article
|
Abstract Magazine Article Scientific Article Other Research Article News Report |
|
Short Composition
|
Poem Fictional Piece Dramatic Script Essay Short Biographical Sketch Review |
Serial
|
Periodicals (Newspaper, Magazine) Journals Conference Proceedings Newsletter |
|
Correspondence
|
Letter Memo Telegram |
Treatise
|
Thesis Business or Operational Report Technical Report Miscellaneous Report Technical Manual |
|
Information Structure
|
List Catalogue Raw Data Table/Calendar Menu Form Programme Questionnaire FAQ |
Evidential Document
|
Minutes Legal Proceedings Financial Record Receipt Slips Contract |
|
Visually Dominant Document
|
Artwork Card Chart Graph Diagram Sheet Music Poster Comics |
Other Functional Document
|
Guideline Regulations Manual Grant or Project Proposal Legal Appeal, Proposal or Order Job, Course or Project Description Product or Application Description Advertisement Announcement Appeal or Propaganda Exam or Worksheet Fact Sheet Forum Discussion Interview Notice Resume/CV Slides Speech Transcript |
The original Corpus contained 6494 documents but some were removed by the request of copyright owners.
It was collected from the internet by students of the University of Glasgow using the following guidelines. Students were:
assigned genre classes from a pre-constructed set of seventy genres.
asked to retrieve as many documents as possible, belonging to their assigned genre with a maximum target of 100.
asked to only retrieve documents in PDF format and English.
asked to give reasons for including these particular samples.
not given any definitions of the genre classes apart from the genre label.
A large number of documents were erroneously submitted by students to the KRYS I corpus. For instance, there were:
documents that are not examples of the genre but whose topic relates to the genre (e.g. instead of actual emails, research articles about email were found labelled as email) [Error type I];
empty templates included as examples of the genre (e.g. instead of selecting ‘actual’ receipts, empty receipt forms were found labelled as receipts) [Error type II];
entire magazines, conference proceedings or journals included as research articles, and vice versa [Error type III].
These documents were not removed from the database for two reasons:
the bias of the remover would be introduced to the corpus
false classification would be filtered out by the reclassification.
Further analysis of this “false” classification is expected be made available in a upcoming publication. A reference will be made available on this web site. About 5500 of the documents submitted by the students were reclassified by one or more additional assessors. This was done under the following conditions:
They were not allowed to confer with each other in making their classification.
The documents were presented to each secretary in a random order without the original label.
No definition of the genres was given before the reclassification.
A Human Labelling Experiment was carried out to measure the agreement between the initial class assigned to the documents by the students to subsequent classes assigned to the same documents by two secretaries who were asked to reclassify the documents without the knowledge of prior classifications. The results showed that the labels varied in many cases as visible in table 1.2.
Table 1.2. Human agreement analysis
|
Labeller group |
Agreed |
|
Student & Secretary I |
2,745* |
|
Student & Secretary II |
2,852* |
|
Secretary I & II |
2,422* |
|
All Three Labellers |
2,008* |
* out of 5,305
The entire set of labels attributed to each document by the classifiers will be distributed with the Corpus.
An extended analysis of the corpus and the agreement of labels is available in the following publications:
A list of publications focusing on the automated experiments on this corpus are available on this page. If you make use of this corpus, please consider citing these publications.
Other discussions by a wide range of experts in genre classification (with emphasis on webpage genre) and its automation is linked at http://purl.org/net/webgenres.
©Copyright HATII, University of
Glasgow. Last updated by Yunhyong Kim, May 2013.