graphics1

Building the KRYS Corpus

  1. Who was involved?
  2. What does it contain?
  3. How was it built?
  4. Human agreement in genre classification
  5. Related work and publications

1. Who was involved?

The building of this Corpus was a collaborative effort led by Dr Yunhyong Kim and Prof Seamus Ross at the Humanities Advanced Technology and Information Institute at the University of Glasgow between 2005 and 2008.

It was part of the Digital Curation Centre initiative to automated metadata extraction from text documents. The PICS genre classification system that facilitated the collection of documents was created by Andrew McHugh and Adam Rusbridge.The actual documents were collected by the students of the University of Glasgow. The administrative organisation and analysis of the Corpus was supported by research assistants Laura Brouard and Vera Berninger. Two people were employed to perform large scale reclassification of the documents and other volunteer labellers provided small scale reclassification of some documents.

2. What does it contain?

The KRYS I Corpus contains over 6300 PDF documents classified into one of 70 genres (see table 1.1 for the genre classification schema).

Table 1.1. Scope of genres under examination

Genre Group

Genre

Genre Group

Genre

Book


Academic Monograph

Poetry Book

Book of Fiction

Other Book

Handbook

Article


Abstract

Magazine Article

Scientific Article

Other Research Article

News Report

Short Composition


Poem

Fictional Piece

Dramatic Script

Essay

Short Biographical Sketch

Review

Serial


Periodicals (Newspaper, Magazine)

Journals

Conference Proceedings

Newsletter

Correspondence


Email

Letter

Memo

Telegram

Treatise


Thesis

Business or Operational

Report

Technical Report

Miscellaneous Report

Technical Manual

Information Structure


List

Catalogue

Raw Data

Table/Calendar

Menu

Form

Programme

Questionnaire

FAQ

Evidential Document


Minutes

Legal Proceedings

Financial Record

Receipt

Slips

Contract

Visually Dominant Document


Artwork

Card

Chart

Graph

Diagram

Sheet Music

Poster

Comics

Other Functional Document


Guideline

Regulations

Manual

Grant or Project Proposal

Legal Appeal, Proposal or Order

Job, Course or Project Description

Product or Application Description

Advertisement

Announcement

Appeal or Propaganda

Exam or Worksheet

Fact Sheet

Forum Discussion

Interview

Notice

Resume/CV

Slides

Speech Transcript


The original Corpus contained 6494 documents but some were removed by the request of copyright owners.

3. How was it built?

It was collected from the internet by students of the University of Glasgow using the following guidelines. Students were:

A large number of documents were erroneously submitted by students to the KRYS I corpus. For instance, there were:


These documents were not removed from the database for two reasons:

Further analysis of this “false” classification is expected be made available in a upcoming publication. A reference will be made available on this web site. About 5500 of the documents submitted by the students were reclassified by one or more additional assessors. This was done under the following conditions:

4. Human agreement in genre classification

A Human Labelling Experiment was carried out to measure the agreement between the initial class assigned to the documents by the students to subsequent classes assigned to the same documents by two secretaries who were asked to reclassify the documents without the knowledge of prior classifications. The results showed that the labels varied in many cases as visible in table 1.2.

Table 1.2. Human agreement analysis

Labeller group

Agreed

Student & Secretary I

2,745*

Student & Secretary II

2,852*

Secretary I & II

2,422*

All Three Labellers

2,008*

* out of 5,305

The entire set of labels attributed to each document by the classifiers will be distributed with the Corpus.

5. Related work and publications

An extended analysis of the corpus and the agreement of labels is available in the following publications:

A list of publications focusing on the automated experiments on this corpus are available on this page. If you make use of this corpus, please consider citing these publications.

Other discussions by a wide range of experts in genre classification (with emphasis on webpage genre) and its automation is linked at http://purl.org/net/webgenres.



©Copyright HATII, University of Glasgow. Last updated by Yunhyong Kim, May 2013.