Welcome to the website of the KRYS I Corpus.
KRYS I corpus is a collection of over 6300 documents labelled with their genre classes. It was constructed as part of a research initiative to
automate document genre classification driven by the Digital Curation
Centre. It was carried out at the Humanities
Advanced Technology and Information Institute (HATII), University of Glasgow between 2005 and 2008.
The notion of genre is deeply embedded in the way humans organise information. Identifying the genre of a document helps to characterise the physical and conceptual structure of the text, helping to capture the style and location of further information within the text. There have been very few genre-labelled corpora available to the research community. Our corpus is made available here to fill this gap and serve as a valuable resource for researchers in:
- metadata extraction,
- digital curation,
- text classification,
- text mining,
- computational linguistics,
- and, pattern recognition.
To access the Corpus, please register first by
going to the page Registration/Login. By registering to access the Corpus, you are agreeing to the
retained by the original copyright owners, and their permission
might be required before you copy, use, or distribute any of the
content. Please note that documents will be removed upon
the request of the copyright owners without prior notice. Also, note that access to the corpus could be withdrawn should any misuse of the Corpus be detected.
documents within the KRYS I Corpus have been collected from the
Internet and are, thus, publicly available. While the authors tried to
assure that no copyright law was violated, not all document owners
could be contacted. Should you find that your document is in
use unrighteously within this collection, please contact the KRYS I Corpus Manager at firstname.lastname@example.org
, quoting the document ID number within the collection, and
the document will be removed immediately. Please be assured, that no
content within the document has been altered except when this has been
explicitly asked for by the copyright owner.
There is more information about the construction method and composition of the corpus on the Information page. Automated experiments we have conducted using the corpus are reported in the published papers listed on this page.
If you would like to further help this corpus building initiative, please go to our Genre Classification System (PICS), register, and classify and submit your own documents.
If you have any queries please
©Copyright HATII, University of
Glasgow. Last updated by Yunhyong Kim November 2008.