corpora
Annotated Corpora
-
Genia (University of Tokyo)
- 2000 abstracts from Medline
- handed annotations for biological terms
- articles with MeSH terms: human, blood cell and trascription factor
-
Genia Treebank
- A collection of parsed Medline abstracts (using an HPSG approach).
-
NOT Manually verified
- on the web site they release 200, but actually we have a CD which contains 100000. (The CD was distributed at BioNLP04, COLING, Geneva)
-
Craven's IE Data Sets There are three datasets, focusing on the relations described below. The labelling was done using a completely automated method.
- subcellular-localization(PROTEIN, LOCATION) The relation tuples were gathered from the (now defunct) Yeast Proteome Database (YPD).
- disease-association(GENE, DISEASE). The relation tuples were gathered from the Online Mendelian Inheritance in Man (OMIM) database.
- protein-interaction(PROTEIN, PROTEIN). This data was collected from the MIPS Comprehensive Yeast Genome Database.
- PASTA Corpora (University of Sheffield)
-
Medstract Corpus (Brandeis University): for two target applications: acronym identification, and entity anaphora resolution.
- Integrated Annotation of Biomedical Text at Pennsylvania University
- started in 2003
- integrates different types of annotations: syntactic (Treebank), predicate-argument structure (Propbank), domain entitites and co-reference.
- first results are expected in early 2004 ( Check )