taggers
Taggers and Chunkers
Bio Taggers
- GeneTaggerCRF (UPenn)
- uses machine learning technique called conditionnal random fields
- Yapex a protein name tagger (ref:franzen:ijmi02)
- KeX freely available source codey (ref:fukuda:PSB1998)
- AbGene simple gene finder in Medline documents
- GAPSCORE identify names of genes and proteins
- LingPipe generic IE tools, now applied to TREC - genomics
- NLPProt NLProt is a tool for finding protein-names in natural language-text. It is based on Support Vector Machines (SVMs), which are trained on contextual-features of named entities in scientific language. Additionally, simple filtering rules and a protein-name dictionary are used to increase performance. NLProt reached a precicion (accuracy) of 70% at a recall (coverage) of 85% after running it on the 166 most recent abstracts of EMBL and Cell (Nov/Dec 2003). When run from the command line, NLProt takes about 1 second per abstract to finish.
MUC-style NE Recognizers
- Biomedical Named Entity Recognition at A*STAR Can recognize the following classes: Virus, Tissue, RNA, Protein, Polynucleotide, Peptide, OtherOrganicCompound, OtherName, OtherArtificialSource, Organism, Nucleotide, MultiCell, MonoCell, Lipid, Inorganic, DNA, CellType, CellLine, CellComponent, Carbohydrate, BodyPart, Atom, AminoAcidMonomer.
Generic POS Taggers
- Brill POS Tagger
- LT POS tagger (maximum entropy tagger)
- Junk Tagger
Generic Chunker
- NP Chunker by Mark Greenwood Uni Sheffield
Corerefence Resolution
Various tools by Patrick Ruch
- Ruch
- seem to be Windows only
- for us MeSHMap might be useful
Semantic Gene Organizer
Semantic Gene Organizer (SGO) is an automated method to cluster genes based on conceptual relationships derived from MEDLINE abstracts. It uses a variant of the vector-space model called Latent Semantic Indexing (LSI) to represent genes as vectors in lower-dimension (concept) space. The relationship between genes is deduced from the cosine of the angle between gene document vectors. A gene document is a concatenation of MEDLINE titles and abstracts identified in the LocusLink entry for each gene.