The Oslo-Bergen-Tagger and The Nomen Nescio Project Janne Bondi Johannessen (jannebj@mail.hf.uio.no) Outline Addresses Which tagset How the tagger treats: compounds, names, unknown words etc. Its performance Input, preprocessing, output Names and Nomen Nescio Norsk Ordbank
Read more! On the Oslo Corpus and the tagger: http://www.tekstlab.uio.no/norsk/bokmaal/ Try the Oslo-Bergen Tagger: http://decentius.hit.uib.no:8005/cl/cgp/test.ht ml The Nomen Nescio Project: http://scrooge.spraakdata.gu.se/nn/
Multitagged text "<lang>" "lang" adj pos mask fem ub ent "lange" verb imp <trans1> "<tradisjon>" "tradisjon" subst mask appell ent ub "<$ >" "$ " CLB <OVERSKRIFT> "<*i>" "i" prep "<over>" "over" prep "<hundre>" "hundre" det kvant fl "hundre" subst n yt appell ent ub "hundre" subst n yt appell fl ub "< r>" " r" subst fem appell ent ub " r" subst mask appell ent ub " r" subst n yt appell ent ub " r" subst n yt appell fl ub "<har>" "ha" verb pres <trans6><auxp>
"<sportsfiskere>" "sportsfisker" subst mask appell fl ub "<og>" "og" CLB konj "og" adv "og" konj "<elveeiere>" "elveeier" subst mask appell fl ub "<pr vd>" "pr ve" adj <perf-part>mask fem ub ent <trans1><trans3> "pr ve" adj <perf-part>n yt ub ent <trans1><trans3> "pr ve" verb perf-part <trans1><trans3> "< >" " " inf-merke " " interj " " subst fem appell ent ub " " subst mask appell e nt ub "<hjelpe>" "hjelpe" verb inf <trans1> "<laksen>" "laks" subst mask appell ent be "<til>"
"til" prep "< >" " " inf-merke " " interj " " subst fem appell ent ub " " subst mask appell e nt ub "<formere>" "former" subst mask appell fl ub "formere" verb inf <trans1><refl4> "<seg>" "seg" pron refl ent/fl akk "sige" verb pret <intrans2> "<I>" "i" prep "<norske>" "norsk" adj pos mask fem n yt be ent "norsk" adj pos ub be fl "norske" verb inf <trans1> "<elver>" "elv" subst fem appell fl ub "elv" subst mask appell fl ub
"<$.>" "$." CLB <PUNKT> opp
Disambiguated text."<*lang>" "lang" adj pos mask fem ub ent "<tradisjon>" "tradisjon" subst mask appell ent ub "<$ >" "$ " CLB <OVERSKRIFT> "<*i>" "i" prep "<over>" "over" prep "<hundre>" "hundre" det kvant fl "< r>" " r" subst n yt appell fl ub "<har>" "ha" verb pres <trans6> <auxp>
"<sportsfiskere>" "sportsfisker" subst mask appell fl ub "<og>" "og" konj "<elveeiere>" "elveeier" subst mask appell fl ub "<pr vd>" "pr ve" verb perf-part <trans1> <trans3> "< >" " " inf-merke "<hjelpe>" "hjelpe" verb inf <trans1> "<laksen>" "laks" subst mask appell ent be "<til>" "til" prep
"< >" " " inf-merke "<formere>" "formere" verb inf <trans1> <refl4> "<seg>" "seg" pron refl ent/fl akk "<i>" "i" prep "<norske>" "norsk" adj pos ub be fl "<elver>" "elv" subst fem appell fl ub "elv" subst mask appell fl ub "<$.>" "$." CLB <PUNKT> opp
Syntactic tags "<*lang>" ----------------@ADJ> "<tradisjon>" ----------------@SUBJ @OBJ @LflS-NP "<$ >" "<*i>" ------------------------@ADV "<over>" ----------------@ADV "<hundre>" ----------------@DET> "< r>" ---------- --------------@<P-UTFYLL "<har>" ------------------------@FV "<sportsfiskere>" --------@SUBJ "<og>" ------------------------@KON "<elveeiere>" ----------------@SUBJ @OBJ @I-OBJ "<pr vd>" ----------------@IV "< >" --------- ---------------@OBJ "<hjelpe>" ----------------@IV "<laksen>" ----------------@OBJ "<til>" ------------------------@ADV "< >" --------- ---------------@<P-UTFYLL "<formere>" ----------------@IV "<seg>" ------------------------@OBJ "<i>" ------------------------@ADV "<norske>" ----------------@ADJ>
"<elver>" ----------------@<P-UTFYLL "<$.>" "<*naturen>" ----------------@SUBJ "<er>" ------------------------@FV "<nemlig>" ----------------@ADV "<knallhard>" ----------------@S-PRED "<mot>" ------------------------@ADV "<lakseavkommet>" --------@<P-UTFYLL "<$->" "<bare>" ----------------@ADV "<noen>" ----------------@DET> "<f >" -------- ----------------@ADJ> "<prosent>" ----------------@SUBJ @OBJ "<av>" ------------------------@ADV "<den>" ------------------------@DET> "<yngelen>" ----------------@<P-UTFYLL "<som>" ------------------------@<SBU-REL "<klekkes>" ----------------@FV "<naturlig>" ----------------@S-PRED @O-PRED @ADV "<i>" ------------------------@ADV "<elvene>" ----------------@<P-UTFYLL "<overlever>" ----------------@FV
"<s >" ------- -----------------@ADV> "<lenge>" ----------------@ADV "<at>" ------------------------@SUBJ @OBJ "<de>" ------------------------@SUBJ "<begynner>" ----------------@FV "< >" --------- ---------------@OBJ "<ta>" ------------------------@IV "<til>" ------------------------@ADV "<seg>" ------------------------@<P-UTFYLL "<f de>" ----------------@OBJ @I-OBJ "<$.>"
Compounds kuldekammer (cold chamber) frontruteareal (windscreenarea) 30- rene (the thirties) livssammenhengen (the life style connection) foreldreforberedende (parent preparing) f dselsopplevelsen (the birth experience) s dkvalitet (sperm quality) laparoskopi (laparoscopy) kjempespent (very excited) spontanaborterte (miscarried)
Compounds with tags kuldekammer "kuldekam" subst mask appell fl ub samset "kuldekammer" subst n yt appell ent ub samset "kuldekammer" subst n yt appell fl ub samset frontruteareal "frontruteareal" subst n yt appell ent ub samset "frontruteareal" subst n yt appell fl ub samset spontanaborterte "spontanabortere" verb pret i2 tr1 samset foreldreforberedende "foreldreforberedende" adj pos n yt ub ent samset laparoskopi "laparoskopi" subst mask appell ent ub samset kortikosteroider "kortikosteroid" subst n yt appell fl ub samset
Compound summary Analyses inflected word forms Accepts unknown first member Analyses productive derivations Gives several analyses if they are equally probable
What to do with unknown words (those not in the lexicon) See if they can be analysed as compounds or derivations If not, mark them as unknow n
Unknown Words with (sometimes wrong) compound analysis Misspelt words: instititutt johannes cowboy-k re veerdensbasis Name at beginning of sentence: Aslaug Foreign words: great
Words marked as unknown Misspellings benia (beina - the legs) eli (Eli) kommme (komme - come) allerde (allerede - already) Non-standard writing conventions du a (du henne - you her/ du da - you then) noesomhelst (noe som helst - anything at all) peppern (pepperen - the pepper) Foreign or dialect words workout con tipica chat
How to improve results with unknown words Expand the lexicon with non-standard words (the Oslo-tagger uses this strategy) Guess
Some words in the expanded Oslo-lexicon (Norsk Ordbank) (marked by unormert Words with old-fashioned spelling: hverken (verken - neither) turde (torte - dared) syv (sju - seven) Inflections outside the norm: faxer (pl. of fax) mann (singular form used as pl. of man) Foreign words: catwalk management Common mistakes: j vli (j vlig - swear w ord) maks (maks. - max.) sj l (sj l, selv - him/her/yourself)
The performance of the tagger Performance of the morphological part of the tagger. (Measured at a 100 000 word training corpus of vary varied text types.) Recall: 99,2% Precision: 96,8% (New testing on unknown corpus will be done soon, + syntacting evaluation and improvement.)
Input, preprocessing, output Input: pure text (or SGML, HTML, XML) Preprocessing: Abbreviations Sentence boundaries Headlines Compounds Names Dates Multitagging Output: pure text in CG-format
NAMES (THE NOMEN NESCIO PROJECT, work by Paul Meurer and JBJ) DIFFICULT NAMES I Only first letter is capital - a noun phrase name: a. Den norske stat b. Institutt for lingvistiske fag c. Direktoratet for naturforvaltning d. Det historisk-filosofiske fakultet
DIFFICULT NAMES II One of the words is a proper name a. Mj r ungdomsskole b. Gjerdrum likningskontor og folkeregister c. Hungerholt gruppebolig d. Universitetet i Oslo e. Sentralsykehuset i Akershus
How are difficult names solved: Regular expressions based on morphologically tagged words: Universitetet i Oslo (:seq (:and subst prop be) "i" (:and subst prop)) Document centered approach: makes it possible to recognise phrasal names (with only one capital letter) even after a full stop: Den norske kirke
NE recognition Ongoing work by: sne H aaland (statistical methods) Andra Bj r k Jonsdottir (linguistic, rule based methods) Six Categories: person names location names organization names publication names events other
Norsk ordbank Oracle data base at UiO containing Bokm lsordboka Nynorskordboka IBM-lexicon + expanded by non-standard words (marked as such)