INF2820 Datalingvistikk V2012. Jan Tore Lønning

INF2820 Datalingvistikk V2012 Jan Tore Lønning

MER OM PARSING, SÆRLIG TABELLPARSING 20. februar 2012 2

I dag Oppsummering og utfylling fra sist: Recursive-descent parser (top-down) Shift-reduce parser (bottom-up) Svakheter ved disse Mellomspill: Chomsky Normal Form Tabellparsing: Dynamisk programmering CKY-algoritmen 20. februar 2012 3

Parsing Gitt en grammatikk G og streng s Spm1: Er s L(G) Spørsmål om anerkjennelse ( recognition ) Spm2: Vi er interessert i (frase)strukturen til s Hvorfor er s L(G)? Finn alle trær i T(G) som har s som utkomme ( yield ) Ekvivalent: Finn alle høyreavledninger av s. Finn alle venstreavledninger av s. Parsing 20. februar 2012 4

Venstre- og høyreavledning Venstreavledning: S NP VP Det N VP the N VP the dog VP the dog V NP PP the dog saw NP PP the dog saw Det N PP the dog saw a N PP the dog saw a man PP the dog saw a man P NP the dog saw a man in NP the dog saw a man in Det N the dog saw a man in the N the dog saw a man in the park Høyreavledning: S NP VP NP V NP PP NP V NP P NP NP V NP P Det N NP V NP P Det park NP V NP P the park NP V NP in the park NP V Det N in the park NP V Det man in the park NP V a man in the park NP saw a man in the park Det N saw a man in the park Det man saw a man in the park the dog saw a man in the park 20. februar 2012 5

Recursive descent parser Lager en venstreavledning Bygger et tre: Fra toppen ( top-down ) Fra venstre mot høyre Tilstrekkelig å bare se på venstreavledninger fordi de svarer til trær Streber mot tidligst mulig å sjekke mot input-data Et trinn er ikke-deterministisk: Velg en regel! Dette gir et søkerom å holde orden på 20. februar 2012 6

Datastruktur Venstreavledning S NP VP Det N VP the N VP the dog VP the dog V NP PP the dog saw NP PP the dog saw Det N PP the dog saw a N PP the dog saw a man PP the dog saw a man P NP the dog saw a man in NP the dog saw a man in Det N the dog saw a man in the N the dog saw a man in the park Datastruktur S the dog saw a man in the park NP VP the dog saw a man in the park Det N VP the dog saw a man in the park N VP dog saw a man in the park VP saw a man in the park V NP PP saw a man in the park NP PP a man in the park Det N PP a man in the park N PP man in the park PP in the park P NP in the park NP the park Det N the park N park # # 20. februar 2012 7

Problemer for RD-parsing 1. Venstrerekursjon: Hvordan takler parseren N AP N N N PP? 2. Dobbeltarbeid: Som en del av en overordnet gal analyse kan den finne riktige deler, men disse blir glemt 3. Prøving og feiling som er litt blind 20. februar 2012 8

Datastruktur Høyreavledning S NP VP NP V NP PP NP V NP P NP NP V NP P Det N NP V NP P Det park NP V NP P the park NP V NP in the park NP V Det N in the park NP V Det man in the park NP V a man in the park NP saw a man in the park Det N saw a man in the park Det man saw a man in the park the dog saw a man in the park Datastruktur S # NP VP # N V NP PP # N V NP P NP # NP V NP P Det N # NP V NP P Det park NP V NP P the park NP V NP in the park NP V Det N in the park NP V Det man in the park NP V a man in the park NP saw a man in the park Det N saw a man in the park Det dog saw a man in the park # the dog saw a man in the park 20. februar 2012 10

Bottom-up: Shift reduce parser Words, Der strenger (Words resten av input, Der det som er funnet så langt) Vanlig notasjon: Der Words Start: Words:= inputstreng Der := ε Løkke: Hvis Words=ε og Der= S stopp med suksess! Hvis mulig, gjør en av følgende: (Shift:) Hvis Words=/=ε, La Der:=Der first(words) og Words:=rest(Words) (Reduce:) Hvis det fins α, β, B, en regel B β og Der = α β: la Der= α B 20. februar 2012 11

Bottom-up: Shift reduce parser ε Kim saw the girl with the telescope NP V Det girl with the telescope NP V Det girl with the telescope (SHIFT) NP V Det N with the telescope (REDUCE) NP V NP with the telescope (REDUCE) 20. februar 2012 12

Algoritme ikke-deterministisk Denne kan lett utvides til parser ved at vi legger inn deltrær i Der i stedet for bare kategorier. SR-parsere har problemer med tomme høyresider To plasser for valg/ikke-determinisme: Skal vi flytte eller redusere? Hva skal vi velge når vi har flere valg for reduksjon? Den kan gjøres mer effektiv hvis vi vet at høyresidene ikke blander terminaler og ikke-terminaler: Når vi flytter, gjør vi samtidig en unær reduksjon. Svarer til regel på formen B t (Vi kan også gjøre den mer effektiv hvis grammatikken er på CNF: Bare se på de to siste symbolene i Der når vi reduserer Svarer til regel på formen A BC) 20. februar 2012 13

def recognize(grammar, stack, rest, trace): if trace > 0: print stack, rest if rest==[] and len(stack)==1 and stack[0]==grammar.start(): return True else: for p in grammar.productions(): if not p.is_lexical(): rhs = list(p.rhs()) n = len(rhs) if stack[-n:] == rhs: newstack = stack[0:-n] newstack.append(p.lhs()) if recognize(grammar, newstack, rest, trace): return True if not len(rest) == 0: word = rest[0] for p in grammar.productions(): if p.is_lexical() and rest[0]==p.rhs()[0]: newst = stack[:] newst.append(p.lhs()) if recognize(grammar, newst, rest[1:], trace): return True 20. februar 2012 14

def parse(grammar, stack, rest, trees, trace): if rest == [] and len(stack)==1 and stack[0][0]==grammar.start(): trees.append(stack[0]) else: for p in grammar.productions(): if not p.is_lexical(): rhs = list(p.rhs()) n = len(rhs) top = [node[0] for node in stack[-n:]] if top == rhs: newstack = stack[0:-n] newstack.append((p.lhs(), stack[-n:])) parse(grammar, newstack, rest, trees, trace) if not len(rest) == 0: word = rest[0] for p in grammar.productions(): if p.is_lexical() and rest[0]==p.rhs()[0]: cat = p.lhs() newstack = stack[:] newstack.append((cat, [word])) newrest = rest[1:] parse(grammar,newstack, newrest, trees, trace) return trees 20. februar 2012 15

Problemer spesielt for Shift-Reduce Unære produksjonsregler: Shift-Reduce kan tillate disse, men en må sjekke at det ikke er cykler av unære regler i grammatikken: A B B A Tomme produksjonsregler: NP DET N PPS PPS PP PPS PPS # Når skulle vi foreslå dem? Hvor mange? Iterasjon? 16

Problem for både RD og SR Ineffektivitet Så på eksempler sist RD: Eksempel: S NP VP Noen valg under NP Noen valg under VP Vi foretar valgene for VP på nytt for hvert alternativ under NP Tilsvarende for SR For hvert valg vi foretar må vi se på alle muligheter for resten av strengen på nytt 20. februar 2012 17

Chomsky-normalform (CNF) En grammatikk er på Chomsky-normalform hvis alle reglene er på en av følgende former: A B C A t (t en terminal) Enhver CFG G hvor ε L(G), er svakt ekvivalent til en G på CNF. Altså L(G) = L(G ) 20. februar 2012 19

Chomsky-normalform (CNF) Enhver CFG G hvor ε L(G) er svakt ekvivalent til en G på CNF Bevisskisse: 1. Erstatt alle regler på formen A ε 2. Erstatt alle regler på formen A B for B en ikke-terminal 3. For hver regel A β, der β > 1 og β inneholder en eller flere terminaler t1,, tn: 1. Innfør nye ikke-terminaler T1,, Tn. 2. Erstatt ti med Ti 3. Innfør reglene Ti ti for i = 1,, n 4. For hver regel A B1 B2 Bn, n > 2, : 1. Innfør nye ikke-terminaler C1,, Cn-1 og regler 2. A B1 C1 3. Ci Bi+1 Ci+1 for i = 1,, n-1 20. februar 2012 20

To trinn til: 2. Erstatt alle regler på formen A B for B en ikke-terminal: For enhver ikke-terminal B og For enhver A s.a. A B eller A + B og For enhver β s.a. B β: Innfør A β, hvis den ikke alt finnes Fjern alle unære regler med ikke-terminal h.s. 1. Erstatt alle regler på formen A ε unntatt evt. S ε: Identifiser alle ikke-terminaler A, s.a. A + ε For hver regel B β: Legg til alle regler på formen B β der β fremkommer ved å stryke en eller flere ikke-terminaler som avleder ε, fra β. Stryk alle regler på formen A ε 20. februar 2012 21

Eksempel S NP VP NP PN NP Det As N As ε As A As VP løp smilte PN Kari Ola N gutt jente mor far A snill gammel trøtt 20. februar 2012 22

Dynamic Programming DP search methods fill tables with partial results and thereby Avoid doing avoidable repeated work Solve exponential problems in polynomial time (well, no not really) Efficiently store ambiguous structures with shared sub-parts. Vi skal se på: CKY Chartparsing (Earleys algoritme svarer til en variant av chartparsing) 2/20/2012 Speech and Language Processing - Jurafsky and Martin 24

CKY Parsing First we ll limit our grammar to CNF Consider the rule A BC If there is an A somewhere in the input then there must be a B followed by a C in the input. If the A spans from i to j in the input then there must be some k st. i<k<j Ie. The B splits from the C someplace. 2/20/2012 Speech and Language Processing - Jurafsky and Martin 25

Sample L1 Grammar 2/20/2012 Speech and Language Processing - Jurafsky and Martin 26

CNF Conversion 2/20/2012 Speech and Language Processing - Jurafsky and Martin 27

CKY So let s build a table so that an A spanning from i to j in the input is placed in cell [i,j] in the table. So a non-terminal spanning an entire string will sit in cell [0, n] Hopefully an S If we build the table bottom-up, we ll know that the parts of the A must go from i to k and from k to j, for some k. 2/20/2012 Speech and Language Processing - Jurafsky and Martin 28

CKY Meaning that for a rule like A B C we should look for a B in [i,k] and a C in [k,j]. In other words, if we think there might be an A spanning i,j in the input AND A B C is a rule in the grammar THEN There must be a B in [i,k] and a C in [k,j] for some i<k<j 2/20/2012 Speech and Language Processing - Jurafsky and Martin 29

CKY So to fill the table loop over the cell[i,j] values in some systematic way What constraint should we put on that systematic search? For each cell, loop over the appropriate k values to search for things to add. 2/20/2012 Speech and Language Processing - Jurafsky and Martin 30

Example 2/20/2012 Speech and Language Processing - Jurafsky and Martin 31

Example Filling column 5 2/20/2012 Speech and Language Processing - Jurafsky and Martin 33

CKY Algorithm 2/20/2012 Speech and Language Processing - Jurafsky and Martin 38

CKY def cky(words, cfg): tabl = [[[] for i in range(len(words)+1) ] for j in range(len(words))] for i in range(len(words)): tabl[i][i+1] = [p.lhs() for p in cfg.productions() if p.is_lexical() and p.rhs()[0] == words[i]] for j in range(i-1,-1,-1): for k in range(j+1, i+1, 1): for p in grammar.productions(): if (p.rhs()[0] in tabl[j][k] and p.rhs()[1] in tabl[k][i+1]): if not p.lhs() in tabl[j][i+1]: tabl[j][i+1].append(p.lhs()) return tabl 39

Display def cellstring(cell): return (',').join(nt.symbol() for nt in cell) def display(tabl, words): n = len(words) cellstrings = [cellstring(tabl[i][j]) for i in range(n) for j in range(n+1)] space = max([len(cell) for cell in cellstrings + words])+2 for i in range(n): print words[i].ljust(space), print(' ') for start in range(n): for count in range(start): print " ".ljust(space), for end in range(start+1, n+1): cell = tabl[start][end] print ("["+cellstring(cell)+"]").ljust(space), print(' ') 40

Egenskaper Grammatikk: S NP VP NP Det Nsg NP Npl VP IV VP TV NP Det en NP en Nsg fisker maler Npl fisker maler snurrer IV fisker maler snurrer TV fisker maler en fisker maler snurrer Det NP Plass til vilkårlig mange i en celle NLTKs wfst har bare plass til en NP S Nsg Npl NP IV VP TV S S VP Nsg Npl NP IV VP TV S S S VP Npl NP IV VP 41

CKY Parsing Is that really a parser? 2/20/2012 Speech and Language Processing - Jurafsky and Martin 42