Dynamic Programming Longest Common Subsequence. Class 27

Like dokumenter
Trigonometric Substitution

Slope-Intercept Formula

UNIVERSITETET I OSLO

Databases 1. Extended Relational Algebra

Unit Relational Algebra 1 1. Relational Algebra 1. Unit 3.3

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

5 E Lesson: Solving Monohybrid Punnett Squares with Coding

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Mathematics 114Q Integration Practice Problems SOLUTIONS. = 1 8 (x2 +5x) 8 + C. [u = x 2 +5x] = 1 11 (3 x)11 + C. [u =3 x] = 2 (7x + 9)3/2

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Syntax/semantics - I INF 3110/ /29/2005 1

Moving Objects. We need to move our objects in 3D space.

UNIVERSITETET I OSLO

Call function of two parameters

Exercise 1: Phase Splitter DC Operation

Oppgave 1a Definer følgende begreper: Nøkkel, supernøkkel og funksjonell avhengighet.

UNIVERSITETET I OSLO

KROPPEN LEDER STRØM. Sett en finger på hvert av kontaktpunktene på modellen. Da får du et lydsignal.

Continuity. Subtopics

Exam in Quantum Mechanics (phys201), 2010, Allowed: Calculator, standard formula book and up to 5 pages of own handwritten notes.

Den som gjør godt, er av Gud (Multilingual Edition)

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet BIOKJEMISK INSTITUTT

IN2010: Algoritmer og Datastrukturer Series 2

Neural Network. Sensors Sorter

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

FYSMEK1110 Eksamensverksted 23. Mai :15-18:00 Oppgave 1 (maks. 45 minutt)

Graphs similar to strongly regular graphs

buildingsmart Norge seminar Gardermoen 2. september 2010 IFD sett i sammenheng med BIM og varedata

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Physical origin of the Gouy phase shift by Simin Feng, Herbert G. Winful Opt. Lett. 26, (2001)

Level Set methods. Sandra Allaart-Bruin. Level Set methods p.1/24

Maple Basics. K. Cooper

Quality in career guidance what, why and how? Some comments on the presentation from Deidre Hughes

Endelig ikke-røyker for Kvinner! (Norwegian Edition)

Ringvorlesung Biophysik 2016

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Øvingsforelesning 2. Mengdelære, funksjoner, rekurrenser, osv. TMA4140 Diskret Matematikk. 10. og 12. september 2018

PATIENCE TÅLMODIGHET. Is the ability to wait for something. Det trenger vi når vi må vente på noe

EKSAMENSOPPGAVE I SØK 1002 INNFØRING I MIKROØKONOMISK ANALYSE

UNIVERSITETET I OSLO

Dagens tema: Eksempel Klisjéer (mønstre) Tommelfingerregler

Andrew Gendreau, Olga Rosenbaum, Anthony Taylor, Kenneth Wong, Karl Dusen

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

EMPIC MEDICAL. Etterutdanningskurs flyleger 21. april Lars (Lasse) Holm Prosjektleder Telefon: E-post:

Emnedesign for læring: Et systemperspektiv

SVM and Complementary Slackness

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Speed Racer Theme. Theme Music: Cartoon: Charles Schultz / Jef Mallett Peanuts / Frazz. September 9, 2011 Physics 131 Prof. E. F.

Issues and challenges in compilation of activity accounts

Level-Rebuilt B-Trees

Windlass Control Panel

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

The regulation requires that everyone at NTNU shall have fire drills and fire prevention courses.

0:7 0:2 0:1 0:3 0:5 0:2 0:1 0:4 0:5 P = 0:56 0:28 0:16 0:38 0:39 0:23

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Kneser hypergraphs. May 21th, CERMICS, Optimisation et Systèmes

Verifiable Secret-Sharing Schemes

HØGSKOLEN I NARVIK - SIVILINGENIØRUTDANNINGEN

Vekeplan 4. Trinn. Måndag Tysdag Onsdag Torsdag Fredag AB CD AB CD AB CD AB CD AB CD. Norsk Matte Symjing Ute Norsk Matte M&H Norsk

Gir vi de resterende 2 oppgavene til én prosess vil alle sitte å vente på de to potensielt tidskrevende prosessene.

Universitetet i Bergen Det matematisk-naturvitenskapelige fakultet Eksamen i emnet Mat131 - Differensiallikningar I Onsdag 25. mai 2016, kl.

Brukerdokumentasjon Brukerdokumentasjon

3/1/2011. I dag. Recursive descent parser. Problem for RD-parser: Top Down Space. Jan Tore Lønning & Stephan Oepen

NTNU, TRONDHEIM Norges teknisk-naturvitenskapelige universitet Institutt for sosiologi og statsvitenskap

INF2820 Datalingvistikk V2011. Jan Tore Lønning & Stephan Oepen

EN Skriving for kommunikasjon og tenkning

TEKSTER PH.D.-KANDIDATER FREMDRIFTSRAPPORTERING

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Hvor mye teoretisk kunnskap har du tilegnet deg på dette emnet? (1 = ingen, 5 = mye)

Tips for bruk av BVAS og VDI i oppfølging av pasienter med vaskulitt. Wenche Koldingsnes

Prosjektet Digital kontaktinformasjon og fullmakter for virksomheter Digital contact information and mandates for entities

stjerneponcho for voksne star poncho for grown ups

Stationary Phase Monte Carlo Methods

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Dialogkveld 03. mars Mobbing i barnehagen

UNIVERSITETET I OSLO

Kartleggingsskjema / Survey

Medisinsk statistikk, KLH3004 Dmf, NTNU Styrke- og utvalgsberegning

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

Public roadmap for information management, governance and exchange SINTEF

Gradient. Masahiro Yamamoto. last update on February 29, 2012 (1) (2) (3) (4) (5)

Eksamen SOS1001, vår 2017

Hvordan føre reiseregninger i Unit4 Business World Forfatter:

HONSEL process monitoring

How Bridges Work Sgrad 2001

Økologisk og kulturell dannelse i økonomiutdanningen

SAS FANS NYTT & NYTTIG FRA VERKTØYKASSA TIL SAS 4. MARS 2014, MIKKEL SØRHEIM

GYRO MED SYKKELHJUL. Forsøk å tippe og vri på hjulet. Hva kjenner du? Hvorfor oppfører hjulet seg slik, og hva er egentlig en gyro?

UNIVERSITETET I OSLO ØKONOMISK INSTITUTT

TFY4170 Fysikk 2 Justin Wells

Til styrets medlemmer: Torkjel Solbakken, Jo Agnar Hansen, Rune Landås Gjeterhundrådgiver/sekretær: Cathinka Kjelstrup Kopi til: Arne Loftsgarden

Molare forsterkningsbetingelser

Supplemental Information

Fra sekvensielt til parallelt

Transkript:

Dynamic Programming Longest Common Subsequence Class 27

Protein a protein is a complex molecule composed of long single-strand chains of amino acid molecules there are 20 amino acids that make up proteins the primary structure of a protein is the sequence of its amino acids

Protein a protein is a complex molecule composed of long single-strand chains of amino acid molecules there are 20 amino acids that make up proteins the primary structure of a protein is the sequence of its amino acids given a new unknown protein, it is relatively easy to determine the primary structure since there are 20 amino acids, and 26 Roman letters, we represent a protein s primary structure as a string of letters

Protein a protein is a complex molecule composed of long single-strand chains of amino acid molecules there are 20 amino acids that make up proteins the primary structure of a protein is the sequence of its amino acids given a new unknown protein, it is relatively easy to determine the primary structure since there are 20 amino acids, and 26 Roman letters, we represent a protein s primary structure as a string of letters A alanine C cysteine D aspartic acid E glutamic acid etc.

A Protein human breast cancer susceptibility protein 1: MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQCPLCKNDITKRSLQESTRFSQLVEELLKIICAFQ LDTGLEYANSYNFAKKENNSPEHLKDEVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYIELGSDSSE DTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQPSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHASS LQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQKLPCSENPRDTEDVPWITLN SSIQKVNEWFSRSDELLGSDDSHDGESESNAKVADVLDVLNEVDEYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKAS LPNLSHVTENLIIGAFVTEPQIIQERPLTNKLKRKRRPTSGLHPEDFIKKADLAVQKTPEMINQGTNQTEQNGQVMNITNSGHENKTKGDSIQN EKNPNPIESLEKESAFKTKAEPISSSISNMELELNIHNSKAPKKNRLRRKSSTRHIHALELVVSRNLSPPNCTELQIDSCSSSEEIKKKKYNQM PVRHSRNLQLMEGKEPATGAKKSNKPNEQTSKRHDSDTFPELKLTNAPGSFTKCSNTSELKEFVNPSLPREEKEEKLETVKVSNNAEDPKDLML SGERVLQTERSVESSSISLVPGTDYGTQESISLLEVSTLGKAKTEPNKCVSQCAAFENPKGLIHGCSKDNRNDTEGFKYPLGHEVNHSRETSIE MEESELDAQYLQNTFKVSKRQSFAPFSNPGNAEEECATFSAHSGSLKKQSPKVTFECEQKEENQGKNESNIKPVQTVNITAGFPVVGQKDKPVD NAKCSIKGGSRFCLSSQFRGNETGLITPNKHGLLQNPYRIPPLFPIKSFVKTKCKKNLLEENFEEHSMSPEREMGNENIPSTVSTISRNNIREN VFKEASSSNINEVGSSTNEVGSSINEIGSSDENIQAELGRNRGPKLNAMLRLGVLQPEVYKQSLPGSNCKHPEIKKQEYEEVVQTVNTDFSPYL ISDNLEQPMGSSHASQVCSETPDDLLDDGEIKEDTSFAENDIKESSAVFSKSVQKGELSRSPSPFTHTHLAQGYRRGAKKLESSEENLSSEDEE LPCFQHLLFGKVNNIPSQSTRHSTVATECLSKNTEENLLSLKNSLNDCSNQVILAKASQEHHLSEETKCSASLFSSQCSELEDLTANTNTQDPF LIGSSKQMRHQSESQGVGLSDKELVSDDEERGTGLEENNQEEQSMDSNLGEAASGCESETSVSEDCSGLSSQSDILTTQQRDTMQHNLIKLQQE MAELEAVLEQHGSQPSNSYPSIISDSSALEDLRNPEQSTSEKAVLTSQKSSEYPISQNPEGLSADKFEVSADSSTSKNKEPGVERSSPSKCPSL DDRWYMHSCSGSLQNRNYPSQEELIKVVDVEEQQLEESGPHDLTETSYLPRQDLEGTPYLESGISLFSDDPESDPSEDRAPESARVGNIPSSTS ALKVPQLKVAESAQSPAAAHTTDTAGYNAMEESVSREKPELTASTERVNKRMSMVVSGLTPEEFMLVYKFARKHHITLTNLITEETTHVVMKTD AEFVCERTLKYFLGIAGGKWVVSYFWVTQSIKERKMLNEHDFEVRGDVVNGRNHQGPKRARESQDRKIFRGLEICCYGPFTNMPTDQLEWMVQL CGASVVKELSSFTLGTGVHPIVVVQPDAWTEDNGFHAIGQMCEAPVVTREWVLDSVALYQCQELDTYLIPQIPHSHY

Similarity however, while the primary sequence is essential to understanding function, it is not sufficient an approach to understanding function is to compare the unknown sequence with the sequences of proteins whose functions are known two proteins with similar sequences may have similar functions the question is the same as with string alignment: what is similar? what metric do we use?

Similarity however, while the primary sequence is essential to understanding function, it is not sufficient an approach to understanding function is to compare the unknown sequence with the sequences of proteins whose functions are known two proteins with similar sequences may have similar functions the question is the same as with string alignment: what is similar? what metric do we use? we can use string alignment for proteins, another common similarity metric is measuring the longest common subsequence

Longest Common Subsequence what is the longest list of letters, in order, in both strings? H S R E T S I E M E E S T V S T I S R N N I R E

Longest Common Subsequence what is the longest list of letters, in order, in both strings? H S R E T S I E M E E S S T S I E T V S T I S R N N I R E

given two sequences Step 1 s[0..n 1] and t[0..m 1] we wish to find a longest common subsequence, i.e., a subsequence of s: s[i 0 ],..., s[i k 1 ] and of t such that t[j 0 ],..., t[j k 1 ] s[i 0 ] = t[j 0 ],..., s[i k 1 ] = t[j k 1 ] where k is as long as possible

given two sequences Step 1 s[0..n 1] and t[0..m 1] we wish to find a longest common subsequence, i.e., a subsequence of s: s[i 0 ],..., s[i k 1 ] and of t such that t[j 0 ],..., t[j k 1 ] s[i 0 ] = t[j 0 ],..., s[i k 1 ] = t[j k 1 ] where k is as long as possible let opt(i, j) be the optimum length of a common subsequence of s[0..i] and t[0..j] where 0 i < n and 0 j < m

Step 2 we need a recurrence relation for opt(i, j) there are two cases: 1. s[i] = t[j] 2. s[i] t[j]

Step 2 we need a recurrence relation for opt(i, j) there are two cases: 1. s[i] = t[j] 2. s[i] t[j] let s look at case 1 if the i and j characters of s and t match, then the lcs of the strings without the ith and jth characters is one shorter than their lcs with those characters i X X X X R Z Z Z Y Y Y Y Y R Z Z Z j

case 2: s[i] t[j] Step 2

Step 2 case 2: s[i] t[j] whatever the lcs of s[0..i] and t[0..j] is, it is not true that both s[i] and t[j] are part of it (why?)

Step 2 case 2: s[i] t[j] whatever the lcs of s[0..i] and t[0..j] is, it is not true that both s[i] and t[j] are part of it (why?) therefore, the lcs of s[0..i] and t[0..j] must be either

Step 2 case 2: s[i] t[j] whatever the lcs of s[0..i] and t[0..j] is, it is not true that both s[i] and t[j] are part of it (why?) therefore, the lcs of s[0..i] and t[0..j] must be either a lcs of s[0..i 1] and t[0..j] or

Step 2 case 2: s[i] t[j] whatever the lcs of s[0..i] and t[0..j] is, it is not true that both s[i] and t[j] are part of it (why?) therefore, the lcs of s[0..i] and t[0..j] must be either a lcs of s[0..i 1] and t[0..j] or a lcs of s[0..i] and t[0..j 1]

Step 2 case 2: s[i] t[j] whatever the lcs of s[0..i] and t[0..j] is, it is not true that both s[i] and t[j] are part of it (why?) therefore, the lcs of s[0..i] and t[0..j] must be either a lcs of s[0..i 1] and t[0..j] or a lcs of s[0..i] and t[0..j 1] now there is enough information to construct a recurrence relation that defines opt(i, j)

Step 2 0 if i = 0 or j = 0 opt(i opt(i, j) = { 1, j 1) + 1 if s[i] = t[j] opt(i 1, j) max if s[i] t[j] opt(i, j 1)

Step 3 as before, put a dummy blank character onto the beginning of each string

Step 3 as before, put a dummy blank character onto the beginning of each string what is the shape of the memo table?

Step 3 as before, put a dummy blank character onto the beginning of each string what is the shape of the memo table? what does an entry in the memo matrix represent?

Step 3 as before, put a dummy blank character onto the beginning of each string what is the shape of the memo table? what does an entry in the memo matrix represent? what is the minimum value in the memo matrix?

Step 3 unsigned opt(size_t i, size_t j, Matrix<unsigned> & memo, const string & s, const string & t) { if (memo.at(i, j) == UINT_MAX) { if (i == 0 j == 0) { memo.at(i, j) = 0; } else if (s.at(i) == t.at(j)) { memo.at(i, j) = opt(i - 1, j - 1, memo, s, t) + 1; } else { memo.at(i, j) = max(opt(i, j - 1, memo, s, t), opt(i - 1, j, memo, s, t)); } } return memo.at(i, j); }

LCS build the LCS memo table for these two sequences: G V C E K S T G D V E G T A

Longest Common Subsequence an LCS matrix - G V C E K S T 0 1 2 3 4 5 5 7-0 0 0 0 0 0 0 0 0 G 1 0 1 1 1 1 1 1 1 D 2 0 1 1 1 1 1 1 1 V 3 0 1 2 2 2 2 2 2 E 4 0 1 2 2 3 3 3 3 G 5 0 1 2 2 3 3 3 3 T 6 0 1 2 2 3 3 3 4 A 7 0 1 2 2 3 3 3 4

Longest Common Subsequence an LCS matrix and traceback - G V C E K S T 0 1 2 3 4 5 5 7-0 0 0 0 0 0 0 0 0 G 1 0 1 1 1 1 1 1 1 D 2 0 1 1 1 1 1 1 1 V 3 0 1 2 2 2 2 2 2 E 4 0 1 2 2 3 3 3 3 G 5 0 1 2 2 3 3 3 3 T 6 0 1 2 2 3 3 3 4 A 7 0 1 2 2 3 3 3 4