Anda di halaman 1dari 8

ICS161:DesignandAnalysisofAlgorithms

LecturenotesforFebruary29,1996
LongestCommonSubsequences
Inthislectureweexamineanotherstringmatchingproblem,offindingthelongestcommon
subsequenceoftwostrings.
Thisisagoodexampleofthetechniqueofdynamicprogramming,whichisthefollowingverysimple
idea:startwitharecursivealgorithmfortheproblem,whichmaybeinefficientbecauseitcallsitself
repeatedlyonasmallnumberofsubproblems.Simplyrememberthesolutiontoeachsubproblemthe
firsttimeyoucomputeit,thenafterthatlookitupinsteadofrecomputingit.Theoveralltimebound
thenbecomes(typically)proportionaltothenumberofdistinctsubproblemsratherthanthelarger
numberofrecursivecalls.Wealreadysawthisideabrieflyinthefirstlecture.
Aswe'llsee,therearetwowaysofdoingdynamicprogramming,topdownandbottomup.Thetop
down(memoizing)methodisclosertotheoriginalrecursivealgorithm,soeasiertounderstand,but
thebottomupmethodisusuallyalittlemoreefficient.

Subsequencetesting
Beforewedefinethelongestcommonsubsequenceproblem,let'sstartwithaneasywarmup.Suppose
you'regivenashortstring(pattern)andlongstring(text),asinthestringmatchingproblem.Butnow
youwanttoknowifthelettersofthepatternappearinorder(butpossiblyseparated)inthetext.If
theydo,wesaythatthepatternisasubsequenceofthetext.
Asanexample,is"nano"asubsequenceof"nematodeknowledge"?Yes,andinonlyoneway.The
easiestwaytoseethisexampleistocapitalizethesubsequence:"NemAtodekNOwledge".
Ingeneral,wecantestthisasbeforeusingafinitestatemachine.Drawcirclesandarrowsasbefore,
correspondingtopartialsubsequences(prefixesofthepattern),butnowthereisnoneedfor
backtracking.

Equivalently,itiseasytowritecodeorpseudocodeforthis:
subseq(char * P, char * T)
{
while (*T != '\0')
if (*P == *T++ && *++P == '\0')
return TRUE;
return FALSE;
}

Longestcommonsubsequenceproblem

Whatifthepatterndoesnotoccurinthetext?Itstillmakessensetofindthelongestsubsequencethat
occursbothinthepatternandinthetext.Thisisthelongestcommonsubsequenceproblem.Sincethe
patternandtexthavesymmetricroles,fromnowonwewon'tgivethemdifferentnamesbutjustcall
themstringsAandB.We'llusemtodenotethelengthofAandntodenotethelengthofB.
Notethattheautomatatheoreticmethodabovedoesn'tsolvetheprobleminsteaditgivesthelongest
prefixofAthat'sasubsequenceofB.ButthelongestcommonsubsequenceofAandBisnotalways
aprefixofA.
Whymightwewanttosolvethelongestcommonsubsequenceproblem?Thereareseveralmotivating
applications.
Molecularbiology.DNAsequences(genes)canberepresentedassequencesoffourletters
ACGT,correspondingtothefoursubmoleculesformingDNA.Whenbiologistsfindanew
sequences,theytypicallywanttoknowwhatothersequencesitismostsimilarto.Onewayof
computinghowsimilartwosequencesareistofindthelengthoftheirlongestcommon
subsequence.
Filecomparison.TheUnixprogram"diff"isusedtocomparetwodifferentversionsofthesame
file,todeterminewhatchangeshavebeenmadetothefile.Itworksbyfindingalongest
commonsubsequenceofthelinesofthetwofilesanylineinthesubsequencehasnotbeen
changed,sowhatitdisplaysistheremainingsetoflinesthathavechanged.Inthisinstanceof
theproblemweshouldthinkofeachlineofafileasbeingasinglecomplicatedcharacterina
string.
Screenredisplay.Manytexteditorslike"emacs"displaypartofafileonthescreen,updating
thescreenimageasthefileischanged.Forslowdialinterminals,theseprogramswanttosend
theterminalasfewcharactersaspossibletocauseittoupdateitsdisplaycorrectly.Itispossible
toviewthecomputationoftheminimumlengthsequenceofcharactersneededtoupdatethe
terminalasbeingasortofcommonsubsequenceproblem(thecommonsubsequencetellsyou
thepartsofthedisplaythatarealreadycorrectanddon'tneedtobechanged).
(Asanaside,itisnaturaltodefineasimilarlongestcommonsubstringproblem,askingforthelongest
substringthatappearsintwoinputstrings.Thisproblemcanbesolvedinlineartimeusingadata
structureknownasthesuffixtreebutthesolutionisextremelycomplicated.)

Recursivesolution
Sowewanttosolvethelongestcommonsubsequenceproblembydynamicprogramming.Todothis,
wefirstneedarecursivesolution.Thedynamicprogrammingideadoesn'ttellushowtofindthis,it
justgivesusawayofmakingthesolutionmoreefficientoncewehave.
Let'sstartwithsomesimpleobservationsabouttheLCSproblem.Ifwehavetwostrings,say
"nematodeknowledge"and"emptybottle",wecanrepresentasubsequenceasawayofwritingthe
twosothatcertainletterslineup:
nematode kno w ledge
|| |
|
|
||
empty
bottle

Ifwedrawlinesconnectingthelettersinthefirststringtothecorrespondinglettersinthesecond,no
twolinescross(thetopandbottomendpointsoccurinthesameorder,theorderofthelettersinthe
subsequence).Converselyanysetoflinesdrawnlikethis,withoutcrossings,representsa
subsequence.

Fromthiswecanobservethefollowingsimplefact:ifthetwostringsstartwiththesameletter,it's
alwayssafetochoosethatstartingletterasthefirstcharacterofthesubsequence.Thisisbecause,if
youhavesomeothersubsequence,representedasacollectionoflinesasdrawnabove,youcan"push"
theleftmostlinetothestartofthetwostrings,withoutcausinganyothercrossings,andgeta
representationofanequallylongsubsequencethatdoesstartthisway.
Ontheotherhand,supposethat,liketheexampleabove,thetwofirstcharactersdiffer.Thenitisnot
possibleforbothofthemtobepartofacommonsubsequenceoneortheother(ormaybeboth)will
havetoberemoved.
Finally,observethatoncewe'vedecidedwhattodowiththefirstcharactersofthestrings,the
remainingsubproblemisagainalongestcommonsubsequenceproblem,ontwoshorterstrings.
Thereforewecansolveitrecursively.
Ratherthanfindingthesubsequenceitself,itturnsouttobemoreefficienttofindthelengthofthe
longestsubsequence.Theninthecasewherethefirstcharactersdiffer,wecandeterminewhich
subproblemgivesthecorrectsolutionbysolvingbothandtakingthemaxoftheresultingsubsequence
lengths.Onceweturnthisintoadynamicprogramwewillseehowtogetthesequenceitself.
Theseobservationsgiveusthefollowing,veryinefficient,recursivealgorithm.
RecursiveLCS:
int lcs_length(char * A, char * B)
{
if (*A == '\0' || *B == '\0') return 0;
else if (*A == *B) return 1 + lcs_length(A+1, B+1);
else return max(lcs_length(A+1,B), lcs_length(A,B+1));
}

Thisisacorrectsolutionbutit'sverytimeconsuming.Forexample,ifthetwostringshaveno
matchingcharacters,sothelastlinealwaysgetsexecuted,thethetimeboundsarebinomial
coefficients,which(ifm=n)arecloseto2^n.

Memoization
Theproblemwiththerecursivesolutionisthatthesamesubproblemsgetcalledmanydifferenttimes.
Asubproblemconsistsofacalltolcs_length,withtheargumentsbeingtwosuffixesofAandB,so
thereareexactly(m+1)(n+1)possiblesubproblems(arelativelysmallnumber).Iftherearenearly2^n
recursivecalls,someofthesesubproblemsmustbebeingsolvedoverandover.
Thedynamicprogrammingsolutionistocheckwheneverwewanttosolveasubproblem,whether
we'vealreadydoneitbefore.Ifsowelookupthesolutioninsteadofrecomputingit.Implementedin
themostdirectway,wejustaddsomecodetoourrecursivealgorithmtodothislookupthis"top
down",recursiveversionofdynamicprogrammingisknownas"memoization".
IntheLCSproblem,subproblemsconsistofapairofsuffixesofthetwoinputstrings.Tomakeit
easiertostoreandlookupsubproblemsolutions,I'llrepresentthesebythestartingpositionsinthe
strings,ratherthan(asIwroteitabove)characterpointers.
RecursiveLCSwithindices:
char * A;
char * B;
int lcs_length(char * AA, char * BB)
{

A = AA; B = BB;
return subproblem(0, 0);

}
int subproblem(int i, int j)
{
if (A[i] == '\0' || B[j] == '\0') return 0;
else if (A[i] == B[j]) return 1 + subproblem(i+1, j+1);
else return max(subproblem(i+1, j), subproblem(i, j+1));
}

Nowtoturnthisintoadynamicprogrammingalgorithmweneedonlyuseanarraytostorethe
subproblemresults.Whenwewantthesolutiontoasubproblem,wefirstlookinthearray,andcheck
iftherealreadyisasolutionthere.Ifsowereturnitotherwiseweperformthecomputationandstore
theresult.IntheLCSproblem,noresultisnegative,sowe'lluse1asaflagtotellthealgorithmthat
nothinghasbeenstoredyet.
MemoizingLCS:
char * A;
char * B;
array L;
int lcs_length(char * AA, char * BB)
{
A = AA; B = BB;
allocate storage for L;
for (i = 0; i <= m; i++)
for (j = 0; j <= m; j++)
L[i,j] = -1;
}

return subproblem(0, 0);

int subproblem(int i, int j)


{
if (L[i,j] < 0) {
if (A[i] == '\0' || B[j] == '\0') L[i,j] = 0;
else if (A[i] == B[j]) L[i,j] = 1 + subproblem(i+1, j+1);
else L[i,j] = max(subproblem(i+1, j), subproblem(i, j+1));
}
return L[i,j];
}

Timeanalysis:eachcalltosubproblemtakesconstanttime.Wecallitoncefromthemainroutine,and
atmosttwiceeverytimewefillinanentryofarrayL.Thereare(m+1)(n+1)entries,sothetotal
numberofcallsisatmost2(m+1)(n+1)+1andthetimeisO(mn).
Asusual,thisisaworstcaseanalysis.Thetimemightsometimesbetter,ifnotallarrayentriesget
filledout.Forinstanceifthetwostringsmatchexactly,we'llonlyfillindiagonalentriesandthe
algorithmwillbefast.

Bottomupdynamicprogramming
Wecanviewthecodeaboveasjustbeingaslightlysmarterwayofdoingtheoriginalrecursive
algorithm,savingworkbynotrepeatingsubproblemcomputations.Butitcanalsobethoughtofasa
wayofcomputingtheentriesinthearrayL.Therecursivealgorithmcontrolswhatorderwefillthem
in,butwe'dgetthesameresultsifwefilledthemininsomeotherorder.Wemightaswelluse
somethingsimpler,likeanestedloop,thatvisitsthearraysystematically.Theonlythingwehaveto
worryaboutisthatwhenwefillinacellL[i,j],weneedtoalreadyknowthevaluesitdependson,
namelyinthiscaseL[i+1,j],L[i,j+1],andL[i+1,j+1].Forthisreasonwe'lltraversethearray

backwards,fromthelastrowworkinguptothefirstandfromthelastcolumnworkinguptothefirst.
Thisisiterative(becauseitusesnestedloopsinsteadofrecursion)orbottomup(becausetheorderwe
fillinthearrayisfromsmallersimplersubproblemstobiggermorecomplicatedones).
IterativeLCS:
int lcs_length(char * A, char * B)
{
allocate storage for array L;
for (i = m; i >= 0; i--)
for (j = n; j >= 0; j--)
{
if (A[i] == '\0' || B[j] == '\0') L[i,j] = 0;
else if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];
else L[i,j] = max(L[i+1, j], L[i, j+1]);
}
return L[0,0];
}

Advantagesofthismethodincludethefactthatiterationisusuallyfasterthanrecursion,wedon'tneed
toinitializethematrixtoall1's,andwesavethreeifstatementsperiterationsincewedon'tneedto
testwhetherL[i,j],L[i+1,j],andL[i,j+1]havealreadybeencomputed(weknowinadvancethatthe
answerswillbeno,yes,andyes).Onedisadvantageovermemoizingisthatthisfillsintheentire
arrayevenwhenitmightbepossibletosolvetheproblembylookingatonlyafractionofthearray's
cells.

Thesubsequenceitself
Whatifyouwantthesubsequenceitself,andnotjustitslength?Thisisimportantforsomebutnotall
oftheapplicationswementioned.OncewehavefilledinthearrayLdescribedabove,wecanfindthe
sequencebyworkingforwardsthroughthearray.
sequence S = empty;
i = 0;
j = 0;
while (i < m && j < n)
{
if (A[i]==B[j])
{
add A[i] to end of S;
i++; j++;
}
else if (L[i+1,j] >= L[i,j+1]) i++;
else j++;
}

Let'sseeanexampleofthis.Here'sthearrayfortheearlierexample:
n e m a t o d e _ k n o w l e d g e
e
m
p
t
y
_
b
o
t
t
l

7
6
5
5
4
4
3
3
3
3
2

7
6
5
5
4
4
3
3
3
3
2

6
6
5
5
4
4
3
3
3
3
2

5
5
5
5
4
4
3
3
3
3
2

5
5
5
5
4
4
3
3
3
3
2

5
4
4
4
4
4
3
3
2
2
2

5
4
4
4
4
4
3
3
2
2
2

5
4
4
4
4
4
3
3
2
2
2

4
4
4
4
4
4
3
3
2
2
2

3
3
3
3
3
3
3
3
2
2
2

3
3
3
3
3
3
3
3
2
2
2

3
3
3
3
3
3
3
3
2
2
2

2
2
2
2
2
2
2
2
2
2
2

2
2
2
2
2
2
2
2
2
2
2

2
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1

0
0
0
0
0
0
0
0
0
0
0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(youcancheckthateachentryiscomputedcorrectlyfromtheentriesbelowandtotheright.)
Tofindthelongestcommonsubsequence,lookatthefirstentryL[0,0].Thisis7,tellingusthatthe
sequencehassevencharacters.L[0,0]wascomputedasmax(L[0,1],L[1,0]),correspondingtothe
subproblemsformedbydeletingeitherthe"n"fromthefirststringorthe"e"fromthesecond.
Deletingthe"n"givesasubsequenceoflengthL[0,1]=7,butdeletingthe"e"onlygivesL[1,0]=6,so
wecanonlydeletethe"n".Nowlet'slookattheentryL[0,1]comingfromthisdeletion.
A[0]=B[1]="e"sowecansafelyincludethis"e"aspartofthesubsequence,andmovetoL[1,2]=6.
Similarlythisentrygivesusan"m"inoursequence.Continuinginthisway(andbreakingtiesasin
thealgorithmabove,bymovingdowninsteadofacross)givesthecommonsubsequence"emtole".
SowecanfindlongestcommonsubsequencesintimeO(mn).Actually,ifyoulookatthematrix
above,youcantellthatithasalotofstructurethenumbersinthematrixformlargeblocksinwhich
thevalueisconstant,withonlyasmallnumberof"corners"atwhichthevaluechanges.Itturnsout
thatonecantakeadvantageofthesecornerstospeedupthecomputation.Thecurrent(theoretically)
fastestalgorithmforlongestcommonsubsequences(duetomyselfandcoauthors)runsintimeO(n
logs+cloglogmin(c,mn/c))wherecisthenumberofthesecorners,andsisthenumberof
charactersappearinginthetwostrings.

Relationtopathsingraphs
Let'sdrawadirectedgraph,withverticescorrespondingtoentriesinthearrayL,andanedge
connectinganentrytooneitdependson:eitheroneedgetoL[i+1,j+1]ifA[i]=B[j],ortwoedgesto
L[i+1,j]andL[i,j+1]otherwise.Don'tdrawedgesfromthebottomrightfringeofthearray(since
thoseentriesdon'tdependonanyothers).
nematode_knowledge
e o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o
| \| | | | | \| | | | | | \| | \|
m o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | \| | | | | | | | | | | | | | | |
p o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
|||||||||||||||||||
t o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | \| | | | | | | | | | | | | |
y o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
|||||||||||||||||||
_ o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o
| | | | | | | | \| | | | | | | | | |
b o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
|||||||||||||||||||
o o-o-o-o-o-o o-o-o-o-o-o o-o-o-o-o-o-o
| | | | | \| | | | | \| | | | | | |
t o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | \| | | | | | | | | | | | | |
t o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | \| | | | | | | | | | | | | |
l o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o
| | | | | | | | | | | | | \| | | | |
e o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o
| \| | | | | \| | | | | | \| | \|
ooooooooooooooooooo

Thenifyoulookatanypathinthegraph,thediagonaledgesformasubsequenceofthetwostrings.
Conversely,ifyoudefinethehorizontalandverticaledgestohavelengthzero,andthediagonaledges

tohavelengthone,thelongestcommonsubsequencecorrespondstothelongestpathfromthetopleft
cornertooneofthebottomrightvertices.Thisgraphisacyclic,sowecancomputelongestpathsin
timelinearinthesizeofthegraph,hereO(mn).
Wheredidtheseedgelengthscomefrom?They'rejusttheamountbywhichtheLCSlengthincreases
comparedtothelengthatthecorrespondingsubproblem.IfA[i]=B[j],thenL[i,j]=L[i+1,j+1]+1,and
weusethatlast"+1"astheedgelength.Otherwise,L[i,j]=max(L[i+1,j],L[i,j+1])+0,soweusezeroas
theedgelength.
Thissortofphenomenon,inwhichadynamicprogrammingproblemturnsouttobeequivalenttoa
shortestorlongestpathproblem,doesnotalwayshappenwithotherproblems,butitisreasonably
common.Thisideadoesn'treallyhelpcomputethesinglelongestcommonsubsequence,butoneof
mypapersusessimilargraphtheoreticideastofindmultiplelongcommonsubsequences(and
multiplesolutionstomanyotherproblems).

Reducedspacecomplexity
Onedisadvantageofthedynamicprogrammingmethodswe'vedescribed,comparedtotheoriginal
recursion,isthattheyusealotofspace:O(mn)forthearrayL(therecursiononlyusesO(n+m)).But
theiterativeversioncanbeeasilymodifiedtouselessspacetheobservationisthatoncewe've
computedrowiofarrayL,wenolongerneedthevaluesinrowi+1.
SpaceefficientLCS:
int lcs_length(char * A, char * B)
{
allocate storage for one-dimensional arrays X and Y
for (i = m; i >= 0; i--)
{
for (j = n; j >= 0; j--)
{
if (A[i] == '\0' || B[j] == '\0') X[j] = 0;
else if (A[i] == B[j]) X[j] = 1 + Y[j+1];
else X[j] = max(Y[j], X[j+1]);
}
Y = X;
}
return X[0];
}

Thistakesroughlythesameamountoftimeasbefore,O(mn)itusesalittlemoretimetocopyX
intoYbutthisonlyincreasesthetimebyaconstant(andcanbeavoidedwithsomemorecare).The
spaceiseitherO(m)orO(n),whicheverissmaller(switchthetwostringsifnecessarysothereare
morerowsthancolumns).Unfortunately,thissolutiondoesnotleaveyouwithenoughinformationto
findthesubsequenceitself,justitslength.
In1975,DanHirschbergshowedhowtofindnotjustthelength,butthelongestcommonsubsequence
itself,inlinearspaceandO(mn)time.TheideaisasabovetouseonedimensionalarraysXandYto
storetherowsofthelargertwodimensionalarrayL.ButHirschberg'smethodtreatsthemiddlerowof
arrayLspecially:foralli<m/2,hestoresalongwiththenumbersinXandYtheplacewheresome
path(correspondingtoasubsequencewiththatmanycharacters)crossesthemiddlerow.These
crossingplacescanbeupdatedalongwiththearrayvalues,bycopyingthemfromX[j+1],Y[j],or
Y[j+1]asappropriate.
ThenwhenthealgorithmabovehasfinishedwiththeLCSlengthinX[0],Hirschbergfindsthe
correspondingcrossingplace(m/2,k).HethensolvesrecursivelytwoLCSproblems,onefor
A[0..m/21]andB[0..k1]andoneforA[m/2..m]andB[k..n].Thelongestcommonsubsequenceis

theconcatenationofthesequencesfoundbythesetworecursivecalls.
Itisnothardtoseethatthismethoduseslinearspace.Whatabouttimecomplexity?Thisisa
recursivealgorithm,withatimerecurrence
T(m,n) = O(mn) + T(m/2,k) + T(m/2,n-k)

Youcanthinkofthisassortoflikequicksortwe'rebreakingbothstringsintoparts.Butunlike
quicksortitdoesn'tmatterthatthesecondstringcanbebrokenunequally.Nomatterwhatkis,the
recurrencesolvestoO(mn).Theeasiestwaytoseethisistothinkaboutwhatit'sdoinginthearrayL.
Themainpartofthealgorithmvisitsthewholearray,thenthetwocallsvisittwosubarrays,oneabove
andleftof(m/2,k)andtheotherbelowandtotheright.Nomatterwhatkis,thetotalsizeofthesetwo
subarraysisroughlymn/2.Soinsteadwecanwriteasimplifiedrecurrence
T(mn) = O(mn) + T(mn/2)

whichsolvestoO(mn)timetotal.
ICS161Dept.Information&ComputerScienceUCIrvine
Lastupdate:

Anda mungkin juga menyukai