• 1.58 MB
  • 2022-04-29 14:31:54 发布

数据库系统概念全套配套课件PPT ch19.ppt

  • 95页
  • 当前文档由用户上传发布,收益归属用户
  1. 1、本文档共5页,可阅读全部内容。
  2. 2、本文档内容版权归属内容提供方,所产生的收益全部归内容提供方所有。如果您对本文有版权争议,可选择认领,认领后既往收益都归您。
  3. 3、本文档由用户上传,本站不保证质量和数量令人满意,可能有诸多瑕疵,付费之前,请仔细先通过免费阅读内容等途径辨别内容交易风险。如存在严重挂羊头卖狗肉之情形,可联系本站下载客服投诉处理。
  4. 文档侵权举报电话:19940600175。
'Chapter19:DistributedDatabases Chapter19:DistributedDatabasesHeterogeneousandHomogeneousDatabasesDistributedDataStorageDistributedTransactionsCommitProtocolsConcurrencyControlinDistributedDatabasesAvailabilityDistributedQueryProcessingHeterogeneousDistributedDatabasesDirectorySystems DistributedDatabaseSystemAdistributeddatabasesystemconsistsoflooselycoupledsitesthatsharenophysicalcomponentDatabasesystemsthatrunoneachsiteareindependentofeachotherTransactionsmayaccessdataatoneormoresites HomogeneousDistributedDatabasesInahomogeneousdistributeddatabaseAllsiteshaveidenticalsoftwareAreawareofeachotherandagreetocooperateinprocessinguserrequests.EachsitesurrenderspartofitsautonomyintermsofrighttochangeschemasorsoftwareAppearstouserasasinglesystemInaheterogeneousdistributeddatabaseDifferentsitesmayusedifferentschemasandsoftwareDifferenceinschemaisamajorproblemforqueryprocessingDifferenceinsoftwareisamajorproblemfortransactionprocessingSitesmaynotbeawareofeachotherandmayprovideonly limitedfacilitiesforcooperationintransactionprocessing DistributedDataStorageAssumerelationaldatamodelReplicationSystemmaintainsmultiplecopiesofdata,storedindifferentsites,forfasterretrievalandfaulttolerance.FragmentationRelationispartitionedintoseveralfragmentsstoredindistinctsitesReplicationandfragmentationcanbecombinedRelationispartitionedintoseveralfragments:systemmaintainsseveralidenticalreplicasofeachsuchfragment. DataReplicationArelationorfragmentofarelationisreplicatedifitisstoredredundantlyintwoormoresites.Fullreplicationofarelationisthecasewheretherelationisstoredatallsites.Fullyredundantdatabasesarethoseinwhicheverysitecontainsacopyoftheentiredatabase. DataReplication(Cont.)AdvantagesofReplicationAvailability:failureofsitecontainingrelationrdoesnotresultinunavailabilityofrisreplicasexist.Parallelism:queriesonrmaybeprocessedbyseveralnodesinparallel.Reduceddatatransfer:relationrisavailablelocallyateachsitecontainingareplicaofr.DisadvantagesofReplicationIncreasedcostofupdates:eachreplicaofrelationrmustbeupdated.Increasedcomplexityofconcurrencycontrol:concurrentupdatestodistinctreplicasmayleadtoinconsistentdataunlessspecialconcurrencycontrolmechanismsareimplemented.Onesolution:chooseonecopyasprimarycopyandapplyconcurrencycontroloperationsonprimarycopy DataFragmentationDivisionofrelationrintofragmentsr1,r2,…,rnwhichcontainsufficientinformationtoreconstructrelationr.Horizontalfragmentation:eachtupleofrisassignedtooneormorefragmentsVerticalfragmentation:theschemaforrelationrissplitintoseveralsmallerschemasAllschemasmustcontainacommoncandidatekey(orsuperkey)toensurelosslessjoinproperty.Aspecialattribute,thetuple-idattributemaybeaddedtoeachschematoserveasacandidatekey. HorizontalFragmentationofaccountRelationbranch_nameaccount_numberbalanceHillsideHillsideHillsideA-305A-226A-15550033662account1=branch_name=“Hillside”(account)branch_nameaccount_numberbalanceValleyviewValleyviewValleyviewValleyviewA-177A-402A-408A-639205100001123750account2=branch_name=“Valleyview”(account) VerticalFragmentationofemployee_infoRelationbranch_namecustomer_nametuple_idHillsideHillsideValleyviewValleyviewHillsideValleyviewValleyviewLowmanCampCampKahnKahnKahnGreendeposit1=branch_name,customer_name,tuple_id(employee_info)1234567account_numberbalancetuple_id500336205100006211237501234567A-305A-226A-177A-402A-155A-408A-639deposit2=account_number,balance,tuple_id(employee_info) AdvantagesofFragmentationHorizontal:allowsparallelprocessingonfragmentsofarelationallowsarelationtobesplitsothattuplesarelocatedwheretheyaremostfrequentlyaccessedVertical:allowstuplestobesplitsothateachpartofthetupleisstoredwhereitismostfrequentlyaccessedtuple-idattributeallowsefficientjoiningofverticalfragmentsallowsparallelprocessingonarelationVerticalandhorizontalfragmentationcanbemixed.Fragmentsmaybesuccessivelyfragmentedtoanarbitrarydepth. DataTransparencyDatatransparency:DegreetowhichsystemusermayremainunawareofthedetailsofhowandwherethedataitemsarestoredinadistributedsystemConsidertransparencyissuesinrelationto:FragmentationtransparencyReplicationtransparencyLocationtransparency NamingofDataItems-Criteria1.Everydataitemmusthaveasystem-wideuniquename.2.Itshouldbepossibletofindthelocationofdataitemsefficiently.3.Itshouldbepossibletochangethelocationofdataitemstransparently.4.Eachsiteshouldbeabletocreatenewdataitemsautonomously. CentralizedScheme-NameServerStructure:nameserverassignsallnameseachsitemaintainsarecordoflocaldataitemssitesasknameservertolocatenon-localdataitemsAdvantages:satisfiesnamingcriteria1-3Disadvantages:doesnotsatisfynamingcriterion4nameserverisapotentialperformancebottlenecknameserverisasinglepointoffailure UseofAliasesAlternativetocentralizedscheme:eachsiteprefixesitsownsiteidentifiertoanynamethatitgeneratesi.e.,site17.account.Fulfillshavingauniqueidentifier,andavoidsproblemsassociatedwithcentralcontrol.However,failstoachievenetworktransparency.Solution:Createasetofaliasesfordataitems;Storethemappingofaliasestotherealnamesateachsite.Theusercanbeunawareofthephysicallocationofadataitem,andisunaffectedifthedataitemismovedfromonesitetoanother. DistributedTransactionsTransactionmayaccessdataatseveralsites.Eachsitehasalocaltransactionmanagerresponsiblefor:MaintainingalogforrecoverypurposesParticipatingincoordinatingtheconcurrentexecutionofthetransactionsexecutingatthatsite.Eachsitehasatransactioncoordinator,whichisresponsiblefor:Startingtheexecutionoftransactionsthatoriginateatthesite.Distributingsubtransactionsatappropriatesitesforexecution.Coordinatingtheterminationofeachtransactionthatoriginatesatthesite,whichmayresultinthetransactionbeingcommittedatallsitesorabortedatallsites. TransactionSystemArchitecture SystemFailureModesFailuresuniquetodistributedsystems:Failureofasite.LossofmassagesHandledbynetworktransmissioncontrolprotocolssuchasTCP-IPFailureofacommunicationlinkHandledbynetworkprotocols,byroutingmessagesviaalternativelinksNetworkpartitionAnetworkissaidtobepartitionedwhenithasbeensplitintotwoormoresubsystemsthatlackanyconnectionbetweenthemNote:asubsystemmayconsistofasinglenodeNetworkpartitioningandsitefailuresaregenerallyindistinguishable. CommitProtocolsCommitprotocolsareusedtoensureatomicityacrosssitesatransactionwhichexecutesatmultiplesitesmusteitherbecommittedatallthesites,orabortedatallthesites.notacceptabletohaveatransactioncommittedatonesiteandabortedatanotherThetwo-phasecommit(2PC)protocoliswidelyusedThethree-phasecommit(3PC)protocolismorecomplicatedandmoreexpensive,butavoidssomedrawbacksoftwo-phasecommitprotocol.Thisprotocolisnotusedinpractice. TwoPhaseCommitProtocol(2PC)Assumesfail-stopmodel–failedsitessimplystopworking,anddonotcauseanyotherharm,suchassendingincorrectmessagestoothersites.Executionoftheprotocolisinitiatedbythecoordinatorafterthelaststepofthetransactionhasbeenreached.TheprotocolinvolvesallthelocalsitesatwhichthetransactionexecutedLetTbeatransactioninitiatedatsiteSi,andletthetransactioncoordinatoratSibeCi Phase1:ObtainingaDecisionCoordinatorasksallparticipantstopreparetocommittransactionTi.CiaddstherecordstothelogandforceslogtostablestoragesendsprepareTmessagestoallsitesatwhichTexecutedUponreceivingmessage,transactionmanageratsitedeterminesifitcancommitthetransactionifnot,addarecordtothelogandsendabortTmessagetoCiifthetransactioncanbecommitted,then:addtherecordtothelogforceallrecordsforTtostablestoragesendreadyTmessagetoCi Phase2:RecordingtheDecisionTcanbecommittedofCireceivedareadyTmessagefromalltheparticipatingsites:otherwiseTmustbeaborted.Coordinatoraddsadecisionrecord,or,tothelogandforcesrecordontostablestorage.Oncetherecordstablestorageitisirrevocable(eveniffailuresoccur)Coordinatorsendsamessagetoeachparticipantinformingitofthedecision(commitorabort)Participantstakeappropriateactionlocally. HandlingofFailures-SiteFailureWhensiteSirecovers,itexaminesitslogtodeterminethefateoftransactionsactiveatthetimeofthefailure.Logcontainrecord:siteexecutesredo(T)Logcontainsrecord:siteexecutesundo(T)Logcontainsrecord:sitemustconsultCitodeterminethefateofT.IfTcommitted,redo(T)IfTaborted,undo(T)ThelogcontainsnocontrolrecordsconcerningTrepliesthatSkfailedbeforerespondingtotheprepareTmessagefromCisincethefailureofSkprecludesthesendingofsucha responseC1mustabortTSkmustexecuteundo(T) HandlingofFailures-CoordinatorFailureIfcoordinatorfailswhilethecommitprotocolforTisexecutingthenparticipatingsitesmustdecideonT’sfate:Ifanactivesitecontainsarecordinitslog,thenTmustbecommitted.Ifanactivesitecontainsanrecordinitslog,thenTmustbeaborted.Ifsomeactiveparticipatingsitedoesnotcontainarecordinitslog,thenthefailedcoordinatorCicannothavedecidedtocommitT.CanthereforeabortT.Ifnoneoftheabovecasesholds,thenallactivesitesmusthavearecordintheirlogs,butnoadditionalcontrolrecords(suchasof).InthiscaseactivesitesmustwaitforCitorecover,tofinddecision.Blockingproblem:activesitesmayhavetowaitforfailedcoordinatortorecover. HandlingofFailures-NetworkPartitionIfthecoordinatorandallitsparticipantsremaininonepartition,thefailurehasnoeffectonthecommitprotocol.Ifthecoordinatoranditsparticipantsbelongtoseveralpartitions:Sitesthatarenotinthepartitioncontainingthecoordinatorthinkthecoordinatorhasfailed,andexecutetheprotocoltodealwithfailureofthecoordinator.Noharmresults,butsitesmaystillhavetowaitfordecisionfromcoordinator.Thecoordinatorandthesitesareinthesamepartitionasthecoordinatorthinkthatthesitesintheotherpartitionhavefailed,andfollowtheusualcommitprotocol.Again,noharmresults RecoveryandConcurrencyControlIn-doubttransactionshavea,butneithera ,noranlogrecord.Therecoveringsitemustdeterminethecommit-abortstatusofsuchtransactionsbycontactingothersites;thiscanslowandpotentiallyblockrecovery.Recoveryalgorithmscannotelockinformationinthelog.Insteadof,writeoutL=listoflocksheldbyTwhenthelogiswritten(readlockscanbeomitted).Foreveryin-doubttransactionT,allthelocksnotedinthe logrecordarereacquired.Afterlockreacquisition,transactionprocessingcanresume;thecommitorrollbackofin-doubttransactionsisperformedconcurrentlywiththeexecutionofnewtransactions. AlternativeModelsofTransactionProcessingNotionofasingletransactionspanningmultiplesitesisinappropriateformanyapplicationsE.g.,transactioncrossinganorganizationalboundaryNoorganizationwouldliketopermitanexternallyinitiatedtransactiontoblocklocaltransactionsforanindeterminateperiodAlternativemodelscarryouttransactionsbysendingmessagesCodetohandlemessagesmustbecarefullydesignedtoensureatomicityanddurabilitypropertiesforupdatesIsolationcannotbeguaranteed,inthatintermediatestagesarevisible,butcodemustensurenoinconsistentstatesresultduetoconcurrencyPersistentmessagingsystemsaresystemsthatprovidetransactionalpropertiestomessagesMessagesareguaranteedtobedeliveredexactlyonceWilldiscussimplementationtechniqueslater AlternativeModels(Cont.)Motivatingexample:fundstransferbetweentwobanksTwophasecommitwouldhavethepotentialtoblockupdatesontheaccountsinvolvedinfundstransferAlternativesolution:DebitmoneyfromsourceaccountandsendamessagetoothersiteSitereceivesmessageandcreditsdestinationaccountMessaginghaslongbeenusedfordistributedtransactions(evenbeforecomputerswereinvented!)Atomicityissueoncetransactionsendingamessageiscommitted,messagemustguaranteedtobedeliveredGuaranteeaslongasdestinationsiteisupandreachable,codetohandleundeliverablemessagesmustalsobeavailablee.g.,creditmoneybacktosourceaccount.Ifsendingtransactionaborts,messagemustnotbesent ErrorConditionswithPersistentMessagingCodetohandlemessageshastotakecareofvarietyoffailuresituations(evenassumingguaranteedmessagedelivery)E.g.,ifdestinationaccountdoesnotexist,failuremessagemustbesentbacktosourcesiteWhenfailuremessageisreceivedfromdestinationsite,ordestinationsiteitselfdoesnotexist,moneymustbedepositedbackinsourceaccountProblemifsourceaccounthasbeenclosedgethumanstotakecareofproblemUsercodeexecutingtransactionprocessingusing2PCdoesnothavetodealwithsuchfailuresTherearemanysituationswhereextraeffortoferrorhandlingisworththebenefitofabsenceofblockingE.g.,prettymuchalltransactionsacrossorganizations PersistentMessagingandWorkflowsWorkflowsprovideageneralmodeloftransactionalprocessinginvolvingmultiplesitesandpossiblyhumanprocessingofcertainstepsE.g.,whenabankreceivesaloanapplication,itmayneedtoContactexternalcredit-checkingagenciesGetapprovalsofoneormoremanagersandthenrespondtotheloanapplicationWestudyworkflowsinChapter25Persistentmessagingformstheunderlyinginfrastructureforworkflowsinadistributedenvironment ConcurrencyControlModifyconcurrencycontrolschemesforuseindistributedenvironment.Weassumethateachsiteparticipatesintheexecutionofacommitprotocoltoensureglobaltransactionautomicity.WeassumeallreplicasofanyitemareupdatedWillseehowtorelaxthisincaseofsitefailureslater Single-Lock-ManagerApproachSystemmaintainsasinglelockmanagerthatresidesinasinglechosensite,saySiWhenatransactionneedstolockadataitem,itsendsalockrequesttoSiandlockmanagerdetermineswhetherthelockcanbegrantedimmediatelyIfyes,lockmanagersendsamessagetothesitewhichinitiatedtherequestIfno,requestisdelayeduntilitcanbegranted,atwhichtimeamessageissenttotheinitiatingsite Single-Lock-ManagerApproach(Cont.)Thetransactioncanreadthedataitemfromanyoneofthesitesatwhichareplicaofthedataitemresides.WritesmustbeperformedonallreplicasofadataitemAdvantagesofscheme:SimpleimplementationSimpledeadlockhandlingDisadvantagesofschemeare:Bottleneck:lockmanagersitebecomesabottleneckVulnerability:systemisvulnerabletolockmanagersitefailure. DistributedLockManagerInthisapproach,functionalityoflockingisimplementedbylockmanagersateachsiteLockmanagerscontrolaccesstolocaldataitemsButspecialprotocolsmaybeusedforreplicasAdvantage:workisdistributedandcanbemaderobusttofailuresDisadvantage:deadlockdetectionismorecomplicatedLockmanagerscooperatefordeadlockdetectionMoreonthislaterSeveralvariantsofthisapproachPrimarycopyMajorityprotocolBiasedprotocolQuorumconsensus PrimaryCopyChooseonereplicaofdataitemtobetheprimarycopy.SitecontainingthereplicaiscalledtheprimarysiteforthatdataitemDifferentdataitemscanhavedifferentprimarysitesWhenatransactionneedstolockadataitemQ,itrequestsalockattheprimarysiteofQ.ImplicitlygetslockonallreplicasofthedataitemBenefitConcurrencycontrolforreplicateddatahandledsimilarlytounreplicateddata-simpleimplementation.DrawbackIftheprimarysiteofQfails,Qisinaccessibleeventhoughothersitescontainingareplicamaybeaccessible. MajorityProtocolLocallockmanagerateachsiteadministerslockandunlockrequestsfordataitemsstoredatthatsite.WhenatransactionwishestolockanunreplicateddataitemQresidingatsiteSi,amessageissenttoSi‘slockmanager.IfQislockedinanincompatiblemode,thentherequestisdelayeduntilitcanbegranted.Whenthelockrequestcanbegranted,thelockmanagersendsamessagebacktotheinitiatorindicatingthatthelockrequesthasbeengranted. MajorityProtocol(Cont.)IncaseofreplicateddataIfQisreplicatedatnsites,thenalockrequestmessagemustbesenttomorethanhalfofthensitesinwhichQisstored.ThetransactiondoesnotoperateonQuntilithasobtainedalockonamajorityofthereplicasofQ.Whenwritingthedataitem,transactionperformswritesonallreplicas.BenefitCanbeusedevenwhensomesitesareunavailabledetailsonhowhandlewritesinthepresenceofsitefailurelaterDrawbackRequires2(n/2+1)messagesforhandlinglockrequests,and(n/2+1)messagesforhandlingunlockrequests.Potentialfordeadlockevenwithsingleitem-e.g.,eachof3transactionsmayhavelockson1/3rdofthereplicasofadata. BiasedProtocolLocallockmanagerateachsiteasinmajorityprotocol,however,requestsforsharedlocksarehandleddifferentlythanrequestsforexclusivelocks.Sharedlocks.WhenatransactionneedstolockdataitemQ,itsimplyrequestsalockonQfromthelockmanageratonesitecontainingareplicaofQ.Exclusivelocks.WhentransactionneedstolockdataitemQ,itrequestsalockonQfromthelockmanageratallsitescontainingareplicaofQ.Advantage-imposeslessoverheadonreadoperations.Disadvantage-additionaloverheadonwrites QuorumConsensusProtocolAgeneralizationofbothmajorityandbiasedprotocolsEachsiteisassignedaweight.LetSbethetotalofallsiteweightsChoosetwovaluesreadquorumQrandwritequorumQwSuchthatQr+Qw>Sand2*Qw>SQuorumscanbechosen(andScomputed)separatelyforeachitemEachreadmustlockenoughreplicasthatthesumofthesiteweightsis>=QrEachwritemustlockenoughreplicasthatthesumofthesiteweightsis>=QwFornowweassumeallreplicasarewrittenExtensionstoallowsomesitestobeunavailabledescribedlater TimestampingTimestampbasedconcurrency-controlprotocolscanbeusedindistributedsystemsEachtransactionmustbegivenauniquetimestampMainproblem:howtogenerateatimestampinadistributedfashionEachsitegeneratesauniquelocaltimestampusingeitheralogicalcounterorthelocalclock.Globaluniquetimestampisobtainedbyconcatenatingtheuniquelocaltimestampwiththeuniqueidentifier. Timestamping(Cont.)AsitewithaslowclockwillassignsmallertimestampsStilllogicallycorrect:serializabilitynotaffectedBut:“disadvantages”transactionsTofixthisproblemDefinewithineachsiteSialogicalclock(LCi),whichgeneratestheuniquelocaltimestampRequirethatSiadvanceitslogicalclockwheneverarequestisreceivedfromatransactionTiwithtimestampandxisgreaterthatthecurrentvalueofLCi.Inthiscase,siteSiadvancesitslogicalclocktothevaluex+1. ReplicationwithWeakConsistencyManycommercialdatabasessupportreplicationofdatawithweakdegreesofconsistency(i.e.,withoutaguaranteeofserializabiliy)E.g.,master-slavereplication:updatesareperformedatasingle“master”site,andpropagatedto“slave”sites.Propagationisnotpartoftheupdatetransaction:itsisdecoupledMaybeimmediatelyaftertransactioncommitsMaybeperiodicDatamayonlybereadatslavesites,notupdatedNoneedtoobtainlocksatanyremotesiteParticularlyusefulfordistributinginformationE.g.,fromcentralofficetobranch-officeAlsousefulforrunningread-onlyqueriesofflinefromthemaindatabase ReplicationwithWeakConsistency(Cont.)Replicasshouldseeatransaction-consistentsnapshotofthedatabaseThatis,astateofthedatabasereflectingalleffectsofalltransactionsuptosomepointintheserializationorder,andnoeffectsofanylatertransactions.E.g.,OracleprovidesacreatesnapshotstatementtocreateasnapshotofarelationorasetofrelationsataremotesitesnapshotrefresheitherbyrecomputationorbyincrementalupdateAutomaticrefresh(continuousorperiodic)ormanualrefresh MultimasterandLazyReplicationWithmultimasterreplication(alsocalledupdate-anywherereplication)updatesarepermittedatanyreplica,andareautomaticallypropagatedtoallreplicasBasicmodelindistributeddatabases,wheretransactionsareunawareofthedetailsofreplication,anddatabasesystempropagatesupdatesaspartofthesametransactionCoupledwith2phasecommitManysystemssupportlazypropagationwhereupdatesaretransmittedaftertransactioncommitsAllowsupdatestooccurevenifsomesitesaredisconnectedfromthenetwork,butatthecostofconsistency DeadlockHandlingConsiderthefollowingtwotransactionsandhistory,withitemXandtransactionT1atsite1,anditemYandtransactionT2atsite2:T1:write(X)write(Y)T2:write(Y)write(X)X-lockonXwrite(X)X-lockonYwrite(Y)waitforX-lockonXWaitforX-lockonYResult:deadlockwhichcannotbedetectedlocallyateithersite CentralizedApproachAglobalwait-forgraphisconstructedandmaintainedinasinglesite;thedeadlock-detectioncoordinatorRealgraph:Real,butunknown,stateofthesystem.Constructedgraph:Approximationgeneratedbythecontrollerduringtheexecutionofitsalgorithm.theglobalwait-forgraphcanbeconstructedwhen:anewedgeisinsertedinorremovedfromoneofthelocalwait-forgraphs.anumberofchangeshaveoccurredinalocalwait-forgraph.thecoordinatorneedstoinvokecycle-detection.Ifthecoordinatorfindsacycle,itselectsavictimandnotifiesallsites.Thesitesrollbackthevictimtransaction. LocalandGlobalWait-ForGraphsLocalGlobal ExampleWait-ForGraphforFalseCyclesInitialstate: FalseCycles(Cont.)Supposethatstartingfromthestateshowninfigure.1.T2releasesresourcesatS1resultinginamessageremoveT1T2messagefromtheTransactionManageratsiteS1tothecoordinator)2.AndthenT2requestsaresourceheldbyT3atsiteS2resultinginamessageinsertT2T3fromS2tothecoordinatorSupposefurtherthattheinsertmessagereachesbeforethedeletemessagethiscanhappenduetonetworkdelaysThecoordinatorwouldthenfindafalsecycleT1T2T3T1Thefalsecycleaboveneverexistedinreality.Falsecyclescannotoccuriftwo-phaselockingisused. UnnecessaryRollbacksUnnecessaryrollbacksmayresultwhendeadlockhasindeedoccurredandavictimhasbeenpicked,andmeanwhileoneofthetransactionswasabortedforreasonsunrelatedtothedeadlock.Unnecessaryrollbackscanresultfromfalsecyclesintheglobalwait-forgraph;however,likelihoodoffalsecyclesislow. AvailabilityHighavailability:timeforwhichsystemisnotfullyusableshouldbeextremelylow(e.g.,99.99%availability)Robustness:abilityofsystemtofunctionspiteoffailuresofcomponentsFailuresaremorelikelyinlargedistributedsystemsToberobust,adistributedsystemmustDetectfailuresReconfigurethesystemsocomputationmaycontinueRecovery/reintegrationwhenasiteorlinkisrepairedFailuredetection:distinguishinglinkfailurefromsitefailureishard(partial)solution:havemultiplelinks,multiplelinkfailureislikelyasitefailure ReconfigurationReconfiguration:AbortalltransactionsthatwereactiveatafailedsiteMakingthemwaitcouldinterferewithothertransactionssincetheymayholdlocksonothersitesHowever,incaseonlysomereplicasofadataitemfailed,itmaybepossibletocontinuetransactionsthathadaccesseddataatafailedsite(moreonthislater)Ifreplicateddataitemswereatfailedsite,updatesystemcatalogtoremovethemfromthelistofreplicas.Thisshouldbereversedwhenfailedsiterecovers,butadditionalcareneedstobetakentobringvaluesuptodateIfafailedsitewasacentralserverforsomesubsystem,anelectionmustbeheldtodeterminethenewserverE.g.,nameserver,concurrencycoordinator,globaldeadlockdetector Reconfiguration(Cont.)Sincenetworkpartitionmaynotbedistinguishablefromsitefailure,thefollowingsituationsmustbeavoidedTwooremorecentralserverselectedindistinctpartitionsMorethanonepartitionupdatesareplicateddataitemUpdatesmustbeabletocontinueevenifsomesitesaredownSolution:majoritybasedapproachAlternativeof“readonewriteallavailable”istantalizingbutcausesproblems Majority-BasedApproachThemajorityprotocolfordistributedconcurrencycontrolcanbemodifiedtoworkevenifsomesitesareunavailableEachreplicaofeachitemhasaversionnumberwhichisupdatedwhenthereplicaisupdated,asoutlinedbelowAlockrequestissenttoatleast½thesitesatwhichitemreplicasarestoredandoperationcontinuesonlywhenalockisobtainedonamajorityofthesitesReadoperationslookatallreplicaslocked,andreadthevaluefromthereplicawithlargestversionnumberMaywritethisvalueandversionnumberbacktoreplicaswithlowerversionnumbers(noneedtoobtainlocksonallreplicasforthistask) Majority-BasedApproachMajorityprotocol(Cont.)Writeoperationsfindhighestversionnumberlikereads,andsetnewversionnumbertooldhighestversion+1WritesarethenperformedonalllockedreplicasandversionnumberonthesereplicasissettonewversionnumberFailures(networkandsite)causenoproblemsaslongasSitesatcommitcontainamajorityofreplicasofanyupdateddataitemsDuringreadsamajorityofreplicasareavailabletofindversionnumbersSubjecttoabove,2phasecommitcanbeusedtoupdatereplicasNote:readsareguaranteedtoseelatestversionofdataitemReintegrationistrivial:nothingneedstobedoneQuorumconsensusalgorithmcanbesimilarlyextended ReadOneWriteAll(Available)BiasedprotocolisaspecialcaseofquorumconsensusAllowsreadstoreadanyonereplicabutupdatesrequireallreplicastobeavailableatcommittime(calledreadonewriteall)Readonewriteallavailable(ignoringfailedsites)isattractive,butincorrectIffailedlinkmaycomebackup,withoutadisconnectedsiteeverbeingawarethatitwasdisconnectedThesitethenhasoldvalues,andareadfromthatsitewouldreturnanincorrectvalueIfsitewasawareoffailurereintegrationcouldhavebeenperformed,butnowaytoguaranteethisWithnetworkpartitioning,sitesineachpartitionmayupdatesameitemconcurrentlybelievingsitesinotherpartitionshaveallfailed SiteReintegrationWhenfailedsiterecovers,itmustcatchupwithallupdatesthatitmissedwhileitwasdownProblem:updatesmaybehappeningtoitemswhosereplicaisstoredatthesitewhilethesiteisrecoveringSolution1:haltallupdatesonsystemwhilereintegratingasiteUnacceptabledisruptionSolution2:lockallreplicasofalldataitemsatthesite,updatetolatestversion,thenreleaselocksOthersolutionswithbetterconcurrencyalsoavailable ComparisonwithRemoteBackupRemotebackup(hotspare)systems(Section17.10)arealsodesignedtoprovidehighavailabilityRemotebackupsystemsaresimplerandhaveloweroverheadAllactionsperformedatasinglesite,andonlylogrecordsshippedNoneedfordistributedconcurrencycontrol,or2phasecommitUsingdistributeddatabaseswithreplicasofdataitemscanprovidehigheravailabilitybyhavingmultiple(>2)replicasandusingthemajorityprotocolAlsoavoidfailuredetectionandswitchovertimeassociatedwithremotebackupsystems CoordinatorSelectionBackupcoordinatorssitewhichmaintainsenoughinformationlocallytoassumetheroleofcoordinatoriftheactualcoordinatorfailsexecutesthesamealgorithmsandmaintainsthesameinternalstateinformationastheactualcoordinatorfailsexecutesstateinformationastheactualcoordinatorallowsfastrecoveryfromcoordinatorfailurebutinvolvesoverheadduringnormalprocessing.ElectionalgorithmsusedtoelectanewcoordinatorincaseoffailuresExample:BullyAlgorithm-applicabletosystemswhereeverysitecansendamessagetoeveryothersite. BullyAlgorithmIfsiteSisendsarequestthatisnotansweredbythecoordinatorwithinatimeintervalT,assumethatthecoordinatorhasfailedSitriestoelectitselfasthenewcoordinator.Sisendsanelectionmessagetoeverysitewithahigheridentificationnumber,SithenwaitsforanyoftheseprocessestoanswerwithinT.IfnoresponsewithinT,assumethatallsiteswithnumbergreaterthanihavefailed,Sielectsitselfthenewcoordinator.IfanswerisreceivedSibeginstimeintervalT’,waitingtoreceiveamessagethatasitewithahigheridentificationnumberhasbeenelected. BullyAlgorithm(Cont.)IfnomessageissentwithinT’,assumethesitewithahighernumberhasfailed;Sirestartsthealgorithm.Afterafailedsiterecovers,itimmediatelybeginsexecutionofthesamealgorithm.Iftherearenoactivesiteswithhighernumbers,therecoveredsiteforcesallprocesseswithlowernumberstoletitbecomethecoordinatorsite,evenifthereisacurrentlyactivecoordinatorwithalowernumber. DistributedQueryProcessingForcentralizedsystems,theprimarycriterionformeasuringthecostofaparticularstrategyisthenumberofdiskaccesses.Inadistributedsystem,otherissuesmustbetakenintoaccount:Thecostofadatatransmissionoverthenetwork.Thepotentialgaininperformancefromhavingseveralsitesprocesspartsofthequeryinparallel. QueryTransformationTranslatingalgebraicqueriesonfragments.ItmustbepossibletoconstructrelationrfromitsfragmentsReplacerelationrbytheexpressiontoconstructrelationrfromitsfragmentsConsiderthehorizontalfragmentationoftheaccountrelationintoaccount1=branch_name=“Hillside”(account)account2=branch_name=“Valleyview”(account)Thequerybranch_name=“Hillside”(account)becomesbranch_name=“Hillside”(account1account2)whichisoptimizedintobranch_name=“Hillside”(account1)branch_name=“Hillside”(account2) ExampleQuery(Cont.)Sinceaccount1hasonlytuplespertainingtotheHillsidebranch,wecaneliminatetheselectionoperation.Applythedefinitionofaccount2toobtainbranch_name=“Hillside”(branch_name=“Valleyview”(account)Thisexpressionistheemptysetregardlessofthecontentsoftheaccountrelation.FinalstrategyisfortheHillsidesitetoreturnaccount1astheresultofthequery. SimpleJoinProcessingConsiderthefollowingrelationalalgebraexpressioninwhichthethreerelationsareneitherreplicatednorfragmentedaccountdepositorbranchaccountisstoredatsiteS1depositoratS2branchatS3ForaqueryissuedatsiteSI,thesystemneedstoproducetheresultatsiteSI PossibleQueryProcessingStrategiesShipcopiesofallthreerelationstositeSIandchooseastrategyforprocessingtheentirelocallyatsiteSI.ShipacopyoftheaccountrelationtositeS2andcomputetemp1=accountdepositoratS2.Shiptemp1fromS2toS3,andcomputetemp2=temp1branchatS3.Shiptheresulttemp2toSI.Devisesimilarstrategies,exchangingtherolesS1,S2,S3Mustconsiderfollowingfactors:amountofdatabeingshippedcostoftransmittingadatablockbetweensitesrelativeprocessingspeedateachsite SemijoinStrategyLetr1bearelationwithschemaR1storesatsiteS1Letr2bearelationwithschemaR2storesatsiteS2Evaluatetheexpressionr1r2andobtaintheresultatS1.1.Computetemp1R1R2(r1)atS1.2.Shiptemp1fromS1toS2.3.Computetemp2r2temp1atS24.Shiptemp2fromS2toS1.5.Computer1temp2atS1.Thisisthesameasr1r2. FormalDefinitionThesemijoinofr1withr2,isdenotedby:r1r2itisdefinedby:R1(r1r2)Thus,r1r2selectsthosetuplesofr1thatcontributedtor1r2.Instep3above,temp2=r2r1.Forjoinsofseveralrelations,theabovestrategycanbeextendedtoaseriesofsemijoinsteps. JoinStrategiesthatExploitParallelismConsiderr1r2r3r4whererelationriisstoredatsiteSi.TheresultmustbepresentedatsiteS1.r1isshippedtoS2andr1r2iscomputedatS2:simultaneouslyr3isshippedtoS4andr3r4iscomputedatS4S2shipstuplesof(r1r2)toS1astheyproduced;S4shipstuplesof(r3r4)toS1Oncetuplesof(r1r2)and(r3r4)arriveatS1(r1r2)(r3r4)iscomputedinparallelwiththecomputationof(r1r2)atS2andthecomputationof(r3r4)atS4. HeterogeneousDistributedDatabasesManydatabaseapplicationsrequiredatafromavarietyofpreexistingdatabaseslocatedinaheterogeneouscollectionofhardwareandsoftwareplatformsDatamodelsmaydiffer(hierarchical,relational,etc.)TransactioncommitprotocolsmaybeincompatibleConcurrencycontrolmaybebasedondifferenttechniques(locking,timestamping,etc.)System-leveldetailsalmostcertainlyaretotallyincompatible.Amultidatabasesystemisasoftwarelayerontopofexistingdatabasesystems,whichisdesignedtomanipulateinformationinheterogeneousdatabasesCreatesanillusionoflogicaldatabaseintegrationwithoutanyphysicaldatabaseintegration AdvantagesPreservationofinvestmentinexistinghardwaresystemsoftwareApplicationsLocalautonomyandadministrativecontrolAllowsuseofspecial-purposeDBMSsSteptowardsaunifiedhomogeneousDBMSFullintegrationintoahomogeneousDBMSfacesTechnicaldifficultiesandcostofconversionOrganizational/politicaldifficultiesOrganizationsdonotwanttogiveupcontrolontheirdataLocaldatabaseswishtoretainagreatdealofautonomy UnifiedViewofDataAgreementonacommondatamodelTypicallytherelationalmodelAgreementonacommonconceptualschemaDifferentnamesforsamerelation/attributeSamerelation/attributenamemeansdifferentthingsAgreementonasinglerepresentationofshareddataE.g.,datatypes,precision,CharactersetsASCIIvsEBCDICSortordervariationsAgreementonunitsofmeasureVariationsinnamesE.g.,KölnvsCologne,MumbaivsBombay QueryProcessingSeveralissuesinqueryprocessinginaheterogeneousdatabaseSchematranslationWriteawrapperforeachdatasourcetotranslatedatatoaglobalschemaWrappersmustalsotranslateupdatesonglobalschematoupdatesonlocalschemaLimitedquerycapabilitiesSomedatasourcesallowonlyrestrictedformsofselectionsE.g.,webforms,flatfiledatasourcesQuerieshavetobebrokenupandprocessedpartlyatthesourceandpartlyatadifferentsiteRemovalofduplicateinformationwhensiteshaveoverlappinginformationDecidewhichsitestoexecutequeryGlobalqueryoptimization MediatorSystemsMediatorsystemsaresystemsthatintegratemultipleheterogeneousdatasourcesbyprovidinganintegratedglobalview,andprovidingqueryfacilitiesonglobalviewUnlikefullfledgedmultidatabasesystems,mediatorsgenerallydonotbotherabouttransactionprocessingButthetermsmediatorandmultidatabasearesometimesusedinterchangeablyThetermvirtualdatabaseisalsousedtorefertomediator/multidatabasesystems DirectorySystemsTypicalkindsofdirectoryinformationEmployeeinformationsuchasname,id,email,phone,officeaddr,..Evenpersonalinformationtobeaccessedfrommultipleplacese.g.,WebbrowserbookmarksWhitepagesEntriesorganizedbynameoridentifierMeantforforwardlookuptofindmoreaboutanentryYellowpagesEntriesorganizedbypropertiesForreverselookuptofindentriesmatchingspecificrequirementsWhendirectoriesaretobeaccessedacrossanorganizationAlternative1:Webinterface.NotgreatforprogramsAlternative2:SpecializeddirectoryaccessprotocolsCoupledwithspecializeduserinterfaces DirectoryAccessProtocolsMostcommonlyuseddirectoryaccessprotocol:LDAP(LightweightDirectoryAccessProtocol)SimplifiedfromearlierX.500protocolQuestion:WhynotusedatabaseprotocolslikeODBC/JDBC?Answer:Simplifiedprotocolsforalimitedtypeofdataaccess,evolvedparalleltoODBC/JDBCProvideanicehierarchicalnamingmechanismsimilartofilesystemdirectoriesDatacanbepartitionedamongstmultipleserversfordifferentpartsofthehierarchy,yetgiveasingleviewtouserE.g.,differentserversforBellLabsMurrayHillandBellLabsBangaloreDirectoriesmayusedatabasesasstoragemechanism LDAP:LightweightDirectoryAccessProtocolLDAPDataModelDataManipulationDistributedDirectoryTrees LDAPDataModelLDAPdirectoriesstoreentriesEntriesaresimilartoobjectsEachentrymusthaveuniquedistinguishedname(DN)DNmadeupofasequenceofrelativedistinguishednames(RDNs)E.g.,ofaDNcn=Silberschatz,ou-BellLabs,o=Lucent,c=USAStandardRDNs(canbespecifiedaspartofschema)cn:commonnameou:organizationalunito:organizationc:countrySimilartopathsinafilesystembutwritteninreversedirection LDAPDataModel(Cont.)EntriescanhaveattributesAttributesaremulti-valuedbydefaultLDAPhasseveralbuilt-intypesBinary,string,timetypesTel:telephonenumberPostalAddress:postaladdressLDAPallowsdefinitionofobjectclassesObjectclassesspecifyattributenamesandtypesCanuseinheritancetodefineobjectclassesEntrycanbespecifiedtobeofoneormoreobjectclassesNoneedtohavesinglemost-specifictype LDAPDataModel(cont.)EntriesorganizedintoadirectoryinformationtreeaccordingtotheirDNsLeaflevelusuallyrepresentspecificobjectsInternalnodeentriesrepresentobjectssuchasorganizationalunits,organizationsorcountriesChildrenofanodeinherittheDNoftheparent,andaddonRDNsE.g.,internalnodewithDNc=USAChildrennodeshaveDNstartingwithc=USAandfurtherRDNssuchasoorouDNofanentrycanbegeneratedbytraversingpathfromrootLeaflevelcanbeanaliaspointingtoanotherentryEntriescanthushavemorethanoneDNE.g.,personinmorethanoneorganizationalunit LDAPDataManipulationUnlikeSQL,LDAPdoesnotdefineDDLorDMLInstead,itdefinesanetworkprotocolforDDLandDMLUsersuseanAPIorvendorspecificfrontendsLDAPalsodefinesafileformatLDAPDataInterchangeFormat(LDIF)Queryingmechanismisverysimple:onlyselection&projection LDAPQueriesLDAPquerymustspecifyBase:anodeintheDITfromwheresearchistostartAsearchconditionBooleancombinationofconditionsonattributesofentriesEquality,wild-cardsandapproximateequalitysupportedAscopeJustthebase,thebaseanditschildren,ortheentiresubtreefromthebaseAttributestobereturnedLimitsonnumberofresultsandonresourceconsumptionMayalsospecifywhethertoautomaticallydereferencealiasesLDAPURLsareonewayofspecifyingqueryLDAPAPIisanotheralternative LDAPURLsFirstpartofURLspecifisserverandDNofbaseldap:://aura.research.bell-labs.com/o=Lucent,c=USAOptionalfurtherpartsseparatedby?symbolldap:://aura.research.bell-labs.com/o=Lucent,c=USA??sub?cn=KorthOptionalpartsspecifyattributestoreturn(emptymeansall)Scope(subindicatesentiresubtree)Searchcondition(cn=Korth) CCodeusingLDAPAPI#include #include main(){ LDAP*ld;LDAPMessage*res,*entry; char*dn,*attr,*attrList[]={“telephoneNumber”,NULL};BerElement*ptr;intvals,i; //Openaconnectiontoserverld=ldap_open(“aura.research.bell-labs.com”,LDAP_PORT);ldap_simple_bind(ld,“avi”,“avi-passwd”);…actualquery(nextslide)…ldap_unbind(ld); } CCodeusingLDAPAPI(Cont.)ldap_search_s(ld,“o=Lucent,c=USA”,LDAP_SCOPE_SUBTREE, “cn=Korth”,attrList,/*attrsonly*/0,&res); /*attrsonly=1=>returnonlyschemanotactualresults*/printf(“found%dentries”,ldap_count_entries(ld,res)); for(entry=ldap_first_entry(ld,res);entry!=NULL;entry=ldap_next_entry(id,entry)){dn=ldap_get_dn(ld,entry);printf(“dn:%s”,dn);/*dn:DNofmatchingentry*/ldap_memfree(dn);for(attr=ldap_first_attribute(ld,entry,&ptr);attr!=NULL;attr=ldap_next_attribute(ld,entry,ptr)) {//foreachattributeprintf(“%s:”,attr);//printnameofattributevals=ldap_get_values(ld,entry,attr); for(i=0;vals[i]!=NULL;i++)printf(“%s”,vals[i]);//sinceattrscanbemultivaluedldap_value_free(vals); } }ldap_msgfree(res); LDAPAPI(Cont.)LDAPAPIalsohasfunctionstocreate,updateanddeleteentriesEachfunctioncallbehavesasaseparatetransactionLDAPdoesnotsupportatomicityofupdates DistributedDirectoryTreesOrganizationalinformationmaybesplitintomultipledirectoryinformationtreesSuffixofaDITgivesRDNtobetaggedontotoallentriestogetanoverallDNE.g.,twoDITs,onewithsuffixo=Lucent,c=USA andanotherwithsuffixo=Lucent,c=IndiaOrganizationsoftensplitupDITsbasedongeographicallocationorbyorganizationalstructureManyLDAPimplementationssupportreplication(master-slaveormulti-masterreplication)ofDITs(notpartofLDAP3standard)AnodeinaDITmaybeareferraltoanodeinanotherDITE.g.,Ou=BellLabsmayhaveaseparateDIT,andDITforo=Lucentmayhavealeafwithou=BellLabscontainingareferraltotheBellLabsDITReferallsarethekeytointegratingadistributedcollectionofdirectoriesWhenaservergetsaqueryreachingareferralnode,itmayeitherForwardquerytoreferredDITandreturnanswertoclient,orGivereferralbacktoclient,whichtransparentlysendsquerytoreferredDIT(withoutuserintervention) EndofChapter ThreePhaseCommit(3PC)Assumptions:NonetworkpartitioningAtanypoint,atleastonesitemustbeup.AtmostKsites(participantsaswellascoordinator)canfailPhase1:ObtainingPreliminaryDecision:Identicalto2PCPhase1.EverysiteisreadytocommitifinstructedtodosoPhase2of2PCissplitinto2phases,Phase2andPhase3of3PCInphase2coordinatormakesadecisionasin2PC(calledthepre-commitdecision)andrecordsitinmultiple(atleastK)sitesInphase3,coordinatorsendscommit/abortmessagetoallparticipatingsites,Under3PC,knowledgeofpre-commitdecisioncanbeusedtocommitdespitecoordinatorfailureAvoidsblockingproblemaslongas