Mining Heterogeneous Information Networks

ACMSIGKDDConferenceTutorial,Washington,D.C.
,July25,2010
MiningHeterogeneous InformationNetworks
JiaweiHan YizhouSun XifengYan PhilipS.Yu UniversityofIllinoisatUrbanaChampaign UniversityofCaliforniaatSantaBarbara UniversityofIllinoisatChicago
Acknowledgements:NSF,ARL,NASA,AFOSR(MURI),Microsoft,IBM,Yahoo!,Google,HPLab&Boeing
July12,2010
1
Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNetAnalysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
2
Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNetAnalysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
3
WhatAreInformationNetworks?
Informationnetwork:Anetworkwhereeachnoderepresentsanentity(e.g., actorinasocialnetwork)andeachlink(e.g.,tie)arelationshipbetween entities Eachnode/linkmayhaveattributes,labels,andweights Linkmaycarryrichsemanticinformation Homogeneousvs.heterogeneousnetworks Homogeneousnetworks Singleobjecttypeandsinglelinktype Singlemodelsocialnetworks(e.g.,friends) WWW:acollectionoflinkedWebpages Heterogeneous,multitypednetworks Multipleobjectandlinktypes Medicalnetwork:patients,doctors,disease,contacts,treatments Bibliographicnetwork:publications,authors,venues
4
UbiquitousInformationNetworks
Graphsandsubstructures Chemicalcompounds,computervisionobjects,circuits,XML Biologicalnetworks Bibliographicnetworks:DBLP,ArXiv,PubMed, Socialnetworks:Facebook>100millionactiveusers WorldWideWeb(WWW):>3billionnodes,>50billionarcs Cyberphysicalnetworks
An Internet Web
Yeast protein interaction network
Co-author network
Social network sites

5
Homogeneousvs.HeterogeneousNetworks
Co-author Network
Conference-Author Network
6
DBLP:AnInterestingandFamiliarNetwork
DBLP:Acomputersciencepublicationbibliographicdatabase 1.4Mrecords(papers),0.7Mauthors,5Kconferences, Willthisdatabasediscloseinterestingknowledgeabout computerscienceresearch? Whatarethepopularresearchfields/subfieldsinCS? WhoaretheleadingresearchersonDBorXQueries? Howdotheauthorsinthissubfieldcollaborateandevolve? HowmanyWeiWangsinDBLP,whichpaperdonebywhich? WhoisSergy Brins supervisorandwhen? WhoareverysimilartoChristosFaloutsos? Allthesekindsofquestions,andpotentiallymuchmore,canbe nicelyansweredbytheDBLPInfoNet How?Exploringthepoweroflinksininformationnetworks!
7
Homo.vs.Hetero.:DifferencesinDBInfoNet Mining
Homogeneousnetworkscanoftenbederivedfromtheiroriginal heterogeneousnetworks Coauthornetworkscanbederivedfromauthorpaper conferencenetworksbyprojectiononauthorsonly Papercitationnetworkscanbederivedfromacomplete bibliographicnetworkwithpapersandcitationsprojected HeterogeneousDBInfoNet carriesricherinformationthanits correspondingprojectedhomogeneousnetworks TypedheterogeneousInfoNet vs.nontypedhetero.InfoNet (i.e., notdistinguishingdifferenttypesofnodes) TypednodesandlinksimplyamorestructuredInfoNet,and thusoftenleadtomoreinformativediscovery Ouremphasis:Miningstructured informationnetworks!
8
WhyMiningHeterogeneousInformationNetworks?
Mostdatasetscanbeorganized ortransformed intoa structured heterogeneousinformationnetwork! Examples:DBLP,IMDB,Flickr,GoogleNews,Wikipedia, Structurescanbeprogressivelyextractedfromlessorganized datasetsbyinformationnetworkanalysis Informationrich,interrelated,organizeddatasetsformone orasetofgigantic,interconnected,multityped heterogeneousinformationnetworks Surprisinglyrichknowledgecanbederivedfromsuch structuredheterogeneousinformationnetworks Ourgoal:Uncoverknowledgehiddenfromorganized data Exploringthepowerofmultityped,heterogeneouslinks Miningstructured heterogeneousinformationnetworks!
9
Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
10
ClusteringandRankinginInformation Networks
IntegratedClusteringandRankingofHeterogeneous InformationNetworks ClusteringofHomogeneousInformationNetworks LinkClus:Clusteringwithlinkbasedsimilaritymeasure SCAN:Densitybasedclusteringofnetworks Others
Spectralclustering Modularitybasedclustering Probabilisticmodelbasedclustering
UserGuidedClusteringofInformationNetworks
11
ClusteringandRankinginHeterogeneous InformationNetworks
Ranking&clusteringeachprovidesanewviewoveranetwork Rankinggloballywithoutconsideringclusters dumb RankingDBandArchitectureconfs.together? Clusteringauthorsinonehugeclusterwithoutdistinction? Dulltoviewthousandsofobjects(thisiswhyPageRank!) RankClus:Integratesclusteringwithranking Conditionalrankingrelativetoclusters Useshighlyrankedobjectstoimproveclusters Qualitiesofclusteringandrankingaremutuallyenhanced Y.Sun,J.Han,etal.,RankClus:IntegratingClusteringwith RankingforHeterogeneousInformationNetworkAnalysis, EDBT'09.
12
GlobalRankingvs.ClusterBasedRanking
AToyExample:Twoareaswith10conferencesand100authors ineacharea
13
RankClus:ANewFramework
Sub-Network Ranking
Clustering
14
TheRankClus Philosophy
WhyintegratedRankingandClustering? Rankingandclusteringcanbemutuallyimproved Ranking:Onceaclusterbecomesmoreaccurate,rankingwill bemorereasonableforsuchaclusterandwillbethe distinguishedfeatureofthecluster Clustering:Oncerankingismoredistinguishedfromeach other,theclusterscanbeadjustedandgetmoreaccurate results Noteveryobjectshouldbetreatedequallyinclustering! Objectspreservesimilarityundernewmeasurespace E.g.,VLDBvs.SIGMOD
15
RankClus:AlgorithmFramework
Step0.Initialization RandomlypartitiontargetobjectsintoKclusters Step1.Ranking Rankingforeachsubnetworkinducedfromeachcluster, whichservesasfeatureforeachcluster Step2.Generatingnewmeasurespace Estimatemixturemodelcoefficientsforeachtargetobject Step3.Adjustingcluster Step4.RepeatingSteps13untilstable
16
FocusonaBiTypedNetworkCase
Conferenceauthornetwork,linkscanexistbetween Conference(X)andauthor(Y) Author(Y)andauthor(Y)
UseWtodenotethelinksandthereweights W=
17
Step1:Ranking:FeatureExtraction
Tworankingstrategies:Simplerankingvs.authorityranking SimpleRanking Proportionaltodegreecountingforobjects,e.g.,#of publicationsofanauthor Considersonlyimmediateneighborhoodinthenetwork AuthorityRanking:ExtensiontoHITSinweightedbitype network Rule1:Highlyrankedauthorspublishmanypapersinhighly rankedconferences Rule2:Highlyrankedconferencesattractmanypapersfrom manyhighlyrankedauthors Rule3:Therankofanauthorisenhancedifheorsheco authorswithmanyauthorsormanyhighlyrankedauthors
18
EncodingRulesinAuthorityRanking
Rule1:Highlyrankedauthorspublishmanypapersinhighly rankedconferences
Rule2:Highlyrankedconferencesattractmanypapersfrom manyhighlyrankedauthors
Rule3:Therankofanauthorisenhancedifheorsheco authorswithmanyauthorsormanyhighlyrankedauthors
19
Example:AuthorityRankinginthe2Area ConferenceAuthorNetwork
Therankingsofauthorsarequitedistinctfromeachotherinthe twoclusters
20
Step2:GenerateNewMeasureSpace: AMixtureModelMethod
Considereachtargetobjectslinksaregeneratedunderamixture distributionofrankingfromeachcluster Considerrankingasadistribution:r(Y) p(Y)
Eachtargetobjectxi ismappedintoaKvector(i,k) ParametersareestimatedusingtheEMalgorithm Maximizetheloglikelihoodgivenalltheobservationsof links

21
Example:2DCoefficientsinthe2Area ConferenceAuthorNetwork
Theconferencesarewellseparatedinthenewmeasurespace
Scatterplotsoftwoconferencesandcomponentcoefficients
22
Step3:ClusterAdjustmentinNew MeasureSpace
Clustercenterinnewmeasurespace Vectormeanofobjectsinthecluster(Kdimensional) Clusteradjustment Distancemeasure:1Cosinesimilarity Assigntotheclusterwiththenearestcenter WhyBetterRankingFunctionDerivesBetterClustering? Considerthemeasurespacegenerationprocess
Highlyrankedobjectsinaclusterplayamoreimportantroleto decide atargetobjectsnewmeasure
Intuitively,ifwecanfindthehighlyrankedobjectsina cluster,equivalently,wegettherightcluster
23
StepbyStepRunningCaseIllustration
Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Two clusters are almost well separated Improved significantly Well separated
Improved a little
Stable
24
TimeComplexity:Linearto#ofLinks
Ateachiteration,|E|:edgesinnetwork,m:numberoftarget objects,K:numberofclusters Rankingforsparsenetwork
~O(|E|)
Mixturemodelestimation
~O(K|E|+mK)
Clusteradjustment
~O(mK^2)
Inall,linearto|E| ~O(K|E|) Note:SimRank willbeatleastquadraticateachiterationsinceit evaluatesdistancebetweeneverypairinthenetwork

25
CaseStudy:Dataset:DBLP
Allthe2676conferencesand20,000authorswithmost publications,fromthetimeperiodofyear1998toyear2007 Bothconferenceauthorrelationshipsandcoauthor relationshipsareused K=15(selectonly5clustershere)
26
NetClus: Ranking & Clustering with Star Network Schema

Beyondbitypedinformationnetwork:AStarNetworkSchema Splitanetworkintodifferentlayers,eachrepresentingbyanet cluster
27
StarNet:Schema&NetCluster
StarNetworkSchema Centertype:Targettype E.g.,apaper,amovie,ataggingevent Acenterobjectisacooccurrence ofabagofdifferenttypesof objects,whichstandsforamultirelationamongdifferenttypesof objects Surroundingtypes:Attribute(property)types NetCluster GivenainformationnetworkG,anetclusterCcontainstwopiecesof information: NodesetandlinksetasasubnetworkofG Membershipindicatorforeachnodex:P(x inC) GivenainformationnetworkG,clusternumberK,aclusteringforGisa setofnetclustersandforeachnodex,thesumofxs probability distributioninallKnetclustersshouldbe1
28
Venue Publish
Research Paper
Author Write
Contain Term DBLP
29
StarNet of Delicious.com
Web Site
Tagging Event
User
Contain Tag Delicious.com
30
StartNet forIMDB
Actor/A ctress Star in

Movie
Director Direct
Contain Title/ Plot IMDB
31
RankingFunctions
RankinganobjectxoftypeTx inanetworkG,denotedasp(x|Tx,G) Giveascoretoeachobject,accordingtoitsimportance Differentrulesdefineddifferentrankingfunctions: SimpleRanking Rankingscoreisassignedaccordingtothedegreeofanobject AuthorityRanking Rankingscoreisassignedaccordingtothemutualenhancementby propagationsofscorethroughlinks highlyrankedconferencesacceptmanygoodpaperspublishedby manyhighlyrankedauthors,andhighlyrankedauthorspublishmany goodpapersinhighlyrankedconferences:
32
RankingFunction(Cont.)
Priorscanbeadded: PP(X|Tx,Gk)=(1 P)P(X|Tx,Gk)+PP0(X|Tx,Gk) P0(X|Tx,Gk)isthepriorknowledge,usuallygivenasa distribution,denotedbyonlyseveralwords Pistheparameterthatwebelieveinpriordistribution Rankingdistribution Normalizerankingscoresto1,giventhemaprobabilistic meaning SimilartotheideaofPageRank
33
NetClus:AlgorithmFramework
Mapeachtargetobjectintoanewlowdimensionalfeature spaceaccordingtocurrentnetclustering,andadjustthe clusteringfurtherinthenewmeasurespace Step0:Generateinitialrandomclusters Step1:Generaterankingbasedgenerativemodelfor targetobjectsforeachnetcluster Step2:Calculateposteriorprobabilitiesfortargetobjects, whichservesasthenewmeasure,andassigntargetobjects tothenearestclusteraccordingly Step3:Repeatsteps1and2,untilclustersdonotchange significantly Step4:Calculateposteriorprobabilitiesforattribute objectsineachnetcluster
34
Generative Model for Target Objects Given a Net-cluster

Eachtargetobjectstandsforancooccurrenceofabagofattributeobjects Definetheprobabilityofatargetobject<=>definetheprobabilityofthe cooccurrenceofalltheassociatedattributeobjects GenerativeprobabilityP(d|Gk)fortargetobjectdinclusterCk :
whereP(x |Tx ,Gk) isrankingfunction,P(Tx |Gk)istypeprobability Twoassumptionsofindependence Theprobabilitiestovisitobjectsofdifferenttypesareindependentto eachother Theprobabilitiestovisittwoobjectswithinthesametypeare independenttoeachother
35
ClusterAdjustment
Usingposteriorprobabilitiesoftargetobjectsasnewfeature space Eachtargetobject=>Kdimensionvector Eachnetclustercenter=>Kdimensionvector Averageontheobjectsinthecluster Assigneachtargetobjectintonearestclustercenter(e.g., cosinesimilarity) Asubnetworkcorrespondingtoanewnetclusteristhenbuilt byextractingallthetargetobjectsinthatclusterandall linkedattributeobjects
39
Experiments:DBLPandBeyond
DataSet:DBLPallarea dataset Allconferences+Top 50Kauthors DBLPfourarea dataset 20conferencesfromDB,DM,ML,IR Allauthorsfromtheseconferences Allpaperspublishedintheseconferences Runningcaseillustration
40
AccuracyStudy:Experiments
Accuracy,comparedwithPLSA,apuretextmodel,noother typesofobjectsandlinksareused,usethesamepriorasin NetClus
AccuracyofPaperClusteringResults Accuracy,comparedwithRankClus,abitypedclusteringmethod ononlyonetype
AccuracyofConferenceClusteringResults
41
NetClus:DistinguishingConferences
AAAI 0.00226670.008991680.934024 0.03000420.0247133 CIKM0.1500530.3101720.007238070.4445240.0880127 CVPR0.0001638120.007630720.931496 0.02813420.032575 ECIR3.47023e050.007126950.006574020.978391 0.00787288 ECML0.000774770.1109220.814362 0.05794260.015999 EDBT0.573362 0.3160330.001014420.02455910.0850319 ICDE0.529522 0.3765420.002391520.01511130.0764334 ICDM0.0004550280.778452 0.05664570.1131840.0512633 ICML0.0003096240.0500780.878757 0.06223350.00862134 IJCAI0.003298160.00467580.94288 0.03037450.0187718 KDD0.005742230.797633 0.06173510.0676810.0672086 PAKDD0.001112460.813473 0.04031050.05747550.0876289 PKDD5.39434e050.760374 0.1196080.0529260.0670379 PODS0.78935 0.1137510.0139390.002774170.0801858 SDM0.0001729530.841087 0.0583160.05270810.0477156 SIGIR0.006003990.002800130.002752370.977783 0.0106604 SIGMOD0.689348 0.2231220.00177030.008254550.0775055 VLDB0.701899 0.2074280.001000120.01169660.0779764 WSDM0.007516540.2692590.02602910.683646 0.0135497 WWW0.07711860.2706350.029307 0.451857 0.171082
42
NetClus:DatabaseSystemCluster
database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.00850744 object 0.00837766 relational 0.0081175 processing 0.00745875 based 0.00736599 distributed 0.0068367 xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167 information 0.0050518 model 0.00499396 efficient 0.00465707 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314 Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853 Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274 Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143 Abraham Silberschatz 0.00278185
43
Ranking authors in XML
NetClus:StarNetBasedRankingandClustering
Ageneralframeworkinwhichrankingandclusteringare successfullycombinedtoanalyzeinfornets Rankingandclusteringcanmutuallyreinforceeachotherin informationnetworkanalysis NetClus,anextensiontoRankClus thatintegratesrankingand clusteringandgeneratenetclustersinastarnetworkwith arbitrarynumberoftypes Netcluster,heterogeneous informationsubnetworks comprisedofmultiple typesofobjects GowellbeyondDBLP,and structuredrelationalDBs
Flickr:queryRaleigh derivesmultipleclusters
44
iNextCube:InformationNetworkEnhancedText Cube(VLDB09Demo)
Demo: iNextCube.cs.uiuc.edu
DimensionhierarchiesgeneratedbyNetClus
ArchitectureofiNextCube
All
Author/conferen termrankingfor researcharea.Th researchareasca atdifferentlevel
DBandIS
Theory
Architecture
DB
DM
IR
XML
DistributedDB
NetclusterHierarchy
45
46
LinkBasedClustering:WhyUseful?
Authors Proceedings sigmod03 sigmod04 Conferences
Tom Mike Cathy John Mary
sigmod
sigmod05 vldb03 vldb04 vldb05 aaai04 aaai05
vldb
aaai
Questions: Q1:Howtoclustereachtypeofobjects? Q2:Howtodefinesimilaritybetweeneachtypeofobjects?

47
SimRank:LinkBasedSimilarities
Twoobjectsaresimilariflinkedwiththesameorsimilarobjects
Jeh &Widom,2002 SimRank sigmod03 sigmod04 Similaritybetweentwoobjectsa andb, S(a,b)=theaveragesimilaritybetween objectslinkedwitha andthosewithb:
Tom Mary
sigmod
sigmod05
Tom Mike Cathy John
sigmod03 sigmod04 sigmod05 vldb03 vldb04 vldb05
sigmod
whereI(v)isthesetofinneighborsof thevertexv. But:Itisexpensivetocompute:
vldb
ForadatasetofN objectsandM links, ittakesO(N2)spaceandO(M2)timeto computeallsimilarities.

48
Observation1:HierarchicalStructures
Hierarchicalstructuresoftenexistnaturallyamongobjects(e.g., taxonomyofanimals) Ahierarchicalstructureof productsinWalmart
All grocery electronics apparel Relationshipsbetweenarticlesand words(Chakrabarti,Papadimitriou, Modha,Faloutsos,2004)
TV
DVD
camera
Articles
Words
49
Observation2:DistributionofSimilarity
0.4 portion of entries 0.3 0.2 0.1 0 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.18 0.22 similarity value 0.24 0.1 0.2 0
Distribution of SimRank similarities among DBLP authors
Powerlawdistributionexistsinsimilarities 56%ofsimilarityentriesarein[0.005,0.015] 1.4%ofsimilarityentriesarelargerthan0.1 Ourgoal:Designadatastructurethatstoresthesignificantsimilaritiesand compressesinsignificantones

50
OurDataStructure:SimTree
Similarity between two sibling nodes n1 and n2
n1
0.2 0.9 0.3 0.8 0.9
Each non-leaf node represents a group of similar lower-level nodes
n2
n3
Adjustment ratio for node n7
0.8
0.9
n4
n5
n6
1.0
n7
n8
n9
Each leaf node simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8) represents an object Path-based node similarity Similarity between two nodes is the average similarity between objects linked with them in other SimTrees Adjustment ratio for x = Average similarity between x and all other nodes Average similarity between xs parent and all other nodes
51
LinkClus:SimTreeBasedHierarchicalClustering
InitializeaSimTree forobjectsofeachtype Repeat ForeachSimTree,updatethesimilaritiesbetweenitsnodes usingsimilaritiesinotherSimTrees Similaritybetweentwonodesa andbistheaverage similaritybetweenobjectslinkedwiththem AdjustthestructureofeachSimTree Assigneachnodetotheparentnodethatitismost similarto
52
InitializationofSimTrees
FindingtightgroupsFrequentpatternmining
Reduced to Transactions
1 2 3 4 5 6 7 8 9
Thetightnessofagroupof nodesisthesupportofa frequentpattern
g1
n1 n2 n3 n4
g2
{n1} {n1, n2} {n2} {n1, n2} {n1, n2} {n2, n3, n4} {n4} {n3, n4} {n3, n4}
Initializingatree: Startfromleafnodes(level0) Ateachlevell,findnonoverlappinggroupsofsimilarnodes withfrequentpatternmining

53
Complexity:LinkClus vs.SimRank
Afterinitialization,iteratively(1)foreachSimTree updatethe similaritiesbetweenitsnodesusingsimilaritiesinotherSimTrees, and(2)AdjustthestructureofeachSimTree Computationalcomplexity: Fortwotypesofobjects,N ineach,andM linkagesbetween them
Time
Updating similarities Adjusting tree structures
Space
O(M+N) O(N)
O(M(logN)2) O(N)
LinkClus SimRank
O(M(logN)2) O(M2)
O(M+N) O(N2)
54
PerformanceComparison:ExperimentSetup
DBLP dataset: 4170 most productive authors, and 154 well-known conferences with most proceedings Manually labeled research areas of 400 most productive authors according to their home pages (or publications) Manually labeled areas of 154 conferences according to their call for papers Approaches Compared: SimRank (Jeh & Widom, KDD 2002) Computing pair-wise similarities SimRank with FingerPrints (F-SimRank) Fogaras & Racz, WWW 2005 pre-computes a large sample of random paths from each object and uses samples of two objects to estimate SimRank similarity ReCom (Wang et al. SIGIR 2003) Iteratively clustering objects using cluster labels of linked objects
55
DBLPDataSet:AccuracyandComputationTime
Authors
1
0.8 0.7
Conferences
0.95 accuracy
accuracy
0.6 0.5 0.4 0.3 0.2 0.1
0.9 LinkClus SimRank ReCom F-SimRank

11 13 15 17 19 3 5 7 9
0.85
0.8
1
LinkClus SimRank ReCom F-SimRank

17
#iteration
#iteration
Approaches LinkClus SimRank ReCom FSimRank
AccrAuthor 0.957 0.958 0.907 0.908
AccrConf 0.723 0.760 0.457 0.583
11
13
averagetime 76.7 1020 43.1 83.6

56
15
19
EmailDataset:AccuracyandTime
F.Nielsen.Emaildataset. http://www.imm.dtu.dk/rem/data/Email1431.zip 370emailsonconferences,272onjobs,and789spamemails WhyisLinkClus evenbetterthanSimRank inaccuracy? Noisefilteringduetofrequentpatternbasedpreprocessing
Approach LinkClus SimRank ReCom FSimRank CLARANS Accuracy 0.8026 0.7965 0.5711 0.3688 0.4768 Totaltime(sec) 1579.6 39160 74.6 479.7 8.55
57
58
SCAN:DensityBasedNetworkClustering
Networksmadeupofthemutualrelationshipsofdataelementsusually haveanunderlyingstructure.Clustering:Astructurediscovery problem Givensimplyinformationofwhoassociateswithwhom,couldone identifyclustersofindividualswithcommoninterestsorspecial relationships(families,cliques,terroristcells)? Questionstobeanswered:Howmanyclusters?Whatsizeshouldthey be?Whatisthebestpartitioning?Shouldsomepointsbesegregated? Scan:Aninterestingdensitybasedalgorithm X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger,SCAN:AStructural ClusteringAlgorithmforNetworks,Proc.2007ACMSIGKDDInt. Conf.KnowledgeDiscoveryinDatabases(KDD'07),SanJose,CA, Aug.2007
59
SocialNetworkandItsClusteringProblem
Individualsinatightsocialgroup,or clique,knowmanyofthesamepeople, regardlessofthesizeofthegroup. Individualswhoarehubs knowmany peopleindifferentgroupsbutbelongto nosinglegroup.Politicians,forexample bridgemultiplegroups. Individualswhoareoutliers resideat themarginsofsociety.Hermits,for example,knowfewpeopleandbelong tonogroup.
60
StructureSimilarity
Define(v)asimmediateneighborofavertexv. Thedesiredfeaturestendtobecapturedbyameasure(u,v) asStructuralSimilarity
| (v) I ( w) | (v, w) = | (v) || ( w) |
Structuralsimilarityislargeformembersofacliqueandsmall for hubsandoutliers.

61
StructuralConnectivity[1]
Neighborhood:
Core:
N (v ) = {w (v) | (v, w) }
CORE , (v) | N (v) | DirRECH , (v, w) CORE , (v ) w N (v )
Directstructurereachable: Structurereachable:transitiveclosureofdirectstructure reachability Structureconnected:

CONNECT , (v, w) u V : RECH , (u , v ) RECH , (u , w)
[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases
62
StructureConnectedClusters
StructureconnectedclusterC
Connectivity: Maximality:
v, w C : CONNECT , (v, w)
v, w V : v C REACH , (v, w) w C
Hubs:
Notbelongtoanycluster Bridgetomanyclusters
Outliers:
Notbelongtoanycluster Connecttolessclusters
outlier
hub
63
Algorithm
2
=2 = 0.7
0.67 8 0.75 9 7 6 11 0.82 12
3 5 4 0 1
10
13
64
Algorithm
2
=2 = 0.7
7 11 12 10 9
3 0.51 0.68 6 8 4 0 5 1
0.51
13
65
RunningTime
Runningtime:O(|E|) Forsparsenetworks:O(|V|)
[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).
66
67
SpectralClustering
Spectralclustering:Findthebestcutthatpartitionsthenetwork Differentcriteriatodecidebest Mincut,ratiocut,normalizedcut,MinMaxcut UsingMincutasanexample[Wuetal.1993] Assigneachnodeianindicatorvariable
Representthecutsizeusingindicatorvectorandadjacencymatrix Cutsize = Minimizetheobjectivefunctionthroughsolvingeigenvalue system Relaxthediscretevalueofqtocontinuousvalue Mapcontinuousvalueofqintodiscreteonestogetclusterlabels Usesecondsmallesteigenvectorforq
68
ModularityBasedClustering
Modularitybasedclustering Findthebestclusteringthatmaximizesthemodularityfunction Qfunction[Newmanetal.,2004] Leteij beahalfofthefractionofedgesbetweengroupiandgroupj eii isthefractionofedgeswithingroupi Letai bethefractionofallendsofedgesattachedtoverticesingroupI Qisthendefinedassumofdifferencebetweenwithingroupedgesand expectedwithingroupedges
MinimizeQ Onepossiblesolution:hierarchicallymergeclustersresultingingreatest increaseinQfunction[Newmanetal.,2004]

69
ProbabilisticModelBasedClustering
Probabilisticmodelbasedclustering Buildgenerativemodelsforlinksbasedonhiddenclusterlabels Maximizetheloglikelihoodofallthelinkstoderivethehiddencluster membership Anexample:Airoldi etal.,Mixedmembershipstochasticblockmodels,2008 DefineagroupinteractionprobabilitymatrixB(KXK) B(g,h)denotestheprobabilityoflinkgenerationbetweengroupg andgrouph Generativemodelforalink Foreachnode,drawamembershipprobabilityvectorforma Dirichlet prior Foreachpaperofnodes,drawclusterlabelsaccordingtotheir membershipprobability(supposinggandh),thendecidewhetherto havealinkaccordingtoprobabilityB(g,h) Derivethehiddenclusterlabelbymaximizethelikelihoodgiven Band theprior
70
71
UserGuidedClusteringinDBInfoNet
Work-In
person group
Professor
name office position
Open-course
course semester instructor
Course
course-id name area
Group
name area
Advise
professor student degree
Publish
author title
Publication
title year conf
User hint
Student
Register
student course semester unit grade
Target of clustering
Userusuallyhasagoalofclustering,e.g.,clusteringstudents byresearcharea UserspecifieshisclusteringgoaltoaDBInfoNet cluster:CrossClus

72
Classificationvs.UserGuidedClustering
User hint Userspecifiedfeature(inthe formofattribute)isusedasa hint,notclasslabels Theattributemaycontaintoo manyortoofewdistinct values E.g.,ausermaywantto clusterstudentsinto20 clustersinsteadof3
All tuples for clustering
Additionalfeaturesneedtobe includedinclusteranalysis
73
UserGuidedClusteringvs.SemisupervisedClustering
Semisupervisedclustering[Wagstaff,etal 01,Xing,etal.02] Userprovidesatrainingsetconsistingofsimilar anddissimilar pairsofobjects Userguidedclustering Userspecifiesanattributeasahint,andmorerelevantfeaturesare foundforclustering Semi-supervised clustering User-guided clustering

74
WhyNotTypicalSemiSupervisedClustering?
Whynotdotypicalsemisuperviseclustering? Muchinformation(inmultiplerelations)isneededtojudgewhethertwo tuples aresimilar Ausermaynotbeabletoprovideagoodtrainingset Itismucheasierforausertospecifyanattributeasahint,suchasastudents researcharea
Tom Smith Jane Chang
SC1211 BI205
TA RA
Tuples to be compared
User hint
75
CrossClus:AnOverview
CrossClus:Framework Searchforgoodmultirelationalfeaturesforclustering Measuresimilaritybetweenfeaturesbasedonhowthey clusterobjectsintogroups Userguidance+heuristicsearchforfindingpertinent features Clusteringbasedonakmedoidsbased algorithm CrossClus:Majoradvantages Userguidance,eveninaverysimpleform,playsan importantroleinmultirelationalclustering CrossClus findspertinentfeatures bycomputingsimilarities betweenfeatures
76
SelectionofMultiRelationalFeatures
Amultirelationalfeatureisdefinedby: Ajoinpath.E.g.,Student Register OpenCourse Course Anattribute.E.g.,Course.area (Fornumericalfeature)anaggregationoperator.E.g.,sumoraverage CategoricalFeaturef= [Student Register OpenCourse Course,Course.area,null] areas of courses of each student
Tuple Areas of courses
Values of feature f
Tuple Feature f
f(t1) f(t2) f(t3) f(t4) f(t5)

DB AI TH
DB t1 t2 t3 t4 t5
5 0 1 5 3
AI
5 3 5 0 3
TH
0 7 4 5 4
DB t1 t2 t3 t4 t5
0.5 0 0.1 0.5 0.3
AI
0.5 0.3 0.5 0 0.3
TH
0 0.7 0.4 0.5 0.4
Numerical Feature, e.g., average grades of students h = [Student Register, Register.grade, average] E.g. h(t1) = 3.5
77
SimilarityBetweenFeatures
Vf
Values of Feature f and g

Feature f (course) DB t1 t2 t3 t4 t5 0.5 0 0.1 0.5 0.3 AI 0.5 0.3 0.5 0 0.3 TH 0 0.7 0.4 0.5 0.4 Info sys 1 0 0 0.5 0.5 Feature g (group) Cog sci 0 0 0.5 0 0.5 Theory 0 1 0.5 0.5 0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S5 S4 S3 S2 S1 1 2 4 3 5 0.5-0.6 0.4-0.5 0.3-0.4 0.2-0.3 0.1-0.2 0-0.1
Vg
Similarity between two features cosine similarity of two vectors
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S1 S2 S3 S4 S5 5 4 3 2 1 0.5-0.6 0.4-0.5 0.3-0.4 0.2-0.3 0.1-0.2 0-0.1
V V Sim( f , g ) = f g V V
f
78
SimilaritybetweenCategorical&NumericalFeatures
V V
h
N
= 2 sim h (t i , t j ) sim f (ti , t j )

N
N l f (ti ). p k (1 h (t i )) f (t j ). p k + 2 f (t i ). p k h (t j ) f (t j ). p k i =1 k =1 j <i j <i
i =1 j < i
= 2
i =1 k =1
Only depend on ti
Depend on all tj with j<i
Feature f DB AI TH
Objects (ordered by h) Feature h

2.7 2.9 3.1 3.3 3.5 3.7 3.9
Parts depending on ti Parts depending on all ti with j<i
79
SearchingforPertinentFeatures
Differentfeaturesconveydifferentaspectsofinformation
Research area
Research group area Conferences of papers Advisor
Academic Performances Demographic info

Permanent address Nationality GPA GRE score Number of papers
Featuresconveyingsameaspectofinformationusuallycluster objectsinmoresimilarways researchgroupareasvs.conferencesofpublications Givenuserspecifiedfeature Findpertinentfeaturesbycomputingfeaturesimilarity
80
HeuristicSearchforPertinentFeatures
Work-In Professor
name office
Open-course
course semester instructor
Course
course-id name area
Overallprocedure
person group
1.Startfromtheuser specified feature Group 2.Searchinneighborhoodof name existingpertinentfeatures 3.Expandsearchrangegradually area
position
Advise
professor student degree
Publish
author
Publication
title year conf
title
User hint
Student
Register
student course semester unit grade
Target of clustering
Tuple ID propagation [Yin, et al.04] is used to create multi-relational features IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple
81
ClusteringwithMultiRelationalFeature
GivenasetofL pertinentfeaturesf1, , fL,similarity betweentwoobjects
sim (t1 , t 2 ) = sim f i (t1 , t 2 ) f i .weight

i =1
Weightofafeatureisdeterminedinfeaturesearchbyits similaritywithotherpertinentfeatures Forclustering,weuseCLARANS,ascalablekmedoids [Ng& Han94]algorithm
82
Experiments:CompareCrossClus withExisting Methods

Baseline:Onlyusetheuserspecifiedfeature PROCLUS[Aggarwal,etal.99]:astateoftheartsubspace clusteringalgorithm Useasubsetoffeaturesforeachcluster Weconvertrelationaldatabasetoatableby propositionalization Userspecifiedfeatureisforcedtobeusedineverycluster RDBC[KirstenandWrobel00] ArepresentativeILPclusteringalgorithm Useneighborinformationofobjectsforclustering Userspecifiedfeatureisforcedtobeused
83
MeasuringClusteringAccuracy
ToverifythatCrossClus capturesusersclusteringgoal,wedefineaccuracy ofclustering Givenaclusteringtask Manuallyfindallfeaturesthatcontaininformationdirectlyrelatedtothe clusteringtask standardfeatureset E.g.,Clusteringstudentsbyresearchareas Standardfeatureset:researchgroup,groupareas,advisors, conferencesofpublications,courseareas Accuracyofclusteringresult:howsimilaritistotheclusteringgenerated bystandardfeatureset
deg (C C ') =
n i =1
max 1 j n ' ci c ' j
deg (C C ') + deg (C ' C ) sim (C , C ') = 2

84
n i =1
ci
ClusteringProfessors:CSDeptDataset
Clustering Accuracy - CS Dept 1 0.8 0.6 0.4 0.2 0 Group Course Group+Course CrossClus K-Medoids CrossClus K-Means CrossClus Agglm Baseline PROCLUS RDBC
(Theory):J.Erickson,S.HarPeled,L.Pitt,E.Ramos,D.Roth,M.Viswanathan (Graphics):J.Hart,M.Garland,Y.Yu (Database):K.Chang,A.Doan,J.Han,M.Winslett,C.Zhai (Numericalcomputing):M.Heath,T.Kerkhoven,E.deSturler (Networking&QoS):R.Kravets,M.Caccamo,J.Hou,L.Sha (ArtificialIntelligence):G.Dejong,M.Harandi,J.Ponce,L.Rendell (Architecture):D.Padua,J.Torrellas,C.Zilles,S.Adve,M.Snir,D.Reed,V.Adve (OperatingSystems):D.Mickunas,R.Campbell,Y.Zhou

85
DBLPDataset
Clustering Accurarcy - DBLP
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Co nf Co au th Co au th or W W +C oa u th re e or d or d th or or
CrossClus K-Medoids CrossClus K-Means CrossClus Agglm Baseline PROCLUS RDBC
Co nf +
Co nf +
or d
A ll
86
Outline
87
ClassificationofInformationNetworks
ClassificationofHeterogeneousInformationNetworks: GraphregularizationBasedMethod(GNetMine) MultiRelationalMiningBasedMethod(CrossMine) StatisticalRelationalLearningBasedMethod(SRL) ClassificationofHomogeneousInformationNetworks
88
WhyClassifyingHeterogeneousInfoNet?
Sometimes,wedohavepriorknowledgeforpartofthenodes/objects! Input:Heterogeneousinformationnetworkstructure+classlabelsfor some objects/nodes Goal:Classifytheheterogeneousnetworkeddataintoclasses,eachofwhich iscomposedofmultitypeddataobjectssharingacommontopic. Naturalgeneralizationofclassificationonhomogeneousnetworkeddata Emailnetwork+several suspicioususers/words/emails Militarynetwork+which militarycampseveral soldiers/commandersbelongto Findouttheterrorists, theiremailsand frequentlyusedwords! Class:terrorism Findoutthesoldiersand commandersbelongingto thatcamp! Class:militarycamp
89
Classifier
Classification:KnowledgePropagation
90
GNetMine:Methodology
Classificationofnetworkeddatacanbeessentiallyviewedasa processofknowledgepropagation,whereinformationis propagatedfromlabeledobjectstounlabeledonesthrough linksuntilastationarystateisachieved. Anovelgraphbasedregularizationframeworktoaddressthe classificationproblemonheterogeneousinformationnetworks. Respectthelinktypedifferencesbypreservingconsistency overeachrelationgraphcorrespondingtoeachtypeoflinks separately Mathematicalintuition:Consistencyassumption Theconfidence(f)of twoobjects(xip andxjq)belongingto classk shouldbesimilarifxip xjq (Rij,pq >0) fshouldbesimilartothegivengroundtruth
91
GNetMine:GraphBasedRegularization
Minimizetheobjectivefunction ( Userpreference:howmuchdoyou J ( f1( k ) ,..., f mk ) )
valuethisrelationship/ground truth?
1 1 (k ) ( f ip f jqk ) ) 2 = ij Rij , pq ( Dij , pp D ji ,qq i , j =1 p =1 q =1

m ni
nj
+ i ( f i ( k ) y i( k ) )T ( f i ( k ) y i( k ) )
i =1
Smoothnessconstraints:objectslinkedtogethershouldshare similar estimationsofconfidencebelongingtoclassk Normalizationtermappliedtoeachtypeoflinkseparately: reducetheimpactofpopularityofnodes Confidenceestimationonlabeleddataandtheirpregiven labelsshouldbesimilar

92
ExperimentsonDBLP
Class:Fourresearchareas(communities) Database,datamining,AI,informationretrieval Fourtypesofobjects Paper(14376),Conf.(20),Author(14475),Term(8920) Threetypesofrelations Paperconf.,paperauthor,paperterm Algorithmsforcomparison LearningwithLocalandGlobalConsistency(LLGC)[Zhouet al.NIPS2003] alsothehomogeneousversionofour method WeightedvoteRelationalNeighborclassifier(wvRN) [Macskassy etal.JMLR2007] NetworkonlyLinkbasedClassification(nLB)[Luetal.ICML 2003,Macskassy etal.JMLR2007]
93
ClassificationAccuracy:LabelingaVerySmallPortion ofAuthorsandPapers
(a%,p%) (0.1%,0.1%) (0.2%,0.2%) (0.3%,0.3%) (0.4%,0.4%) (0.5%,0.5%) nLB AA 25.4 28.3 28.4 30.7 29.8 nLB PP 49.8 73.1 77.9 79.1 80.7 ACPT 31.5 40.3 35.4 38.6 39.3 nLB ACPT 25.5 22.5 25.0 25.0 25.0 PP 62.0 71.7 77.9 78.1 77.9 ACPT 26.0 26.0 27.4 26.7 27.3 AA 40.8 46.0 48.6 46.3 49.0 wvRN ACPT 34.1 41.2 42.5 45.6 51.4 wvRN ACPT 42.0 49.7 54.3 54.4 53.5 wvRN ACPT 43.5 56.0 59.0 57.0 68.0 AA 41.4 44.7 48.8 48.7 50.6 LLGC ACPT 61.3 62.2 65.7 66.0 68.9 LLGC ACPT 62.7 65.5 66.6 70.5 73.5 GNetMine ACPT 81.0 85.0 87.0 89.5 94.0 94 GNetMine ACPT 82.9 83.4 86.7 87.2 87.5 GNetMine ACPT 79.2 83.5 83.2 83.7 84.1
Comparisonofclassificationaccuracyonauthors(%)
(a%,p%) (0.1%,0.1%) (0.2%,0.2%) (0.3%,0.3%) (0.4%,0.4%) (0.5%,0.5%) PP 67.2 72.8 76.8 77.9 79.0 LLGC ACPT 79.0 83.5 87.0 86.5 90.0
Comparisonofclassificationaccuracyonpapers(%)
(a%,p%) (0.1%,0.1%) (0.2%,0.2%) (0.3%,0.3%) (0.4%,0.4%) (0.5%,0.5%)
Comparisonofclassificationaccuracyonconferences(%)
KnowledgePropagation:ListObjectswiththe HighestConfidenceMeasureBelongingtoEachClass
No. 1 2 3 4 5 Database data database query system xml DataMining mining data clustering learning classification ArtificialIntelligence learning knowledge Reinforcement reasoning model InformationRetrieval retrieval information web search document
Top5termsrelatedtoeacharea
No. 1 2 3 4 5 Database Surajit Chaudhuri H.V.Jagadish MichaelJ.Carey MichaelStonebraker C.Mohan No. 1 2 3 4 5 Database VLDB SIGMOD PODS ICDE EDBT DataMining Jiawei Han PhilipS.Yu ChristosFaloutsos WeiWang Shusaku Tsumoto DataMining KDD SDM PAKDD ICDM PKDD ArtificialIntelligence SridharMahadevan TakeoKanade AndrewW.Moore Satinder P.Singh ThomasS.Huang ArtificialIntelligence IJCAI AAAI CVPR ICML ECML InformationRetrieval W.BruceCroft Iadh Ounis MarkSanderson ChengXiang Zhai GerardSalton InformationRetrieval SIGIR ECIR WWW WSDM CIKM
Top5authorsconcentratedineacharea
Top5conferencesconcentratedineacharea
95
96
MultiRelationtoFlatRelationMining?
Foldingmultiplerelationsintoasingleflat oneformining?
Doctor Patient Contact
flatten
Cannotbeasolutionduetoproblems: Loseinformationoflinkagesandrelationships,nosemantics preservation Cannotutilizeinformationofdatabasestructuresorschemas (e.g.,ERmodeling)

97
OneApproach:InductiveLogicProgramming(ILP)
Findahypothesisthatisconsistentwithbackground knowledge(trainingdata) FOIL,Golem,Progol,TILDE, Backgroundknowledge Relations(predicates),Tuples (groundfacts) InductiveLogicProgramming(ILP) Hypothesis:Thehypothesisisusuallyasetofrules,which canpredictcertainattributesincertainrelations Daughter(X,Y) female(X),parent(Y,X)
Training examples
Daughter(mary, ann) Daughter(eve, tom) Daughter(tom, ann) Daughter(eve, ann) + +
Background knowledge
Parent(ann, mary) Parent(ann, tom) Parent(tom, eve) Parent(tom, ian) Female(ann) Female(mary) Female(eve)
98
InductiveLogicProgrammingApproachtoMulti RelationClassification
ILPApproachedtoMultiRelationClassification TopdownApproaches(e.g.,FOIL) while(enough examplesleft) generatearule removeexamplessatisfyingthisrule BottomupApproaches(e.g.,Golem) Useeachexampleasarule Generalizerulesbymergingrules DecisionTreeApproaches(e.g.,TILDE) ILPApproach:ProsandCons Advantages:Expressiveandpowerful,andrulesareunderstandable Disadvantages:Inefficientfordatabaseswithcomplexschemas,and inappropriateforcontinuousattributes
99
FOIL:FirstOrderInductiveLearner(RuleGeneration)
Findasetofrulesconsistentwithtrainingdata Atopdown,sequentialcoveringlearner Buildeachrulebyheuristics Foilgain aspecialtypeofinformationgain
Examplescovered byRule2 Examplescovered Examplescovered byRule1 byRule3
A3=1&&A1=2 A3=1&&A1=2 &&A8=5 A3=1

Negative examples
Allexamples
Togeneratearule while(true) findthebestpredicatep if foilgain(p)>thresholdthen addp tocurrentrule else break

100
Positive examples
FindtheBestPredicate:PredicateEvaluation
Allpredicatesinarelationcanbeevaluatedbasedon propagatedIDs Usefoilgain toevaluatepredicates Supposecurrentruleisr.Forapredicatep, foilgain(p)=
P(r ) P(r + p ) P(r + p ) log + log P(r ) + N (r ) P(r + p ) + N (r + p )
CategoricalAttributes Computefoilgaindirectly NumericalAttributes Discretize witheverypossiblevalue

101
LoanApplications:BackendDatabase
Account District
district-id dist-name region #people #lt-500 #lt-2000 #lt-10000 #gt-10000 #city ratio-urban avg-salary unemploy95 unemploy96 den-enter #crime95 #crime96
Loan Targetrelation: Eachtuple hasaclass label,indicating whetheraloanispaid ontime.

loan-id account-id date amount duration payment
account-id district-id frequency date
Card
card-id disp-id type
Transaction
trans-id account-id date
issue-date
Disposition
disp-id account-id client-id
Order
order-id account-id bank-to account-to amount type
type operation amount balance symbol
Client
client-id birth-date gender district-id
Howtomakedecisionstoloanapplications?
102
CrossMine:AnEffectiveMultirelationalClassifier
Methodology TupleIDpropagation:anefficientandflexible methodforvirtuallyjoiningrelations Confinetherulesearchprocessinpromising directions Lookoneahead:amorepowerfulsearchstrategy Negativetuple sampling:improveefficiencywhile maintainingaccuracy
103
Tuple IDPropagation
Loan ID 1 2 Account ID 124 124 108 45 Amount 1000 4000 10000 12000 Duration 12 12 24 36 Decision Yes Yes No No
Applicant #1
3 4
Applicant #2
Account ID 124 108 45
Frequency monthly weekly monthly weekly
Open date 02/27/93 09/23/97 12/09/96 01/01/97
Propagated ID 1, 2 3 4 Null
Labels 2+, 0 0+, 1 0+, 1 0+, 0
Applicant #3
67
Possiblepredicates:
Frequency=monthly:2+,1
Applicant #4
Opendate<01/01/95:2+,0
Propagatetuple IDsoftargetrelationtonontargetrelations Virtuallyjoinrelationstoavoidthehighcostofphysicaljoins

104
Tuple IDPropagation(IdeaOutlined)
Efficient Onlypropagatethetuple IDs Timeandspaceusageislow Flexible CanpropagateIDsamongnontargetrelations ManysetsofIDscanbekeptononerelation,whichare propagatedfromdifferentjoinpaths
TargetRelation
R1
R2
R3
105
RuleGeneration:Example
Account District
districtid distname
Targetrelation
Loan
loanid accountid date
accountid districtid frequency date Firstpredicate
Card
cardid dispid type
region #people #lt500 #lt2000 #lt10000 #gt10000 #city
RuleGeneration
Startatthetargetrelation Repeat Searchinallactive relations Searchinallrelations joinabletoactive relations Addthebestpredicate tothecurrentrule Settheinvolvedrelation toactive Until Thebestpredicatedoes nothaveenoughgain Currentruleistoolong
amount duration payment
Transaction
transid accountid date
issuedate
Disposition
dispid accountid clientid
Order
orderid accountid bankto accountto amount type
ratiourban avgsalary unemploy95 unemploy96 denenter Second predicate
type operation amount balance symbol
Client
clientid birthdate gender districtid
#crime95 #crime96
106
LookoneaheadinRuleGeneration
Twotypesofrelations:EntityandRelationship Oftencannotfindusefulpredicatesonrelationsofrelationship
Nogoodpredicate Target Relation
SolutionofCrossMine: WhenpropagatingIDstoarelationofrelationship, propagateonemoresteptonextrelationofentity

107
NegativeTuple Sampling
Eachtimearuleisgenerated,coveredpositiveexamplesare removed Aftergeneratingmanyrules,therearemuchlesspositive examplesthannegativeones Cannotbuildgoodrules(lowsupport) Stilltimeconsuming(largenumberofnegativeexamples) Solution:Samplingonnegativeexamples Improveefficiencywithoutaffectingrulequality
+ +++ + + + + + + ++ + + + +
+ ++ + +
108
RealDataset
PKDDCup99dataset LoanApplication
Accuracy FOIL TILDE CrossMine 74.0% 81.3% 90.7% Time (per fold) 3338 sec 2429 sec 15.3 sec
Mutagenesisdataset(4relations):Only4relations,soTILDE doesagoodjob,thoughslow
Accuracy FOIL TILDE CrossMine 79.7% 89.4% 87.7% Time (per fold) 1.65 sec 25.6 sec 0.83 sec
109
110
ProbabilisticRelationalModelsin StatisticalRelationalLearning
Goal:modeldistributionofdatainrelationaldatabases Treatbothentitiesandrelationsasclasses Intuition:objectsarenolongerindependentwitheachother Buildstatisticalnetworksaccordingtothedependency relationshipsbetweenattributesofdifferentclasses AProbabilisticRelationalModels (PRM)consistsof Relationalschema(fromdatabases) Dependencystructure(betweenattributes) Localprobabilitymodel(conditionalprobabilitydistribution) Threemajormethodsofprobabilisticrelationalmodels RelationalBayesianNetwork(RBN,Lise Getoor etal.) RelationalMarkovNetworks(RMN,BenTaskar etal.) RelationalDependencyNetworks(RDN,JenniferNevilleetal.)
111
RelationalBayesianNetworks(RBN)
ExtendBayesiannetworktoconsiderentities,propertiesand relationshipsinaDBscenario Threedifferentuncertainties Attributeuncertainty Modelconditionalprobabilityforanattributegivenits parentvariables Structuraluncertainty Modelconditionalprobabilityforareferenceorlink existencegivenitsparentvariables Classuncertainty Refinetheconditionalprobabilitybyconsideringsub classesorhierarchyofclasses
112
Rel.MarkovNetworks&Rel.DependencyNetworks
SimilarideastoRelationalBayesianNetworks RelationalMarkovNetworks ExtendfromMarkovNetwork Undirectedlinktomodeldependencyrelationinsteadof directedlinksasinBayesiannetworks RelationalDependencyNetworks ExtendfromDependencyNetwork Undirectedlinktomodeldependencyrelation Usepseudolikelihoodinsteadofexactlikelihood Efficientinlearning
113
114
Transductive LearningintheGraph
Problem:forasetofnodesinthegraph,theclasslabelsaregivenforpartialof thenodes,thetaskistolearnthelabelsoftheunlabelednodes
Methods Labelpropagationalgorithm[Zhuetal.2002,Zhouetal.2004,Szummer et al.2001] Iterativelypropagatelabelstoitsneighbors,accordingtothe transitionprobabilitydefinedbythenetwork Graphregularizationbasedalgorithm[Zhouetal.2004] Intuition:tradeoffbetween(1)consistencywiththelabelingdataand (2)consistencybetweenlinkedobjects Anquadraticoptimizationproblem

115
Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaning andDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
116
DataCleaningbyLinkAnalysis
Objectreconciliationvs.objectdistinctionasdatacleaningtasks Linkanalysismaytakeadvantagesofredundancyandmake facilitateentitycrosscheckingandvalidation Objectdistinction:Differentpeople/objectsdosharenames InAllMusic.com,72songsand3albumsnamedForgotten or TheForgotten InDBLP,141papersarewrittenbyatleast14WeiWang Newchallengesofobjectdistinction: Textualsimilaritycannotbeused Distinct:Objectdistinctionbyinformationnetworkanalysis X.Yin,J.Han,andP.S.Yu,ObjectDistinction:Distinguishing ObjectswithIdenticalNamesbyLinkAnalysis,ICDE'07
117
EntityDistinction:TheWeiWang ChallengeinDBLP
(1)
Wei Wang, Jiong Yang, Richard Muntz VLDB 1997
(2)
Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Yu VLDB 2004
Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu Jiong Yang, Hwanjo Yu, Wei Wang, Jiawei Han
SIGMOD
2002
Hongjun Lu, Yidong Yuan, Wei Wang, Xuemin Lin
ICDE
2005
CSB
2003 Wei Wang, Xuemin Lin KDD 2004 ADMA 2005
Jiong Yang, Jinze Liu, Wei Wang
Jian Pei, Jiawei Han, Hongjun Lu, et al.
ICDM
2001
Jinze Liu, Wei Wang
ICDM
2004
(4)
Jian Pei, Daxin Jiang, Aidong Zhang ICDE 2005 Aidong Zhang, Yuqing Song, Wei Wang WWW 2003
(3)
Wei Wang, Jian Pei, Jiawei Han CIKM 2002 Haixun Wang, Wei Wang, Baile Shi, Peng Wang KDD 2004 ICDM 2005
Yongtai Zhu, Wei Wang, Jian Pei, Baile Shi, Chen Wang
(1) Wei Wang at UNC (3) Wei Wang at Fudan Univ., China
(2) Wei Wang at UNSW, Australia (4) Wei Wang at SUNY Buffalo
118
TheDISTINCTMethodology
Measuresimilaritybetweenreferences Linkbasedsimilarity:Linkagesbetweenreferences Referencestothesameobjectaremorelikelytobe connected(Usingrandomwalkprobability) Neighborhoodsimilarity Neighbortuples ofeachreferencecanindicatesimilarity betweentheircontexts Selfboosting:Trainingusingthesame bulkydataset Referencebasedclustering Groupreferencesaccordingtotheirsimilarities
119
TrainingwiththeSame DataSet
Buildatrainingsetautomatically Selectdistinctnames,e.g.,JohannesGehrke Thecollaborationbehaviorwithinthesamecommunityshare somesimilarity Trainingparametersusingatypicalandlargesetof unambiguous examples UseSVMtolearnamodelforcombiningdifferentjoinpaths Eachjoinpathisusedastwoattributes(withlinkbased similarityandneighborhoodsimilarity) Themodelisaweightedsumofallattributes
120
Clustering:MeasureSimilaritybetweenClusters
Singlelink(highestsimilaritybetweenpointsintwoclusters)? No,becausereferencestodifferentobjectscanbe connected. Completelink(minimumsimilaritybetweenthem)? No,becausereferencestothesameobjectmaybeweakly connected. Averagelink(averagesimilaritybetweenpointsintwoclusters)? Abettermeasure Refinement:Averageneighborhoodsimilarity andcollective randomwalkprobability
121
RealCases:DBLPPopularNames
Name Num_authors Num_refs accuracy precision recall f-measure
Hui Fang Ajay Gupta Joseph Hellerstein Rakesh Kumar Michael Wagner Bing Liu Jim Smith Lei Wang Wei Wang Bin Yu
3 4 2 2 5 6 3 13 14 5
9 16 151 36 29 89 19 55 141 44
1.0 1.0 0.81 1.0 0.395 0.825 0.829 0.863 0.716 0.658 0.81
1.0 1.0 1.0 1.0 1.0 1.0 0.888 0.92 0.855 1.0 0.966
1.0 1.0 0.81 1.0 0.395 0.825 0.926 0.932 0.814 0.658 0.836
1.0 1.0 0.895 1.0 0.566 0.904 0.906 0.926 0.834 0.794 0.883
Average
122
DistinguishingDifferentWeiWangs
UNC-CH (57) Fudan U, China (31)
Zhejiang U China (3) SUNY Binghamton (2) Najing Normal China (3) Ningbo Tech China (2)
5 2
UNSW, Australia (19)
Purdue (2)
Chongqing U China (2)
SUNY Buffalo (5)
Beijing Polytech (3)
NU Singapore (5)
Harbin U China (5)
Beijing U Com China
(2)
123
Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidation byInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
124
TruthValidationbyInfo.NetworkAnalysis
Thetrustworthinessproblemoftheweb(accordingtoasurvey): 54%ofInternetuserstrustnewswebsitesmostoftime 26%forwebsitesthatsellproducts 12%forblogs TruthFinder:TruthdiscoveryontheWebbylinkanalysis Amongmultipleconflictresults,canweautomaticallyidentifywhichone islikelythetruefact? Veracity(conformitytotruth): Givenalargeamountofconflictinginformationaboutmanyobjects, providedbymultiplewebsites(orotherinformationproviders), howto discoverthetruefactabouteachobject? Ourwork:Xiaoxin Yin,Jiawei Han,PhilipS.Yu,TruthDiscoverywithMultiple ConflictingInformationProvidersontheWeb,TKDE08
125
ConflictingInformationontheWeb
Differentwebsitesoftenprovideconflictinginfo.onasubject, e.g.,AuthorsofRapidContextualDesign
Online Store Powells books Barnes & Noble A1 Books Cornwall books Mellons books Lakeside books Blackwell online Authors Holtzblatt, Karen Karen Holtzblatt, Jessamyn Wendell, Shelley Wood Karen Holtzblatt, Jessamyn Burns Wendell, Shelley Wood Holtzblatt-Karen, Wendell-Jessamyn Burns, Wood Wendell, Jessamyn WENDELL, JESSAMYNHOLTZBLATT, KARENWOOD, SHELLEY Wendell, Jessamyn, Holtzblatt, Karen, Wood, Shelley
126
OurSetting:Info.NetworkAnalysis
Eachobjecthasasetofconflictive facts E.g.,differentauthornamesforabook Andeachwebsiteprovidessomefacts Howtofindthetruefactforeachobject? Web sites Facts f1 f2 f3 f4 f5
127
Objects o1
w1 w2 w3 w4
o2
BasicHeuristicsforProblemSolving
1.
2.
3.
4.
Thereisusuallyonlyonetruefactforapropertyofan object Thistruefactappearstobethesameorsimilarondifferent websites E.g.,JenniferWidom vs.J.Widom Thefalsefactsondifferentwebsitesarelesslikelytobe thesameorsimilar Falsefactsareoftenintroducedbyrandomfactors Awebsitethatprovidesmostlytruefactsformany objectswilllikelyprovidetruefactsforotherobjects
128
OverviewoftheTruthFinder Method
Confidenceoffacts Trustworthinessofwebsites Afacthashighconfidence ifitisprovidedby(many) trustworthywebsites Awebsiteistrustworthy ifitprovidesmanyfactswithhigh confidence TheTruthFinder mechanism,anoverview: Initially,eachwebsiteisequallytrustworthy Basedontheabovefourheuristics,inferfactconfidencefrom websitetrustworthiness,andthenbackwards Repeatuntilachievingstablestate
129
AnalogytoAuthorityHubAnalysis
Facts Authorities,Websites Hubs Web sites
High trustworthiness
Facts f1
High confidence
w1
Hubs
Authorities
Differencefromauthorityhubanalysis Linearsummationcannotbeused
Awebsiteistrustableifitprovidesaccuratefacts,insteadofmany facts Confidenceistheprobabilityofbeingtrue
Differentfactsofthesameobjectinfluenceeachother
130
InferenceonTrustworthness
Inferenceofwebsitetrustworthiness&factconfidence Web sites Facts f1 o1 f2 f3 o2 f4 Objects
w1 w2 w3 w4
True facts and trustable web sites will become apparent after some iterations
131
ComputationalModel:t(w)ands(f)
Thetrustworthinessofawebsitew:t(w) Averageconfidenceoffactsitprovides
t (w) =
f F (w) s( f )
F (w )
Sum of fact confidence
t(w1) w1 s(f1) f1
Set of facts provided by w
Theconfidenceofafactf:s(f) Oneminustheprobabilitythatallwebsitest(w ) 2 providingf arewrong w2
s( f ) = 1
wW f
()1 t (w)) (
Probability that w is wrong
Set of websites providing f

132
Experiments:FindingTruthofFacts
Determiningauthorsofbooks Datasetcontains1265bookslistedonabebooks.com Weanalyze100randombooks(usingbookimages)
Case Correct Miss author(s) Incomplete names Wrong first/middle names Has redundant names Add incorrect names No information Voting 71 12 18 1 0 1 0 TruthFinder 85 2 5 1 2 5 0 Barnes & Noble 64 4 6 3 23 5 2
133
Experiments:TrustableInfoProviders
Findingtrustworthyinformationsources MosttrustworthybookstoresfoundbyTruthFinder vs.Top rankedbookstoresbyGoogle(querybookstore) TruthFinder
Bookstore TheSaintBookstore MildredsBooks Alphacraze.com trustworthiness 0.971 0.969 0.968 #book 28 10 13 Accuracy 0.959 1.0 0.947
Google
Bookstore Barnes & Noble Powells books Google rank 1 3 #book 97 42 Accuracy 0.865 0.654
134
Outline
135
SimilaritySearchinInformationNetworks
Structuralsimilarityvs.semanticsimilarity Structuralsimilarity:Basedonstructural/isomorphic similarityof subgraph/subnetworkstructures Semanticsimilarity:influencedbysimilarnetworkstructures Graphstructurebasedindexingandsimilaritysearch Structurebasedindexing,e.g.,gIndex,Spath, Useindextosearchforsimilargraph/networkstructures SubstructureindexingMethods: Keyproblem:Whatsubstructuresaregoodindexingfeatures? gIndex [Yan,Yu&Han,SIGMOD04]:Findfrequentand discriminativesubgraphs (bygraphpatternmining) Spath [Zhao&Han,VLDB10]:Usedecomposedshortest paths asbasicindexingfeatures
136
WhySPathasIndexingFeatures?
Shortestpathsasneighborhoodsignaturesofvertices(indexing features):scalableandpruningsearchspaceeffectively Processing(byquerydecomposition): Decomposethequery graphintoasetofindexedshortestpathsinSPath
Network
Query
Agloballookuptable
Neighborhoodsignatureofv3
SemanticsBasedSimilaritySearchinInfoNet
Searchtopksimilarobjectsofthesametypeforaquery FindresearchersmostsimilarwithChristosFaloutsos Twocriticalconceptstodefineasimilarity Featurespace Traditionaldata:attributesdenotedasnumerical value/vector,set,etc. Networkdata:arelationsequencecalledpathschema Existinghomogeneousnetworkbasedsimilaritydoes notdealwiththisproblem Measuredefinedonthefeaturespace Cosine,Euclideandistance,Jaccard coefficient,etc. PathSim
138
PathSchemaforDBLPQueries
Pathschema:ApathofInfoNet schema,e.g.,APC,APA WhoaremostsimilartoChristosFaloutsos?
139
Flickr:WhichPicturesAreMostSimilar?
Somepathschemaleadstosimilarityclosertohumanintuition Butsomeothersarenot
140
NotAllSimilarityMeasuresAreGood
Favor highly visible objects
Not reasonable
141
LongPathSchemaMayNotBeGood
Repeatthepathschema2,4,andinfinitetimesforconference similarityquery
142
PathSim:Definition&Properties
CommutingmatrixcorrespondingtoapathschemaMP Productofadjacencymatrixofrelationsinthepathschema TheelementMP(i,j)denotesthestrengthbetweenobjecti andobjectjinthesemanticofpathschemaP Iftheweightofadjacencymatrixisunweighted,itwill denotethenumberofpathinstancesfollowingpath schemaP PathSim s(i,j)=2MP(i,j)/(MP(i,i)+MP(j,j)) Properties:
143
CoClusteringBasedPruningAlgorithm
Storecommutingmatricesforshortpathschemas&computetopkquerieson line Framework Generatecoclustersformaterializedcommutingmatrices,forfeature objectsandtargetobjects Deriveupperboundforsimilaritybetweenobjectandtargetcluster,and betweenobjectandobject:Safelypruningtargetclustersandobjectsif theupperboundsimilarityislowerthancurrentthreshold Dynamicallyupdatetopkthreshold: Performance:Baselinevs.pruning
144
Outline
145
RoleDiscoveryinNetwork:WhyItMatters?
Armycommunication network(imaginary)
Automatically infer
Commander Captain Solider
146
RoleDiscovery:ExtractionSemantic InformationfromLinks
Objective:Extractsemanticmeaningfromplainlinkstofinely modelandbetterorganizeinformationnetworks Challenges Latentsemanticknowledge Interdependency Scalability Opportunity Humanintuition Realisticconstraint Crosscheckwithcollectiveintelligence Methodology:propagatesimpleintuitiverulesandconstraints overthewholenetwork
147
DiscoveryofAdvisorAdvisee RelationshipsinDBLPNetwork
Input:DBLPresearchpublicationnetwork Output:Potentialadvisingrelationshipanditsranking(r,[st,ed]) Ref.C.Wang,J.Han,etal., MiningAdvisorAdvisee RelationshipsfromResearchPublicationNetworks,SIGKDD2010
Input: Temporal collaboration network
1999 Ada Bob
Output: Relationship analysis

(0.9, [/, 1998]) Ada (0.4, [/, 1998]) (0.5, [/, 2000])
Visualized chorological hierarchies
20 00 2000
2000 Jerry 2001 Ying 2002 Smith th 2003 2004
(0.8, [1999,2000]) (0.49, [/, 1999]) Jerry
(0.7, [2000, 2001]) Bob Ying (0.65, [2002, 2004]) Smith
(0.2, [2001, 2003])
148
OverallFramework
ai:authori pj:paperj py:paperyear pn:paper# sti,yi:startingtime edi,yi:endingtime ri,yi:rankingscore
149
TimeConstrainedProbabilisticFactorGraph(TPFG)
yx:axsadvisor stx,yx:startingtime edx,yx:endingtime g(yx,stx,edx) is predefinedlocal feature fx(yx,Zx)=maxg(yx , stx, edx)undertime constraint Objectivefunction P({yx})=xfx (yx,Z) Z={z|x Yz} Yx:setofpotential advisorsofax
150
ExperimentResults
DBLPdata:654,628authors,1076,946publications,years provided Labeleddata:MathGealogy Project;AIGealogy Project; Homepage
Datasets TEST1 TEST2 TEST3 RULE 69.9% 69.8% 80.6% SVM 73.4% 74.6% 86.7% IndMAX 75.2% 74.6% 83.1% 78.9% 79.0% 90.9% TPFG 80.2% 81.5% 88.8% 84.4% 84.3% 91.3%
heuristics Supervised learning
Empirical optimized parameter parameter
151
CaseStudy&Scalability
Advisee TopRankedAdvisor Time 0103 0506 0203 0408 9798 Note PhDadvisor,2004grad Postdoc,2006 MSadvisor,2003 PhDadvisor,2008 Unofficialadvisor DavidM. 1.MichaelI.Jordan Blei 2.JohnD.Lafferty Hong Cheng Sergey Brin 1.Qiang Yang 2.Jiawei Han 1.RajeevMotawani
152
Graph/NetworkSummarization:Graph Compression
Extractcommonsubgraphs andsimplifygraphsbycondensing thesesubgraphs intonodes
153
OLAPonInformationNetworks
WhyOLAPinformationnetworks? AdvantagesofOLAP:Interactiveexplorationofmultidimensional andmultilevelspaceinadatacubeInfonet Multidimensional:Differentperspectives Multilevel:Differentgranularities InfoNet OLAP:Rollup/drilldownandslice/diceoninformation networkdata TraditionalOLAPcannothandlethis,becausetheyignorelinks amongdataobjects HandlingtwokindsofInfoNet OLAP InformationalOLAP TopologicalOLAP
154
InformationalOLAP
IntheDBLPnetwork,studythe collaborationpatternsamongresearchers Dimensionscomefrominformational attributesattachedatthewholesnapshot level,socalledInfoDims IOLAPCharacteristics: Overlaymultiplepiecesofinformation Nochangeontheobjectswhose interactionsarebeingexamined Intheunderlyingsnapshots,each nodeisaresearcher Inthesummarizedview,eachnodeis stillaresearcher
155
TopologicalOLAP
Dimensionscomefromthe node/edgeattributesinsideindividual networks,socalledTopoDims TOLAPCharacteristics Zoomin/Zoomout Networktopologychanged: generalized nodesand generalized edges Intheunderlyingnetwork, eachnodeisaresearcher Inthesummarizedview,each nodebecomesaninstitutethat comprisesmultipleresearchers
156
InfoNet OLAP:Operations&Framework
InfoNet IOLAP InfoNet TOLAP Overlaymultiplesnapshotsto Shrinkthetopology&obtainaTaggregated formahigherlevelsummaryvia networkthatrepresentsacompressedview, Iaggregatednetwork withtopologicalelements(i.e.,nodesand/or Rollup edges)mergedandreplacedbycorresp. higherlevelones Returntothesetoflowerlevel Areverseoperationofrollup Drilldown snapshotsfromthehigherlevel overlaid(aggregated)network Selectasubsetofqualifying Selectasubnetwork basedonTopoDims Slice/dice snapshotsbasedonInfoDims
Measureisanaggregatedgraph&othermeasureslikenodecount, average degree,etc.canbetreatedasderived Graphplaysadualrole:(1)datasource,and(2)aggregatemeasure Measurescouldbecomplex,e.g.,maximumflow,shortestpath,centrality ItispossibletocombineIOLAPandTOLAPintoahybridcase

157
Outline
158
MiningEvolutionandDynamicsofInfoNet
Manynetworksarewithtimeinformation E.g.,accordingtopaperpublicationyear,DBLPnetworkscan formnetworksequences Motivation:Modelevolutionofcommunitiesinheterogeneous network Automaticallydetectthebestnumberofcommunitiesineach timestamp Modelthesmoothnessbetweencommunitiesofadjacent timestamps Modeltheevolutionstructureexplicitly
Birth,death,split
159
Evolution:IdeaIllustration
Fromnetworksequencestoevolutionarycommunities
160
GraphicalModel:AGenerativeModel
Dirichlet ProcessMixtureModelbasedgenerativemodel Ateachtimestamp,acommunityisdependentonhistorical communitiesandbackgroundcommunitydistribution
161
GenerativeModel&ModelInference
Togenerateanewpaperoi Decidewhethertojoinanexistingcommunityoranewone Joinanexistingcommunitykwithprob.nk/(i1+) Joinanewcommunitykwithprob./(i 1+):Decideitsprior, eitherfromabackgrounddistribution()orhistoricalcommunities ((1 )k),withdifferentprobabilities,drawtheattributedistribution fromtheprior Generateoi accordingtotheattributedistribution
Greedyinferenceforeachtimestamp:CollapseGibbssampling,whichistrying tosampleclusterlabelforeachtargetobject(e.g.,paper)
162
AccuracyStudy
Themoretypesofobjectsused,thebetteraccuracy Historicalpriorresultsinbetteraccuracy
163
CaseStudyonDBLP
Trackingdatabasecommunityevolution
1991 1994 1997 2000
DEXA Comm, ACM ICDE VLDB Int. J. MM Studies systems object database information oriented
DEXA VLDB SIGMOD Conf. CIKM TKDE object oriented database systems databases
DEXA Workshop SIGMOD Conf. VLDB DEXA ICDE data databases database object systems
VLDB ICDE SIGMOD Conf. DEXA IDEAS data databases database query web
(1993)
CHI Conf. Comp. AAAI TREC SIGIR PKDD SIGKDD Explor. KDD data retrieval information mining text
DBLP Schema
Comm. ACM W. Simu. Conf. SIGCSE software systems design knowledge analysis
164
CaseStudyon Delicious.com
Delicious Schema
550 500 450 Event Count 400 350 300 250 200 150 1 1.5 2 2.5 Week 3 3.5 4 C1 C2 C3
165
Outline
Conclusions
Richknowledgecanbeminedfrominformationnetworks Whatisthemagic? Heterogeneous,structured informationnetworks! Clustering,rankingandclassification:Integratedclustering, rankingandclassification:RankClus,NetClus,GNetMine, Datacleaning,validation,andsimilaritysearch Rolediscovery,OLAP,andevolutionaryanalysis Knowledgeispower,butknowledgeishiddeninmassivelinks! Miningheterogeneous informationnetworks:Muchmoretobe explored!!
167
FutureResearch
Fromminingcurrentsinglestarnetworkschematoranking, clustering,,inmultistar,multirelationaldatabases Mininginformationnetworksformedbystructureddatalinking withunstructureddata(text,multimediaandWeb) Miningcyberphysicalnetworks(networksformedbydynamic sensors,image/videocameras,withinformationnetworks) Enhancingthepowerofknowledgediscoverybytransforming massiveunstructureddata:Incrementalinformationextraction, rolediscovery, multidimensionalstructuredinfonet Miningnoisy,uncertain,untrustablemassivedatasetsby informationnetworkanalysisapproach TurningWikipediaand/orWebintostructuredorsemistructured databasesbyheterogeneousinformationnetworkanalysis
168
References:BooksonNetworkAnalysis
A.L.Barabasi.Linked:HowEverythingIsConnectedtoEverythingElseandWhatItMeans.Plume, 2003. M.Buchanan.Nexus:SmallWorldsandtheGroundbreakingTheoryofNetworks.W.W.Norton& Company,2003. D.J.CookandL.B.Holder.MiningGraphData.JohnWiley&Sons,2007 S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann, 2003 A.Degenne andM.Forse.IntroducingSocialNetworks.SagePublications,1999 P.J.Carrington,J.Scott,andS.Wasserman.ModelsandMethods inSocialNetworkAnalysis. CambridgeUniversityPress,2005. J.Davies,D.Fensel,andF.vanHarmelen.TowardstheSemanticWeb:OntologyDriven KnowledgeManagement.JohnWiley&Sons,2003. D.Fensel,W.Wahlster,H.Lieberman,andJ.Hendler.SpinningtheSemanticWeb:Bringingthe WorldWideWebtoItsFullPotential.MITPress,2002. L.Getoor andB.Taskar (eds.).Introductiontostatisticallearning.InMITPress,2007. B.Liu.WebDataMining:ExploringHyperlinks,Contents,andUsageData.Springer,2006. J.P.Scott.SocialNetworkAnalysis:AHandbook.SagePublications,2005. J.Watts.SixDegrees:TheScienceofaConnectedAge.W.W.Norton&Company,2003. D.J.Watts.SmallWorlds:TheDynamicsofNetworksbetweenOrderandRandomness.Princeton UniversityPress,2003. S.WassermanandK.Faust.SocialNetworkAnalysis:MethodsandApplications.Cambridge UniversityPress,1994.
References:SomeOverviewPapers
T.BernersLee,J.Hendler,andO.Lassila.Thesemanticweb.ScientificAmerican,May 2001. C.CooperandAFrieze.Ageneralmodelofwebgraphs.Algorithms,22,2003. S.Chakrabarti andC.Faloutsos.Graphmining:Laws,generators,andalgorithms.ACM Comput.Surv.,38,2006. T.Dietterich,P.Domingos,L.Getoor,S.Muggleton,andP.Tadepalli.Structured machinelearning:Thenexttenyears.MachineLearning,73,2008 S.Dumais andH.Chen.Hierarchicalclassificationofwebcontent.SIGIR'00. S.Dzeroski.Multirelational datamining:Anintroduction.ACMSIGKDDExplorations, July2003. L.Getoor.Linkmining:anewdataminingchallenge.SIGKDDExplorations, 5:84{89, 2003. L.Getoor,N.Friedman,D.Koller,andB.Taskar.Learningprobabilisticmodelsof relationalstructure.ICML'01 D.JensenandJ.Neville.Datamininginnetworks.InPapersof theSymp.Dynamic SocialNetworkModelingandAnalysis,NationalAcademyPress,2002. T.Washio andH.Motoda.Stateoftheartofgraphbaseddatamining.SIGKDD Explorations,5,2003.
170
References:SomeInfluentialPapers
A.Z.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins, andJ.L.Wiener.Graphstructureintheweb.ComputerNetworks, 33,2000. S.Brin andL.Page.Theanatomyofalargescalehypertextualwebsearchengine. WWW'98. S.Chakrabarti,B.E.Dom,S.R.Kumar,P.Raghavan,S.Rajagopalan,A.Tomkins,D. Gibson,andJ.M.Kleinberg.Miningtheweb'slinkstructure.COMPUTER,32,1999. M.Faloutsos,P.Faloutsos,andC.Faloutsos.Onpowerlawrelationshipsoftheinternet topology.ACMSIGCOMM'99 M.GirvanandM.E.J.Newman.Communitystructureinsocialand biologicalnetworks. InProc.Natl.Acad.Sci.USA99,2002. B.A.Huberman andL.A.Adamic.Growthdynamicsofworldwideweb.Nature, 399:131,1999. G.Jeh andJ.Widom.SimRank:ameasureofstructuralcontextsimilarity.KDD'02 J.M.Kleinberg,R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Thewebasa graph:Measurements,models,andmethods.COCOON'99 D.Kempe,J.Kleinberg,andE.Tardos.Maximizingthespreadofinfluencethrougha socialnetwork.KDD'03 J.M.Kleinberg.Smallworldphenomenaandthedynamicsofinformation.NIPS'01 R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal. Stochasticmodelsforthewebgraph.FOCS'00 M.E.J.Newman.Thestructureandfunctionofcomplexnetworks. SIAMReview,45, 2003.
171
References:ClusteringandRanking(1)
E.Airoldi,D.Blei,S.FienbergandE.Xing,MixedMembershipStochastic Blockmodels,JMLR08 Liangliang Cao,Andrey DelPozo,Xin Jin,Jiebo Luo,Jiawei Han,andThomasS.Huang, RankCompete:SimultaneousRankingandClusteringofWebPhotos,WWW10 G.Jeh andJ.Widom,SimRank:ameasureofstructuralcontextsimilarity,KDD'02 JingGao,Feng Liang,Wei Fan,ChiWang,Yizhou Sun,andJiawei Han,Community OutliersandtheirEfficientDetectioninInformationNetworks",KDD'10 M.E.J.NewmanandM.Girvan,Findingandevaluatingcommunitystructurein networks,PhysicalReviewE,2004 M.E.J.NewmanandM.Girvan,Fastalgorithmfordetectingcommunitystructurein networks,PhysicalReviewE,2004 J.ShiandJ.Malik,NormalizedcutsandimageSegmentation,CVPR'97 Yizhou Sun,Yintao Yu,andJiawei Han,"RankingBasedClusteringofHeterogeneous InformationNetworkswithStarNetworkSchema",KDD09 Yizhou Sun,Jiawei Han,Peixiang Zhao,Zhijun Yin,HongCheng,andTianyi Wu, "RankClus:IntegratingClusteringwithRankingforHeterogeneousInformation NetworkAnalysis",EDBT09
172
References:ClusteringandRanking(2)
Yizhou Sun,Jiawei Han,JingGao,andYintao Yu,"iTopicModel:InformationNetwork IntegratedTopicModeling",ICDM09 Xiaoxin Yin,Jiawei Han,PhilipS.Yu."LinkClus:EfficientClusteringviaHeterogeneous SemanticLinks",VLDB'06. Yintao Yu,CindyX.Lin,Yizhou Sun,ChenChen,Jiawei Han,Binbin Liao,Tianyi Wu, ChengXiang Zhai,DuoZhang,andBoZhao,"iNextCube:InformationNetwork EnhancedTextCube",VLDB'09(demo) A.Wu,M.Garland,andJ.Han.Miningscalefreenetworksusinggeodesicclustering. KDD'04 Z.WuandR.Leahy,Anoptimalgraphtheoreticapproachtodataclustering:Theory anditsapplicationtoimagesegmentation,IEEETrans.PatternAnal.Mach.Intell., 1993. X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger.SCAN:Astructuralclusteringalgorithm fornetworks.KDD'07 X.Yin,J.Han,andP.S.Yu.Crossrelationalclusteringwithuser'sguidance.KDD'05
173
References:NetworkClassification(1)
A.Appice,M.Ceci,andD.Malerba.Miningmodeltrees:Amultirelationalapproach. ILP'03 JingGao,Feng Liang,WeiFan,Yizhou Sun,andJiawei Han,"BipartiteGraphbased ConsensusMaximizationamongSupervisedandUnsupervisedModels",NIPS'09 L.Getoor,N.Friedman,D.Koller andB.Taskar,LearningProbabilisticModelsofLink Structure,JMLR02. L.Getoor,E.Segal,B.Taskar andD.Koller,ProbabilisticModelsofTextandLink StructureforHypertextClassification,IJCAIWSTextLearning:BeyondClassification, 2001. L.Getoor,N.Friedman,D.Koller,andA.Pfeffer,LearningProbabilisticRelational Models,chapterinRelationDataMining,eds.S.Dzeroski andN.Lavrac,2001. M.Ji,Y.Sun,M.Danilevsky,J.Han,andJ.Gao.Graphbasedclassificationon heterogeneousinformationnetworks,ECMLPKDD10. Q.LuandL.Getoor,Linkbasedclassification,ICML'03 D.LibenNowell andJ.Kleinberg,Thelinkpredictionproblemforsocialnetworks, CIKM'03
174
References:NetworkClassification(2)
J.Neville,B.Gallaher,andT.EliassiRad.Evaluatingstatisticaltestsforwithinnetwork classifiersofrelationaldata.ICDM'09. J.Neville,D.Jensen,L.Friedland,andM.Hay.Learningrelationalprobabilitytrees. KDD'03 JenniferNeville,DavidJensen,RelationalDependencyNetworks,JMLR07 M.Szummer andT.Jaakkola,Partiallylabeledclassication withmarkov random walks,InNIPS,volume14,2001. M.J.Rattigan,M.Maier,andD.Jensen.Graphclusteringwithnetworkstructure indices.ICML'07 P.Sen,G.M.Namata,M.Galileo,M.Bilgic,L.Getoor,B.Gallagher,andT.EliassiRad. Collectiveclassificationinnetworkdata.AIMagazine,29,2008. B.Taskar,E.Segal,andD.Koller.Probabilisticclassificationandclusteringinrelational data.IJCAI'01 B.Taskar,P.Abbeel,M.F.Wong,andD.Koller,RelationalMarkovNetworks,chapter inL.Getoor andB.Taskar,editors,IntroductiontoStatisticalRelationalLearning,2007 X.Yin,J.Han,J.Yang,andP.S.Yu,CrossMine:EfficientClassificationacrossMultiple DatabaseRelations,ICDE'04. D.Zhou,O.Bousquet,T.N.Lal,J.Weston,andB.Scholkopf,Learningwithlocaland globalconsistency,InNIPS16,Vancouver,Canada,2004. X.ZhuandZ.Ghahramani,Learningfromlabeledandunlabeleddatawithlabel propagation,TechnicalReport,2002.
175
References:SocialNetworkAnalysis
B.AlemanMeza,M.Nagarajan,C.Ramakrishnan,L.Ding,P.Kolari,A.P.Sheth,I.B. Arpinar,A.Joshi,andT.Finin.Semanticanalyticsonsocialnetworks:experiencesin addressingtheproblemofconflictofinterestdetection.WWW'06 R.Agrawal,S.Rajagopalan,R.Srikant,andY.Xu.Miningnewsgroupsusingnetworks arisingfromsocialbehavior.WWW'03 P.Boldi andS.Vigna.TheWebGraph frameworkI:Compressiontechniques.WWW'04 D.Cai,Z.Shao,X.He,X.Yan,andJ.Han.Communityminingfrommultirelational networks.PKDD'05 P.Domingos.Miningsocialnetworksforviralmarketing.IEEEIntelligentSystems,20, 2005. P.Domingos andM.Richardson.Miningthenetworkvalueofcustomers.KDD'01 P.DeRose,W.Shen,F.Chen,A.Doan,andR.Ramakrishnan.Buildingstructuredweb communityportals:Atopdown,compositional,andincrementalapproach.VLDB'07 G.Flake,S.Lawrence,C.L.Giles,andF.Coetzee.Selforganizationandidentificationof webcommunities.IEEEComputer,35,2002. J.Kubica,A.Moore,andJ.Schneider.Tractablegroupdetectiononlarge linkdatasets. ICDM'03
176
References:DataQuality&SearchinNetworks
I.BhattacharyaandL.Getoor,Iterativerecordlinkageforcleaningandintegration, Proc.SIGMOD2004WorkshoponResearchIssuesonDataMiningand Knowledge Discovery(DMKD'04) XinLunaDong,LaureBertiEquille,andDiveshSrivastava,Integratingconflictingdata: Theroleofsourcedependence,PVLDB,2(1):550561,2009. XinLunaDong,LaureBertiEquille,andDiveshSrivastava,Truthdiscoveryand copyingdetectioninadynamicworld,PVLDB,2(1):562573,2009. H.Han,L.Giles,H.Zha,C.Li,andK.Tsioutsiouliklis,Twosupervisedlearning approachesfornamedisambiguationinauthorcitations,ICDL'04. Y.Sun,J.Han,T.Wu,X.Yan,andPhilipS.Yu,PathSim:PathSchemaBasedTopK SimilaritySearchinHeterogeneousInformationNetworks,Technicalreport,CS,UIUC, July2010. X.Yin,J.Han,andP.S.Yu,ObjectDistinction:DistinguishingObjectswithIdentical NamesbyLinkAnalysis,ICDE'07 X.Yin,J.Han,andP.S.Yu,TruthDiscoverywithMultipleConflictingInformation ProvidersontheWeb,IEEETKDE,20(6):796808,2008 P.ZhaoandJ.Han,OnGraphQueryOptimizationinLargeNetworks,VLDB10.
177
References:RoleDiscovery,SummarizationandOLAP
D.Archambault,T.Munzner,andD.Auber.Topolayout:Multilevelgraphlayoutby topologicalfeatures.IEEETrans.Vis.Comput.Graph,2007. ChenChen,Xifeng Yan,Feida Zhu,Jiawei Han,andPhilipS.Yu,"GraphOLAP:Towards OnlineAnalyticalProcessingonGraphs",ICDM2008 ChenChen,Xifeng Yan,Feida Zhu,Jiawei Han,andPhilipS.Yu,"GraphOLAP:AMulti DimensionalFrameworkforGraphDataAnalysis",KAIS2009. Xin Jin,Jiebo Luo,Jie Yu,GangWang,Dhiraj Joshi,andJiawei Han,iRIN:Image RetrievalinImageRichInformationNetworks,WWW'10(demopaper) LuLiu,Feida Zhu,ChenChen,Xifeng Yan,Jiawei Han,PhilipYu,andShiqiang Yang, MiningDiversityonNetworks",DASFAA'10 Y.Tian,R.A.Hankins,andJ.M.Patel.Efficientaggregationforgraphsummarization. SIGMOD'08 ChiWang,Jiawei Han,Yuntao Jia,Jie Tang,DuoZhang,Yintao Yu,andJingyi Guo, MiningAdvisorAdviseeRelationshipsfromResearchPublicationNetworks ",KDD'10 Zhijun Yin,ManishGupta,TimWeninger andJiawei Han,LINKREC:AUnified FrameworkforLinkRecommendationwithUserAttributesandGraph Structure , WWW10
178
References:NetworkEvolution
L.Backstrom,D.Huttenlocher,J.Kleinberg,andX.Lan.Groupformationin largesocialnetworks:Membership,growth,andevolution.KDD'06 M.S.KimandJ.Han.Aparticleanddensitybasedevolutionaryclustering methodfordynamicnetworks.VLDB'09 J.Leskovec,J.Kleinberg,andC.Faloutsos.Graphsovertime:Densification laws,shrinkingdiametersandpossibleexplanations.KDD'05 Yizhou Sun,Jie Tang,Jiawei Han,ManishGupta,BoZhao,Community EvolutionDetectioninDynamicHeterogeneousInformationNetworks,KDD MLG10
179

Mining Heterogeneous Information Networks

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Mining Heterogeneous Information Networks

Diunggah oleh

Hak Cipta:

Format Tersedia

ACMSIGKDDConferenceTutorial,Washington,D.C.

Yeast protein interaction network

Social network sites

Eachtargetobjectxi ismappedintoaKvector(i,k) ParametersareestimatedusingtheEMalgorithm Maximizetheloglikelihoodgivenalltheobservationsof links

Inall,linearto|E| ~O(K|E|) Note:SimRank willbeatleastquadraticateachiterationsinceit evaluatesdistancebetweeneverypairinthenetwork

NetClus: Ranking & Clustering with Star Network Schema

Contain Term DBLP

Contain Tag Delicious.com

Actor/A ctress Star in

Contain Title/ Plot IMDB

Generative Model for Target Objects Given a Net-cluster

AccuracyofPaperClusteringResults Accuracy,comparedwithRankClus,abitypedclusteringmethod ononlyonetype

Ranking authors in XML

Author/conferen termrankingfor researcharea.Th researchareasca atdifferentlevel

Tom Mike Cathy John Mary

sigmod05 vldb03 vldb04 vldb05 aaai04 aaai05

Questions: Q1:Howtoclustereachtypeofobjects? Q2:Howtodefinesimilaritybetweeneachtypeofobjects?

Tom Mike Cathy John

sigmod03 sigmod04 sigmod05 vldb03 vldb04 vldb05

whereI(v)isthesetofinneighborsof thevertexv. But:Itisexpensivetocompute:

ForadatasetofN objectsandM links, ittakesO(N2)spaceandO(M2)timeto computeallsimilarities.

Distribution of SimRank similarities among DBLP authors

Powerlawdistributionexistsinsimilarities 56%ofsimilarityentriesarein[0.005,0.015] 1.4%ofsimilarityentriesarelargerthan0.1 Ourgoal:Designadatastructurethatstoresthesignificantsimilaritiesand compressesinsignificantones

Each non-leaf node represents a group of similar lower-level nodes

Adjustment ratio for node n7

Thetightnessofagroupof nodesisthesupportofa frequentpattern

Initializingatree: Startfromleafnodes(level0) Ateachlevell,findnonoverlappinggroupsofsimilarnodes withfrequentpatternmining

0.6 0.5 0.4 0.3 0.2 0.1

0.9 LinkClus SimRank ReCom F-SimRank

LinkClus SimRank ReCom F-SimRank

Approaches LinkClus SimRank ReCom FSimRank

AccrAuthor 0.957 0.958 0.907 0.908

AccrConf 0.723 0.760 0.457 0.583

averagetime 76.7 1020 43.1 83.6

| (v) I ( w) | (v, w) = | (v) || ( w) |

Structuralsimilarityislargeformembersofacliqueandsmall for hubsandoutliers.

Directstructurereachable: Structurereachable:transitiveclosureofdirectstructure reachability Structureconnected:

Representthecutsizeusingindicatorvectorandadjacencymatrix Cutsize = Minimizetheobjectivefunctionthroughsolvingeigenvalue system Relaxthediscretevalueofqtocontinuousvalue Mapcontinuousvalueofqintodiscreteonestogetclusterlabels Usesecondsmallesteigenvectorforq

MinimizeQ Onepossiblesolution:hierarchicallymergeclustersresultingingreatest increaseinQfunction[Newmanetal.,2004]

Userusuallyhasagoalofclustering,e.g.,clusteringstudents byresearcharea UserspecifieshisclusteringgoaltoaDBInfoNet cluster:CrossClus

All tuples for clustering

All tuples for clustering

Tom Smith Jane Chang

f(t1) f(t2) f(t3) f(t4) f(t5)

Values of Feature f and g

Similarity between two features cosine similarity of two vectors

= 2 sim h (t i , t j ) sim f (ti , t j )

Depend on all tj with j<i

Objects (ordered by h) Feature h

Parts depending on ti Parts depending on all ti with j<i

Academic Performances Demographic info

Featuresconveyingsameaspectofinformationusuallycluster objectsinmoresimilarways researchgroupareasvs.conferencesofpublications Givenuserspecifiedfeature Findpertinentfeaturesbycomputingfeaturesimilarity

1.Startfromtheuser specified feature Group 2.Searchinneighborhoodof name existingpertinentfeatures 3.Expandsearchrangegradually area

sim (t1 , t 2 ) = sim f i (t1 , t 2 ) f i .weight

Weightofafeatureisdeterminedinfeaturesearchbyits similaritywithotherpertinentfeatures Forclustering,weuseCLARANS,ascalablekmedoids [Ng& Han94]algorithm

Experiments:CompareCrossClus withExisting Methods

max 1 j n ' ci c ' j

deg (C C ') + deg (C ' C ) sim (C , C ') = 2

1 1 (k ) ( f ip f jqk ) ) 2 = ij Rij , pq ( Dij , pp D ji ,qq i , j =1 p =1 q =1

Smoothnessconstraints:objectslinkedtogethershouldshare similar estimationsofconfidencebelongingtoclassk Normalizationtermappliedtoeachtypeoflinkseparately: reducetheimpactofpopularityofnodes Confidenceestimationonlabeleddataandtheirpregiven labelsshouldbesimilar

Cannotbeasolutionduetoproblems: Loseinformationoflinkagesandrelationships,nosemantics preservation Cannotutilizeinformationofdatabasestructuresorschemas (e.g.,ERmodeling)

A3=1&&A1=2 A3=1&&A1=2 &&A8=5 A3=1

Togeneratearule while(true) findthebestpredicatep if foilgain(p)>thresholdthen addp tocurrentrule else break