Anda di halaman 1dari 176

ACMSIGKDDConferenceTutorial,Washington,D.C.

,July25,2010

MiningHeterogeneous InformationNetworks
JiaweiHan YizhouSun XifengYan PhilipS.Yu UniversityofIllinoisatUrbanaChampaign UniversityofCaliforniaatSantaBarbara UniversityofIllinoisatChicago
Acknowledgements:NSF,ARL,NASA,AFOSR(MURI),Microsoft,IBM,Yahoo!,Google,HPLab&Boeing

July12,2010
1

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNetAnalysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
2

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNetAnalysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
3

WhatAreInformationNetworks?
Informationnetwork:Anetworkwhereeachnoderepresentsanentity(e.g., actorinasocialnetwork)andeachlink(e.g.,tie)arelationshipbetween entities Eachnode/linkmayhaveattributes,labels,andweights Linkmaycarryrichsemanticinformation Homogeneousvs.heterogeneousnetworks Homogeneousnetworks Singleobjecttypeandsinglelinktype Singlemodelsocialnetworks(e.g.,friends) WWW:acollectionoflinkedWebpages Heterogeneous,multitypednetworks Multipleobjectandlinktypes Medicalnetwork:patients,doctors,disease,contacts,treatments Bibliographicnetwork:publications,authors,venues
4

UbiquitousInformationNetworks
Graphsandsubstructures Chemicalcompounds,computervisionobjects,circuits,XML Biologicalnetworks Bibliographicnetworks:DBLP,ArXiv,PubMed, Socialnetworks:Facebook>100millionactiveusers WorldWideWeb(WWW):>3billionnodes,>50billionarcs Cyberphysicalnetworks

An Internet Web

Yeast protein interaction network

Co-author network

Social network sites


5

Homogeneousvs.HeterogeneousNetworks

Co-author Network

Conference-Author Network
6

DBLP:AnInterestingandFamiliarNetwork
DBLP:Acomputersciencepublicationbibliographicdatabase 1.4Mrecords(papers),0.7Mauthors,5Kconferences, Willthisdatabasediscloseinterestingknowledgeabout computerscienceresearch? Whatarethepopularresearchfields/subfieldsinCS? WhoaretheleadingresearchersonDBorXQueries? Howdotheauthorsinthissubfieldcollaborateandevolve? HowmanyWeiWangsinDBLP,whichpaperdonebywhich? WhoisSergy Brins supervisorandwhen? WhoareverysimilartoChristosFaloutsos? Allthesekindsofquestions,andpotentiallymuchmore,canbe nicelyansweredbytheDBLPInfoNet How?Exploringthepoweroflinksininformationnetworks!
7

Homo.vs.Hetero.:DifferencesinDBInfoNet Mining
Homogeneousnetworkscanoftenbederivedfromtheiroriginal heterogeneousnetworks Coauthornetworkscanbederivedfromauthorpaper conferencenetworksbyprojectiononauthorsonly Papercitationnetworkscanbederivedfromacomplete bibliographicnetworkwithpapersandcitationsprojected HeterogeneousDBInfoNet carriesricherinformationthanits correspondingprojectedhomogeneousnetworks TypedheterogeneousInfoNet vs.nontypedhetero.InfoNet (i.e., notdistinguishingdifferenttypesofnodes) TypednodesandlinksimplyamorestructuredInfoNet,and thusoftenleadtomoreinformativediscovery Ouremphasis:Miningstructured informationnetworks!
8

WhyMiningHeterogeneousInformationNetworks?
Mostdatasetscanbeorganized ortransformed intoa structured heterogeneousinformationnetwork! Examples:DBLP,IMDB,Flickr,GoogleNews,Wikipedia, Structurescanbeprogressivelyextractedfromlessorganized datasetsbyinformationnetworkanalysis Informationrich,interrelated,organizeddatasetsformone orasetofgigantic,interconnected,multityped heterogeneousinformationnetworks Surprisinglyrichknowledgecanbederivedfromsuch structuredheterogeneousinformationnetworks Ourgoal:Uncoverknowledgehiddenfromorganized data Exploringthepowerofmultityped,heterogeneouslinks Miningstructured heterogeneousinformationnetworks!
9

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
10

ClusteringandRankinginInformation Networks
IntegratedClusteringandRankingofHeterogeneous InformationNetworks ClusteringofHomogeneousInformationNetworks LinkClus:Clusteringwithlinkbasedsimilaritymeasure SCAN:Densitybasedclusteringofnetworks Others
Spectralclustering Modularitybasedclustering Probabilisticmodelbasedclustering

UserGuidedClusteringofInformationNetworks
11

ClusteringandRankinginHeterogeneous InformationNetworks
Ranking&clusteringeachprovidesanewviewoveranetwork Rankinggloballywithoutconsideringclusters dumb RankingDBandArchitectureconfs.together? Clusteringauthorsinonehugeclusterwithoutdistinction? Dulltoviewthousandsofobjects(thisiswhyPageRank!) RankClus:Integratesclusteringwithranking Conditionalrankingrelativetoclusters Useshighlyrankedobjectstoimproveclusters Qualitiesofclusteringandrankingaremutuallyenhanced Y.Sun,J.Han,etal.,RankClus:IntegratingClusteringwith RankingforHeterogeneousInformationNetworkAnalysis, EDBT'09.
12

GlobalRankingvs.ClusterBasedRanking
AToyExample:Twoareaswith10conferencesand100authors ineacharea

13

RankClus:ANewFramework
Sub-Network Ranking

Clustering
14

TheRankClus Philosophy
WhyintegratedRankingandClustering? Rankingandclusteringcanbemutuallyimproved Ranking:Onceaclusterbecomesmoreaccurate,rankingwill bemorereasonableforsuchaclusterandwillbethe distinguishedfeatureofthecluster Clustering:Oncerankingismoredistinguishedfromeach other,theclusterscanbeadjustedandgetmoreaccurate results Noteveryobjectshouldbetreatedequallyinclustering! Objectspreservesimilarityundernewmeasurespace E.g.,VLDBvs.SIGMOD
15

RankClus:AlgorithmFramework
Step0.Initialization RandomlypartitiontargetobjectsintoKclusters Step1.Ranking Rankingforeachsubnetworkinducedfromeachcluster, whichservesasfeatureforeachcluster Step2.Generatingnewmeasurespace Estimatemixturemodelcoefficientsforeachtargetobject Step3.Adjustingcluster Step4.RepeatingSteps13untilstable

16

FocusonaBiTypedNetworkCase
Conferenceauthornetwork,linkscanexistbetween Conference(X)andauthor(Y) Author(Y)andauthor(Y)

UseWtodenotethelinksandthereweights W=

17

Step1:Ranking:FeatureExtraction
Tworankingstrategies:Simplerankingvs.authorityranking SimpleRanking Proportionaltodegreecountingforobjects,e.g.,#of publicationsofanauthor Considersonlyimmediateneighborhoodinthenetwork AuthorityRanking:ExtensiontoHITSinweightedbitype network Rule1:Highlyrankedauthorspublishmanypapersinhighly rankedconferences Rule2:Highlyrankedconferencesattractmanypapersfrom manyhighlyrankedauthors Rule3:Therankofanauthorisenhancedifheorsheco authorswithmanyauthorsormanyhighlyrankedauthors
18

EncodingRulesinAuthorityRanking
Rule1:Highlyrankedauthorspublishmanypapersinhighly rankedconferences

Rule2:Highlyrankedconferencesattractmanypapersfrom manyhighlyrankedauthors

Rule3:Therankofanauthorisenhancedifheorsheco authorswithmanyauthorsormanyhighlyrankedauthors

19

Example:AuthorityRankinginthe2Area ConferenceAuthorNetwork
Therankingsofauthorsarequitedistinctfromeachotherinthe twoclusters

20

Step2:GenerateNewMeasureSpace: AMixtureModelMethod
Considereachtargetobjectslinksaregeneratedunderamixture distributionofrankingfromeachcluster Considerrankingasadistribution:r(Y) p(Y)

Eachtargetobjectxi ismappedintoaKvector(i,k) ParametersareestimatedusingtheEMalgorithm Maximizetheloglikelihoodgivenalltheobservationsof links


21

Example:2DCoefficientsinthe2Area ConferenceAuthorNetwork
Theconferencesarewellseparatedinthenewmeasurespace

Scatterplotsoftwoconferencesandcomponentcoefficients
22

Step3:ClusterAdjustmentinNew MeasureSpace
Clustercenterinnewmeasurespace Vectormeanofobjectsinthecluster(Kdimensional) Clusteradjustment Distancemeasure:1Cosinesimilarity Assigntotheclusterwiththenearestcenter WhyBetterRankingFunctionDerivesBetterClustering? Considerthemeasurespacegenerationprocess
Highlyrankedobjectsinaclusterplayamoreimportantroleto decide atargetobjectsnewmeasure

Intuitively,ifwecanfindthehighlyrankedobjectsina cluster,equivalently,wegettherightcluster
23

StepbyStepRunningCaseIllustration
Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Two clusters are almost well separated Improved significantly Well separated

Improved a little

Stable

24

TimeComplexity:Linearto#ofLinks
Ateachiteration,|E|:edgesinnetwork,m:numberoftarget objects,K:numberofclusters Rankingforsparsenetwork
~O(|E|)

Mixturemodelestimation
~O(K|E|+mK)

Clusteradjustment
~O(mK^2)

Inall,linearto|E| ~O(K|E|) Note:SimRank willbeatleastquadraticateachiterationsinceit evaluatesdistancebetweeneverypairinthenetwork


25

CaseStudy:Dataset:DBLP
Allthe2676conferencesand20,000authorswithmost publications,fromthetimeperiodofyear1998toyear2007 Bothconferenceauthorrelationshipsandcoauthor relationshipsareused K=15(selectonly5clustershere)

26

NetClus: Ranking & Clustering with Star Network Schema


Beyondbitypedinformationnetwork:AStarNetworkSchema Splitanetworkintodifferentlayers,eachrepresentingbyanet cluster

27

StarNet:Schema&NetCluster
StarNetworkSchema Centertype:Targettype E.g.,apaper,amovie,ataggingevent Acenterobjectisacooccurrence ofabagofdifferenttypesof objects,whichstandsforamultirelationamongdifferenttypesof objects Surroundingtypes:Attribute(property)types NetCluster GivenainformationnetworkG,anetclusterCcontainstwopiecesof information: NodesetandlinksetasasubnetworkofG Membershipindicatorforeachnodex:P(x inC) GivenainformationnetworkG,clusternumberK,aclusteringforGisa setofnetclustersandforeachnodex,thesumofxs probability distributioninallKnetclustersshouldbe1
28

Venue Publish
Research Paper

Author Write

Contain Term DBLP

29

StarNet of Delicious.com

Web Site
Tagging Event

User

Contain Tag Delicious.com

30

StartNet forIMDB

Actor/A ctress Star in


Movie

Director Direct

Contain Title/ Plot IMDB

31

RankingFunctions
RankinganobjectxoftypeTx inanetworkG,denotedasp(x|Tx,G) Giveascoretoeachobject,accordingtoitsimportance Differentrulesdefineddifferentrankingfunctions: SimpleRanking Rankingscoreisassignedaccordingtothedegreeofanobject AuthorityRanking Rankingscoreisassignedaccordingtothemutualenhancementby propagationsofscorethroughlinks highlyrankedconferencesacceptmanygoodpaperspublishedby manyhighlyrankedauthors,andhighlyrankedauthorspublishmany goodpapersinhighlyrankedconferences:

32

RankingFunction(Cont.)
Priorscanbeadded: PP(X|Tx,Gk)=(1 P)P(X|Tx,Gk)+PP0(X|Tx,Gk) P0(X|Tx,Gk)isthepriorknowledge,usuallygivenasa distribution,denotedbyonlyseveralwords Pistheparameterthatwebelieveinpriordistribution Rankingdistribution Normalizerankingscoresto1,giventhemaprobabilistic meaning SimilartotheideaofPageRank

33

NetClus:AlgorithmFramework
Mapeachtargetobjectintoanewlowdimensionalfeature spaceaccordingtocurrentnetclustering,andadjustthe clusteringfurtherinthenewmeasurespace Step0:Generateinitialrandomclusters Step1:Generaterankingbasedgenerativemodelfor targetobjectsforeachnetcluster Step2:Calculateposteriorprobabilitiesfortargetobjects, whichservesasthenewmeasure,andassigntargetobjects tothenearestclusteraccordingly Step3:Repeatsteps1and2,untilclustersdonotchange significantly Step4:Calculateposteriorprobabilitiesforattribute objectsineachnetcluster
34

Generative Model for Target Objects Given a Net-cluster


Eachtargetobjectstandsforancooccurrenceofabagofattributeobjects Definetheprobabilityofatargetobject<=>definetheprobabilityofthe cooccurrenceofalltheassociatedattributeobjects GenerativeprobabilityP(d|Gk)fortargetobjectdinclusterCk :

whereP(x |Tx ,Gk) isrankingfunction,P(Tx |Gk)istypeprobability Twoassumptionsofindependence Theprobabilitiestovisitobjectsofdifferenttypesareindependentto eachother Theprobabilitiestovisittwoobjectswithinthesametypeare independenttoeachother
35

ClusterAdjustment
Usingposteriorprobabilitiesoftargetobjectsasnewfeature space Eachtargetobject=>Kdimensionvector Eachnetclustercenter=>Kdimensionvector Averageontheobjectsinthecluster Assigneachtargetobjectintonearestclustercenter(e.g., cosinesimilarity) Asubnetworkcorrespondingtoanewnetclusteristhenbuilt byextractingallthetargetobjectsinthatclusterandall linkedattributeobjects

39

Experiments:DBLPandBeyond
DataSet:DBLPallarea dataset Allconferences+Top 50Kauthors DBLPfourarea dataset 20conferencesfromDB,DM,ML,IR Allauthorsfromtheseconferences Allpaperspublishedintheseconferences Runningcaseillustration

40

AccuracyStudy:Experiments
Accuracy,comparedwithPLSA,apuretextmodel,noother typesofobjectsandlinksareused,usethesamepriorasin NetClus

AccuracyofPaperClusteringResults Accuracy,comparedwithRankClus,abitypedclusteringmethod ononlyonetype

AccuracyofConferenceClusteringResults
41

NetClus:DistinguishingConferences
AAAI 0.00226670.008991680.934024 0.03000420.0247133 CIKM0.1500530.3101720.007238070.4445240.0880127 CVPR0.0001638120.007630720.931496 0.02813420.032575 ECIR3.47023e050.007126950.006574020.978391 0.00787288 ECML0.000774770.1109220.814362 0.05794260.015999 EDBT0.573362 0.3160330.001014420.02455910.0850319 ICDE0.529522 0.3765420.002391520.01511130.0764334 ICDM0.0004550280.778452 0.05664570.1131840.0512633 ICML0.0003096240.0500780.878757 0.06223350.00862134 IJCAI0.003298160.00467580.94288 0.03037450.0187718 KDD0.005742230.797633 0.06173510.0676810.0672086 PAKDD0.001112460.813473 0.04031050.05747550.0876289 PKDD5.39434e050.760374 0.1196080.0529260.0670379 PODS0.78935 0.1137510.0139390.002774170.0801858 SDM0.0001729530.841087 0.0583160.05270810.0477156 SIGIR0.006003990.002800130.002752370.977783 0.0106604 SIGMOD0.689348 0.2231220.00177030.008254550.0775055 VLDB0.701899 0.2074280.001000120.01169660.0779764 WSDM0.007516540.2692590.02602910.683646 0.0135497 WWW0.07711860.2706350.029307 0.451857 0.171082
42

NetClus:DatabaseSystemCluster
database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.00850744 object 0.00837766 relational 0.0081175 processing 0.00745875 based 0.00736599 distributed 0.0068367 xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167 information 0.0050518 model 0.00499396 efficient 0.00465707 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314 Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853 Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274 Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143 Abraham Silberschatz 0.00278185
43

Ranking authors in XML

NetClus:StarNetBasedRankingandClustering
Ageneralframeworkinwhichrankingandclusteringare successfullycombinedtoanalyzeinfornets Rankingandclusteringcanmutuallyreinforceeachotherin informationnetworkanalysis NetClus,anextensiontoRankClus thatintegratesrankingand clusteringandgeneratenetclustersinastarnetworkwith arbitrarynumberoftypes Netcluster,heterogeneous informationsubnetworks comprisedofmultiple typesofobjects GowellbeyondDBLP,and structuredrelationalDBs
Flickr:queryRaleigh derivesmultipleclusters

44

iNextCube:InformationNetworkEnhancedText Cube(VLDB09Demo)
Demo: iNextCube.cs.uiuc.edu

DimensionhierarchiesgeneratedbyNetClus

ArchitectureofiNextCube
All

Author/conferen termrankingfor researcharea.Th researchareasca atdifferentlevel

DBandIS

Theory

Architecture

DB

DM

IR

XML

DistributedDB

NetclusterHierarchy

45

ClusteringandRankinginInformation Networks
IntegratedClusteringandRankingofHeterogeneous InformationNetworks ClusteringofHomogeneousInformationNetworks LinkClus:Clusteringwithlinkbasedsimilaritymeasure SCAN:Densitybasedclusteringofnetworks Others
Spectralclustering Modularitybasedclustering Probabilisticmodelbasedclustering

UserGuidedClusteringofInformationNetworks
46

LinkBasedClustering:WhyUseful?
Authors Proceedings sigmod03 sigmod04 Conferences

Tom Mike Cathy John Mary

sigmod

sigmod05 vldb03 vldb04 vldb05 aaai04 aaai05

vldb

aaai

Questions: Q1:Howtoclustereachtypeofobjects? Q2:Howtodefinesimilaritybetweeneachtypeofobjects?


47

SimRank:LinkBasedSimilarities
Twoobjectsaresimilariflinkedwiththesameorsimilarobjects
Jeh &Widom,2002 SimRank sigmod03 sigmod04 Similaritybetweentwoobjectsa andb, S(a,b)=theaveragesimilaritybetween objectslinkedwitha andthosewithb:

Tom Mary

sigmod

sigmod05

Tom Mike Cathy John

sigmod03 sigmod04 sigmod05 vldb03 vldb04 vldb05

sigmod

whereI(v)isthesetofinneighborsof thevertexv. But:Itisexpensivetocompute:

vldb

ForadatasetofN objectsandM links, ittakesO(N2)spaceandO(M2)timeto computeallsimilarities.


48

Observation1:HierarchicalStructures
Hierarchicalstructuresoftenexistnaturallyamongobjects(e.g., taxonomyofanimals) Ahierarchicalstructureof productsinWalmart
All grocery electronics apparel Relationshipsbetweenarticlesand words(Chakrabarti,Papadimitriou, Modha,Faloutsos,2004)

TV

DVD

camera

Articles

Words
49

Observation2:DistributionofSimilarity
0.4 portion of entries 0.3 0.2 0.1 0 0.02 0.04 0.06 0.08 0.12 0.14 0.16 0.18 0.22 similarity value 0.24 0.1 0.2 0

Distribution of SimRank similarities among DBLP authors

Powerlawdistributionexistsinsimilarities 56%ofsimilarityentriesarein[0.005,0.015] 1.4%ofsimilarityentriesarelargerthan0.1 Ourgoal:Designadatastructurethatstoresthesignificantsimilaritiesand compressesinsignificantones


50

OurDataStructure:SimTree
Similarity between two sibling nodes n1 and n2
n1
0.2 0.9 0.3 0.8 0.9

Each non-leaf node represents a group of similar lower-level nodes

n2

n3

Adjustment ratio for node n7

0.8

0.9

n4

n5

n6
1.0

n7

n8

n9

Each leaf node simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8) represents an object Path-based node similarity Similarity between two nodes is the average similarity between objects linked with them in other SimTrees Adjustment ratio for x = Average similarity between x and all other nodes Average similarity between xs parent and all other nodes
51

LinkClus:SimTreeBasedHierarchicalClustering
InitializeaSimTree forobjectsofeachtype Repeat ForeachSimTree,updatethesimilaritiesbetweenitsnodes usingsimilaritiesinotherSimTrees Similaritybetweentwonodesa andbistheaverage similaritybetweenobjectslinkedwiththem AdjustthestructureofeachSimTree Assigneachnodetotheparentnodethatitismost similarto

52

InitializationofSimTrees
FindingtightgroupsFrequentpatternmining
Reduced to Transactions
1 2 3 4 5 6 7 8 9

Thetightnessofagroupof nodesisthesupportofa frequentpattern

g1

n1 n2 n3 n4

g2

{n1} {n1, n2} {n2} {n1, n2} {n1, n2} {n2, n3, n4} {n4} {n3, n4} {n3, n4}

Initializingatree: Startfromleafnodes(level0) Ateachlevell,findnonoverlappinggroupsofsimilarnodes withfrequentpatternmining


53

Complexity:LinkClus vs.SimRank
Afterinitialization,iteratively(1)foreachSimTree updatethe similaritiesbetweenitsnodesusingsimilaritiesinotherSimTrees, and(2)AdjustthestructureofeachSimTree Computationalcomplexity: Fortwotypesofobjects,N ineach,andM linkagesbetween them
Time
Updating similarities Adjusting tree structures

Space
O(M+N) O(N)

O(M(logN)2) O(N)

LinkClus SimRank

O(M(logN)2) O(M2)

O(M+N) O(N2)
54

PerformanceComparison:ExperimentSetup
DBLP dataset: 4170 most productive authors, and 154 well-known conferences with most proceedings Manually labeled research areas of 400 most productive authors according to their home pages (or publications) Manually labeled areas of 154 conferences according to their call for papers Approaches Compared: SimRank (Jeh & Widom, KDD 2002) Computing pair-wise similarities SimRank with FingerPrints (F-SimRank) Fogaras & Racz, WWW 2005 pre-computes a large sample of random paths from each object and uses samples of two objects to estimate SimRank similarity ReCom (Wang et al. SIGIR 2003) Iteratively clustering objects using cluster labels of linked objects
55

DBLPDataSet:AccuracyandComputationTime
Authors
1
0.8 0.7

Conferences

0.95 accuracy
accuracy

0.6 0.5 0.4 0.3 0.2 0.1

0.9 LinkClus SimRank ReCom F-SimRank


11 13 15 17 19 3 5 7 9

0.85

0.8
1

LinkClus SimRank ReCom F-SimRank


17

#iteration

#iteration

Approaches LinkClus SimRank ReCom FSimRank

AccrAuthor 0.957 0.958 0.907 0.908

AccrConf 0.723 0.760 0.457 0.583

11

13

averagetime 76.7 1020 43.1 83.6


56

15

19

EmailDataset:AccuracyandTime
F.Nielsen.Emaildataset. http://www.imm.dtu.dk/rem/data/Email1431.zip 370emailsonconferences,272onjobs,and789spamemails WhyisLinkClus evenbetterthanSimRank inaccuracy? Noisefilteringduetofrequentpatternbasedpreprocessing
Approach LinkClus SimRank ReCom FSimRank CLARANS Accuracy 0.8026 0.7965 0.5711 0.3688 0.4768 Totaltime(sec) 1579.6 39160 74.6 479.7 8.55
57

ClusteringandRankinginInformation Networks
IntegratedClusteringandRankingofHeterogeneous InformationNetworks ClusteringofHomogeneousInformationNetworks LinkClus:Clusteringwithlinkbasedsimilaritymeasure SCAN:Densitybasedclusteringofnetworks Others
Spectralclustering Modularitybasedclustering Probabilisticmodelbasedclustering

UserGuidedClusteringofInformationNetworks
58

SCAN:DensityBasedNetworkClustering
Networksmadeupofthemutualrelationshipsofdataelementsusually haveanunderlyingstructure.Clustering:Astructurediscovery problem Givensimplyinformationofwhoassociateswithwhom,couldone identifyclustersofindividualswithcommoninterestsorspecial relationships(families,cliques,terroristcells)? Questionstobeanswered:Howmanyclusters?Whatsizeshouldthey be?Whatisthebestpartitioning?Shouldsomepointsbesegregated? Scan:Aninterestingdensitybasedalgorithm X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger,SCAN:AStructural ClusteringAlgorithmforNetworks,Proc.2007ACMSIGKDDInt. Conf.KnowledgeDiscoveryinDatabases(KDD'07),SanJose,CA, Aug.2007
59

SocialNetworkandItsClusteringProblem
Individualsinatightsocialgroup,or clique,knowmanyofthesamepeople, regardlessofthesizeofthegroup. Individualswhoarehubs knowmany peopleindifferentgroupsbutbelongto nosinglegroup.Politicians,forexample bridgemultiplegroups. Individualswhoareoutliers resideat themarginsofsociety.Hermits,for example,knowfewpeopleandbelong tonogroup.

60

StructureSimilarity
Define(v)asimmediateneighborofavertexv. Thedesiredfeaturestendtobecapturedbyameasure(u,v) asStructuralSimilarity

| (v) I ( w) | (v, w) = | (v) || ( w) |

Structuralsimilarityislargeformembersofacliqueandsmall for hubsandoutliers.


61

StructuralConnectivity[1]
Neighborhood:
Core:
N (v ) = {w (v) | (v, w) }
CORE , (v) | N (v) | DirRECH , (v, w) CORE , (v ) w N (v )

Directstructurereachable: Structurereachable:transitiveclosureofdirectstructure reachability Structureconnected:


CONNECT , (v, w) u V : RECH , (u , v ) RECH , (u , w)
[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases
62

StructureConnectedClusters
StructureconnectedclusterC
Connectivity: Maximality:
v, w C : CONNECT , (v, w)

v, w V : v C REACH , (v, w) w C

Hubs:
Notbelongtoanycluster Bridgetomanyclusters

Outliers:
Notbelongtoanycluster Connecttolessclusters
outlier

hub

63

Algorithm
2

=2 = 0.7
0.67 8 0.75 9 7 6 11 0.82 12

3 5 4 0 1

10

13
64

Algorithm
2

=2 = 0.7
7 11 12 10 9

3 0.51 0.68 6 8 4 0 5 1

0.51

13
65

RunningTime
Runningtime:O(|E|) Forsparsenetworks:O(|V|)

[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).
66

ClusteringandRankinginInformation Networks
IntegratedClusteringandRankingofHeterogeneous InformationNetworks ClusteringofHomogeneousInformationNetworks LinkClus:Clusteringwithlinkbasedsimilaritymeasure SCAN:Densitybasedclusteringofnetworks Others
Spectralclustering Modularitybasedclustering Probabilisticmodelbasedclustering

UserGuidedClusteringofInformationNetworks
67

SpectralClustering
Spectralclustering:Findthebestcutthatpartitionsthenetwork Differentcriteriatodecidebest Mincut,ratiocut,normalizedcut,MinMaxcut UsingMincutasanexample[Wuetal.1993] Assigneachnodeianindicatorvariable

Representthecutsizeusingindicatorvectorandadjacencymatrix Cutsize = Minimizetheobjectivefunctionthroughsolvingeigenvalue system Relaxthediscretevalueofqtocontinuousvalue Mapcontinuousvalueofqintodiscreteonestogetclusterlabels Usesecondsmallesteigenvectorforq

68

ModularityBasedClustering
Modularitybasedclustering Findthebestclusteringthatmaximizesthemodularityfunction Qfunction[Newmanetal.,2004] Leteij beahalfofthefractionofedgesbetweengroupiandgroupj eii isthefractionofedgeswithingroupi Letai bethefractionofallendsofedgesattachedtoverticesingroupI Qisthendefinedassumofdifferencebetweenwithingroupedgesand expectedwithingroupedges

MinimizeQ Onepossiblesolution:hierarchicallymergeclustersresultingingreatest increaseinQfunction[Newmanetal.,2004]


69

ProbabilisticModelBasedClustering
Probabilisticmodelbasedclustering Buildgenerativemodelsforlinksbasedonhiddenclusterlabels Maximizetheloglikelihoodofallthelinkstoderivethehiddencluster membership Anexample:Airoldi etal.,Mixedmembershipstochasticblockmodels,2008 DefineagroupinteractionprobabilitymatrixB(KXK) B(g,h)denotestheprobabilityoflinkgenerationbetweengroupg andgrouph Generativemodelforalink Foreachnode,drawamembershipprobabilityvectorforma Dirichlet prior Foreachpaperofnodes,drawclusterlabelsaccordingtotheir membershipprobability(supposinggandh),thendecidewhetherto havealinkaccordingtoprobabilityB(g,h) Derivethehiddenclusterlabelbymaximizethelikelihoodgiven Band theprior
70

ClusteringandRankinginInformation Networks
IntegratedClusteringandRankingofHeterogeneous InformationNetworks ClusteringofHomogeneousInformationNetworks LinkClus:Clusteringwithlinkbasedsimilaritymeasure SCAN:Densitybasedclusteringofnetworks Others
Spectralclustering Modularitybasedclustering Probabilisticmodelbasedclustering

UserGuidedClusteringofInformationNetworks
71

UserGuidedClusteringinDBInfoNet
Work-In
person group

Professor
name office position

Open-course
course semester instructor

Course
course-id name area

Group
name area

Advise
professor student degree

Publish
author title

Publication
title year conf

User hint
Student
name office position

Register
student course semester unit grade

Target of clustering

Userusuallyhasagoalofclustering,e.g.,clusteringstudents byresearcharea UserspecifieshisclusteringgoaltoaDBInfoNet cluster:CrossClus


72

Classificationvs.UserGuidedClustering
User hint Userspecifiedfeature(inthe formofattribute)isusedasa hint,notclasslabels Theattributemaycontaintoo manyortoofewdistinct values E.g.,ausermaywantto clusterstudentsinto20 clustersinsteadof3
All tuples for clustering

Additionalfeaturesneedtobe includedinclusteranalysis
73

UserGuidedClusteringvs.SemisupervisedClustering
Semisupervisedclustering[Wagstaff,etal 01,Xing,etal.02] Userprovidesatrainingsetconsistingofsimilar anddissimilar pairsofobjects Userguidedclustering Userspecifiesanattributeasahint,andmorerelevantfeaturesare foundforclustering Semi-supervised clustering User-guided clustering

All tuples for clustering

All tuples for clustering


74

WhyNotTypicalSemiSupervisedClustering?
Whynotdotypicalsemisuperviseclustering? Muchinformation(inmultiplerelations)isneededtojudgewhethertwo tuples aresimilar Ausermaynotbeabletoprovideagoodtrainingset Itismucheasierforausertospecifyanattributeasahint,suchasastudents researcharea

Tom Smith Jane Chang

SC1211 BI205

TA RA

Tuples to be compared

User hint
75

CrossClus:AnOverview
CrossClus:Framework Searchforgoodmultirelationalfeaturesforclustering Measuresimilaritybetweenfeaturesbasedonhowthey clusterobjectsintogroups Userguidance+heuristicsearchforfindingpertinent features Clusteringbasedonakmedoidsbased algorithm CrossClus:Majoradvantages Userguidance,eveninaverysimpleform,playsan importantroleinmultirelationalclustering CrossClus findspertinentfeatures bycomputingsimilarities betweenfeatures
76

SelectionofMultiRelationalFeatures
Amultirelationalfeatureisdefinedby: Ajoinpath.E.g.,Student Register OpenCourse Course Anattribute.E.g.,Course.area (Fornumericalfeature)anaggregationoperator.E.g.,sumoraverage CategoricalFeaturef= [Student Register OpenCourse Course,Course.area,null] areas of courses of each student
Tuple Areas of courses

Values of feature f
Tuple Feature f

f(t1) f(t2) f(t3) f(t4) f(t5)


DB AI TH

DB t1 t2 t3 t4 t5
5 0 1 5 3

AI
5 3 5 0 3

TH
0 7 4 5 4

DB t1 t2 t3 t4 t5
0.5 0 0.1 0.5 0.3

AI
0.5 0.3 0.5 0 0.3

TH
0 0.7 0.4 0.5 0.4

Numerical Feature, e.g., average grades of students h = [Student Register, Register.grade, average] E.g. h(t1) = 3.5
77

SimilarityBetweenFeatures
Vf

Values of Feature f and g


Feature f (course) DB t1 t2 t3 t4 t5 0.5 0 0.1 0.5 0.3 AI 0.5 0.3 0.5 0 0.3 TH 0 0.7 0.4 0.5 0.4 Info sys 1 0 0 0.5 0.5 Feature g (group) Cog sci 0 0 0.5 0 0.5 Theory 0 1 0.5 0.5 0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S5 S4 S3 S2 S1 1 2 4 3 5 0.5-0.6 0.4-0.5 0.3-0.4 0.2-0.3 0.1-0.2 0-0.1

Vg

Similarity between two features cosine similarity of two vectors

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S1 S2 S3 S4 S5 5 4 3 2 1 0.5-0.6 0.4-0.5 0.3-0.4 0.2-0.3 0.1-0.2 0-0.1

V V Sim( f , g ) = f g V V
f

78

SimilaritybetweenCategorical&NumericalFeatures
V V
h
N

= 2 sim h (t i , t j ) sim f (ti , t j )


N
N l f (ti ). p k (1 h (t i )) f (t j ). p k + 2 f (t i ). p k h (t j ) f (t j ). p k i =1 k =1 j <i j <i

i =1 j < i

= 2
i =1 k =1

Only depend on ti

Depend on all tj with j<i

Feature f DB AI TH

Objects (ordered by h) Feature h


2.7 2.9 3.1 3.3 3.5 3.7 3.9

Parts depending on ti Parts depending on all ti with j<i

79

SearchingforPertinentFeatures
Differentfeaturesconveydifferentaspectsofinformation
Research area
Research group area Conferences of papers Advisor

Academic Performances Demographic info


Permanent address Nationality GPA GRE score Number of papers

Featuresconveyingsameaspectofinformationusuallycluster objectsinmoresimilarways researchgroupareasvs.conferencesofpublications Givenuserspecifiedfeature Findpertinentfeaturesbycomputingfeaturesimilarity

80

HeuristicSearchforPertinentFeatures
Work-In Professor
name office

Open-course
course semester instructor

Course
course-id name area

Overallprocedure

person group

1.Startfromtheuser specified feature Group 2.Searchinneighborhoodof name existingpertinentfeatures 3.Expandsearchrangegradually area

position

Advise
professor student degree

Publish
author

Publication
title year conf

title

User hint
Student
name office position

Register
student course semester unit grade

Target of clustering

Tuple ID propagation [Yin, et al.04] is used to create multi-relational features IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple
81

ClusteringwithMultiRelationalFeature
GivenasetofL pertinentfeaturesf1, , fL,similarity betweentwoobjects

sim (t1 , t 2 ) = sim f i (t1 , t 2 ) f i .weight


i =1

Weightofafeatureisdeterminedinfeaturesearchbyits similaritywithotherpertinentfeatures Forclustering,weuseCLARANS,ascalablekmedoids [Ng& Han94]algorithm

82

Experiments:CompareCrossClus withExisting Methods


Baseline:Onlyusetheuserspecifiedfeature PROCLUS[Aggarwal,etal.99]:astateoftheartsubspace clusteringalgorithm Useasubsetoffeaturesforeachcluster Weconvertrelationaldatabasetoatableby propositionalization Userspecifiedfeatureisforcedtobeusedineverycluster RDBC[KirstenandWrobel00] ArepresentativeILPclusteringalgorithm Useneighborinformationofobjectsforclustering Userspecifiedfeatureisforcedtobeused
83

MeasuringClusteringAccuracy
ToverifythatCrossClus capturesusersclusteringgoal,wedefineaccuracy ofclustering Givenaclusteringtask Manuallyfindallfeaturesthatcontaininformationdirectlyrelatedtothe clusteringtask standardfeatureset E.g.,Clusteringstudentsbyresearchareas Standardfeatureset:researchgroup,groupareas,advisors, conferencesofpublications,courseareas Accuracyofclusteringresult:howsimilaritistotheclusteringgenerated bystandardfeatureset

deg (C C ') =

n i =1

max 1 j n ' ci c ' j

deg (C C ') + deg (C ' C ) sim (C , C ') = 2


84

n i =1

ci

ClusteringProfessors:CSDeptDataset
Clustering Accuracy - CS Dept 1 0.8 0.6 0.4 0.2 0 Group Course Group+Course CrossClus K-Medoids CrossClus K-Means CrossClus Agglm Baseline PROCLUS RDBC

(Theory):J.Erickson,S.HarPeled,L.Pitt,E.Ramos,D.Roth,M.Viswanathan (Graphics):J.Hart,M.Garland,Y.Yu (Database):K.Chang,A.Doan,J.Han,M.Winslett,C.Zhai (Numericalcomputing):M.Heath,T.Kerkhoven,E.deSturler (Networking&QoS):R.Kravets,M.Caccamo,J.Hou,L.Sha (ArtificialIntelligence):G.Dejong,M.Harandi,J.Ponce,L.Rendell (Architecture):D.Padua,J.Torrellas,C.Zilles,S.Adve,M.Snir,D.Reed,V.Adve (OperatingSystems):D.Mickunas,R.Campbell,Y.Zhou


85

DBLPDataset
Clustering Accurarcy - DBLP
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Co nf Co au th Co au th or W W +C oa u th re e or d or d th or or
CrossClus K-Medoids CrossClus K-Means CrossClus Agglm Baseline PROCLUS RDBC

Co nf +

Co nf +

or d

A ll

86

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
87

ClassificationofInformationNetworks
ClassificationofHeterogeneousInformationNetworks: GraphregularizationBasedMethod(GNetMine) MultiRelationalMiningBasedMethod(CrossMine) StatisticalRelationalLearningBasedMethod(SRL) ClassificationofHomogeneousInformationNetworks

88

WhyClassifyingHeterogeneousInfoNet?
Sometimes,wedohavepriorknowledgeforpartofthenodes/objects! Input:Heterogeneousinformationnetworkstructure+classlabelsfor some objects/nodes Goal:Classifytheheterogeneousnetworkeddataintoclasses,eachofwhich iscomposedofmultitypeddataobjectssharingacommontopic. Naturalgeneralizationofclassificationonhomogeneousnetworkeddata Emailnetwork+several suspicioususers/words/emails Militarynetwork+which militarycampseveral soldiers/commandersbelongto Findouttheterrorists, theiremailsand frequentlyusedwords! Class:terrorism Findoutthesoldiersand commandersbelongingto thatcamp! Class:militarycamp
89

Classifier

Classification:KnowledgePropagation

90

GNetMine:Methodology
Classificationofnetworkeddatacanbeessentiallyviewedasa processofknowledgepropagation,whereinformationis propagatedfromlabeledobjectstounlabeledonesthrough linksuntilastationarystateisachieved. Anovelgraphbasedregularizationframeworktoaddressthe classificationproblemonheterogeneousinformationnetworks. Respectthelinktypedifferencesbypreservingconsistency overeachrelationgraphcorrespondingtoeachtypeoflinks separately Mathematicalintuition:Consistencyassumption Theconfidence(f)of twoobjects(xip andxjq)belongingto classk shouldbesimilarifxip xjq (Rij,pq >0) fshouldbesimilartothegivengroundtruth
91

GNetMine:GraphBasedRegularization
Minimizetheobjectivefunction ( Userpreference:howmuchdoyou J ( f1( k ) ,..., f mk ) )
valuethisrelationship/ground truth?

1 1 (k ) ( f ip f jqk ) ) 2 = ij Rij , pq ( Dij , pp D ji ,qq i , j =1 p =1 q =1


m ni

nj

+ i ( f i ( k ) y i( k ) )T ( f i ( k ) y i( k ) )
i =1

Smoothnessconstraints:objectslinkedtogethershouldshare similar estimationsofconfidencebelongingtoclassk Normalizationtermappliedtoeachtypeoflinkseparately: reducetheimpactofpopularityofnodes Confidenceestimationonlabeleddataandtheirpregiven labelsshouldbesimilar


92

ExperimentsonDBLP
Class:Fourresearchareas(communities) Database,datamining,AI,informationretrieval Fourtypesofobjects Paper(14376),Conf.(20),Author(14475),Term(8920) Threetypesofrelations Paperconf.,paperauthor,paperterm Algorithmsforcomparison LearningwithLocalandGlobalConsistency(LLGC)[Zhouet al.NIPS2003] alsothehomogeneousversionofour method WeightedvoteRelationalNeighborclassifier(wvRN) [Macskassy etal.JMLR2007] NetworkonlyLinkbasedClassification(nLB)[Luetal.ICML 2003,Macskassy etal.JMLR2007]
93

ClassificationAccuracy:LabelingaVerySmallPortion ofAuthorsandPapers
(a%,p%) (0.1%,0.1%) (0.2%,0.2%) (0.3%,0.3%) (0.4%,0.4%) (0.5%,0.5%) nLB AA 25.4 28.3 28.4 30.7 29.8 nLB PP 49.8 73.1 77.9 79.1 80.7 ACPT 31.5 40.3 35.4 38.6 39.3 nLB ACPT 25.5 22.5 25.0 25.0 25.0 PP 62.0 71.7 77.9 78.1 77.9 ACPT 26.0 26.0 27.4 26.7 27.3 AA 40.8 46.0 48.6 46.3 49.0 wvRN ACPT 34.1 41.2 42.5 45.6 51.4 wvRN ACPT 42.0 49.7 54.3 54.4 53.5 wvRN ACPT 43.5 56.0 59.0 57.0 68.0 AA 41.4 44.7 48.8 48.7 50.6 LLGC ACPT 61.3 62.2 65.7 66.0 68.9 LLGC ACPT 62.7 65.5 66.6 70.5 73.5 GNetMine ACPT 81.0 85.0 87.0 89.5 94.0 94 GNetMine ACPT 82.9 83.4 86.7 87.2 87.5 GNetMine ACPT 79.2 83.5 83.2 83.7 84.1

Comparisonofclassificationaccuracyonauthors(%)
(a%,p%) (0.1%,0.1%) (0.2%,0.2%) (0.3%,0.3%) (0.4%,0.4%) (0.5%,0.5%) PP 67.2 72.8 76.8 77.9 79.0 LLGC ACPT 79.0 83.5 87.0 86.5 90.0

Comparisonofclassificationaccuracyonpapers(%)
(a%,p%) (0.1%,0.1%) (0.2%,0.2%) (0.3%,0.3%) (0.4%,0.4%) (0.5%,0.5%)

Comparisonofclassificationaccuracyonconferences(%)

KnowledgePropagation:ListObjectswiththe HighestConfidenceMeasureBelongingtoEachClass
No. 1 2 3 4 5 Database data database query system xml DataMining mining data clustering learning classification ArtificialIntelligence learning knowledge Reinforcement reasoning model InformationRetrieval retrieval information web search document

Top5termsrelatedtoeacharea
No. 1 2 3 4 5 Database Surajit Chaudhuri H.V.Jagadish MichaelJ.Carey MichaelStonebraker C.Mohan No. 1 2 3 4 5 Database VLDB SIGMOD PODS ICDE EDBT DataMining Jiawei Han PhilipS.Yu ChristosFaloutsos WeiWang Shusaku Tsumoto DataMining KDD SDM PAKDD ICDM PKDD ArtificialIntelligence SridharMahadevan TakeoKanade AndrewW.Moore Satinder P.Singh ThomasS.Huang ArtificialIntelligence IJCAI AAAI CVPR ICML ECML InformationRetrieval W.BruceCroft Iadh Ounis MarkSanderson ChengXiang Zhai GerardSalton InformationRetrieval SIGIR ECIR WWW WSDM CIKM

Top5authorsconcentratedineacharea

Top5conferencesconcentratedineacharea
95

ClassificationofInformationNetworks
ClassificationofHeterogeneousInformationNetworks: GraphregularizationBasedMethod(GNetMine) MultiRelationalMiningBasedMethod(CrossMine) StatisticalRelationalLearningBasedMethod(SRL) ClassificationofHomogeneousInformationNetworks

96

MultiRelationtoFlatRelationMining?
Foldingmultiplerelationsintoasingleflat oneformining?
Doctor Patient Contact

flatten

Cannotbeasolutionduetoproblems: Loseinformationoflinkagesandrelationships,nosemantics preservation Cannotutilizeinformationofdatabasestructuresorschemas (e.g.,ERmodeling)


97

OneApproach:InductiveLogicProgramming(ILP)
Findahypothesisthatisconsistentwithbackground knowledge(trainingdata) FOIL,Golem,Progol,TILDE, Backgroundknowledge Relations(predicates),Tuples (groundfacts) InductiveLogicProgramming(ILP) Hypothesis:Thehypothesisisusuallyasetofrules,which canpredictcertainattributesincertainrelations Daughter(X,Y) female(X),parent(Y,X)
Training examples
Daughter(mary, ann) Daughter(eve, tom) Daughter(tom, ann) Daughter(eve, ann) + +

Background knowledge
Parent(ann, mary) Parent(ann, tom) Parent(tom, eve) Parent(tom, ian) Female(ann) Female(mary) Female(eve)

98

InductiveLogicProgrammingApproachtoMulti RelationClassification
ILPApproachedtoMultiRelationClassification TopdownApproaches(e.g.,FOIL) while(enough examplesleft) generatearule removeexamplessatisfyingthisrule BottomupApproaches(e.g.,Golem) Useeachexampleasarule Generalizerulesbymergingrules DecisionTreeApproaches(e.g.,TILDE) ILPApproach:ProsandCons Advantages:Expressiveandpowerful,andrulesareunderstandable Disadvantages:Inefficientfordatabaseswithcomplexschemas,and inappropriateforcontinuousattributes
99

FOIL:FirstOrderInductiveLearner(RuleGeneration)
Findasetofrulesconsistentwithtrainingdata Atopdown,sequentialcoveringlearner Buildeachrulebyheuristics Foilgain aspecialtypeofinformationgain
Examplescovered byRule2 Examplescovered Examplescovered byRule1 byRule3

A3=1&&A1=2 A3=1&&A1=2 &&A8=5 A3=1


Negative examples

Allexamples

Togeneratearule while(true) findthebestpredicatep if foilgain(p)>thresholdthen addp tocurrentrule else break


100

Positive examples

FindtheBestPredicate:PredicateEvaluation
Allpredicatesinarelationcanbeevaluatedbasedon propagatedIDs Usefoilgain toevaluatepredicates Supposecurrentruleisr.Forapredicatep, foilgain(p)=
P(r ) P(r + p ) P(r + p ) log + log P(r ) + N (r ) P(r + p ) + N (r + p )

CategoricalAttributes Computefoilgaindirectly NumericalAttributes Discretize witheverypossiblevalue


101

LoanApplications:BackendDatabase
Account District
district-id dist-name region #people #lt-500 #lt-2000 #lt-10000 #gt-10000 #city ratio-urban avg-salary unemploy95 unemploy96 den-enter #crime95 #crime96

Loan Targetrelation: Eachtuple hasaclass label,indicating whetheraloanispaid ontime.


loan-id account-id date amount duration payment

account-id district-id frequency date

Card
card-id disp-id type

Transaction
trans-id account-id date

issue-date

Disposition
disp-id account-id client-id

Order
order-id account-id bank-to account-to amount type

type operation amount balance symbol

Client
client-id birth-date gender district-id

Howtomakedecisionstoloanapplications?
102

CrossMine:AnEffectiveMultirelationalClassifier
Methodology TupleIDpropagation:anefficientandflexible methodforvirtuallyjoiningrelations Confinetherulesearchprocessinpromising directions Lookoneahead:amorepowerfulsearchstrategy Negativetuple sampling:improveefficiencywhile maintainingaccuracy

103

Tuple IDPropagation
Loan ID 1 2 Account ID 124 124 108 45 Amount 1000 4000 10000 12000 Duration 12 12 24 36 Decision Yes Yes No No

Applicant #1

3 4

Applicant #2

Account ID 124 108 45

Frequency monthly weekly monthly weekly

Open date 02/27/93 09/23/97 12/09/96 01/01/97

Propagated ID 1, 2 3 4 Null

Labels 2+, 0 0+, 1 0+, 1 0+, 0

Applicant #3

67

Possiblepredicates:
Frequency=monthly:2+,1
Applicant #4

Opendate<01/01/95:2+,0

Propagatetuple IDsoftargetrelationtonontargetrelations Virtuallyjoinrelationstoavoidthehighcostofphysicaljoins


104

Tuple IDPropagation(IdeaOutlined)
Efficient Onlypropagatethetuple IDs Timeandspaceusageislow Flexible CanpropagateIDsamongnontargetrelations ManysetsofIDscanbekeptononerelation,whichare propagatedfromdifferentjoinpaths

TargetRelation

R1

R2

R3
105

RuleGeneration:Example
Account District
districtid distname

Targetrelation

Loan
loanid accountid date

accountid districtid frequency date Firstpredicate

Card
cardid dispid type

region #people #lt500 #lt2000 #lt10000 #gt10000 #city

RuleGeneration
Startatthetargetrelation Repeat Searchinallactive relations Searchinallrelations joinabletoactive relations Addthebestpredicate tothecurrentrule Settheinvolvedrelation toactive Until Thebestpredicatedoes nothaveenoughgain Currentruleistoolong

amount duration payment

Transaction
transid accountid date

issuedate

Disposition
dispid accountid clientid

Order
orderid accountid bankto accountto amount type

ratiourban avgsalary unemploy95 unemploy96 denenter Second predicate

type operation amount balance symbol

Client
clientid birthdate gender districtid

#crime95 #crime96

106

LookoneaheadinRuleGeneration
Twotypesofrelations:EntityandRelationship Oftencannotfindusefulpredicatesonrelationsofrelationship

Nogoodpredicate Target Relation

SolutionofCrossMine: WhenpropagatingIDstoarelationofrelationship, propagateonemoresteptonextrelationofentity


107

NegativeTuple Sampling
Eachtimearuleisgenerated,coveredpositiveexamplesare removed Aftergeneratingmanyrules,therearemuchlesspositive examplesthannegativeones Cannotbuildgoodrules(lowsupport) Stilltimeconsuming(largenumberofnegativeexamples) Solution:Samplingonnegativeexamples Improveefficiencywithoutaffectingrulequality

+ +++ + + + + + + ++ + + + +

+ ++ + +
108

RealDataset
PKDDCup99dataset LoanApplication
Accuracy FOIL TILDE CrossMine 74.0% 81.3% 90.7% Time (per fold) 3338 sec 2429 sec 15.3 sec

Mutagenesisdataset(4relations):Only4relations,soTILDE doesagoodjob,thoughslow
Accuracy FOIL TILDE CrossMine 79.7% 89.4% 87.7% Time (per fold) 1.65 sec 25.6 sec 0.83 sec

109

ClassificationofInformationNetworks
ClassificationofHeterogeneousInformationNetworks: GraphregularizationBasedMethod(GNetMine) MultiRelationalMiningBasedMethod(CrossMine) StatisticalRelationalLearningBasedMethod(SRL) ClassificationofHomogeneousInformationNetworks

110

ProbabilisticRelationalModelsin StatisticalRelationalLearning
Goal:modeldistributionofdatainrelationaldatabases Treatbothentitiesandrelationsasclasses Intuition:objectsarenolongerindependentwitheachother Buildstatisticalnetworksaccordingtothedependency relationshipsbetweenattributesofdifferentclasses AProbabilisticRelationalModels (PRM)consistsof Relationalschema(fromdatabases) Dependencystructure(betweenattributes) Localprobabilitymodel(conditionalprobabilitydistribution) Threemajormethodsofprobabilisticrelationalmodels RelationalBayesianNetwork(RBN,Lise Getoor etal.) RelationalMarkovNetworks(RMN,BenTaskar etal.) RelationalDependencyNetworks(RDN,JenniferNevilleetal.)
111

RelationalBayesianNetworks(RBN)
ExtendBayesiannetworktoconsiderentities,propertiesand relationshipsinaDBscenario Threedifferentuncertainties Attributeuncertainty Modelconditionalprobabilityforanattributegivenits parentvariables Structuraluncertainty Modelconditionalprobabilityforareferenceorlink existencegivenitsparentvariables Classuncertainty Refinetheconditionalprobabilitybyconsideringsub classesorhierarchyofclasses
112

Rel.MarkovNetworks&Rel.DependencyNetworks
SimilarideastoRelationalBayesianNetworks RelationalMarkovNetworks ExtendfromMarkovNetwork Undirectedlinktomodeldependencyrelationinsteadof directedlinksasinBayesiannetworks RelationalDependencyNetworks ExtendfromDependencyNetwork Undirectedlinktomodeldependencyrelation Usepseudolikelihoodinsteadofexactlikelihood Efficientinlearning

113

ClassificationofInformationNetworks
ClassificationofHeterogeneousInformationNetworks: GraphregularizationBasedMethod(GNetMine) MultiRelationalMiningBasedMethod(CrossMine) StatisticalRelationalLearningBasedMethod(SRL) ClassificationofHomogeneousInformationNetworks

114

Transductive LearningintheGraph
Problem:forasetofnodesinthegraph,theclasslabelsaregivenforpartialof thenodes,thetaskistolearnthelabelsoftheunlabelednodes

Methods Labelpropagationalgorithm[Zhuetal.2002,Zhouetal.2004,Szummer et al.2001] Iterativelypropagatelabelstoitsneighbors,accordingtothe transitionprobabilitydefinedbythenetwork Graphregularizationbasedalgorithm[Zhouetal.2004] Intuition:tradeoffbetween(1)consistencywiththelabelingdataand (2)consistencybetweenlinkedobjects Anquadraticoptimizationproblem


115

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaning andDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
116

DataCleaningbyLinkAnalysis
Objectreconciliationvs.objectdistinctionasdatacleaningtasks Linkanalysismaytakeadvantagesofredundancyandmake facilitateentitycrosscheckingandvalidation Objectdistinction:Differentpeople/objectsdosharenames InAllMusic.com,72songsand3albumsnamedForgotten or TheForgotten InDBLP,141papersarewrittenbyatleast14WeiWang Newchallengesofobjectdistinction: Textualsimilaritycannotbeused Distinct:Objectdistinctionbyinformationnetworkanalysis X.Yin,J.Han,andP.S.Yu,ObjectDistinction:Distinguishing ObjectswithIdenticalNamesbyLinkAnalysis,ICDE'07
117

EntityDistinction:TheWeiWang ChallengeinDBLP
(1)
Wei Wang, Jiong Yang, Richard Muntz VLDB 1997

(2)
Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Yu VLDB 2004

Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu Jiong Yang, Hwanjo Yu, Wei Wang, Jiawei Han

SIGMOD

2002

Hongjun Lu, Yidong Yuan, Wei Wang, Xuemin Lin

ICDE

2005

CSB

2003 Wei Wang, Xuemin Lin KDD 2004 ADMA 2005

Jiong Yang, Jinze Liu, Wei Wang

Jian Pei, Jiawei Han, Hongjun Lu, et al.

ICDM

2001

Jinze Liu, Wei Wang

ICDM

2004

(4)
Jian Pei, Daxin Jiang, Aidong Zhang ICDE 2005 Aidong Zhang, Yuqing Song, Wei Wang WWW 2003

(3)
Wei Wang, Jian Pei, Jiawei Han CIKM 2002 Haixun Wang, Wei Wang, Baile Shi, Peng Wang KDD 2004 ICDM 2005

Yongtai Zhu, Wei Wang, Jian Pei, Baile Shi, Chen Wang

(1) Wei Wang at UNC (3) Wei Wang at Fudan Univ., China

(2) Wei Wang at UNSW, Australia (4) Wei Wang at SUNY Buffalo
118

TheDISTINCTMethodology
Measuresimilaritybetweenreferences Linkbasedsimilarity:Linkagesbetweenreferences Referencestothesameobjectaremorelikelytobe connected(Usingrandomwalkprobability) Neighborhoodsimilarity Neighbortuples ofeachreferencecanindicatesimilarity betweentheircontexts Selfboosting:Trainingusingthesame bulkydataset Referencebasedclustering Groupreferencesaccordingtotheirsimilarities

119

TrainingwiththeSame DataSet
Buildatrainingsetautomatically Selectdistinctnames,e.g.,JohannesGehrke Thecollaborationbehaviorwithinthesamecommunityshare somesimilarity Trainingparametersusingatypicalandlargesetof unambiguous examples UseSVMtolearnamodelforcombiningdifferentjoinpaths Eachjoinpathisusedastwoattributes(withlinkbased similarityandneighborhoodsimilarity) Themodelisaweightedsumofallattributes

120

Clustering:MeasureSimilaritybetweenClusters
Singlelink(highestsimilaritybetweenpointsintwoclusters)? No,becausereferencestodifferentobjectscanbe connected. Completelink(minimumsimilaritybetweenthem)? No,becausereferencestothesameobjectmaybeweakly connected. Averagelink(averagesimilaritybetweenpointsintwoclusters)? Abettermeasure Refinement:Averageneighborhoodsimilarity andcollective randomwalkprobability
121

RealCases:DBLPPopularNames
Name Num_authors Num_refs accuracy precision recall f-measure

Hui Fang Ajay Gupta Joseph Hellerstein Rakesh Kumar Michael Wagner Bing Liu Jim Smith Lei Wang Wei Wang Bin Yu

3 4 2 2 5 6 3 13 14 5

9 16 151 36 29 89 19 55 141 44

1.0 1.0 0.81 1.0 0.395 0.825 0.829 0.863 0.716 0.658 0.81

1.0 1.0 1.0 1.0 1.0 1.0 0.888 0.92 0.855 1.0 0.966

1.0 1.0 0.81 1.0 0.395 0.825 0.926 0.932 0.814 0.658 0.836

1.0 1.0 0.895 1.0 0.566 0.904 0.906 0.926 0.834 0.794 0.883

Average

122

DistinguishingDifferentWeiWangs
UNC-CH (57) Fudan U, China (31)
Zhejiang U China (3) SUNY Binghamton (2) Najing Normal China (3) Ningbo Tech China (2)

5 2
UNSW, Australia (19)

Purdue (2)

Chongqing U China (2)

SUNY Buffalo (5)

Beijing Polytech (3)

NU Singapore (5)

Harbin U China (5)

Beijing U Com China

(2)

123

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidation byInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
124

TruthValidationbyInfo.NetworkAnalysis
Thetrustworthinessproblemoftheweb(accordingtoasurvey): 54%ofInternetuserstrustnewswebsitesmostoftime 26%forwebsitesthatsellproducts 12%forblogs TruthFinder:TruthdiscoveryontheWebbylinkanalysis Amongmultipleconflictresults,canweautomaticallyidentifywhichone islikelythetruefact? Veracity(conformitytotruth): Givenalargeamountofconflictinginformationaboutmanyobjects, providedbymultiplewebsites(orotherinformationproviders), howto discoverthetruefactabouteachobject? Ourwork:Xiaoxin Yin,Jiawei Han,PhilipS.Yu,TruthDiscoverywithMultiple ConflictingInformationProvidersontheWeb,TKDE08
125

ConflictingInformationontheWeb
Differentwebsitesoftenprovideconflictinginfo.onasubject, e.g.,AuthorsofRapidContextualDesign
Online Store Powells books Barnes & Noble A1 Books Cornwall books Mellons books Lakeside books Blackwell online Authors Holtzblatt, Karen Karen Holtzblatt, Jessamyn Wendell, Shelley Wood Karen Holtzblatt, Jessamyn Burns Wendell, Shelley Wood Holtzblatt-Karen, Wendell-Jessamyn Burns, Wood Wendell, Jessamyn WENDELL, JESSAMYNHOLTZBLATT, KARENWOOD, SHELLEY Wendell, Jessamyn, Holtzblatt, Karen, Wood, Shelley

126

OurSetting:Info.NetworkAnalysis
Eachobjecthasasetofconflictive facts E.g.,differentauthornamesforabook Andeachwebsiteprovidessomefacts Howtofindthetruefactforeachobject? Web sites Facts f1 f2 f3 f4 f5
127

Objects o1

w1 w2 w3 w4

o2

BasicHeuristicsforProblemSolving
1.

2.

3.

4.

Thereisusuallyonlyonetruefactforapropertyofan object Thistruefactappearstobethesameorsimilarondifferent websites E.g.,JenniferWidom vs.J.Widom Thefalsefactsondifferentwebsitesarelesslikelytobe thesameorsimilar Falsefactsareoftenintroducedbyrandomfactors Awebsitethatprovidesmostlytruefactsformany objectswilllikelyprovidetruefactsforotherobjects
128

OverviewoftheTruthFinder Method
Confidenceoffacts Trustworthinessofwebsites Afacthashighconfidence ifitisprovidedby(many) trustworthywebsites Awebsiteistrustworthy ifitprovidesmanyfactswithhigh confidence TheTruthFinder mechanism,anoverview: Initially,eachwebsiteisequallytrustworthy Basedontheabovefourheuristics,inferfactconfidencefrom websitetrustworthiness,andthenbackwards Repeatuntilachievingstablestate
129

AnalogytoAuthorityHubAnalysis
Facts Authorities,Websites Hubs Web sites
High trustworthiness

Facts f1
High confidence

w1
Hubs

Authorities

Differencefromauthorityhubanalysis Linearsummationcannotbeused
Awebsiteistrustableifitprovidesaccuratefacts,insteadofmany facts Confidenceistheprobabilityofbeingtrue

Differentfactsofthesameobjectinfluenceeachother
130

InferenceonTrustworthness
Inferenceofwebsitetrustworthiness&factconfidence Web sites Facts f1 o1 f2 f3 o2 f4 Objects

w1 w2 w3 w4

True facts and trustable web sites will become apparent after some iterations
131

ComputationalModel:t(w)ands(f)
Thetrustworthinessofawebsitew:t(w) Averageconfidenceoffactsitprovides

t (w) =

f F (w) s( f )
F (w )

Sum of fact confidence

t(w1) w1 s(f1) f1

Set of facts provided by w

Theconfidenceofafactf:s(f) Oneminustheprobabilitythatallwebsitest(w ) 2 providingf arewrong w2

s( f ) = 1

wW f

()1 t (w)) (

Probability that w is wrong

Set of websites providing f


132

Experiments:FindingTruthofFacts
Determiningauthorsofbooks Datasetcontains1265bookslistedonabebooks.com Weanalyze100randombooks(usingbookimages)
Case Correct Miss author(s) Incomplete names Wrong first/middle names Has redundant names Add incorrect names No information Voting 71 12 18 1 0 1 0 TruthFinder 85 2 5 1 2 5 0 Barnes & Noble 64 4 6 3 23 5 2

133

Experiments:TrustableInfoProviders
Findingtrustworthyinformationsources MosttrustworthybookstoresfoundbyTruthFinder vs.Top rankedbookstoresbyGoogle(querybookstore) TruthFinder
Bookstore TheSaintBookstore MildredsBooks Alphacraze.com trustworthiness 0.971 0.969 0.968 #book 28 10 13 Accuracy 0.959 1.0 0.947

Google
Bookstore Barnes & Noble Powells books Google rank 1 3 #book 97 42 Accuracy 0.865 0.654
134

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
135

SimilaritySearchinInformationNetworks
Structuralsimilarityvs.semanticsimilarity Structuralsimilarity:Basedonstructural/isomorphic similarityof subgraph/subnetworkstructures Semanticsimilarity:influencedbysimilarnetworkstructures Graphstructurebasedindexingandsimilaritysearch Structurebasedindexing,e.g.,gIndex,Spath, Useindextosearchforsimilargraph/networkstructures SubstructureindexingMethods: Keyproblem:Whatsubstructuresaregoodindexingfeatures? gIndex [Yan,Yu&Han,SIGMOD04]:Findfrequentand discriminativesubgraphs (bygraphpatternmining) Spath [Zhao&Han,VLDB10]:Usedecomposedshortest paths asbasicindexingfeatures
136

WhySPathasIndexingFeatures?
Shortestpathsasneighborhoodsignaturesofvertices(indexing features):scalableandpruningsearchspaceeffectively Processing(byquerydecomposition): Decomposethequery graphintoasetofindexedshortestpathsinSPath

Network

Query

Agloballookuptable

Neighborhoodsignatureofv3

SemanticsBasedSimilaritySearchinInfoNet
Searchtopksimilarobjectsofthesametypeforaquery FindresearchersmostsimilarwithChristosFaloutsos Twocriticalconceptstodefineasimilarity Featurespace Traditionaldata:attributesdenotedasnumerical value/vector,set,etc. Networkdata:arelationsequencecalledpathschema Existinghomogeneousnetworkbasedsimilaritydoes notdealwiththisproblem Measuredefinedonthefeaturespace Cosine,Euclideandistance,Jaccard coefficient,etc. PathSim
138

PathSchemaforDBLPQueries
Pathschema:ApathofInfoNet schema,e.g.,APC,APA WhoaremostsimilartoChristosFaloutsos?

139

Flickr:WhichPicturesAreMostSimilar?
Somepathschemaleadstosimilarityclosertohumanintuition Butsomeothersarenot

140

NotAllSimilarityMeasuresAreGood
Favor highly visible objects

Not reasonable

141

LongPathSchemaMayNotBeGood
Repeatthepathschema2,4,andinfinitetimesforconference similarityquery

142

PathSim:Definition&Properties
CommutingmatrixcorrespondingtoapathschemaMP Productofadjacencymatrixofrelationsinthepathschema TheelementMP(i,j)denotesthestrengthbetweenobjecti andobjectjinthesemanticofpathschemaP Iftheweightofadjacencymatrixisunweighted,itwill denotethenumberofpathinstancesfollowingpath schemaP PathSim s(i,j)=2MP(i,j)/(MP(i,i)+MP(j,j)) Properties:

143

CoClusteringBasedPruningAlgorithm
Storecommutingmatricesforshortpathschemas&computetopkquerieson line Framework Generatecoclustersformaterializedcommutingmatrices,forfeature objectsandtargetobjects Deriveupperboundforsimilaritybetweenobjectandtargetcluster,and betweenobjectandobject:Safelypruningtargetclustersandobjectsif theupperboundsimilarityislowerthancurrentthreshold Dynamicallyupdatetopkthreshold: Performance:Baselinevs.pruning

144

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
145

RoleDiscoveryinNetwork:WhyItMatters?
Armycommunication network(imaginary)

Automatically infer

Commander Captain Solider

146

RoleDiscovery:ExtractionSemantic InformationfromLinks
Objective:Extractsemanticmeaningfromplainlinkstofinely modelandbetterorganizeinformationnetworks Challenges Latentsemanticknowledge Interdependency Scalability Opportunity Humanintuition Realisticconstraint Crosscheckwithcollectiveintelligence Methodology:propagatesimpleintuitiverulesandconstraints overthewholenetwork
147

DiscoveryofAdvisorAdvisee RelationshipsinDBLPNetwork
Input:DBLPresearchpublicationnetwork Output:Potentialadvisingrelationshipanditsranking(r,[st,ed]) Ref.C.Wang,J.Han,etal., MiningAdvisorAdvisee RelationshipsfromResearchPublicationNetworks,SIGKDD2010
Input: Temporal collaboration network
1999 Ada Bob

Output: Relationship analysis


(0.9, [/, 1998]) Ada (0.4, [/, 1998]) (0.5, [/, 2000])

Visualized chorological hierarchies

20 00 2000

2000 Jerry 2001 Ying 2002 Smith th 2003 2004

(0.8, [1999,2000]) (0.49, [/, 1999]) Jerry

(0.7, [2000, 2001]) Bob Ying (0.65, [2002, 2004]) Smith

(0.2, [2001, 2003])

148

OverallFramework
ai:authori pj:paperj py:paperyear pn:paper# sti,yi:startingtime edi,yi:endingtime ri,yi:rankingscore

149

TimeConstrainedProbabilisticFactorGraph(TPFG)
yx:axsadvisor stx,yx:startingtime edx,yx:endingtime g(yx,stx,edx) is predefinedlocal feature fx(yx,Zx)=maxg(yx , stx, edx)undertime constraint Objectivefunction P({yx})=xfx (yx,Z) Z={z|x Yz} Yx:setofpotential advisorsofax
150

ExperimentResults
DBLPdata:654,628authors,1076,946publications,years provided Labeleddata:MathGealogy Project;AIGealogy Project; Homepage
Datasets TEST1 TEST2 TEST3 RULE 69.9% 69.8% 80.6% SVM 73.4% 74.6% 86.7% IndMAX 75.2% 74.6% 83.1% 78.9% 79.0% 90.9% TPFG 80.2% 81.5% 88.8% 84.4% 84.3% 91.3%

heuristics Supervised learning

Empirical optimized parameter parameter

151

CaseStudy&Scalability
Advisee TopRankedAdvisor Time 0103 0506 0203 0408 9798 Note PhDadvisor,2004grad Postdoc,2006 MSadvisor,2003 PhDadvisor,2008 Unofficialadvisor DavidM. 1.MichaelI.Jordan Blei 2.JohnD.Lafferty Hong Cheng Sergey Brin 1.Qiang Yang 2.Jiawei Han 1.RajeevMotawani

152

Graph/NetworkSummarization:Graph Compression
Extractcommonsubgraphs andsimplifygraphsbycondensing thesesubgraphs intonodes

153

OLAPonInformationNetworks
WhyOLAPinformationnetworks? AdvantagesofOLAP:Interactiveexplorationofmultidimensional andmultilevelspaceinadatacubeInfonet Multidimensional:Differentperspectives Multilevel:Differentgranularities InfoNet OLAP:Rollup/drilldownandslice/diceoninformation networkdata TraditionalOLAPcannothandlethis,becausetheyignorelinks amongdataobjects HandlingtwokindsofInfoNet OLAP InformationalOLAP TopologicalOLAP
154

InformationalOLAP
IntheDBLPnetwork,studythe collaborationpatternsamongresearchers Dimensionscomefrominformational attributesattachedatthewholesnapshot level,socalledInfoDims IOLAPCharacteristics: Overlaymultiplepiecesofinformation Nochangeontheobjectswhose interactionsarebeingexamined Intheunderlyingsnapshots,each nodeisaresearcher Inthesummarizedview,eachnodeis stillaresearcher
155

TopologicalOLAP
Dimensionscomefromthe node/edgeattributesinsideindividual networks,socalledTopoDims TOLAPCharacteristics Zoomin/Zoomout Networktopologychanged: generalized nodesand generalized edges Intheunderlyingnetwork, eachnodeisaresearcher Inthesummarizedview,each nodebecomesaninstitutethat comprisesmultipleresearchers
156

InfoNet OLAP:Operations&Framework
InfoNet IOLAP InfoNet TOLAP Overlaymultiplesnapshotsto Shrinkthetopology&obtainaTaggregated formahigherlevelsummaryvia networkthatrepresentsacompressedview, Iaggregatednetwork withtopologicalelements(i.e.,nodesand/or Rollup edges)mergedandreplacedbycorresp. higherlevelones Returntothesetoflowerlevel Areverseoperationofrollup Drilldown snapshotsfromthehigherlevel overlaid(aggregated)network Selectasubsetofqualifying Selectasubnetwork basedonTopoDims Slice/dice snapshotsbasedonInfoDims

Measureisanaggregatedgraph&othermeasureslikenodecount, average degree,etc.canbetreatedasderived Graphplaysadualrole:(1)datasource,and(2)aggregatemeasure Measurescouldbecomplex,e.g.,maximumflow,shortestpath,centrality ItispossibletocombineIOLAPandTOLAPintoahybridcase


157

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions
158

MiningEvolutionandDynamicsofInfoNet
Manynetworksarewithtimeinformation E.g.,accordingtopaperpublicationyear,DBLPnetworkscan formnetworksequences Motivation:Modelevolutionofcommunitiesinheterogeneous network Automaticallydetectthebestnumberofcommunitiesineach timestamp Modelthesmoothnessbetweencommunitiesofadjacent timestamps Modeltheevolutionstructureexplicitly
Birth,death,split

159

Evolution:IdeaIllustration
Fromnetworksequencestoevolutionarycommunities

160

GraphicalModel:AGenerativeModel
Dirichlet ProcessMixtureModelbasedgenerativemodel Ateachtimestamp,acommunityisdependentonhistorical communitiesandbackgroundcommunitydistribution

161

GenerativeModel&ModelInference
Togenerateanewpaperoi Decidewhethertojoinanexistingcommunityoranewone Joinanexistingcommunitykwithprob.nk/(i1+) Joinanewcommunitykwithprob./(i 1+):Decideitsprior, eitherfromabackgrounddistribution()orhistoricalcommunities ((1 )k),withdifferentprobabilities,drawtheattributedistribution fromtheprior Generateoi accordingtotheattributedistribution

Greedyinferenceforeachtimestamp:CollapseGibbssampling,whichistrying tosampleclusterlabelforeachtargetobject(e.g.,paper)
162

AccuracyStudy
Themoretypesofobjectsused,thebetteraccuracy Historicalpriorresultsinbetteraccuracy

163

CaseStudyonDBLP
Trackingdatabasecommunityevolution
1991 1994 1997 2000

DEXA Comm, ACM ICDE VLDB Int. J. MM Studies systems object database information oriented

DEXA VLDB SIGMOD Conf. CIKM TKDE object oriented database systems databases

DEXA Workshop SIGMOD Conf. VLDB DEXA ICDE data databases database object systems

VLDB ICDE SIGMOD Conf. DEXA IDEAS data databases database query web

(1993)
CHI Conf. Comp. AAAI TREC SIGIR PKDD SIGKDD Explor. KDD data retrieval information mining text

DBLP Schema

Comm. ACM W. Simu. Conf. SIGCSE software systems design knowledge analysis

164

CaseStudyon Delicious.com

Delicious Schema

550 500 450 Event Count 400 350 300 250 200 150 1 1.5 2 2.5 Week 3 3.5 4 C1 C2 C3

165

Outline
Motivation: WhyMiningHeterogeneousInformationNetworks? PartI: Clustering,RankingandClassification ClusteringandRankinginInformationNetworks ClassificationofInformationNetworks PartII: DataQualityandSearchinInformationNetworks DataCleaningandDataValidationbyInfoNet Analysis SimilaritySearchinInformationNetworks PartIII:AdvancedTopicsonInformationNetworkAnalysis RoleDiscoveryandOLAPinInformationNetworks MiningEvolutionandDynamicsofInformationNetworks Conclusions

Conclusions
Richknowledgecanbeminedfrominformationnetworks Whatisthemagic? Heterogeneous,structured informationnetworks! Clustering,rankingandclassification:Integratedclustering, rankingandclassification:RankClus,NetClus,GNetMine, Datacleaning,validation,andsimilaritysearch Rolediscovery,OLAP,andevolutionaryanalysis Knowledgeispower,butknowledgeishiddeninmassivelinks! Miningheterogeneous informationnetworks:Muchmoretobe explored!!
167

FutureResearch
Fromminingcurrentsinglestarnetworkschematoranking, clustering,,inmultistar,multirelationaldatabases Mininginformationnetworksformedbystructureddatalinking withunstructureddata(text,multimediaandWeb) Miningcyberphysicalnetworks(networksformedbydynamic sensors,image/videocameras,withinformationnetworks) Enhancingthepowerofknowledgediscoverybytransforming massiveunstructureddata:Incrementalinformationextraction, rolediscovery, multidimensionalstructuredinfonet Miningnoisy,uncertain,untrustablemassivedatasetsby informationnetworkanalysisapproach TurningWikipediaand/orWebintostructuredorsemistructured databasesbyheterogeneousinformationnetworkanalysis
168

References:BooksonNetworkAnalysis
A.L.Barabasi.Linked:HowEverythingIsConnectedtoEverythingElseandWhatItMeans.Plume, 2003. M.Buchanan.Nexus:SmallWorldsandtheGroundbreakingTheoryofNetworks.W.W.Norton& Company,2003. D.J.CookandL.B.Holder.MiningGraphData.JohnWiley&Sons,2007 S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann, 2003 A.Degenne andM.Forse.IntroducingSocialNetworks.SagePublications,1999 P.J.Carrington,J.Scott,andS.Wasserman.ModelsandMethods inSocialNetworkAnalysis. CambridgeUniversityPress,2005. J.Davies,D.Fensel,andF.vanHarmelen.TowardstheSemanticWeb:OntologyDriven KnowledgeManagement.JohnWiley&Sons,2003. D.Fensel,W.Wahlster,H.Lieberman,andJ.Hendler.SpinningtheSemanticWeb:Bringingthe WorldWideWebtoItsFullPotential.MITPress,2002. L.Getoor andB.Taskar (eds.).Introductiontostatisticallearning.InMITPress,2007. B.Liu.WebDataMining:ExploringHyperlinks,Contents,andUsageData.Springer,2006. J.P.Scott.SocialNetworkAnalysis:AHandbook.SagePublications,2005. J.Watts.SixDegrees:TheScienceofaConnectedAge.W.W.Norton&Company,2003. D.J.Watts.SmallWorlds:TheDynamicsofNetworksbetweenOrderandRandomness.Princeton UniversityPress,2003. S.WassermanandK.Faust.SocialNetworkAnalysis:MethodsandApplications.Cambridge UniversityPress,1994.

References:SomeOverviewPapers
T.BernersLee,J.Hendler,andO.Lassila.Thesemanticweb.ScientificAmerican,May 2001. C.CooperandAFrieze.Ageneralmodelofwebgraphs.Algorithms,22,2003. S.Chakrabarti andC.Faloutsos.Graphmining:Laws,generators,andalgorithms.ACM Comput.Surv.,38,2006. T.Dietterich,P.Domingos,L.Getoor,S.Muggleton,andP.Tadepalli.Structured machinelearning:Thenexttenyears.MachineLearning,73,2008 S.Dumais andH.Chen.Hierarchicalclassificationofwebcontent.SIGIR'00. S.Dzeroski.Multirelational datamining:Anintroduction.ACMSIGKDDExplorations, July2003. L.Getoor.Linkmining:anewdataminingchallenge.SIGKDDExplorations, 5:84{89, 2003. L.Getoor,N.Friedman,D.Koller,andB.Taskar.Learningprobabilisticmodelsof relationalstructure.ICML'01 D.JensenandJ.Neville.Datamininginnetworks.InPapersof theSymp.Dynamic SocialNetworkModelingandAnalysis,NationalAcademyPress,2002. T.Washio andH.Motoda.Stateoftheartofgraphbaseddatamining.SIGKDD Explorations,5,2003.
170

References:SomeInfluentialPapers
A.Z.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins, andJ.L.Wiener.Graphstructureintheweb.ComputerNetworks, 33,2000. S.Brin andL.Page.Theanatomyofalargescalehypertextualwebsearchengine. WWW'98. S.Chakrabarti,B.E.Dom,S.R.Kumar,P.Raghavan,S.Rajagopalan,A.Tomkins,D. Gibson,andJ.M.Kleinberg.Miningtheweb'slinkstructure.COMPUTER,32,1999. M.Faloutsos,P.Faloutsos,andC.Faloutsos.Onpowerlawrelationshipsoftheinternet topology.ACMSIGCOMM'99 M.GirvanandM.E.J.Newman.Communitystructureinsocialand biologicalnetworks. InProc.Natl.Acad.Sci.USA99,2002. B.A.Huberman andL.A.Adamic.Growthdynamicsofworldwideweb.Nature, 399:131,1999. G.Jeh andJ.Widom.SimRank:ameasureofstructuralcontextsimilarity.KDD'02 J.M.Kleinberg,R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Thewebasa graph:Measurements,models,andmethods.COCOON'99 D.Kempe,J.Kleinberg,andE.Tardos.Maximizingthespreadofinfluencethrougha socialnetwork.KDD'03 J.M.Kleinberg.Smallworldphenomenaandthedynamicsofinformation.NIPS'01 R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal. Stochasticmodelsforthewebgraph.FOCS'00 M.E.J.Newman.Thestructureandfunctionofcomplexnetworks. SIAMReview,45, 2003.
171

References:ClusteringandRanking(1)
E.Airoldi,D.Blei,S.FienbergandE.Xing,MixedMembershipStochastic Blockmodels,JMLR08 Liangliang Cao,Andrey DelPozo,Xin Jin,Jiebo Luo,Jiawei Han,andThomasS.Huang, RankCompete:SimultaneousRankingandClusteringofWebPhotos,WWW10 G.Jeh andJ.Widom,SimRank:ameasureofstructuralcontextsimilarity,KDD'02 JingGao,Feng Liang,Wei Fan,ChiWang,Yizhou Sun,andJiawei Han,Community OutliersandtheirEfficientDetectioninInformationNetworks",KDD'10 M.E.J.NewmanandM.Girvan,Findingandevaluatingcommunitystructurein networks,PhysicalReviewE,2004 M.E.J.NewmanandM.Girvan,Fastalgorithmfordetectingcommunitystructurein networks,PhysicalReviewE,2004 J.ShiandJ.Malik,NormalizedcutsandimageSegmentation,CVPR'97 Yizhou Sun,Yintao Yu,andJiawei Han,"RankingBasedClusteringofHeterogeneous InformationNetworkswithStarNetworkSchema",KDD09 Yizhou Sun,Jiawei Han,Peixiang Zhao,Zhijun Yin,HongCheng,andTianyi Wu, "RankClus:IntegratingClusteringwithRankingforHeterogeneousInformation NetworkAnalysis",EDBT09
172

References:ClusteringandRanking(2)
Yizhou Sun,Jiawei Han,JingGao,andYintao Yu,"iTopicModel:InformationNetwork IntegratedTopicModeling",ICDM09 Xiaoxin Yin,Jiawei Han,PhilipS.Yu."LinkClus:EfficientClusteringviaHeterogeneous SemanticLinks",VLDB'06. Yintao Yu,CindyX.Lin,Yizhou Sun,ChenChen,Jiawei Han,Binbin Liao,Tianyi Wu, ChengXiang Zhai,DuoZhang,andBoZhao,"iNextCube:InformationNetwork EnhancedTextCube",VLDB'09(demo) A.Wu,M.Garland,andJ.Han.Miningscalefreenetworksusinggeodesicclustering. KDD'04 Z.WuandR.Leahy,Anoptimalgraphtheoreticapproachtodataclustering:Theory anditsapplicationtoimagesegmentation,IEEETrans.PatternAnal.Mach.Intell., 1993. X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger.SCAN:Astructuralclusteringalgorithm fornetworks.KDD'07 X.Yin,J.Han,andP.S.Yu.Crossrelationalclusteringwithuser'sguidance.KDD'05
173

References:NetworkClassification(1)
A.Appice,M.Ceci,andD.Malerba.Miningmodeltrees:Amultirelationalapproach. ILP'03 JingGao,Feng Liang,WeiFan,Yizhou Sun,andJiawei Han,"BipartiteGraphbased ConsensusMaximizationamongSupervisedandUnsupervisedModels",NIPS'09 L.Getoor,N.Friedman,D.Koller andB.Taskar,LearningProbabilisticModelsofLink Structure,JMLR02. L.Getoor,E.Segal,B.Taskar andD.Koller,ProbabilisticModelsofTextandLink StructureforHypertextClassification,IJCAIWSTextLearning:BeyondClassification, 2001. L.Getoor,N.Friedman,D.Koller,andA.Pfeffer,LearningProbabilisticRelational Models,chapterinRelationDataMining,eds.S.Dzeroski andN.Lavrac,2001. M.Ji,Y.Sun,M.Danilevsky,J.Han,andJ.Gao.Graphbasedclassificationon heterogeneousinformationnetworks,ECMLPKDD10. Q.LuandL.Getoor,Linkbasedclassification,ICML'03 D.LibenNowell andJ.Kleinberg,Thelinkpredictionproblemforsocialnetworks, CIKM'03
174

References:NetworkClassification(2)
J.Neville,B.Gallaher,andT.EliassiRad.Evaluatingstatisticaltestsforwithinnetwork classifiersofrelationaldata.ICDM'09. J.Neville,D.Jensen,L.Friedland,andM.Hay.Learningrelationalprobabilitytrees. KDD'03 JenniferNeville,DavidJensen,RelationalDependencyNetworks,JMLR07 M.Szummer andT.Jaakkola,Partiallylabeledclassication withmarkov random walks,InNIPS,volume14,2001. M.J.Rattigan,M.Maier,andD.Jensen.Graphclusteringwithnetworkstructure indices.ICML'07 P.Sen,G.M.Namata,M.Galileo,M.Bilgic,L.Getoor,B.Gallagher,andT.EliassiRad. Collectiveclassificationinnetworkdata.AIMagazine,29,2008. B.Taskar,E.Segal,andD.Koller.Probabilisticclassificationandclusteringinrelational data.IJCAI'01 B.Taskar,P.Abbeel,M.F.Wong,andD.Koller,RelationalMarkovNetworks,chapter inL.Getoor andB.Taskar,editors,IntroductiontoStatisticalRelationalLearning,2007 X.Yin,J.Han,J.Yang,andP.S.Yu,CrossMine:EfficientClassificationacrossMultiple DatabaseRelations,ICDE'04. D.Zhou,O.Bousquet,T.N.Lal,J.Weston,andB.Scholkopf,Learningwithlocaland globalconsistency,InNIPS16,Vancouver,Canada,2004. X.ZhuandZ.Ghahramani,Learningfromlabeledandunlabeleddatawithlabel propagation,TechnicalReport,2002.
175

References:SocialNetworkAnalysis
B.AlemanMeza,M.Nagarajan,C.Ramakrishnan,L.Ding,P.Kolari,A.P.Sheth,I.B. Arpinar,A.Joshi,andT.Finin.Semanticanalyticsonsocialnetworks:experiencesin addressingtheproblemofconflictofinterestdetection.WWW'06 R.Agrawal,S.Rajagopalan,R.Srikant,andY.Xu.Miningnewsgroupsusingnetworks arisingfromsocialbehavior.WWW'03 P.Boldi andS.Vigna.TheWebGraph frameworkI:Compressiontechniques.WWW'04 D.Cai,Z.Shao,X.He,X.Yan,andJ.Han.Communityminingfrommultirelational networks.PKDD'05 P.Domingos.Miningsocialnetworksforviralmarketing.IEEEIntelligentSystems,20, 2005. P.Domingos andM.Richardson.Miningthenetworkvalueofcustomers.KDD'01 P.DeRose,W.Shen,F.Chen,A.Doan,andR.Ramakrishnan.Buildingstructuredweb communityportals:Atopdown,compositional,andincrementalapproach.VLDB'07 G.Flake,S.Lawrence,C.L.Giles,andF.Coetzee.Selforganizationandidentificationof webcommunities.IEEEComputer,35,2002. J.Kubica,A.Moore,andJ.Schneider.Tractablegroupdetectiononlarge linkdatasets. ICDM'03
176

References:DataQuality&SearchinNetworks
I.BhattacharyaandL.Getoor,Iterativerecordlinkageforcleaningandintegration, Proc.SIGMOD2004WorkshoponResearchIssuesonDataMiningand Knowledge Discovery(DMKD'04) XinLunaDong,LaureBertiEquille,andDiveshSrivastava,Integratingconflictingdata: Theroleofsourcedependence,PVLDB,2(1):550561,2009. XinLunaDong,LaureBertiEquille,andDiveshSrivastava,Truthdiscoveryand copyingdetectioninadynamicworld,PVLDB,2(1):562573,2009. H.Han,L.Giles,H.Zha,C.Li,andK.Tsioutsiouliklis,Twosupervisedlearning approachesfornamedisambiguationinauthorcitations,ICDL'04. Y.Sun,J.Han,T.Wu,X.Yan,andPhilipS.Yu,PathSim:PathSchemaBasedTopK SimilaritySearchinHeterogeneousInformationNetworks,Technicalreport,CS,UIUC, July2010. X.Yin,J.Han,andP.S.Yu,ObjectDistinction:DistinguishingObjectswithIdentical NamesbyLinkAnalysis,ICDE'07 X.Yin,J.Han,andP.S.Yu,TruthDiscoverywithMultipleConflictingInformation ProvidersontheWeb,IEEETKDE,20(6):796808,2008 P.ZhaoandJ.Han,OnGraphQueryOptimizationinLargeNetworks,VLDB10.
177

References:RoleDiscovery,SummarizationandOLAP
D.Archambault,T.Munzner,andD.Auber.Topolayout:Multilevelgraphlayoutby topologicalfeatures.IEEETrans.Vis.Comput.Graph,2007. ChenChen,Xifeng Yan,Feida Zhu,Jiawei Han,andPhilipS.Yu,"GraphOLAP:Towards OnlineAnalyticalProcessingonGraphs",ICDM2008 ChenChen,Xifeng Yan,Feida Zhu,Jiawei Han,andPhilipS.Yu,"GraphOLAP:AMulti DimensionalFrameworkforGraphDataAnalysis",KAIS2009. Xin Jin,Jiebo Luo,Jie Yu,GangWang,Dhiraj Joshi,andJiawei Han,iRIN:Image RetrievalinImageRichInformationNetworks,WWW'10(demopaper) LuLiu,Feida Zhu,ChenChen,Xifeng Yan,Jiawei Han,PhilipYu,andShiqiang Yang, MiningDiversityonNetworks",DASFAA'10 Y.Tian,R.A.Hankins,andJ.M.Patel.Efficientaggregationforgraphsummarization. SIGMOD'08 ChiWang,Jiawei Han,Yuntao Jia,Jie Tang,DuoZhang,Yintao Yu,andJingyi Guo, MiningAdvisorAdviseeRelationshipsfromResearchPublicationNetworks ",KDD'10 Zhijun Yin,ManishGupta,TimWeninger andJiawei Han,LINKREC:AUnified FrameworkforLinkRecommendationwithUserAttributesandGraph Structure , WWW10
178

References:NetworkEvolution
L.Backstrom,D.Huttenlocher,J.Kleinberg,andX.Lan.Groupformationin largesocialnetworks:Membership,growth,andevolution.KDD'06 M.S.KimandJ.Han.Aparticleanddensitybasedevolutionaryclustering methodfordynamicnetworks.VLDB'09 J.Leskovec,J.Kleinberg,andC.Faloutsos.Graphsovertime:Densification laws,shrinkingdiametersandpossibleexplanations.KDD'05 Yizhou Sun,Jie Tang,Jiawei Han,ManishGupta,BoZhao,Community EvolutionDetectioninDynamicHeterogeneousInformationNetworks,KDD MLG10

179

Anda mungkin juga menyukai