Big Data Analytics: Turning Big Data into Big Money

By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.


Backu p system s, 66 67
Abstraction tools, 54 Batch processin g, 53, 54
Access to data, 32, 50, 63, 69 Beh avioral an alytics, 17 18
Accu racy of data, 103 104, 117 118 Ben efits an alysis, 23 24
Activity logs, 64 Best practices, 93 109
Algorith m s an om alies, 101 103
accu racy, 18, 103 104 expedien cy-accu racy tradeoff,
an om alies, 103 103 104
data m in in g, 119 h igh -valu e opportu n ities focu s,
evolu tion of, 90 91 94 95
real-tim e resu lts, 104 in -m em ory processin g, 104 109
scen arios, 5 project m an agem en t processes,
statistical application s, 4 98 101
text an alytics, 59 project prerequ isites, 93 94
Am azon , 11 12, 27 th in kin g big, 95 96
Am azon S3, 44 worst practice avoidan ce, 96 98
An alysis of data. See Data an alysis BI. See Bu sin ess in telligen ce (BI)
An om alies, valu e of, 101 103 Big Data an d Big Data an alytics
Apple, 102 an alysis categories, 4 5
Application s, 4, 52 57, 70 application platform s, 52 57
Arch ives, 51 best practices, 93 109
Artificial in telligen ce, 86 87, 116, 120 bu sin ess case developm en t, 21 28
Astron om y, 114 ch allen ges, 6 7, 112
Au to-categorization , 57 classification s, 5 6
Au tom ated m etadata acqu isition com pon en ts, 47
system s, 116 117 defin ed, 1 2, 21 22, 78
Availability of data, 63, 71 72 evolu tion of, 77 91
exam ples of, 113 115
4Vs of, 3 4
BA. See Bu sin ess an alytics (BA) goal settin g, 39 40
BackType, 18 in trodu ction , ix xi


Big Data an d Big Data an alytics lim itation s of, 96

(continued) m arketin g cam paign s, 46
in vestm en t in , 113 risk an alysis, 24 25
path to, 112 113 storage capacity issu es, 48
ph ases of, 115 121 u n stru ctu red data, 5 6
poten tial of, 111 visu alization s, 121
privacy issu es, 122 123 Bu sin ess leads, 99
processin g, 59 61, 104 109 Bu sin ess logic, 54 55
role of, 2 3 Bu sin ess objectives, 45, 94 95, 99
secu rity (See Secu rity) Bu sin ess ru les, 99 100
sou rces of, 37 46
storage, 7, 25 28, 47 52
team developm en t, 29 35 Capacity of storage system s,
tech n ologies (See Tech n ologies) 48 49
valu e of, 11 19 Cassan dra, 27
visu alization s, 121 122 Cen su s data, 77 78
Big Scien ce, 78 80 CERN, 13, 16
BigSh eets, 43 44 Citi, 86 87
Bigtable, 26, 27 Classification of data, 5 6, 65
Bioin form atics, 16, 114, 118 Clean in g, 117, 118
Biom edical in du stry, 16 Click-stream data, 16, 17
Blekko, 14 Clou d com pu tin g, 28, 56, 60 61
Bu sin ess an alytics (BA), 9, 39 Clou dera, 28
Bu sin ess case Com bs, Nick, 83
best practices, 96 Com m odity h ardware, 52
data collection an d storage option s, Com m on Crawl Corpu s, 44
25 28 Com m u n ication , 98
elem en ts of, 22 25 Com petition , 22, 109
in trodu ction , 21 22 Com plian ce, 50 51, 67 72
Bu sin ess in telligen ce (BI) Com pu ter secu rity officers (CSOs),
as Big Data an alytics fou n dation , 73 74
112 113 Con su ltin g firm s, 33
Big Data an alytics team Core capabilities, data an alytics
in corporation , 30, 34 team , 32
Big Data im pact, 21, 26 Costs, 24, 51 52, 107
defin ed, 4 Cou n terin telligen ce m in d-set, 75
extract, tran sform , an d load CRUD (create, retrieve, u pdate,
(ETL), 9 delete) application s, 54
in form ation tech n ology an d, 30 Cryptograph ic keys, 70
in -m em ory processin g, 107 Cu ltu re, corporate, 34 35, 44

Cu stom er n eeds, 108 Data m odelin g, 5, 115 116

Cu ttin g, Dou g, 26 Data protection . See Secu rity
Data reten tion , 64 65
Data scien tists, 29 30, 31, 32 33
Data Data sou rces, 37 46
defin ed, ix growth of, 40 43
growth in volu m e of, 2, 14 15, iden tification of, 37, 38 40, 43 44
25 26, 40 42, 46, 80, 84, 104 im portation of data in to
valu e of, 11 19 platform , 38
See also Big Data an d Big Data pu blic in form ation , 43 44
an alytics Data visu alization , 33, 43 44,
Data an alysis 121 122
categories, 4 5 Data wareh ou ses, 25, 70 71, 96
ch allen ges, 112, 115 116, 118 DevOPs, 45
com plexity of, 17 18 Discovery of data, 33, 42
as critical skill for team m em bers, Disk clon in g, 51
32, 33 Disru ptive tech n ologies, 21
data accu racy, 117 118 Distribu ted file system s, 26, 60.
evolu tion of, 79 See also Hadoop
im portan ce of, 113 Dyn am o, 27
process, 118 121
tech n ologies, 43 44, 96
Database design , 118 e-com m erce, 17, 102
Data classification , 5 6, 65 Economist, 104
Data discovery, 33, 42 e-discovery, 42
Data extraction , 9, 59, 117 Edu cation , 114
Data in tegration 80Legs, 43, 44
tech n ologies, 6, 53, 58 Electron ic m edical records
valu e creation , 112 com plian ce, 68
Data in terpretation , 120 122 data errors, 119
Data m an ipu lation , 33 data extraction , 117
Data m igration , 50 privacy issu es, 122
Data m in in g tren ds, 41
com pon en ts, 38 Electron ic tran saction s, 38
as critical skill for team m em bers, EMC Corporation , 83
32, 33 Em ployees
defin ed, 4, 19 data an alytics team m em bersh ip,
exam ples, 84, 85 29 35
m eth ods, 118 119 m on itorin g of, 74 75
tech n ologies, 43 train in g, 45, 74

En cryption , 70, 72
En tertain m en t in du stry, 41 Hadoop
En tity extraction , 59 advan tages an d disadvan tages of,
En tity relation extraction , 59 7 10, 26, 28, 46, 60, 69, 85
Errors, 119 design an d fu n ction of, 7 8,
Even t-driven data distribu tion , 56 84 85, 103
Eviden ce-based m edicin e, 88 even t-processin g fram ework, 53
Evolu tion of Big Data, 77 91 fu tu re, 85
algorith m s, 90 91 origin s of, 26
cu rren t issu es, 84 85 ven dor su pport, 46
fu tu re developm en ts, 85 90 Yah oo s u se, 26 27
m odern era, 80 83 HANA, 85
origin s of, 77 80 HBase, 9 10
Expectation s, 98 HDFS, 9 10
Expedien cy-accu racy tradeoff, Health care
103 104 Big Data an alytics opportu n ities,
Extern al data, 38, 40 114
Extract, tran sform , an d load (ETL), 9 Big Data tren ds, 41
Extractiv, 43 com plian ce, 68
evolu tion of Big Data, 87 90
See also Electron ic m edical
Facebook, 12, 27 records
Filters, 116 Hibern ate, 54
Fin an cial con trollers, 109 High -valu e opportu n ities, 94 95
Fin an cial sector, 16 17, 91, 109 History. See Evolu tion of Big Data
Fin an cial tran saction s, 42 Hive, 9, 54
Flexibility of storage system s, 50 Hollerith Tabu latin g System , 78
4Vs of Big Data, 3 4 Horton works, 28

Gartn er, 104 IBM, 2, 86 87, 116

Gen eral Electric (GE), 82 83 IDC (In tern ation al Data
Geph i, 44 Corporation ), 84, 104
Goal settin g, 39 40 IDC Digital Un iverse Stu dy, 46
Google, 12 13, 26, 111 In form ation profession als, 32 33,
Google Books Ngram s, 44 97 98, 118
Google Refin e, 43 In form ation tech n ology (IT)
Govern an ce, 32, 96 Big Data an alytics team
Govern m en t agen cies, 41 in corporation , 30, 34
Grep, 43 bu sin ess valu e focu s, 96

database m an agem en t as Lockh eed Martin , 83

percen tage of bu dget, 107 Log-in screen s, 74
data govern an ce, 32 Logistics, 41, 84
evolu tion of, 79 Logs, activity, 64
in -m em ory processin g im pact, 107 Loyalty program s, 16
pilot program s, 61
u ser an alysis, 100 101
In -m em ory processin g, 55 56, 104 Main ten an ce plan s, 100
109, 116 Man h attan Project, 78 79
In pu t-ou tpu t operation s per secon d Man ipu lation of data, 33
(IOPS), 50 Man u factu rin g, in -m em ory
In tegration of data, 6, 53, 58, 112 processin g tech n ology, 109
In tellectu al property, 72 75 Mappin g tools, 54
In tercon n ected data, 119 MapR, 28
In tern al data, 38, 39 40 MapRedu ce
In tern ation al Biological Program , 79 advan tages, 46, 54, 55
In tern ation al Data Corporation , bu ilt-in su pport for in tegration , 53
84, 104 defin ed, 8
In tern ation al Geoph ysical Year Hadoop, 9
project, 79 relation al database m an agem en t
In terpretation of data, 120 122 system s, 58
Marketin g cam paign s, 46
Mem ory, brain s capacity, ix
Jah an ian , Farn am , 80 82 Metadata, 49, 57, 116 117
JPA, 54 Metrics, 35
Min in g. See Data m in in g
Mobile devices, 43, 107 108
Kelly, Nu ala O Con n or, 82 83 Modelin g, 5, 115 116
Kogan , Caron , 83 Moore s Law, 25 26
Mozen da, 43

Labelin g of con fiden tial

in form ation , 74 NAS, 7
Laten cy of storage system s, 49 50 Nation al Ocean ic an d Atm osph eric
Legal issu es, 42, 64, 75 Adm in istration (NOAA), 2
LexisNexis Risk Solu tion s, 83 Nation al Scien ce Fou n dation (NSF),
Liability, 64 80, 82
Life scien ces, 41 Natu ral lan gu age recogn ition ,
Livin gSocial, 19 57, 59
Location -based services, 122 123 New York Times, 2

Noisy data, 119

NoSQL (Not on ly SQL), 53, Qu eries, 100, 118 119, 120
58 59, 69

RAM-based devices, 55 56
Object-based storage system s, 49 Real-tim e an alytics, 53, 104 109, 116
OLAP system s, 120 Recru itm en t of data an alytics
OOZIE, 9 person n el, 32 33
Open HeatMap, 44 Red Hat, 54
Open sou rce tech n ologies Relation al database m an agem en t
availability, 28 system (RDBMS), 58, 67
option s, 43 44, 45 Research an d developm en t (R&D),
pilot projects, 61 82
See also Hadoop Resou rce description fram ework
Organ ization al stru ctu re, 30 31 (RDF), 58 59
Ou tsou rcin g, 61 Resu lts, 121
an om alies, 102
Parallel processin g, 55 Big Data u se, 84, 109
Paten ts, 72 75 click-stream data, 16, 17
Pen tah o, 9 data sou rces, 41
Perform an ce m easu rem en t, 35 goal settin g, 39 40
Perform an ce-secu rity tradeoff, in -m em ory processin g
63 64, 71 72 tech n ology, 109
Perlowitz, Bill, 83 organ ization al cu ltu re, 34
Ph arm aceu tical com pan ies, 2 Reten tion of data, 64 65
Pig, 9 Retu rn on in vestm en t (ROI), 25
Pilot projects, 9 10, 61 Risk an alysis, 24 25
Plan n in g, 44 46, 93 94, 99
Poin t-of-sale (POS) data, 16
Predictive an alysis, 4 5, 40 SANS, 7
Privacy, 122 123 SAP, 85
Problem iden tification , 45 Scale-ou t storage solu tion s, 48 49
Processin g, 59 61, 104 109 Scalin g, 95, 120
Project m an agem en t processes, Scen arios, 5, 121 122
98 101 Sch m idt, Erik, 80
Project plan n in g, 44 46, 93 94, 99 Scien ce, 16, 78 83, 114
Pu blic in form ation sou rces, 43 44 Scope of project, 24
Pu rgin g of data, 64 65 Scru bbin g program s, 101

Secu rity Statistical application s, 4

backu p system s, 66 67 Storage, 7, 25 28, 47 52
ch allen ges, 49, 63 64 Storm , 18 19
com plian ce issu es, 67 72 Stru ctu red data, 5 6, 17
data classification , 65 Su ccess, m easu rem en t of, 35
data reten tion , 64 65 Su pplem en tary in form ation , 121
in tellectu al property, 72 75 Su pply ch ain , 108 109
ru les, 71 72
tech n ologies, 69 70, 74
Sem an tics Tableau Pu blic, 44
even t-driven data distribu tion Taxon om ies, 57
su pport, 56 Team m em bers, 29 35, 97 98
m appin g of, 55 Tech n ologies
tech n ologies, 58 59 application platform s, 52 57
tren ds, 57 Cassan dra, 27
Sem istru ctu red data, 6 clou d com pu tin g, 28, 56, 60 61
Sen sor data com m odity h ardware, 52
filterin g, 116 decision m akin g, 28, 45 46
growth of, 15, 46 processin g power, 59 61, 104 109
types, 2, 41, 84 secu rity, 69 70, 74
Silos, 65 storage, 7, 25 28, 47 52
Sloan Digital Sky Su rvey, 114 Web-based tools, 43 44
Sm all an d m ediu m bu sin esses worst practices, 96 97
(SMBs), 13 15, 31, 52 See also Hadoop
Sm art m eters, 43 Telecom m u n ication s, 41
Sm artph on es, 43 Text an alytics, 59
Sn apsh ots, 51 Th in provision in g, 51
Social m edia, 42, 59, 71, 102 T-Mobile, 85
Software. See Tech n ologies Train in g, 45, 74
Sou rces of data. See Data sou rces Tran sportation , 41
Space program , 78 79 Tren ds, 102
Specificity of in form ation , 108 Tru sted application s, 70
Speed-accu racy tradeoff, 103 104 Tu rk, 43
Sprin g Data, 54 Twitter, 18 19, 103
lim itation s, 55
NoSQL In tegration , 53, 58 Un ited Parcel Service (UPS), 84
scalin g, 95, 120 Un stru ctu red data
Stale data, 49 com plexity of, 17

Un ited Parcel Service (UPS) Villan u stre, Flavio, 83

(continued) Visu alization , 33, 43 44, 121 122
defin ed, 6 Volu m e, 3
form s, 57
growth of, 5, 87, 104
project goal settin g, 39 40 Walt Disn ey Com pan y, 2
social m edia s collection , 71 Watson , 86 87, 116
tech n ologies, 18 19, 57 59 Web-based tech n ologies, 43 44
varieties of, 3 Web sites
U.S. cen su s, 77 78 click-stream data, 16, 17
User an alysis, 100 101 logs, 38
Utilities sector, 41 traffic distribu tion , 14
Wh ite-box system s, 52
Worst practices, 96 98
Valu e, extraction of, 3 5, 57 Wyle Laboratories, 83
Variety, 3
Velocity, 3
Ven dor lock-in , 72 XML, 6, 58
Veracity, 3
Videos, 59
Video su rveillan ce, 41 Yah oo, 26 27

