Analyzing Petabytes
Suchi Raman
Netezza Corp.
http://www.netezza.com/
Macro-analytic queries
Identify trends and patterns > Very large data volumes > Query times dominated by disk scan times
>
Micro-analytic queries
Short running queries > Query run once and stored > Pre-computed summaries
>
Data management
ETL load/unload > Backup/restore
>
Netezza Confidential
Client
TRU64
HP-UX
1
WINDOWS LINUX
Query Plan
Optimize
ETL Server
Admin
High-speed Loader/Unloader
1000+
DBA CLI
Source Systems
Front End
3rd Party Apps
DBOS
SMP Host
High Performance Loader
Gigabit Ethernet
Netezza Confidential
Software challenges
Optimal data layouts Data compression Increased effective disk bandwidth (and reliability!) Upgrades and evolution of on-disk formats Minimize disk reads (indexes, caches) Skew avoidance algorithms Scheduling among queries, especially with mixed workloads combining large and small queries System monitoring during busy periods Accurate profiling techniques
>
System Monitoring/profiling
> >
High speed data path in/out of NPS system Efficient/flexible data formats for load/unload Infrastructure challenge fast external devices for sourcing/sinking data Custom functions (UDFs/UDAs) implemented within the system
Netezza Confidential
Hardware challenges
Hardware challenges
Increased effective disk bandwidth (and reliability!) Multi-core technology Balancing CPU-to-disk ratio Specialized engines (e.g., FPGA-based filtering) Faster internal and external connectivity
Netezza Confidential
>
> >
Platform improvements
> > > >
Disk performance and reliability FPGA filtering algorithms Faster interconnect networks Power and cooling improvements
Netezza Confidential