Anda di halaman 1dari 11

Jon Cohn Exton PA - Corporate Data Architecture

A complete revisiting of the corporate data architecture and its respective best
practices is in order because of cloud computing and big changes in computing
technology and software development. In some cases, a complete inversion has
occurred (in the best way) to solve a particular problem. To be competitive,
organizations need to take advantage of these new ways of doing things.
Massive data and information is out there if we can just grasp it. Below are some
principles and practices on how we can better deal with data going forward.
Store First, Analyze Later: Disk is cheap. We cant always predict what data will
be important later. Store first and ask questions later. With scalable
infrastructure and todays hardware economics, its okay if a piece of data turns
out to never be used. The schema flexibility of NoSQL technology facilitates this.
For example, with a customer document, adding additional fields of information
at a later date is easy even if they were not envisioned initially.
Default to Real-time: Historically, data processing and analysis has been done via
batch processing. We defaulted to batch processing because its computationally
efficient however, given Moores law and the passage of time we now have much
more power at our disposal. We can afford to do more work to get real-time
answers instead of answers tomorrow. NoSQL and fast storage technologies
(such as solid state disk) make real-time possible. Your organization should
deliver recommendations, personalization and business metrics immediately.
Default to real-time and go to batch only when necessary.
Structure Shouldnt Hold You Back: Its easy to store basic stock information for
example (ticker, high, low, close) in any database. What about a complete
derivative security? How do we store that in the database, especially given that
new securities are invented all the time? A legal contracts terms? How do we
store polymorphic information or data we werent aware of a priori?
Historically a few methods have been most common: the relational database for
data with very precise structuring; completely unstructured data (BLOBs); and
things in the middle, such as spreadsheets. The latter two formats are mostly
useless for integration into your applications, yet the volume of such data is
massive. With the rise of dynamic document-oriented data models (using JSON),
semi-structured, complex structured, and polymorphic data can be stored,
accessed and organized just as efficiently as the more rigidly structured data
that has been in databases traditionally.
Agility Is Key: The software development world has moved from classic
waterfall software development lifecycles to more agile, or iterative,
methodologies (for example, Scrum). These methods rapid iteration allows
organizations to deliver features and enhancements to end users quickly and
effectively. To work this way, we need new tools that are agile-compatible
version control, continuous integration, programming languages have adapted
already. We need similar adaption by the database if we want to make software

development nimble and productive. NoSQL technologies facilitate iteration in

the data model much the same way as you iterate with your code.
One Size Doesnt Fit All: One-size-fits-all is over. Use multiple database
technologies as part of your standard enterprise technology platform. You wont
want dozens that would be far too complex but more than one is optimal. A
good model for the future is to have three primary tools: an DBMS, a relational
data warehouse and a NoSQL database. For each project or sub-problem, use
whichever tool is best. Augment with niche tools (e.g., a time series database)
for special cases. The above approach is highly compatible with service-oriented
architectures, which you should be using. Monolithic hub-and-spoke architectures
lead to late projects and unchangeable systems. Instead, build web services
with each one potentially having its own database or datamart behind it.
Go Commodity: The rise of commodity hardware as a viable production platform
has made it possible to deploy multi-node systems quickly. Newer database
technologies are designed with commodity servers in mind. Companies are
moving away from big iron servers and embrace this approach. By adopting a
commodity server deployment model, there is less of a dependency on
proprietary mechanisms and vendor lock-in is often avoided. Find the sweet spot
on the price-performance curve and buy servers of that size. Dont buy $1k
servers, youll have too many to manage (or even plug in!) But dont go too big
either. Many organizations are standardizing on $10k commodity Xeon (or AMD)
based servers with gigabit Ethernet.
Use Solid State Drives a Lot: Traditional spinning disks have increased in
capacity and data transfer rates by a factor of one thousand, yet the random i/o
times have barely budged over a decade. If you are doing any random I/O at all,
you should use SSDs instead. Commodity SATA-style SSDs can work surprisingly
well. Be sure to mirror them they still fail even though there are no moving
parts (except electrons!) Reserve 20%+ of the disks space as un-partitioned to
give the drive room to optimize random writes and avoid excess write
For sequential I/O, stick to spinning disks. Thus, use spinning disks for Hadoop
batch processing and for backups. Some have predicted eventually 99% of data
may be stored on spinning disks yet 99% of accesses will be happening on SSDs.
With spinning disks being the main place for backups, that is conceivable.
Source :
Recommended by :
Jon Cohn ,CTO , VP IT Architecture

"Jon Cohn Exton PA" "Jon Cohn Exton" "Jon Cohn Evolution"