Anda di halaman 1dari 13

Bill Inmon

Forest Rim Technology


PO Box 210
200 Wilcox Street
Castle Rock, CO
80104
303-814-8970
whinmon@msn.com

TEXTUAL BUSINESS INTELLIGENCE


By W H Inmon
In the beginning there were applications systems. And for a variety of reasons – not the
least was no corporate data – organizations began to build data warehouses. In a short
amount of time data warehouses began to spread around the world.

ETL

And with data warehouses came the corporation’s opportunity to look at information
across the corporation. But building a data base of integrated, historical, granular data
was not enough. As powerful as data can be inside a data warehouse, unless the end user
can unleash the potential of the data warehouse, there wasn’t much value in building a
data warehouse.

CLASSICAL BUSINESS INTELLIGENCE


Soon appeared Business Intelligence (BI). BI is the software that is needed to go into a
data warehouse and examine the data, the relationships, and the information that is found
there. Once BI appeared, organizations could make sense of the data found in the data
warehouse.

BI

ETL

With BI, organizations could create reports, transactions, and very sophisticated analysis
of the data found in the data warehouse. Among other things, graphical displays of
information was popular. The granular data found in the data warehouse provided a very
firm foundation for the analysis and discovery of corporate information.
BI

ETL

BI was designed to operate on the data found in the data warehouse. And exactly what
was the essential nature of the data found in the data warehouse? The data found in the
classical data warehouse was –

- Numeric, where numbers can be added and subtracted


- repetitive, where the same type of values occur over and over
- where other supporting data – points of interest – surround and are attached to the
numeric data.

ETL

- numeric data
- repetitive data
- pints of interest data

CONTENTS OF THE CLASSICAL DATA WAREHOUSE


Let’s take a look at what the typical contents of a data warehouse look like. Suppose an
organization has an integrated list of the checks written by individuals, the date of the
check, and the location where the check was written.
- numeric data
- repetitive data
- pints of interest data

Aug 3, 2010 Sarah Inmon Lima, Peru 2,347.70


Aug 3, 2010 Brook Sadler La Pax, Mx 3,346.87
Aug 3, 2010 Beasley Smith Bogota, Chile 10,245.36
Aug 3, 2010 Tony Velez Caracas, VZ 4,556.12
Aug 4, 2010 Nancy Jones Juarez, Mx 7,114.09
Aug 4, 2010 Rbta Ross Rio, Brasil 9,109.23
Aug 4, 2010 Joan Jett Sao Paolo, Bz 3,339.87
Aug 4, 2010 Glen Frey Brasilia, Bz 2,109.25
....................................................................................

Such an arrangement of data might be typical for the contents of a classical data
warehouse. Consider how the data might be used for analysis. Some data will be used for
selecting data and discriminating data from other data. Other data – numeric data – will
be used in calculations and comparisons. Given the typical data base that has been
described, the analyst could examine such things as –

- how many checks were written on a given day


- how much money was spent in Peru
- how much money changed hands in South America
- what was the largest check written
- and so forth.
-

- numeric data
- repetitive data
- pints of interest data

Aug 3, 2010 Sarah Inmon Lima, Peru 2,347.70


Aug 3, 2010 Brook Sadler La Pax, Mx 3,346.87
Aug 3, 2010 Beasley Smith Bogota, Chile 10,245.36
Aug 3, 2010 Tony Velez Caracas, VZ 4,556.12
Aug 4, 2010 Nancy Jones Juarez, Mx 7,114.09
Aug 4, 2010 Rbta Ross Rio, Brasil 9,109.23
Aug 4, 2010 Joan Jett Sao Paolo, Bz 3,339.87
Aug 4, 2010 Glen Frey Brasilia, Bz 2,109.25
....................................................................................

Addition
Selection subtraction
grouping multiplication
discrete reporting comparison

In order to do the analysis and the calculations, a BI tool could be used on top of the data.
BI

- numeric data
- repetitive data
- pints of interest data

Aug 3, 2010 Sarah Inmon Lima, Peru 2,347.70


Aug 3, 2010 Brook Sadler La Pax, Mx 3,346.87
Aug 3, 2010 Beasley Smith Bogota, Chile 10,245.36
Aug 3, 2010 Tony Velez Caracas, VZ 4,556.12
Aug 4, 2010 Nancy Jones Juarez, Mx 7,114.09
Aug 4, 2010 Rbta Ross Rio, Brasil 9,109.23
Aug 4, 2010 Joan Jett Sao Paolo, Bz 3,339.87
Aug 4, 2010 Glen Frey Brasilia, Bz 2,109.25
....................................................................................

For many years data warehouse and BI worked more or less as described. And – even
though they are not aware of it – BI tools focused on operating on repetitive numeric data
for calculation and comparison, where nonnumeric data served the purpose of allowing
data to be selected and grouped together. And BI tools simply expected to find repetitive
data in the data warehouse, where the same type of data was repeated over and over.

ENTER THE UNSTRUCTURED DATA WAREHOUSE


But a profound change in data warehousing has occurred. Today it is possible to build an
entirely different kind of data warehouse. Today it is possible to build a data warehouse
that is based on text. And text has an entirely different set of properties than data that
classically has been placed in a data warehouse.

Now it is possible to build a data warehouse using textual ETL, such as that provided by
Forest Rim Technology. Now unstructured text can be read into Textual ETL and a new
type of data warehouse can be built.

Text Textual
ETL

Text

The contents of an unstructured data warehouse are very different than that of a classical
data warehouse. The contents of an unstructured data warehouse are – not surprisingly –
text. However, the text that arrives in the unstructured data warehouse is formatted into a
standard relational data base. For years organizations have been able to place text in a
relational data base in the form of blobs. But once text is placed into a relational data base
in the form of a blob, there is not a lot that can be done with it.

Instead textual ETL passes the text through a myriad of algorithms before the text is
placed in the relational data base. (NOTE: most of the important algorithms are patent
pending. See Forest Rim Technology for licensing opportunities.) The net result is a
relational data base that can be used for analytical purposes.

Text Textual
ETL

“...As we see it, the event was a success...”


“...Thanks for the sale. I enjoyed it greatly...”
Text “...I think you ought to know about what went...”
“...I went down the street and saw the same thing...”
“...I want to return my purchase...”
“...I found a stain on the bottom of the...”
“...Let me tell you how pleased I was when I...”
“We went for a walk down the park lane. It went by the river
“...Your salesperson is outrageous. Do you know...”
and there was a bridge to cross the creek at one point. She
“...Your products are great but your service is lousy...”
was talking so intently that she never even noticed the bridge.
“...I want my money back...”
She was wrapped up in her thoughts. First there had been the
oss at graduate school. Then there was the Derek. And Tally.
Sheoption
The offering includes stock just couldn’t take it any more.
rights exercisable at .10 There had to be another
per share by Nov 15. In answer.
addition,Maybe shewarrants
there are needed as
a change.
well. Maybe a change of
The entire equity was heldclimates was
by three what was
people. Nowcalled for.
that there
was to be a distribution, The
theyducks
wouldinallthe pond went by and were followed by a brood of
profit.
six small paddlers, each mimicking the mother....”
The dealership offered the latest model. Of course you can
order off of the Internet and pick up the car at the dealership.
When you do it this way you get to choose all the options
you want. But the price is not negotiable and you are responsible
for selling your car....”

But creating new forms of a data warehouse leads to its own challenges (as well as
opportunities.) The first thing the organization discovers is that classical BI does not
work very well with an unstructured data warehouse. What is needed is an entirely
different kind of BI. What is needed is Textual BI.

Textual
BI
Text Textual
ETL

Text

TEXTUAL BUSINESS INTELLIGENCE


Why is there a need for a new and different kind of BI? The answer is simple – the data
in an unstructured data warehouse is fundamentally different than the data found in
classical BI. Let’s start with numeric data. One of the essences of textual data is that is
decidedly not numeric. Textual data consists of words, and you cannot add or subtract
words. About the best you can do is to count words. So one of the major differences
between a structured data warehouse and an unstructured data warehouse is the ability to
do calculations and comparisons against the basic data found in the data warehouse.
But there is another important difference as well. Universally, classical data warehouses
contain repetitive data. In a classical data warehouse, the same type of data appears over
and over. But in an unstructured data warehouse there may or may not be any repetition.
In this regard an unstructured data warehouse is fundamentally different from a classical
data warehouse. And the lack of or existence of repetition makes a big difference in the
type of BI that can be done against the data warehouse, as shall be seen.

Text
Nonnumeric

Text

Under normal circumstances data is not repetitive in an unstructured data warehouse. But
there are some circumstances – for some types of data – where there is a certain amount
of repetition in a data warehouse.

UNSTRUCTURED DATA – REPETITIVE/NON REPETITIVE


As some examples of repetition occurring in an unstructured data warehouse, consider
contracts. Suppose there is a collection of oil and gas leases, a form of a contract. The
first contract is for landowner ABC, the second contract is for landowner BCD, and so
forth. One contract is different from any other contract. But taken as a whole, there is a
great similarity between the different contracts found in the collection. The structure of
the contracts, much of the fine print, and so forth are common among all the contracts. So
the contracts in the collection are collectively structurally repetitive, even if all the text is
not exactly the same.

Now consider the transcripts from a call center. Certainly a person participating in the
call center conversation can say whatever he/she wants to say. But most operators
working a call center have been carefully trained to structure the conversation. As a result
there is a certain similarity to the structure of each call.

And there are plenty other examples of structural repetition in the world of text.

But there are plenty of examples where there is no structural repetition in text. Consider
emails. In emails, a person can say whatever he/she wants to say. The email can be short
or long. The email can be formal or informal. The email can be in any language, and so
forth. There simply is no structural conformity of text when it comes to email.
Text

Text

Repetitive Non repetitive

contracts email
call center calls law
insurance claims medical records
warranty claims depositions
log records doctors notes
real estate filings

REPETITIOUS CONTRACTS
Pictured below are three different contracts. There is great similarity between the
contracts but each contract is certainly different from any other contract.

NON REPETITIOUS LAW


As an example of no structural uniformity, consider the law that is shown. Below are two
sections of the 848 page long Dodd Frank law, passed in 2010. There is no structural
uniformity to the different sections of the law whatsoever.

BI AND REPETITIVE DATA


There is a relationship to the existence or the non existence of repetition in a document
and the type of BI that can be used. When it comes to text, if the text is repetitive, then
both classical BI and textual BI can be used against the text. But if there is no repetition,
then only textual BI can be used, as seen below.

Text

Text

Repetitive Non repetitive

Classical Textual
BI BI

Another way of looking at this concept is that an unstructured data warehouse can have
two types of BI used, as in the case of contracts, while completely non repetitive text can
have only textual BI used against it, as in the case of the Dodd Frank law. The diagram
below makes this point.

Text

Text

Repetitive Non repetitive

Contract
Law
Contract

Classical Textual
BI BI

TEXTUAL BI – AN EXAMPLE
So what does Textual BI look like? Consider the following example of Textual BI, from
Forest Rim Technology.

In the diagram below, the basic screen is shown. It is seen that there is basic query
management, there is parametric control of the query, there is execution of the query, and
there is the display of the results of the query. In many ways this screen is analogous the
a SQL statement. The difference is that this elaborate query management tool is built
specifically for the management of textual data, not general purpose access and analysis
of a relational data base.
The most interesting part of the textual Business Intelligence query is in the execution.
The screen below shows a simple query where there is a search for all contracts where
there is a mention of “naphtha” and “helium”

The query is executed and there are six occurrences of contracts in which “naphtha” and
“helium” are mentioned.

Now that the query has been executed, the results are displayed. First the basic
parameters of the query are shown. Not that the results can be displayed in four ways –

- showing the contracts where the text is found,


- showing the byte locations in the contracts where the references are found
- showing “snippets” of text where the references are found,
- showing the entire contract where the references are found.

Suppose the analyst merely wants to find the contracts where the references are found.
The results would look like –

Or suppose the analyst wants to find the exact byte location where the references are
found. The results would look like -
Or suppose the analyst wanted to see snippets of text where the references are found. The
results would look like -

There are more snippets to be shown. They look like –

Or suppose the analyst wanted to see the entire document and at a glance see where the
references are found in the document. The results would look like -

There are then many different ways to look at and analyze text using Textual Business
Intelligence.
The example that has been shown was selected for its simplicity. Textual Business
Intelligence can look at text in many different ways, other than the simple example that
has been shown. The diagram below depicts just some of the many sophisticated ways
that analysis can be done with textual Business Intelligence.

Textual
BI
Text Textual
ETL

Text Search can be done in many ways -


- by a word
- by words (AND/OR)
- by categories of words
- by indexes of words
- by words in proximity
- many other ways

COMPLEMENTARY BUSINESS INTELLIGENCE


As a final note, is textual BI a replacement for classical BI? The answer is not at all.
Textual BI and classical BI are complementary. A sophisticated organization is going to
need BOTH forms of Business Intelligence.

Classical
BI

Textual
BI

References

- BUILDING THE UNSTRUCTURED DATA WAREHOUSE, W H Inmon, Krish


Krishnan, TechnicsPubs, 2011
- TAPPING INTO UNSTRUCTURED INFORMATION, W H Inmon, Tony
Nesavich, Pearson Publications, 2008
- DW 2.0 ARCHITECTURE FOR THE NEXT GENERATION OF DATA
WAREHOUSING, W H Inmon, Morgan Kauffman, 2009

Anda mungkin juga menyukai