Anda di halaman 1dari 19

Types of Digital Data

Types and Distribution of Digital Data


Unstructured
80%-90%
Does not conform to data model
Not in a format which can be used by program
Chat, ppt, images, videos, letters, research papers, body of an email
Semi-structured
Data not conform to model but has some structure
Not in a format which can be used by program
Emails, XML, Markup languages
Structured
Data is in organized form
Data can be easily accessed by computer program
Relationship exist between data
DB
GOODLIFE Database
Snapshot of structured data
Patient Index Card
Patient ID
Name
Age
Nurse name
Temp
BP
date
Structured data

Characteristics
Conforms to data model
Data stored in rows and columns
Data resides in fixed fields
Definition, format and meaning of data is explicitly
known
Attributes in a group are same
Similar entities are grouped
Structured Data
Sources
Databases
Spreadsheets
SQL
OLTP systems
Ease with structured data
Storage
Scalability
Security
Update and delete
Structured data
Ease of retrieval
Retrieving information
Indexing
Searching
Mining
BI Operations
Unstructured Data
GOOD LIFE Health care system
80-85% of data in any organization is in unstructured
format only
Characteristics
Does not conform to any data model
Cannot be stored in rows and columns
Not in any particular format
Not easily used by program
Does not follow any rules or semantics
Has no easily identifiable strucuture
Unstructured Data
Sources
Web pages
Memos
Videos
Images
Body of an email
Word document, PPT
Chat
Reports
White papers
Surveys
Unstructured Data
Managing Unstructured data
Indexing
Tags/Metadata
Classification/Taxonomy
CAS (Content Addressable Storage)
Unstructured Data
Challenges
Storage space
Scalability
Information retrieval
Security
Update and delete
Indexing and searching
Solutions
Changing format
Developing new hardware
Storing in RDBMS/BLOBs
Storing in XML format
CAS
Unstructured Data
Challenges in extracting information
Interpretation
Tags
Indexing
Deriving meaning
File formats
Classification/taxonomy
Solutions
Tags
Text mining
Application platforms
Classification/taxonomy
Naming conventions/standards
Unstructured Information Management
Architecture
Open source platform from IBM
Integrate different kinds of analysis engines to provide
complete solution for knowledge discovery from
unstructured data
It stored information in structured format
Various analysis engines analyze data in different ways like
Breaking up of documents into separate words
Grouping and classifying according to taxonomy
Detecting POS, grammar and synonyms
Detecting events and times
Detecting relationships between various elements
Semi structured data
Characteristics
Does not conform to data model but contains tags
and elements
Cannot be stored in rows and columns
Tags and elements describe data
Not sufficient metadata
Attributes in a group may not be same
Similar entities are grouped
Semi-structured data
Example
Blood test report
Sources
Email
XML
TCP/IP packets
Zipped files
Binary executables
Mark-up languages
Integration of data from heterogeneous sources
Semi-structured data
Managing
Schemas
Graph based data models
XML
Storing challenges
Cost, RDBMS
Irregular, implicit and partial structure
Evolving schemas
Distinction between schema and data
Solution
XML, RDBMS, Special purpose DBMS
Object exchange Model (OEM)
Storing and exchanging semi-structured data
Graphs like structure- objects are entities, labels are attributes, leaf contains data
Semi-structured Data
Challenges- extracting information
Flat files
Heterogeneous sources
Incomplete or irregular structures
Solution
Indexing
OEM
XML
Mining tools
Solution for Semi-structured data management XML

Extensible markup Language is open source ML


written in plain text
Independent of h/w and s/w
Designed to store and transport data over internet
It allows data to be stored in hierarchical or nested
format
User can define tags to store data
Separation of content (XML) and presentation (XSL)
XML
No predefined tags
<> are user defined tags
It is the standard for exchanging data over
internet
DTDs provide partial schema
Semi-structured data vs XML
Semi-structured data
Consists of attributes, objects, atomic values
XML
Consists of tags, elements, CDATA