3
FOR USE WITH COURSE 3Z100 ONLY
cover
Front cover
Student Notebook
ERC 1.3
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Trademarks
IBM and the IBM logo are registered trademarks of International Business Machines
Corporation.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
AIX DB2 InfoSphere
Initiate Master Data Service Initiate Systems Initiate
RDN WebSphere
Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Other product and service names might be trademarks of IBM or other companies.
TOC Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-29
Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-31
Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-36
Performance targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-37
Data Extract Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-38
Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-38
File format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-39
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-39
Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-40
Sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-40
Preventing extract errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-40
Online data transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-42
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-42
Inbound message requirements appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-43
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-43
Components overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-43
Configuration details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-44
viii Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
xii Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
TMK Trademarks
The reader should recognize that the following terms, which appear in the content of this
training document, are official trademarks of IBM or other companies:
IBM and the IBM logo are registered trademarks of International Business Machines
Corporation.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
AIX DB2 InfoSphere
Initiate Master Data
Initiate Systems Initiate
Service
RDN WebSphere
Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Other product and service names might be trademarks of IBM or other companies.
xiv Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Purpose
The Technical Boot Camp will prepare an implementation team
member for the types of tasks that they would be expected to perform
on their first project. The Boot Camp is not focused on the Marketing
or End-User side of the IBM Initiate Master Data Service platform, but
rather on the behind-the-scenes processes that take place in a typical
implementation.
Prerequisites
Students should have reviewed the General Product Overview and the
Boot Camp Training Kit prior to class.
Objectives
Your goal is to learn the independent steps that make up the
implementation process. Many of the activities in the Technical Boot
Camp have interdependencies and prerequisites. In the following
pages, we will outline the general flow of the Technical Boot Camp and
note the dependencies.
Install and navigate Workbench
Learn to create an Initiate member model
Work with a data extract and CloverETL graphs
Design and build an algorithm
Perform an initial data load and bulk cross match
Analyze the quality of the data in your database
Perform threshold and bucketing analysis
Explore the weight generation
Test the configuration using a CloverETL graph
Configure LDAP
Configure IBM Initiate Inspector
xvi Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
pref Agenda
Day 1
Welcome
Unit 1: Introduction to the Boot Camp project
Unit 2: Installing Workbench
Unit 3: Configuring the Initiate member model
Unit 4: Configuring the algorithm
Day 2
Unit 4: Configuring the algorithm, cont.
Unit 5: Cleaning the data extract
Unit 6: Deploying the instance
Unit 7: Overview of the Initiate member model
Unit 8: Deriving data
Day 3
Unit 9: Generating weights
Unit 10: Running a bulk cross match
Unit 11: Analyzing thresholds and matched pairs
Unit 12: Analyzing buckets and frequency based bucketing
Day 4
Unit 13: Reiterating the process
Unit 14: Managing users, groups, and permissions
Unit 15: Configuring and deploying Inspector
Day 5
Unit 16: Testing the hub configuration
Conclusion
Day 6 - 10
Sample Implementation
xviii Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
This unit will review milestones, data requirements, and general configuration needs for the
Boot Camp project. We will also review an Implementation Approach document that
explains the course as a project.
Dependencies
Students should have reviewed the General Product Overview and the Boot Camp Training
Kit prior to class.
Topics
This unit will cover:
Concepts and terms
Implementing the IBM Initiate Master Data Service
The Technical Boot Camp process
General rules of implementation
Common process dependencies
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
What is MDM?
Master Data Management (MDM) provides consistent and comprehensive core information
across an enterprise.
What is an EMPI?
Enterprise Master Person Index (EMPI) is the process of identifying each unique patient
within healthcare systems and assigning them an Enterprise Identifier (EID) so that
Electronic Medical Records can be cross-referenced to produce a full-bodied picture of a
patient's medical history.
What is CDI?
Customer Data Integration (CDI) is the process of determining a distinct set of customer
records and creating a single, unified view of the information across all sources.
What is a hub?
For most implementations, the IBM Initiate Master Data Service is the central point where
you go to locate information. We casually refer to the IBM Initiate Master Data Service as
"the Hub." Hubs are designed with regard to specific data and data relationship domains,
such as consumers, organizations, locations, patients, vehicles, households and
hierarchies. For example, an IBM Initiate Provider Hub tracks information about Doctors
and Facilities that provide healthcare to patients.
1-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
NO YES
1-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Review Customer Review the requirements that have been defined for the Data
Requirements Extract by the customer.
Configure and load an Initiate Member Model to fit the project
Configure Data
needs. Data loaded includes, metadata (like sources and
Model
attributes), validation lists, and lookup tables in Workbench.
Design and build an algorithm to address your attributes,
Configure Algorithm comparisons, search requirements, and bucketing design
needs.
Analyze the data to ensure it conforms to the specifications
Clean Data Extract
and fix any problems that are found.
Upload the data dictionary and algorithm configured and
Deploy Instance
stored on Workbench to the IBM Initiate Master Data Service.
Parse a data extract to fit the Initiate data dictionary. You will
Derive Data create binary files, comparison strings, and bucket
assignments.
Measure the frequency of values (ignoring the anonymous
Generate Weights
values), and then assign weights accordingly.
Compare records, calculate comparison scores based on the
Bulk Cross Match
weights, and then link the records that match.
Perform data analytics that assess attribute completeness,
Analyze and Review
duplication rates, and threshold settings.
Tweak your settings or algorithm until you get the desired
Reiterate
results.
1-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Prior to implementation, the requirements for the Data Extract have been defined. The
Data Extract is a subset of the data that is used to configure the initial implementation.
Dependencies
You will need to have clear understanding of the data sources involved, the data fields
needed, and how to gather that data in a way that IBM Initiate Master Data Service
software can consume it. This helps you build your data dictionary.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
The Initiate member model (.imm) defines the way that the IBM Initiate Master Data
Service software stores, manages, and validates data. You will build a Data Dictionary from
scratch in class, but normally you will begin with a predefined dictionary.
Dependencies
The Data Extract Guide outlines the specific attributes and fields and the Implementation
Approach defines additional data dictionary requirements.
1-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Typically, a generic algorithm is imported along with your project configuration, but in class
we will begin with an empty algorithm. This algorithm will need to be configured to address
the attributes you are using, the comparisons that you would like to use, and the bucketing
strategy that you would like to employ. You can use Workbench to make your edits. The
tool will validate your design and present you with a list of errors if there are any
inaccuracies in your algorithm design.
Dependencies
The algorithm is the brain of the IBM Initiate Master Data Service software. Therefore, the
proper data elements must be in place before you can fully develop the algorithm. After
your first pass at the implementation, you can make some tweaks to the algorithm. After
making those changes, you will need to derive data, generate weights and/or perform a
Bulk Cross Match again.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
The Data Extract is a sampling of the data. You will test this data for basic adherence to the
Data Extract Guides specifications and run CloverETL graphs against it to ensure proper
data format.
Dependencies
The Data Extract Guide outlines the data requirements. You will also need access to
Workbench and CloverETL tool to perform the data cleansing.
1-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Dependencies
The only real dependency is that you need to have a supported database platform.
Dependencies
You need to have the proper software installation files for your operating system (for
example, Windows 64-bit, Linux 64-bit, and so on) and an empty database in order to
create your instance.
Dependencies
You will need to have the IBM Initiate Master Data Service software installed. You will also
need to access the empty database and your hub instance.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Dependencies
You will need to have the IBM Initiate Master Data Service engine with a bootstrapped
database running before you can import the data dictionary. You can start modifying the
import files, though, as soon as you know the core needs and the fields that are to be
referenced in the hub. The Data Extract Guide and the Implementation Approach can guide
you through the configuration.
1-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Derived data is essentially data that has been processed by the algorithm. The data
derivation process includes four main events:
1. Raw data is parsed into segment specific unload files
2. Comparison strings are built from standardized data
3. Members are assigned bucket hashes
4. Binary files are created for faster computation
There are multiple methods to derive data. For example, the Derive Data and Create UNLs
(mpxdata) job takes raw data and builds member unload files, generates comparison
strings, assigns bucket hashes, and creates binary files for faster comparison. In contrast,
the Derive Data from UNLs (mpxfsdvd) job uses pre-existing member unload files to extract
and create comparison strings, bucket hashes, and binaries. After you have derived your
data, you will load the results into your database.
A configuration file (.cfg) acts as a map between the Data Extract and the Data Dictionary.
It literally indicates which field in an extract row goes to which attribute in the database. An
engine utility uses the .cfg file to parse a raw data extract into the table structure that Derive
Data and Create UNLs (mpxdata) expects.
Dependencies
You will need to have most of the components installed and configured, like the hub engine,
member model, .cfg file, and the algorithm. If changes are made to the algorithm, then data
will need to be re-derived.
You will need to know the order in which the fields appear in the Data Extract and the
corresponding Attribute names in the member model. You can build your .cfg file from
project documentation if the real files are not yet available. Check for accuracy before
deployment.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7 - Generate weights
The weight generation process is an integrated utility that goes through multiple steps to
measure the frequency of individual values in the database, then assigns weights to those
values the most common weighing less and the most rare weighing more. The weight
generation process creates unload files and loads them into the database.
Dependencies
You must have the engine installed and your algorithm configured. If you have already
derived your data then weight generation will take less time, but the weight generation
utility can derive data for its own use. You should always check your weights before loading
them into the database.
1-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
The bulk cross match (BXM) is a process that allows you to compare and link thousands of
records per second. The BXM is most commonly performed in the initial stage of the
implementation and again right before the system goes live. The BXM process is made up
of two primary jobs; Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink).
After running the compare and link, the data will need to be loaded into the database.
Dependencies
You must have derived data and weights before you perform the bulk cross match. That
also means that the engine, algorithm, and data dictionary must be in place. The BXM
process uses the weights to generate an aggregate comparison score. That score is
compared to the thresholds to determine auto-linking and task generation.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Once your data is fully loaded into the IBM Initiate Master Data Service software, you
should run tests to establish how well your system and the data are performing. Through
the analysis tools in Workbench, you can assess attribute completeness, score distribution,
entity and bucket size, and threshold analysis.
Dependencies
Your core engine and data must be fully loaded in order to run the data analytics. Analysis
can all be done within Workbench.
1-16 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty 10 - Reiterate
You will take the results of the analysis and make tweaks to your algorithm and data
dictionary, if necessary. After your edits you will usually re-derive your data, run another
BXM, and analyze the results again.
Dependencies
Bucket design changes usually require re-deriving, but not another BXM. Comparison
changes require new weights, re-derivation, and a new BXM. Some small tweaks only
require an engine restart or simply redeploying your configuration.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
You will test your hub configuration using a CloverETL graph designed to perform a
MEMPUT operation.
Dependencies
Your configuration and data must be fully loaded into the IBM Initiate Master Data Service
software.
1-18 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
1-20 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Important
Implementation Approach Documents vary in content and format. This sample contains
common elements to all Implementation Approach documents, but does not contain all
elements used in all projects.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Application overview
High-level requirements
Initiate University will use Initiate software to achieve these key objectives:
Discover data quality errors The IBM Initiate Master Data Service identifies
potential duplicates and potential linkages so that you can review and correct them.
Establish an enterprise-wide student identifier The IBM Initiate Master Data
Service identifies and links student records across your enterprise using an Enterprise
ID (EID). Initiate University desires to leverage this capability to assign an
enterprise-wide student identifier to student records.
Solution architecture
Overview
The IBM Initiate Master Data Service software will manage member demographic data
from Archway Center and Bellwood Center.
1-22 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Source data
Outlined below are the data elements Initiate University will store in the IBM Initiate Master
Data Service software:
Source
Source ID
Gender
Last Name
First Name
Middle Name
Suffix
Birth Date
Address Line 1
City
State
Zip Code
Home Phone
Mobile Phone
Social Security #
For more details on the source data, please see the Data Extract Guide.
1-24 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Entity types
Outlined below are the entity types for your configuration:
Table 1-4: Entity types
Async Same
Entity Entity Member Comparison Has Cross-Linked
Entity Source
Type Label/Category Type Algorithm Links? Members
Mgmt? Autolinks
id Identity PERSON CMPID Y Y Active Y
Source attributes
A source is a separate system/database with which the IBM Initiate Master Data Service
software interacts and receives member information and updates.
The following sources have been defined for Initiate University:
Table 1-5: Source attributes
Source/Physical
Source Name Source Type Member Type
Code
Archway A Definitional PERSON
Bellwood B Definitional PERSON
The following outside source have been defined for Initiate University:
Table 1-6: Source attributes
Source/Physical
Source Name Source Type Member Type
Code
Social Security Administration SSA Informational PERSON
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Member attributes
The IBM Initiate Master Data Service software captures member level information in
different attributes. Every member attribute is stored in a particular form or structure
(segment) which corresponds to a database table, for example, names are stored in the
MEMNAME segment or the mpi_memname table, dates are stored in the MEMDATE
segment or the mpi_memdate table, and so on.
There are attributes that require predefined pick list of values to facilitate queries through
end user applications. These pick lists are defined and associated with individual attributes
using type code (EDT Code) values.
Attribute specific definitions also include Number of Active Attributes (Number Active), and
Number of Historical attributes (Number Exists).
Number of Active Attributes (Number Active) - Usually the most current attribute value
is setup as the Active value for an attribute. In this case, the field would be set to 1.
Should you have a case where you need more than one active value for the same
attribute, increase the number active value. Example: Marital status - At any given point
in time, a person should only have one active status, such as Married.
Number of Historical attributes (Number Exists) - Over a period of time, an attribute
value might change. The number exists value determines how many historical or
previous values along with the current active value(s) should be stored in the Initiate
database. For example, if number exists is set to 3 and number active is set to 1 for
name and Mary Jones gets married to Jonathan Smith, her name entries might look like
the following (oldest to most current):
- Mary Jonas - Status=Inactive
- Mary Jones - Status=Inactive
- Mary Smith - Status=Active
If Mary gets married again, then the very first name or the oldest value would get
purged. In the example above, Mary Jonas is removed and the resulting entries look
like the following:
- Mary Jones - Status=Inactive
- Mary Smith - Status=Inactive
- Mary Johnson -Status=Active
1-26 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
* Social Security Number will have the Review Identifier flag enabled in the Algorithm.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
1-28 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Derivation
Nicknames
The IBM Initiate Master Data Service software includes nickname translations that we
developed over time based on our algorithm and matching experience. Some customer
data sets might require additional nickname translations. Your installation uses our
standard person nicknames. The nicknames used for your installation are attached:
<document file would be attached here>
Standardization
The IBM Initiate Master Data Service software algorithms use standardization routines
tailored to meet your data needs. Standardization routines are comprised of a Function
designed to standardize a specific attribute type, and a set of Anonymous values. For
example, the USZIP function is designed to standardize zip codes into 5 digit US Postal
code representations by removing the zip+4 extensions. The ZIPCODE anonymous value
contains the list of zip codes that are not meaningful during matching, for example '11111' -
see Anonymous Values below under Selection for more information.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
1-30 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Selection
Bucketing
Your bucketing configuration, a part of your derived data specification, determines which
members are candidates for comparison. For example, you might compare any members
that have similar names and share the same zip code or have identical phone numbers.
We use an 'OR' condition, instead of an 'AND' condition, during candidate selection to
maximize the number of relevant candidates returned and hence reducing the likelihood of
leaving out a genuine candidate. We optimize your bucketing configuration based on your
data volumes, profile, and business objectives. Your bucketing configuration is as follows:
Table 1-10: Bucketing definitions
Phonetic Max/Min
Max/Min Bucket Bucket Equiv String
Bucket Tokens Derivation Attr
Tokens Func Gen Type Code
Argument Tokens
SSN SSN 1/0 ATTR Sorted n/a n/a 1/1
NAME + LGLNAME 1/1 PXNM EQMETA NORMPHONE NICKNAME 2/2
DOB BIRTHDT 1/1 DATE DTY4MM n/a n/a 2/2
LAST NAME LGLNAME 1/1 PXNM EQMETA NORMPHONE NICKNAME 2/2
+ ZIP HOMEADDR 1/1 ATTR As Is n/a n/a 2/2
PHONE HOMEPHON 1/0 ATTR Sorted n/a n/a 1/1
(HOME +
MOBILEPHON 1/0 ATTR Sorted n/a n/a 1/1
MOBILE
Anonymous values
Anonymous values are common values that do not have sufficient meaning for use in
matching. For example, anonymous values include data that come from testing, such as
the name Test Customer. They also can include values that you use as defaults such as
the birth date of 1/1/1900. The attached file represents the data contained in the
mpi_stranon table. <document file would be attached here> The last two fields indicate the
type of anonymous value and the value itself. This file is included as a baseline for
reference, should you ever desire that additional anonymous values be entered, or existing
ones removed.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Scoring
Comparison
We use comparison routines tailored to meet your data needs. Comparison is comprised of
a Function that indicates which data elements are compared, a list of Nicknames for
comparison (see Nicknames above), and a set of weights that dictate the match scores for
each attribute (see Weights below).
1-32 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Weights
Weights dictate the comparison scores generated by the IBM Initiate Master Data Service
software algorithm. These weights were determined via analysis of your data and are
included here <document file would be attached here> as a baseline, in case you should
ever request that Initiate perform additional analysis on your data to regenerate.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Linking
Threshold review
The IBM Initiate Master Data Service software employs two thresholds in determining the
final outcome of a comparison, the clerical-review (CR) threshold and the auto-link (AL)
threshold. If a pair of records scores above the AL threshold, the IBM Initiate Master Data
Service software automatically links the pair of records together as a single entity. If they
score above the CR threshold but below the AL threshold, the IBM Initiate Master Data
Service software places the pair of records into an electronic queue for manual review.
Setting these thresholds correctly is essential to the performance and accuracy of the IBM
Initiate Master Data Service software. The values for these thresholds depend upon
several factors:
The size of the data file
The richness of the underlying data (i.e. the number of attributes available for matching
The tolerable false-positive error rate (which describes the number of records which
would be incorrectly linked)
The desirable false-negative rate (which is the number of missed linkages) or the
resources available for processing manual review
Optionally, we establish thresholds for the comparison of members within each source
system and for comparison of members across your source systems.
Thresholds
Initiate' probabilistic matching algorithms provide highly accurate results that can be tuned
to meet your business requirements. The Initiate algorithms assign a probability score that
indicates the likelihood that a given record matches the search criteria. By performing a
thorough Threshold Analysis you determine a specific threshold score that produces
optimum results for your specific data characteristics and requirements. For a given
Search, Initiate's probability score is based on how many of the attributes on the returned
record match the input criteria and on how closely those attributes match the input criteria.
For more details on Initiate's matching algorithms, see Identity hub Overview Guide. In
general, higher scores indicate higher confidence matches, where most or all of the Search
criteria are met. Lower scores reveal less confident matches, where one or more of the
search criteria are either different or missing from the returned record.
Therefore, if you set a high Threshold score, your Search results will yield only the highest
confidence matches and you will never return False-Positives, or incorrect matches.
However, at very high Threshold scores, you risk having False-Negatives, or missed
opportunities for matching. For example, if someone has moved and their ZIP code no
longer matches the input criteria that record will score lower than if you searched using the
old address criteria. In order to obtain optimum Search results, you must balance the
trade-off between False-Negatives and False-Positives. You do this by selecting a score
threshold that meets your specific business needs.
1-34 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Results
We configured the IBM Initiate Master Data Service based on our knowledge of your
industry and your business goals combined with an analysis of your data. During threshold
analysis with your business group, the following threshold levels were chosen:
Table 1-13: Thresholds
Source Clerical Review Threshold Auto-link Threshold
Archway 9.0 9.0
Bellwood 9.0 9.0
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Tasks
There are 4 types of tasks that can be created by the IBM Initiate Master Data Service:
Potential Duplicate: Two records from the same source that score between the clerical
review and auto-link threshold.
Potential Linkage: Two records from different sources that score between the clerical
review and auto-link threshold.
Review Identifier: Two records from the same source that have the same unique
identifier (attributed used for Review Identifier tasks is identified in Member Attributes
section above)
The following tasks are not utilized by Initiate University, but must be active for the IBM
Initiate Master Data Service software to function correctly:
Potential Overlay: A record received an update with information that is radically different
than the data that was already there.
1-36 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Data volumes
Initiate University maintains a database of roughly ~ 500,000 records. Detailed data
volumes are listed below:
Table 1-14: Data volumes
Description Records
Day 1 Volume ~ 500,000
Average daily update
20,000 (10,000 from Archway, 10,000 from Bellwood)
volume
Total volume to support ~ 500,000
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Data description
For each section below, please describe your data by providing responses to the questions
and provide any additional information that you believe might be helpful in understanding
your data environment and processes.
The IBM Initiate Master Data Service manages data that you collect from these sources:
Table 1-15: Data sources
Data
Description Approximate Record Count
Source
Registration source for Archway
Archway ~ 250,000
Training Center
Registration source for Bellwood
Bellwood ~ 250,000
Training Center
Please describe the unique identifiers from your sources:
Table 1-16: Unique identifiers
Question Description Response
Records are uniquely identified by a
What is your primary The number used to uniquely combination of Source and Source
qualifier? identify a record. ID, students are uniquely identified
by Member Record Number.
Does the primary ID have a
Describe any values in the X No
meaningful prefix, suffix, or
identifier that represent a
any other characteristic _ Yes, please describe:
distinct population
within the identifier?
When you perform your data X Yes
Can you extract one record
extract, are you able to pull one
per primary identifier? _ No
record per unique identifier?
If you want to include these in
your evaluation, a primary
identifier needs to be assigned _ Yes
Do you have records without
prior to submitting the file, or
a primary identifier? X No
Initiate can assign an identifier
during processing. Briefly
describe.
1-38 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Assumptions
When preparing your data extract, please ensure the following:
Source Identifier is unique within the source
All fields are Text data type
Phone numbers are in a single format
Data file fields are alphanumeric characters and left-justified
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-39
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Customization
The IBM Initiate Master Data Service software is configurable to support additional
attributes or different formats (changed field order, different delimiters, et cetera). These
changes should be approved and documented as revisions to the above format prior to file
submission so we can properly configure the software. Certain customizations could
require a change to project pricing or schedules. Please coordinate any extract change
requests with your Initiate Project Manager.
Sample data
Based on our conversations, the following are representative samples of the data that you
provide to Initiate.
Source|MemId|Gender|LastName|FirstName|MiddleInitial|Suffix|BirthDate|HomePhone|CellPhone|SSN
WEB|435263|F|GRIMM|JEANICE|I||1946-10-04|(480)312-2086|(480)217-2304|617-63-4723
WEB|436287|M|ADOLPHSEN|KEENAN|O|JR||(480)186-3228||421-91-9316
REG|M-1509|M|MORING|WADE|R||1963-12-13|(928)103-9712|(928)302-0913|262-21-1509
REG|G-9637|F|GALGANO|SARAI|L||1990-06-07||(480)100-1377|285-42-9637
1-40 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-41
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Security
Initiate adheres to strict confidentiality standards. We take this responsibility very seriously
and enforce regulatory standards relating to the distribution, disclosure and retention of
personal data. Unless otherwise instructed, Initiate destroys client media in accordance
with an agreed upon time frame.
1-42 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Introduction
Initiate Inbound Message Based Transaction Services (Inbound Broker) enables you to
submit record updates from your source systems to the Identity hub via your existing
messaging infrastructure. The components of Inbound Broker are designed to manage
your specific XML and delimited messages and can be customized to your requirements.
Outlined below are the components required for an inbound message, some message
samples, and then your inbound map.
Components overview
The Inbound Broker consists of the following two components:
Inbound Message Reader
Inbound Message Broker
The Inbound Message Reader is a stand-alone process that receives messages through a
TCP/IP connection - it listens on a port, validates that the message is formatted correctly,
and then writes the message to a queue from which they are consumed by the Inbound
message Manager.
The Inbound Message Broker process consumes messages from Inbound Message
Reader queue and then sends the message via TCP/IP to the hub. The engine attempts to
process the messages and sends a notification of success or failure back to the Inbound
Message Manager. After receiving an acknowledgement from the IBM Initiate Master Data
Service, the message is either placed in the 'success*.dat' or 'reject*.dat files.
The Inbound Broker uses a configuration file to map message fields to the Identity hub data
attributes. The inbound.ini identifies how each attribute will be stored in the IBM Initiate
Master Data Service, error conditions for incoming values, and any special formatting or
processing required. It provides detailed descriptions of valid values and indicates
formatting requirements for processing messages read by the message reader.
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-43
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Configuration details
The following sections outline the specific configuration options for your implementation. It
includes how the messages should be formatted, valid values for specific XML tags, and
the expected data to be included.
Table 1-19: Segment definitions
Field
Field Name Segment Label Special Processing
Number
1 Source MEMHEAD srcCode Reject the message if Source is blank
2 Source ID MEMHEAD memIdnum Reject the message if Source ID is blank
3 Gender SEX attrVal
4 Last Name LGLNAME onmLast
5 First Name LGLNAME onmFirst
6 Middle Name LGLNAME onmMiddle
7 Suffix LGLNAME onmSuffix
8 Birth Date BIRTHDT dateVal
9 Address Line 1 HOMEADDR stLine1
10 City HOMEADDR city
11 State HOMEADDR state
12 Zip Code HOMEADDR zipCode
13 Home Phone HOMEPHON phNumber
14 Mobile Phone MOBILEPHON phNumber
Social Security
15 SSN idNumber Ignore segment if SSN is blank
Number
1-44 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
You can configure the Initiate member model (.imm) using Workbench where we will build
the data dictionary. Workbench is the main user and configuration management tool of the
IBM Initiate Master Data Service.
Dependencies
The computer images have Workbench installed.
Topics
This unit will cover:
Workbench basic functionality
General Workbench navigation
Workbench overview
Workbench is a graphic user interface that provides user management and configuration
management tools for IBM Initiate Master Data Service. Simply put, it allows you to view
and manage the configuration for Data or Relationship hubs.
Using Workbench, a hub's data model, algorithm, and thresholds can be easily adjusted to
your requirements using a single toolset. Graphical analytics are available to correctly
adjust the algorithms and thresholds to increase the accuracy and performance of a
particular hub. These features make it much easier for algorithms to be tuned for
performance, and to improve the accuracy of matching based on analytics returned from
Workbench.
One Workbench project contains all this information for ease of versioning and
management by system administrators to support the standard IT processes.
Basic functionality
Workbench projects can be created and configured without a hub instance or a data
source. This is different from previous versions of configuration tools, like Identity hub
Manager, where the engine and databases were written to directly. Instead, Workbench
saves the configuration and uploads it to the engine allowing changes to be made off line.
Workbench allows users to perform many tasks that were once completed using scripts or
multiple Initiate software packages. Some of the key functionality includes:
Creating, configuring, and editing member model dictionaries
Creating, configuring, and editing algorithms
Cleaning and de-duplicating data in the data extract
Bucket analysis
Threshold analysis
Entity analysis
5. Set the Command Prompt (Start > All Programs > Accessories > Command
Prompt) to run as administrator using the steps above.
2-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Perspectives
The Workbench Editor pane has multiple perspectives for viewing multiple file types. There
are perspectives for configuration, mapping, bucket analysis, and working with Clover
ETLs. On the screen capture above, Call out 1 shows what perspective Workbench is
currently in and where perspectives can be changed.
Exercise
Now it is time to perform Exercise 1, taking approximately 10 minutes.
2-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
You will now learn how to configure the Initiate member model (.imm) by building a Data
Dictionary in Workbench. You will build a Data Dictionary from scratch in the exercises, but
normally you will begin with a predefined dictionary.
Dependencies
You must understand how the Initiate member model (.imm) defines the way that the IBM
Initiate Master Data Service software stores, manages, and validates data, and you must
have Workbench installed on your computer.
Topics
This unit will cover:
Working with the Initiate Member Model file to create the data dictionary:
- Adding a New Member type
- Adding Attributes
- Adding an Entity Type
- Adding a Composite Source
- Adding Sources
- Adding an Algorithm (Name Only)
- Adding Information Sources
Copyright IBM Corp. 2010, 2011 Unit 3. The Initiate member model 3-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
3-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 3. The Initiate member model 3-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Adding attributes
Attributes are the pieces of information that we know about members. Name, Rank, and
Serial Number. As you are setting up your data dictionary you need to find out what data is
being collected and stored in your source systems.
For Boot Camp, we are going to use the following information, or attributes for the
members in our sources:
Name
Home Address
Gender
Social Security Number
Birth Date
Home Phone
Mobile Phone
In this step, we will create the attributes using their common titles above, then define their
attribute code (attrcode) and the type of attribute. The attribute code is the shortened name
that will be used by the hub to define the attribute. The attrcode is user defined and should
be as descriptive yet as brief as possible. It cannot contain spaces.
The attribute type, sometimes called a segment, is predefined and must be selected from a
table. This code defines how the information will be treated in the hub.
3-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 3. The Initiate member model 3-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
3-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 3. The Initiate member model 3-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Adding strings
Strings enable you to create rules or guidelines that instruct the algorithm on how to handle
certain incoming data values. To save some time in Boot Camp, we will not create the
string value files from scratch. We will copy string files to our project, and then create a new
string.
Exercise
Now it is time to perform Exercise 2, taking approximately 45 minutes.
3-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
Typically, a generic algorithm is imported along with your project configuration, but in class
we will begin with an empty algorithm. This algorithm will need to be configured to address
the attributes you are using, the comparisons that you would like to use, and the bucketing
strategy that you would like to employ. You can use Workbench to make your edits. The
tool will validate your design and present you with a list of errors if there are any
inaccuracies in your algorithm design.
Dependencies
The algorithm is the brain of the IBM Initiate Master Data Service software. Therefore, the
proper data elements must be in place before you can fully develop the algorithm. After
your first pass at the implementation, you can make some tweaks to the algorithm. After
making those changes, you will need to derive data, generate weights and/or perform a
Bulk Cross Match again.
Topics
This unit will cover:
Algorithms: The secret sauce
- Introduction to algorithm components
- Standardization Introduction
- Bucketing Introduction
- Comparing data introduction
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Standardization
Standardization is a process applied to attributes that cleans the values for easier
comparison. It reduces the variance between data (for example, converting characters to
UPPERCASE) and removes extraneous information that won't help comparison (for
example, remove anonymous values, remove generic descriptors like Road and Street
from addresses, and removes special characters like (), -, , and so on).
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Note
Most standardization functions have the ability to filter out values that have been deemed
as anonymous (for example, SSN of 999-99-9999 or Zip Code of 11111). Likewise, many
functions give you the ability to substitute codes or standard values in place of predefined
strings (for example, Nicknames like Jimmy or Jaime both become James and Genders
are converted from Male and Female to M and F).
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Abstract code
The following table contains a list of the current Abstract Code Standardization functions:
Formats abstract codes using the equivalent string tables.
ABSCODE
(Accepts only 1 field.)
Example
The ABSCODE function could perform the following standardization on a Provider Name
(Name/Provider), if the proper validation tables are created in the Strings section of your
Initiate Member Model:
Mike L. Goodman, Allergist becomes ALG
4-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Date
The following table contains a list of the current Date Standardization functions:
Table 4-1: Date standardization functions
(DATE1) Formats alphanumeric date values as YYYYMMDD
Standard and removes special characters. (Accepts only 1 field.) This is
the Initiate preferred date format.
(DATE2) If year, month or day is invalid, those portions of the
date are replaced by a string of 0's (zeros) and the comparison
function ignores the 0 string. The function is configurable with a
Partial minimum and maximum year. When the date falls outside of
those ranges, the entire date is treated as an anonymous value.
After applying the year range filter, DATE2 checks the date
against a configurable date standardization table.
(GRDATE) Formats a date as YYYMMDD and treats dates
Fixed earlier than 1890 and later than 2020 as anonymous values.
(Accepts only 1 field.)
(AGE) Formats an age into a birth year by removing non
Age integers and treating values less than 8 or greater than 100 as
anonymous. (Accepts only 1 field.)
Date example
The Date/Standard function would perform the following standardization on a Date of
Birth:
1962-10-27 becomes 19621027
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Attribute
The following table contains a list of the current Attribute Standardization functions:
Table 4-2: Generic attribute standardization functions
(ATTR) Keeps alphanumeric characters and removes any
special characters like,!@#$%^&*-(). Often used for single-field
Alphanumeric
attributes like, Gender and defined attributes. (Accepts only 1
field.)
(ATTRA) Formats alphabetic data only, so removes all numbers
Alphabetic
and special characters. (Accepts only 1 field.)
(ATTRN) Formats numeric data, so removes all alphabetic data
Numeric
and special characters. (Accepts only 1 field.)
4-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Phone
The following table contains a list of the current Phone Number Standardization functions:
Table 4-3: Phone standardization functions
(PHONE1) Formats a phone number that is 7 or more
characters to a 7-digit United States/Canada number, by
North America
shaving off leading 1's, area codes, and special characters.
(Accepts only 1 field.)
(PHONE2) Formats a phone number by concatenating the
phone data in multiple fields, then reducing it to a 7-digit United
Full
States/Canada number. This function can work with numbers
that only have the minimum 7 digits. (Accepts up to 3 fields.)
(PHONEEND) This function standardizes phone numbers
Last Digits
regardless of country format.
(INTPHONE1) Formats a phone number to a 10-digit
International International number. Analyzes country calling codes for more
accuracy. (Accepts up to 3 fields.)
(AUSTPH) Formats an Australian phone number to an 8-digit
Australia
local number. (Accepts only 1 field.)
Phone examples
The Phone/Full function would perform the following standardization on a US Phone
Number:
1 (404) 871-1316 becomes 8711316
The Phone/International function would perform the following standardization on a UK
Phone Number:
+44 (0) 282 995 7182 becomes 2829957182
Logic for standardizing US/CN phone numbers is outlined below:
If the leading character is a 1
Then remove it
Else If the string length is >=10 (assuming an area code)
Then skip first 3 digits and use next 7
Else If the output of the first process still has a length >=7
Then use the first 7 digits
Else treat the phone number as if it is NULL
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Postal code
The following table contains a list of the current Address Standardization functions:
Table 4-4: Postal code standardization functions
(AUSTPOST) Formats an Australian postal code to 4 digits.
Australia
(Accepts only 1 field.)
Canada (CNZIP) Formats a Canadian zip code. (Accepts only 1 field.)
International (INTZIP) Format an international zip code. (Accepts only 1 field.)
North America (NAZIP) Formats a US/CAN zip code. (Accepts only 1 field.)
(UKZIP) Formats a United Kingdom zip code. (Accepts only 1
United Kingdom
field.)
(USZIP) Formats a United States zip code. (Accepts only 1
United States
field.)
4-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Name
The following table contains a list of the current Name Standardization functions:
Table 4-5: Name standardization functions
(PXNM) Formats a list of strings applying rules for Person
names including First Name, Middle Name, Last Name, Prefix,
Person Suffix, Title, and Degree. This function can also take
hyphenated names and split them into multiple tokens. (Accepts
up to 6 fields.)
(BXNM) Formats a list of strings for Health care Providers
(Individual Doctors, Physician Offices, Medical Practices, and so
on) using equivalent tokens and filtering out anonymous values
to create the derived data. BXNM is the only function that
Provider creates two roles, one for all of the name tokens and a second
role that denotes the specialty of the provider. Specialty is
translated from Titles, Degrees, or from words in the name. This
function cannot break apart hyphenated names, it simply
removes the hyphen. (Accepts up to 6 fields.)
(CXNM) Formats a list of strings applying rules for Company
names, like removing Inc. and Co. from names. This function
Company
cannot break apart hyphenated names, it simply removes the
hyphen. (Accepts only 1 field.)
(CJKCXNM) Formats a list of strings applying rules for
Company Unicode Company names. Uses rules for Chinese, Japanese, and
Korean tokenization. (Accepts only 1 field.)
(UCSFREQXNM) Incorporates the FORGNXNM and JAPXNM
Person Unicode
functionality.
Name examples
The Name/Person function would perform the following standardizations:
Howard K. Pingston, Jr. becomes PINGSTON:HOWARD:K:JR:.
Anne M. Fuller-Kline, DDS becomes FULLER:KLINE:ANNE:M:DDS.
The Name/Provider function would perform the following standardizations:
Geoff M. Locke, Family & Pediatric becomes
Role 1: LOCKE:FAMILY:PEDIATRIC:JEFFREY:M
Role 2: FAM:PED (Conversion from mpi_strEqui table)
Martin's Foot & Ankle Clinic, LLP becomes
Role 1: MARTINS:FOOT:ANKLE:CLINIC
Role 2: POD:ORT (Foot = Podiatry = POD and Ankle = Orthopedics = ORT in the
mpi_strEqui table)
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Address
The following table contains a list of the current Address Standardization functions:
Table 4-6: Address standardization functions
(CNADDR) Formats an address string dividing it into the address
Canada
subcomponents for Canada. (Accepts up to 4 fields.)
(CNADDR2) Formats an address string, dividing it into the address
Canada -
subcomponents for Canada with region and postal code. (Accepts up to
Expanded
7 fields.)
(INTADDR2) Formats an address string, dividing it into the address
International subcomponents for International with region and postal code. (Accepts
up to 7 fields.)
(NAADDR2) Formats an address string, dividing it into the address
North America subcomponents for North America with region and postal code. (Accepts
up to 7 fields.)
(UKADDR2) Formats an address string, dividing it into the address
United Kingdom subcomponents for UK with region and postal code. (Accepts up to 7
fields.)
(USADDR) Formats an address string, dividing it into the address
United States
subcomponents for United States. (Accepts up to 4 fields.)
(USADDR2) Formats an address string, dividing it into the address
subcomponents for United States with region and postal code.
The ADDR2 standardization function has 3 components. Street lines,
region and postal code. You can omit the region and postal code or just
the postal code. But, the number of fields for each component is fixed.
Street lines are four, region two and postal code one. For example:
(Accepts up to 7 fields.)
Universal (UCSFREQADDR) Incorporates the FORGADDR and JAPADDR
Character Set functionality.
4-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Biometric
The following table contains a list of the current Biometric Standardization functions:
Table 4-8: Biometric standardization functions
(HAIRCOLOR) Formats hair color as an alphanumeric data
Hair Color
(Identical to ATTR). (Accepts only 1 field.)
(EYECOLOR) Formats eye color as an alphanumeric data
Eye Color
(Identical to ATTR). (Accepts only 1 field.)
(RACE) Formats race as an alphanumeric data (Identical to
Race
ATTR). (Accepts only 1 field.)
(HEIGHT) Formats height into inches. Input must be in the
format FII where F is feet and II are inches. Values less than 36
Height
or greater than 90 are treated as anonymous. (Accepts only 1
field.)
(WEIGHT) Formats weight by removing non integers and treats
Weight values less than 60 and greater that 500 as anonymous.
(Accepts only 1 field.)
Biometric examples
The Biometric/Hair Color function could perform the following standardization on a Hair
Color, if the proper validation tables are created:
BROWN becomes BR BALD becomes BD
BLONDE becomes BL BLACK becomes BK
Geocode
The following table contains a list of the current Geocode Standardization functions:
Table 4-9: Geocode standardization functions
The GEO standardization function converts latitude/longitude
GEO location coordinates into a standardized format that can be
consumed by GEO comparison and bucket functions.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Identifier
The following table contains a list of the current Identifier Standardization functions:
Table 4-10: Identifier standardization functions
(IDENT1) Keeps alphanumeric characters in an identifier and
removes any special characters like,!@#$%^&*-(). This is often
Numeric
used for Medical Record Numbers (MRNs). (Accepts only 1
field.)
(IDENT1A) Formats alphabetic identifiers only, so removes all
Alphabetic
numbers and special characters. (Accepts only 1 field.)
(IDENT1N) Formats numeric identifiers, and removes all
Numeric alphabetic data and special characters. This function works well
with Social Security Numbers. (Accepts only 1 field.)
Identifier examples
The Identifier/Numeric function would perform the following standardization on a Social
Security Number:
902-83-1386 becomes 902831386
Important
Data must be stored in the MEMIDENT table, and the appropriate informational sources
must be configured for these functions to work.
4-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Passthrough
The following table contains a list of the current Passthrough Standardization functions:
Table 4-12: Passthrough standardization functions
This standardization function does not change the input and
simply passes it through the Master Data Engine processes.
Please note that use of this function is discouraged as it was
originally designed as a temporary workaround and might not
PASSTHRU
always perform as described. For example, if you attempt to
standardize a value with certain special characters (for example,
a colon, caret or period), you risk a negative impact on
comparisons with this given member.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
What is bucketing?
Bucketing is used to select candidates for comparison. If you wanted to perform a search
without bucketing, you would be comparing against every member in the database. This
would make searching time consuming and reduce the overall performance of the system.
By employing buckets, we can select a smaller list of candidates to compare against (for
example, less than 2000) and still be confident that we are going to find the right member
records to match with.
4-16 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty In the diagram below, notice that Paula Montauk's member record (167289) is being
bucketed by First Name, First Name + Last Name, Address, and Last Name + Zip Code.
The Bucketing Functions define how her record will be assigned hashes.
Note
Instead of the Buckets being listed by name, they are assigned a hash number, which
allows the Master Data Engine to retrieve members faster during comparison.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-18 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
If you require two tokens per bucket, what combinations of buckets would be created? You
only need to list unique combinations of values once (that is, you do not need State+Zip
and Zip+State. One combination is sufficient.):
If you require three tokens per bucket, what combinations of buckets would be created?
What about four tokens?
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-20 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-22 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Comparing data
A comparison function allows the algorithm to look at two sets of data and determine how
similar they are to one another. The IBM Initiate Master Data Service platform has several
methods for comparing data, from a simple exact match comparison to a three-dimensional
edit distance comparison.
In the case of an exact match comparison, you simply want to know if x = y. So, if Phillip =
Phillip then you have a positive score, but since Phillip <> Phillips then you will receive a
negative score. That is where an edit distance can help.
In the case of a three-dimensional edit distance comparison, you have three values like, zip
+ address + phone and you compare x(zip + address + phone) to y(zip + address + phone).
The edit distance is the number of edits that would need to be made in order for the two
strings to match exactly.
So, in the case of Phillip versus Phillips (which is a one-dimensional edit distance), there is
an edit distance of 1, which would still offer a positive score because they are close to
being the same. For a three-dimensional comparison, let's look at these two strings:
89814+1285_W_Main+2239871 versus 89814+1285_Main+2239817
If we break them down to the single elements, then we will see that 89814=89814, so that
is an edit distance of '0', which is the highest score possible. 1285_W_Main and
1285_Main has an edit distance of '2' because you need to add or remove 2 characters
'W_' to make them match. 2239871 and 2239817 are almost the same, but there is a
transposition on the last two numbers 71 versus 17. This is an edit distance of '1' which is
still a high score.
Comparing data determines the structure of the memcompd table. Weight tables, like the 3
Dim (mpi_wgt3dim) weight table, can take into account the comparison functions
assessment and provide a score. In the case of an exact match on zip code, a distance of
2 on address, and a distance on 1 on phone the weight table can provide the correct weight
value to assign to the comparison. All together, these two strings would score positively in a
comparison.
4-24 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
String pair
This function is used to handle any kind of specialty (abstract) codes used by providers.
Table 4-16: String pair comparisons
ATTR2S can be used to compare any combination of two
string-valued attributes. This function replaces the previously
ATTR2S
used EXH function, but can extend beyond just eye and hair
color comparison.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-26 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Name
Name, or XNM, comparisons use a combination of techniques to compare the data. For
example, name comparisons use Phonetics, Nicknames, Substitutions, and can even pull
in Edit Distance processes. Depending upon the type of data that you are working with, you
would use one of four types of Name (XNM) comparisons.
Table 4-18: Name (XNM) comparisons
(BXNM) Value-based provider name comparison; 1 role/1
Provider dimension. Used to compare business names. Uses exact match,
nicknames, and phonetics when comparing tokens.
(CXNM) Value-based company/organization name comparison; 1
Company role/1 dimension. Used to compare business names. Final weight
is based on total similarity, not total token weights.
(CXNM_CS) Value-based company/organization name
comparison; 1 role/1 dimension. The comparison functions enable
business name token weights to reflect locality. For example, within
the locale of the Pheonix metropolitan area, the token Pheonix will
have a high frequency and a low weight.
Company Example:
(context sensitive) Phoenix Flower Shop, 7611 E. Phoenix Road, Phoenix, AZ
Phoenix Pizza, 7611 E. Phoenix Road, Phoenix, AZ
In the above example, the comprole2 setting would enable a start
location after Pheonix. In this case, the Master Data Service could
start with Flower and Pizza when comparing the name of the
company.
(PXNM) Value-based person name comparison; 1 role/1
Person dimension. Used to compare person names. Uses exact match,
nicknames, and phonetics when comparing tokens.
(QXNM) Value-based person name comparison; 1 role/1
dimension. Used to compare person names on four dimensions (Q
Person
is for Quad). Uses exact match, nicknames, phonetics, and edit
(comprehensive)
distance when comparing tokens. The PARM weight table contains
a limit on the total weight for the name match.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Date
There are several elements that can play into date comparisons. Namely, exact match and
edit distance. The comparison functions for date to use a large penalty for year
disagreement, but also uses edit distance to look for transpositions and typos.
Table 4-19: Date comparisons
(DATE) Value based date comparison, 1 role, 1 dimension.
Date Used to compare dates. If the result is an exact match, weight is
by birth year, else weight is by edit distance.
(DATE2) Value based date comparison, 1 role, 1 dimension.
Date or Age This function is intended for single date comparison, such as
birthdays or any event that occurs on a specific day.
(DOBA) Value based date/age comparison, 1 role, 1 dimension.
Used to compare dates and/or birth years. If both values are
Date or Year
dates, logic is identical to DATE. Otherwise, comparison is
difference in birth years.
4-28 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
4-30 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Equivalency
Equivalency comparisons are simply looking for exact matches. Therefore, they can be
excellent comparisons for demographics like Gender, Eye Color, and Birth Year.
Table 4-22: Equivalency comparisons
(EQVD) Value-based simple string comparison; 1 role/1 dimension.
Simple true/false comparison of string values and returns MATCH
Alpha/Numeric or NO_MATCH. Uses weight specified in configuration from the
sval weight table. Gender and Eye Color are often compared with
EQVD.
(EQVN) Value-based numeric comparison; 1 role/ 1 dimension.
Used to compare to integers. Result is match or non-match. Exact
Numeric
match values are looked up by value. Uses weight specified in the
nval weight table.
4-32 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Geocode
Input for the GEO comparison function must come from an attribute standardized by the
GEO standardization function.
The GEO comparison operates by calculating the great-circle distance between the two
locations being compared. The distance is calculated in meters. This distance is converted
into a similarity measure by taking the base-10 logarithm according to the formula:
Similarity = 2 * log10(distance / resolution)
For distances less than the resolution, the similarity is set to 0. The default resolution is 1
meter. The resolution can be changed by providing an argument to the comparison
function. The argument specifies the resolution to be used (in meters).
The GEO comparison function uses a single 1Dim weight table. The weights are indexed
by the similarity measure just described. The similarity is offset by 1 so that index 0 can be
used for missing data.
Table 4-24: Geocode comparisons
The GEO comparison function compares locations by calculating the distance
GEO between them. Locations that are closer together are considered more similar
than locations farther apart.
US zipcode
Address-based comparisons are especially vital to the house holding process. But, even if
you are not using a household entity, being able to draw correlations between address
components can help you match on members that have a history of moving frequently or
might have entered their address in different formats.
Table 4-26: Address comparisons
Value-based United States Zip code comparison; 1 role/1
USZIP dimension. Returns exact if the first five digits match and partial
if first three digits match; otherwise a NO_MATCH is returned.
4-34 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Tuning search
An implementation requiring a low false-positive rates employing one algorithm that
directed the search, match and link process was not always efficient because of the
false-positive filter (FPF) configured within the single algorithm. The tuning functionality
enables you to configure a single algorithm while allowing specification of which
comparison functions (cmpfuncs) are used during a search operation or a match operation
(memsearch versus memput). By specifying a comparison mode (cmpmode), algorithms
can be configured to:
Search only,
Match and Link only, or
Search, Match, and Link in one call
If your implementation requires a broader search criteria and a tighter matching criteria,
your algorithm should be configured to have a Search only comparison and a Match and
Link comparison. This replaces the previous concept of creating two algorithms. Search,
Match and Link is the default setting, and if all comparison functions are set to this, the
previous behavior is retained.
Using the example of an implementation requiring low false-positive rates, you would
configure a single algorithm and flag the FPF comparison function role as Match and Link
and flag the remaining comparison functions as Search, Match and Link. When a new
member is added to the hub under the ID entity, the algorithm uses all comparison roles
(including the FPF). When a search request is executed against the ID entity, the algorithm
would not apply the FPF to the returned search results. However, if the API requested that
the data be returned as entity, those entities would be those formed using the FPF.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
You might also want to use attributes in searching that you do not want to contribute to the
matching/linking score. As an example, using the next-of-kin value in the matching
algorithm might contribute to family false-positives during the matching. However, you
might still want to use the fuzzy name matching to search on next-of-kin. For this, you
would still configure a single algorithm, but would flag the QXNM cmpspec applied to the
next-of-kin name field as Search Only while the rest would be flagged Search, Match and
Link.
Note
If you currently use two algorithms to address this scenario, Initiate recommends that you
remove the algorithms and utilize the new tuning capability as described. This will simplify
your deployment.
Query roles
Query roles are seldom used. They bypass the algorithm and allow you to perform a strict,
deterministic search on an attribute and are usually associated with task searches.
4-36 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Note
The paths above show how a Social Security Number can be processed in the Algorithm.
The first three elements include the Attribute of SSN, a Standardization Function that only
keeps numbers (removes special characters and letters), and simply moves the data into
Comparison Role 2 (the second one in the algorithm, the number has no significance).
Those three elements correspond to the memcompd table.
From there, the SSN is Bucketed by sorting the numbers and then a Bucket Group is
created to hold the tokens. These two elements correspond to the membktd table.
The Comparison Path shares the same first three elements, then rolls into a 1-Role,
1-Dimension Quick Edit Distance. These elements correspond to the weight table.
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Configuring an algorithm
Algorithms are created in the Algorithm Editor. After the data dictionary has been created,
Attributes and Entity Types especially, the Palette will be fully populated with the
components of an algorithm.
Attributes Analyze the data from any member attribute in your dictionary. You do not
have to include all attributes in your algorithm as some demographics are for display
purposes only. You can only add an attribute to the Algorithm Editor once.
Standardization Functions Transform the attribute values into more consistent sets
of data. You can remove specified characters, trim the length of the field, and even
break apart one value into many smaller values. You can standardize a single attribute
more than once.
Comparison Roles Defines how a comparison function is used in the algorithm.
Query Roles Defines how a query function is used in the algorithm.
Comparison Functions Analyze two sets of data and determine how similar they are
to each other. You can choose a number of comparisons to run against the data.
Bucketing Functions Identify bucketing data which identifies groups of shared
information. You can use buckets for names, address, identifiers, and so on.
Bucketing Groups Defines how a comparison function is used in the algorithm.
Exercise
Now it is time to perform Exercise 3, taking approximately 60 minutes.
4-38 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
The Data Extract is a sampling of the data. You will test this data for basic adherence to the
Data Extract Guides specifications and run CloverETL graphs against it to ensure proper
data format.
Dependencies
The Data Extract Guide outlines the data requirements. You will also need access to
Workbench and CloverETL tool to perform the data cleansing.
Topics
This unit will cover:
Data Extracts
- Reviewing the Data Extract Guide and Extracts
Clover ETL
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Data extracts
The Data Extract is the heart of every Initiate software implementation. Before Initiate can
begin to solve a problem, we must first get a data extract so that we can verify the quality of
the data, load the data into the software, configure an algorithm to manage the data, and
perform an initial data load to see how the records compare, match, and link. Each project
has a Data Extract Guide, which highlights the necessary file format, data elements, and
protocols for delivering the file to Initiate.
To ensure that the data extract adheres to Initiate standards, a Data Extract Guide is
prepared for each project. Much of the guide is standard issue, but can be customized (for
example, each algorithm requires data of a slightly different type, therefore we ask for the
appropriate data to fulfill the project goal). The extract should be 10% of the total number of
records or approximately 1,000,000, whichever is larger.
Be prepared!
Data collection methods are unique. Some are separate extract files for each source and
many times there is not one, clear extract file. Common scenarios include:
Source-specific attributes keep files from being merged
Demographic Data, Address Information, and Alias Name information can be stored in
separate tables
Duplicate record identifiers might be in data on purpose to retain historical data (for
example, Previous Address or Phone Number)
5-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Important
Encountering problems with the data in the Data Extract does not necessarily mean that
the project loses momentum. In most cases, the data file can be scrubbed using Clover
graphs that help convert data into the proper formats and reduce duplicate records. Once
the data has been fixed the initial data load can commence.
5-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Tidbit:
Java ETL can be shortened to JETL. The word jetl in Czech means clover.
What is ETL?
ETL stands for Extract, Transform, and Load. It is a three step process for data
warehousing that:
Extracts data from outside sources
Transforms it to fit business needs
Loads it into the data warehouse
Clover graphs
Graphs, or transformation graphs, are a visual representation of the process of
transforming data from one form to another. A graph consists of at least three elements:
Components: Perform various data transformations or functions.
Edges: Connect the nodes by passing data.
Metadata: Define the data structure.
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Hint
5-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
We will construct the graph above in the next few pages. Let us begin!
5-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
This expression starts by taking the whole record as one field and appending or
concatenating a space to the end - see concat($0.fieldName, " "). The extra space is
added to account for records which have no value for the last field. Having a null value
is not the same as having the wrong number of fields. This space will eventually need to
be shaved off of the last field before the records are brought into the hub by the Derive
Data and Create UNLs (mpxdata) process.
Note
The concatenation can be removed for data extracts that already have a closing pipe
delimiter at the end of each record.
The next process in the expression splits the string at each occurrence of a pipe "|" -
see split(concat( $0.fieldName, ""), "\\|"). In order for Clover to properly recognize the
pipe character, it needs to have two backslashes in front of it (this has to do with the
way the expression is converted to Java before processing).
The third step uses the length function to measure the number of characters in the
resulting string - see the length() function at the beginning. Basically, it counts the
number of fields between the pipes.
The final step is to compare the field length with the number of fields that you expect
each row to have - see the == # of Fields element at the end. This number is the
number of fields you should have, not the number of pipe delimiters. The double-equal
sign is required in the Clover language and the number does not need to have any
special qualifiers around it, like '3' or "3"... simply enter == 3.
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
5-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
5-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Exercise
Now it is time to perform Exercise 4, taking approximately 60 minutes.
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
5-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
You need to create a database, install the IBM Initiate Master Data Service engine,
configure the ODBC connection, create the instance directory, and establish the windows
service.
Dependencies
You need to have a supported database platform and the proper software installation files
for your operating system. The computer images provided have DB2 and will support the
IBM Initiate Master Data Service.
Topics
This unit will cover:
Deploying the instance overview
Copyright IBM Corp. 2010, 2011 Unit 6. Deploying the instance 6-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
6-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Exercise
Now it is time to perform Exercise 5, taking approximately 30 minutes.
Copyright IBM Corp. 2010, 2011 Unit 6. Deploying the instance 6-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
6-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
The Initiate data model is the physical way the IBM Initiate Master Data Service software
stores data. The Initiate member model (.imm), is a metadata layer that classifies and
organizes the components used to track and match member data elements.
Dependencies
The Data Extract Guide outlines the specific attributes and fields and the Implementation
Approach defines additional data dictionary requirements.
Topics
This unit will cover:
The generic data model
Attributes
The data dictionary
Entities
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Audit tables
The IBM Initiate Master Data Service has built-in capabilities to track activity in the hub.
The level at which the audit tables track activity is determined on the Security tab in
Workbench. Auditing is enabled at the Interaction (API function call) level.
There are three options for auditing:
None: The interaction will not be tracked
Activity: Who called the interaction and when
Member: Who, when, and what member records were involved in the interaction
7-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
The MemIdNum is not generated by the hub but instead is the primary identifier generated
by the source system.
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
These tables comprise the core member segments in the IBM Initiate Master Data Service
software. Member segments are used as storage for what is normally thought of as data
records. These data types store data in a normalized form, and are generally more
complex in nature than the typical, low-level database offerings of numeric, string, and
date-oriented types.
Table 7-2: Member tables
mpi_memaddr Postal address information from original data.
mpi_memappt Appointment information from original data.
mpi_memattr Generic information, uninterpreted strings.
Derived bucket hash assignments (from Derive Data and Create
mpi_membktd
UNLs (mpxdata)).
Derived comparison strings (from Derive Data and Create UNLs
mpi_memcmpd
(mpxdata)).
mpi_memcont Provider contract information from original data.
mpi_memdate Date and datetime information from original data.
mpi_memdrug Prescription information from original data.
mpi_memelig Eligibility information from original data.
mpi_memhead Member header information (including original source IDs).
mpi_memident Identifier information from original data.
mpi_memname Name information from original data.
mpi_memnote Notes about members.
mpi_memphone Phone number information from original data.
mpi_memqryd Direct-access browsing query information (rarely used).
mpi_memoref Reserved for future use.
mpi_memrule Member rule information.
mpi_memtext Reserved for future use.
mpi_memextb Extension attributes (B).
mpi_memextc Extension attributes (C).
mpi_memextd Extension attributes (D).
mpi_memexte Extension attributes (E).
mpi_memlink Relationship linking information.
Note
Tables can be viewed in DB2 Control Center. This can sometimes help with
troubleshooting later on in the implementation.
To view tables:
1. Open DB2 Control Center.
2. Expand the All Databases node, the Bootcamp node, and then click the Tables
node.
3. Click the table in the tables list to the right and view the data in the display below.
7-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
7-16 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
Derived data is essentially data that has been processed by the algorithm. You will derive
your data using the Derive Data and Create UNLs (mpxdata) job and then load the data
into the database.
Dependencies
You will need to have most of the components installed and configured, like the hub engine,
member model, .cfg file, and the algorithm. If changes are made to the algorithm, then data
will need to be re-derived.
You will need to know the order in which the fields appear in the Data Extract and the
corresponding Attribute names in the member model. You can build your .cfg file from
project documentation if the real files are not yet available. Check for accuracy before
deployment.
Topics
This unit will cover:
Data derivation overview
Data analytics overview
Comparison strings
Comparison Strings are caret (^) delimited strings with standardized data for each
field/attribute referenced in the algorithm. Each member record correlates to a single
comparison string, stored in the mpi_memcmpd table.
Table 8-1: Excerpt of mpi_memcmpd table
8-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Bucket hashes
Bucket Hashes are numbers that represent the buckets that each member belongs to.
Each member can have multiple hash assignments stored in the mpi_membktd table.
Table 8-2: Excerpt of mpi_membktd table
memrecno srcrecno bkthash
107247 10 2906828572220351548
107247 10 7872683519891125240
107247 10 8417370951307185476
107247 10 7876659353937979316
107247 10 8080639425811308900
107247 10 -8626021308968924882
107247 10 -6612914640955687282
8-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Prepare Binary Files (mpxprep): This method is usually used with an Incremental Cross
Match (IXM). The job loads up the records from each of the sources separately, then runs
Prepare Binary Files (mpxprep) to compile the Binary files. If you ran each source
separately, your binary files would not be complete. Running Prepare Binary Files
(mpxprep) as a final step reads all of the records from the member tables and builds a
complete set of BXM files.
Member Model Transform Graph: This is a CloverETL enabled wizard that guides you
through the process of creating member UNL files from your data extract. The advantage is
using the existing metadata from the extract instead of designing a configuration file. You
would follow up this step with Derive Data from UNLs (mpxfsdvd) to create bucket hashes,
comparison strings, and binary files.
8-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
2. MEMPUT This mode deposits data directly into the database from the extract file. The
parsing of the data is simply a step in the process as Derive Data and Create UNLs
(mpxdata) loads the information into the hub's data structure.
8-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
8-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Attribute
This column is used to identify the attribute code associated with the attribute (or data
element) inside the database you are going to populate. This name must match the
attribute's ATTRCODE as found in the mpi_segattr table.
#
This column defines the instance variable (IVAR) number which indicates how many of the
same named data elements are in this customer record. For example, if the extract
contains three phone numbers and inside mpi_segattr you only have one attribute defined
as PHONE (rather than WORKPHN, HOMEPHN, CELLPHN), you could increment the
IVAR column for each PHONE entry you have in the configuration file.
This is not the best way to handle multiple attributes of similar type. It is recommended to
create three separate attributes inside mpi_segattr to handle the three phone types
mentioned above. IVAR is also used in conjunction with the asaidxno (the attribute sparse
array index number) of each data element. If you are using asaidxno, you will need to
increment the IVAR to correspond with each increment of the asaidxno you are assigning
to the data element. This is a feature best handled through consultation with an Initiate
SME.
Position
This column is the physical position of the field within the customer extract (for example,
the third field has an offset of 3, the fourth field has an offset of 4, and so on). An offset of 0
indicates we will be inserting a constant value in this field, and not pulling the value from
the extract. The example above is inserting a constant value of SSA as the ID Issuer of the
SSN, which is represented by a 0, since it is not really a field in the import data set. Simply
skip fields that you do not want to import into the database (for example, if the 15th field is
unimportant, go from 14 to 16).
Transform
This column allows an optional edit method or transformation method you can apply to a
string field in the customer data element before inserting it into the Initiate database. These
are very basic commands that can be used to remove blanks or zeros from the right or left
side of the data element if necessary. More advanced edits can be conducted by using the
Algorithm's Standardization Functions to ensure that data is compared more effectively.
NA Do nothing
TR Trim leading and trailing blanks, allow empty string
BL Trim blanks from the left, allow empty string
ZL Trim zeroes from the left, allow empty string
B1 Trim blanks from the left, leave at most 1
Z1 Trim zeroes from the left, leave at most 1
BR Trim blanks from the right, allow empty string
ZR Trim zeroes from the right, allow empty string
ZX Trim zeroes from the right, if all zeroes make NULL
8-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Assignment
This column defines the Initiate Database Method used to populate the Initiate record with
the customer data. Each data type has its own set methods. You must take care to select
the proper one for the data element you are inserting. The following list shows many of the
common methods used.
SetString Defines the value as a text string
SetNumber Defines the value as a numeric field
SetDate_Y4MD Defines a date field as YYYYMMDD
SetDate_MDY4 Defines a date field as MMDDYYYY
SetDate_MDY2 Defines a date field as MMDDYY
Field
This column is the Initiate Database Field where the value will be inserted. The
ATTRCODE defined in column one allows you to direct data to a specific field within a
Segment (for example, the DOB Attribute is found in the MEMDATE Segment and the
dateVal field is the column where dates are stored). You can find the available Fields in
Workbench by going to the .imm file and selecting Attribute Types.
Constant
This is the constant data value that will be inserted into the Initiate database when the data
element has a Position value of 0.
8-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Configuration template
Fill in the template based on the information on the previous page. If you need additional
reference, you can check the sample config file earlier in this unit.
Table 8-3: Practice configuration template
Attribute # Position Transform Assignment Field Constant
Deriving data
Before we can derive our data by running the Final Extract file thru the Derive Data and
Create UNLs (mpxdata) job, we need to move our data to the folder below in order to run:
C:\<Engine_Directory>\inst\mpinet_<Hub_Instance>\work\<Workbench_Project_Name>\work
The work folder represents the folder on the server where any job that connects to a server
will write its output.
8-16 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Prerequisites
Before running data analytics, you must have already done the following:
Installed Workbench
Installed the Master Data Engine
Created a database
Loaded the database with specific data dictionary, algorithm, anonymous values,
weights, member data
Performed a bulk cross-match and loaded entity data into the database before Score
Distribution and Potential Overlay.
The icons at the top of the Analytics view provide the tools for accessing analytics data.
Table 8-4: Analytics icons
Icon Name Function
Set the Data Source To connect to the hub from which analytics data is drawn. (Data is taken
from the hubs database directly.)
Add New View To create a new empty view within the perspective.
Pin Query To pin the query results to the current view and prevent drilldowns from
changing the contents of the view.
8-18 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Exercise
Now it is time to perform Exercise 6, taking approximately 45 minutes.
8-20 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
The weight generation process is an integrated utility that goes through multiple steps to
measure the frequency of individual values in the database and then assigns weights to
those values with the most common values weighing less and rare values weighing more.
The weight generation process creates unload files which will later be loaded into the
database.
Dependencies
You must have the hub engine installed and your algorithm configured. If you have already
derived your data, then weight generation will take less time, but the weight generation
utility can also derive data. You should always check your weights before loading them into
the database.
Topics
This unit will cover:
Weights overview
Troubleshooting weights
Exercise
Now it is time to perform Step 1 in Exercise 7, taking approximately 5 minutes. The job will
take up to an hour to run.
9-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
9-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
9-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
9-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
9-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
5. Select the Insert Chart feature, select a Line Chart option and click Finish.
9-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Counting dimensions
Dimension weights are based on the number of characters in the attribute that you are
measuring, plus two additional values. For a US phone number the standardized value
would have 7 digits. You add to that a weight for Missing and Exact Match and you end up
with 9 dimensions of weight numbered 0-8.
Table 9-6: Counting dimensions
Dimension # Meaning Value
A value is missing (or anonymous)
from one or both of the member Usually zero, it sometimes has a
0
records compared. For example, no negative penalty.
phone number.
The values in both member records Highest score, usually between 4 and
1
are exactly the same. 6.5.
The values are one edit distance
different. You would have to add, Slightly lower than exact match, but
2
change, or remove one character to still positive.
make the records the same.
The further your edit distance gets,
the more penalty there is for Usually a negative number around an
9-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty
Exercise
Now it is time to perform the remaining steps in Exercise 7, taking approximately 45
minutes.
9-16 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
The bulk cross match (BXM) is a process that allows you to compare and link thousands of
records per second. The BXM is most commonly performed in the initial stage of the
implementation and again right before the system goes live. The BXM process is made up
of two primary jobs; Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink).
After running the compare and link, the data will need to be loaded into the database.
Dependencies
You must have derived data and generated weights before you perform the bulk cross
match. That also means that the hub engine, algorithm, and data dictionary must be in
place.
Topics
This unit will cover:
Bulk Cross Match overview
Copyright IBM Corp. 2010, 2011 Unit 10. Running a bulk cross match 10-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Exercise
Now it is time to perform Exercise 8, taking approximately 10 minutes. The job will take
approximately 45 minutes to run.
1 If you are not using Entities in your implementation, you will not need to perform this step.
10-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
BXM steps
The starting point is a large file of customer data and the ending point is all the customer
data (plus the Initiate derived data and entity/task assignments) loaded into the database.
The interim steps are designed to load large amounts of data into memory and process
them without ever hitting the disk sub-system.
BXM refers to the entire process, but is also often used to refer to just the compare and link
steps.
The example illustrated here assumes that the hub configuration, algorithm, and weights
have been created and only the customer data remains to be dealt with.
Copyright IBM Corp. 2010, 2011 Unit 10. Running a bulk cross match 10-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Step 1
Derive Data and Create UNLs (mpxdata) utility: This is the utility you would typically use
for an initial load (starting from scratch).
This utility takes the file of customer data, and converts the core data into the Identity hub
data model. For each table in the Identity hub data model, a new unl text file
(mpi_memname.unl, for example) is created which is eventually loaded into the database.
This utility also creates the derived data (buckets, comparison string) as well, and puts the
derived data into two formats:
unl text file, to be loaded into the database later
Binary files, which will be used in subsequent processing steps
Step 2
Compare Members in Bulk (mpxcomp) utility: This utility iterates through all the buckets
(selects candidates) and performs the comparison calculations for all the members in each
bucket.
The input is the binary files of derived data (bucket and comparison string binary files) from
the previous step. The binary files are read into memory to speed up all the comparison
calculations.
The output is additional binary files that represent the entity link and task groupings.
10-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Step 3
Link Entities (mpxlink) utility: This utility takes the comparison results and creates entity
link and task files that can be loaded into the database.
The input is the binary files of comparison results (entity link and task groupings) from the
previous step. These files are read into memory for faster processing.
The output is additional unl text files that contain the EID assignments (entlink), tasks
(enttsk), and EID history (entxeia).
The Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink) utilities must be run
once for each type of entity (for example: 'identity' and 'household'), as the outcome will be
different per entity type.
Step 4
Load UNLs to DB (madunlload) utility: This utility takes the 'unl' text files created during
previous steps and loads them into the database.
The Load UNLs to DB (madunlload) utility loads the core member data 'unl' files and the
derived data 'unl' files (the outputs from the Create Core Data and Create Derived Data
steps).
After the database is loaded, the Identity hub engine is started and real-time operation
begins.
Copyright IBM Corp. 2010, 2011 Unit 10. Running a bulk cross match 10-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
10-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
Once your data is fully loaded into the IBM Initiate Master Data Service software, you
should run tests to establish how well your system and the data are performing. Through
the Generating Matched Pairs job and the Threshold Calculator in Workbench, you can
assess your threshold analysis.
Dependencies
Your core engine and data must be fully loaded in order to run the data analytics. Threshold
analysis can be done within Workbench.
Topics
This unit will cover:
Threshold analysis overview
Analyzing Matched Pairs
Copyright IBM Corp. 2010, 2011 Unit 11. Analyzing thresholds and matched pairs 11-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Threshold overview
Your hub uses set scores to determine what action to take as a result of a comparison.
These scores are referred to as thresholds.
The Autolink (AL) Threshold is the cutoff score where two members will be linked
together by the hub. This upper threshold reflects the score at which the organization is
confident that a match represents the same person. We set this based on the
organization's tolerance for false positives.
The Clerical Review (CR) Threshold is the cutoff score where two members will be
assigned a task for manual review by a user. This lower threshold reflects the score at
which the organization wants to manually review matches. We set this based both on the
organization's tolerance for false negatives, and on the number of matches they are willing
to review based on the cost of manually reviewing these tasks.
Members whose scores fall below the CR threshold will not be linked or assigned to a task
- it is assumed that these members are not related given the information currently
available. If the member records are updated causing the scores to improve, then the hub
might assign them to a task or link them.
Importance of thresholds
The accuracy of our solution is measured in terms of two potential types of errors and we
establish the two thresholds to minimize both types of errors: False Negatives and False
Positives.
Auto Link minimizes False Positive matches. These occur when you match two records
that do not represent the same person.
Clerical Review minimizes False Negative matches. These occur when you fail to
match two records that represent the same person.
11-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
False negatives
The empty arrows on the left are members that should be linked but are not because of
insufficient data. These are False Negatives. If all that is known is a name, then there is not
enough data to make a sound linking judgment. False negatives can possibly be reduced
by tweaking the algorithm, but are most commonly the result of poor data collection.
False positives
The dark arrow on the right that should not be linked is the result of data that is very similar;
usually a set of twins, or parent and child records. This is a False Positive. There are
several ways to deal with false positives. Most commonly, we use the False Positive Filter
to issue weight penalties for subtle differences between records.
Copyright IBM Corp. 2010, 2011 Unit 11. Analyzing thresholds and matched pairs 11-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
11-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 11. Analyzing thresholds and matched pairs 11-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Now what?
At this point, threshold analysis becomes iterative, and the specific steps will vary from
organization to organization.
You will most likely need to make adjustments to your threshold settings, run another bulk
cross-match, and show the organization another set of matched pairs highlighting the
difference between the new and old threshold settings.
11-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Exercise
Now it is time to perform Exercise 9, taking approximately 45 minutes.
Copyright IBM Corp. 2010, 2011 Unit 11. Analyzing thresholds and matched pairs 11-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
11-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
Once your data is fully loaded into the IBM Initiate Master Data Service software, you
should run tests to establish how well your system and the data are performing. You can
assess score distribution, and entity and bucket size through the analysis tools in
Workbench.
Dependencies
Your core engine and data must be fully loaded in order to run the data analytics. Analysis
can all be done within Workbench.
Topics
This unit will cover:
Analyzing buckets overview
Configuring Frequency Based Bucketing
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Choosing attributes
An attribute can be used in multiple buckets. For example, using a 2-token equi-name
bucket (using the name as is) as well as a single-token phonetic name + DOB/zip bucket
might be helpful. Having both buckets will allow users to search for a specific record even
though Frequency-Based Bucketing (FBB) might have eliminated either one of the buckets.
Also remember that as population increases, techniques such as phonetics or sorting can
result in an exponential increase in members that share the same bucket. Because of this,
sorting of phone and SSN should usually be avoided for large populations. Instead, it might
be better to use ASIS bucketing for SSN, phone, or other numeric identifiers. You lose the
ability to compensate for typing errors because of this, but usually the performance gain is
more important.
12-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
12-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Note
Each blue dot on the graph is a link to the bucket that it represents. Double-click a point on
the graph to pull up the details about that bucket.
12-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty
Note
The View Bucket and View Algorithm buttons on the right allow you to see more
information about the members in a particular bucket as well as the element in the
algorithm that designed the bucket.
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
12-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Note
An average of around 1000 members compared per search still returns results in
subsecond time in most environments. In training, our machines might perform a little
slower.
12-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Note
Frequency-Based Bucketing is not a dynamic monitoring tool. You will need to invoke the
FBB analysis process every 6 to 12 months to keep on top of your largest buckets.
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Exercise
Now it is time to perform Exercise 10, taking approximately 40 minutes.
12-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
You will take the results of the analysis and make tweaks to your algorithm and data
dictionary, if necessary. After your edits, you will usually re-derive your data, run another
BXM, and analyze the results again.
Dependencies
Bucket design changes usually require re-deriving, but not another BXM. Comparison
changes require new weights, re-derivation, and a new BXM. Some small tweaks only
require an engine restart or simply redeploying your configuration.
Topics
This unit will cover:
Deploying a new configuration
Deriving data again
Rerunning bulk cross match
Running Entity Analysis
Copyright IBM Corp. 2010, 2011 Unit 13. Reiterating the process 13-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Generate Weights
Save and Deploy
MAD UNLLOAD
MAD UNLLOAD
MPX REDVD
MPX FSDVD
MPX COMP
O = Optional
MPX PREP
MPX LINK
MPX Data
X = Mandatory
13-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Re-deriving data
There are multiple ways to re-derive your data when you have made changes to your data
model or algorithm, such as adding new Anonymous Values, designing new Bucket
strategies, or changing Standardization functions.
Derive Data and Create UNLs (mpxdata) starts with raw data in a flat file. You can
choose to assign bucket hashes, create comparison strings, and compile binary files. You
can also determine whether to use MEMCOMPUTE or MEMPUT modes. MEMCOMPUTE
outputs the results of Derive Data and Create UNLs (mpxdata) into UNLs (Pipe-delimited
Unload files). MEMPUT writes directly to the Database without creating a set of UNL files.
MEMCOMPUTE is a great option to document your progress, but it will not allow any
duplication of the Source and MemIdNum (basically only one record per unique individual),
but MEMPUT will allow you to process multiple records for the same Source and
MemIdNum (like transactional data or Historical information regarding the record). So, if
you need to import history into the database, MEMPUT is the best way to add that data in
bulk (beyond using API or Message Brokers).
Copyright IBM Corp. 2010, 2011 Unit 13. Reiterating the process 13-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Derive Data from UNLs (mpxfsdvd) re-derives from the UNL files to create new buckets,
comparison strings, and binary files. Derive Data from UNLs (mpxfsdvd) is good when you
have made changes to your algorithm, but you are working with a static set of records (like
a sample extract before go live). You can select the specific elements (buckets, comparison
strings, or binaries) you want to re-derive.
Derive Data from Hub (mpxredvd) looks at the data in the database tables and goes line
by line to create new buckets, comparison strings, and binary files. Derive Data from Hub
(mpxredvd) is good to use when you have consumed inbound broker messages or API
inputs that did not exist in your original data extract. You can select the specific elements
(buckets, comparison strings, or binaries) you want to re-derive.
Prepare Binary Files (mpxprep) is used for incremental bulk cross matches (ixm). When
multiple systems have loaded data, Prepare Binary Files (mpxprep) will rebuild the binary
files since there is no single extract file.
Use the table below to determine which method of deriving data is best.
Parse Assign Create Compile
O = Optional, X = Default Process Member Bucket Comp Binary
UNLs Hashes String Files
Derive Data and Create UNLs (mpxdata)
X O O O
(reads data from an Extract File)
Derive Data from UNLs (mpxfsdvd) (reads
O O O
existing UNL Files)
Derive Data from Hub (mpxredvd) (reads data
O O O
in the MEM tables)
Prepare Binary Files (mpxprep) (reads data in
X
the MEM tables)
Member Model Transform Graph (uses Clover
X
ETL)
13-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Comparing members
Member Comparisons are a great way to see how your member records score against one
another. This not only lets you see why two records did or did not join, but also lets you see
how one attribute might be taking too much control in the matching and linking process.
Copyright IBM Corp. 2010, 2011 Unit 13. Reiterating the process 13-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Score distribution
Score Distribution provides the volume of matched pairs by score for potential duplicates
within all sources.
This information is often used with organizations during Threshold Analysis so they can get
a better sense of how many tasks and autolinks could result from a particular threshold
setting.
In the first example on the next page, we can see how the raw results of our Score
Distribution query will appear. Generally, we take those results and put them into a chart
and present them to an organization as part of Threshold Analysis.
In the sample chart, we can see higher volumes of record pairs scoring at 10 and below.
When reviewing this chart with an organization, we would explain that they should use this
information to understand how setting thresholds will affect the numbers of tasks and
autolinks that get generated by the hub. If we set our Clerical Review threshold at 7 and our
Autolink threshold at 11, all of those record pairs between 7 and 11 would result in tasks
that the organization would need to resolve. Everything above 11 would be Autolinked
together.
Score Distribution is not the only data point used for threshold analysis - we would also
review sets of sample matched pairs, and do some statistical analysis to determine where
to set thresholds. But Score Distribution gives us an important piece of information about
how our thresholds will impact the amount of work the organization needs to do once the
hub goes into production.
13-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Note
Values on the x-axis that do not show a bar have at least 1 entity matching the size
specified on the x-axis but not enough members to make the bar visible in the chart.
Copyright IBM Corp. 2010, 2011 Unit 13. Reiterating the process 13-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Member overlap
Member Overlap provides the number of entities that have member records in multiple
sources. Member Overlap can be expressed as a total number of entities and also as a
percentage of the total number of records in each source.
13-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Exercise
Now it is time to perform Exercise 11, taking approximately 90 minutes.
Copyright IBM Corp. 2010, 2011 Unit 13. Reiterating the process 13-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
13-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
Initiate supports user and group management through the LDAPv3 standard. The Master
Data Service includes a default LDAP repository and a Workbench module for creating and
managing groups and users. You have the choice of using the default LDAP repository, or
you can optionally integrate with a separate enterprise directory server to manage your
users and groups.
Dependencies
You must have Workbench and the Master Data Engine software installed with an LDAP
server enabled.
Topics
This unit will cover:
Initiates Model for Managing Users, Groups, and Permissions
Sample LDAP Configurations
Default Initiate Groups and Users
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
14-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Components
Several components are involved in managing users and groups in the Master Data
Service.
Workbench
Within the LDAP perspective in Workbench, you have the ability to connect to your LDAP
server to create and maintain users and groups.
Within the Configuration perspective in Workbench, the Groups tab allows you to
synchronize the Groups in your LDAP server with the hub database, and assign specific
permissions to those Groups.
Data model
Several Initiate database tables play a role in user access and security. While the LDAP
repository manages user accounts and group assignments, specific permissions and audit
trail information is stored in the Initiate database.
mpi_grphead
- The names of default Initiate groups as well as any groups created in an external
LDAP repository are stored in mpi_grphead.
mpi_grpxseg, mpi_grpxixn, mpi_grpxcvw, mpi_grpxapp
- The individual permission settings for each group are stored in these tables. Specific
permissions can be set for read/write access to segments and sources, access to
perform specific interactions, access to composite views, and access to Web
reports.
mpi_usrhead
- Whenever a user logs in to an Initiate application for the first time, their user name
will be stored in the mpi_usrhead table for auditing purposes.
User passwords are not stored in the hub database; they are stored in the LDAP repository.
14-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Note
External LDAP DN settings must be added manually to this property file before
configuring LDAP connections in Workbench. Instructions for configuring the Master Data
Engine to communicate to an external server are provided in the Master Data Engine
Installation Guide. The default external LDAP settings in this file have been optimized for
integrating with Microsofts Active Directory.
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Initiate groups
LDAP records are stored hierarchically similar to DNS or Unix file trees. The records
Distinguished Name is read upwards to the top level or base. Each DN has two parts, the
Common Name (CN) and its location within the directory which it resides. The location is
determined by the Organization Unit (OU) and the Domain Component (DC).
Each user will have a Bind DN which represents the users name and what group they
belong to. In our systems, they will look like this:
cn=system,ou=System,ou=Users,dc=Initiatesystems,dc=com
In the example above, the user system is part of the subgroup Users in the System group
which is part of the Initiatesystem.com domain.
If an external LDAP server is going to be used, the groups must be defined prior to
implementation and the appropriate DN setting must be added to the ldap.properties file by
hand before making any configurations in Workbench.
14-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Password changes
When users change their passwords via the Inspector application, the interaction with the
internal Initiate LDAP directory server is supported. If the password is being changed when
the Master Data Engine is configured for an external directory server, the request is not
supported and will generate an error. Password changes for externally authenticated users
will need to follow their corporate operation procedures.
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Administrators
The Administrators group has full access to all interactions, operations, composite views,
and attributes (segments).
The Administrators group is preconfigured to have full access to Workbench to import and
deploy hub configurations, run Analytics reports, set user group permissions and execute
jobs on the hub. Any users you want to have this access must be added to the
Administrators group. To be part of the Administrators group, the user has to be present in
the internal directory server.
The LDAP server comes with a pre configured user, system (usrrecno = 1), which has
membership in the Administrators group. You will use this login initially to add additional
users and groups. This group cannot be configured in Workbench.
Important
In prior software versions, ALIGNDEX was defined as the system-level user. This has
been removed and replaced by system. Your first steps in user administration should be
to create a new user with Administrators group access using system as your basis and
then deleting system for security purposes. Do not delete or rename the Administrators
group.
Default
The Default group has access to interactions USRGETINFO, GRPGETINFO,
USRSETPASS and has read only access to segments USRHEAD, GRPHEAD, GRPXAPP,
GRPXCVW, GRPXIXN, and GRPXSEG.
The Default group is assigned to a user when he logs in if he is not currently a member of
any other defined group. This group cannot be granted permissions through the
Workbench configuration editor. This group cannot be configured in Workbench.
14-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
All interactions
The All Interactions group has access to all interactions.
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Will this Initiate master data engine instance use an embedded Initiate
LDAP server?
The answer to this prompt will determine if you will be using an embedded or standalone
LDAP server and will present a different set of prompts.
No for standalone
If the embedded option is not chosen, the LDAP server will be an internally or externally
managed standalone server. If organizations use existing LDAP directories this would
typically be the option used.
When selecting standalone servers you will also be prompted:
Enter the Initiate LDAP Server host name:
Enter the Initiate LDAP Server port number:
When running the MADCONFIG script to create an instance, it is best to have the
standalone LDAP servers already created. Use the MADCONFIG create_ldap script to
create an Initiate LDAP server.
Important
A standalone LDAP service will need to be started when the Master Data Engine service
is started.
14-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Will this Initiate LDAP server be clustered with other Initiate LDAP
servers?
If you selected to have an embedded LDAP server in your Master Data Engine, you will
also be asked if the server will be part of a cluster. The cluster will contain a standalone
LDAP server used for high availability which will be internally or externally managed.
Answering Yes means there will be a standalone LDAP and you will also be prompted:
Enter the Initiate LDAP Server replication port number:
Enter a cluster peer Initiate LDAP Server host name:
Enter a cluster peer Initiate LDAP Server port number:
Enter a cluster peer Initiate LDAP Server replication port number:
It is best to have the standalone server created using the create_ldap script, but it can be
created later as long as you use the same information you entered for the prompts above.
Exercise
Now it is time to perform Exercise 12, taking approximately 20 minutes.
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
14-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
This module will cover deploying and configuring IBM Initiate Inspector.
Dependencies
You need to have a supported database platform and the proper software installation files
for your operating system. The computer images provided will support Inspector.
Topics
This unit will cover:
Introduction to Inspector
The inspector.properties file
Inspector configuration in Workbench
Copyright IBM Corp. 2010, 2011 Unit 15. Configuring and deploying Inspector 15-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Introduction to Inspector
Inspector is a Web-based, integrated data stewardship and governance application that
enables data stewards to perform three main tasks.
Data resolution
Inspector enables data stewards to understand and resolve data quality issues using a
simple, drag-and-drop interface.
Relationship management
Inspector enables data stewards to view, manage, and modify complex master data
relationships, including hierarchical relationships using innovative relationship visualization
technology. Inspector lets organizations gain insight into these relationships for purposes
such as identifying top accounts and determine pricing eligibility.
Data management
Inspector lets data stewards manage additions, changes, and deletions to master data.
Inspector enhances Initiates existing support for transactional implementation styles by
using the IBM Initiate Master Data Service as the master data source.
The large volume of data stored across multiple source systems and the often dynamic
state of that data can present organizations with challenging integrity and profiling issues.
IBM Initiate Master Data Service software and associated applications enable you to
combine, compare, review, and resolve potential data issues. Using the software adds
value to your organization by increasing the integrity of your data and providing a
360-degree view of a record.
Designed to store and manage data from multiple sources, the software and algorithms
configured specifically for your data and business environment compares the records
and attributes contained therein to identify potential data issues. Inspector is the end user
application that enables you to locate data issues, review the records involved, view
relationships between entities, and make appropriate adjustments to correct errors.
Inspector is an integrated data stewardship platform integrating three functionalities into a
single platform: data resolution, relationship and data management. This tool is based on
the premise that understanding relationships in data helps data stewards to manage and
resolve quality issues.
15-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Unit 15. Configuring and deploying Inspector 15-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Attribute display
Much like it sounds, this pane configures which and how Attributes are displayed for each
member in Inspector. You can add and remove Attributes by Member Type and their
corresponding Attribute Types and configure which fields are displayed.
15-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
General preferences
The General Preferences pane again sounds much like what it does. Use this pane to
configure the number of search results that are returned, page sizes, date formatting, and
more.
Copyright IBM Corp. 2010, 2011 Unit 15. Configuring and deploying Inspector 15-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Search forms
The Search Forms pane will configure which Attributes and corresponding fields you will be
able to use as search criteria.
15-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Exercise
Now it is time to perform Exercise 13, taking approximately 30 minutes.
Copyright IBM Corp. 2010, 2011 Unit 15. Configuring and deploying Inspector 15-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
15-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
The purpose of this unit is to cover testing approaches and to test your hub configuration
using a CloverETL graph designed to perform a MEMPUT operation.
Prerequisites
Once your configuration and data is fully loaded into the IBM Initiate Master Data Service
software, you should test your hub configuration.
Topics
Testing philosophy
Testing your configuration using CloverETL
Copyright IBM Corp. 2010, 2011 Unit 16. Testing the hub configuration 16-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Testing philosophy
Testing should be performed by the implementers throughout the entire process. The
testing should focus on six categories:
General
Data
Algorithm
Application
Integration
Performance
General tests
Every time you click Save in Workbench, you should check the Problems tab. These
messages provide information on what is not correct in your configuration and will more
than likely cause problems for you down the line. Most messages are context sensitive, so
you can go directly to the location of the problem by double-clicking it.
Workbench also validates your project prior to deploying it to the hub which will be
suspended during this time meaning requests can still be submitted to the engine while the
configuration check is taking place. If your hub has any errors that will prevent your hub
from working correctly, you should check your problem messages and fix the problem
listed.
16-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Application tests
Implementers should test the interactions between the hub and Initiate applications.
Test data in the applications such as IBM Initiate Inspector or Enterprise Viewer against
the data in the data extract by selecting a random record and looking them up in the
application.
Integration tests
Because there are so many ways to integrate into the hub and existing customer software
and services, there are no specific tests that should be run. Implementers should test the
interaction by sending messages to the hub and having Initiate check the log files to see
what expected information needs to be pushed.
Test the engine callouts used in the integration by creating and resolving tasks such as
potential duplicates or overlays.
Performance tests
Test the implementations performance by performing searches and other tasks to see if it
meets the benchmarks. Initiate has performance testing APIs available that will measure
how efficiently the system is running. Contact your Initiate project manager for the APIs that
best apply to your project.
Copyright IBM Corp. 2010, 2011 Unit 16. Testing the hub configuration 16-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Understanding MEMPUT
A MemPut interaction inserts or updates member data in the hub database. The CloverETL
Initiate MEMPUT component processes and inserts or updates data in the same manner
as the MemPut API interaction. Arguments are supplied to the CloverETL Initiate MEMPUT
component in the form of component parameters.
The CloverETL Initiate MEMPUT component requires input from a CloverETL Reader. The
Reader supplies a connection to the external source of the data, and can also apply
filtering or other criteria to determine which data is passed on to the MemPut component.
CloverETL provides a number of Readers that can read data from databases, text files,
LDAP repositories, and so on.
Exercise
Now it is time to perform Exercise 14, taking approximately 10 minutes.
16-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Overview
This module outlines the design and configuration of the Sample Implementation project for
week two of the Initiate Technical Boot Camp. In this project, you will design an instance
that will track customers for a fictitious company, Capital Aviation.
Dependencies
You need to have completed the first week of Boot Camp and successfully created the
instance outlined in this book.
Topics
This module will cover:
Overview and Objectives
Implementation Goal
Solution Architecture
Initiate Systems Software Components
Use Cases
Initiate Configuration
IBM Initiate Inspector Configuration
Data Extract Guide
Create New DB2 Create a new db called capmed, Create a new 10 min
Database and User Windows user called capmed and assign to
capmed db
Create New Hub Madconfig create_datasource 15 min
Instance Using Followed by successful madconfig test_datasource
MADCONFIG Madconfig create_instance
Should result in c:\initiate\projects\capmed
directory created
Capmed db should have Hub tables created
Deploy Workbench Baseline dictionary and algorithm settings 15 min
Project deployed to Hub database
Tuesday Review and Clean Clover graph updated and run against data extract; 1 hr
Customer Data Extract Data extract ready to be derived with MPXDATA
Derive Member Data Derived data (bin, unl files) 2 hrs
with MPXDATA
Load Derived Data Member UNL files loaded into Hub database 30 min
A-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Implementation goals
We designed this integration to address these objectives and achieve specific goals:
Discover data quality errors IBM Initiate Master Data Service identifies potential
duplicates and potential linkages so that you can review and correct them.
Uniquely identify a customer across multiple sources of data IBM Initiate Master
Data Service identifies and links customer records across your enterprise so you can
have a single view of your customers profiles available for real-time searching, or for
extraction to a data warehouse.
Solution architecture
The figure below presents a proposed architecture that addresses these goals.
Enterprise Operational
3 Inspector
Viewer Reports
Master Data
Workbench
Extract
Enterprise Integrator 2
C Initiate Master Data
Toolkit
Service
memPut
1 C memSearch
DF Adapter
memGet
DF Adapter
DF Adapter
Source System
DF Adapter
TRAVEL AGENCY
Notation
Description Owner
Indicator
Capital develops interfaces using Initiate Java SDK that
calls the Initiate Put, Search and Get APIs to add,
1 Capital
update and retrieve information from the IBM Initiate
Master Data Service Hub.
The Data Hub is configured to establish identity
2 Initiate
relationships and tuned to meet our matching goals.
Data stewards use Inspector for Data Resolution to
3 Capital
resolve identities in their workflow queue.
A-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Contributes1
Consumes2
System Number of records Notes
CAPITAL
RESERVATI Contributes customer data to the
~1,000,000
ON Initiate via real-time messages.
SYSTEM
CAPITAL
Contributes customer data to the
TRAVEL ~27,000
Initiate via real-time messages.
AGENCY
1. Is a source of customer records that contributes information to the Master Data Service.
2. Consumes information from the Initiate, for example, searches the Master Data Service
using Initiates APIs, or consumes Enterprise ID change notifications.
A-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
A-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
A-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Initiate configuration
This section describes the data that you store in the Hub, the algorithm that you use to
compare records, and the thresholds you use to determine whether to automatically link
records or to identify potential duplicate and potential linkage tasks.
Member attributes
The Hub stores customer information as attributes. The table below details the attributes
you store in the Hub and details about each attribute.
Attribute Name/
Review
Member # #
Attribute Code Label/ Segment Identifier
Type Exists1 Active2
Description Checks?3
LGLNAME Customer Name PERSON MEMNAME 5 1
BIRTHDT Date of Birth PERSON MEMDATE 1 1
SEX Gender PERSON MEMATTR 1 1
HOMEADDR Home Address PERSON MEMADDR 10 1
Home/Evening
HOMEPHONE PERSON MEMPHONE 10 1
Phone Number
Social Security
SSN PERSON MEMIDENT 1 1
Number
Customer
CUSTID PERSON MEMATTR 1 1
Account Number
1. Indicates the number of versions of each attribute that the Hub will keep. For example,
the Hub stores up to 10 addresses per customer, but only one SSN per customer.
2. Indicates the number of active versions of each attribute that the Hub will keep. For
example, the Hub stores up to 10 addresses per customer, but only one address (the
most recently provided address) will be active.
3. Indicates whether the Hub should create a task if two different records share the same
value. For example, the Hub creates a task if two different records share the same
SSN.
A-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty Algorithm
Algorithms specify how the Hub compares records to determine whether multiple records
represent the same customer. Your algorithm compares customer records using these data
elements:
Customer Name
Date of Birth
Gender
Street address (a component of the Home Address attribute)
Zip code (a component of the Home Address attribute)
Phone
Social Security Number
Customer Account Number
Thresholds
While the algorithm determines how to compare and score records, thresholds interpret
those scores to determine whether to automatically link records or to mark them as
potential duplicates that should be manually reviewed. Your Hub implementation uses the
following thresholds:
Source Clerical Review Threshold Auto-link Threshold
CAPITAL RESERVATION
7.0 7.0
SYSTEM
CAPITAL TRAVEL AGENCY 7.0 7.0
Inspector configuration
The Initiate Systems Implementation Project Team configures Inspector to display the
fields that will help you quickly review and resolve tasks. The following table details the
Inspector configuration.
Task
Searchable? Search
Attribute 1 Search
Results?2
Results?3
LGLNAME
BIRTHDT
SEX
HOMEADDR
HOMEPHON
SSN
CUSTID
1. Indicates whether the attribute will be part of the member search dialog box.
2. Indicates whether the attribute will appear in the Entity Search Results section of
Inspector.
3. Indicates whether the attribute will appear in the Task Search Results section of
Inspector.
As you review and resolve tasks in Inspector, you assign workflow statuses to indicate your
progress or final decisions. The following workflow statuses will be available in your
Inspector implementation:
Workflow Status Description Action
Unexamined The task is
Unexamined Initial task status awaiting review and resolution
by an end-user.
Resolved the Hub removes
The members in the task
the task from the work queue
Not Same Person do not represent the same
and creates a non-identity
person.
rule.
The members in the task
represent the same
Resolved the Hub removes
person. The end-user must
Same Person the task from the work queue
also assign a common
and creates an identity rule.
Enterprise ID to the
resolved records.
The end-user cannot
determine at this time Deferred The task remains
Not Enough Information
whether the records are, in until an end-user resolves it.
fact, duplicates.
A-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Data description
For each section below, please describe your data by providing responses to the questions
and provide any additional information that you believe might be helpful in understanding
your data environment and processes.
The IBM Initiate Master Data Service manages data from these sources:
Data Source Description Record Count
CAPITAL RESERVATION Reservation system for
~1,000,000
SYSTEM Capital Aviation
CAPITAL TRAVEL Customer system for Capital
~27,000
AGENCY Travel
File formats
Please provide data extract files as pipe-delimited ASCII files (alternatively, you can
arrange to provide fixed-width files). Each record should be CRLF terminated. The extracts
should conform to the formats outline below (required fields are in bold):
Question Response
What is your primary identifier (for Records in the extract are uniquely
example, MRN, Corporate ID, Account identified by combination of Source and
Number)? Source ID.
Does the primary ID have a meaningful
prefix, suffix, or any other characteristic No
within the identifier? Yes, please describe:
Sequentially
How is the primary ID assigned?
Algorithmically, describe:
Can you extract one record per primary
Yes
identifier?
Do you assign a secondary identifier? No
Only one source.
Do you have multiple sources (i) sharing
Multiple source sharing a pool of
the same identifier or (ii) assigning a
identifiers.
unique identifier by source?
Multiple, using unique pools of identifiers.
Do you have records without a primary
No
identifier?
Will the source file contain non-surviving, File will not contain obsolete records.
merged records? That is, can you avoid
sending obsolete records? File might contain obsolete records.
A-16 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Sample record
Records in the data extract should be formatted as follows:
Source|Source ID|Last Name|First Name|Middle Name|Birth Date|Area Code|Phone
Number|SSN|Gender|Address Line 1|Address Line 2|City|State|Zip Code|Customer
Account Number
1|1|Kennedy|John|F|1917-05-29|202|4561414|999-99-9999|F|1600 Pennsylvania
Ave||Washington|DC|20171|I-74036598
Assumptions
When preparing your data extract, please ensure the following:
Source Identifier is unique within the source
Files are pipe delimited
All fields are Text data type
Data file fields are alphanumeric characters and left-justified
All dates are in a CCYYMMDD format
Max Length indicates the maximum allowable length for data in the field (excess
characters are dropped
A-18 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Security
Initiate adheres to strict confidentiality standards. We take this responsibility very seriously
and enforce regulatory standards relating to the distribution, disclosure and retention of
personal data. Unless otherwise instructed, Initiate destroys client media in accordance
with an agreed upon timeframe.
A-20 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Relationship scenarios
There are two basic types of relationships, hierarchical and peer-to-peer. In hierarchical
relationships there is a parent record and then a sub-set of child records that belong to the
parent. In non-hierarchical relationships there are a number of records that could be used
as the parent depending on the organization and their needs.
Peer-to-peer relationships
Peer-to-peer relationships are created by two entities who share enough common
information to be linked to each other.
These peer-to-peer relationships are also called asymmetrical because they do not
necessarily have a hierarchical or parent/child connections.
One-to-one
B-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Uempty One-to-many
Many-to-many
The graphic below shows some of the relationships that exist in the health care field. The
group practice works with individual practitioners, hold the records for the practitioners
patients, and maintains relationships with HMOs and health insurance providers. Each
practitioner maintains relationships with patients and hospitals they have privileges at. The
patient has a relationship with the doctor, the practice and the health insurance provider...
and so on...
As you can see, there are many different relationships which could possibly exist here, and
depending on your perspective, the hierarchical structure would be different.
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Hierarchical relationships
Hierarchical Relationships have parent records and children records. Think about the
organization of your company. At the top is the President or CEO. Below him are his direct
reports like the CFO and COO. The next level down are the executive boards direct
reports like VPs and Directors. The organizational chart branches out showing their direct
reports like group managers all the way down to entry level positions.
A corporation that has subsidiaries also has a hierarchical structure. Below we see an
example of the corporate giant Really Big Corporation. RBC makes everything from light
bulbs to locomotives, and has many different subsidiaries, or children, to the parent
company.
B-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Relationship sources
The Linker uses a reference source that comes from inside an organization called
inter-source, or from a outside source called intra-source data.
Inter-source
Inter-source uses established relationships in an organizations internal data. The data
model will generally include a combination of IDs to map out the relationships within the
organization. For example:
A commercial organization might have the following identifiers:
- employee_id
- department_id
- business group_id
A health care provider might have the following identifiers:
- patient_id
- physician_id
- hospital_id
Intra-source
Intra-source uses globally accepted identifiers within the data. The data model would
contain identifiers that are commonly used by many organizations. For example:
SSN - Social Security Number NPI - National Provider Identifier
VIN - Vehicle Identification Number TIN - Taxation Identification Number
EIN - Employer Identification Number
Trusted sources
A trusted source can be designated when a source is known to have the most accurate
information. A trusted source is not necessary for relationships, but is considered a best
practice when creating relationships. Deeming a source as a trusted source can avoid
creating circular relationships in hierarchical implementations.
B-6 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Relationship tasks
When the linker finds a relationship that falls outside the constraints of the rules created
within Workbench a Task is created. Common relationship tasks are:
Relationship Multiplicity Task - An entity is involved in too many or too few
relationships according to the multiplicity constraints. For example, a patient is assigned
to more than one primary care physician despite having a one-to-one multiplicity
setting.
Missing Relationship Task - An entity does not have a required relationship type. For
example, a patient is not assigned to a physician despite having a parent required
setting.
Invalid Reference Task - An entity should have a relationship but it does not. For
example, a patient is assigned to a physician who no longer belongs to that practice
leaving the patient without a parent.
Relationship Creation Task - This task type is triggered when the master data engine
does not agree with a relationship that was manually created. An example would be
when a company is said to be owned by another, but the data underneath does not
support that setting.
B-8 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Data model
To accommodate relationship linking, the following tables are used in the data model:
Added the tables:
mpi_relrule to store rules that govern relationship creation
mpi_relxtsk to store relationship information
mpi_relsegattr to support relationship segment-to-relationship attribute definitions
Added fields to existing tables:
rellinkno, relseqno, rulerecno, and relflag to mpi_rellink
relusrflag, relengflag, requiredleft, requiredright and requiredhierarchy to mpi_reltype
tskkind and tsktype to mpi_tsktype
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Workbench
The relationship rules the Relationship Linker uses to create and delete relationships are
maintained on the Relationship Type pane in Workbench. The rules are set up by creating:
Relationship Types: Gives a description, the direction, the multiplicity, entities and
defines the connection between the parties in the relationship.
Relationship Type Attributes: Defines attributes used strictly for relationships.
Relationship Rules: Defines how the relationships are created.
Relationship rules are based on making connections between an entity on the left with an
entity on the right. In other words, the record of the left entity should equal the record of the
right entity.
Workbench also has a Relationship Linker job that can be used to create relationships
based on the rules you have defined on an ad hoc basis. This job can be used for the initial
data load or for very large batch loads. This job works similarly to the bulk-cross-match job
where it will run a re-dvd if you have existing relationships in your database which will
replace the relationship data along with the bucket and comparison data.
Inspector
Records and relationships are managed in Inspector. Entities can be added and removed
from a hierarchy, or dragged and dropped to a different location within the hierarchy. As
information in the records changes, the relationships automatically change in real time
using the Relationship Linker in the MDE.
When tasks are created they are resolved through correcting data, or in some cases,
overriding relationship rules in Inspector.
B-10 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Initial load
When a project is being implemented with relationships, it follows similar steps to an
implementation without relationships.
1. Derive the Data with the Generate Query BXM option enabled.
2. Generate weights.
3. Run MPXCOMP which uses the source tables to determine how to compare sources.
4. Run MPXLINK using the same output folder for BXM data as the derived data.
5. If processing Cross Entities, run the MPXFSDVD job to reconcatenate all the member
types.
6. Run the Relationship Linker job in Workbench.
7. Load the data into the hub.
Maintenance
After the initial load, additions and changes are added to the hub and the relationship are
processed in real time.
1. Obtain the Source Extracts from the customer and outside source (Dun & Bradstreet,
for example).
2. Standardize and run a MEMPUT job in CloverETL.
3. Entity Management and Relationship Linker run.
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Setting up relationships
Relationships are managed in Workbench in the Relationship Type pane. But first you need
to plan how your relationship will be created by defining your rules based on the
information the customer needs. For this exercise we are going to build upon the Boot
Camp data by adding another attribute named Supervisor ID.
To add this new data to the data set we will run a MEMPUT job to append data to the hub to
create sample relationships.
The Supervisor ID will be the attribute the Relationship Linker uses to create connections
between entities. Follow the procedures below to implement relationships with our Boot
Camp data.
__ 2. Select Person in the Member Type drop-down list if it is not already selected.
__ 3. Click the Add button to the right of the fields.
__ 4. Enter the following information for the Supervisor ID attribute.
3
B-12 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
FOR USE WITH COURSE 3Z100 ONLY
Viewing relationships
Now that we have added our relationship data, we can view it in Inspector.
__ 1. Search for Perry Brose and view the results by clicking the Inspect icon.
__ 2. Click the Entities tab to view the records relationships.
B-14 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Source
A source is a file or system where your data comes from. Sources can be flat files,
databases, web services, or other tools that store information.
Member
A member is simply a record from a source. The term member comes from the concept
that a member record belongs to an entity, like a person belongs to a club or organization.
Attribute
An attribute is a piece of demographic information about a member. These attributes are
specific, like a Home Address or a Work Address.
Entity
An entity is a unique person or organization. Multiple members might hold data about one
entity. Entities are represented by an Enterprise Identifier (EID), which is assigned by the
hub.
Algorithm
An algorithm is a series of computational processes that analyze member records. The
algorithm has three main processes: Standardize, Bucket, and Compare.
Standardization
Standardization, the first step in the algorithm, reformats data into consistent chunks. For
example, phone numbers are boiled down to the last 7 digits: 1 (312) 832-1231 =>
8321231.
Bucket
A bucket is a means of organizing members, based on data they share in common, so they
can be found more quickly. Is it easier to find a needle in a haystack or in a jar labeled
needles?
Comparison function
A comparison function is a means for finding similarity or difference between two attributes.
Comparison types include: Exact Match, Starts With, Edit Distance, Phonetics, and
Equivalency.
Weight
A weight is a number that represents the hubs confidence that a single value is a good
identifier. The rarest values have a higher weight while more common values have lower
weights.
Comparison score
A comparison score is the aggregate of individual attribute weight scores when two
members are compared. Each set of attributes has a score (positive or negative) that adds
to the overall score.
Threshold
A threshold is a number on a scale that is used to make a decision. There are two main
thresholds in the hub: Clerical Review (CR) and Auto Link (AL).
Tasks
A task is simply something that a human being has to do. Typically this involves making a
decision about joining two records that fall between the CR and AL thresholds. There are
four main tasks:
Potential overlay
A record received an update with information that is radically different than the data that
was already there and is considered the most urgent to resolve.
Potential duplicate
Two records are in the same source and appear to represent the same person or
organization.
C-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
Review identifier
Two records from the same source seem to be using the same identifier (like SSN or
Passport Number)
Relationships
When two entities have something in common they have a relationship. That relationship
might be part of a hierarchy such as a boss and a subordinate, or are part of a group such
as patients of a medical practice.
C-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4.0.3 FOR USE WITH COURSE 3Z100 ONLY
backpg
Back page
FOR USE WITH COURSE 3Z100 ONLY