Notebook

V5.4.0.
3
FOR USE WITH COURSE 3Z100 ONLY
cover
Front cover
Initiate Technical Boot Camp
(Course code ZZ100)
Student Notebook
ERC 1.3
Student Notebook
Trademarks
IBM and the IBM logo are registered trademarks of International Business Machines
Corporation.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
AIX DB2 InfoSphere
Initiate Master Data Service Initiate Systems Initiate
RDN WebSphere
Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Other product and service names might be trademarks of IBM or other companies.
April 2011 edition

The information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.
Copyright International Business Machines Corporation 2010, 2011.

This document may not be reproduced in whole or in part without the prior written permission of IBM.
Note to U.S. Government Users Documentation related to restricted rights Use, duplication or disclosure is subject to restrictions
set forth in GSA ADP Schedule Contract with IBM Corp.
V5.4.0.3
FOR USE WITH COURSE 3Z100 ONLY Student Notebook
TOC Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Unit 1. Introduction to the Boot Camp project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Core concepts and terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What is MDM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What is an EMPI? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What is CDI? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What is a hub? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What is a registry-style hub? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
What is a transactional-style hub? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
What is a hybrid-style hub? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
What is deterministic matching? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
What is probabilistic matching? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
Solutions versus tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
Implementing IBM Initiate Master Data Service software . . . . . . . . . . . . . . . . . . . . 1-6
The platform implementation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
1 - Review the customer requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
2 - Configure the Initiate member data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
3 - Configure the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9
4 - Clean the data extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
5 - Deploy the instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
6 - Derive data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
7 - Generate weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
8 - Perform bulk cross match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-15
9 - Analyze and review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
10 - Reiterate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
11 - Test the hub configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18
General rules of implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-19
Common process dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-19
Boot Camp best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20
The implementation approach/summary document . . . . . . . . . . . . . . . . . . . . . . . . . 1-21
Application overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22
High-level requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22
Solution architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22
Source data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24
Client dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25
Algorithm configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-29
Copyright IBM Corp. 2010, 2011 Contents iii

Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-29
Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-31
Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-36
Performance targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-37
Data Extract Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-38
Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-38
File format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-39
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-39
Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-40
Sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-40
Preventing extract errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-40
Online data transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-42
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-42
Inbound message requirements appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-43
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-43
Components overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-43
Configuration details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-44
Unit 2. Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1
Workbench overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-2
Basic functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-2
Workbench general navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
Unit 3. The Initiate member model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1
Creating the data dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2
Creating a new Initiate default project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Adding a new member type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
Adding attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
Adding an entity type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
Adding composite views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
Adding sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
Adding informational sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8
Adding strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8
Unit 4. Configuring the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1
Algorithms: The secret sauce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
iv Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011

V5.4.0.3
TOC How do algorithms relate to members and entities? . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
What can standardization do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
What does standardized data look like? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Where are standardization functions stored? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Standardization functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Abstract code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
Phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
Phone examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
Postal code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
Email address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
Biometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
Geocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Multiple attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Passthrough . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Working with buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
What is bucketing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
Organizing member records into buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
How do buckets impact searching? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
How do you locate the values associated with a hash? . . . . . . . . . . . . . . . . . . . . 4-17
Reviewing bucket hash values and the underlying real values . . . . . . . . . . . . . . . 4-18
Designing multi-token buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
How are buckets formed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
Where are bucketing functions stored? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
Bucketing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
Available bucket functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
What is the role of bucketing generation types? . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Why use bucketing generation types? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
What bucket generation types are available? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
Comparing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
What is the best way to compare data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
Comparison functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
String pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
Address & phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27
Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
Edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29
Equivalency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32
False positive filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-33
Geocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34
Height & weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34
US zipcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34
New comparison functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-35
Copyright IBM Corp. 2010, 2011 Contents v

Student Notebook
Tuning search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-35

Query roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-36
Reading the flow of an algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-37
Configuring an algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-38
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-38
Unit 5. Cleaning the data extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
Data extracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
Data extract formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
Reviewing extracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-3
Sampling a large extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-3
Looking for trends in the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-3
Check the whole file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
Checking a sample of the extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
Cleaning the data extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
Clover ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
What is ETL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
Clover graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
Understanding the clover components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
Clover status symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
Creating a cleanup graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
Creating a filtering expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
Adding the second phase to your graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-11
Laying out the second phase components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-11
Configuring the second phase ext filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-12
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
Unit 6. Deploying the instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1
Deploying the instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Creating a hub instance using MADCONFIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Creating a data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Creating a new hub instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Starting the IBM Initiate Master Data Service engine . . . . . . . . . . . . . . . . . . . . . . . .6-2
Deploying hub configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Registering a hub to a project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Deploying the configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Unit 7. Overview of the Initiate data model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1
vi Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011

V5.4.0.3
TOC What is the Initiate member model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2

Features of the data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Extensibility is at the core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Common naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
Core database tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
Audit tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
Member tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7
Entity and relationship tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9
The Data Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
The dictionary tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
Algorithm tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
Base configuration tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Metadata tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Runtime configuration tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
String handling tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
Weight definition tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Unit 8. Deriving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
What is derived data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Derived data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
When is data derived? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
The derivation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
Data derivation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
How does derived data impact the database? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
Derive data and create UNLs (mpxdata) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Data model exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
The configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
What does a config file look like? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Designing a configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Configuration file arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Special considerations for configuration files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Working with identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Working with compound attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Working with attributes that have many valid forms . . . . . . . . . . . . . . . . . . . . . . . 8-14
Where should my configuration file be stored? . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Exercise: Creating a configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Attribute codes and segment fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Data extract excerpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
Copyright IBM Corp. 2010, 2011 Contents vii

Student Notebook
Configuration template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-16

Deriving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-16
Data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17
Why are data analytics produced? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17
How do you produce them? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17
Workbench analytics perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-18
Attribute completeness by source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-19
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-19
Unit 9. Generating weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
The importance of weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
What makes you think these records match? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
What about Bob? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3
What are the odds? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3
Thinking more about weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3
Identifying the weight tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4
The 80/20 weight rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4
When should you recalculate weights? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-5
The magical weight formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
Calculating weights with matched pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
What are unmatched pairs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
Basic weight calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-7
Running weight generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-8
Weight generation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-9
Troubleshooting your weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-10
Good weight distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-11
Bad weight distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-12
Hand editing AXP weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13
Reading multi-dimensional weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-14
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-15
Unit 10. Running a bulk cross match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
Bulk cross match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
How are records grouped? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
Enterprise ID (EID) assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-3
BXM steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-3
Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4
Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4
Step 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
How does the BXM impact the database? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
viii Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
TOC Unit 11. Analyzing thresholds and matched pairs . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
Threshold overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Importance of thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Possible matching errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Threshold analysis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4
Objectives of threshold analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4
Conducting a pre-threshold analysis internal review . . . . . . . . . . . . . . . . . . . . . . . 11-4
Conducting the internal review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4
Looking for false positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
Resolving false positives with the false positive filter . . . . . . . . . . . . . . . . . . . . . . 11-5
Conducting threshold analysis with organizations . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
Reviewing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6
How long does it take to review matched pairs? . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6
Initiate pair manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
Unit 12. Analyzing buckets and frequency based bucketing . . . . . . . . . . . . . . . . . 12-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
Analyzing buckets overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Data set size ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Choosing attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
How to bucket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Generating a bucket analysis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Members not in a bucket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
Large buckets: 2,000+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
How to look up the actual value of the bucket . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5
Visualizing bucket size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6
Seeing member bucket frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Viewing a member's bucket values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Member comparison distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Frequency based bucketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
When and why is FBB used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
How is FBB implemented? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
Unit 13. Reiterating the process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
Reiterating after configuration changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Redeploying the configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
Re-deriving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
Running entity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
Copyright IBM Corp. 2010, 2011 Contents ix

Student Notebook
Viewing entity size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-4

Viewing entity composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-5
Comparing members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-5
Member comparison output (MCC codes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-5
Score distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-6
Member overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-8
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-9
Unit 14. Managing users, groups, and permissions . . . . . . . . . . . . . . . . . . . . . . . . .14-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-1
Managing groups and users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-2
Working with LDAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-3
Typical server configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-3
Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-4
The ldap.properties file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-5
Initiate groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-6
Using MADCONFIG to create an instance with LDAP . . . . . . . . . . . . . . . . . . . . . . .14-10
Will this Initiate master data engine instance use an embedded Initiate LDAP
server? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-10
Will this Initiate LDAP server be clustered with other Initiate LDAP servers? . . .14-11
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-11
Unit 15. Configuring and deploying Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-1
Introduction to Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-2
The inspector.properties file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-3
Inspector configuration in Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4
Attribute display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4
Custom task summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-5
General preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-5
Member and entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-6
Search forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-6
Search results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-7
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-7
Unit 16. Testing the hub configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-1
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-1
Testing philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-2
General tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-2
Data validity tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-2
Algorithm tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-3
Application tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-3
x Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011

V5.4.0.3
TOC Integration tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3

Performance tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3
Testing your configuration using CloverETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-4
Understanding MEMPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-4
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-4
Appendix A. Sample implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Schedule for the sample project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
Overview and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
Implementation goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
Solution architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
Initiate Systems software components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
Contributing and consuming systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Creating a new customer record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Updating customer records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Resolving potential duplicate tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-9
Resolving potential linkage tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Resolving review identifier tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-11
Initiate configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-12
Member attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-12
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-13
Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-13
Inspector configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-14
Data extract guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-15
Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-15
File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-16
File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-17
Sample record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-17
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-18
Verifying data extracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-18
Data transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-19
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-19
Resolving data extract questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-19
Appendix B. Working with relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

What are relationships? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
Why are relationships important? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
How does understanding relationships help? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
Relationship scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
Peer-to-peer relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
One-to-one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
One-to-many . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
Many-to-many . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
Hierarchical relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
Copyright IBM Corp. 2010, 2011 Contents xi

Student Notebook
How does relationship management work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

Relationship sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6
Relationship rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7
Relationship tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8
Relationship components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9
Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9
Master data engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9
Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
The relationship linking process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
Initial load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
Setting up relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
Adding the relationship attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
Creating relationship rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
Adding relationship data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
Viewing relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
Appendix C. Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1

Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Member . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Attribute type (segment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Bucket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Comparison function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Comparison score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Potential overlay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Potential duplicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Potential linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3
Review identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3
Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3
xii Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
TMK Trademarks
The reader should recognize that the following terms, which appear in the content of this
training document, are official trademarks of IBM or other companies:
IBM and the IBM logo are registered trademarks of International Business Machines
Corporation.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
AIX DB2 InfoSphere
Initiate Master Data
Initiate Systems Initiate
Service
RDN WebSphere
Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Other product and service names might be trademarks of IBM or other companies.
Copyright IBM Corp. 2010, 2011 Trademarks xiii

Student Notebook
xiv Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
pref Course description

Initiate Technical Boot Camp
Duration: 5 days (10 days with optional sample implementation)
Purpose
The Technical Boot Camp will prepare an implementation team
member for the types of tasks that they would be expected to perform
on their first project. The Boot Camp is not focused on the Marketing
or End-User side of the IBM Initiate Master Data Service platform, but
rather on the behind-the-scenes processes that take place in a typical
implementation.
Prerequisites
Students should have reviewed the General Product Overview and the
Boot Camp Training Kit prior to class.
Objectives
Your goal is to learn the independent steps that make up the
implementation process. Many of the activities in the Technical Boot
Camp have interdependencies and prerequisites. In the following
pages, we will outline the general flow of the Technical Boot Camp and
note the dependencies.
Install and navigate Workbench
Learn to create an Initiate member model
Work with a data extract and CloverETL graphs
Design and build an algorithm
Perform an initial data load and bulk cross match
Analyze the quality of the data in your database
Perform threshold and bucketing analysis
Explore the weight generation
Test the configuration using a CloverETL graph
Configure LDAP
Configure IBM Initiate Inspector
IBM Initiate technical publications

You can find technical reference documentation on the Information
Center:
http://publib.boulder.ibm.com/infocenter/initiate/v9r5/
index.jsp
Copyright IBM Corp. 2010, 2011 Course description xv

Student Notebook
xvi Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
pref Agenda
Day 1
Welcome
Unit 1: Introduction to the Boot Camp project
Unit 2: Installing Workbench
Unit 3: Configuring the Initiate member model
Unit 4: Configuring the algorithm
Day 2
Unit 4: Configuring the algorithm, cont.
Unit 5: Cleaning the data extract
Unit 6: Deploying the instance
Unit 7: Overview of the Initiate member model
Unit 8: Deriving data
Day 3
Unit 9: Generating weights
Unit 10: Running a bulk cross match
Unit 11: Analyzing thresholds and matched pairs
Unit 12: Analyzing buckets and frequency based bucketing
Day 4
Unit 13: Reiterating the process
Unit 14: Managing users, groups, and permissions
Unit 15: Configuring and deploying Inspector
Day 5
Unit 16: Testing the hub configuration
Conclusion
Day 6 - 10
Sample Implementation
Copyright IBM Corp. 2010, 2011 Agenda xvii

Student Notebook
xviii Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
Uempty Unit 1. Introduction to the Boot Camp project
Overview
This unit will review milestones, data requirements, and general configuration needs for the
Boot Camp project. We will also review an Implementation Approach document that
explains the course as a project.
Dependencies
Students should have reviewed the General Product Overview and the Boot Camp Training
Kit prior to class.
Topics
This unit will cover:
Concepts and terms
Implementing the IBM Initiate Master Data Service
The Technical Boot Camp process
General rules of implementation
Common process dependencies
Copyright IBM Corp. 2010, 2011 Unit 1. Introduction to the Boot Camp project 1-1
Student Notebook
Core concepts and terms

The following core concepts are integral to understanding what Initiate software does. In
this section we will discuss how we help solve data management problems, how our
products approach those problems, and the common terms that you will hear throughout
class.
What is MDM?
Master Data Management (MDM) provides consistent and comprehensive core information
across an enterprise.
What is an EMPI?
Enterprise Master Person Index (EMPI) is the process of identifying each unique patient
within healthcare systems and assigning them an Enterprise Identifier (EID) so that
Electronic Medical Records can be cross-referenced to produce a full-bodied picture of a
patient's medical history.
What is CDI?
Customer Data Integration (CDI) is the process of determining a distinct set of customer
records and creating a single, unified view of the information across all sources.
What is a hub?
For most implementations, the IBM Initiate Master Data Service is the central point where
you go to locate information. We casually refer to the IBM Initiate Master Data Service as
"the Hub." Hubs are designed with regard to specific data and data relationship domains,
such as consumers, organizations, locations, patients, vehicles, households and
hierarchies. For example, an IBM Initiate Provider Hub tracks information about Doctors
and Facilities that provide healthcare to patients.
1-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
Uempty What is a registry-style hub?

Registry style: Allows the original source systems to retain ownership of the data, although
key comparison attributes might be stored in the Hub. The Hub keeps a record of each
unique party in the data and assigns an Enterprise Identifier that is source independent.
Some industries will recognize this model as the classic EMPI (Enterprise Master Person
Index) implementation. In this deployment, the record of truth is a virtual record delivered
by the Hub. The registry is a reflection of linked data coming from contributing systems. A
registry style provides the fastest implementations because it doesn't require expensive
changes to source systems. This style works well for organizations seeking to minimally
impact their source systems and attain quick time-to-value.
What is a transactional-style hub?

Transactional (or centralized or persistent or mastered) style: writes and maintains a
golden record in the Hub. The Hub retains the master data records, and becomes the
system of record. Although the source systems still operate as data entry points, the central
Hub owns the records. As updates come in from the sources, or from the central Hub, the
multiple copies must be kept in sync. This is done by propagating each record's Enterprise
ID (EID), which identifies the record's Entity, to source systems. It is also possible to send
data changes, like updates to an address or phone number, at the same time that the EID
is synchronized.
Student Notebook
What is a hybrid-style hub?

The Hub retains the master data records, and becomes the system of record. style:
combines the registry style with the transactional style and enables your organization to
flexibly maintain and view your data. Hybrid styles let some consumers use the Hub as a
master, while it continues to serve the function of a registry to others. Hybrid styles let you
evolve from a registry style to a master style, without having to convert all of your
applications at once, or a hybrid style can be your end destination.
What is deterministic matching?

Deterministic matching uses a series of rules, like nested if statements, to run a series of
logical tests on the data sets. This is how we determine relationships, hierarchies, and
householding within your data set. Typically, these are very straight-forward assessments
of equality (for example, does Parent Company for Record 1 = Company Name for record 2
or does Record 1's Address = Record 2's Address). The deterministic model relies on key
indices to determine match, so unlike the probabilistic method, it does not factor in overall
similarity or difference of the records. It is best used when you have a clearly defined data
element connecting two records together.
What is probabilistic matching?

Probabilistic matching measures the statistical likelihood that two records are the same. By
rating the matchiness of the two records, the probabilistic method is able to find
non-obvious correlations between data. The correlations drawn offer more flexibility in data
analysis because they make fewer assumptions about what is right and what is wrong.
NO YES
V5.4.0.3
Uempty Solutions versus tools

When implementing and using the IBM Initiate Master Data Service you will encounter
several applications and tools used to work with these specific configurations.
Table 1-1: Initiate applications and tools
Clients Integration tools
Message Broker Suite
IBM Initiate Inspector Facilitates inbound and outbound

communication between systems via TCP/IP
Search and Inspect entities and ports
member records
EID Synchronization
Resolve tasks, like Potential
Duplicates and Potential Linkages Uses TCP/IP ports to propagate Enterprise
Identifiers (and attribute updates) to
Visualize and manage relationships consuming systems
between entities
HL7 Query Adapter
Add and edit member record
information Processes HL7 queries (PIX and PDQ)
synchronously
Workbench
Master Data Extract
Configure and deploy your hubs
member model Uses the power of CloverETL to migrate data
from the hub, reformat it into denormalized
Build and edit an algorithm
table structures, and then loads it into
Manage supporting data like another database
nicknames, anonymous values, etc
Enterprise Integrator
Use CloverETL to extract, transform,
Point of service terminal emulator that allows
and load data
you to bypass a legacy applications search
Analyze your buckets, entities, and mechanism and search the hub instead
thresholds Composer
Enterprise Viewer A unified development environment for
Search for members and entities in a quickly building lightweight but robust
web browser applications to display, access, and
manipulate the data managed by the Master
Pair Manager
Data Service.
View sample matched pairs displayed
ESOA Toolkit
side by side, and can be sorted.
Allows developers who are integrating the
Indicate whether a pair is or is not a
Initiate Master Data Service search API calls
match, or if it might be a match
to design custom web services that fit with
their pre-existing data models
Student Notebook
Implementing IBM Initiate Master Data Service software

The following diagram outlines the high-level implementation process for deploying IBM
Initiate Master Data Service platform. The overall project has additional processes that we
will discuss next.
Table 1-2: The general implementation process
Review Customer Review the requirements that have been defined for the Data
Requirements Extract by the customer.
Configure and load an Initiate Member Model to fit the project
Configure Data
needs. Data loaded includes, metadata (like sources and
Model
attributes), validation lists, and lookup tables in Workbench.
Design and build an algorithm to address your attributes,
Configure Algorithm comparisons, search requirements, and bucketing design
needs.
Analyze the data to ensure it conforms to the specifications
Clean Data Extract
and fix any problems that are found.
Upload the data dictionary and algorithm configured and
Deploy Instance
stored on Workbench to the IBM Initiate Master Data Service.
Parse a data extract to fit the Initiate data dictionary. You will
Derive Data create binary files, comparison strings, and bucket
assignments.
Measure the frequency of values (ignoring the anonymous
Generate Weights
values), and then assign weights accordingly.
Compare records, calculate comparison scores based on the
Bulk Cross Match
weights, and then link the records that match.
Perform data analytics that assess attribute completeness,
Analyze and Review
duplication rates, and threshold settings.
Tweak your settings or algorithm until you get the desired
Reiterate
results.
V5.4.0.3
Uempty The platform implementation process

The following section outlines the tasks that we will perform in class to implement the IBM
Initiate Master Data Service platform. While many of these tasks have dependencies, they
can be conducted in alternate sequences. For example, we need to have a clean Data
Extract before we import it into the database. However, the necessary files can be prepared
from specifications in the Data Extract Guide appendix of the Implementation Summary,
while you are waiting for the data to arrive.
1 - Review the customer requirements
Prior to implementation, the requirements for the Data Extract have been defined. The
Data Extract is a subset of the data that is used to configure the initial implementation.
Dependencies
You will need to have clear understanding of the data sources involved, the data fields
needed, and how to gather that data in a way that IBM Initiate Master Data Service
software can consume it. This helps you build your data dictionary.
Student Notebook
2 - Configure the Initiate member data model

l
The Initiate member model (.imm) defines the way that the IBM Initiate Master Data
Service software stores, manages, and validates data. You will build a Data Dictionary from
scratch in class, but normally you will begin with a predefined dictionary.
Dependencies
The Data Extract Guide outlines the specific attributes and fields and the Implementation
Approach defines additional data dictionary requirements.
V5.4.0.3
Uempty 3 - Configure the algorithm
Typically, a generic algorithm is imported along with your project configuration, but in class
we will begin with an empty algorithm. This algorithm will need to be configured to address
the attributes you are using, the comparisons that you would like to use, and the bucketing
strategy that you would like to employ. You can use Workbench to make your edits. The
tool will validate your design and present you with a list of errors if there are any
inaccuracies in your algorithm design.
Dependencies
The algorithm is the brain of the IBM Initiate Master Data Service software. Therefore, the
proper data elements must be in place before you can fully develop the algorithm. After
your first pass at the implementation, you can make some tweaks to the algorithm. After
making those changes, you will need to derive data, generate weights and/or perform a
Bulk Cross Match again.
Student Notebook
4 - Clean the data extract
The Data Extract is a sampling of the data. You will test this data for basic adherence to the
Data Extract Guides specifications and run CloverETL graphs against it to ensure proper
data format.
Dependencies
The Data Extract Guide outlines the data requirements. You will also need access to
Workbench and CloverETL tool to perform the data cleansing.
V5.4.0.3
Uempty 5 - Deploy the instance
5.1 - Create an empty database

You will create a new database for the IBM Initiate Master Data Service software to
reference. In class, we will use DB2 Server, but our product also supports Oracle and SQL
databases.
Dependencies
The only real dependency is that you need to have a supported database platform.
5.2 - Create hub instance

You will install the IBM Initiate Master Data Service engine (typical install shield executable)
and use the madconfig utility to configure the ODBC connection, create the instance
directory, and establish the Windows service. the following platforms are supported:
Windows Server, UNIX, Linux, AIX, HP, and Solaris.
Dependencies
You need to have the proper software installation files for your operating system (for
example, Windows 64-bit, Linux 64-bit, and so on) and an empty database in order to
create your instance.
5.3 - Bootstrap database

Bootstrapping your database involves creating the core database tables, defining the field
properties, and indexing the tables. During the bootstrap process several of the data
dictionary tables will be populated with default settings. Your database will be bootstrapped
as part of the instance creation or it can be done separately.
Dependencies
You will need to have the IBM Initiate Master Data Service software installed. You will also
need to access the empty database and your hub instance.
Student Notebook
5.4 - Deploy your configuration

The Data Dictionary is one part of the overall Initiate Member Model. The dictionary tables
control validation rules, application properties, attributes, sources, nicknames, and the core
algorithm settings. The dictionary can be populated by using a combination of engine
utilities and Workbench jobs. We will build a dictionary from scratch, but it is common to
import a baseline configuration.
Dependencies
You will need to have the IBM Initiate Master Data Service engine with a bootstrapped
database running before you can import the data dictionary. You can start modifying the
import files, though, as soon as you know the core needs and the fields that are to be
referenced in the hub. The Data Extract Guide and the Implementation Approach can guide
you through the configuration.
V5.4.0.3
Uempty 6 - Derive data
Derived data is essentially data that has been processed by the algorithm. The data
derivation process includes four main events:
1. Raw data is parsed into segment specific unload files
2. Comparison strings are built from standardized data
3. Members are assigned bucket hashes
4. Binary files are created for faster computation
There are multiple methods to derive data. For example, the Derive Data and Create UNLs
(mpxdata) job takes raw data and builds member unload files, generates comparison
strings, assigns bucket hashes, and creates binary files for faster comparison. In contrast,
the Derive Data from UNLs (mpxfsdvd) job uses pre-existing member unload files to extract
and create comparison strings, bucket hashes, and binaries. After you have derived your
data, you will load the results into your database.
A configuration file (.cfg) acts as a map between the Data Extract and the Data Dictionary.
It literally indicates which field in an extract row goes to which attribute in the database. An
engine utility uses the .cfg file to parse a raw data extract into the table structure that Derive
Data and Create UNLs (mpxdata) expects.
Dependencies
You will need to have most of the components installed and configured, like the hub engine,
member model, .cfg file, and the algorithm. If changes are made to the algorithm, then data
will need to be re-derived.
You will need to know the order in which the fields appear in the Data Extract and the
corresponding Attribute names in the member model. You can build your .cfg file from
project documentation if the real files are not yet available. Check for accuracy before
deployment.
Student Notebook
7 - Generate weights
The weight generation process is an integrated utility that goes through multiple steps to
measure the frequency of individual values in the database, then assigns weights to those
values the most common weighing less and the most rare weighing more. The weight
generation process creates unload files and loads them into the database.
Dependencies
You must have the engine installed and your algorithm configured. If you have already
derived your data then weight generation will take less time, but the weight generation
utility can derive data for its own use. You should always check your weights before loading
them into the database.
V5.4.0.3
Uempty 8 - Perform bulk cross match
The bulk cross match (BXM) is a process that allows you to compare and link thousands of
records per second. The BXM is most commonly performed in the initial stage of the
implementation and again right before the system goes live. The BXM process is made up
of two primary jobs; Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink).
After running the compare and link, the data will need to be loaded into the database.
Dependencies
You must have derived data and weights before you perform the bulk cross match. That
also means that the engine, algorithm, and data dictionary must be in place. The BXM
process uses the weights to generate an aggregate comparison score. That score is
compared to the thresholds to determine auto-linking and task generation.
Student Notebook
9 - Analyze and review
Once your data is fully loaded into the IBM Initiate Master Data Service software, you
should run tests to establish how well your system and the data are performing. Through
the analysis tools in Workbench, you can assess attribute completeness, score distribution,
entity and bucket size, and threshold analysis.
Dependencies
Your core engine and data must be fully loaded in order to run the data analytics. Analysis
can all be done within Workbench.
V5.4.0.3
Uempty 10 - Reiterate
You will take the results of the analysis and make tweaks to your algorithm and data
dictionary, if necessary. After your edits you will usually re-derive your data, run another
BXM, and analyze the results again.
Dependencies
Bucket design changes usually require re-deriving, but not another BXM. Comparison
changes require new weights, re-derivation, and a new BXM. Some small tweaks only
require an engine restart or simply redeploying your configuration.
Student Notebook
11 - Test the hub configuration
You will test your hub configuration using a CloverETL graph designed to perform a
MEMPUT operation.
Dependencies
Your configuration and data must be fully loaded into the IBM Initiate Master Data Service
software.
V5.4.0.3
Uempty General rules of implementation

Every implementation is unique, however, there are several processes that are common to
most implementations. Here are some general rules that we should address first:
There are several micro processes that make up the larger implementation process.
The processes that we will go through do not need to be run in the exact order that we
will use in class.
An implementation is an iterative process. You will refine your settings and redeploy the
software more than once during implementation.
Many processes have dependencies. We have tried to note the major dependencies in
this manual.
Common process dependencies

Here are some of the main dependencies that exist during implementation:
You must bootstrap your database before data can be loaded into it. Bootstrapping is
the process of creating the core database tables that the IBM Initiate Master Data
Service platform will reference.
You must know what fields are in an extract before you can configure the data dictionary
or import any member data.
You must configure an algorithm before you can derive data and generate weights.
You must have weights before you can compare and link records.
You must start with default thresholds in order to generate data that allows you to
optimize your thresholds.
Student Notebook
Boot Camp best practices

Below are a few tips and tricks to keep in mind as you go through the Boot Camp process.
While these are not hard and fast rules, following them will make your life easier during
class. Many of these tips have been learned the hard way by your fellow students, so heed
the advice of those who have gone before you.
Never use spaces when naming files, projects, databases, users... basically anything.
Use ALL CAPS, no caps, or caMel. Any method is fine, just use capitalization
consistently.
Save early and often. Check the Problems tab for configuration errors that need to be
fixed.
Watch for an asterisk on Workbench tabs which indicate unsaved changes.
Check the Problem tab in Workbench after saving to prevent problems down the line.
Check the log files after Workbench jobs to ensure they ran properly. Job coming back
as Successful does not necessarily mean everything ran correctly.
When using drop-down menus, the blank option at the top of the list will leave the field
empty.
V5.4.0.3
Uempty The implementation approach/summary document

The following section contains a sample implementation document. These documents are
created to determine how the IBM Initiate Master Data Service will be implemented. The
document provides an overview of the needs, and the architecture and configuration
needed to achieve your goals.
Also contained in the documentation is a Data Extract Guide. This guide provides details of
the data extract such as the expected file format and the data contained in the fields.
Use the Implementation Approach Document starting on the next page as a reference as
you set up your implementation throughout the Boot Camp.
Important
Implementation Approach Documents vary in content and format. This sample contains
common elements to all Implementation Approach documents, but does not contain all
elements used in all projects.
Student Notebook
Application overview
High-level requirements
Initiate University will use Initiate software to achieve these key objectives:
Discover data quality errors The IBM Initiate Master Data Service identifies
potential duplicates and potential linkages so that you can review and correct them.
Establish an enterprise-wide student identifier The IBM Initiate Master Data
Service identifies and links student records across your enterprise using an Enterprise
ID (EID). Initiate University desires to leverage this capability to assign an
enterprise-wide student identifier to student records.
Solution architecture
Overview
The IBM Initiate Master Data Service software will manage member demographic data
from Archway Center and Bellwood Center.
V5.4.0.3
Uempty Components and interactions

This Initiate Software Implementation Architecture includes these Initiate products:
Initiate University Source Systems: The following Boot Camp systems will provide
member data to the IBM Initiate Master Data Service:
- Archway Center
- Bellwood Center
IBM Initiate Master Data Service Software: The Identity hub Engine is comprised of
the database, business rules, and linkage logic (algorithms) that support required
functionality in all modules and applications.
Algorithm: The algorithm provides the heart of IBM Initiate Master Data Service
software. Combining specialized routines for data standardization, derivation, and
comparison, the algorithm enables candidate selection, member identification, and
linking of enterprise records.
Inbound Broker: The Initiate Inbound Message-Based Transaction Service is a
generic interface designed to manage client-specific messages between the source
systems and the Identity hub Engine and database. When information about a member
is entered or updated in a source system, this information is made available to the
Initiate database via the Inbound Message-Based Transaction Service through TCP/IP
connection.
Student Notebook
Source data
Outlined below are the data elements Initiate University will store in the IBM Initiate Master
Data Service software:
Source
Source ID
Gender
Last Name
First Name
Middle Name
Suffix
Birth Date
Address Line 1
City
State
Zip Code
Home Phone
Mobile Phone
Social Security #
For more details on the source data, please see the Data Extract Guide.
V5.4.0.3
Uempty Client dictionary

Member types
Outlined below are the member types for your configuration:
Table 1-3: Member types
Member Type Member Label Member Category Data Derivation Code
PERSON PERSON PERSON DVDPERSON
Entity types
Outlined below are the entity types for your configuration:
Table 1-4: Entity types
Async Same
Entity Entity Member Comparison Has Cross-Linked
Entity Source
Type Label/Category Type Algorithm Links? Members
Mgmt? Autolinks
id Identity PERSON CMPID Y Y Active Y
Source attributes
A source is a separate system/database with which the IBM Initiate Master Data Service
software interacts and receives member information and updates.
The following sources have been defined for Initiate University:
Table 1-5: Source attributes
Source/Physical
Source Name Source Type Member Type
Code
Archway A Definitional PERSON
Bellwood B Definitional PERSON
The following outside source have been defined for Initiate University:
Table 1-6: Source attributes
Source/Physical
Source Name Source Type Member Type
Code
Social Security Administration SSA Informational PERSON
Student Notebook
Member attributes
The IBM Initiate Master Data Service software captures member level information in
different attributes. Every member attribute is stored in a particular form or structure
(segment) which corresponds to a database table, for example, names are stored in the
MEMNAME segment or the mpi_memname table, dates are stored in the MEMDATE
segment or the mpi_memdate table, and so on.
There are attributes that require predefined pick list of values to facilitate queries through
end user applications. These pick lists are defined and associated with individual attributes
using type code (EDT Code) values.
Attribute specific definitions also include Number of Active Attributes (Number Active), and
Number of Historical attributes (Number Exists).
Number of Active Attributes (Number Active) - Usually the most current attribute value
is setup as the Active value for an attribute. In this case, the field would be set to 1.
Should you have a case where you need more than one active value for the same
attribute, increase the number active value. Example: Marital status - At any given point
in time, a person should only have one active status, such as Married.
Number of Historical attributes (Number Exists) - Over a period of time, an attribute
value might change. The number exists value determines how many historical or
previous values along with the current active value(s) should be stored in the Initiate
database. For example, if number exists is set to 3 and number active is set to 1 for
name and Mary Jones gets married to Jonathan Smith, her name entries might look like
the following (oldest to most current):
- Mary Jonas - Status=Inactive
- Mary Jones - Status=Inactive
- Mary Smith - Status=Active
If Mary gets married again, then the very first name or the oldest value would get
purged. In the example above, Mary Jonas is removed and the resulting entries look
like the following:
- Mary Jones - Status=Inactive
- Mary Smith - Status=Inactive
- Mary Johnson -Status=Active
V5.4.0.3
Uempty Your attributes are defined as below:

Table 1-7: Attribute definitions
Maximum Maximum
Attribute Member Type
Code Active Existing
Name/Label/Description Type (Segment)
Values Values
Name LGLNAME PERSON MEMNAME 1 0
Home Address HOMEADDR PERSON MEMADDR 1 10
Gender SEX PERSON MEMATTR 1 1
Social Security Number* SSN PERSON MEMIDENT 1 1
Birth Date BIRTHDT PERSON MEMDATE 1 1
Home Phone HOMEPHON PERSON MEMPHONE 1 10
Mobile Phone MOBILEPHON PERSON MEMPHONE 1 10
* Social Security Number will have the Review Identifier flag enabled in the Algorithm.
Student Notebook
Implementation defined segments

The following Implementation Defined Segment was created to store all data.
Table 1-8: Segment definitions
Field Data
Segment Field Name Label Length Required
Number Type
MEMHEAD 1 Source srcCode CHAR 12 Y
MEMHEAD 2 Source ID memIdnum CHAR 60 Y
MEMATTR 3 Gender attrVal CHAR 128 N
MEMNAME 4 Last Name onmLast CHAR 75 N
MEMNAME 5 First Name onmFirst CHAR 30 N
MEMNAME 6 Middle Name onmMiddle CHAR 30 N
MEMNAME 7 Suffix onmSuffix CHAR 10 N
MEMDATE 8 Birth Date dateVal CHAR 19 N
MEMADDR 9 Address Line 1 stLine1 CHAR 75 N
MEMADDR 10 City city CHAR 30 N
MEMADDR 11 State state CHAR 15 N
MEMADDR 12 Zip Code zipCode CHAR 10 N
MEMPHONE 13 Home Phone phNumber CHAR 20 N
MEMPHONE 14 Mobile Phone phNumber CHAR 20 N
Social Security
MEMIDENT 15 idNumber CHAR 40 N
Number
V5.4.0.3
Uempty Algorithm configuration

IBM Initiate Master Data Service software algorithms make use of data that is specifically
tailored for your installation. This section describes the process we use to configure the
data to produce results that meet your business requirements.
Your IBM Initiate Master Data Service software configuration is described below. If you
change any of these settings, you should record a log of those changes and evaluate the
behavioral differences in the Initiate software. For example, if you change your Anonymous
values, you would expect to see a change in the scoring of members and possibly a
change in the number of tasks and linkages. Or if you decrease your thresholds, you would
expect to see an increase in the number of linked members and/or tasks.
Your IBM Initiate Master Data Service software implementation has settings for each
component of the Initiate algorithm that are optimized for your industry and for your specific
business.
Derivation
Nicknames
The IBM Initiate Master Data Service software includes nickname translations that we
developed over time based on our algorithm and matching experience. Some customer
data sets might require additional nickname translations. Your installation uses our
standard person nicknames. The nicknames used for your installation are attached:
<document file would be attached here>
Standardization
The IBM Initiate Master Data Service software algorithms use standardization routines
tailored to meet your data needs. Standardization routines are comprised of a Function
designed to standardize a specific attribute type, and a set of Anonymous values. For
example, the USZIP function is designed to standardize zip codes into 5 digit US Postal
code representations by removing the zip+4 extensions. The ZIPCODE anonymous value
contains the list of zip codes that are not meaningful during matching, for example '11111' -
see Anonymous Values below under Selection for more information.
Student Notebook
Your standardization rules are as follows:

Table 1-9: Standardization rules
Record
Attribute AttrCode Function Anon Comment
Status
Social Active Formats identifier without issuer by
Security SSN IDENT1N None
Inactive removing all non-numeric characters.
Number
Formats a value by removing all non
alphanumeric characters and non-digit
Active characters. Performs anonymous checks
Gender SEX ATTR None
Inactive if an ANON code is specified. Returns
comparison value, query value and single
bucket value (all the same).
Formats a list of strings along with their
prefixes/suffixes so that they can be
checked consistently for scoring and data
Legal Name
derivation. Performs anonymous checks if
(onmLast, Active an ANON code is specified. For
onmFirst, LGLNAME PXNM ANAME
Inactive comparison and query values, returns
onmMiddle
concatenated string of name tokens.
onmSuffix
Returns one bucket value for every name
token; separates prefixes, suffixes and
degrees.
Formats a list of strings along with their
prefixes/suffixes so that they can be
checked consistently for scoring and data
derivation. Performs anonymous checks if
Legal Name Active an ANON code is specified. For
LGLNAME PXNM ANAME
(onmLast) Inactive comparison and query values, returns
concatenated string of name tokens.
Returns one bucket value for every name
token; separates prefixes, suffixes and
degrees.
Active Standardizes dates. The output is the
Birth Date BIRTHDT DATE1 None checked for length. If 0 it is treated as
Inactive ANON and if over 8 it is truncated to 8.
Home Active Formats a United States zip code.
HOMEADDR USZIP None
Address (zip) Inactive (Accepts only 1 field.)
Home
Formats an address string, dividing it into
Address Active the address subcomponents for United
(StLine1, HOMEADDR USADDR2 None
Inactive States with region and postal code.
City, State,
(Accepts up to 7 fields.)
ZipCode)
Formats a phone number that is 7 or more
Active characters to a 7-digit United
Home Phone HOMEPHON PHONE1 None States/Canada number, by shaving off
Inactive leading 1's, area codes, and special
characters. (Accepts only 1 field.)
Formats a phone number that is 7 or more
Active characters to a 7-digit United
Mobile
MOBILEPHON PHONE1 None States/Canada number, by shaving off
Phone Inactive leading 1's, area codes, and special
characters. (Accepts only 1 field.)
V5.4.0.3
Uempty Selection
Bucketing
Your bucketing configuration, a part of your derived data specification, determines which
members are candidates for comparison. For example, you might compare any members
that have similar names and share the same zip code or have identical phone numbers.
We use an 'OR' condition, instead of an 'AND' condition, during candidate selection to
maximize the number of relevant candidates returned and hence reducing the likelihood of
leaving out a genuine candidate. We optimize your bucketing configuration based on your
data volumes, profile, and business objectives. Your bucketing configuration is as follows:
Table 1-10: Bucketing definitions
Phonetic Max/Min
Max/Min Bucket Bucket Equiv String
Bucket Tokens Derivation Attr
Tokens Func Gen Type Code
Argument Tokens
SSN SSN 1/0 ATTR Sorted n/a n/a 1/1
NAME + LGLNAME 1/1 PXNM EQMETA NORMPHONE NICKNAME 2/2
DOB BIRTHDT 1/1 DATE DTY4MM n/a n/a 2/2
LAST NAME LGLNAME 1/1 PXNM EQMETA NORMPHONE NICKNAME 2/2
+ ZIP HOMEADDR 1/1 ATTR As Is n/a n/a 2/2
PHONE HOMEPHON 1/0 ATTR Sorted n/a n/a 1/1
(HOME +
MOBILEPHON 1/0 ATTR Sorted n/a n/a 1/1
MOBILE
Anonymous values
Anonymous values are common values that do not have sufficient meaning for use in
matching. For example, anonymous values include data that come from testing, such as
the name Test Customer. They also can include values that you use as defaults such as
the birth date of 1/1/1900. The attached file represents the data contained in the
mpi_stranon table. <document file would be attached here> The last two fields indicate the
type of anonymous value and the value itself. This file is included as a baseline for
reference, should you ever desire that additional anonymous values be entered, or existing
ones removed.
Student Notebook
Scoring
Comparison
We use comparison routines tailored to meet your data needs. Comparison is comprised of
a Function that indicates which data elements are compared, a list of Nicknames for
comparison (see Nicknames above), and a set of weights that dictate the match scores for
each attribute (see Weights below).
V5.4.0.3
Uempty Your comparison rules are as follows:

Table 1-11: Comparison rules
Historical Weight
Attributes Function Nicknames Comment
Comparison Table
DR1D1B is used to compare string
data. The result of a DR1D1B is an
integer that represents how close the
Active two strings are to each other. This
SSN DR1D1B None 1dim
Inactive integer is called edit-distance and is
used for distance based
comparisons. Examples include SSN,
ID Number, et cetera.
This is a simple string comparison,
Active used for a simple MATCH or
Gender EQVD None sval NO-MATCH type of comparison with
Inactive different match weights for different
match values.
The QXNM routine is used for name
Legal Name
Active comparisons. QXNM uses the
(onmLast,
QXNM NICKNAME sval cmpargs argument to specify the type
onmFirst, Inactive of metaphone algorithm to be applied
onmMiddle)
(refer to AXP).
Active The DATE comparison function
Birth Date DATE None 1dim + nval
Inactive formats alphanumeric date data.
The AXP address and phone
comparison function is based on
Home information content and similarity. If
Address + Active the address consists of street
1dim +
Home Phone AXP None information and postal codes, this
Inactive 2dim + sval
+ Mobile information is used for comparison.
Phone When the postal code is not present,
the City, State (or city, country, et
cetera) is used in the comparison.
Weights
Weights dictate the comparison scores generated by the IBM Initiate Master Data Service
software algorithm. These weights were determined via analysis of your data and are
included here <document file would be attached here> as a baseline, in case you should
ever request that Initiate perform additional analysis on your data to regenerate.
False positive filter

None
Student Notebook
Linking
Threshold review
The IBM Initiate Master Data Service software employs two thresholds in determining the
final outcome of a comparison, the clerical-review (CR) threshold and the auto-link (AL)
threshold. If a pair of records scores above the AL threshold, the IBM Initiate Master Data
Service software automatically links the pair of records together as a single entity. If they
score above the CR threshold but below the AL threshold, the IBM Initiate Master Data
Service software places the pair of records into an electronic queue for manual review.
Setting these thresholds correctly is essential to the performance and accuracy of the IBM
Initiate Master Data Service software. The values for these thresholds depend upon
several factors:
The size of the data file
The richness of the underlying data (i.e. the number of attributes available for matching
The tolerable false-positive error rate (which describes the number of records which
would be incorrectly linked)
The desirable false-negative rate (which is the number of missed linkages) or the
resources available for processing manual review
Optionally, we establish thresholds for the comparison of members within each source
system and for comparison of members across your source systems.
Thresholds
Initiate' probabilistic matching algorithms provide highly accurate results that can be tuned
to meet your business requirements. The Initiate algorithms assign a probability score that
indicates the likelihood that a given record matches the search criteria. By performing a
thorough Threshold Analysis you determine a specific threshold score that produces
optimum results for your specific data characteristics and requirements. For a given
Search, Initiate's probability score is based on how many of the attributes on the returned
record match the input criteria and on how closely those attributes match the input criteria.
For more details on Initiate's matching algorithms, see Identity hub Overview Guide. In
general, higher scores indicate higher confidence matches, where most or all of the Search
criteria are met. Lower scores reveal less confident matches, where one or more of the
search criteria are either different or missing from the returned record.
Therefore, if you set a high Threshold score, your Search results will yield only the highest
confidence matches and you will never return False-Positives, or incorrect matches.
However, at very high Threshold scores, you risk having False-Negatives, or missed
opportunities for matching. For example, if someone has moved and their ZIP code no
longer matches the input criteria that record will score lower than if you searched using the
old address criteria. In order to obtain optimum Search results, you must balance the
trade-off between False-Negatives and False-Positives. You do this by selecting a score
threshold that meets your specific business needs.
V5.4.0.3
Uempty Input criteria

The following factors drive the threshold setting process.
Table 1-12: Input criteria
Factor Description Criteria
The size of the data set, in terms of
Expected volume after 6 months to 1
Data size number of records, that you are
year of operations: TBD
searching against
Name
Birth Date
The richness of the underlying data
Address
Data (i.e., the number of attributes available
richness for matching and how those attribute Gender
values vary). Variations for each of the above
attributes estimated from Initiate
Industry standards.
The tolerable false-positive error rate Very low tolerance for false-positives.
False In range of:
(which describes the number of
Positive
records which would be incorrectly 1 in 100,000 searches to
toleration
linked) 1 in 1 million searches
Minimize false negatives to the extent
The desired false-negative rate (which
possible without compromising
False-Nega is the number of missed linkages) or
false-positive tolerance. Target
tive goal the resources available for processing
10-20% maximum false negative
manual review
rates.
Results
We configured the IBM Initiate Master Data Service based on our knowledge of your
industry and your business goals combined with an analysis of your data. During threshold
analysis with your business group, the following threshold levels were chosen:
Table 1-13: Thresholds
Source Clerical Review Threshold Auto-link Threshold
Archway 9.0 9.0
Bellwood 9.0 9.0
Student Notebook
Tasks
There are 4 types of tasks that can be created by the IBM Initiate Master Data Service:
Potential Duplicate: Two records from the same source that score between the clerical
review and auto-link threshold.
Potential Linkage: Two records from different sources that score between the clerical
review and auto-link threshold.
Review Identifier: Two records from the same source that have the same unique
identifier (attributed used for Review Identifier tasks is identified in Member Attributes
section above)
The following tasks are not utilized by Initiate University, but must be active for the IBM
Initiate Master Data Service software to function correctly:
Potential Overlay: A record received an update with information that is radically different
than the data that was already there.
V5.4.0.3
Uempty Performance targets

We size hardware by analyzing expected transaction volume (logical and physical) by
transaction type over time given the number of member records in the database. We take
into account customer technical architecture preferences, integration method, and ratio of
searches to updates, and so on, when making recommendations.
Data volumes
Initiate University maintains a database of roughly ~ 500,000 records. Detailed data
volumes are listed below:
Table 1-14: Data volumes
Description Records
Day 1 Volume ~ 500,000
Average daily update
20,000 (10,000 from Archway, 10,000 from Bellwood)
volume
Total volume to support ~ 500,000
Student Notebook
Data Extract Guide

The IBM Initiate Master Data Service software is the most accurate solution available for
matching customer data. In order to provide the most accurate matching possible, we
analyze your actual data to configure and test an algorithm that meets your specific needs.
Your data extracts provide Initiate with the attributes required for matching and any
additional attributes that aid in validating results or provide additional context to the data.
The richness of the attributes and the degree to which they are populated factors into the
strength of our accuracy.
Data description
For each section below, please describe your data by providing responses to the questions
and provide any additional information that you believe might be helpful in understanding
your data environment and processes.
The IBM Initiate Master Data Service manages data that you collect from these sources:
Table 1-15: Data sources
Data
Description Approximate Record Count
Source
Registration source for Archway
Archway ~ 250,000
Training Center
Registration source for Bellwood
Bellwood ~ 250,000
Training Center
Please describe the unique identifiers from your sources:
Table 1-16: Unique identifiers
Question Description Response
Records are uniquely identified by a
What is your primary The number used to uniquely combination of Source and Source
qualifier? identify a record. ID, students are uniquely identified
by Member Record Number.
Does the primary ID have a
Describe any values in the X No
meaningful prefix, suffix, or
identifier that represent a
any other characteristic _ Yes, please describe:
distinct population
within the identifier?
When you perform your data X Yes
Can you extract one record
extract, are you able to pull one
per primary identifier? _ No
record per unique identifier?
If you want to include these in
your evaluation, a primary
identifier needs to be assigned _ Yes
Do you have records without
prior to submitting the file, or
a primary identifier? X No
Initiate can assign an identifier
during processing. Briefly
describe.
V5.4.0.3
Uempty File format

Data extract files you provide are pipe-delimited (the | character) ASCII files. Each record
occupies a single line in your file, with a final '|' at the end of the record and is CRLF
terminated. The extract(s) conform to the following format (the various fields from the
sources are mapped accordingly and any fields not used by a source are left null but still
have a pipe delimiter to preserve consistent field positions):
Table 1-17: File format
Max
Field # Field Name Description
Length
The source from which this record
1 Source 12
originated
2 Source ID Unique identifier for the source 60
3 Gender Students gender 128
4 Last Name Students last name 75
5 First Name Students first name 30
6 Middle Name Students middle name 30
7 Suffix Students suffix 10
8 Birth Date Students birth date 19
9 Address Line 1 Students street address, line 1 75
10 City Students city 30
11 State Students state 15
12 Zip Code Students zip code 10
13 Home Phone Students home phone 20
14 Mobile Phone Students mobile phone 20
15 Social Security Number Students Social Security number 40
Assumptions
When preparing your data extract, please ensure the following:
Source Identifier is unique within the source
All fields are Text data type
Phone numbers are in a single format
Data file fields are alphanumeric characters and left-justified
Student Notebook
Customization
The IBM Initiate Master Data Service software is configurable to support additional
attributes or different formats (changed field order, different delimiters, et cetera). These
changes should be approved and documented as revisions to the above format prior to file
submission so we can properly configure the software. Certain customizations could
require a change to project pricing or schedules. Please coordinate any extract change
requests with your Initiate Project Manager.
Sample data
Based on our conversations, the following are representative samples of the data that you
provide to Initiate.
Source|MemId|Gender|LastName|FirstName|MiddleInitial|Suffix|BirthDate|HomePhone|CellPhone|SSN
WEB|435263|F|GRIMM|JEANICE|I||1946-10-04|(480)312-2086|(480)217-2304|617-63-4723
WEB|436287|M|ADOLPHSEN|KEENAN|O|JR||(480)186-3228||421-91-9316
REG|M-1509|M|MORING|WADE|R||1963-12-13|(928)103-9712|(928)302-0913|262-21-1509
REG|G-9637|F|GALGANO|SARAI|L||1990-06-07||(480)100-1377|285-42-9637
Preventing extract errors

We do not process data that fail to meet the criteria and format specified in this document.
We recognize that any large volume of data is likely to have unforeseen characteristics, so
we recommend that you review your extracts for accuracy before delivering them to us.
The data submission checklist below includes some validation guidelines you could use to
verify your extracts.
V5.4.0.3
Uempty Data submission checklist

When we receive your data we perform the following series of checks to ensure that the
quality, size, and format of the data are what we expected and agreed upon. When we
encounter errors we provide you a reject log of the records we could not process. If there
are a significant number of errors (for example, greater than 0.05%), you might be required
to provide another extract. Reprocessing your data due to excessive extract errors might
impact the project schedule and/or price.
Table 1-18: Data submission checklist
Step Description Pass
Are the files readable? If they are zipped, can they be extracted and _ Yes
Readable
read in a plain text viewer? _ No
What is the record count of each file? Does that match our expected _ Yes
File Count
record count? _ No
_ Yes
Primary ID Is the primary identifier unique throughout the source file?
_ No
Date Does the date format for all date fields match our expected date _ Yes
Format format, including only valid dates? _ No
Name _ Yes
Is the name format divided into first, middle, and last?
Format _ No
Non-printab _ Yes
Does the file contain printable characters only? In other words, it
le
does not include any non-printable characters. _ No
Characters
End of Line Does the end of line terminator match the CRLF that we are _ Yes
Terminator expecting? _ No
Does the file adhere to the agreed upon format? Is the correct _ Yes
Format
delimiter used (|)? _ No
Does the file contain a column that identifies each record as _ Yes
Source
pertaining to a particular source? _ No
_ Yes
Null Value Character(s) are not used to indicate a null value in any field.
_ No
Extra Are there any extra characters or fields that we did not agree to in the _ Yes
Characters format? _ No
Student Notebook
Online data transmission

Initiate offers electronic transfer of files through our secure ftp site. Each of our clients
receives a dedicated, secure folder for your data transmission. Once the transmission is
complete, Initiates' staff removes all data from the ftp site and stores it on a secure server
to perform our data analysis. Only your Initiate project team has access to this data. The
default ftp transmission mode is ASCII so you must specify BINARY mode for transmission
of compressed files. Please coordinate with your project manager to arrange for data
encryption, if necessary.
Security
Initiate adheres to strict confidentiality standards. We take this responsibility very seriously
and enforce regulatory standards relating to the distribution, disclosure and retention of
personal data. Unless otherwise instructed, Initiate destroys client media in accordance
with an agreed upon time frame.
V5.4.0.3
Uempty Inbound message requirements appendix
Introduction
Initiate Inbound Message Based Transaction Services (Inbound Broker) enables you to
submit record updates from your source systems to the Identity hub via your existing
messaging infrastructure. The components of Inbound Broker are designed to manage
your specific XML and delimited messages and can be customized to your requirements.
Outlined below are the components required for an inbound message, some message
samples, and then your inbound map.
Components overview
The Inbound Broker consists of the following two components:
Inbound Message Reader
Inbound Message Broker
The Inbound Message Reader is a stand-alone process that receives messages through a
TCP/IP connection - it listens on a port, validates that the message is formatted correctly,
and then writes the message to a queue from which they are consumed by the Inbound
message Manager.
The Inbound Message Broker process consumes messages from Inbound Message
Reader queue and then sends the message via TCP/IP to the hub. The engine attempts to
process the messages and sends a notification of success or failure back to the Inbound
Message Manager. After receiving an acknowledgement from the IBM Initiate Master Data
Service, the message is either placed in the 'success*.dat' or 'reject*.dat files.
The Inbound Broker uses a configuration file to map message fields to the Identity hub data
attributes. The inbound.ini identifies how each attribute will be stored in the IBM Initiate
Master Data Service, error conditions for incoming values, and any special formatting or
processing required. It provides detailed descriptions of valid values and indicates
formatting requirements for processing messages read by the message reader.
Student Notebook
The configuration file can be categorized into three sections:

Data - This section is used to define the message format; it defines how the data will be
extracted from the message.
Event - This optional section is used to edit, modify or scrub data elements before
sending to the IBM Initiate Master Data Service. These are global events, and not to be
confused with the 'EVENT' tag included in your Data section.
Evaluator Broker - This optional section is used to define evaluator logic on the data
elements. This section is predominately used when you are unable to process all your
data massaging requirements in the Event section. In other words, this section allows
for more complex processing of message fields. The evaluator broker is used in
conjunction with an event to conditionally manage data elements of a message. Your
requirements do not require using this section.
Refer to the IBM Initiate Master Data Service Software Installation and Configuration Guide
and the Initiate Operations Guide for more details about the Inbound Broker components.
Configuration details
The following sections outline the specific configuration options for your implementation. It
includes how the messages should be formatted, valid values for specific XML tags, and
the expected data to be included.
Table 1-19: Segment definitions
Field
Field Name Segment Label Special Processing
Number
1 Source MEMHEAD srcCode Reject the message if Source is blank
2 Source ID MEMHEAD memIdnum Reject the message if Source ID is blank
3 Gender SEX attrVal
4 Last Name LGLNAME onmLast
5 First Name LGLNAME onmFirst
6 Middle Name LGLNAME onmMiddle
7 Suffix LGLNAME onmSuffix
8 Birth Date BIRTHDT dateVal
9 Address Line 1 HOMEADDR stLine1
10 City HOMEADDR city
11 State HOMEADDR state
12 Zip Code HOMEADDR zipCode
13 Home Phone HOMEPHON phNumber
14 Mobile Phone MOBILEPHON phNumber
Social Security
15 SSN idNumber Ignore segment if SSN is blank
Number
V5.4.0.3
Uempty Unit 2. Workbench
Overview
You can configure the Initiate member model (.imm) using Workbench where we will build
the data dictionary. Workbench is the main user and configuration management tool of the
IBM Initiate Master Data Service.
Dependencies
The computer images have Workbench installed.
Topics
Workbench basic functionality
General Workbench navigation
Copyright IBM Corp. 2010, 2011 Unit 2. Workbench 2-1

Student Notebook
Workbench overview
Workbench is a graphic user interface that provides user management and configuration
management tools for IBM Initiate Master Data Service. Simply put, it allows you to view
and manage the configuration for Data or Relationship hubs.
Using Workbench, a hub's data model, algorithm, and thresholds can be easily adjusted to
your requirements using a single toolset. Graphical analytics are available to correctly
adjust the algorithms and thresholds to increase the accuracy and performance of a
particular hub. These features make it much easier for algorithms to be tuned for
performance, and to improve the accuracy of matching based on analytics returned from
Workbench.
One Workbench project contains all this information for ease of versioning and
management by system administrators to support the standard IT processes.
Basic functionality
Workbench projects can be created and configured without a hub instance or a data
source. This is different from previous versions of configuration tools, like Identity hub
Manager, where the engine and databases were written to directly. Instead, Workbench
saves the configuration and uploads it to the engine allowing changes to be made off line.
Workbench allows users to perform many tasks that were once completed using scripts or
multiple Initiate software packages. Some of the key functionality includes:
Creating, configuring, and editing member model dictionaries
Creating, configuring, and editing algorithms
Cleaning and de-duplicating data in the data extract
Bucket analysis
Threshold analysis
Entity analysis
5. Set the Command Prompt (Start > All Programs > Accessories > Command
Prompt) to run as administrator using the steps above.
V5.4.0.3
Uempty Workbench general navigation

Workbench is separated into five different informational areas:
1 - Perspective Bar: This tab bar shows the available perspectives you can work in.
2 - Navigation Pane: This pane resembles Windows Explorer. It lists the projects and
the associated folders and files.
3 - Configuration Pane: This pane opens the various sections needed when
configuring the hub.
4 - Editor Pane: This pane is used to configure and edit project files and graphs.
5 - Information Panes: These panes have multiple tabs which can be used to set
properties, view errors in the project, view job results, and more.
Perspectives
The Workbench Editor pane has multiple perspectives for viewing multiple file types. There
are perspectives for configuration, mapping, bucket analysis, and working with Clover
ETLs. On the screen capture above, Call out 1 shows what perspective Workbench is
currently in and where perspectives can be changed.
Exercise
Now it is time to perform Exercise 1, taking approximately 10 minutes.
Copyright IBM Corp. 2010, 2011 Unit 2. Workbench 2-3

Student Notebook
V5.4.0.3
Uempty Unit 3. The Initiate member model
Overview
You will now learn how to configure the Initiate member model (.imm) by building a Data
Dictionary in Workbench. You will build a Data Dictionary from scratch in the exercises, but
normally you will begin with a predefined dictionary.
Dependencies
You must understand how the Initiate member model (.imm) defines the way that the IBM
Initiate Master Data Service software stores, manages, and validates data, and you must
have Workbench installed on your computer.
Topics
Working with the Initiate Member Model file to create the data dictionary:
- Adding a New Member type
- Adding Attributes
- Adding an Entity Type
- Adding a Composite Source
- Adding Sources
- Adding an Algorithm (Name Only)
- Adding Information Sources
Copyright IBM Corp. 2010, 2011 Unit 3. The Initiate member model 3-1
Student Notebook
Creating the data dictionary

The data dictionary configuration is stored in the Initiate Member Model file or *.imm file
which Workbench places in the Navigator pane automatically when a project is created.
The file is given the same name as the project and additional information is populated
based on the template option you chose during the project creation process.
In this unit, we will define the information the hub will need to link to other sources and read
the necessary data attributes, create the algorithm that will match and link the individual
records into entities, and provide the information the end users are expecting to see. The
information is entered and configured on five separate tabs of the Member Types pane:
Attributes: Define the attribute name, attribute code (AttrCde), and the attribute type
(AttrType) and configure the functionality.
Entity Types: Define entity name, label, and configure the functionality.
Composite Views: Define and describe the composite view, define the type of entity,
and configure the functionality.
Sources: Define the source name, the source code, the sources physical location, the
attributes used from the source, and configure the functionality.
Algorithms: (Module 5) Create an algorithm to define how the hub will search, link, and
match records.
We will also define an Informational Source for our Social Security Numbers which is a
separate pane in the .imm file.
There are multiple steps involved in creating the data dictionary. They include:
Adding a New Member type
Adding Attributes
Adding an Entity Type
Adding Sources
Adding an Algorithm (Name Only)
Adding Information Sources
But first, we need to open the *.imm file.
V5.4.0.3
Uempty Creating a new Initiate default project

Before we can do anything, we need to create a project in Workbench for Boot Camp. A
project contains all the configurations that will later be deployed to the hub including the
data dictionary, the algorithm, CloverETL graphs, user and security management,
relationship rules, and files generated from running jobs.
Your project will appear in the navigation pane on the left side of Workbench. The project
and related folders and files will function like a typical folder/file structure that can be
expanded and collapsed. This allows you to manage multiple instances in Workbench,
while maintaining organization and structure.
Student Notebook
Adding a new member type

Members are individual records from a source containing all the attributes of a person.
Member types are the definition of what data is stored in the hub for each individual
member such as their attributes and entity types.
This step is much like creating a folder that will contain multiple, related files. We need to
give the folder an appropriate name so that it will make sense to us and others. For Boot
Camp, we will name our member types Person.
Adding attributes
Attributes are the pieces of information that we know about members. Name, Rank, and
Serial Number. As you are setting up your data dictionary you need to find out what data is
being collected and stored in your source systems.
For Boot Camp, we are going to use the following information, or attributes for the
members in our sources:
Name
Home Address
Gender
Social Security Number
Birth Date
Home Phone
Mobile Phone
In this step, we will create the attributes using their common titles above, then define their
attribute code (attrcode) and the type of attribute. The attribute code is the shortened name
that will be used by the hub to define the attribute. The attrcode is user defined and should
be as descriptive yet as brief as possible. It cannot contain spaces.
The attribute type, sometimes called a segment, is predefined and must be selected from a
table. This code defines how the information will be treated in the hub.
V5.4.0.3
Uempty Maximum active and existing values

We will also define the Maximum Active and Existing Values which determines how many
nuggets of information the hub will store for each attribute.
We have all had more than one phone number in our lives. Back in the day, you were given
a new number every time you moved. Before cell phone numbers could be ported, you
were assigned a new one when you switched carriers.
If you think real hard, you could probably remember all of them, but luckily we have
databases to do that for us.
The cell phone number you currently use would be your active number. In most cases, the
Maximum Active value will be set to 1, but in the case we want to store more than one
mobile phone number for an attribute we would set that value to 2 or higher.
It would also be helpful to keep some of these past phone numbers in the record by setting
the Maximum Existing value to 1 or higher. This will determine how many past numbers are
kept as historical.
Adding an entity type

Members and the information associated with them make up entities. The records
attributes create a logical relationship between two or more member records and are
represented as records sharing an Enterprise ID.
Entity types define how the entities are viewed and linked by using specific algorithm
configurations. The entity types are identity, household, group, and organization.
Identity: The identity entity type represents an individual made up of a single or multiple
records based on attribute similarity.
Household: The household entity type represents multiple individuals who are
associated with a single physical location. These individuals share addresses or phone
numbers, but are not necessarily related in any other way, similar to dorm roommates.
Household members will share an Enterprise ID based on the physical location.
Group: The group entity type allows records to have multiple entity record numbers
within a single entity type if the members match above the auto-link threshold.
Organization: The organization entity type represents multiple individuals who are
associated with an entity, similar to employees of a company, and will all share a
common Enterprise ID.
As you create the entity types you have a few important configuration options to consider.
Student Notebook
Synchronous/asynchronous entity management

Entity management defines how record comparison, data linking, and task creation occur
after the data is derived. There are two settings defined by setting the Asynchronous field
to True or False:
Synchronous: When a change is made to a record, the update is immediately
cross-matched.
Asynchronous: Changes are made and the subsequent cross-match are placed in a
queue based on priority.
Allow same source linking

This setting allows two records, presumably duplicates, scoring above the auto-link
threshold in the same source to be linked and an Enterprise ID assigned.
During Boot Camp, we will use the Identity entity type, and your entities will be called 'id'.
You will allow for same source linking and entity management will be done asynchronously.
Adding composite views

Composite views define how attribute information is displayed and represents the snapshot
of a member and its attributes. In other words, it controls what information Initiate
applications display, and can be configured to hide certain information from end users such
as a members Social Security Number.
Entity Most Current Attribute (EMCA) and Member Most Current Attribute (MMCA) are the
default views expected by IBM Initiate Inspector and is what we will use for Boot Camp.
An EMCA composite view displays the most current attributes for any member that shares
a common Enterprise ID. The EMCA combines the attributes so that the view of the entity
(for example, person) is a conglomeration of attributes from your various source systems.
MMCA is the same as EMCA, only it provides a view of the member instead of the entity.
V5.4.0.3
Uempty Adding sources

Sources are basically what they sound like, they are a system or database that sends data
to the hub and receives information and updates. This pane allows you to add and
configure definitional sources, which are sources where record information is created,
stored, and updated.
For these definitional sources we want to assign identifying codes. The hub will refer to the
source code and will assign it as a prefix on all source-assigned record numbers from that
source. We will assign the letter A and B to our Boot Camp sources, so if a record from
Archway was assigned a record number of123456, the hub will use the source code to
append the record to read A-123456.
The physical code will allow for an ID to be assigned to a group of sources without
assigning an individual code to each. In our Boot Camp example, we only have two
sources, but let us suppose that there are three separate sources from Archway. One from
ER, one from surgery, and one from the lab.
By assigning the physical code of ARCH to the Archway sources, those three sources will
be under the Archway umbrella but will retain their individual code IDs. We will know by
looking at their physical code which location the source physically resides.
We can also configure which attributes are used from each source. In general, more
information is better, but when you know that an attribute from a particular source is
incorrect or sparsely populated, you can turn off that attribute to avoid problems when we
start matching and linking.
We can also determine whether we want the hub to review identifiers from the sources to
check for duplicates in that source, and also across the other sources. Each identifier
should be unique, but when two match and the other attributes do not exceed the auto-link
threshold, a Review Identifier task will be triggered.
Student Notebook
Adding informational sources

An information source identifies who the Issuer of an identifier is. They are used exclusively
with the identity attribute types.
For driver's licenses, you would add each state so that the license numbers from one state
are not confused with the license number from another. It is possible that both Virginia and
Hawaii could issue the same driver's license number, but because they come from different
states, they are not really a case of two people using the same identifier. The issuer makes
them unique.
Definitional sources versus informational sources

Simply put, the definitional sources contain your attributes, while the informational sources
provide information from a third party used for validation or clarification of the attributes
stored in the MEMIDENT table.
Again, if you are using Drivers License numbers in the MEMIDENT table and end up with
two identical license numbers, a correlation would be made between the two records. By
using informational sources to track who issued the licenses and assigning a state code to
each number, you might find that one license was issued by Georgia and the other by
Minnesota resulting in less correlation than would normally be expected.
In Boot Camp, Source A and Source B are definitional sources and are reflected by the
srctype 'D' in the mpi_SrcHead table. Informational sources are stored in mpi_SrcHead as
well, but with a srctype of 'I'.
Adding strings
Strings enable you to create rules or guidelines that instruct the algorithm on how to handle
certain incoming data values. To save some time in Boot Camp, we will not create the
string value files from scratch. We will copy string files to our project, and then create a new
string.
Exercise
V5.4.0.3
Uempty Unit 4. Configuring the algorithm
Overview
Typically, a generic algorithm is imported along with your project configuration, but in class
we will begin with an empty algorithm. This algorithm will need to be configured to address
the attributes you are using, the comparisons that you would like to use, and the bucketing
strategy that you would like to employ. You can use Workbench to make your edits. The
tool will validate your design and present you with a list of errors if there are any
inaccuracies in your algorithm design.
Dependencies
The algorithm is the brain of the IBM Initiate Master Data Service software. Therefore, the
proper data elements must be in place before you can fully develop the algorithm. After
your first pass at the implementation, you can make some tweaks to the algorithm. After
making those changes, you will need to derive data, generate weights and/or perform a
Bulk Cross Match again.
Topics
Algorithms: The secret sauce
- Introduction to algorithm components
- Standardization Introduction
- Bucketing Introduction
- Comparing data introduction
Copyright IBM Corp. 2010, 2011 Unit 4. Configuring the algorithm 4-1
Student Notebook
Algorithms: The secret sauce

An algorithm is a process that analyzes member records whenever a user executes a
search, a member record is updated, a bulk cross match utility is executed, or during
integrated weight generation. The end products of an algorithm are derived data (both
comparison and bucket hashes) and a comparison score. Algorithms are tailored to a data
set - specifically using the attributes in the data that will help the engine assess likelihood
that two members are a match. Therefore, the derived data and the comparison score
outputs are unique to each implementation.
An algorithm works in stages:
Standardization Data is scrubbed, formatted, and derived into a Comparison String.
Bucketing Candidates are selected for comparison and assigned a Bucket Hash
number.
Comparison Members are compared at the attribute-by-attribute level and scored.
Member comparison scores can be used to return search results to the user, or the system
can determine, via thresholds, whether or not to link members together.
How do algorithms relate to members and entities?

An algorithm is defined for each member type. The standardization and bucketing function
are applied to the member type, but comparison function are applied to the entity type.
V5.4.0.3
Uempty Standardization
Standardization is a process applied to attributes that cleans the values for easier
comparison. It reduces the variance between data (for example, converting characters to
UPPERCASE) and removes extraneous information that won't help comparison (for
example, remove anonymous values, remove generic descriptors like Road and Street
from addresses, and removes special characters like (), -, , and so on).
What can standardization do?

Standardization can perform a wide variety of functions. The following list illustrates some
of the most common types of standardization that are available:
Case Conversion: Converts Karen Jones to KAREN JONES
Truncation of Values: Converts (312) 832-1212 to 8321212
Anonymous Values: Removes Phones with (000) 000-0000
Validating Data Length: Rejects SSNs that do not have 9 digits
Validating Data Format: Rejects values that do not meet the requirements, like e-mail
addresses without an @ sign
Nickname/Equivalency Translation: Converts Nicknames like Apartment to APT or
values like Male to M
What does standardized data look like?

When data has been standardized with the algorithm, the derived data is stored in a
comparison string. The comparison string lives in the mpi_memcmpd table of the database.
Derived comparison strings have the following properties:
Attributes are delimited by carats ^
Tokens within the attributes are delimited by colons :
Or is indicated by a tilde ~
Strings must be less than 256 characters in length
If there are more than 256 characters, a second line in is created
Below is an excerpt of what the cmpval field in the mpi_memcmpd table looks like:
cmpval contains: Last Name, First Name, Middle Initial, Suffix, Social Security Number,
Gender, Birth Date, Street Address, Home Phone, Cell Phone, Eye Color
PINCHON:ARMAND:N^228073164^M^85022^N-32537^S-REPUBULIC^2621259~2022647^BROWN
HAYWOOD:DALE:Y:JR^240251125^M^85217^N-2835^S-PERRIER^1090013~1217253^BROWN
SCAGLIONE:STEPHANIE:A^275311207^F^85032^N-20085^S-TOULOUSE^2611037~2032074^BLUE
TEW:WENDY:H^242611929^F^85219^N-1769^S-MARKET^1998719~1172916^BROWN
GRIMM:JEANICE:I^617634723^F^85217^N-22960^S-CANAL^3122086~2172304^GREEN
WEIDLER:MARK:I:III^288915381^M^85007^N-1848^S-EXPOSITION^9248645~9007819^BROWN
Student Notebook
Where are standardization functions stored?

The Standardization Functions are part of the Initiate Master Data Engine, however, you
can access the list of Standardization Functions by referring to the following locations:
mpi_stdfunc - This table stores the descriptions and names of your standardization
functions.
The Algorithm Editor Palette - You will find your standardization functions in the
palette in the Algorithm editor. Hover your mouse over a function to read a description
of the function. You can also right-click the Palette and change your layout to include
detailed descriptions in the list.
V5.4.0.3
Uempty Standardization functions

The following section covers the current Attribute Standardization functions. There are
several Standardization functions available that can process common attribute types.
Abstract Code
Date
Attribute
Phone
Postal Code
Name
Address
Email Address
Biometric
Geocode
Identifier
Multiple Attribute
Passthrough
Note
Most standardization functions have the ability to filter out values that have been deemed
as anonymous (for example, SSN of 999-99-9999 or Zip Code of 11111). Likewise, many
functions give you the ability to substitute codes or standard values in place of predefined
strings (for example, Nicknames like Jimmy or Jaime both become James and Genders
are converted from Male and Female to M and F).
Student Notebook
Abstract code
The following table contains a list of the current Abstract Code Standardization functions:
Formats abstract codes using the equivalent string tables.
ABSCODE
(Accepts only 1 field.)
Example
The ABSCODE function could perform the following standardization on a Provider Name
(Name/Provider), if the proper validation tables are created in the Strings section of your
Initiate Member Model:
Mike L. Goodman, Allergist becomes ALG
V5.4.0.3
Uempty Date
The following table contains a list of the current Date Standardization functions:
Table 4-1: Date standardization functions
(DATE1) Formats alphanumeric date values as YYYYMMDD
Standard and removes special characters. (Accepts only 1 field.) This is
the Initiate preferred date format.
(DATE2) If year, month or day is invalid, those portions of the
date are replaced by a string of 0's (zeros) and the comparison
function ignores the 0 string. The function is configurable with a
Partial minimum and maximum year. When the date falls outside of
those ranges, the entire date is treated as an anonymous value.
After applying the year range filter, DATE2 checks the date
against a configurable date standardization table.
(GRDATE) Formats a date as YYYMMDD and treats dates
Fixed earlier than 1890 and later than 2020 as anonymous values.
(AGE) Formats an age into a birth year by removing non
Age integers and treating values less than 8 or greater than 100 as
anonymous. (Accepts only 1 field.)
Date example
The Date/Standard function would perform the following standardization on a Date of
Birth:
1962-10-27 becomes 19621027
Student Notebook
Attribute
The following table contains a list of the current Attribute Standardization functions:
Table 4-2: Generic attribute standardization functions
(ATTR) Keeps alphanumeric characters and removes any
special characters like,!@#$%^&*-(). Often used for single-field
Alphanumeric
attributes like, Gender and defined attributes. (Accepts only 1
field.)
(ATTRA) Formats alphabetic data only, so removes all numbers
Alphabetic
and special characters. (Accepts only 1 field.)
(ATTRN) Formats numeric data, so removes all alphabetic data
Numeric
and special characters. (Accepts only 1 field.)
V5.4.0.3
Uempty Phone
The following table contains a list of the current Phone Number Standardization functions:
Table 4-3: Phone standardization functions
(PHONE1) Formats a phone number that is 7 or more
characters to a 7-digit United States/Canada number, by
North America
shaving off leading 1's, area codes, and special characters.
(PHONE2) Formats a phone number by concatenating the
phone data in multiple fields, then reducing it to a 7-digit United
Full
States/Canada number. This function can work with numbers
that only have the minimum 7 digits. (Accepts up to 3 fields.)
(PHONEEND) This function standardizes phone numbers
Last Digits
regardless of country format.
(INTPHONE1) Formats a phone number to a 10-digit
International International number. Analyzes country calling codes for more
accuracy. (Accepts up to 3 fields.)
(AUSTPH) Formats an Australian phone number to an 8-digit
Australia
local number. (Accepts only 1 field.)
Phone examples
The Phone/Full function would perform the following standardization on a US Phone
Number:
1 (404) 871-1316 becomes 8711316
The Phone/International function would perform the following standardization on a UK
Phone Number:
+44 (0) 282 995 7182 becomes 2829957182
Logic for standardizing US/CN phone numbers is outlined below:
If the leading character is a 1
Then remove it
Else If the string length is >=10 (assuming an area code)
Then skip first 3 digits and use next 7
Else If the output of the first process still has a length >=7
Then use the first 7 digits
Else treat the phone number as if it is NULL
Student Notebook
Postal code
The following table contains a list of the current Address Standardization functions:
Table 4-4: Postal code standardization functions
(AUSTPOST) Formats an Australian postal code to 4 digits.
Australia
Canada (CNZIP) Formats a Canadian zip code. (Accepts only 1 field.)
International (INTZIP) Format an international zip code. (Accepts only 1 field.)
North America (NAZIP) Formats a US/CAN zip code. (Accepts only 1 field.)
(UKZIP) Formats a United Kingdom zip code. (Accepts only 1
United Kingdom
field.)
(USZIP) Formats a United States zip code. (Accepts only 1
United States
field.)
V5.4.0.3
Uempty Name
The following table contains a list of the current Name Standardization functions:
Table 4-5: Name standardization functions
(PXNM) Formats a list of strings applying rules for Person
names including First Name, Middle Name, Last Name, Prefix,
Person Suffix, Title, and Degree. This function can also take
hyphenated names and split them into multiple tokens. (Accepts
up to 6 fields.)
(BXNM) Formats a list of strings for Health care Providers
(Individual Doctors, Physician Offices, Medical Practices, and so
on) using equivalent tokens and filtering out anonymous values
to create the derived data. BXNM is the only function that
Provider creates two roles, one for all of the name tokens and a second
role that denotes the specialty of the provider. Specialty is
translated from Titles, Degrees, or from words in the name. This
function cannot break apart hyphenated names, it simply
removes the hyphen. (Accepts up to 6 fields.)
(CXNM) Formats a list of strings applying rules for Company
names, like removing Inc. and Co. from names. This function
Company
cannot break apart hyphenated names, it simply removes the
hyphen. (Accepts only 1 field.)
(CJKCXNM) Formats a list of strings applying rules for
Company Unicode Company names. Uses rules for Chinese, Japanese, and
Korean tokenization. (Accepts only 1 field.)
(UCSFREQXNM) Incorporates the FORGNXNM and JAPXNM
Person Unicode
functionality.
Name examples
The Name/Person function would perform the following standardizations:
Howard K. Pingston, Jr. becomes PINGSTON:HOWARD:K:JR:.
Anne M. Fuller-Kline, DDS becomes FULLER:KLINE:ANNE:M:DDS.
The Name/Provider function would perform the following standardizations:
Geoff M. Locke, Family & Pediatric becomes
Role 1: LOCKE:FAMILY:PEDIATRIC:JEFFREY:M
Role 2: FAM:PED (Conversion from mpi_strEqui table)
Martin's Foot & Ankle Clinic, LLP becomes
Role 1: MARTINS:FOOT:ANKLE:CLINIC
Role 2: POD:ORT (Foot = Podiatry = POD and Ankle = Orthopedics = ORT in the
mpi_strEqui table)
Student Notebook
Address
The following table contains a list of the current Address Standardization functions:
Table 4-6: Address standardization functions
(CNADDR) Formats an address string dividing it into the address
Canada
subcomponents for Canada. (Accepts up to 4 fields.)
(CNADDR2) Formats an address string, dividing it into the address
Canada -
subcomponents for Canada with region and postal code. (Accepts up to
Expanded
7 fields.)
(INTADDR2) Formats an address string, dividing it into the address
International subcomponents for International with region and postal code. (Accepts
up to 7 fields.)
(NAADDR2) Formats an address string, dividing it into the address
North America subcomponents for North America with region and postal code. (Accepts
up to 7 fields.)
(UKADDR2) Formats an address string, dividing it into the address
United Kingdom subcomponents for UK with region and postal code. (Accepts up to 7
fields.)
(USADDR) Formats an address string, dividing it into the address
United States
subcomponents for United States. (Accepts up to 4 fields.)
(USADDR2) Formats an address string, dividing it into the address
subcomponents for United States with region and postal code.
The ADDR2 standardization function has 3 components. Street lines,
region and postal code. You can omit the region and postal code or just
the postal code. But, the number of fields for each component is fixed.
Street lines are four, region two and postal code one. For example:
United States - stline1,stline2,stline3,stline4 - OK

Expanded stline1,stline2,city,state,zip - NOT OK
stline1,stline2,stline3,stline4,city,state - OK
stline1,stline2,stline3,stline4,zipcode - NOT OK
stline1,stline2,stline3,stline4,city,state,zipcode - OK
(Accepts up to 7 fields.)
Universal (UCSFREQADDR) Incorporates the FORGADDR and JAPADDR
Character Set functionality.
Address (ADDR2) example

208 W. Chicago Ave. Apartment 3, Oak Park, IL 60302 becomes
N-208:S-W:S-CHICAGO:S-AVE:S-APT:N-3:S-OAK:S-PARK:S-IL:N-60302
V5.4.0.3
Uempty Email address

The following table contains a list of the current Miscellaneous Standardization functions:
Table 4-7: Miscellaneous standardization functions
Formats an e-mail address, by looking for an @ symbol followed
EMAIL by a period. The function strips off the suffix after the period.
Biometric
The following table contains a list of the current Biometric Standardization functions:
Table 4-8: Biometric standardization functions
(HAIRCOLOR) Formats hair color as an alphanumeric data
Hair Color
(Identical to ATTR). (Accepts only 1 field.)
(EYECOLOR) Formats eye color as an alphanumeric data
Eye Color
(Identical to ATTR). (Accepts only 1 field.)
(RACE) Formats race as an alphanumeric data (Identical to
Race
ATTR). (Accepts only 1 field.)
(HEIGHT) Formats height into inches. Input must be in the
format FII where F is feet and II are inches. Values less than 36
Height
or greater than 90 are treated as anonymous. (Accepts only 1
field.)
(WEIGHT) Formats weight by removing non integers and treats
Weight values less than 60 and greater that 500 as anonymous.
Biometric examples
The Biometric/Hair Color function could perform the following standardization on a Hair
Color, if the proper validation tables are created:
BROWN becomes BR BALD becomes BD
BLONDE becomes BL BLACK becomes BK
Geocode
The following table contains a list of the current Geocode Standardization functions:
Table 4-9: Geocode standardization functions
The GEO standardization function converts latitude/longitude
GEO location coordinates into a standardized format that can be
consumed by GEO comparison and bucket functions.
Student Notebook
Identifier
The following table contains a list of the current Identifier Standardization functions:
Table 4-10: Identifier standardization functions
(IDENT1) Keeps alphanumeric characters in an identifier and
removes any special characters like,!@#$%^&*-(). This is often
Numeric
used for Medical Record Numbers (MRNs). (Accepts only 1
field.)
(IDENT1A) Formats alphabetic identifiers only, so removes all
Alphabetic
numbers and special characters. (Accepts only 1 field.)
(IDENT1N) Formats numeric identifiers, and removes all
Numeric alphabetic data and special characters. This function works well
with Social Security Numbers. (Accepts only 1 field.)
Identifier examples
The Identifier/Numeric function would perform the following standardization on a Social
Security Number:
902-83-1386 becomes 902831386
Important
Data must be stored in the MEMIDENT table, and the appropriate informational sources
must be configured for these functions to work.
V5.4.0.3
Uempty Multiple attribute

The following table contains a list of the current Multiple Attribute Standardization functions:
Table 4-11: Multiple attribute standardization functions
This function standardizes attributes that have multiple fields. It
MULTIDIM is intended for use with the DR1D[234][ABC] comparison
functions.
Passthrough
The following table contains a list of the current Passthrough Standardization functions:
Table 4-12: Passthrough standardization functions
This standardization function does not change the input and
simply passes it through the Master Data Engine processes.
Please note that use of this function is discouraged as it was
originally designed as a temporary workaround and might not
PASSTHRU
always perform as described. For example, if you attempt to
standardize a value with certain special characters (for example,
a colon, caret or period), you risk a negative impact on
comparisons with this given member.
Student Notebook
Working with buckets

Imagine a young girl walking along a beach with a number of pails collecting things she
finds along the way. One pail is for shells. Another is for sea glass. A third for rocks, and so
on.
The concept of buckets is exactly the same. The buckets collect items that look similar that
make it easier to find and compare later on.
What is bucketing?
Bucketing is used to select candidates for comparison. If you wanted to perform a search
without bucketing, you would be comparing against every member in the database. This
would make searching time consuming and reduce the overall performance of the system.
By employing buckets, we can select a smaller list of candidates to compare against (for
example, less than 2000) and still be confident that we are going to find the right member
records to match with.
Organizing member records into buckets

Buckets use a process called Hashing in which the attributes (and combinations of
attributes) in member records are converted to a hash number that acts as a fingerprint of
that value. Hash values allow the system to convert large strings into small numbers.
Through this conversion, hash numbers reduce the retrieval time needed to pull candidates
for selection.
You can think of the bucketing process as a set of mail slots. Just as you would organize
the mail into the respective cubby holes for each person, the bucketing process records
which hash numbers (or buckets) a particular member record would belong to.
V5.4.0.3
Uempty In the diagram below, notice that Paula Montauk's member record (167289) is being
bucketed by First Name, First Name + Last Name, Address, and Last Name + Zip Code.
The Bucketing Functions define how her record will be assigned hashes.
Note
Instead of the Buckets being listed by name, they are assigned a hash number, which
allows the Master Data Engine to retrieve members faster during comparison.
How do buckets impact searching?

Buckets set some ground rules about how you can search. Users can only search off of
attributes that are referenced in a bucket design. If you do not bucket the information, you
cannot use it to search; however you can still use comparison scores from non-bucketed
attributes to help feed comparison scores. When more attributes have comparison scores,
your results are more accurate.
How do you locate the values associated with a hash?

When looking at hash vales, you will not be able to decode the hash to understand what
the original value was for the bucket. This is where the Bucket Analysis tool comes in
handy.
Student Notebook
Reviewing bucket hash values and the underlying real values

After running Prepare Bucket Analysis to generate the output of a member's buckets, you
should examine the bucket hashes and bucket roles to see what the data means.
Here is how the following record was Bucketed:
First Name: Alec
Last Name: Speegle
SSN: 780547291
DOB: 19820502
Zip: 85038
Phone: 2716291
Table 4-13: Bucket hashes and values
Bucket Hash BucketValue What Happened
SSN of 780547291 appears in sorted order as
4540265678601088809 012457789
012457789.
Phone of 2716291 appears in sorted order as
-8809730092881002640 1122679
1122679.
DOB of May 2, 1982 is added to a phonetic form of
6085846524090996483 198205+ALK
Alec.
DOB is added to a phonetic form of Alexander (the
6394223249508127562 198205+ALKSNT
formal name for Alec).
6312547120776979299 198205+SPKL DOB is added to phonetic form of Speegle.
Phonetic form of Alec is added to a phonetic form of
5256576921179425119 ALK+SPKL
Speegle.
Phonetic form of Alexander (the formal name for Alec)
6590837625867207706 ALKSNT+SPKL
is added to a phonetic form of Speegle.
A sorted Zip Code (85038 > 03588) is added to
8128495530969973259 85038+SPKL
phonetic form of Speegle.
V5.4.0.3
Uempty Designing multi-token buckets

Bucketing strategies are made more dynamic when you create multi-token bucketing. This
simply involves the process of building buckets from attributes with more than one field, like
names or addresses. When you combine multiple attributes into one bucket, you will need
to decide how many tokens from each attribute will make up the bucket definition. Based
on your settings, the IBM Initiate Master Data Service will create the appropriate buckets
that fit your set of data.
How are buckets formed?

When you bucket on data that has multiple tokens, you can control how many of the tokens
are used when creating the buckets. For example, if an address has the 4 following tokens:
Street Line 1, City, State, and Zip Code then there are four possible single-token buckets.
If you require two tokens per bucket, what combinations of buckets would be created? You
only need to list unique combinations of values once (that is, you do not need State+Zip
and Zip+State. One combination is sufficient.):
If you require three tokens per bucket, what combinations of buckets would be created?
What about four tokens?
Student Notebook
Where are bucketing functions stored?

The Bucketing Functions are part of the IBM Initiate Master Data Service Engine, but you
can access the list of Bucketing Functions by referring to the following locations:
mpi_bktfunc This table stores the descriptions and names of your bucket functions.
mpi_genfunc Defines the function name for bucketing functions. This enables the
functions to be externalized in the library.
The Algorithm Editor You will find your bucketing functions in the palette in the
algorithm editor. Hover your mouse over a function to read a description of the function.
You will find the bucket generation type settings on the Properties tab, after you have
highlighted the bucketing function in the editor.
V5.4.0.3
Uempty Bucketing functions

Bucketing functions enable flexibility for bucketing routines and assists in candidate
selection by identifying groups of shared information, leading to more accurate
comparisons.
The creation of bucket values is a two-step process. First, a bucket function is applied to
the compare value that was generated by the standardization function. The second step is
generating the bucket's value. A bucket generation type takes the output of the bucket
function then modifies it to produce the bucket values. The bucket generation type formats
the name of the bucket. Name appears in quotations because the bucket's name will
really be converted to a hash number, but that Hash does represent a real value. We will
see how bucket generation types work later in this module.
Available bucket functions

The following table contains a list of the current bucket functions:
Table 4-14: Available bucket functions
(ATTR) can be used with most attributes, although the other
bucketing functions are designed to work with specific output
Attribute
from standardization functions. This generic bucketing function
generates a single value.
(DATE) can be used when the standardization function is DATE
or GRDATE. You can use the ATTR function with date attributes
Date
to apply a wider variety of Bucket Generation Types. This
function generates a single date value.
(PXNM) is normally used when Standardization function is
PXNM. This function generates a list of personal name tokens.
(CXNM) is normally used when standardization function is
Name
CXNM. Generates a list of business name tokens.
(BXNM) is normally used when standardization function is
CXNM. Generates a list of business name tokens.
(ADDR2) is normally used when standardization function is
Address
ADDR2. Generates a list of address tokens.
Student Notebook
What is the role of bucketing generation types?

Bucketing Generation Types control how buckets are conceived by using a standard set of
conversion methods. What this means, is that you can leverage the power of common
standardization functions and comparison functions to make your buckets more accepting
of data discrepancies.
For example, if you created a bucket on phone number that uses a Sorted generation type,
the following would happen:
Why use bucketing generation types?

In the example above the phone number 4452983 would be converted to 2344589, but
inside the bucket, the data remains in its Standardized format. Therefore, if a second
phone number of 8349425 was added to your hub, it would also be bucketed in 2344589,
but upon comparison the two numbers would not be a good match.
The real purpose for using a Sorted Generation Type is to take two values with a simple
transposition and make sure that a typo does not exclude data that might be a good match
(for example, 4452983 and 4452938 could just be an accidental typo).
V5.4.0.3
Uempty What bucket generation types are available?

Since the bucket generation type determines what the name of the bucket is, there are
several ways that you can format the name of a bucket. For example, you might want to
bucket using nicknaming, where all nicknames are rolled into a bucket for the proper name
(for example, Shellie, Michele, and Shelly all get assigned into the MICHELLE bucket).
Similarly, we could use the phonetic equivalent of the value, therefore, the MICHELLE
bucket would really become the MXL bucket and would include all names that phonetically
become MXL (like, Michelle and Michael, Mike, Shelly, Mikahil, Mickey, and so on, would
all get converted to the MXL phonetic marker). The following table displays the list of
bucket generation types that are available.
Can be used with Address, Attribute, Name/Provider, Name/Company, or Name/Person
functions use with Date Function
Table 4-15: Bucket generation types
As Is (ASIS) This function leaves each token as is.The ASIS generation type can be used with any attribute
or Bucketing Function. For example, a member with the zip code 77392 is put into a 77392 bucket.
Phonetic Meatphone: (METAPHONE) This function is used to index words by sound.
First Three Characters: (PREFIXMAP) This function uses a truncation mapping such that for each
token, the truncation function outputs the first three characters. If the token is less than three
characters, the entire token is output.
Initiate: (IDENTAPHONE) This function uses Initiate proprietary phonetic algorithm.
Arabic Name: (ARABPHONE) This function is used to apply phonetics to the English translation of
Arabic names (does not work on Arabic character sets).
French Name: (FRPHONE) This function applies a proprietary transformation similar to
IDENTAPHONE, but is suited to predominantly French names.
Equivalence (EQUI) This function applies a String Equivalency (a Nickname, Code, or Abbreviation) from a lookup
table to each token. Jim becomes James & Foot becomes POD.
Equivalence & (EQMETA) This function applies a Meta translation function to the EQUI look up for each token.
Phonetic
Sorted (SORTED) This function sorts the contents of each token. 3014201324 becomes 0011223344 and
52431 becomes 12345.
Numeric Range (NRANGE) This function applies numeric range transformation to each token. This is commonly used
with height, weight, and other numerical data. Weights from 150-159 pounds are in the 150 bucket.
String Range (SRANGE) This function applies string range transformation to each token. This can be used with full
Dates and Years if they are bucketed together. 1983 and 19830123 are not close numerically, but are
alphabetically.
YYYYMMDD (DTY4SMD) This function applies canonical date (YYYYMMDD) transformations to each date.
Basically, each distinct date gets a bucket in the YYYYMMDD Format (for example, 19280912,
19280913, 19280914, and so on).
YYYYMM (DTY4MM) This function only uses the YYYYMM portion of the date token to group dates by Month &
Year. 19280912 and 19280930 becomes 192809.
MMDD (DTMMDD) This function only uses the MMDD portion of the date token to group dates by Month &
Day, but ignores Year. 19280912 becomes 0912.
Student Notebook
Comparing data
A comparison function allows the algorithm to look at two sets of data and determine how
similar they are to one another. The IBM Initiate Master Data Service platform has several
methods for comparing data, from a simple exact match comparison to a three-dimensional
edit distance comparison.
In the case of an exact match comparison, you simply want to know if x = y. So, if Phillip =
Phillip then you have a positive score, but since Phillip <> Phillips then you will receive a
negative score. That is where an edit distance can help.
In the case of a three-dimensional edit distance comparison, you have three values like, zip
+ address + phone and you compare x(zip + address + phone) to y(zip + address + phone).
The edit distance is the number of edits that would need to be made in order for the two
strings to match exactly.
So, in the case of Phillip versus Phillips (which is a one-dimensional edit distance), there is
an edit distance of 1, which would still offer a positive score because they are close to
being the same. For a three-dimensional comparison, let's look at these two strings:
89814+1285_W_Main+2239871 versus 89814+1285_Main+2239817
If we break them down to the single elements, then we will see that 89814=89814, so that
is an edit distance of '0', which is the highest score possible. 1285_W_Main and
1285_Main has an edit distance of '2' because you need to add or remove 2 characters
'W_' to make them match. 2239871 and 2239817 are almost the same, but there is a
transposition on the last two numbers 71 versus 17. This is an edit distance of '1' which is
still a high score.
Comparing data determines the structure of the memcompd table. Weight tables, like the 3
Dim (mpi_wgt3dim) weight table, can take into account the comparison functions
assessment and provide a score. In the case of an exact match on zip code, a distance of
2 on address, and a distance on 1 on phone the weight table can provide the correct weight
value to assign to the comparison. All together, these two strings would score positively in a
comparison.
What is the best way to compare data?

There is no one best way to compare data. That is why Initiate's data scientists have
created an array of comparison functions which allow you to compare your data in the most
appropriate way, based on the type of data that you are comparing.
V5.4.0.3
Uempty Comparison functions

The following list breaks down the different categories of comparison functions:
String Pair
Address & Phone
Name
Date
Edit Distance
Equivalency
False Positive Filter
Geocode
Height & Weight
US Zipcode
String pair
This function is used to handle any kind of specialty (abstract) codes used by providers.
Table 4-16: String pair comparisons
ATTR2S can be used to compare any combination of two
string-valued attributes. This function replaces the previously
ATTR2S
used EXH function, but can extend beyond just eye and hair
color comparison.
Student Notebook
Address & phone

Address-based comparisons are especially vital to the house holding process. But, even if
you are not using a household entity, being able to draw correlations between address
components can help you match on members that have a history of moving frequently or
might have entered their address in different formats.
Table 4-17: Address & phone comparisons
This comparison function allows you to compare addresses and
phone numbers and apply more emphasis on specific parts.
Instead of relying on a 3 Dimensional weight table like a
Zip-Address-Phone comparison, the AXP comparison function
takes the output of USADDR2, CNADDR2, UKADDR2, and
INTADDR2 and performs edit distance, phonetic, and frequency
based analysis on the data in an Address and a quick edit
distance on Phone Number.
AXP Essentially, each value in the address is tokenized into strings or
numbers. Then each string is compared with other strings and
each number is compared with other numbers. Therefore, if your
address was 46064 N. Pendleton Avenue, Pendleton, IN 46064,
and you had two member records that were being compared,
the house number and the zip codes would cross compare
between two member records. Likewise, Pendleton Avenue and
the city of Pendleton would cross compare between two
members in a comparison.
V5.4.0.3
Uempty Name
Name, or XNM, comparisons use a combination of techniques to compare the data. For
example, name comparisons use Phonetics, Nicknames, Substitutions, and can even pull
in Edit Distance processes. Depending upon the type of data that you are working with, you
would use one of four types of Name (XNM) comparisons.
Table 4-18: Name (XNM) comparisons
(BXNM) Value-based provider name comparison; 1 role/1
Provider dimension. Used to compare business names. Uses exact match,
nicknames, and phonetics when comparing tokens.
(CXNM) Value-based company/organization name comparison; 1
Company role/1 dimension. Used to compare business names. Final weight
is based on total similarity, not total token weights.
(CXNM_CS) Value-based company/organization name
comparison; 1 role/1 dimension. The comparison functions enable
business name token weights to reflect locality. For example, within
the locale of the Pheonix metropolitan area, the token Pheonix will
have a high frequency and a low weight.
Company Example:
(context sensitive) Phoenix Flower Shop, 7611 E. Phoenix Road, Phoenix, AZ
Phoenix Pizza, 7611 E. Phoenix Road, Phoenix, AZ
In the above example, the comprole2 setting would enable a start
location after Pheonix. In this case, the Master Data Service could
start with Flower and Pizza when comparing the name of the
company.
(PXNM) Value-based person name comparison; 1 role/1
Person dimension. Used to compare person names. Uses exact match,
nicknames, and phonetics when comparing tokens.
(QXNM) Value-based person name comparison; 1 role/1
dimension. Used to compare person names on four dimensions (Q
Person
is for Quad). Uses exact match, nicknames, phonetics, and edit
(comprehensive)
distance when comparing tokens. The PARM weight table contains
a limit on the total weight for the name match.
Student Notebook
Date
There are several elements that can play into date comparisons. Namely, exact match and
edit distance. The comparison functions for date to use a large penalty for year
disagreement, but also uses edit distance to look for transpositions and typos.
Table 4-19: Date comparisons
(DATE) Value based date comparison, 1 role, 1 dimension.
Date Used to compare dates. If the result is an exact match, weight is
by birth year, else weight is by edit distance.
(DATE2) Value based date comparison, 1 role, 1 dimension.
Date or Age This function is intended for single date comparison, such as
birthdays or any event that occurs on a specific day.
(DOBA) Value based date/age comparison, 1 role, 1 dimension.
Used to compare dates and/or birth years. If both values are
Date or Year
dates, logic is identical to DATE. Otherwise, comparison is
difference in birth years.
V5.4.0.3
Uempty Edit distance

Edit Distance is a measure of similarity of two stings. Edit Distance functions work to
answer the question: How many insertions, deletions, or transpositions of characters would
it take to make two strings match? To achieve those ends, there are a few things we need
to understand about the Edit Distance functions.
What do the edit distance numbers mean?
How is edit distance calculated?
How do you know which edit distance function to use?
What do the numbers mean?

When talking about edit distance there are several important numbers to note. The
following table explains some of the most common edit distance numbers.
Table 4-20: Edit distance
Distance
Both strings match exactly (for example, Michael versus
0
Michael).
There is one edit (insertion, deletion, or transposition) that must
1 be made to make the strings match (for example, Michael
versus Michel, Michaell, or Micahel).
There are two edits that must be made to make the strings
2 match (for example, Michael versus Micae, Michwale, Nickael,
Mitchel).
There are three edits that must be made to make the strings
3 match (for example, Michael versus Mihcelan, Imichela,
Meckeel)
All 7 characters in 'Michael' would need to change in order to
make the strings match. You could be more than 7 different, but
>7
at that point the difference becomes negligible. (for example,
Michael versus Persephone).
Typically, when you get beyond an edit distance of 10, then you
know you have a mismatch. You will commonly see plateaus in
>12
your weights at this point because you cannot be much more
different.
The drawbacks of edit distance

Edit Distance comparisons are a great way of calculating the sameness between two or
more attributes. However, when employing the Edit Distance comparisons, you are asking
your hub to process data in a much more analytical way than many of the other comparison
functions. The added computational power needed to calculate edit distances should limit
your widespread use of them. Simply put, do not use an Edit Distance calculation for too
many attributes in your algorithm if you have performance concerns.
Student Notebook
How do you calculate distance?

There are several nuances to counting edit distance, but there are four main markers that
you can use to get a rough number:
Insertions If there is a missing character that should be inserted, add 1 (for example,
Laurie versus Lauri).
Deletions If there is an extra character that should be removed, add 1 (for example,
5276227 versus 52776227).
Transpositions If there are two adjacent characters that should be switched, add 1
(for example, Feinman versus Fienman).
Replacements If a character should be replaced by another character, add 1 (for
example, Jimmy versus Timmy)
What is the difference?

Knowing what you have just learned, take some time to try to figure out the edit distances
between the following strings:
Table 4-21: What is the edit difference?
String 1 String 2 Distance
2171_W_KLINE 1217_N_KLINE
MARY GARY
FELICIA FELIX
818121463 818211436
NGUYEN_9617852 NUGYEN_712814
ROBERT NORBERT
839769198362 177172824194
KLEINFELTER LINGENFELTER
V5.4.0.3
Uempty Choosing an edit distance function

There are several Edit Distance Functions to choose from. The best way to learn what they
each do is to understand how they are named.
Edit Distance Function Names are formatted as: DR<r>D<d><t>
DR<r>: <r> represents the number of Roles (or attributes that have been standardized)
that are being compared. This number should be between 1 and 4. For example, if you are
only using Phone Number, then there is only one role. If you are using Zip
Code+Address+Phone, then you have 3 roles.
D<d>: <d> represents the number of Dimensions (or tokens/fields that make up an
attribute) that are being compared in each role. In the majority of situations, this value
should be set to 1. This becomes more prevalent when we use Arabic names in
comparisons because the Name tokens in Arabic have multiple parts beyond the typical
fields encountered in Westernized names.
<t>: <t> represents the type of comparison that you want to perform. There are three
possible choices: A, B, and C.
A - Simple Edit Distance where a match or a non-match discovered.
B - Quick Edit Distance returns an integer that represents the edit distance. This
number is usually a little higher than the true edit distance, but it can process quickly.
This is recommended for long strings.
C - Real Edit distance returns the true edit distance. This calculation is more resource
intensive than the other types, but is the most accurate. This is recommended for
shorter strings or situations where the comparison is more vital.
SRC - The addition of SRC at the beginning (or appended to the end) of the Edit
Distance Function Name indicates that the Source Code should be added to the string
before it is compared (for example, SRCDR1D1B or DR1D1B_SRC as both formats are
in the Palette). Many systems use the same conventions for assigning account
numbers and primary identifiers (specifically the use of auto-numbers). Therefore, the
likelihood that two systems would have the same identifier in common can be great. For
example, the customer assigned to number 4231 in the CRM is not necessarily the
same student assigned to 4231 in the Learning Management System (LMS). So, by
appending source codes to the identifiers like, CRM4231 and LMS4231, we create
more distance between these two member records. The extra distance helps keep the
score lower since these members are most likely not the same person.
Student Notebook
Examples of edit distance

There are several Edit Distance comparison functions that you can choose from. The
following examples can help explain the situations where you might use one over the other.
DR1D1A: One Role, One Dimension, and uses a simple yes, they match or no, they
do not match comparison. This is a good match for use on an attribute like E-mail
Address.
DR1D4B: This is the recommended role for working with Arabic.
DR2D1C: Two Roles, One Dimension, and uses the Real Edit Distance. This is a good
match for combinations like Last_Name + Zip Code where the strings are usually fairly
short.
SRCDR1D1B: One Role, One Dimension, and uses a Quick Edit Distance. The real
power is that the Source Code is appended to the value before the comparison is made,
so if you had similar medical record numbers in two sources, the addition of the Source
Code prefix would throw their edit distances further apart.
Equivalency
Equivalency comparisons are simply looking for exact matches. Therefore, they can be
excellent comparisons for demographics like Gender, Eye Color, and Birth Year.
Table 4-22: Equivalency comparisons
(EQVD) Value-based simple string comparison; 1 role/1 dimension.
Simple true/false comparison of string values and returns MATCH
Alpha/Numeric or NO_MATCH. Uses weight specified in configuration from the
sval weight table. Gender and Eye Color are often compared with
EQVD.
(EQVN) Value-based numeric comparison; 1 role/ 1 dimension.
Used to compare to integers. Result is match or non-match. Exact
Numeric
match values are looked up by value. Uses weight specified in the
nval weight table.
V5.4.0.3
Uempty False positive filter

Simply put, a False Positive is a pair of linked records that should not have been linked.
There are several common scenarios where members that should not be linked end up
linked because their data is so similar. For example, Twins who have similar names are
often false positives because they have the same address, same phone, same parents,
same date of birth, and their SSNs are usually sequential. Father and Son (Sr. and Jr.)
combinations can also create false positives, especially if one of the records is missing the
SSN or DOB.
Finally, many babies are listed under the mother's name or an anonymous name of
BabyBoy or BabyGirl if a name has not been officially assigned to the baby. Anonymous
values are treated as zeros in comparisons, so they do not count negatively - only neutrally.
Table 4-23: False positive filters
(FPF) False positive filter is a rules-based comparison. Using all of
these parameters, the algorithm attempts to determine if a false
positive has occurred by looking at XNM, DOB, and SEX. For
example, if the birth date or sex does not match, this returns an
indicator of false positive. FPF does not look at the gender or birth
Expanded=false
date comparisons unless there is a partial name match. The FPF
weights are generally hand edited in the mpi_wgt3dim table to
create stronger penalties for disagreement.
Note: FPF is rarely used in implementations, because FPF2 is a
better gauge of false positives.
(FPF2) This function uses the comparison codes of XNM, SEX,
DOB, and SSN. It looks at the results of name comparison, gender
comparison, birth date comparison, SSN comparison, and the
Expanded=true difference in birth year. The algorithm attempts to determine if a
false positive has occurred. The FPF2 weights are generally hand
edited in the mpi_wgt4dim table to create stronger penalties for
disagreement.
Student Notebook
Geocode
Input for the GEO comparison function must come from an attribute standardized by the
GEO standardization function.
The GEO comparison operates by calculating the great-circle distance between the two
locations being compared. The distance is calculated in meters. This distance is converted
into a similarity measure by taking the base-10 logarithm according to the formula:
Similarity = 2 * log10(distance / resolution)
For distances less than the resolution, the similarity is set to 0. The default resolution is 1
meter. The resolution can be changed by providing an argument to the comparison
function. The argument specifies the resolution to be used (in meters).
The GEO comparison function uses a single 1Dim weight table. The weights are indexed
by the similarity measure just described. The similarity is offset by 1 so that index 0 can be
used for missing data.
Table 4-24: Geocode comparisons
The GEO comparison function compares locations by calculating the distance
GEO between them. Locations that are closer together are considered more similar
than locations farther apart.
Height & weight

On their own, many biometric descriptors are simply too broad and match the description of
too many members in any given population. But, by combining two biometric
demographics, we can get a better comparison of two members.
Table 4-25: Biometric comparisons
Height and weight comparison. There are two weight tables associated with this
HXW comparison: an sval table used for exact matches and a 2-dim table used for
disagreements.
US zipcode
Address-based comparisons are especially vital to the house holding process. But, even if
you are not using a household entity, being able to draw correlations between address
components can help you match on members that have a history of moving frequently or
might have entered their address in different formats.
Table 4-26: Address comparisons
Value-based United States Zip code comparison; 1 role/1
USZIP dimension. Returns exact if the first five digits match and partial
if first three digits match; otherwise a NO_MATCH is returned.
V5.4.0.3
Uempty New comparison functions

You will find that the vast majority of data types can be compared using our standard set of
comparison functions. If, by chance, you encounter a need for a new comparison function,
you can enter a ticket with the Initiate Customer Support Group (CSG) and they will help
you work with our Data Scientists and R&D to develop a comparison function that meets
your needs.
Tuning search
An implementation requiring a low false-positive rates employing one algorithm that
directed the search, match and link process was not always efficient because of the
false-positive filter (FPF) configured within the single algorithm. The tuning functionality
enables you to configure a single algorithm while allowing specification of which
comparison functions (cmpfuncs) are used during a search operation or a match operation
(memsearch versus memput). By specifying a comparison mode (cmpmode), algorithms
can be configured to:
Search only,
Match and Link only, or
Search, Match, and Link in one call
If your implementation requires a broader search criteria and a tighter matching criteria,
your algorithm should be configured to have a Search only comparison and a Match and
Link comparison. This replaces the previous concept of creating two algorithms. Search,
Match and Link is the default setting, and if all comparison functions are set to this, the
previous behavior is retained.
Using the example of an implementation requiring low false-positive rates, you would
configure a single algorithm and flag the FPF comparison function role as Match and Link
and flag the remaining comparison functions as Search, Match and Link. When a new
member is added to the hub under the ID entity, the algorithm uses all comparison roles
(including the FPF). When a search request is executed against the ID entity, the algorithm
would not apply the FPF to the returned search results. However, if the API requested that
the data be returned as entity, those entities would be those formed using the FPF.
Student Notebook
You might also want to use attributes in searching that you do not want to contribute to the
matching/linking score. As an example, using the next-of-kin value in the matching
algorithm might contribute to family false-positives during the matching. However, you
might still want to use the fuzzy name matching to search on next-of-kin. For this, you
would still configure a single algorithm, but would flag the QXNM cmpspec applied to the
next-of-kin name field as Search Only while the rest would be flagged Search, Match and
Link.
Note
If you currently use two algorithms to address this scenario, Initiate recommends that you
remove the algorithms and utilize the new tuning capability as described. This will simplify
your deployment.
Query roles
Query roles are seldom used. They bypass the algorithm and allow you to perform a strict,
deterministic search on an attribute and are usually associated with task searches.
V5.4.0.3
Uempty Reading the flow of an algorithm

It can be challenging when you try to read an existing Algorithm, but each algorithm uses
the same basic flow pattern. When you read an Algorithm, there are two main paths that
you should be aware of:
Bucketing Path The bucketing path always starts with a Standardization Function,
then moves to a Comparison Role, then a Bucketing Function, and finally a Bucketing
Group (even if there is only one attribute or token).
Comparison Path The comparison path starts with the same Standardization
Function and Comparison Role as the Bucketing Path, but then terminates at the
Comparison Function (which determines the method by which to compare the
attributes).
- Bucketing Path (straight line)
- Comparison Path (curved line)
Note
The paths above show how a Social Security Number can be processed in the Algorithm.
The first three elements include the Attribute of SSN, a Standardization Function that only
keeps numbers (removes special characters and letters), and simply moves the data into
Comparison Role 2 (the second one in the algorithm, the number has no significance).
Those three elements correspond to the memcompd table.
From there, the SSN is Bucketed by sorting the numbers and then a Bucket Group is
created to hold the tokens. These two elements correspond to the membktd table.
The Comparison Path shares the same first three elements, then rolls into a 1-Role,
1-Dimension Quick Edit Distance. These elements correspond to the weight table.
Student Notebook
Configuring an algorithm
Algorithms are created in the Algorithm Editor. After the data dictionary has been created,
Attributes and Entity Types especially, the Palette will be fully populated with the
components of an algorithm.
Attributes Analyze the data from any member attribute in your dictionary. You do not
have to include all attributes in your algorithm as some demographics are for display
purposes only. You can only add an attribute to the Algorithm Editor once.
Standardization Functions Transform the attribute values into more consistent sets
of data. You can remove specified characters, trim the length of the field, and even
break apart one value into many smaller values. You can standardize a single attribute
more than once.
Comparison Roles Defines how a comparison function is used in the algorithm.
Query Roles Defines how a query function is used in the algorithm.
Comparison Functions Analyze two sets of data and determine how similar they are
to each other. You can choose a number of comparisons to run against the data.
Bucketing Functions Identify bucketing data which identifies groups of shared
information. You can use buckets for names, address, identifiers, and so on.
Bucketing Groups Defines how a comparison function is used in the algorithm.
Exercise
V5.4.0.3
Uempty Unit 5. Cleaning the data extract
Overview
The Data Extract is a sampling of the data. You will test this data for basic adherence to the
Data Extract Guides specifications and run CloverETL graphs against it to ensure proper
data format.
Dependencies
The Data Extract Guide outlines the data requirements. You will also need access to
Workbench and CloverETL tool to perform the data cleansing.
Topics
Data Extracts
- Reviewing the Data Extract Guide and Extracts
Clover ETL
Copyright IBM Corp. 2010, 2011 Unit 5. Cleaning the data extract 5-1
Student Notebook
Data extracts
The Data Extract is the heart of every Initiate software implementation. Before Initiate can
begin to solve a problem, we must first get a data extract so that we can verify the quality of
the data, load the data into the software, configure an algorithm to manage the data, and
perform an initial data load to see how the records compare, match, and link. Each project
has a Data Extract Guide, which highlights the necessary file format, data elements, and
protocols for delivering the file to Initiate.
To ensure that the data extract adheres to Initiate standards, a Data Extract Guide is
prepared for each project. Much of the guide is standard issue, but can be customized (for
example, each algorithm requires data of a slightly different type, therefore we ask for the
appropriate data to fulfill the project goal). The extract should be 10% of the total number of
records or approximately 1,000,000, whichever is larger.
Data extract formats

Data Extract Files can be processed in many formats. The most popular formats are:
Pipe (|) Delimited Text Files
Comma Separated Values (CSV)
Tab Delimited Text Files
Fixed-length Text Files
Initiate typically recommends using a Pipe Delimited Text File. Pipes are rarely used in
common data and, therefore, can help avoid the common pitfalls of CSV, namely that many
fields might contain commas that would affect field counts.
Be prepared!
Data collection methods are unique. Some are separate extract files for each source and
many times there is not one, clear extract file. Common scenarios include:
Source-specific attributes keep files from being merged
Demographic Data, Address Information, and Alias Name information can be stored in
separate tables
Duplicate record identifiers might be in data on purpose to retain historical data (for
example, Previous Address or Phone Number)
V5.4.0.3
Uempty Reviewing extracts

Upon receiving new extract file, your first action should be to confirm that the extracts
conform to the layout and formatting defined in the project's Data Extract Guide document.
Since we usually deal with very large extracts, you might need to use a more advanced text
editor such as TextPad to view and/or edit the files.
Sampling a large extract

If an extract is too large to open with something like TextPad, then you should sample the
extract. An easy way to do this is by using the head or tail commands (included with the
MKS Toolkit or available on Unix) to take the first and last 10,000 or so rows.
Looking for trends in the data

Once you have gotten the extract open in a text editor, you can look for trends within the
data. At this early stage, you are just trying to confirm that as a whole the extract is worth
moving forward with, not identifying each specific bad row. Try starting by noting the
following things:
Does the file meet the format set forth in the data guide? This is usually pipe
delimited or fixed length.
Does the extract have the same number of fields or characters as set forth in the
data guide? Stray fields or characters could be evidence that the extraction process
has errors, which might lead to problems in the future.
Does the data in each field match the list in the guide? For example, does field
position 17 hold phone numbers as the Data Extract Guide defines?
Do you see any non-ASCII (non-text) characters? These often show up as thick
black bars in TextPad or other strange characters and must be removed before import.
Does each row have a source and source id value? A few empty ones in the file
might not be a problem, but seeing more could signify an error.
Do you see each attribute populated at least a few times across the entire
extract? If you see a key field, like Last Name, that is empty in most records, it might
indicate that something might be wrong.
Are you able to successfully complete the Data Submission Checklist? The Data
Extract Guide contains a table with additional items that should be checked.
Does the extract contain the same number of files as specified?
Are the proper end-of-line characters present? Zipped UNIX files might not convert
to the DOS format.
Student Notebook
Check the whole file

Ensure that you skim through the entire file and confirm that the rows look consistent from
the beginning of the file to the end. This is especially important if you are dealing with a
single extract file that contains data from multiple sources. For example, the first 100k rows
of a file might be comprised of only records from Source A, while the remaining records are
from Source B. Source B records might contains errors nonexistent in Source A.
Important
Encountering problems with the data in the Data Extract does not necessarily mean that
the project loses momentum. In most cases, the data file can be scrubbed using Clover
graphs that help convert data into the proper formats and reduce duplicate records. Once
the data has been fixed the initial data load can commence.
Checking a sample of the extract

Try loading the first 50 records of your data extract. If you can work with a sample of your
data, then you can be more confident that the larger file will run properly.
Cleaning the data extract

In Boot Camp, we use CloverETL to clean the data extract since it is part of Workbench,
but there are other tools that can be used to get the job done. For example:
PERL and other scripting languages
Datanomic Director
Extract, Transform, and Load (ETL) tools
V5.4.0.3
Uempty Clover ETL

CloverETL is a Java based, platform independent extract, transform, and load (ETL)
framework used to transform structured data in multiple file formats. CloverETL is the
graphic user interface integrated with Workbench that allows you to create and edit Clover
graphs.
Tidbit:
Java ETL can be shortened to JETL. The word jetl in Czech means clover.
What is ETL?
ETL stands for Extract, Transform, and Load. It is a three step process for data
warehousing that:
Extracts data from outside sources
Transforms it to fit business needs
Loads it into the data warehouse
Clover graphs
Graphs, or transformation graphs, are a visual representation of the process of
transforming data from one form to another. A graph consists of at least three elements:
Components: Perform various data transformations or functions.
Edges: Connect the nodes by passing data.
Metadata: Define the data structure.
Student Notebook
Understanding the clover components

Clover graphs contain multiple components that perform specialized operations on the
data. Each component has at least one and sometimes multiple inputs and outputs, with
mandatory and optional parameters.
The four main types of components are:
Readers: Read data from files or database tables.
Writers: Write the transformed data to files or database tables.
Transformers: Perform functions on the data that will evaluate it, reformat it, and sort it.
Joiners: Combines data from different flows and reformats it.
Components can all be enabled, disabled, or set to pass through.
Enabled: Performs the function as specified.
Disabled: Does not perform the function and ends the graph process.
Pass Through: Works like an edge. Does not perform the function, but passes the data
continuing the graph process.
Hint
For more information: http://wiki.cloveretl.org
V5.4.0.3
Uempty Clover status symbols

The table below contains common status symbols used in Clover graphs and what they
mean.
Table 5-1: Clover status symbols
Symbol Meaning
Required data elements and items that need to be configured.
Red X. The component contains an error in its properties.
Green Play symbol. The component has started processing.
Blue Check mark. The component has completed its process

successfully.
Student Notebook
Creating a cleanup graph

The CloverETL tool gives you a wide variety of options to extract, transform and load (ETL)
data into your hub. In this section, we will focus primarily on the transform components of
CloverETL which allow you to take a raw data extract and scan it to ensure that it adheres
to your expected data requirements.
In the following graph, there are three main processes taking place:
Checking records for proper number of fields We will read the original extract,
treat each record as if it is one field, and then assess the number of delimiters to ensure
that each record has the proper field count.
Checking fields for proper data types and values We will take the records with the
proper field counts and process them again using the field properties to filter out the
records that do not match our data standards.
Checking for true duplicate records From the records that meet our standards, we
will check for duplicate records. These are not the same type of normal duplicates that
we work with in the IBM Initiate Inspector tasks, but rather records that were mistakenly
added to our extract more than once during the extract generation process.
We will construct the graph above in the next few pages. Let us begin!
V5.4.0.3
Uempty Creating a filtering expression

The Ext Filter component uses logical expressions written in the Clover Transformation
Language that assume a true or false answer. Records that are deemed to be true pass on
to the first outbound port (port 0). Records that fail to pass the expression are sent to the
second outbound port (port 1). For more information about the Clover Transformation
Language functions, see the CloverETL Wiki page at: http://wiki.cloveretl.org
In this example, we will add an expression that is designed to count the number of fields in
your extract file and assess if there are the proper number of fields in each record.
length(split(concat($0.fieldName, " "), "\\|")) == # of Fields
This expression starts by taking the whole record as one field and appending or
concatenating a space to the end - see concat($0.fieldName, " "). The extra space is
added to account for records which have no value for the last field. Having a null value
is not the same as having the wrong number of fields. This space will eventually need to
be shaved off of the last field before the records are brought into the hub by the Derive
Data and Create UNLs (mpxdata) process.
Note
The concatenation can be removed for data extracts that already have a closing pipe
delimiter at the end of each record.
The next process in the expression splits the string at each occurrence of a pipe "|" -
see split(concat( $0.fieldName, ""), "\\|"). In order for Clover to properly recognize the
pipe character, it needs to have two backslashes in front of it (this has to do with the
way the expression is converted to Java before processing).
The third step uses the length function to measure the number of characters in the
resulting string - see the length() function at the beginning. Basically, it counts the
number of fields between the pipes.
The final step is to compare the field length with the number of fields that you expect
each row to have - see the == # of Fields element at the end. This number is the
number of fields you should have, not the number of pipe delimiters. The double-equal
sign is required in the Clover language and the number does not need to have any
special qualifiers around it, like '3' or "3"... simply enter == 3.
Student Notebook
Below is a sample of how this expression can be used:

Source|MemIdNum|Sex|LastName|FirstName|BirthDate|HomePhone|EyeColor
A|435472|M|VANE|BABY|1903-01-17|GLENDALE|AZ|85301||6232017317|2657973|BL
UE
B|435473|M|CAJIGAS|SEAN|1908-07-28|4802435217|BROWN
C|435474|F|AQUIRRE|MAGAN|1929-06-06|4807473468|
D|435475|M|ARO||LLOYD||1958-02-19|245143015|HAZEL||
E|435476|M|SCHEELER|AHMAD|2002-05-30|BLUE
length(split(concat($0.field1, " "), "\\|")) == 8
V5.4.0.3
Uempty Adding the second phase to your graph

In the second phase of our graph we will run the data through a more rigorous cleansing
process, in which we will sort the records, validate specific field properties, and then pull
out records that have duplicate IDs.
The most important step in this process is deduping the data which will check for records
that might have been accidentally added to the extract more than once.
To add a second graph phase:
Lay out the components of your graph.
Configure your Reader to read the Proper Count text file.
Create a new metadata file using the attribute columns.
Configure your Ext Filter component to remove records with non-standard characters.
Configure your Ext Sort component to sort by the Source and Member ID.
Configure your Dedup component to find duplicate Member IDs.
Configure your Writers to create files for records with non-standard characters,
duplicate records, and last but not least, the cleaned extract.
The end result will look like the graph below:
Laying out the second phase components

1. Add the following components to your graph by clicking them in the palette and placing
them into your graph.
- Universal Data Reader
- Transformers
Ext Filter
Ext Sort
Dedup
- Three Universal Data Writers
2. Arrange the components as shown in the graphic above.
Student Notebook
Configuring the second phase ext filter

We will enter the filter expression below, but before we do so, let us discuss what the
elements of this expression mean.
($0.Source == "A" OR $0.Source == "B") AND
($0.MemId ~= "[0-9]+") AND
(isnull($0.Gender) OR length($0.Gender) == 1) AND
(isnull($0.State) OR length($0.State) == 2) AND
(isnull($0.BirthDate) OR length($0.BirthDate) == 10) AND
(isnull($0.LastName) OR ($0.LastName ~= "[a-zA-Z]+")) AND
(isnull($0.FirstName) OR ($0.FirstName ~= "[a-zA-Z]+")) AND
(isnull($0.StreetLine1) OR ($0.StreetLine1 ~= "[\"/'-., 0-9A-Za-z]+"))
AND
(isnull($0.HomePhone) OR ($0.HomePhone ~= "[0-9]+")) AND
(isnull($0.CellPhone) OR ($0.CellPhone ~= "[0-9]+")) AND
(isnull($0.SSN) OR ($0.SSN ~= "[0-9]+"))
Expression filter examples:

Validating Against Specific Values
You can design your filter expression so that it scans for specific values in the data. The
IBM Initiate Master Data Service must receive a Source Code for each record. These
source codes must pass validation against the list of Source Codes that you added to your
Initiate member model (.imm) file. For example, if you want to validate that your source
code is an "A", then try the following:
$0.Source == "A"
If you want to want to validate against two codes, you can add an 'OR' to your formula. The
resulting expression will throw out all of the records that do not have an A or B in the source
field. This includes records that have a NULL or empty value.
$0.Source == "A" OR $0.Source == "B"
If you want to make sure that this formula is processed as one function in the expression,
use parentheses around the clause.
($0.Source == "A" OR $0.Source == "B")
V5.4.0.3
Uempty Validating character sets

Many times the validation processes is simply checking to see that there are numbers,
letters, special characters, or a combination of the three. CloverETL uses Regular
Expressions to extend the power of the built in Clover Transformation Language functions.
Within Clover, the operator '~=' indicates that a Regular Expression will be used. Any
record that does not adhere to the expression's standards is sent to the list of rejected
records. Below are a few examples of checking for letters only, numbers only, or any
combination including special characters.
$0.HomePhone ~= "[0-9]+"
$0.LastName ~= "[a-zA-Z]+"
StreetLine1 ~= "[\"/'-., 0-9A-Za-z]+"
Allowing empty or NULL values

There are many fields where it is acceptable to have an empty value. For example, a
missing home phone number is usually not a required field. If you are going to perform a
validation check on a field like home phone, you will also need to allow a NULL value. The
expression would be structured like the following:
(isnull($0.HomePhone) OR ($0.HomePhone ~= "[0-9]+"))
Checking field length

Having the right number of characters can be an important form of validation. As long as
the data you are analyzing are in String format, you can use the Length function to validate
the number of characters. In this example, we want to make sure that all of the State codes
only have 2 characters.
Length($0.State) == 2
Exercise
Student Notebook
V5.4.0.3
Uempty Unit 6. Deploying the instance
Overview
You need to create a database, install the IBM Initiate Master Data Service engine,
configure the ODBC connection, create the instance directory, and establish the windows
service.
Dependencies
You need to have a supported database platform and the proper software installation files
for your operating system. The computer images provided have DB2 and will support the
IBM Initiate Master Data Service.
Topics
Deploying the instance overview
Copyright IBM Corp. 2010, 2011 Unit 6. Deploying the instance 6-1
Student Notebook
Deploying the instance

You will create a new database for the IBM Initiate Master Data Service software to
reference. You will then install the IBM Initiate Master Data Service engine (typical install
shield executable) and use the madconfig utility to configure the ODBC connection, create
the instance directory, and establish the windows service.
Bootstrapping your database involves creating the core database tables, defining the field
properties, and indexing the tables. During the bootstrap process, several of the data
dictionary tables will be populated with default settings. Your database will be bootstrapped
as part of the instance creation. But you can bootstrap separately.
The Data Dictionary is one part of the overall Initiate Member Model. The dictionary tables
control validation rules, application properties, attributes, sources, nicknames, and the core
algorithm settings. The dictionary can be populated by using a combination of engine
utilities and Workbench jobs. We will build a dictionary from scratch, but it is common to
import a baseline configuration.
Creating a hub instance using MADCONFIG

MADCONFIG is used to help you configure and manage instances of your hub. You should
run this utility each time you want to create a new instance of your hub, and you can also
run it to remove instances of your hub.
Creating a data source

The IBM Initiate Master Data Service platform connects to the database using an ODBC
connection. In order to insure that you have the most stable connection possible, Initiate
has chosen to deploy drivers through Data Direct. The following steps will use the
MADCONFIG utility to create a new ODBC Data Source.
Creating a new hub instance

One server can house multiple instances of Initiate software. This allows you to have
Production, Test, and Development environments all on one box. To operate these
instances independently, the MADCONFIG utility is used to configure the engine service,
TCP/IP port, and other pertinent connection settings.
Starting the IBM Initiate Master Data Service engine

Now that the instance has been created, we have to start the engine. Use the Services tool
in Windows Administrative Services to start, restart, or stop the instance.
V5.4.0.3
Uempty Deploying hub configuration

After you have created your data dictionary, created your algorithm, and created a hub
instance. Now it is time to put everything into action and really start playing with the system
and the data. The first step will be registering the hub you created to the bootcamp project
and then uploading the configuration to the hub.
Registering a hub to a project

Think back to when we first created the bootcamp project and we did not assign a hub to
the project. Now that we have created a hub instance we can assign it to the project. This
function can also be used to assign additional hubs to an existing project.
Deploying the configuration

It is now time to deploy the configuration that you have created to the hub. During this
process, the hub will be suspended to verify the configuration and then updates, but will
remain live for requests sent to it. After the configuration has been uploaded, operations
will resume.
Exercise
Copyright IBM Corp. 2010, 2011 Unit 6. Deploying the instance 6-3
Student Notebook
V5.4.0.3
Uempty Unit 7. Overview of the Initiate data model
Overview
The Initiate data model is the physical way the IBM Initiate Master Data Service software
stores data. The Initiate member model (.imm), is a metadata layer that classifies and
organizes the components used to track and match member data elements.
Dependencies
The Data Extract Guide outlines the specific attributes and fields and the Implementation
Approach defines additional data dictionary requirements.
Topics
The generic data model
Attributes
The data dictionary
Entities
Copyright IBM Corp. 2010, 2011 Unit 7. Overview of the Initiate data model 7-1
Student Notebook
What is the Initiate member model?

Before you can fully understand the Initiate member model, it is important to have an
overall understanding of the Data Model. The data model ensures that data is placed
properly. When changes have to be made manually or when working with Application
Programming Interfaces (APIs), it is helpful to know where the data resides.
The IBM Initiate Master Data Service software data model design is rooted in a complex
relational database that can divide data from a single extract file and parse it into multiple
tables, each one holding a different type of data.
By storing the data this way, the IBM Initiate Master Data Service software engine can
process information in a much faster and more accurate way than having the data reside in
one single table. In order to add data to the IBM Initiate Master Data Service database, the
data dictionary must be configured to accept the data and process it appropriately.
The purpose of the database design is:
Audit access to, and allow governance of data
To support very fast access to a small number of targeted records in a vast storage
array
To allow broad storage and configuration options of data
To support a flexible workflow model that adapts to multiple applications
Features of the data model

The key to the Initiate software model is its database design. The database design allows
for the ability to create complex relationships and search requests that cannot be satisfied
under normal SQL relational terms.
To these ends, the database has been carefully designed to have:
Simple relationships
Highly efficient, proprietary indexing strategies
Traces of member and entity access and alteration
A highly configurable entity algorithm scheme
Compound attribute types that match common, real-world entities
Highly flexible and configurable workflow types and statuses
V5.4.0.3
Uempty Extensibility is at the core

The hubs database architecture has been designed to anticipate varying needs by offering
an extensible table structure with configurable data modeling. Since each project brings a
unique set of data, containing different fields as well as different processing needs, the
database has been designed with inherent flexibility.
Although our data model is highly flexible, we also recognize that there are similarities
between many data sets. Therefore, our design instantiates the IBM Initiate Master Data
Service software with a set of core tables that can hold and process a large variety of key
identifying data.
Student Notebook
Common naming conventions

There are several common naming conventions used in the IBM Initiate Master Data
Service databases table structure. Below are some of the more common items you will run
across in the data model:
mpi The mpi prefix stands for master person index.
head If the table name has head in it, then it is a core table which will map and control
other tables with a similar prefix (for example, the mpi_memhead table controls the
other mpi_mem**** tables).
x When you see an x in the name of your table, that usually means cross or by (for
example, mpi_srcxsrc hold source by source thresholds).
eia The acronym eia stands for Enterprise Id Assignment and typically means that the
table helps maintain relationships between members and the entities that they are a
part of.
_id The suffix on your entity tables (the ones that begin with mpi_ent) indicates which
type of entity the table is referencing. You can have more than one type of entity, like
Identity and Householding.
Other common suffixes:
anon: Anonymous values are treated as NULL values
attr: Attributes are specific demographics
aud: Audit tables track user and record level activity
bkt: Buckets organize data for searching
cmp: Comparisons check for similarity
dvd: Derived data has been processed by the algorithm
ent: Entities are distinct individuals or organizations
equi: Equivalent strings that can be interchangeable
func: Functions are components used in the algorithm
mem: Members are individual records from a source
rel: Relationships are links between entities, not records
seg: Segments are tables that contain similar data
src: Sources are systems where the hub data came from
std: Standardization changes data to consistent formats
str: Text strings are reference lists used by the algorithm
tsk: Tasks are items that require human review
wgt: Weights are lookup tables used for scoring
V5.4.0.3
Uempty Core database tables

Initiates data model is made up of approximately one hundred tables which can be divided
into four basic categories. The tables range from ones that store transactional history of
data changes so that they can be audited, to tables that manage the complex relationships
between entities and members. Below is a list of the four categories of tables:
Table 7-1: Database category table

This is data related to auditing who has accessed or modified what data. Auditing can
Audit Tables track data changes made by internal system users and updates from source systems.
Most of the Audit Tables begin with the prefix mpi_aud
This data is related to the linking of members into entities. By tracking entity
relationships in separate tables, we can link and unlink members without adding
Entity
unnecessary fields. It also allows us to review the history of entity linkages. The Entity
Tables
Tables begin with the prefix mpi_ent and end with a suffix of the Entity Type (for
example, id for Identity, hh for Household, and so on).
This is data explicitly related to information associated with members within the hubs
Member
database. Member data is stored by segment (for example, Name, Date, and ID). The
Tables
Member Tables begin with the prefix mpi_mem
These tables define data types and values provided by various source systems,
configuration settings that allow the IBM Initiate Master Data Service engine to run,
Data
application properties and layout, and the algorithm configuration definitions. The data
Dictionary
dictionary, includes the segment, attribute, and field definitions as well as the
Tables
definitions that identify sources, member types, entity types, and supporting
information like string equivalents and anonymous values.
The Data Dictionary is by far the most elaborate of the table categories. The vast majority
of implementations begin by importing the core configuration of the data dictionary. Doing
so can save considerable time in the implementation process. We will explore the Data
Dictionary import process more in-depth in the following sections of this module.
Student Notebook
Audit tables
The IBM Initiate Master Data Service has built-in capabilities to track activity in the hub.
The level at which the audit tables track activity is determined on the Security tab in
Workbench. Auditing is enabled at the Interaction (API function call) level.
There are three options for auditing:
None: The interaction will not be tracked
Activity: Who called the interaction and when
Member: Who, when, and what member records were involved in the interaction
V5.4.0.3
Uempty Member tables

Member data is stored in a process that we call verticalization. Initiate takes information
that would normally be stored in a horizontal structure (one record with many fields) and
divides the record into multiple silos of data based on the type of information. For example,
phone numbers are stored in the mpi_memphone table, while addresses are stored in the
mpi_memaddr table.
The MemIdNum is not generated by the hub but instead is the primary identifier generated
by the source system.
Student Notebook
These tables comprise the core member segments in the IBM Initiate Master Data Service
software. Member segments are used as storage for what is normally thought of as data
records. These data types store data in a normalized form, and are generally more
complex in nature than the typical, low-level database offerings of numeric, string, and
date-oriented types.
Table 7-2: Member tables
mpi_memaddr Postal address information from original data.
mpi_memappt Appointment information from original data.
mpi_memattr Generic information, uninterpreted strings.
Derived bucket hash assignments (from Derive Data and Create
mpi_membktd
UNLs (mpxdata)).
Derived comparison strings (from Derive Data and Create UNLs
mpi_memcmpd
(mpxdata)).
mpi_memcont Provider contract information from original data.
mpi_memdate Date and datetime information from original data.
mpi_memdrug Prescription information from original data.
mpi_memelig Eligibility information from original data.
mpi_memhead Member header information (including original source IDs).
mpi_memident Identifier information from original data.
mpi_memname Name information from original data.
mpi_memnote Notes about members.
mpi_memphone Phone number information from original data.
mpi_memqryd Direct-access browsing query information (rarely used).
mpi_memoref Reserved for future use.
mpi_memrule Member rule information.
mpi_memtext Reserved for future use.
mpi_memextb Extension attributes (B).
mpi_memextc Extension attributes (C).
mpi_memextd Extension attributes (D).
mpi_memexte Extension attributes (E).
mpi_memlink Relationship linking information.
Note
Tables can be viewed in DB2 Control Center. This can sometimes help with
troubleshooting later on in the implementation.
To view tables:
1. Open DB2 Control Center.
2. Expand the All Databases node, the Bootcamp node, and then click the Tables
node.
3. Click the table in the tables list to the right and view the data in the display below.
V5.4.0.3
Uempty Entity and relationship tables

These tables store linkage information between members. They also assist the entity
manager, the process that facilitates comparison and entity definition, while it processes
the queue of members that need to be linked.
Table 7-3: Entity and relationship tables
mpi_entlink_xx Entity-to-member linkage.
Entity rule information for records that were manually
mpi_entrule_xx
linked/unlinked.
mpi_entnote_xx Entity notes are stored here when added by data stewards.
Entity linkage history showing members and their assigned
mpi_entxeia_xx
entities.
mpi_entxtsk_xx Entity task relationship data.
mpi_entique_xx Entity manager input queue (used by the engine).
mpi_entlkem_xx Entity manager lock table (not in use).
mpi_entlkiq_xx Entity manager input queue lock table (not in use).
mpi_entlkoq_xx Entity manager queue lock table (not in use).
mpi_entoque_xx Entity manager output queue (used by the engine).
mpi_rellink Entity relationship linkages (used in IBM Initiate Inspector).
mpi_relrule Relationship rule definition.
mpi_reltype Relationship type definition.
mpi_relsegattr Relationship segment to relationship attribute definitions.
mpi_relxtsk Relationship tasks.
Student Notebook
The Data Dictionary

The Data Dictionary is the largest section of the IBM Initiate Master Data Service Data
Model. It supports the core table structure that configures the system, defines the types of
data that are permissible, and holds the supporting information that is fed into common
drop-down menus.
The dictionary tables

The Data Dictionary can be further subdivided into six logical groupings of tables. These six
subcategories of tables are listed below:
Table 7-4: Dictionary tables
These tables control the definition and processing of the core
Algorithm Algorithm. You will find the definitions for Standardization
Functions, Comparison Functions, and Bucketing Functions.
These tables hold the core definitions of common elements like
Base Configuration user profiles, group permissions, interaction codes, task types,
and statuses.
These tables define the relationships between Member Types,
Metadata Segments, Attributes, and Fields. Source definitions are also
included here.
These tables define how the applications will interact with a user
at runtime. The settings include application properties,
Runtime
composite view permissions, group permissions, and
enumerated data (drop-down/validation lists).
These tables hold information that pertains to specific strings (or
text) that the system should be aware of. You will find
String Handling
anonymous values, equivalencies (nicknames), and string to
string conversions for standardization.
These tables control how your weights are accessed for
calculating comparison scores. The weight tables are essentially
Weight Definition lookup tables. These include 1 dimensional through 4
dimensional weights, weights for strings (like names), and
weights for numbers (like height).
V5.4.0.3
Uempty Algorithm tables

The following tables feed into the algorithm definitions. These are vital to the processing of
raw data that is used to control the matching and linking processes in the hub.
Table 7-5: Algorithm tables
mpi_bktfunc List of bucketing function definitions.
mpi_bkttype Bucket generation type definitions.
mpi_bktxgen Bucket generation types per bucket function.
mpi_cmphead Comparison strategy definitions.
mpi_cmpfunc List of comparison function definitions.
mpi_cmpspec Comparison role definitions.
mpi_dvdhead Derived data strategy definitions.
Derived data bucket role definitions and association to bucket
mpi_dvdxbkt
groups.
Derived data association to comparison roles and
mpi_dvdxcmp
standardization roles.
Derived data association to query roles and standardization
mpi_dvdxqry
roles.
Derived data definitions of standardization roles and their
mpi_dvdxstd
association to member attributes and strategies.
Derived data bucket group definitions and association with
mpi_dvdybkt
cmproles.
mpi_enttype Entity to member type/comparison strategy definitions.
mpi_libhead Comparison, derivation, and standardization library definitions.
Source-to-source threshold definitions (this is where you set
mpi_srcxsrc
thresholds).
mpi_srcxent Source-to-entity relationship definitions.
mpi_stdfunc List of standardization function definitions.
mpi_stdtype Standard data type management.
Student Notebook
Base configuration tables

These tables define the baseline information for the IBM Initiate Master Data Service
software to initialize and load dependent components and allow access for modification of
runtime configuration information.
Table 7-6: Base configuration tables
mpi_cvwhead Composite view definitions.
mpi_dvdfunc Derived data function definitions.
mpi_eiastat Entity task status resolution definitions.
mpi_eiatype Entity task type resolution definitions.
mpi_grphead Group definitions.
mpi_ixnhead Interaction function definitions.
mpi_seghead Segment-to-database table definitions.
mpi_seqgen Sequence number generator.
mpi_syskey Defines system key (version) information.
mpi_sysprop Defines system properties.
mpi_tskstat Task status definitions.
mpi_tsktype Task type definitions.
mpi_usrhead Holds hub user information after first log in.
V5.4.0.3
Uempty Metadata tables

These tables provide the data definitions for the source data being consumed by the IBM
Initiate Master Data Service software. These tables provide the basic information to
associate the source data to the member data.
Table 7-7: Metadata tables
mpi_evttype Event type definitions (add, change, merge).
mpi_memtype Member-to-derivation strategy definitions.
mpi_segattr Lists the attributes that belong to each segment table.
mpi_segxfld Member segment field-to-database field definitions.
mpi_srcattr Member source attribute information.
mpi_srchead Member source identification header.
A type can be applied to a task via automatic assignment by the
mpi_tag Master Data Engine based on the rules defined for the tag or by
Inspector end-users.
Metadata used by the Master Data Engine processes to
mpi_tagmon
evaluate whether tags should be applied to tasks.
Defines tag types. Tag types with empty rule sets must be
assigned manually by an end user. Tag types with at least one
mpi_tagtype
rule in the rule set are automatically assigned by the Master
Data Engine tag process.
Student Notebook
Runtime configuration tables

These are lookup tables providing information for various defined categories. The data in
these tables are highly volatile, session-oriented, used for reporting and tracking purposes,
and can be expected to grow over time at a given installation.
Table 7-8: Runtime configuration tables
mpi_appdata Application specific data.
mpi_apphead Application definition information.
mpi_appprop Application specific properties (used by Enterprise Viewer).
mpi_cvwxseg Attributes and segments for a composite view.
mpi_edthead Enumerated data type definitions.
mpi_edtelem Enumerated data type element lists.
mpi_usrprop User specific property definitions.
mpi_usrxgrp User-to-group definitions.
mpi_grpxcvw Group-to-composite view permission definitions.
mpi_grpxixn Group-to-interaction permission definitions.
mpi_grpxseg Group-to-segment permission definition.
Enables restriction on the operations that a user group can
mpi_grpxapp
perform on a given application.
Rules are used to direct the Master Data Engine on whether a
mpi_rule tag should be applied to a task. This table is used to define the
rules for a given rule set.
mpi_rulearg Defines the configuration arguments for a given rule.
A set of rules for a given tag type which determine if a tag should
mpi_ruleset
be applied to a task.
V5.4.0.3
Uempty String handling tables

These tables help the IBM Initiate Master Data Service deal with text strings in several
ways. From treating certain text values as anonymous, to using equivalencies to process
nicknames, the string handling tables help data become standardized, which eases the
comparison process.
Table 7-9: String handling tables
mpi_stranon Anonymous string values.
mpi_strcmap Defines character mapping.
mpi_stredit String table edit patterns.
mpi_strequi Equivalent string values like Nicknames.
mpi_strfreq String frequency cut-off for bucketing.
mpi_strhead String definitions.
Numerical value-to-string value bucket substitution values for
mpi_strnbkt
ranges.
String value-to-string value bucket substitution values for
mpi_strsbkt
ranges.
mpi_strtype String type management.
mpi_strword String-to-string conversion for standardization.
String code cross connect definitions (replaces mpi_strhead
mpi_strxstr
entries).
Student Notebook
Weight definition tables

These tables provide the weight values that are used to generate the comparison scores.
Without properly calibrated weight values, comparisons would not accurately reflect the
probability of good matches. Also included are weight management tables that help define
which weights are used by which comparison functions.
Table 7-10: Weight definition tables
mpi_wgthead Weight table definitions.
mpi_wgt1dim Calculated weight value 1 dim definitions.
mpi_wgtnval Calculated weight numerical definitions.
mpi_wgtsval Calculated weight string definitions.
mpi_wgttype Weight type management.
Weight cross connect definitions (replaces mpi_wgthead
mpi_wgtxwgt
entries).
V5.4.0.3
Uempty Unit 8. Deriving data
Overview
Derived data is essentially data that has been processed by the algorithm. You will derive
your data using the Derive Data and Create UNLs (mpxdata) job and then load the data
into the database.
Dependencies
You will need to have most of the components installed and configured, like the hub engine,
member model, .cfg file, and the algorithm. If changes are made to the algorithm, then data
will need to be re-derived.
You will need to know the order in which the fields appear in the Data Extract and the
corresponding Attribute names in the member model. You can build your .cfg file from
project documentation if the real files are not yet available. Check for accuracy before
deployment.
Topics
Data derivation overview
Data analytics overview
Copyright IBM Corp. 2010, 2011 Unit 8. Deriving data 8-1

Student Notebook
What is derived data?

Derived data is information that has gone through the standardization, comparison, and
bucketing processes in the algorithm so that it is configured for matching and scoring.
The standardization process strips out attribute tokens identified as Anonymous Values
and makes common conversions to ensure consistent formatting. For example:
Special characters are removed from Identifiers, like Hyphens (-) in SSN values
Area Code is shaved off of United States Phone Numbers, leaving only a 7-digit number
Derived data types

There are two types of derived data that we'll discuss here:
Comparison Strings
Bucket Hashes
Comparison strings
Comparison Strings are caret (^) delimited strings with standardized data for each
field/attribute referenced in the algorithm. Each member record correlates to a single
comparison string, stored in the mpi_memcmpd table.
Table 8-1: Excerpt of mpi_memcmpd table
V5.4.0.3
Uempty How does the algorithm determine the comparison string?

The number assigned to the comparison roles is determined by the order the attributes are
listed in the algorithm from top to bottom. If the algorithm contains the following attributes
as shown below:
1. Name
2. Home Phone
3. Cell Phone
4. Gender
5. Zip Code
6. Address
7. Member ID
The comparison string will be:
Name^HomePhone^CellPhone^Gender^ZipCode^Address^MemberID
Bucket hashes
Bucket Hashes are numbers that represent the buckets that each member belongs to.
Each member can have multiple hash assignments stored in the mpi_membktd table.
Table 8-2: Excerpt of mpi_membktd table
memrecno srcrecno bkthash
107247 10 2906828572220351548
107247 10 7872683519891125240
107247 10 8417370951307185476
107247 10 7876659353937979316
107247 10 8080639425811308900
107247 10 -8626021308968924882
107247 10 -6612914640955687282

Student Notebook
When is data derived?

There are two typical ways that the data derivation process is triggered: automatically on
an insert/update of a member record, or manually using an Engine Utility Command in the
command line. Let us learn more about them.
Automatic: When member records are added or updated, the algorithm will derive the
data, create comparison strings, and then bucket the members. Once the members are
bucketed, an actual member comparison can take place where the comparison strings of
two members are reviewed for matching. If they score high enough, then they link and
become part of the same entity.
Manual: The manual derivation process is most commonly performed when running an
initial load of the database, while generating weights, or when you want to propagate a
change to your algorithm in the database. All four data derivation engine utilities can derive
your data manually.
V5.4.0.3
Uempty The derivation process

The table structure of the hub's database is highly normalized. We take data from the
extract file and parse it out to segment specific tables. We also assign bucket hashes,
create comparison strings, and build binary files during the derivation process.
Data Derivation includes four steps:
1. Member Record Numbers are assigned and raw data is parsed into segment-specific
unload files. (Record phone number is added to the MEMPHONE unl file.)
2. Comparison strings are built from standardized data for each record and stored in the
MEMCMPD unl file.
3. Bucket hashes for members are assigned and stored in the MEMBKTD unl file.
4. Binary files of the member, bucket, and comparison data are compiled and stored as
Bulk Cross Match (BXM) files.
The unofficial fifth step of the derivation process is loading the UNL files into the database,
but before doing so you should verify that the derivation worked properly. Let us learn more
about the different derivation options.
Data derivation methods

There are multiple methods to derive data.
Derive Data and Create UNLs (mpxdata): This method is typically used the first time that
you derive your data for the initial load (before you go live) while your data extract is still a
static view of the data. The job takes raw data and builds member unload files, generates
comparison strings, assigns bucket hashes, and creates binary files for faster comparison.
Derive Data from UNLs (mpxfsdvd): This method is typically used during the
implementation process (before you go live) when you have made changes to the
algorithm, but have not received new records since the last time you ran Derive Data and
Create UNLs (mpxdata). FS stands for file source referring to how it reads the existing UNL
files. It uses these pre-existing member unload files to extract and create comparison
strings, bucket hashes, and binaries.
Derive Data from Hub (mpxredvd): This method is typically used after you have gone live
and have changed your bucketing or comparison strategy in the algorithm. The job will
reassign your bucket hashes and comparison strings by reading the data in the database
tables since they include all of the updates up to that point in time through brokers or the
API. Your old UNL files would not reflect those updates, but the tables would.

Student Notebook
Prepare Binary Files (mpxprep): This method is usually used with an Incremental Cross
Match (IXM). The job loads up the records from each of the sources separately, then runs
Prepare Binary Files (mpxprep) to compile the Binary files. If you ran each source
separately, your binary files would not be complete. Running Prepare Binary Files
(mpxprep) as a final step reads all of the records from the member tables and builds a
complete set of BXM files.
Member Model Transform Graph: This is a CloverETL enabled wizard that guides you
through the process of creating member UNL files from your data extract. The advantage is
using the existing metadata from the extract instead of designing a configuration file. You
would follow up this step with Derive Data from UNLs (mpxfsdvd) to create bucket hashes,
comparison strings, and binary files.
How does derived data impact the database?

The scripts run external to the database so you will not see any changes to the tables in the
database until the final step of the script. At that point, the comparison strings and bucket
assignments are imported to the mpi_memcmpd and mpi_membktd tables.
V5.4.0.3
Uempty Derive data and create UNLs (mpxdata)

The Derive Data and Create UNLs (mpxdata) utility performs several steps, from building
member unload files, generating comparison strings, assigning bucket hashes, and
creating binary files. The binary files allow us to compare data faster than scanning through
strings.
The main feature that we will explore below is how Derive Data and Create UNLs
(mpxdata) parses raw data extracts into attribute-specific sets of data. In other words, how
the engine utility takes a single record for a person and creates a record for the SSN,
another for the Name elements, and a third for the Phone Numbers, and so on. This allows
the hub to store multiple iterations of active and inactive data (like a former address or
phone number) and increases responsiveness when searching and comparing.
The Derive Data and Create UNLs (mpxdata) utility can be run in the following interaction
modes:
1. MEMCOMPUTE This mode creates unload (UNL) files that can later be imported into
the database using madhubload utility. This is the most common method for an initial load
because you can track the process through log files.
2. MEMPUT This mode deposits data directly into the database from the extract file. The
parsing of the data is simply a step in the process as Derive Data and Create UNLs
(mpxdata) loads the information into the hub's data structure.

Student Notebook
Data model exercise

Parse the following records into the appropriate tables to simulate how data is placed into
the hub.
How would the hub identify the records, and where would the fields be stored in the
database? To give you some clues, we've listed the tables that need to be populated below.
V5.4.0.3
Uempty The configuration file

The configuration file (or config as it is commonly called) is used by the Derive Data and
Create UNLs (mpxdata) job as a map for reading the customer's extract file and a legend
for converting the data to the Initiate database table layout. The configuration file tells
Derive Data and Create UNLs (mpxdata) where each field is located in the extract, and
how to migrate it into the proper Initiate data format. The Derive Data and Create UNLs
(mpxdata) command can be run as a stand alone utility from the command window or as a
job in Workbench.
What does a config file look like?

Configuration files are written for each unique data extract file. Therefore, the config files
are typically designed by following the Data Extract Guide that the customer agreed.
Config files define several elements including:
Attribute to field number mapping in the extract file
String Cook options that remove leading and trailing data
Methods that set the data type, like String or Date
Field to field mapping for Attribute parts, like Last Name
Sample data extract file
Corresponding config file

Student Notebook
Designing a configuration file

Since the config file is a map to help Derive Data and Create UNLs (mpxdata) parse out the
data extract, we will now explore the arguments that go into a config file so that you can
see how the map ties back to the data.
Below is a sample config file to review.
V5.4.0.3
Uempty Configuration file arguments

In the following pages you will find an explanation of each of the columns in the config file
known as arguments.
Attribute
This column is used to identify the attribute code associated with the attribute (or data
element) inside the database you are going to populate. This name must match the
attribute's ATTRCODE as found in the mpi_segattr table.
#
This column defines the instance variable (IVAR) number which indicates how many of the
same named data elements are in this customer record. For example, if the extract
contains three phone numbers and inside mpi_segattr you only have one attribute defined
as PHONE (rather than WORKPHN, HOMEPHN, CELLPHN), you could increment the
IVAR column for each PHONE entry you have in the configuration file.
This is not the best way to handle multiple attributes of similar type. It is recommended to
create three separate attributes inside mpi_segattr to handle the three phone types
mentioned above. IVAR is also used in conjunction with the asaidxno (the attribute sparse
array index number) of each data element. If you are using asaidxno, you will need to
increment the IVAR to correspond with each increment of the asaidxno you are assigning
to the data element. This is a feature best handled through consultation with an Initiate
SME.

Student Notebook
Position
This column is the physical position of the field within the customer extract (for example,
the third field has an offset of 3, the fourth field has an offset of 4, and so on). An offset of 0
indicates we will be inserting a constant value in this field, and not pulling the value from
the extract. The example above is inserting a constant value of SSA as the ID Issuer of the
SSN, which is represented by a 0, since it is not really a field in the import data set. Simply
skip fields that you do not want to import into the database (for example, if the 15th field is
unimportant, go from 14 to 16).
Transform
This column allows an optional edit method or transformation method you can apply to a
string field in the customer data element before inserting it into the Initiate database. These
are very basic commands that can be used to remove blanks or zeros from the right or left
side of the data element if necessary. More advanced edits can be conducted by using the
Algorithm's Standardization Functions to ensure that data is compared more effectively.
NA Do nothing
TR Trim leading and trailing blanks, allow empty string
BL Trim blanks from the left, allow empty string
ZL Trim zeroes from the left, allow empty string
B1 Trim blanks from the left, leave at most 1
Z1 Trim zeroes from the left, leave at most 1
BR Trim blanks from the right, allow empty string
ZR Trim zeroes from the right, allow empty string
ZX Trim zeroes from the right, if all zeroes make NULL
V5.4.0.3
Uempty Assignment
This column defines the Initiate Database Method used to populate the Initiate record with
the customer data. Each data type has its own set methods. You must take care to select
the proper one for the data element you are inserting. The following list shows many of the
common methods used.
SetString Defines the value as a text string
SetNumber Defines the value as a numeric field
SetDate_Y4MD Defines a date field as YYYYMMDD
SetDate_MDY4 Defines a date field as MMDDYYYY
SetDate_MDY2 Defines a date field as MMDDYY
Field
This column is the Initiate Database Field where the value will be inserted. The
ATTRCODE defined in column one allows you to direct data to a specific field within a
Segment (for example, the DOB Attribute is found in the MEMDATE Segment and the
dateVal field is the column where dates are stored). You can find the available Fields in
Workbench by going to the .imm file and selecting Attribute Types.
Constant
This is the constant data value that will be inserted into the Initiate database when the data
element has a Position value of 0.

Student Notebook
Special considerations for configuration files

There are a few special considerations that should be taken when working with
configuration files.
Working with identifiers

Some data elements require more than one piece of information to constitute a complete
Initiate record ready for insertion into the database. For example: A Social Security Number
requires both the actual ID and the ID Issuer. '111223333' would be the ID and 'SSA' would
be the ID Issuer.
Working with compound attributes

Some data elements can be inserted using one or more data elements to make up a
complete record. For example, a Name attribute can have up to seven elements that can
be used to make up a single name record. Likewise, the Address attribute can hold multiple
elements from Street Line 1 through 4 to City, State, and Zip. If used, each one of these
elements will have to be mapped from the source file to a database field inside an Initiate
database record.
Working with attributes that have many valid forms

Some fields, like date fields, can be represented in different ways (MMDDYYYY,
YYYYMMDD, and so on) and Derive Data and Create UNLs (mpxdata) must be told how
the field is formatted inside the customer extract. If you find that an extract has Dates that
change format with different sources, you might need to run CloverETL or PERL scripts to
clean up the extracts or divide the extract into multiple input files, each with their own config
file.
Where should my configuration file be stored?

There are several options to storing your configuration file, but in general the following best
practices should be followed.
Name your config file after the data extract file (for example, if the data extract is named
CountyHospitalPatients.txt, then you should name your config file
CountyHospitalPatients.cfg).
When you run the Derive Data and Create UNLs (mpxdata) job, a copy of the .cfg file
will be stored in the server's work directory at:
C:\Initiate\instance\bootcamp\inst\mpinet_bootcamp\work\bootcamp\workmpxdata.cfg
V5.4.0.3
Uempty Exercise: Creating a configuration file

The goal of this exercise is to help you think more about configuration files. In this exercise,
you will:
Review an Extract Excerpt
Complete the Config File Worksheet
Attribute codes and segment fields

Each of the following Attribute Codes appears with the list of possible Segment fields that
could be used in your Configuration Template:
ADDR MEMHEAD NAME

stLine1 srcCode onmFirst
stLine2 memIdNum onmMiddle
stLine2 onmLast
city onmTitle
state onmPrefix
zip onmSuffix
onmDegree
SSN
DOB PHONE
idIssuer
dateVal phNumber
idNumber

Student Notebook
Data extract excerpt

The data extract excerpt below represents a snapshot of the data that the customer has
identified after going through the Data Extract Guide. Please review this excerpt and fill out
the Configuration Template on the following page based on the Attribute Codes, Segment
Fields, and this Excerpt.
Source MRN Title Last Name First Name MI Degree DOB Phone Address SSN
PHRM 00045 Dr Lockhart Eldora C PHD 08-16-1953 6231835023 289 W. James 214232158
St.
PHRM 00244 Mr Tougas Cyrus U 09-08-1897 4804105130 90 S Newstead 226241361
Rd
OUTP 00673 Mrs Polhemus Ivana L ESQ 07-17-1918 6231931923 2930 N Hwy 30523831068
OUTP 00827 Dr Stills Junior MD 07-03-1945 4801176228 RR 6 Box 2883 265662872
PHRM 00050 Ms Peppers Maryetta P CPA 04-04-1982 4809668324 265171522
OUTP 00393 Miss Mantilla Polly N RN 05-19-1938 6232926226 7868 Miller Ct 261161031
PHRM 00425 Mr Sander Jeffrey N RHIA 04-23-1929 4802079425 241432207
Configuration template
Fill in the template based on the information on the previous page. If you need additional
reference, you can check the sample config file earlier in this unit.
Table 8-3: Practice configuration template
Attribute # Position Transform Assignment Field Constant
Deriving data
Before we can derive our data by running the Final Extract file thru the Derive Data and
Create UNLs (mpxdata) job, we need to move our data to the folder below in order to run:
C:\<Engine_Directory>\inst\mpinet_<Hub_Instance>\work\<Workbench_Project_Name>\work
The work folder represents the folder on the server where any job that connects to a server
will write its output.
V5.4.0.3
Uempty Data analytics

Data analytics are facts about a data set identified by the hub.
Data analytics help us answer questions such as:
How many records have I added to the hub?
What percentage of attributes in my data set has valid values?
How many duplicate customer/patient records exist in my data set?
How many tasks and links did the hub create for my data set?
How many entities, or individual customers/patients, exist in my data set?
How well is my algorithm performing?
Why are data analytics produced?

They are used to help you better understand the data that is going into and out of the hub.
Analytics are used in a variety of situations:
To help you understand the quality of your data
To help you make better decisions about how to tune the performance of the hub
How do you produce them?

Data analytics are produced by running a series queries in Workbench connected to a hub
loaded with data and configured algorithms and weights. The output of these scripts can be
viewed in Workbench.
Prerequisites
Before running data analytics, you must have already done the following:
Installed Workbench
Installed the Master Data Engine
Created a database
Loaded the database with specific data dictionary, algorithm, anonymous values,
weights, member data
Performed a bulk cross-match and loaded entity data into the database before Score
Distribution and Potential Overlay.

Student Notebook
Workbench analytics perspective

When you open the Analytics perspective you will find four panes for running analytic
queries.
The icons at the top of the Analytics view provide the tools for accessing analytics data.
Table 8-4: Analytics icons
Icon Name Function
Set the Data Source To connect to the hub from which analytics data is drawn. (Data is taken
from the hubs database directly.)
Add a New Query To add a query to the view.
Clear Queries To remove all queries from the view.
Export Query To save the currently displayed query to a CSV file.
Add New View To create a new empty view within the perspective.
Pin Query To pin the query results to the current view and prevent drilldowns from
changing the contents of the view.
V5.4.0.3
Uempty Attribute completeness by source

Attribute Completeness provides the percentage of records in the database that do not
have an anonymous or other invalid value. This number of records is calculated for
attributes that are represented in your algorithm. Attribute Completeness does not tell you
whether or not the data is accurate, it only tells you how well populated your data is with
valid values.
The results are sorted by attribute, but can also be sorted by column to help you answer
some important questions as you move forward:
How complete is the attribute data in your records?
What high percentage attributes are good candidates for use in the algorithm?
Is a source or attribute missing a large percentage of data?
The source columns also indicate the percentage of records from all member attributes
coming from that source
This information can help you better understand the quality of your data and take corrective
action. For example, if only 50% of the SSNs in a particular source are valid, then the data
owners might want to alter their business processes to do a better job of collecting SSNs.
This also shows that you probably won't be able to find a person in this data set by
searching with SSN alone. Any search used with this data will need to be able to search
across other individual or combinations of attributes in order to find a person.
Exercise

Student Notebook
V5.4.0.3
Uempty Unit 9. Generating weights
Overview
The weight generation process is an integrated utility that goes through multiple steps to
measure the frequency of individual values in the database and then assigns weights to
those values with the most common values weighing less and rare values weighing more.
The weight generation process creates unload files which will later be loaded into the
database.
Dependencies
You must have the hub engine installed and your algorithm configured. If you have already
derived your data, then weight generation will take less time, but the weight generation
utility can also derive data. You should always check your weights before loading them into
the database.
Topics
Weights overview
Troubleshooting weights
Copyright IBM Corp. 2010, 2011 Unit 9. Generating weights 9-1

Student Notebook
Exercise
Now it is time to perform Step 1 in Exercise 7, taking approximately 5 minutes. The job will
take up to an hour to run.
The importance of weights

Weight are a measure or calculation of our confidence that a particular set of attributes are
a match or a non-match based on the results of a comparison. Let us look at a sample pair.
What makes you think these records match?

Take a look at these records...
Table 9-1: Same person?
Member A Member B
Name Robert L. Kognoski Bob L. Kognosky
SSN 813-12-1147 813-12-1174
Gender Male Male
What makes you confident that both of these members are the same person?
Is it the fact that Bob is a nickname for Robert?
Is it that they share the Middle Initial of L.?
Is it that Kognoski and Kognosky sound alike?
Is it the fact that the SSN values just have a simple typo?
Is it the fact that they both are Male?
Could it be a combination of all of these elements?
V5.4.0.3
Uempty What about Bob?

Look at the following questions and rate your answer on a scale of 1 to 10 (10 being very
confident) that you have found a match:
Table 9-2: What about Bob?
Question Rating
If you see two Female records does that mean that they are the
same woman?
If you see two people who both have the same SSN, are they the
same person?
If you see one record named Sue Chaudray-Patel and another
named Susan C. Patel, are they the same person?
If you see one record for John Smith born on June 5, 1938 and
another record for a John Smith with no birth date, are they the
same person?
If you see one record with a DOB of 1962-04-23 and a SSN of
718-12-0921 and another record with a DOB of 1962-04-13 and a
SSN of 718-21-0921, are they the same person?
Do you see how different elements such as Gender versus SSN would give you different
levels of confidence? This is because the more frequently a value appears, the less weight
it has. The rarer a value is, the more weight it has. -- Platinum and Aluminum are both
metals, but the rarity of Platinum gives it more value.
What are the odds?

Let us put it another way, if you were playing the Lottery and the winning numbers were
4-8-7-2-9-0, would you still win some money if you had 4-8-7-2-9-3? Of course, you would,
but having a one number difference would mean that you would not win as much as you
would have if you had a perfect match. Like with horseshoes and hand grenades, being
close still has value. Weights can take close into account and assign value based on their
proximity.
Thinking more about weights

When your brain goes through the cognitive process of comparing two records, you can
make the judgment that these two members are the same person because you have ample
information and you know that some data elements are more important than others. That is
similar to how weights work.
But, what if your information is less than complete? For example, if you had two cell phone
numbers that were exactly the same, you would reasonably expect them to belong to the
same person. But if you saw that the Gender, Birth Date, and Name of the members were
different, your confidence would drop. That is also how weights work.

Student Notebook
Identifying the weight tables

There are several tables that come into play when dealing with weights. The dimensional
weight tables read the algorithm directly to see how many values are compared together in
the same comparison function. If two attributes like Last Name and Zip are compared
together, then the weights of Last Name + Zip will appear in the mpi_wgt2dim table.
In the following table, the weight tables are described:
Table 9-3: Weight tables
Table Description
Holds core definitions of the weights, including the comparison specification
mpi_wgthead
string and the weight type (such as 1dim or sval).
Holds the weight values for the appropriate comparison functions that have
mpi_wgt1dim
a single comparison attribute (such as SSN and Date of Birth).
Holds the weight values for the appropriate comparison functions that use
mpi_wgt2dim
two attributes (such as Eye Color + Hair Color).
Holds the weight values for the appropriate comparison functions that use
mpi_wgt3dim
three attributes (such as Zip Code + Address + Phone Number).
Holds the weight values for the False Positive Filter, which uses four
mpi_wgt4dim separate attributes to controls situations like Twins or Jr's & Sr's who are
mistakenly linked.
Holds common string (text) weight values based purely on frequency. This is
mpi_wgtsval where you will see the weights for people's names and attributes that use a
simple match or do not match like gender.
Holds common numeric weight values based purely on frequency. You will
mpi_wgtnval
typically see date information like birth year weighed here.
Holds the anonymous values that have been established by the Anonymous
Value Utility or were imported into your system. While this is not a weight
mpi_stranon
table per se, it is commonly associated with weights because of the role that
anonymous values play in measuring frequency.
The 80/20 weight rule

Your weights operate on the 80/20 rule. When the hub calculates weights, it typically only
generates specific weight values for the most common values in the population up to the
point where 80% of the population is accounted for. The remaining 20% of the population
have the rarest values and therefore can be given the same default weight score which is
the highest possible weight for that attribute/token.
Remember, the more rare the value, the higher its weight. Therefore, the values that are
most rare all receive the highest possible weight score. The beauty in this approach is that
when a distinctly new value is added to the hub, it can still be assigned a weight value,
even though the system has never seen that value before.
V5.4.0.3
Uempty When should you recalculate weights?

There is no set rule that dictates when you need to recalculate your weights, but there are
a few scenarios that merit running weights again. That said, you might find that the change
in weights was so negligible that you can continue using your old weights.
Reasons to check for updated weights:
If you have added or changed comparison function or comparison code in your
Algorithm.
When your population has grown by more than 20% of the original population
(remember the 80/20 rule now you have a different 20%).
It has been 2 years since you last calculated weights.
When you have added a new source, especially one that comes from a different
geographical area (the East Coast has different name, phone and address distributions
than the West Coast).
When you are upgrading to a new version of the IBM Initiate Master Data Service.

Student Notebook
The magical weight formula

The process of calculating weights is based on solid mathematical and statistical research.
Our data scientists have leveraged empirical research and theory to create the formulas
that measure and calculate weight. The basic weight generation formula is shown below.
If R is the result of a comparison function, then the weight for R is:
Log10(Probability(R in matched pairs)/Probability(R in unmatched Pairs))
Calculating weights with matched pairs

Matched Pairs are sets of records that likely represent the same person.The process is not
concerned with duplicates or linkages at this time and matched pairs do not need to find an
exact match on all attributes since it uses the probabilistic algorithm.
Below is an example of a matched pair:
What are unmatched pairs?

Unmatched Pairs are sets of records that are pulled randomly from the database. Because
they are random, it is most likely that they will not match, although it is possible they will.
Each attribute will be compared for agreement and frequency.
Below is an example of an unmatched pair:
V5.4.0.3
Uempty Basic weight calculations

The simplest form of weight calculations are frequency-based weights. Below is an
example that uses three names and shows that, in a given population, these names will
have weights that are inversely proportionate to their frequency. Remember, not all weights
are calculated this way. Let us take the weight formula and boil it down to its most
elemental parts:
Log10(Probability(R in matched pairs)/Probability(R in unmatched Pairs)) which can
be calculated as Log 10(C * A/B).
A = Population Frequency: A is the number of times that a value occurs in population
divided by the total population. So, if 9 of 100 people had the name Smith then 9%.
B = (Population Frequency)2: B is the likelihood that two random members would have
the name Smith. This would be 9% of 9% of the random pairs or 0.81%.
C = Matched Frequency: C is the actual number of records that agreed (not just the
Smiths, but simply had the same name) in a set of matched pairs. These matched pairs are
identified by the algorithm in a similar process to the way you generated threshold analysis
sample pairs. Typically, this is a high number, between 85-99%. In our example below,
92.6% of matched pairs agreed.
Probability(R in matched pairs) or C * A: The probability that two members have the
same name is dependent upon two things: 1) how often does the name Smith appear in the
general population and 2) what percentage of matched pairs have the same value for the
name (some records might have empty or wrong information). When we multiply these two
elements, we get the probability that two matched records would be named Smith (92.56%
* 0.81% = 0.75%).

Student Notebook
Probability(R in unmatched pairs) or B: The statistical likelihood that two random

records pulled from your database would have the value you are measuring is based on
how common that value is. If 9% of the population has the name Smith, then each of the
two members in the random pairing have a 9% chance of being named Smith. So, to use it
in the formula, the unmatched probability is the square of the value's population
frequency or 0.81%.
Table 9-4: Weight calculations
A B C D
1 Smith Gonzales Chitsumungo
2 Value Frequency 92,982 6,372 39
3 Total Population 1,000,000 1,000,000 1,000,000
4 Population Freq. - A 0.09298 0.00637 0.00004
5
6 Unmatched that Agree 432,283 2,030 0.08
7 Unmatched Pairs 50,000,000 50,000,000 50,000,000
8 Unmatched Freq. - B 0.008645652 0.000040602 0.000000002
9
10 Matched that Agree 4,628,101 4,628,101 4,628,101
11 Matched Pairs 5,000,000 5,000,000 5,000,000
12 Matched Freq. - C 0.92562 0.92562 0.92562
13
14 Formula Used =log10(B12*B4/(B42)) =log10(C12*C4/(C42)) =log10(D12*D4/(D42))
15 Resulting Weight 0.998 2.162 4.375
Running weight generation

Luckily, the Weight Generation tool will calculate the weights of our data automatically. So
you are off the hook for sitting down with a calculator to manually calculate weights. This
process can take several hours depending upon the size of your database, the complexity
of your algorithm, and the number of attributes you have. For a data set with 1,000,000
member records and 6-8 attributes, you should plan on a minimum of 2 hours of processing
time. Larger record sets will take more time to process, but the time estimate is dependent
upon multiple variables (for example, Hardware, Data set Size, Algorithm Complexity, and
so on). For this reason, weights are most frequently done at the end of the day so that they
can process overnight and create less of a disruption for other applications that might be
running on the same server.
V5.4.0.3
Uempty Weight generation overview

There are seven steps to weight generation.
Table 9-5: Weight generation steps
Steps Description
Delete artifacts from previous run (removes the contents of the weights
Step 1
directory).
Generate counts for all attribute values (uses the Generate Frequency Stats
Step 2
(mpxfreq) utility).
Step 3 Generate random pairs of members.
Step 4 Derive random data by comparing random members.
Perform matched candidate pairs reduction (similar to Threshold pairs
Step 5
process).
Step 6 Generate matched set, matched statistics, and initial weights.
Step 7 Iterate over previous step and check for convergence of weights.
You can run these steps independently or in sequence, and offer additional settings for
optimizing your weight generation and output.

Student Notebook
Troubleshooting your weights

Bad weights are most often the result of having bad or insufficient data. In many cases,
attempts to create fake or test data fail to generate weights because they do not represent
the real distribution of errors that are found in real life. Most commonly you will generate
bad weights if the following conditions exist:
Input file size is too small. Possibly there will be too many matched pairs in the
random pair sample, or the size of the matched set is too small for reliable matched set
statistics.
The duplication rate of the input file is too small. The size of the matched set is too
small for reliable matched set statistics.
The duplication rate of the input file is too large. Possibly there will be too many
matched pairs in the random pair sample or they are from the same source.
One or more of the match attributes is sparsely populated. Not enough pairs for
reliable statistics. If an attribute is only populated in 10% of records, only about 1% of
matched pairs will have members who both have the attribute.
There are two few match attributes. If there are only two attributes, say Name and
Address, to derive the matched statistics for Name we have only Address. You cannot
derive a good matched set from address alone.
V5.4.0.3
Uempty Good weight distribution

This graph represents good weights for Social Security Numbers. Notice that the exact
match (on the left) has the highest score, while number 10 (which represents that all 9
digits would need to change in order to match) has the lowest score. To read the Index:
1 = Exact Match
2 = 1 Edit Required to Match
3 = 2 Edits Required to Match, and so on
An edit could be as simple as 399-90-1314 changing to 339-90-1314.

Student Notebook
Bad weight distribution

This graph illustrates a situation where weights are going to cause problems. You will see
that PIN (access code) values for 8 through 10 are running in positive numbers. In fact,
according to these weights it is better to have an edit distance of 8 or 9 than it is to have an
exact match. This means that having a completely different PIN would be considered a
better match than two PIN numbers that are exactly the same.
5. Select the Insert Chart feature, select a Line Chart option and click Finish.
6. Review the weight distribution.
V5.4.0.3
Uempty Hand editing AXP weights

Two dimensional weights, like AXP in the wgt2dim file, should tell us that when the Phone
is an exact match and the Address is an exact match we should see the highest score. As
we deviate from there, like the Phone is 3 edits off and the Address is 2 edits off, the score
should be somewhat lower.
As you reach the lower-right corner of the weight values you should see the lowest score of
all where the Address and Phone are a complete mismatch. Unfortunately, the data that we
are using in the Boot Camp does not have a realistic enough distribution of values to
generate good looking weights. So, in this exercise, we will edit the AXP weights by hand to
smooth them out.

Student Notebook
Reading multi-dimensional weights

Weights in the mpi_wgtsval, mpi_wgt1dim, and mpi_wgtnval tables are fairly straight
forward. The multi-dimensional weight tables are a bit more difficult to read. In this section
we will look at how to read and interpret the data in the mpi_wgt2dim and mpi_wgt3dim
tables.
To read the weights you must understand a few basic concepts:
Dimension counting structure
Trends to look for in the data
Counting dimensions
Dimension weights are based on the number of characters in the attribute that you are
measuring, plus two additional values. For a US phone number the standardized value
would have 7 digits. You add to that a weight for Missing and Exact Match and you end up
with 9 dimensions of weight numbered 0-8.
Table 9-6: Counting dimensions
Dimension # Meaning Value
A value is missing (or anonymous)
from one or both of the member Usually zero, it sometimes has a
0
records compared. For example, no negative penalty.
phone number.
The values in both member records Highest score, usually between 4 and
1
are exactly the same. 6.5.
The values are one edit distance
different. You would have to add, Slightly lower than exact match, but
2
change, or remove one character to still positive.
make the records the same.
The further your edit distance gets,
the more penalty there is for Usually a negative number around an
difference. The more different, the edit distance of 4 or 5.

lower the score will be.
V5.4.0.3
Uempty
Zip dimensions go down (0, 0, 0, 0, 1, 1, 1, 1) in column B

Address dimensions go down (0, 1, 2, 3, 4, 5, 6) in column C
Phone dimensions start with column D and are represented by the actual weights:
- Column D = 0 (missing)
- Column E = 1 (exact match)
- Column F = 2 (edit distance of 1)
- Column G = 3 (edit distance of 2)
You should see a trend on the diagonal that starts high and moves down (cell E18 is the
highest score because it represents an exact match on all 3 dimensions)
You should also see a trend from left to right, and from top to bottom
Exercise
Now it is time to perform the remaining steps in Exercise 7, taking approximately 45
minutes.

Student Notebook
V5.4.0.3
Uempty Unit 10. Running a bulk cross match
Overview
The bulk cross match (BXM) is a process that allows you to compare and link thousands of
records per second. The BXM is most commonly performed in the initial stage of the
implementation and again right before the system goes live. The BXM process is made up
of two primary jobs; Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink).
After running the compare and link, the data will need to be loaded into the database.
Dependencies
You must have derived data and generated weights before you perform the bulk cross
match. That also means that the hub engine, algorithm, and data dictionary must be in
place.
Topics
Bulk Cross Match overview
Copyright IBM Corp. 2010, 2011 Unit 10. Running a bulk cross match 10-1
Student Notebook
Exercise
Now it is time to perform Exercise 8, taking approximately 10 minutes. The job will take
approximately 45 minutes to run.
Bulk cross match

The Bulk Cross Match (BXM) compares your member records in binary form and then
measures the comparison scores against your thresholds to create Entity1 assignments.
This is a critical step in ensuring that your thresholds are set at the proper level.
The typical implementation performs the BXM process to establish entities. Matching and
linking in bulk the first time around means that the hub will only need to focus on matching
and linking the new records that come into the database once the hub is in production.
In production, new/updated source system records are processed through Entity
Management to maintain entities going forward. The set of logical steps necessary to
process a source system record to end up with the correct entity assignment (EID) is the
same whether the source records are added real-time or in batch.
How are records grouped?

For a single member record (Member A), how do we decide whether that member should
be grouped in an entity with other members?
First, take all the information about that member and derive two additional sets of
information: buckets and a comparison string
Buckets are groups of members who only need to have a little information in common.
For example, all the members whose phone number is 678-3233, or all the members
whose name is phonetically similar to John+Wilson, or all the members whose last
name is Wilson in the zip code 60657. Buckets are used to select a wide set of
candidates who might belong together in the same entity.
Compare the comparison strings of all the candidates selected from the buckets in
common with Member A against the comparison string of Member A. Only those
candidates with a high enough comparison score (above a high-confidence threshold
level) are deemed to be in the same entity as Member A.
Assign a common EID value to Member A and the other members who scored high
enough. Members who score just below the threshold will not be linked into the entity,
but a task will be created which can be manually reviewed later.
1 If you are not using Entities in your implementation, you will not need to perform this step.
V5.4.0.3
Uempty Enterprise ID (EID) assignment

When a member is created in the Master Data Engine, it is assigned an internal member
record number (MemRecno) and an Enterprise ID (EID and/or EntRecno). During Master
Data Engine and entity management processing, member records are linked to entities
with other records. As members are added/updated, linked/unlinked, merged/unmerged or
removed from a hub, the makeup of entities can change. This might result in the change of
a member's Enterprise ID.
Enterprise IDs are assigned in sequence via the mpi_seqgen table when being added,
linked/unlinked, or merged/unmerged. This method of assignment successfully maintains
the least number of changes. The EIDs are non-transitive which means there would no
longer be a connection between members after the change had been made.
For example: Bert and Ernie were roommates until Bert moved out. Even though they once
shared an address, there is no longer any connection between them.
BXM steps
The starting point is a large file of customer data and the ending point is all the customer
data (plus the Initiate derived data and entity/task assignments) loaded into the database.
The interim steps are designed to load large amounts of data into memory and process
them without ever hitting the disk sub-system.
BXM refers to the entire process, but is also often used to refer to just the compare and link
steps.
The example illustrated here assumes that the hub configuration, algorithm, and weights
have been created and only the customer data remains to be dealt with.
Student Notebook
Step 1
Derive Data and Create UNLs (mpxdata) utility: This is the utility you would typically use
for an initial load (starting from scratch).
This utility takes the file of customer data, and converts the core data into the Identity hub
data model. For each table in the Identity hub data model, a new unl text file
(mpi_memname.unl, for example) is created which is eventually loaded into the database.
This utility also creates the derived data (buckets, comparison string) as well, and puts the
derived data into two formats:
unl text file, to be loaded into the database later
Binary files, which will be used in subsequent processing steps
Step 2
Compare Members in Bulk (mpxcomp) utility: This utility iterates through all the buckets
(selects candidates) and performs the comparison calculations for all the members in each
bucket.
The input is the binary files of derived data (bucket and comparison string binary files) from
the previous step. The binary files are read into memory to speed up all the comparison
calculations.
The output is additional binary files that represent the entity link and task groupings.
V5.4.0.3
Uempty Step 3
Link Entities (mpxlink) utility: This utility takes the comparison results and creates entity
link and task files that can be loaded into the database.
The input is the binary files of comparison results (entity link and task groupings) from the
previous step. These files are read into memory for faster processing.
The output is additional unl text files that contain the EID assignments (entlink), tasks
(enttsk), and EID history (entxeia).
The Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink) utilities must be run
once for each type of entity (for example: 'identity' and 'household'), as the outcome will be
different per entity type.
Step 4
Load UNLs to DB (madunlload) utility: This utility takes the 'unl' text files created during
previous steps and loads them into the database.
The Load UNLs to DB (madunlload) utility loads the core member data 'unl' files and the
derived data 'unl' files (the outputs from the Create Core Data and Create Derived Data
steps).
After the database is loaded, the Identity hub engine is started and real-time operation
begins.
How does the BXM impact the database?

The BXM process is conducted external to the hub using the binary files that came out of
the Derive Data and Create UNLs (mpxdata) derivation process and establishes entities
and tasks on thousands of records per second. By conducting this process outside of the
database, we reduce the drag on the server. Loading the entity and task tables also allows
us to understand the volume of work that hub has performed automatically and the volume
of tasks that will need to reviewed by human eyes.
Student Notebook
V5.4.0.3
Uempty Unit 11. Analyzing thresholds and matched pairs
Overview
should run tests to establish how well your system and the data are performing. Through
the Generating Matched Pairs job and the Threshold Calculator in Workbench, you can
assess your threshold analysis.
Dependencies
Your core engine and data must be fully loaded in order to run the data analytics. Threshold
analysis can be done within Workbench.
Topics
Threshold analysis overview
Analyzing Matched Pairs
Copyright IBM Corp. 2010, 2011 Unit 11. Analyzing thresholds and matched pairs 11-1
Student Notebook
Threshold overview
Your hub uses set scores to determine what action to take as a result of a comparison.
These scores are referred to as thresholds.
The Autolink (AL) Threshold is the cutoff score where two members will be linked
together by the hub. This upper threshold reflects the score at which the organization is
confident that a match represents the same person. We set this based on the
organization's tolerance for false positives.
The Clerical Review (CR) Threshold is the cutoff score where two members will be
assigned a task for manual review by a user. This lower threshold reflects the score at
which the organization wants to manually review matches. We set this based both on the
organization's tolerance for false negatives, and on the number of matches they are willing
to review based on the cost of manually reviewing these tasks.
Members whose scores fall below the CR threshold will not be linked or assigned to a task
- it is assumed that these members are not related given the information currently
available. If the member records are updated causing the scores to improve, then the hub
might assign them to a task or link them.
Importance of thresholds
The accuracy of our solution is measured in terms of two potential types of errors and we
establish the two thresholds to minimize both types of errors: False Negatives and False
Positives.
Auto Link minimizes False Positive matches. These occur when you match two records
that do not represent the same person.
Clerical Review minimizes False Negative matches. These occur when you fail to
match two records that represent the same person.
V5.4.0.3
Uempty Possible matching errors

Let us learn more about the possible matching errors that can take place based on the
available information.
False negatives
The empty arrows on the left are members that should be linked but are not because of
insufficient data. These are False Negatives. If all that is known is a name, then there is not
enough data to make a sound linking judgment. False negatives can possibly be reduced
by tweaking the algorithm, but are most commonly the result of poor data collection.
False positives
The dark arrow on the right that should not be linked is the result of data that is very similar;
usually a set of twins, or parent and child records. This is a False Positive. There are
several ways to deal with false positives. Most commonly, we use the False Positive Filter
to issue weight penalties for subtle differences between records.
Student Notebook
Threshold analysis overview

In order to determine where to set our thresholds, we need to examine a sample set of
matched pairs from our Bulk Cross Match. As we go through the pairs we will see patterns
in them. We will see where the pairs transition from no to maybe, and maybe to yes. It is at
these transitions where we want to set up our thresholds.
Objectives of threshold analysis

We perform threshold analysis for various reasons:
Determine the threshold at which you are confident that two records represent the same
person
Determine the threshold at which you want to manually review potential duplicates
Validate that the algorithm we have configured is producing meaningful matches
Confirm that the weights we generated make sense
Identify general data quality issues
Conducting a pre-threshold analysis internal review

Before having others review your sample pairs, you should do your own review to check
the quality of your algorithm and weights. In general, your internal review will probably be
more extensive and include more data than what you will ultimately give to your
organization for their review. For your internal review, plan on generating up to 20 sample
pairs per tenth point.
Some of the key things to look for are:
Are there False Positives in your data set?
Are there additional anonymous values or nicknames that should be added?
Overall, do the scores make sense?
Do higher scoring pairs have more attributes in common than lower scoring pairs?
Is your algorithm comparing and scoring members as you would expect?
Conducting the internal review

You usually want at least two members, hopefully a Project Manager and Technical
Analyst, to review the pairs and then compare notes. There should always be a small score
range where you see a switch from a mixture of good and bad links to scores where 99% of
the links are good ones. That is your first approximation of where the AL threshold should
be located.
V5.4.0.3
Uempty Looking for false positives

Looking for False Positives usually means eyeballing the sample pairs and looking for
certain telltale signs. In health care situations in particular, you will be looking for high
scoring pairs with attribute differences between name, date of birth, and/or SSN.
It is a given that your lower scores are going to have a lot of false positives, so our goal is to
locate the highest scoring false positives, usually 10s or higher, depending on the project. If
we can eliminate the higher scoring false positives, then the organization will most likely be
able to implement a lower AL threshold.
Resolving false positives with the false positive filter

Usually you will be able to identify trends in the false positives such as twins, father/son, or
husband/wife linkages that can be penalized and eliminated with the False Positive Filter or
FPF. The FPF uses the following rules:
If there is a complete mismatch on the name, the FPF is set to true.
If there is a phonetic or partial name match AND either DOB or SEX disagrees, the FPF
is set to true. (Note that FPF looks at DOB or SEX only if there is a partial match.)
If a name suffix disagrees, the FPF is set to true.
ANON values are treated as a 0 in comparisons and are treated neutrally by FPF.
As you identify trends creating false positives, start an iterative process where you
augment the algorithm as necessary by adding or updating the False Positive Filter
settings. Then rerun comp and link and review new linkages while confirming the false
positives you targeted from the last run are no longer being linked.
Eventually, you will become confident you have eliminated as many false positive trends as
you can and you are ready to pass the samples onto the customer for review.
This process varies from project to project, and depending upon your organization's
tolerance for false positives, you might need to involve them in these decisions. It is
important to remember that it is possible to introduce false negatives while trying to
eliminate false positives when applying the False Positive Filter.
Conducting threshold analysis with organizations

Besides setting the ground rules, preparing organizations for Threshold Analysis really
means educating them about our software and setting their expectations for what they will
encounter when they review the sample pairs. When generating sample pairs for
organization review, 8 sample pairs per tenth point as a minimum, or 10-12 pairs per tenth
point for a more thorough analysis.
Student Notebook
Reviewing the data

Organizations should understand that some of the matches will be obvious and easy to
assess, particularly at the low scores and the high scores. We intentionally include a broad
range of scores, from bad matches to ambiguous matches to good matches, to make sure
organizations see different types of data. This helps us ensure that the thresholds are
statistically sound.
They will see matches that do not have enough information to suggest they are or are not
the same person. In these cases, remind organizations that they can mark these as not
having enough information to decide if there is a match or not. They should not guess or try
to research whether or not these pairs represent a match.
They will also occasionally see anonymous or invalid values, such as SSN of 00000000.
They should understand that our algorithm ignores these invalid values (treats them the
same as missing values).
How long does it take to review matched pairs?

We find that reviewers can usually review an average of about 10 matches per minute, or
roughly 1,000 matches in 2 hours. Organizations should understand that before getting
started so they know to focus on reviewing the pairs, and not trying to fix problems.
Interpreting results from organization sample pair review

Once they have evaluated and scored the sample matches, we can make algorithm
improvements and perform another round of threshold analysis. We then use the
organization's evaluations to determine the appropriate thresholds.
Usually, we have multiple persons on the organization side review the same batch of
sample pairs. So the first thing you want to do when you get back their results is review
them and ask yourself if you agree with their decisions. Write down questions as to why
they made certain decisions you find curious and save for future discussion with them. Also
compare each person's results with one another. In general, are these people making
consistent decisions? If not, you'll need to discuss with them and hopefully reach a
consensus on same person/not same person decisions.
Now what?
At this point, threshold analysis becomes iterative, and the specific steps will vary from
organization to organization.
You will most likely need to make adjustments to your threshold settings, run another bulk
cross-match, and show the organization another set of matched pairs highlighting the
difference between the new and old threshold settings.
V5.4.0.3
Uempty Initiate pair manager

Pair Manager is a standalone tool that reads the sample pair file (.xls) created from the
Threshold Analysis Pair Generation job. The Threshold Analysis Pair Generation job
supports dividing the generated sample pairs into multiple .xls files.
The Pair Manager interface shows each pair side-by-side with the matching data
highlighted. For each pair viewed, you can mark whether it is a match, non-match, or
maybe a match. Next/Previous buttons allow for easy scrolling through pairs, or a filtering
feature enables control of what pairs you review. You can choose to view:
All pairs
Undecided pairs
Matching pairs
Non-matched pairs
Maybe pairs
By default, the Pair Manager displays all the data fields contained in the sample pair file.
However, you can easily reconfigure the interface to display only the fields you want to
compare and the order in which to display them. The easy access enables you to start and
stop your review as needed, and to repeat the review process multiple times if necessary.
Once the review is complete, you can import the sample pair .xls files into the Workbench
Threshold Calculator to configure your thresholds.
Exercise
Student Notebook
V5.4.0.3
Uempty Unit 12. Analyzing buckets and frequency based

bucketing
Overview
should run tests to establish how well your system and the data are performing. You can
assess score distribution, and entity and bucket size through the analysis tools in
Workbench.
Dependencies
Your core engine and data must be fully loaded in order to run the data analytics. Analysis
can all be done within Workbench.
Topics
Analyzing buckets overview
Configuring Frequency Based Bucketing
Copyright IBM Corp. 2010, 2011 Unit 12. Analyzing buckets and frequency based bucketing 12-1
Student Notebook
Analyzing buckets overview

Organizations searching requirements, performance goals, and population size are key
when designing bucketing algorithms. Bucketing technologies and techniques should be
selected after a careful review of the organization's needs and source data characteristics.
Improperly applied bucketing methods can lead to excessively large buckets. Also consider
how the population will grow once in production, and try to design your bucketing to
compensate for future growth.
Data set size ranges

The following guidelines might be useful when designing a bucket strategy.
Data sets less than 500K members

Sorted phone and SSN buckets as well as 2-token meta name buckets can typically be
used to create reasonable size buckets. It's usually safe to design the bucketing so that
Search Only can be performed with only a single attribute populated. This is unlike larger
population sizes, which usually require a combination of attributes to avoid large buckets.
In this size range the potential for excessively large buckets is less of a concern. Still be
sure to carefully review the sizes and monitor them as appropriate for the data.
Data sets 500K or larger

Larger populations require very careful bucket design. Generally, you must combine
attributes that are more commonly shared between records, such as first and last name
and date of birth, to create a more unique bucketing pair. This is especially important when
using phonetic name bucketing. A 2-token phonetic name-only bucket can result in very
large buckets for larger populations. This happens because phonetic equivalents are often
the same across many different names, so it's usually best to implement a 1 token phonetic
name + DOB or 1 token phonetic name + zip bucket so that overall bucket sizes are limited.
Extremely large populations of a million or more might require 2 name tokens + zip.
Choosing attributes
An attribute can be used in multiple buckets. For example, using a 2-token equi-name
bucket (using the name as is) as well as a single-token phonetic name + DOB/zip bucket
might be helpful. Having both buckets will allow users to search for a specific record even
though Frequency-Based Bucketing (FBB) might have eliminated either one of the buckets.
Also remember that as population increases, techniques such as phonetics or sorting can
result in an exponential increase in members that share the same bucket. Because of this,
sorting of phone and SSN should usually be avoided for large populations. Instead, it might
be better to use ASIS bucketing for SSN, phone, or other numeric identifiers. You lose the
ability to compensate for typing errors because of this, but usually the performance gain is
more important.
V5.4.0.3
Uempty How to bucket

Person Name You can choose to bucket on 1 or more name tokens. The more tokens
you bucket on, the more unique your bucket string becomes, resulting in smaller buckets.
Sorting or phonetics reduces the uniqueness of your strings, resulting in larger buckets. For
larger populations, try combining a single name token with phone, DOB or zip to reduce
bucket size.
Street Address Bucketing on street line address is typically not recommended.
Phone Although area codes are generally dropped for use in the bucketing string, the
remaining 7 digits are still fairly unique across even a large population. Phone is usually
bucketed by itself. Avoid sorting in large populations as it will result in large buckets.
SSN/DLN/other unique identifiers Usually bucketed by itself. Avoid sorting in large
populations as it will result in large buckets.
Gender Not usually used in bucketing.
DOB Usually used in conjunction with one or more name tokens.
Zip Usually used in conjunction with one or more name tokens.
Generating a bucket analysis overview

When you first access the Analytics Perspective the four quadrants will all appear dormant.
You will need to load up the data from your hub into the bucket analysis. This is done by
running a series of jobs. Because these queries can take a while to process, the bucket
analysis statistics can be stored locally as a snapshot. This will save considerable time
when you review your buckets at a later time.
Student Notebook
Members not in a bucket

When a member is not contained in any bucket, it means the only way to retrieve it is by
unique identifier and it would never be returned by a standard attribute search. Usually this
happens with very sparsely populated members or members comprised mainly of junk or
anonymous values.
It is important to confirm that members with no rows in mpi_membktd are in fact made up of
missing or anonymous values. Verify that the missing rows are a product of correct
behavior. If a large number of legitimate members are missing from mpi_membktd -
perhaps it is because you are bucketing using fields that are missing from those members.
For example, if they only have name and address information while you are bucketing on
name+phone or name+DOB this would cause no buckets to be created since DOB or
phone data is missing. If this is not acceptable the algorithm should be updated so that
buckets are created for the affected members.
Large buckets: 2,000+

Large buckets are generally defined as those with 2,000 members or more. Simply using
FBB to eliminate the largest buckets is not sufficient. Check for cases where there are
100's or 1000's of buckets with 1000-2000 members. Buckets in that range would just slip
under what was targeted with FBB. If a large quantity of medium to large-sized buckets are
left after applying FBB, then the algorithm might need to be tuned or the FBB
maxbucketsize setting might need to be lowered. Having a large number of medium to
large-sized buckets can just as adversely affect performance as having a few huge
buckets.
V5.4.0.3
Uempty How to look up the actual value of the bucket

After you reload your data, you might be wondering what the bucket hash value represents.
You can see the Bucket Value in the results.
Student Notebook
Visualizing bucket size distribution

The Bucket Size Distribution graph gives you the ability to visualize the size of your
buckets, and see how your buckets could hinder performance. The gradation of green to
red indicates that buckets on the far right, in the red area are larger than is recommended.
Buckets in the yellow are ones to monitor as they might soon exceed the recommended
size.
Note
Each blue dot on the graph is a link to the bucket that it represents. Double-click a point on
the graph to pull up the details about that bucket.
V5.4.0.3
Uempty
Note
The View Bucket and View Algorithm buttons on the right allow you to see more
information about the members in a particular bucket as well as the element in the
algorithm that designed the bucket.
Student Notebook
Seeing member bucket frequency

Member Bucket Frequency tells you how many buckets a particular member is in. As the
bars go towards the right, you see an increase in the number of buckets that one member
has. The height of each bar in the graph represents the number of members that have that
volume of buckets.
The EQMETA is the culprit behind your members with the most buckets. This is caused by
nicknames that have multiple formal names associated with them. Like the other graph,
you can click a bar to see more details.
V5.4.0.3
Uempty Viewing a member's bucket values

The Member Bucket Values tab gives you the ability to see which buckets a particular
member is associated with. This does require the member record number as input. Simple
enter the memrecno and click Reload.
Student Notebook
Member comparison distribution

The member comparison distribution displays the average number (Median and Mean) of
comparisons that are performed when you search for a member. This is derived as part of
the bucket analysis because the more members you have in a single bucket, the more
comparisons are made during a search. The more members you compare, the longer the
delay is before the results are returned.
Note
An average of around 1000 members compared per search still returns results in
subsecond time in most environments. In training, our machines might perform a little
slower.
V5.4.0.3
Uempty Frequency based bucketing

At one time or another, we have all been bombarded by too much information. Imagine a
Google search that results in 400 returns. Finding the information you need becomes the
proverbial needle in the haystack.
You need to work smarter, not harder. You need to refine your search to find the information
easier.
The same problem occurs in bucketing. The usefulness of a bucket is lessened when it has
too many members. When extremely large buckets occur, we can use frequency-based
bucketing (FBB) to put a lid on the bucket and prevent additional members from being
added.
When and why is FBB used?

Frequency-based bucketing cuts-off searching against a particular bucket hash once the
amount of members in that hash becomes too high to provide meaningful and/or fast
enough results. If there are10,000 members in the system whose derived data includes a
particularly common bucket, then whenever your search includes that bucket, the system
can take a LONG time to retrieve the attribute data for all those members.
Imagine you are bucketing on first name+zip. How many Marias do you think there are in a
New York zip code?
Searching a New York Initiate database for Maria Sanchez 10001 will create a buckethash
out of MR+10001 and do a select on mpi_membktd. The result might return 50,000
candidates against that hash and, depending on the speed of your system, might take
upwards of 5 minutes to return results.
The IBM Initiate Master Data Service engine retrieves all the candidates and then performs
the scoring and filtering across all of them. Over time, organizations might start to notice
performance problems associated with searches and cross-matching. Raw bucket sizes
will usually grow with data sizes. If they grow faster, there is likely a different problem, such
as anonymous values that need to be added to mpi_stranon.
Note
Frequency-Based Bucketing is not a dynamic monitoring tool. You will need to invoke the
FBB analysis process every 6 to 12 months to keep on top of your largest buckets.
The average recommended Maximum Bucket Size for Frequency-Based Bucketing is

2,000.
Student Notebook
How is FBB implemented?

There are a few elements that combine to make FBB work:
Go to the Bucket Group properties in the algorithm editor, you must determine a cutoff
for your maximum bucket size, which is the total number of acceptable members in any
given bucket.
Determine the minimum attribute tokens that must exist for the bucket to be created.
Determine the maximum attribute tokens that combine to make a bucket.
Run the Generate Frequency Stats (mpxfreq) job in FBB mode and the system notes
the buckets that have too many members and turns them off. No additional members
will be added and the bucket will not be searchable.
Exercise
V5.4.0.3
Uempty Unit 13. Reiterating the process
Overview
You will take the results of the analysis and make tweaks to your algorithm and data
dictionary, if necessary. After your edits, you will usually re-derive your data, run another
BXM, and analyze the results again.
Dependencies
Bucket design changes usually require re-deriving, but not another BXM. Comparison
changes require new weights, re-derivation, and a new BXM. Some small tweaks only
require an engine restart or simply redeploying your configuration.
Topics
Deploying a new configuration
Deriving data again
Rerunning bulk cross match
Running Entity Analysis
Copyright IBM Corp. 2010, 2011 Unit 13. Reiterating the process 13-1
Student Notebook
Reiterating after configuration changes

Depending on the changes made during analyzing and reviewing your implementation and
data, you will need to identify which jobs are affected by these changes and then rerun.
When you make changes to your configuration you will need to rerun some of the jobs to
apply those changes. The table below shows which jobs need to be run after changes are
made. The table below uses the abbreviated job names. Refer to Re-deriving data on
page 3.
Table 13-1: Configuration changes/reiteration matrix
Generate Weights
Save and Deploy
MAD UNLLOAD
MAD UNLLOAD
MPX REDVD
MPX FSDVD
MPX COMP
O = Optional
MPX PREP
MPX LINK
MPX Data
X = Mandatory
Initial Load and Extract X X X X X X X

Changed Algorithm Bucket
X X O X O O O
Only
Changed Algorithm
X X O X X X X X
Comparison Only
Added a New Source X X X X O X X X
Added a New Attribute X X O X X X X X
Update Thresholds X X X X
V5.4.0.3
Uempty Redeploying the configuration

After all the changes we have made, we need to redeploy the configuration that you have
created to the hub.
Re-deriving data
There are multiple ways to re-derive your data when you have made changes to your data
model or algorithm, such as adding new Anonymous Values, designing new Bucket
strategies, or changing Standardization functions.
Derive Data and Create UNLs (mpxdata) starts with raw data in a flat file. You can
choose to assign bucket hashes, create comparison strings, and compile binary files. You
can also determine whether to use MEMCOMPUTE or MEMPUT modes. MEMCOMPUTE
outputs the results of Derive Data and Create UNLs (mpxdata) into UNLs (Pipe-delimited
Unload files). MEMPUT writes directly to the Database without creating a set of UNL files.
MEMCOMPUTE is a great option to document your progress, but it will not allow any
duplication of the Source and MemIdNum (basically only one record per unique individual),
but MEMPUT will allow you to process multiple records for the same Source and
MemIdNum (like transactional data or Historical information regarding the record). So, if
you need to import history into the database, MEMPUT is the best way to add that data in
bulk (beyond using API or Message Brokers).
Student Notebook
Derive Data from UNLs (mpxfsdvd) re-derives from the UNL files to create new buckets,
comparison strings, and binary files. Derive Data from UNLs (mpxfsdvd) is good when you
have made changes to your algorithm, but you are working with a static set of records (like
a sample extract before go live). You can select the specific elements (buckets, comparison
strings, or binaries) you want to re-derive.
Derive Data from Hub (mpxredvd) looks at the data in the database tables and goes line
by line to create new buckets, comparison strings, and binary files. Derive Data from Hub
(mpxredvd) is good to use when you have consumed inbound broker messages or API
inputs that did not exist in your original data extract. You can select the specific elements
(buckets, comparison strings, or binaries) you want to re-derive.
Prepare Binary Files (mpxprep) is used for incremental bulk cross matches (ixm). When
multiple systems have loaded data, Prepare Binary Files (mpxprep) will rebuild the binary
files since there is no single extract file.
Use the table below to determine which method of deriving data is best.
Parse Assign Create Compile
O = Optional, X = Default Process Member Bucket Comp Binary
UNLs Hashes String Files
Derive Data and Create UNLs (mpxdata)
X O O O
(reads data from an Extract File)
Derive Data from UNLs (mpxfsdvd) (reads
O O O
existing UNL Files)
Derive Data from Hub (mpxredvd) (reads data
O O O
in the MEM tables)
Prepare Binary Files (mpxprep) (reads data in
X
the MEM tables)
Member Model Transform Graph (uses Clover
X
ETL)
Running entity analysis

Entity Analysis can be viewed in the Analytics Perspective and can be used to see how
your entity information might impact performance, especially during linking and task
creation.
Viewing entity size distribution

Entity Size Distribution gives you insight into the overall composition of your entities. To
access the score distribution graph:
V5.4.0.3
Uempty Viewing entity composition

Entity Composition gives you the ability to see which records make up a particular entity.
You also have the ability to see the comparison string data about that member, grouped by
the comparison role number.
Comparing members
Member Comparisons are a great way to see how your member records score against one
another. This not only lets you see why two records did or did not join, but also lets you see
how one attribute might be taking too much control in the matching and linking process.
Member comparison output (MCC codes)

Code Value
MISSING M
DIFFERENT D
EQUAL E
PARTIAL P
TRUE (1) T
FALSE (2) F
For the bottom section where things are compared on a token by token or word basis, the
codes can be:
Code Value Description
EQWRD X Equal at a word or token level.
EQINI Y Equal at a single character initial level B = B, not Bill = B.
PHONETIC P Phonetic match.
NICKNAME N Nickname match.
NICKMETA O Phonetic nickname match.
Equal at an initial level, but not a direct match of a single
INITIAL I
character - For instance Bill = B.
DIFFERENT D Different.
MISSING M One member is missing a value.
EDITDIST E Edit Distance comparison.
ACRONYM A For example, ABS = American Bakery Supplies.
(1) Used by the False Positive Filter
(2) Used by the False Positive Filter
Student Notebook
Score distribution
Score Distribution provides the volume of matched pairs by score for potential duplicates
within all sources.
This information is often used with organizations during Threshold Analysis so they can get
a better sense of how many tasks and autolinks could result from a particular threshold
setting.
In the first example on the next page, we can see how the raw results of our Score
Distribution query will appear. Generally, we take those results and put them into a chart
and present them to an organization as part of Threshold Analysis.
In the sample chart, we can see higher volumes of record pairs scoring at 10 and below.
When reviewing this chart with an organization, we would explain that they should use this
information to understand how setting thresholds will affect the numbers of tasks and
autolinks that get generated by the hub. If we set our Clerical Review threshold at 7 and our
Autolink threshold at 11, all of those record pairs between 7 and 11 would result in tasks
that the organization would need to resolve. Everything above 11 would be Autolinked
together.
Score Distribution is not the only data point used for threshold analysis - we would also
review sets of sample matched pairs, and do some statistical analysis to determine where
to set thresholds. But Score Distribution gives us an important piece of information about
how our thresholds will impact the amount of work the organization needs to do once the
hub goes into production.
V5.4.0.3
Uempty The score distribution query

This query shows the distribution of scores for all the record pairs in the system. Single
member entities or entities with more than 2 member records are not included in the
results.
The number of pairs for each score is actually the sum of all counts in a given score range.
For example, an x-axis score value of 27 represents all pairs that score between 26.1 and
27.0.
The view can be filtered to show entities from the checked sources only. If an entity is
comprised of members in a checked source(s) as well as an unchecked source(s), then the
size shown for the entity will be a count of the member records in the checked sources
only.
If no results show for a particular linkage type, there might not be any entities meeting the
criteria for that linkage type and/or set of selected sources.
Note
Values on the x-axis that do not show a bar have at least 1 entity matching the size
specified on the x-axis but not enough members to make the bar visible in the chart.
Student Notebook
Member overlap
Member Overlap provides the number of entities that have member records in multiple
sources. Member Overlap can be expressed as a total number of entities and also as a
percentage of the total number of records in each source.
Member overlap query

This query provides information on the number of overlaps in the hub.
An overlap exists when an entity has records from multiple sources. For example, if an
entity with three records exists, and each record is in a separate source system, then each
source would be said to have two overlaps in it (A with B, A with C, and so on).
The first column group just shows the number of unique entities represented in the
specified source as well as the percentage of all entities that are represented by a record in
that source.
The second column group shows the count and percent of those entities that have overlaps
in at least one other source (those entities have at least one record in another source).
Entities with overlaps in multiple other sources are only counted once in these two
columns.
The remaining column groups are for each source by source combination. When the row
and column source is the same, the count is simply the count of entities in that source, and
percent will always be 100%. However, when the row and column sources are unique, the
count represents the number of overlaps that exist between the row source system and the
column source system. The percent value then represents the percent of entities in the row
source that have overlaps in the column source.
V5.4.0.3
Uempty Exercise
Student Notebook
V5.4.0.3
Uempty Unit 14. Managing users, groups, and permissions
Overview
Initiate supports user and group management through the LDAPv3 standard. The Master
Data Service includes a default LDAP repository and a Workbench module for creating and
managing groups and users. You have the choice of using the default LDAP repository, or
you can optionally integrate with a separate enterprise directory server to manage your
users and groups.
Dependencies
You must have Workbench and the Master Data Engine software installed with an LDAP
server enabled.
Topics
Initiates Model for Managing Users, Groups, and Permissions
Sample LDAP Configurations
Default Initiate Groups and Users
Copyright IBM Corp. 2010, 2011 Unit 14. Managing users, groups, and permissions 14-1
Student Notebook
Managing groups and users

The Initiate Master Data Engine gives administrators the ability to control user access to
data and system operations at a very granular level. Using Initiates security model, you
have the ability to control access to:
Specific sources, member types, and attributes
Specific relationship types
Composite views
Interactions
Initiate reports or particular Workbench features
Specific permissions are configured in the User Management perpective.
V5.4.0.3
Uempty Working with LDAP

LDAP is Lightweight Directory Access Protocol. This protocol is used to access information
stored in a directory. This directory holds a wide range of data including user names, user
groups, security access, and passwords. LDAP is not a relational database but instead
processes queries and updates an information directory.
LDAP works on many different platforms and with numerous applications because it is an
Internet standard. This means that interacting with any LDAP server will use the same
protocols, connection packages, and query commands which makes them easy to install,
maintain, and optimize.
LDAP securely delegates read and modification authority using an Access Control List.
This list controls what information users can see and modify across numerous user
applications and locations. For example, you can assign a user only read-only rights in IBM
Initiate Inspector.
Typical server configurations

The LDAP servers can be embedded into the Master Data Engine or be standalone. The
LDAP server might be internally managed using Workbench, or externally managed using
the organizations existing LDAP management tool. Some common configurations:
Embedded LDAP The LDAP server is part of the hub. In the MADCONFIG
create_instance script you would enable the embedded option, but not the cluster
option.
Standalone LDAP The LDAP servers are not part of the hub but are internal or
external standalone servers. In the MADCONFIG create_instance script you would not
enable the embedded option and would point to the existing LDAP servers when
prompted. If LDAP servers do not already exist, the MADCONFIG create_ldap script
should be run prior to starting the hubs service.
Embedded LDAP/Standalone Combinations The hub server has an embedded
LDAP server and is also connected to a standalone internal or external LDAP server.
- Internal: Implementations with multiple Master Data Engine instances will often use
a cluster of embedded and standalone LDAP servers.
- External: Implementations where the organization will be using their corporate
directory server will use a combination of the external LDAP server and an internal
Initiate server. The internal server is required because the hub has internal system
users defined in the internal server that are required for the engine to operate.
- If you use an external corporate directory server, the hub will communicate with the
external directory server via the LDAP protocol to request user/group information
and perform authorization. Initiate supports the use of any LDAPv3 compliant
server.
Student Notebook
Components
Several components are involved in managing users and groups in the Master Data
Service.
Workbench
Within the LDAP perspective in Workbench, you have the ability to connect to your LDAP
server to create and maintain users and groups.
Within the Configuration perspective in Workbench, the Groups tab allows you to
synchronize the Groups in your LDAP server with the hub database, and assign specific
permissions to those Groups.
Data model
Several Initiate database tables play a role in user access and security. While the LDAP
repository manages user accounts and group assignments, specific permissions and audit
trail information is stored in the Initiate database.
mpi_grphead
- The names of default Initiate groups as well as any groups created in an external
LDAP repository are stored in mpi_grphead.
mpi_grpxseg, mpi_grpxixn, mpi_grpxcvw, mpi_grpxapp
- The individual permission settings for each group are stored in these tables. Specific
permissions can be set for read/write access to segments and sources, access to
perform specific interactions, access to composite views, and access to Web
reports.
mpi_usrhead
- Whenever a user logs in to an Initiate application for the first time, their user name
will be stored in the mpi_usrhead table for auditing purposes.
User passwords are not stored in the hub database; they are stored in the LDAP repository.
V5.4.0.3
Uempty The ldap.properties file

During creation of a Master Data Engine instance, an ldap.properties file is stored in the
<instanceName>\inst\mpinet_<instanceName>\conf directory.
The order of items in the LDAP.properties file represents the path in which group
authorization will flow: embedded LDAP server > internal (standalone) LDAP server >
external LDAP server.
The answers you chose when configuring your Master Data Engine via madconfig will
provide the settings for the embedded and/or internal LDAP Directory Server.
For a detailed description of each property in the ldap.properties file, refer to Master Date
Engine Installation Guide - Appendix A.
Some items of interest from the file include:
embedded.ldap.enabled If you specified y for the madconfig prompt Will this
Initiate Master Data Engine instance use an embedded Initiate LDAP Server?, then
this property will be set to true. Changing this setting could result in a failure in the
authentication process.
The property internal.ldap.enabled will always be set to true.
The hostname and port for the internal LDAP server will be defined here.
The hostname and port for any external LDAP server being used will also be defined
here.
Note
External LDAP DN settings must be added manually to this property file before
configuring LDAP connections in Workbench. Instructions for configuring the Master Data
Engine to communicate to an external server are provided in the Master Data Engine
Installation Guide. The default external LDAP settings in this file have been optimized for
integrating with Microsofts Active Directory.
Student Notebook
Initiate groups
LDAP records are stored hierarchically similar to DNS or Unix file trees. The records
Distinguished Name is read upwards to the top level or base. Each DN has two parts, the
Common Name (CN) and its location within the directory which it resides. The location is
determined by the Organization Unit (OU) and the Domain Component (DC).
Each user will have a Bind DN which represents the users name and what group they
belong to. In our systems, they will look like this:
cn=system,ou=System,ou=Users,dc=Initiatesystems,dc=com
In the example above, the user system is part of the subgroup Users in the System group
which is part of the Initiatesystem.com domain.
If an external LDAP server is going to be used, the groups must be defined prior to
implementation and the appropriate DN setting must be added to the ldap.properties file by
hand before making any configurations in Workbench.
Default Initiate LDAP structure
V5.4.0.3
Uempty Preconfigured Initiate user groups

The LDAP server comes preconfigured with seven groups:
Administrators
cn=Administrators,ou=System,ou=Groups,dc=initiatesystems,dc=com
Default
cn=Default,ou=System,ou=Groups,dc=initiatesystems,dc=com
All Application Operations
cn=All Application Operations,ou=System,ou=Groups,dc=initiatesystems,dc=com
All Composite Views
cn=All Composite Views,ou=System,ou=Groups,dc=initiatesystems,dc=com
All Interactions
cn=All Interactions,ou=System,ou=Groups,dc=initiatesystems,dc=com
All Segments Read Write
cn=All Segments Read Write,ou=System,ou=Groups,dc=initiatesystems,dc=com
All Segments Read Only
cn=All Segments Read Only,ou=System,ou=Groups,dc=initiatesystems,dc=com
Users, groups and associated permissions defined in earlier versions of the Master Data
Engine data model will be promoted to the new LDAP Directory Server during the upgrade
process.
Password changes
When users change their passwords via the Inspector application, the interaction with the
internal Initiate LDAP directory server is supported. If the password is being changed when
the Master Data Engine is configured for an external directory server, the request is not
supported and will generate an error. Password changes for externally authenticated users
will need to follow their corporate operation procedures.
Student Notebook
Administrators
The Administrators group has full access to all interactions, operations, composite views,
and attributes (segments).
The Administrators group is preconfigured to have full access to Workbench to import and
deploy hub configurations, run Analytics reports, set user group permissions and execute
jobs on the hub. Any users you want to have this access must be added to the
Administrators group. To be part of the Administrators group, the user has to be present in
the internal directory server.
The LDAP server comes with a pre configured user, system (usrrecno = 1), which has
membership in the Administrators group. You will use this login initially to add additional
users and groups. This group cannot be configured in Workbench.
Important
In prior software versions, ALIGNDEX was defined as the system-level user. This has
been removed and replaced by system. Your first steps in user administration should be
to create a new user with Administrators group access using system as your basis and
then deleting system for security purposes. Do not delete or rename the Administrators
group.
Default
The Default group has access to interactions USRGETINFO, GRPGETINFO,
USRSETPASS and has read only access to segments USRHEAD, GRPHEAD, GRPXAPP,
GRPXCVW, GRPXIXN, and GRPXSEG.
The Default group is assigned to a user when he logs in if he is not currently a member of
any other defined group. This group cannot be granted permissions through the
Workbench configuration editor. This group cannot be configured in Workbench.
V5.4.0.3
Uempty All application operations

The All Application Operations group has access to all operations.
All composite views

The All Composite Views group has access to all composite views.
All interactions
The All Interactions group has access to all interactions.
All segments read write

The All Segments Read Write group has read and write access to all attributes.
All segments read only

The All Segments Read Only group has read-only access to all attributes.
If you want a certain user to always have read/write permissions to all attributes, for
example, you can assign that user to the All Segments Read Write group. Doing so gives
the user permissions to read and write all attributes regardless of any permissions that
might also be granted him via the Security:Groups tab. This is helpful for non-human users,
such as those created to enable handlers, custom applications and so forth.
Student Notebook
Using MADCONFIG to create an instance with LDAP

Do you remember the Choose Your Own Adventure books from your youth? Those were
the books where numerous times throughout the story you were faced with a decision.
Should I fight the dragon and save the princess?
Yes. Continue reading on page 26.
No. Continue reading on page 30.
The outcome of the story depended the choices you made. Similarly, the choices you make
regarding creating an embedded LDAP server when creating an instance of the Master
Data Engine determines the subsequent prompts you will be presented.
Will this Initiate master data engine instance use an embedded Initiate
LDAP server?
The answer to this prompt will determine if you will be using an embedded or standalone
LDAP server and will present a different set of prompts.
Yes for embedded

Answering Yes means the LDAP server will be embedded and will reside directly on the
MDE server. You will then have to enter the Initiate LDAP Server port number.
Selecting the embedded option will also give you the option to have the LDAP server be
part of a cluster. We will cover what this means shortly.
No for standalone
If the embedded option is not chosen, the LDAP server will be an internally or externally
managed standalone server. If organizations use existing LDAP directories this would
typically be the option used.
When selecting standalone servers you will also be prompted:
Enter the Initiate LDAP Server host name:
Enter the Initiate LDAP Server port number:
When running the MADCONFIG script to create an instance, it is best to have the
standalone LDAP servers already created. Use the MADCONFIG create_ldap script to
create an Initiate LDAP server.
Important
A standalone LDAP service will need to be started when the Master Data Engine service
is started.
V5.4.0.3
Uempty Will this Initiate LDAP server be clustered with other Initiate LDAP
servers?
If you selected to have an embedded LDAP server in your Master Data Engine, you will
also be asked if the server will be part of a cluster. The cluster will contain a standalone
LDAP server used for high availability which will be internally or externally managed.
Answering Yes means there will be a standalone LDAP and you will also be prompted:
Enter the Initiate LDAP Server replication port number:
Enter a cluster peer Initiate LDAP Server host name:
Enter a cluster peer Initiate LDAP Server port number:
Enter a cluster peer Initiate LDAP Server replication port number:
It is best to have the standalone server created using the create_ldap script, but it can be
created later as long as you use the same information you entered for the prompts above.
Exercise
Student Notebook
V5.4.0.3
Uempty Unit 15. Configuring and deploying Inspector
Overview
This module will cover deploying and configuring IBM Initiate Inspector.
Dependencies
You need to have a supported database platform and the proper software installation files
for your operating system. The computer images provided will support Inspector.
Topics
Introduction to Inspector
The inspector.properties file
Inspector configuration in Workbench
Copyright IBM Corp. 2010, 2011 Unit 15. Configuring and deploying Inspector 15-1
Student Notebook
Introduction to Inspector
Inspector is a Web-based, integrated data stewardship and governance application that
enables data stewards to perform three main tasks.
Data resolution
Inspector enables data stewards to understand and resolve data quality issues using a
simple, drag-and-drop interface.
Relationship management
Inspector enables data stewards to view, manage, and modify complex master data
relationships, including hierarchical relationships using innovative relationship visualization
technology. Inspector lets organizations gain insight into these relationships for purposes
such as identifying top accounts and determine pricing eligibility.
Data management
Inspector lets data stewards manage additions, changes, and deletions to master data.
Inspector enhances Initiates existing support for transactional implementation styles by
using the IBM Initiate Master Data Service as the master data source.
The large volume of data stored across multiple source systems and the often dynamic
state of that data can present organizations with challenging integrity and profiling issues.
IBM Initiate Master Data Service software and associated applications enable you to
combine, compare, review, and resolve potential data issues. Using the software adds
value to your organization by increasing the integrity of your data and providing a
360-degree view of a record.
Designed to store and manage data from multiple sources, the software and algorithms
configured specifically for your data and business environment compares the records
and attributes contained therein to identify potential data issues. Inspector is the end user
application that enables you to locate data issues, review the records involved, view
relationships between entities, and make appropriate adjustments to correct errors.
Inspector is an integrated data stewardship platform integrating three functionalities into a
single platform: data resolution, relationship and data management. This tool is based on
the premise that understanding relationships in data helps data stewards to manage and
resolve quality issues.
V5.4.0.3
Uempty The inspector.properties file

Inspector uses the inspector.properties file to setup the application properties. The location
of this file is determined by the Web server and might be different depending on what
version of Web server you are running. Below is an example of the properties file.
HostName=localhost
HostPort=16000
UseSSL=false
MaxContext=10
InitContext=5
TimeOut=30
hub engine connection properties Required

Table 15-1: hub engine connection properties
Property Definition
HostName The name of the server running the Initiate hub Engine.
HostPort The port number of the Initiate hub Engine.
Indicates whether secure sockets layer (SSL) is to be used to
UseSSL
communicate with the engine. Default is false.
General connection properties Optional

Table 15-2: General connection properties
Property Definition
This is the number of connections that will be maintained to the
MaxContext MDS engine.
10 is default. 0 means no limit (not recommended).
The number of connections for Inspector to grab upon startup. The
InitContext
default is 5.
This is the number of seconds to wait for a free context (or
TimeOut
connection) to come into the Pool. 30 is the default.
Student Notebook
Inspector configuration in Workbench

You will need to customize Inspector to make it work for your data set. This configuration is
done using Workbench and is divided into five categories, each displayed on separate
panes:
Attribute Display
General Preferences
Member and Entity
Search Forms
Search Results
Attribute display
Much like it sounds, this pane configures which and how Attributes are displayed for each
member in Inspector. You can add and remove Attributes by Member Type and their
corresponding Attribute Types and configure which fields are displayed.
V5.4.0.3
Uempty Custom task summary

This pane configures custom task summary views if they are being used instead of the
default search result views. Similar to the Attribute Display and Search Results panes, this
the Custom Task Summary pane allows you to hide attributes, or configure which fields are
displayed.
General preferences
The General Preferences pane again sounds much like what it does. Use this pane to
configure the number of search results that are returned, page sizes, date formatting, and
more.
Student Notebook
Member and entity

Use the Member and Entity pane to determine the member and entity type settings and
add fields and label patterns. If you have Composite Views and/or Hierarchy Relationships,
they must be assigned to the Member and Entity types. The Label Pattern will appear on
the tab when viewing an Entity.
Search forms
The Search Forms pane will configure which Attributes and corresponding fields you will be
able to use as search criteria.
V5.4.0.3
Uempty Search results

This pane is where you configure how the member information will be returned back to you
after a successful search. You can select which attributes will or will not be displayed.
Exercise
Student Notebook
V5.4.0.3
Uempty Unit 16. Testing the hub configuration
Overview
The purpose of this unit is to cover testing approaches and to test your hub configuration
using a CloverETL graph designed to perform a MEMPUT operation.
Prerequisites
Once your configuration and data is fully loaded into the IBM Initiate Master Data Service
software, you should test your hub configuration.
Topics
Testing philosophy
Testing your configuration using CloverETL
Copyright IBM Corp. 2010, 2011 Unit 16. Testing the hub configuration 16-1
Student Notebook
Testing philosophy
Testing should be performed by the implementers throughout the entire process. The
testing should focus on six categories:
General
Data
Algorithm
Application
Integration
Performance
General tests
Every time you click Save in Workbench, you should check the Problems tab. These
messages provide information on what is not correct in your configuration and will more
than likely cause problems for you down the line. Most messages are context sensitive, so
you can go directly to the location of the problem by double-clicking it.
Workbench also validates your project prior to deploying it to the hub which will be
suspended during this time meaning requests can still be submitted to the engine while the
configuration check is taking place. If your hub has any errors that will prevent your hub
from working correctly, you should check your problem messages and fix the problem
listed.
Data validity tests

Throughout the Boot Camp process we have already performed some of the iterative tests
to ensure data validity. After you derived your data we did a few test to verify the data was
crunched accurately.
Verify the mpi_membktd and mpi_memcmpd tables are populated. Then test that the
algorithm is passing the data through properly.
Compare data in the mpi_memhead table to your data extract. Verify the number
records in the table match your extract file.
Compare data in the mpi_memcmpd table to your data extract. Verify the number rows
in the table match your extract file.
Compare data in the mpi_segattr table to your data extract. Verify the attributes in the
table match your extract file.
V5.4.0.3
Uempty Algorithm tests

After you have run Weight Generation, there are a number of tests you can run to see if
your algorithm is working correctly.
Open the Bucket Value analysis, select a bucket role and open the algorithm to verify
that the roles match up and the appropriate data is populated.
If your algorithm uses QXNM or AXP, find a record that will use a nickname such as
Robert (Rob, Bob) or Anthony (Tony). Run a Member Comparison analysis of that
record to that ensure the phonetics and nicknames are applied properly.
Run a Member Comparison on a record and verify the comparison function has
represented and assigned a score to the attributes properly.
Application tests
Implementers should test the interactions between the hub and Initiate applications.
Test data in the applications such as IBM Initiate Inspector or Enterprise Viewer against
the data in the data extract by selecting a random record and looking them up in the
application.
Integration tests
Because there are so many ways to integrate into the hub and existing customer software
and services, there are no specific tests that should be run. Implementers should test the
interaction by sending messages to the hub and having Initiate check the log files to see
what expected information needs to be pushed.
Test the engine callouts used in the integration by creating and resolving tasks such as
potential duplicates or overlays.
Performance tests
Test the implementations performance by performing searches and other tasks to see if it
meets the benchmarks. Initiate has performance testing APIs available that will measure
how efficiently the system is running. Contact your Initiate project manager for the APIs that
best apply to your project.
Copyright IBM Corp. 2010, 2011 Unit 16. Testing the hub configuration 16-3
Student Notebook
Testing your configuration using CloverETL

You can test your Boot Camp implementation by creating a MEMPUT graph in CloverETL.
Understanding MEMPUT
A MemPut interaction inserts or updates member data in the hub database. The CloverETL
Initiate MEMPUT component processes and inserts or updates data in the same manner
as the MemPut API interaction. Arguments are supplied to the CloverETL Initiate MEMPUT
component in the form of component parameters.
The CloverETL Initiate MEMPUT component requires input from a CloverETL Reader. The
Reader supplies a connection to the external source of the data, and can also apply
filtering or other criteria to determine which data is passed on to the MemPut component.
CloverETL provides a number of Readers that can read data from databases, text files,
LDAP repositories, and so on.
Exercise
V5.4.0.3
Uempty Appendix A. Sample implementation
Overview
This module outlines the design and configuration of the Sample Implementation project for
week two of the Initiate Technical Boot Camp. In this project, you will design an instance
that will track customers for a fictitious company, Capital Aviation.
Dependencies
You need to have completed the first week of Boot Camp and successfully created the
instance outlined in this book.
Topics
This module will cover:
Overview and Objectives
Implementation Goal
Solution Architecture
Initiate Systems Software Components
Use Cases
Initiate Configuration
IBM Initiate Inspector Configuration
Data Extract Guide
Copyright IBM Corp. 2010, 2011 Appendix A. Sample implementation A-1

Student Notebook
Schedule for the sample project

Task Task Description Outputs Expected
# Duration
Monday Boot Camp Review and 2 hrs
Project Kickoff
Configure Baseline Create new Initiate project called capmed 2 hrs
Dictionary and Configure dictionary and algorithm according to
Algorithm Data Requirements spreadsheet
Use Boot Camp algorithm as a starting point
Create New DB2 Create a new db called capmed, Create a new 10 min
Database and User Windows user called capmed and assign to
capmed db
Create New Hub Madconfig create_datasource 15 min
Instance Using Followed by successful madconfig test_datasource
MADCONFIG Madconfig create_instance
Should result in c:\initiate\projects\capmed
directory created
Capmed db should have Hub tables created
Deploy Workbench Baseline dictionary and algorithm settings 15 min
Project deployed to Hub database
Tuesday Review and Clean Clover graph updated and run against data extract; 1 hr
Customer Data Extract Data extract ready to be derived with MPXDATA
Derive Member Data Derived data (bin, unl files) 2 hrs
with MPXDATA
Load Derived Data Member UNL files loaded into Hub database 30 min
Check Attribute Validity Attribute Validity analysis completed 30 min

Weight Generation Weight generation process completed, weights 2 hrs
deployed to Hub database
Wednesday Bulk-Cross Match MPXCOMP, MPXLINK, MADUNLLOAD jobs run; 2 hrs
Entity UNL files loaded into Hub database
Threshold Analysis Matched pairs reviewed; threshold settings revised 1 hr
and deployed to Hub
Bucket Analysis, Entity Bucket analysis, entity size analysis completed 1 hr
Size Analysis
Thursday 2nd Iteration Update algorithm 3 hrs

Update thresholds
Rerun comp/link
Rerun Data Analytics
Configure Inspector Inspector configured to search Hub and resolve
tasks 2 hrs
Test, add and change Verified that data is being received and processed 2 hrs
messages with MEMPUT by the Hub
Graph
Friday Final Test 1 hr
A-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
Uempty Overview and objectives

Capital Aviation wants to obtain a complete view of its customers, to recognize its
customers regardless of the carrier through which they book flights and travel, and to
provide the level of customer service that it desires.
Capital Aviation has chosen to use an MDM solution to manage customer identities across
its operations. Capital Aviation will use IBM Initiate Master Data Service initially to
passively identify and resolve data quality errors such as duplicate customer records. Over
time, they expect to move to an active integration where customer service representatives
will be adding, updating, and searching for customer data against their Initiate Hub while
performing day-to-day transactions.
Implementation goals
We designed this integration to address these objectives and achieve specific goals:
Discover data quality errors IBM Initiate Master Data Service identifies potential
duplicates and potential linkages so that you can review and correct them.
Uniquely identify a customer across multiple sources of data IBM Initiate Master
Data Service identifies and links customer records across your enterprise so you can
have a single view of your customers profiles available for real-time searching, or for
extraction to a data warehouse.

Student Notebook
Solution architecture
The figure below presents a proposed architecture that addresses these goals.
Customer Architecture Initiate Master Data Service Platform
Enterprise Operational
3 Inspector
Viewer Reports
Master Data
Workbench
Extract
Enterprise Integrator 2
C Initiate Master Data
Toolkit
Service
C HL7 Query Broker Consumer Outbound Broker C

Source System
C Inbound Broker Routing Broker C
Registry
C Mapping Broker Synchronization C
RESERVATION SYSTEM
English Mapping Broker C
Java API
memPut
1 C memSearch
DF Adapter
memGet
DF Adapter
DF Adapter
Source System
DF Adapter
TRAVEL AGENCY
Notation
Description Owner
Indicator
Capital develops interfaces using Initiate Java SDK that
calls the Initiate Put, Search and Get APIs to add,
1 Capital
update and retrieve information from the IBM Initiate
Master Data Service Hub.
The Data Hub is configured to establish identity
2 Initiate
relationships and tuned to meet our matching goals.
Data stewards use Inspector for Data Resolution to
3 Capital
resolve identities in their workflow queue.
V5.4.0.3
Uempty Initiate Systems software components

This Initiate Systems Software Implementation Architecture includes these Initiate Systems
products:
IBM Initiate Master Data Service stores and indexes customer data, links customer
records across sources, and identifies data quality issues such as duplicate records.
Inspector to allow end-users to review and resolve data quality issues such as
duplicate records.
Java SDK to develop interfaces that add, update and retrieve data from the hub for
external applications.

Student Notebook
Contributing and consuming systems

The following sources will provide data to and/or consume data from the Initiate software:
Contributes1
Consumes2
System Number of records Notes
CAPITAL
RESERVATI Contributes customer data to the
~1,000,000
ON Initiate via real-time messages.
SYSTEM
CAPITAL
Contributes customer data to the
TRAVEL ~27,000
Initiate via real-time messages.
AGENCY
1. Is a source of customer records that contributes information to the Master Data Service.
2. Consumes information from the Initiate, for example, searches the Master Data Service
using Initiates APIs, or consumes Enterprise ID change notifications.
V5.4.0.3
Uempty Use cases

This solution architecture is designed to address the following uses cases:
Creating a new customer record

Summary: An end-user creates a new customer record, which is then sent to IBM Initiate
Master Data Service via the Initiate memPut API.
Basic sequence of events:

Actor System
The end-user creates the customer record in your source
system and enters the customers information.
Your source system generates a Customer ID (your unique
End-user creates a
identifier).
new record
Your source system sends the customer record, including the
Customer ID, to the Initiate Inbound Broker via a TCP/IP
message.
Initiate Inbound Broker receives the message, retrieves the
appropriate customer demographic information, and sends it
The Initiate stores the to the Hub.
record
IBM Initiate Master Data Service stores the appropriate
customer demographic information.
IBM Initiate Master Data Service compares the new
customer record to existing customer records to identify
other records that represent the same customer.
If IBM Initiate Master Data Service finds matching customer
The Hub assigns an
records, it assigns the customers existing Enterprise ID to
Enterprise ID
the new customer record.
If IBM Initiate Master Data Service does not find matching
customer records, it assigns a new Enterprise ID to the new
customer record.

Student Notebook
Updating customer records

Summary: An end-user updates a customer record, which is then sent to IBM Initiate
Master Data Service via the Initiate memPut API.

Actor System
The end-user updates the customer record in your source
End-user updates a system (for example, changes the customers address).
record Your source system sends the customer record, including the
Customer ID, to the Inbound Broker via a TCP/IP message.
Inbound Broker receives the message, retrieves the appropriate
customer demographic information, and sends it to the Hub.
The Hub updates its
version of the record The Hub updates its version of the record with the new
customer demographic information. The Hub maintains a history
of the previous values where appropriate.
The Hub compares the updated customer record to other
customer records to identify whether other records now match
the updated record.
If the Hub finds matching customer records, it changes either
the Enterprise ID of the updated record or of the matching
The Hub assigns an
record(s) so that they share a common Enterprise ID.
Enterprise ID
If the Hub determines that records that previously matched the
updated record (before the update) no longer match, it changes
the Enterprise ID of the updated record or of the previously
matching record(s) so that they no longer share a common
Enterprise ID.
V5.4.0.3
Uempty Resolving potential duplicate tasks

Summary: An end-user uses Inspector to review a Potential Duplicate task. The end-user
confirms the records are duplicates and links them.

Actor System
An end-user uses Inspector to query for Potential Duplicate
tasks.
End-user reviews
tasks in Inspector The end-user reviews the records in the tasks, notes the
similarities and differences between the records, and decides
whether they are, in fact, duplicates.
Assuming the records are, in fact, duplicates, the end-user
links the records by assigning the Enterprise ID of one of the
records to the other(s).
The Hub links the records and creates an identity rule
End-user links the indicating that these two records should continue to be
records considered the same customer, regardless of subsequent
updates that might occur to any of the records. (Note that if the
end-user had determined that the potential duplicates were
not, in fact, duplicates, the Hub would have created a similar
non-identity rule.)

Student Notebook
Resolving potential linkage tasks

Summary: An end-user uses Inspector to review a Potential Linkage task. The end-user
confirms the records represent the same person and links them.

Actor System
An end-user uses Inspector to query for Potential Linkage
tasks.
End-user reviews
whether they are, in fact, the same person.
Assuming the records are, in fact, the same person, the
end-user links the records by assigning the Enterprise ID of
one of the records to the other(s).
The Hub links the records and creates an identity rule
End-user links the indicating that these two records should continue to be
records considered the same customer, regardless of subsequent
updates that might occur to any of the records. (Note that if the
end-user had determined that the potential linkages were
not, in fact, the same person, the Hub would have created a
similar non-identity rule.)
V5.4.0.3
Uempty Resolving review identifier tasks

Summary: An end-user uses Inspector to review a Review Identifier task. The end-user
confirms the records represent the same person and links them.
Actor System
An end-user uses Inspector to query for Review Identifier
tasks.
End-user reviews
whether they are, in fact, the same person.
Assuming the records are, in fact, the same person, the
end-user works with Customer Records to resolve the
End-user links the
discrepancy in the source system. If the records are not the
records
same person, the end-user will report this situation for further
investigation.

Student Notebook
Initiate configuration
This section describes the data that you store in the Hub, the algorithm that you use to
compare records, and the thresholds you use to determine whether to automatically link
records or to identify potential duplicate and potential linkage tasks.
Member attributes
The Hub stores customer information as attributes. The table below details the attributes
you store in the Hub and details about each attribute.
Attribute Name/
Review
Member # #
Attribute Code Label/ Segment Identifier
Type Exists1 Active2
Description Checks?3
LGLNAME Customer Name PERSON MEMNAME 5 1
BIRTHDT Date of Birth PERSON MEMDATE 1 1
SEX Gender PERSON MEMATTR 1 1
HOMEADDR Home Address PERSON MEMADDR 10 1
Home/Evening
HOMEPHONE PERSON MEMPHONE 10 1
Phone Number
Social Security
SSN PERSON MEMIDENT 1 1
Number
Customer
CUSTID PERSON MEMATTR 1 1
Account Number
1. Indicates the number of versions of each attribute that the Hub will keep. For example,
the Hub stores up to 10 addresses per customer, but only one SSN per customer.
2. Indicates the number of active versions of each attribute that the Hub will keep. For
example, the Hub stores up to 10 addresses per customer, but only one address (the
most recently provided address) will be active.
3. Indicates whether the Hub should create a task if two different records share the same
value. For example, the Hub creates a task if two different records share the same
SSN.
V5.4.0.3
Uempty Algorithm
Algorithms specify how the Hub compares records to determine whether multiple records
represent the same customer. Your algorithm compares customer records using these data
elements:
Customer Name
Date of Birth
Gender
Street address (a component of the Home Address attribute)
Zip code (a component of the Home Address attribute)
Phone
Social Security Number
Customer Account Number
Thresholds
While the algorithm determines how to compare and score records, thresholds interpret
those scores to determine whether to automatically link records or to mark them as
potential duplicates that should be manually reviewed. Your Hub implementation uses the
following thresholds:
Source Clerical Review Threshold Auto-link Threshold
CAPITAL RESERVATION
7.0 7.0
SYSTEM
CAPITAL TRAVEL AGENCY 7.0 7.0

Student Notebook
Inspector configuration
The Initiate Systems Implementation Project Team configures Inspector to display the
fields that will help you quickly review and resolve tasks. The following table details the
Inspector configuration.
Task
Searchable? Search
Attribute 1 Search
Results?2
Results?3
LGLNAME
BIRTHDT
SEX
HOMEADDR
HOMEPHON
SSN
CUSTID
1. Indicates whether the attribute will be part of the member search dialog box.
2. Indicates whether the attribute will appear in the Entity Search Results section of
Inspector.
3. Indicates whether the attribute will appear in the Task Search Results section of
Inspector.
As you review and resolve tasks in Inspector, you assign workflow statuses to indicate your
progress or final decisions. The following workflow statuses will be available in your
Inspector implementation:
Workflow Status Description Action
Unexamined The task is
Unexamined Initial task status awaiting review and resolution
by an end-user.
Resolved the Hub removes
The members in the task
the task from the work queue
Not Same Person do not represent the same
and creates a non-identity
person.
rule.
The members in the task
represent the same
Resolved the Hub removes
person. The end-user must
Same Person the task from the work queue
also assign a common
and creates an identity rule.
Enterprise ID to the
resolved records.
The end-user cannot
determine at this time Deferred The task remains
Not Enough Information
whether the records are, in until an end-user resolves it.
fact, duplicates.
V5.4.0.3
Uempty Data extract guide

The Hub software is the most accurate solution available for matching organizational data.
In order to provide the most accurate matching possible, we configure and test an
algorithm based on your specific data. Your data extracts provide Initiate Systems with the
attributes required for matching and any additional attributes that aid in validating results or
provide additional context to the data. The richness and quality of the data you provide
determine the strength of our accuracy.
Data description
For each section below, please describe your data by providing responses to the questions
and provide any additional information that you believe might be helpful in understanding
your data environment and processes.
The IBM Initiate Master Data Service manages data from these sources:
Data Source Description Record Count
CAPITAL RESERVATION Reservation system for
~1,000,000
SYSTEM Capital Aviation
CAPITAL TRAVEL Customer system for Capital
~27,000
AGENCY Travel

Student Notebook
File formats
Please provide data extract files as pipe-delimited ASCII files (alternatively, you can
arrange to provide fixed-width files). Each record should be CRLF terminated. The extracts
should conform to the formats outline below (required fields are in bold):
Question Response
What is your primary identifier (for Records in the extract are uniquely
example, MRN, Corporate ID, Account identified by combination of Source and
Number)? Source ID.
Does the primary ID have a meaningful
prefix, suffix, or any other characteristic No
within the identifier? Yes, please describe:
Sequentially
How is the primary ID assigned?
Algorithmically, describe:
Can you extract one record per primary
Yes
identifier?
Do you assign a secondary identifier? No
Only one source.
Do you have multiple sources (i) sharing
Multiple source sharing a pool of
the same identifier or (ii) assigning a
identifiers.
unique identifier by source?
Multiple, using unique pools of identifiers.
Do you have records without a primary
No
identifier?
Will the source file contain non-surviving, File will not contain obsolete records.
merged records? That is, can you avoid
sending obsolete records? File might contain obsolete records.
V5.4.0.3
Uempty File formats

Please provide data extract files as pipe-delimited ASCII files (alternatively, you can
arrange to provide fixed-width files). Each record should be CRLF terminated. The extracts
should conform to the formats outline below (required fields are in bold):
Field # Field name Description Max Length
Source The source from which this record originated 12
Source ID Unique identifier for the source 60
Last name Customer last name 75
First name Customer first name 30
Middle name Customer middle name 30
Birth date Customer date of birth 19
Area Code Customers area code 5
Phone Number Customers phone number 20
Social Security
Customer Social Security Number 9
Number
Gender Customers gender 1
Address Line 1 Customers street address line 1 75
Address Line 2 Customers street address line 2 75
City Customers city 30
State Customers state 15
Zip Code Customers zip code 10
Customer
Account Customers account number 10
Number
Sample record
Records in the data extract should be formatted as follows:
Source|Source ID|Last Name|First Name|Middle Name|Birth Date|Area Code|Phone
Number|SSN|Gender|Address Line 1|Address Line 2|City|State|Zip Code|Customer
Account Number
1|1|Kennedy|John|F|1917-05-29|202|4561414|999-99-9999|F|1600 Pennsylvania
Ave||Washington|DC|20171|I-74036598

Student Notebook
Assumptions
When preparing your data extract, please ensure the following:
Source Identifier is unique within the source
Files are pipe delimited
All fields are Text data type
Data file fields are alphanumeric characters and left-justified
All dates are in a CCYYMMDD format
Max Length indicates the maximum allowable length for data in the field (excess
characters are dropped
Verifying data extracts

We recognize that any large volume of data is likely to have unforeseen characteristics, so
we recommend that you perform thorough quality assurance on your extracts before
delivering them to us. The data submission checklist below includes some validation
guidelines you should use to verify your extracts.
Check Description Result
Are the files readable? If they are zipped, can they No
Readable
be extracted and read in a plain text viewer? Yes
What is the record count of each file? Does that No
Record count
match our expected record count? Yes
Does the file contain printable characters only? In No
Non-printable
other words, it does not include any non-printable
characters Yes
characters.
End of line Does the end of line terminator match the CRLF No
terminator that we are expecting? Yes
Does the file adhere to the agreed upon format? Is No
Format
the correct delimiter used? Yes
Special characters are not used to indicate a null No
Null value
value in any field. Yes
Are there any extra characters or fields that we did No
Extra characters
not agree to in the format? Yes
V5.4.0.3
Uempty Data transmission

Initiate Systems offers electronic transfer of files through our secure FTP site. Each of our
clients receives a dedicated, secure folder for your data transmission. Once the
transmission is complete, Initiate Systems removes all data from the FTP site and stores it
on a secure server to perform our data analysis. The default FTP transmission mode is
ASCII so you must specify binary mode for transmission of zipped or compressed files.
Please coordinate with your project manager to arrange for data encryption, if necessary.
Security
Initiate adheres to strict confidentiality standards. We take this responsibility very seriously
and enforce regulatory standards relating to the distribution, disclosure and retention of
personal data. Unless otherwise instructed, Initiate destroys client media in accordance
with an agreed upon timeframe.
Resolving data extract questions

Please contact a member of the Initiate Systems project team if you have any questions
about the extract requirements outlined above.

Student Notebook
V5.4.0.3
Uempty Appendix B. Working with relationships

Relationship linking is a feature of the Master Data Engine which is configured in
Workbench and can be viewed and modified in IBM Initiate Inspector.
What are relationships?

Relationships basically consist of a connection between entities. That relationship can be a
connection between two peer entities, or between parent and child entities. The entities
themselves can be individual people or groups of entities such as a household or a
company,
Why are relationships important?

Imagine your company did not have an organization chart to refer to. How would you know
who reports to who? How would you know what the chain of command was? How would
you, as an employee, know how you fit into the overall structure of the company you work
for?
Many large organizations are layered in the same manner. A large corporation can have
many different organizations and subsidiaries. Then to add to the confusion are the wholly
owned subsidiaries and the OEMs that make products that are branded by that large
corporation.
Now imagine working for a company that does business with that large corporation and a
number of their subsidiaries. In your records, there is no hierarchy showing that the
subsidiaries and OEMs you do your business with are actually part of the large corporation.
With out knowing the relationships between the entities, how would you know your revenue
is coming from the large corporation?
How does understanding relationships help?

Understanding the relationships between entities can create opportunities and prevent
problems. If you know that a customer is actually a subsidiary of a large corporation, they
might be eligible for a discount based on economies of scale because you know how to
accurately attribute where your revenue is coming from.
Conversely, understanding this relationship might minimize risk exposure. If a large
corporation is in financial trouble, their assets frozen, knowing that your customer is a
subsidiary might make you think twice about creating another open invoice with them.
Knowing relationships helps you understand your total business and exposure to an
organization.
Copyright IBM Corp. 2010, 2011 Appendix B. Working with relationships B-1
Student Notebook
Relationship scenarios
There are two basic types of relationships, hierarchical and peer-to-peer. In hierarchical
relationships there is a parent record and then a sub-set of child records that belong to the
parent. In non-hierarchical relationships there are a number of records that could be used
as the parent depending on the organization and their needs.
Peer-to-peer relationships
Peer-to-peer relationships are created by two entities who share enough common
information to be linked to each other.
A husband and wife can share a last name.

Roommates share an address.
Entities extend beyond people. For example, a

family of four will share the same address. This
common address is referred to as a household
and can be treated as an entity, with the individual
family members becoming part of that entity.
These peer-to-peer relationships are also called asymmetrical because they do not
necessarily have a hierarchical or parent/child connections.
One-to-one
In theory, a doctor and his patient can have a

one-to-one relationship. The connection is her file lists
him as the General Practitioner, and his records show
her as his patient.
B-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
Uempty One-to-many
In reality, a doctor will have more than one patient. In this

case, the doctor has the same type of doctor/patient
relationship with many other people.
Many-to-many
The graphic below shows some of the relationships that exist in the health care field. The
group practice works with individual practitioners, hold the records for the practitioners
patients, and maintains relationships with HMOs and health insurance providers. Each
practitioner maintains relationships with patients and hospitals they have privileges at. The
patient has a relationship with the doctor, the practice and the health insurance provider...
and so on...
As you can see, there are many different relationships which could possibly exist here, and
depending on your perspective, the hierarchical structure would be different.
Student Notebook
Hierarchical relationships
Hierarchical Relationships have parent records and children records. Think about the
organization of your company. At the top is the President or CEO. Below him are his direct
reports like the CFO and COO. The next level down are the executive boards direct
reports like VPs and Directors. The organizational chart branches out showing their direct
reports like group managers all the way down to entry level positions.
A corporation that has subsidiaries also has a hierarchical structure. Below we see an
example of the corporate giant Really Big Corporation. RBC makes everything from light
bulbs to locomotives, and has many different subsidiaries, or children, to the parent
company.
V5.4.0.3
Uempty How does relationship management work?

Relationships are created by the Relationship Linker using data at the record level within
entities to deterministically derive relationships between entities. The Linker does not use
algorithms but instead uses relationship rules defined in Workbench. The relationships are
created based on those rules which create linkages, identify conflicts or violations which
then become tasks.
The relationships are initially built in a batch run and are then created and deleted in a
real-time manner as changes are made to the data. The Relationship Linker can also be
run as a batch job through Workbench. In real time processing, the data goes to a queue
where the information is processed by the Relationship Linker similarly to the Entity Linker.
Student Notebook
Relationship sources
The Linker uses a reference source that comes from inside an organization called
inter-source, or from a outside source called intra-source data.
Inter-source
Inter-source uses established relationships in an organizations internal data. The data
model will generally include a combination of IDs to map out the relationships within the
organization. For example:
A commercial organization might have the following identifiers:
- employee_id
- department_id
- business group_id
A health care provider might have the following identifiers:
- patient_id
- physician_id
- hospital_id
Intra-source
Intra-source uses globally accepted identifiers within the data. The data model would
contain identifiers that are commonly used by many organizations. For example:
SSN - Social Security Number NPI - National Provider Identifier
VIN - Vehicle Identification Number TIN - Taxation Identification Number
EIN - Employer Identification Number
Trusted sources
A trusted source can be designated when a source is known to have the most accurate
information. A trusted source is not necessary for relationships, but is considered a best
practice when creating relationships. Deeming a source as a trusted source can avoid
creating circular relationships in hierarchical implementations.
V5.4.0.3
Uempty Relationship rules

The Relationship Linker uses the source reference information and creates relationships
based on rules:
Directed: Indicates if the relationship is one way and non-reversible. For example, Dick
is the husband of Jane, but Jane is not the husband of Dick, so Directed would be true.
However, if the relationship was changed to spouse, Directed could be false.
Multiplicity: Indicates the number of relationships an entity is allowed to have. An
entity can have a multiplicity of 1 to 1, 1 to Many, and Many to Many. For example, a
patient can only have one physician, one physician can have many patients, and many
patients can have many physicians.
Relationship Requirements: An entity can have a required relationship type that will
create a task if it is not met. There are three types of relationship requirements:
- Parent Required - An entity must have a parent record. For example, each patient
must have a primary care physician.
- Hierarchy Required - An entity must exist within a hierarchical structure. For
example, an organization must either own or be owned by another organization.
- Sided - An entity must have a parent relationship or a child relationship, but not the
opposite. For example, every patient must have a physician, but physicians do not
have to have patients assigned to them.
The Relationship Linker can process multiple rules will run through all of them. If any
rules are true, a relationship is created, but false rules will not negate the relationship.
Sample relationship criteria

Hierarchical and peer-to-peer relationships have certain criteria.
Table B-1: Relationship criteria
Relationship
Directed Entity Type Multiplicity
Type
Hierarchical Yes Same on both ends required. 1 to 1 or 1 to Many
Same on both ends not 1 to 1, 1 to Many, Many to
Peer No
required. Many
Student Notebook
Relationship tasks
When the linker finds a relationship that falls outside the constraints of the rules created
within Workbench a Task is created. Common relationship tasks are:
Relationship Multiplicity Task - An entity is involved in too many or too few
relationships according to the multiplicity constraints. For example, a patient is assigned
to more than one primary care physician despite having a one-to-one multiplicity
setting.
Missing Relationship Task - An entity does not have a required relationship type. For
example, a patient is not assigned to a physician despite having a parent required
setting.
Invalid Reference Task - An entity should have a relationship but it does not. For
example, a patient is assigned to a physician who no longer belongs to that practice
leaving the patient without a parent.
Relationship Creation Task - This task type is triggered when the master data engine
does not agree with a relationship that was manually created. An example would be
when a company is said to be owned by another, but the data underneath does not
support that setting.
V5.4.0.3
Uempty Relationship components

Relationships are created in the Master Data Engine using settings in Workbench and are
managed by data stewards using Inspector.
Data model
To accommodate relationship linking, the following tables are used in the data model:
Added the tables:
mpi_relrule to store rules that govern relationship creation
mpi_relxtsk to store relationship information
mpi_relsegattr to support relationship segment-to-relationship attribute definitions
Added fields to existing tables:
rellinkno, relseqno, rulerecno, and relflag to mpi_rellink
relusrflag, relengflag, requiredleft, requiredright and requiredhierarchy to mpi_reltype
tskkind and tsktype to mpi_tsktype
Master data engine

When running the MADCONFIG script you will also be prompted to enable the entity
manager Relationship linker. You must answer Yes to enable the entity manager in order to
use the Relationship Linker. A prompt to enable this feature comes later in the
MADCONFIG script.
Student Notebook
Workbench
The relationship rules the Relationship Linker uses to create and delete relationships are
maintained on the Relationship Type pane in Workbench. The rules are set up by creating:
Relationship Types: Gives a description, the direction, the multiplicity, entities and
defines the connection between the parties in the relationship.
Relationship Type Attributes: Defines attributes used strictly for relationships.
Relationship Rules: Defines how the relationships are created.
Relationship rules are based on making connections between an entity on the left with an
entity on the right. In other words, the record of the left entity should equal the record of the
right entity.
Workbench also has a Relationship Linker job that can be used to create relationships
based on the rules you have defined on an ad hoc basis. This job can be used for the initial
data load or for very large batch loads. This job works similarly to the bulk-cross-match job
where it will run a re-dvd if you have existing relationships in your database which will
replace the relationship data along with the bucket and comparison data.
Inspector
Records and relationships are managed in Inspector. Entities can be added and removed
from a hierarchy, or dragged and dropped to a different location within the hierarchy. As
information in the records changes, the relationships automatically change in real time
using the Relationship Linker in the MDE.
When tasks are created they are resolved through correcting data, or in some cases,
overriding relationship rules in Inspector.
V5.4.0.3
Uempty The relationship linking process

As you are setting up your instance you should have already planned ahead for
relationships by setting up the Initiate Member Model to include the relationship rules and
the attributes being used in the relationship rules. After the groundwork is laid, the hub
processes relationships in the initial load and then maintains them.
Initial load
When a project is being implemented with relationships, it follows similar steps to an
implementation without relationships.
1. Derive the Data with the Generate Query BXM option enabled.
2. Generate weights.
3. Run MPXCOMP which uses the source tables to determine how to compare sources.
4. Run MPXLINK using the same output folder for BXM data as the derived data.
5. If processing Cross Entities, run the MPXFSDVD job to reconcatenate all the member
types.
6. Run the Relationship Linker job in Workbench.
7. Load the data into the hub.
Maintenance
After the initial load, additions and changes are added to the hub and the relationship are
processed in real time.
1. Obtain the Source Extracts from the customer and outside source (Dun & Bradstreet,
for example).
2. Standardize and run a MEMPUT job in CloverETL.
3. Entity Management and Relationship Linker run.
Student Notebook
Setting up relationships
Relationships are managed in Workbench in the Relationship Type pane. But first you need
to plan how your relationship will be created by defining your rules based on the
information the customer needs. For this exercise we are going to build upon the Boot
Camp data by adding another attribute named Supervisor ID.
To add this new data to the data set we will run a MEMPUT job to append data to the hub to
create sample relationships.
The Supervisor ID will be the attribute the Relationship Linker uses to create connections
between entities. Follow the procedures below to implement relationships with our Boot
Camp data.
Adding the relationship attributes

We will need to add attribute information for our new relationship data.
Adding member type attribute information

Follow the following procedure to add a new Member Type attribute for relationships.
__ 1. Click the Attributes tab in the Member Types pane.
__ 2. Select Person in the Member Type drop-down list if it is not already selected.
__ 3. Click the Add button to the right of the fields.
__ 4. Enter the following information for the Supervisor ID attribute.
3
Table B-2: Supervisor ID attribute

Attribute Supervisor ID
Code SUPERID
Type MEMATTR
__ 5. Click the Save button.
V5.4.0.3
Uempty Creating relationship rules

Now that we have the attribute necessary to create relationships, let us map out how these
relationships will be created. The relationship will show the hierarchical connections
between employees and their supervisors.
__ 1. Open the Relationship Types pane.
__ 2. Click the Add button to the right of the fields.

__ 3. Type or use the drop-down menus to create the relationship rule types following the
table below.
Table B-3: Relationship types
Type/Name Directed Multiplicity Entity 1 and 2
Manages true 1 to Many id
__ 5. Click the Rules tab underneath the Relationship Types fields.
__ 6. Click the Add button four times.

__ 7. Type or use the drop-down menus to create the relationship rules following the table
below.
Table B-4: Relationship rules
Source 1 Attribute 1 Field 1 Comparison Source 2 Attribute 2 Field 2
Archway Supervisor ID attrval equals Archway memidnum blank
Bellwood Supervisor ID attrval equals Bellwood memidnum blank
Archway Supervisor ID attrval equals Bellwood memidnum blank
Bellwood Supervisor ID attrval equals Archway memidnum blank
Student Notebook
Adding relationship data

Now that our Initiate Member Model has been set up, we can append the new Supervisor
ID attribute to the hub using MEMPUT.
__ 1. Navigate to the MEMPUT graph and open it.
__ 2. Run the MEMPUT graph.
Viewing relationships
Now that we have added our relationship data, we can view it in Inspector.
__ 1. Search for Perry Brose and view the results by clicking the Inspect icon.
__ 2. Click the Entities tab to view the records relationships.
V5.4.0.3
Uempty Appendix C. Glossary

The following terms are key to understanding how our products work and what they do.
Many of these terms will be covered in more depth further in the class.
Source
A source is a file or system where your data comes from. Sources can be flat files,
databases, web services, or other tools that store information.
Member
A member is simply a record from a source. The term member comes from the concept
that a member record belongs to an entity, like a person belongs to a club or organization.
Attribute
An attribute is a piece of demographic information about a member. These attributes are
specific, like a Home Address or a Work Address.
Attribute type (segment)

An Attribute Type, or Segment, is the table in the database where like attributes are stored.
Addresses are stored in the MEMADDR segment, phone numbers in MEMPHONE, and so
on.
Entity
An entity is a unique person or organization. Multiple members might hold data about one
entity. Entities are represented by an Enterprise Identifier (EID), which is assigned by the
hub.
Algorithm
An algorithm is a series of computational processes that analyze member records. The
algorithm has three main processes: Standardize, Bucket, and Compare.
Standardization
Standardization, the first step in the algorithm, reformats data into consistent chunks. For
example, phone numbers are boiled down to the last 7 digits: 1 (312) 832-1231 =>
8321231.
Copyright IBM Corp. 2010, 2011 Appendix C. Glossary C-1

Student Notebook
Bucket
A bucket is a means of organizing members, based on data they share in common, so they
can be found more quickly. Is it easier to find a needle in a haystack or in a jar labeled
needles?
Comparison function
A comparison function is a means for finding similarity or difference between two attributes.
Comparison types include: Exact Match, Starts With, Edit Distance, Phonetics, and
Equivalency.
Weight
A weight is a number that represents the hubs confidence that a single value is a good
identifier. The rarest values have a higher weight while more common values have lower
weights.
Comparison score
A comparison score is the aggregate of individual attribute weight scores when two
members are compared. Each set of attributes has a score (positive or negative) that adds
to the overall score.
Threshold
A threshold is a number on a scale that is used to make a decision. There are two main
thresholds in the hub: Clerical Review (CR) and Auto Link (AL).
Tasks
A task is simply something that a human being has to do. Typically this involves making a
decision about joining two records that fall between the CR and AL thresholds. There are
four main tasks:
Potential overlay
A record received an update with information that is radically different than the data that
was already there and is considered the most urgent to resolve.
Potential duplicate
Two records are in the same source and appear to represent the same person or
organization.
C-2 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3
Uempty Potential linkage

Two records are in different sources and appear to represent the same person or
organization.
Review identifier
Two records from the same source seem to be using the same identifier (like SSN or
Passport Number)
Relationships
When two entities have something in common they have a relationship. That relationship
might be part of a hierarchy such as a boss and a subordinate, or are part of a group such
as patients of a medical practice.
Copyright IBM Corp. 2010, 2011 Appendix C. Glossary C-3

Student Notebook
C-4 Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011
V5.4.0.3 FOR USE WITH COURSE 3Z100 ONLY
backpg
Back page

Notebook

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Notebook

Diunggah oleh

Hak Cipta:

Format Tersedia

V5.4.0.

Initiate Technical Boot Camp

(Course code ZZ100)

April 2011 edition

Copyright International Business Machines Corporation 2010, 2011.

Unit 1. Introduction to the Boot Camp project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

Copyright IBM Corp. 2010, 2011 Contents iii

Unit 2. Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1

Unit 3. The Initiate member model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1

Unit 4. Configuring the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1

iv Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011

TOC How do algorithms relate to members and entities? . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Copyright IBM Corp. 2010, 2011 Contents v

Tuning search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-35

Unit 5. Cleaning the data extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1

Unit 6. Deploying the instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1

Unit 7. Overview of the Initiate data model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1

vi Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011

TOC What is the Initiate member model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2

Unit 8. Deriving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1

Copyright IBM Corp. 2010, 2011 Contents vii

Configuration template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-16

Unit 9. Generating weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1

Unit 10. Running a bulk cross match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1

TOC Unit 11. Analyzing thresholds and matched pairs . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1

Unit 12. Analyzing buckets and frequency based bucketing . . . . . . . . . . . . . . . . . 12-1

Unit 13. Reiterating the process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1

Copyright IBM Corp. 2010, 2011 Contents ix

Viewing entity size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-4

Unit 14. Managing users, groups, and permissions . . . . . . . . . . . . . . . . . . . . . . . . .14-1

Unit 15. Configuring and deploying Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-1

Unit 16. Testing the hub configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-1

x Initiate Technical Boot Camp Copyright IBM Corp. 2010, 2011

TOC Integration tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3

Appendix A. Sample implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1

Appendix B. Working with relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

Copyright IBM Corp. 2010, 2011 Contents xi

How does relationship management work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

Appendix C. Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1

Copyright IBM Corp. 2010, 2011 Trademarks xiii

pref Course description

Duration: 5 days (10 days with optional sample implementation)

IBM Initiate technical publications

Copyright IBM Corp. 2010, 2011 Course description xv

Copyright IBM Corp. 2010, 2011 Agenda xvii

Uempty Unit 1. Introduction to the Boot Camp project

Core concepts and terms

Uempty What is a registry-style hub?

What is a transactional-style hub?

What is a hybrid-style hub?

What is deterministic matching?

What is probabilistic matching?

Uempty Solutions versus tools

IBM Initiate Inspector Facilitates inbound and outbound

Implementing IBM Initiate Master Data Service software

Table 1-2: The general implementation process

Uempty The platform implementation process

1 - Review the customer requirements

2 - Configure the Initiate member data model

Uempty 3 - Configure the algorithm

4 - Clean the data extract

Uempty 5 - Deploy the instance

5.1 - Create an empty database

5.2 - Create hub instance