Sunday, 17 August 2014

IBM 'Big Match'

The IBM Big Match provides the customers a way to obtain master data from the Big Data stored in IBM BigInsights using the IBM MDM Probabilistic Matching Engine. 

The Probabilistic Matching Engine makes use of standardization, compare, score and link techniques to decide on whether two records map to the same entity.  The PME can be used with the BigInsights to provide the Big Match functionality.

Using the IBM MDM Workbench, the Probabilistic Matching Engine can be configured and exported for use with IBM InfoSphere BigInsights.

The Big Match is massively scalable and is capable of performing faster real time matching, which helps the customer obtain a 360 degree view of the entities quickly.





Links
Technical Overview of Big Data Matching
Harness big data and use actionable insights to provide data confidence
IBM Big Match For Hadoop


Example of Probabilistic Matching
A simple example of MDM Probabilistic Matching is provided here

Lakshmi and Sevaal go to a particular showroom, which has dress materials of a specific brand and purchase a couple of dress materials.  Both of them are given membership ids.  Hence their details including their customer id, mobile number and address are stored in the database the store maintains.

Assuming,
Customer Id: 2342
Name: Lakshmi Srinivasan   
DoB: Not provided
PAN: SSCCI222M
Address: 18, Silver Street, Oak Rd, Chennai - 59
Mobile: 091-4423423482

Customer Id: 2343
Name: Sevaal V
Age: 09-07-78
PAN: SSDPG2433V
Address: 19, Silver Street, Oak Road Chennai 600059
Mobile: 9234223143

Both of them like different types of dress materials, mark a Like in Facebook for the particular type and order for dress materials from that brand through online shopping sites too.

Sevaal changes her mobile number and has misplaced her membership card. After few months, she goes to a different branch of the same showroom and ends up getting a new membership card, with a different customer id.

New Details given:

Customer Id: 2545
Name: Sevaal Vasudevan
Age: 09-07-1978
PAN: SSDPG2433V
Address: 19, Silver Rd., Oak Road Chennai 600059
Mobile: 08242479824

Lakshmi migrates to Bangalore and the showroom gives her a different customer id and membership id

Customer Id: 3454

Name: Lakshmi S   
DoB: 30-01-1983
PAN: SSCCI222M
Address: 88, Jaya Nagar, Blr 560089
Mobile: 9283425839
 
What is given here is just details of two customers.  The store has thousands of such customers, many of them having only one customer id, but some of them having two customer ids.  Most of these customers are present in Facebook and have provided their feedback there.  Some of them have filled feedback in the store.  Some of them have preferred to go in for online shopping as well.

The store decides to provide a discount of 5%  to the regular customers.  Without thousands of customer records (volume), lakhs of purchase records (volume), thousands of records from the Internet providing feedback (velocity) that keeps coming in everyday, and a few thousands of customers having more than one customer id (veracity), this is a challenge for the store.  Here the variety component of the Big Data is the customer detail, the feedback, and the details that come from the Internet.  Now the store has Big Data.  It needs to follow to match records and find the customers and their details.  Customers having two ids have to be identified and their records merged into one.


The MDM Probabilistic Matching Engine can now be used with BigInsights for Big Match.  The Probabilistic Matching Engine does standardization, matching, scoring and linking to get the individual records.

Standardization
We sometimes tend to write Rd. as the short form of Road.  In very few situations, we add our country code before our phone number.

Standardization ensures that the data follows certain standards.  It follows certain rules and modifies Rd. to Road.  It could also be customized to remove the country code and any hyphens in the telephone number.

So the records for Sevaal and Lakshmi will be stored as follows.

Customer Id: 2342
Name: Lakshmi Srinivasan   
DoB:
PAN: SSCCI222M
Address: 18, Silver Street, Oak Road, Chennai 600059
Mobile: 4423423482

Customer Id: 3454
Name: Lakshmi S   
DoB: 30-01-1983
PAN: SSCCI222M
Address: 88, Jaya Nagar, Bangalore 560089
Mobile: 9283425839

Customer Id: 2343
Name: Sevaal V
DoB: 09-07-1978
PAN: SSDPG2433V
Address: 19, Silver Street, Oak Road Chennai 600059
Mobile: 9234223143

Customer Id: 2545
Name: Sevaal Vasudevan
DoB: 09-07-1978
PAN: SSDPG2433V
Address: 19, Silver Street, Oak Road Chennai 600059
Mobile: 8242479824

Thus standardization makes the comparison process easy.

Matching
When the system tries to compare the name of customer 3454 using equals, to that of customer 2535, the result is unequal and the decision would be that they are different customers.

However, their PAN matches.  The PAN of every individual is unique.  Sevaal is not a common name in India.  All the more, their Date of Birth also matches.   Hence there is a strong possibility that the two records refer to the same customer.

This way of matching is very similar to the Probabilistic Matching Engine service provided by IBM Master Data Management.  Attributes in records (here the attributes are Name, DoB, PAN, Address and Mobile) are matched.  A positive and a negative score are provided for each attribute based on the match.  And based on the total score a record obtains, the linking happens.

Scoring
Another important characteristic of Matching is that, the score is based not only on the percentage of match, but also on the frequency of occurrence.  Sevaal is not a name commonly found, however Lakshmi is.  Hence the score when the first name for two records is Sevaal is higher than that for Lakshmi.

PAN is unique for each individual, hence the score that can be given when PAN for two records match, could be the highest.

In the example we have considered, the records for Sevaal would obviously have a higher score than that for Lakshmi, since the name is not common, the PAN and Date of Birth matches.

The record for Lakshmi would also have a good score, since the PAN matches.  However since the Date of Birth is not provided in one record, the score would reduce.

Linking
Now that the scores are obtained, a linking has to take place, that is, deciding whether the two records are the same and consolidating all details into one customer id.  In general a threshold value is provided, and when the matching yields a score higher than the threshold value, it is automatically decided that the two records are the same.  There would be another range of values wherein the system cannot decide whether the records are the same (potential match) and a data steward needs to decide on it (in our example, Lakshmi's record).   And there is a least value, below which the system can automatically decide that the two records are not the same.

Hence after linking, Sevaal has only one customer id - 2545 and all her details would be added to this customer id.  The two records for Lakshmi - 2342 and 3454 would be shown to the data steward to decide  on whether they point to the same customer.

In the case of Big Match, the engine only performs the automatic linking and does not generate tasks to link records that are a potential match.


4 comments:

Pranab Mukherjee said...

Hello Chitra,

I am a regular reader of your blog. I especially really like all of your posting related to IBM MDM.

Do you have any white paper write-up on “how best we could implement multiple composite view”? Basically I am looking for some pros and cons on implementing multiple composite views on our project.

Any help on this front would be greatly appreciated. Looking forward to hearing back from you.

Best Regards,

Pranab
pranab_mukherjee@hotmail.com

Pranab Mukherjee said...

IBM 'Big Match' is a part of Advance Edition 11x could you pl clarify it -Regards

Mubeen Sulthana said...

Hi Chitra,
Can you help me find the correct answer for the below question.

Q) A client wants to determine if Social Media Sentiment information is coming from known customers and can be related to a known transaction. Which solution should be considered for identifying the customer who posted the comment?
 A. IBM Big Match
 B. IBM InfoSphere Streams
 C. IBM InfoSphere Biglnsights
 D. IBM InfoSphere Information Server

Mubeen Sulthana said...

Hi Chitra,
Can you help me find the correct answer for the below question.

Q) A client wants to determine if Social Media Sentiment information is coming from known customers and can be related to a known transaction. Which solution should be considered for identifying the customer who posted the comment?
 A. IBM Big Match
 B. IBM InfoSphere Streams
 C. IBM InfoSphere Biglnsights
 D. IBM InfoSphere Information Server

Post a Comment