Skip to main content

Dynamics Company Name Matching Using R for Data Science

I was given a list of over 2000 company names and asked if it was possible to find which of those we have as accounts in our Dynamics CRM. There were no unique identifiers to directly match the companies apart from their names which could also potentially be spelt differently.

My first initial thought was that to use the CRM SDK to query dynamics checking if they exist, but that could have resulted in a few calls which I wanted to avoid.

The next idea was to export accounts in dynamics into excel. But after taken a while just to read that excel file into c# with 80,000+ accounts and nearly running out of memory,  maybe c# wasn’t the best for this task. I then thought of R which I have used a few times such as in this post for predictive modeling, would be perfect for this due to the volume of data needing to be processed. This guide describes the steps i followed.

R is available as a Visual Studio workload through the installer, which I use currently, but I recommend R-Studio for simplicity sake, much easier to use without the additional Visual Studio features.

You can start by downloading R and then RStudio. There is a quick guide on how to get started with R with a few links to learning resources. But for now an even quicker guide to get you started is this presentation from Princeton.

Create a new project, and add a new R script. Set working directory to the location of your datasets, in my case I have  “Match Accounts.xls” which are my  2000 names in need to attempt to find a match for and “Dynamics Accounts.xlsx” companies exported from dynamics.

Datasets

First step is to load in both data sets. We`ll need to install the “readxl” package.  Use install. package to install the package and then library or require to use functions from the package. Both excel files are then read into R as data frames using read_excel.

install.packages("readxl")
library("readxl")
#load data
inputDynamicsAccounts = as.data.frame(read_excel("Dynamics Accounts.xlsx"))
inputMatchAccounts = as.data.frame(read_excel("Match Accounts.xls"))

There are a few functions to quickly explore your data such as ; names returns column names in the dataset, nrow for the number or records/rows and head  to  preview the first few lines. You can either add these functions into your r script and use Ctrl+Enter to execute each line at a time or directly in the console.

names(inputDynamicsAccounts) # get column names
names(inputMatchAccounts)
nrow(inputDynamicsAccounts) # get number of records
nrow(inputMatchAccounts)
head(inputDynamicsAccounts) # preview data

Pre-Processing

Before any processing I create a unique id or index for my datasets so I can always refer back to each record. Data exported from dynamics contains the account guid in a column called “(Do Not Modify) Account”, so I just use that and renam it to AccountID . The match dataset however does not have a unique identifier so I create a new column called id and fill it with numbers from 1 to the number of rows in the data set.  The seq function creates a sequence of numbers from a start to an end point as an array or list.

#Rename Account id
colnames(inputDynamicsAccounts)[colnames(inputDynamicsAccounts) == "(Do Not Modify) Account"] = "AccountID"
inputMatchAccounts$ID = seq(1:nrow(inputMatchAccounts)) #create new ID column and populate

Cleansing

Before trying to find matching names,i clean up the datasets making sure there is the best possible chance of joining companies by name, making sure they are as close together as possible by eliminating any user data input errors.

Firstly made a copy of the datasets to a new data frame, this will allow referring back to the original names later. The gsub function when applied to a column allows searching for a pattern to be replaced by a value in all rows . gsub allows using a string as a pattern or regex expressions so using “\\s+” to find and replace all spaces, “[[:punct:]]” to remove all punctuation characters. As shown in the R code these are replaced with (“”) which is blank.

#copy to new new data frame
DynamicsAccounts = inputDynamicsAccounts
MatchAccounts = inputMatchAccounts
#remove all spaces
DynamicsAccounts$Name = gsub("\\s+", "", DynamicsAccounts$Name)
MatchAccounts$Name = gsub("\\s+", "", MatchAccounts$Name)
#remove punctuation
DynamicsAccounts$Name = gsub("[[:punct:]]", "", DynamicsAccounts$Name)
MatchAccounts$Name = gsub("[[:punct:]]", "", MatchAccounts$Name)
#change all characters to lowercase
DynamicsAccounts$Name = tolower(DynamicsAccounts$Name)
MatchAccounts$Name = tolower(MatchAccounts$Name)

By conducting these replacements eliminated several different naming possibilities such as for example someone may have entered “Charles company” as “Charle’s Company”, “CharlesCompany” or “Charles.Company” at the end of or cleansing the account name will be “charlescompany” regardless.

Matching

To find matches i use the “left_join” function which is in the “dplyr” package. This function will return all rows from the left table (match dataset) as well as return all columns from both dataset. Rows without matches will have NA`s in right dataset, (Dynamics Accounts) columns. Not the function will also duplicate rows if multiple matches are found for the same name.

#get required package
install.packages("dplyr")
library("dplyr")
#join match accounts to dynamics accounts using the Name column
namematch = left_join(MatchAccounts[, which(names(MatchAccounts) %in% c("Name", "ID"))], DynamicsAccounts[, which(names(DynamicsAccounts) %in% c("Name", "AccountID"))], by = "Name")
#subset ,namematch to only contain rows with matches
namematch = subset(namematch, !(is.na(namematch$AccountID)))
#subset rows without matches
remainingAccounts = subset(MatchAccounts, !(MatchAccounts$Name %in% namematch$Name))

Left join the match accounts, selecting only the name and id columns in both data sets. Checking the new column names using names(namematch) after the join shows there are only 3 columns in the new data set the name ,ID and AccountID.

Currently the namematch dataset includes both rows with matches as well as ones without, so using the subset function to select only rows where the account id. Subset can also be used to get a list of remaining accounts, these are names where matches were not found by getting a subset of our original match accounts that do not exist in the namematch dataset. This remaining account list can be used for any additional matching methods later or just as an output. nrow(namematch) also tells show that i found 361 matches through the join.

Final step is to join the matches back to the first input dataset to get the original names of accounts as well as any other data that was available in them orginally. The current name column in the namematch data frame is not very useful as it now has no spaces or punctuation so remove that from the data set leaving the unique identifiers ID and AccountID.

Using  the familiar left_left join function again to join to the original inputMatchaccount dataset using the ID column to copy across all the columns. A second left join to the Input dynamics accounts using the account id, to allow viewing the matched accounts on the same row. The additional parameter suffix = c(“.Match”, “.Dyn”) is added, this is to label correctly any columns that have the same name in both datasets for example the name column is a duplicate ,in the match data set will now be named Name.Match and the dynamics account Name.Dyn. Final step is to write the list of matches to a csv file using write.csv

#Select only ID columns
namematch = namematch[, which(names(namematch) %in% c("ID", "AccountID"))]
#join the data back again to the original datasources
namematch = left_join(namematch, inputMatchAccounts, by = "ID")
#suffix is added so the name column is suffised .match .dyn
namematch = left_join(namematch, inputDynamicsAccounts, by = "AccountID", suffix = c(".Match", ".Dyn"))
write.csv(namematch[, - which(names(namematch) %in% c("AccountID", "ID"))], "NameMatchDemo.csv", na = "", row.names = FALSE)

At this stage you can then decide to expand this further in several ways. You can use fuzzy matching to attempt to find matches with similar names rather exact , the fuzzyjoin package provides the necessary function for that. My method was to use fuzzy join with a high tolerance and then compare the postcodes of those accounts to see if there are any that are exact, this got me a few more matches.

This can also be used to bulk update data in dynamics for example if you have a list of accounts you need to update a specific attribute but you lack the GUID, you can run this,as long as you preserve the hidden do not modify columns in your dynamics export, you can then re import those accounts back into dynamics after adding your modifications. The full R script below.

install.packages("readxl")
library("readxl")
#load data
inputDynamicsAccounts = as.data.frame(read_excel("Dynamics Accounts.xlsx"))
inputMatchAccounts = as.data.frame(read_excel("Match Accounts.xls"))

#Explore Data
names(inputDynamicsAccounts) # get column names
names(inputMatchAccounts)
nrow(inputDynamicsAccounts) # get number of records
nrow(inputMatchAccounts)
head(inputDynamicsAccounts) # preview data

#Pre-processing
#rename Account id
colnames(inputDynamicsAccounts)[colnames(inputDynamicsAccounts) == "(Do Not Modify) Account"] = "AccountID"
inputMatchAccounts$ID = seq(1:nrow(inputMatchAccounts)) #create new ID column and populate

#Data cleansing
#copy to new new data frame
DynamicsAccounts = inputDynamicsAccounts
MatchAccounts = inputMatchAccounts
#remove all spaces
DynamicsAccounts$Name = gsub("\\s+", "", DynamicsAccounts$Name)
MatchAccounts$Name = gsub("\\s+", "", MatchAccounts$Name)
#remove punctuation
DynamicsAccounts$Name = gsub("[[:punct:]]", "", DynamicsAccounts$Name)
MatchAccounts$Name = gsub("[[:punct:]]", "", MatchAccounts$Name)
#change all characters to lowercase
DynamicsAccounts$Name = tolower(DynamicsAccounts$Name)
MatchAccounts$Name = tolower(MatchAccounts$Name)

#direct name match
#get required package
install.packages("dplyr")
library("dplyr")
#join match accounts to dynamics accounts using the Name column
namematch = left_join(MatchAccounts[, which(names(MatchAccounts) %in% c("Name", "ID"))], DynamicsAccounts[, which(names(DynamicsAccounts) %in% c("Name", "AccountID"))], by = "Name")
#subset ,namematch to only contain rows with matches
namematch = subset(namematch, !(is.na(namematch$AccountID)))
#subset rows without matches
remainingAccounts = subset(MatchAccounts, !(MatchAccounts$Name %in% namematch$Name))

#explore namemateches
names(namematch)
nrow(namematch)

#Join to original dataset
#Select only ID columns
namematch = namematch[, which(names(namematch) %in% c("ID", "AccountID"))]
#join the data back again to the original datasources
namematch = left_join(namematch, inputMatchAccounts, by = "ID")
#suffix is added so the name column is suffised .match .dyn
namematch = left_join(namematch, inputDynamicsAccounts, by = "AccountID", suffix = c(".Match", ".Dyn"))

write.csv(namematch[, - which(names(namematch) %in% c("AccountID", "ID"))], "NameMatchDemo.csv", na = "", row.names = FALSE)

Charles Osei

Dynamics 365 Consultant @ RSM MCSE Business Applications Computer Science – University of Lincoln 2017 Nottingham D365/CRM user group Chapter Leader TDG – Community leader Twitter @charliedevxyz LinedIn https://www.linkedin.com/in/charlesosei/

Leave a Reply

%d bloggers like this: