ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage and Data Deduplication

Homepage: http://ReMaDDersoft.wix.
com/ReMaDDer
ReMaDDer Software Tutorial

How to use ReMaDDer software for successful records matching, data
cleansing and data deduplication projects
11/20/2016
Revision 2.0.
Table of Contents
Introduction ........................................................................................................................ 3
What Is ReMaDDer Software .......................................................................................... 3
Fuzzy Match ..................................................................................................................... 3
Records Linkage .............................................................................................................. 4
Data Deduplication .......................................................................................................... 4
ReMaDDer Software Advantages .................................................................................... 4
Prerequisites .................................................................................................................... 5
Revision History .............................................................................................................. 5
Projects ................................................................................................................................ 7
Projects Page .................................................................................................................... 7
Concept of Left and Right Dataset ............................................................................ 8
Record Matching Project vs. Data Deduplication Projects ............................................. 8
Copy A Project ................................................................................................................. 9
Raw Data Import ................................................................................................................. 9
Left and Right datasets ............................................................................................ 10
Import Raw Data ............................................................................................................ 11
Browse And Choose CSV files..................................................................................................................... 11
Register CSV Files ....................................................................................................................................... 11
Determine And Convert CSV File To UTF-8 ............................................................................................ 12
Edit Raw Datasource Schema Information ...............................................................................................17
Pre-process Raw Datasource ......................................................................................................................17
Import Data From Raw Datasources ........................................................................................................ 19
Solution Definition .............................................................................................................21

Page 1 / 59
How ReMaDDer performs record linkage and data deduplication .............................. 22

Solution Definition Header ........................................................................................... 22
Solution Basic Information ....................................................................................................................... 24
Machine Learning Strictness ..................................................................................................................... 25
Join Type .................................................................................................................................................... 25
Return Only Best Matching Records ........................................................................................................ 26
Solution Definition Details ............................................................................................ 26

Fields Picker ............................................................................................................................................... 27
Solution Constraints .................................................................................................................................. 29
Solution Execution ............................................................................................................ 34

Solution Execution In One Step .................................................................................... 38
Solution Execution In Two Major Steps ....................................................................... 39
Solution Execution In Several Minor Steps .................................................................. 39
Data Retrieving And Storing ..............................................................................................41
Execute Resultset Retrieval SQL Query ........................................................................ 42
Solution Status Info ....................................................................................................... 43
Save And Load Resultset ............................................................................................... 45
Review And Edit Resultset ............................................................................................ 46
Resultset Browsing .................................................................................................................................... 46
Resultset Edit And Review ........................................................................................................................ 51
Exporting Resultset.................................................................................................................................... 52
Customize Data Grids........................................................................................................ 55

Customize Splitters ........................................................................................................... 56
ReMaDDer Software Trial ................................................................................................. 56
Commercial Release Code Purchase And Activation........................................................ 57
Page 2 / 59

How to use ReMaDDer software for successful records matching, data cleansing and
data deduplication projects
Introduction
What Is ReMaDDer Software
ReMaDDer is record linkage and data cleansing software, with powerful fuzzy record matching and data
deduplication capabilities, based on state of the art machine learning and data processing techniques.
As client-server application, ReMaDDer consists of two parts: client front-end part and server-side part.
Client front-end provides user-friendly graphical interface with intuitive means for projects creation, raw
data import and solutions definition, while server-side part ensures mighty data processing engine that can
solve even the most complex fuzzy match analysis in reasonable time.
By combining advanced artificial intelligence with clever blocking techniques and multiple string similarity
metrics, ReMaDDer provides unique solution for fully automatic records matching and data deduplication
projects.
Traditionally, fuzzy records matching software require substantial human intervention, either to provide
various parameters and threshold values, either to perform extensive clerical review and supervised
machine learning training. Unique property of the ReMaDDer software is that it does not require any such
human assistance beyond project definition. There are no thresholds or any other input parameters which
user must provide in order to enable software to distinguish between matches and non-matches, the
ReMaDDer software is capable to infer and learn everything by itself.
As far as we are aware, ReMaDDer might be the only software currently available that is capable to perform
fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of
human clerical review. This is accomplished by utilizing various advanced machine learning techniques and
approaches.
The name ReMaDeDer is an acronym for Records Matching and Data Deduplication Software.
Homepage: http://ReMaDDersoft.wix.com/ReMaDDer
Fuzzy Match
Term fuzzy match refers to methods of identifying related records by measuring how similar they are. It
is used in cases where no unique identifier or exact match relation exists between two sets of data.
Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity.
Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with
probabilities below threshold are considered to be non-matches.
Page 3 / 59
Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold
matching percentage set by the application.
Records Linkage
Record linkage refers to the task of finding records in a data set that refer to the same entity across different
data sources, i.e. to identify related records in two separate data sets.
Record linkage is necessary when joining data sets is based on entities that may or may not share a common
identifier, as may be the case due to differences in record shape, storage location, and/or curator style or
preference.
There are many business cases where record linkage has to be performed. Some typical examples are
product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.
Data Deduplication
Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from
redundant information.
ReMaDDer Software Advantages

Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic
papers. Some of the researchers even tend to build their own software, but those programs suffer from their
complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use
it. This is not something that can be expected from an average user facing data linkage problem in urge to
be able to solve it in matter of hours or days.
On the other hand, there are huge corporate entity resolution framework solutions, produced by big
software companies, oriented towards huge corporate customers. These solutions are often very complex
and affordable only to big companies and corporate users.
ReMaDDer places itself in the middle and provides powerful fuzzy match records linkage solution for mere
mortals and regular office users.
By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints
in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she
can focus on the business case, rather than technical issues. That is where ReMaDDer software really shines
and clearly distinguishes itself from competition.
Traditionally, fuzzy record matching software suffer from requiring immense user involvement in project
parameterization and clerical review. User is either required to provide various input parameters and
threshold values, either he/she is required to perform machine learning training and provide examples of
matches and non-matches. In both cases, considerable user involvement and expertise is prerequisite for
successful analysis.
On the contrary, the ReMaDDer software does not require such heavy user involvement, since it can figure
optimal parameter values automatically, all by itself. This is accomplished by advanced artificial intelligence
utilizing various state of the art machine learning techniques.
Page 4 / 59
To summarize: utilization of advanced artificial intelligence, accompanied with intuitive graphical user
interface and low pricing - that is what makes ReMaDDer superb fuzzy match records linkage solution.
Prerequisites
Major prerequisite to use ReMaDDer is active internet connection, since the raw data is imported to remote
server where data is processed. After trial period expires, you are required to purchase commercial release
code in order to be able to continue using remote server.
However, project and solution creation and editing can be performed even without established connection
and purchased release code, since these data are stored locally on your computer.
ReMaDDer front-end client is available as executable for Windows and Linux systems. It is possible to
provide executables for various other systems, on demand.
ReMaDDer does not operate directly on original data sources, but requires data to be imported from CSV
(comma separated values) flat files to server, where corresponding left and right database tables are
then created and processed. Therefore, you will have to provide source datasets as flat CSV file, encoded in
UTF-8, preferably with comma (,) or semi-colon (;) field separators.
Revision History
Revision
1.0.
1.1.
Date
3/20/2016
5/10/2016
Change Description
Initial release. Tutorial covers ReMaDDer version 1.0.
Document is updated to reflect changes and improvements brought by
ReMaDDer version 1.1.
New version brings many improvements and simplifies solution
definition. Instead of separately choosing and defining thresholds for
trigram similarity and levenshtein distance functions, a new, combined,
common similarity function (ReMaDDer_similarity) is now introduced
that combines both trigram and levenshtein similarity properties. This
reduces complexity and uncertainty in solution definition creation,
retaining ReMaDDer strength and advantages.
Previous ReMaDDer version has been outputting all columns from left
and right dataset into resultset. Now, you can choose which fields are to
be included in resultset.
Raw data import process is also much improved, especially regarding
importing data from Excel files (in CSV format) where column names
contain non-ascii characters and blanks.
2.0.
11/20/2016
There are many small performance improvements and several bugfixes

that will improve user experience when using the ReMaDDer software
for data match analysis.
Document is updated to reflect major changes and improvements
brought by ReMaDDer version 2.0.
The main changes are:
Instead of using only Levenshtein and Trigram similarity functions,
multiple other similarity metrics are added to the server engine.
Page 5 / 59
Matches and non-matches are not based on similarity thresholds

any more. Instead, ReMaDDer now utilizes machine learning
techniques. Advanced algorithms infer and automatically detect
duplicates and record matches.
Threshold parameters are removed as obsolete.
Use composite field parameter is removed as obsolete.
Use inclusive ORparameter is removed as obsolete.
New parameter Machine Learning Strictness is introduced. The
parameter defines how strictly artificial intelligence will
distinguished between matches and non-matches. The options are:
match, strict match and potential match.
New parameter Join Typeis introduced. Join Type attribute
determines how SQL joins between left and right tables will be
established, via solution base table. There are three options of
joining: a) inner join, b) left outer join, c) right outer join.
The "inner join" option is default behavior, meaning that the
resultset will contain all rows from left and right datasets which
meet matching criteria.
In case of "left outer join" option, resultset will contain all rows
from left dataset and only those rows from right dataset that satisfy
matching criteria.
In case of "right outer join" option, resultset will contain all rows
from right dataset and only those rows from left dataset that satisfy
matching criteria.
New parameter Return Only Best Match is introduced. The
parameter can have True or False value and determines whether
SQL query will return only best matching record or multiple records
satisfying similarity criteria.
Check this option if you wish to return only the best matching
records for each left or right record, when using corresponding left
or right outer joins.
If this option is unchecked (default), multiple matching rows will be
returned.
Page 6 / 59
Projects
Projects Page
Project is basic entity in ReMaDDER software. Each project contains definition of two source datasets
to be imported and analyzed (so-called "left dataset" and "right dataset"), as well as variable number of
corresponding solutions, which are stored definitions of how to perform fuzzy match analysis.
On creation, each project is assigned unique project tag. During raw data importing to server,
corresponding input tables get that tag appended in their name. This way, imported tables are always tagged
by the project name, which ensures their uniqueness.
The Projects page consists of two two sections separated by movable splitter. In upper section there is
a datagrid view where you can browse and edit projects, while on the lower section there is form view
of currently selected project. The same concept of datagrids and form views is implemented throughout the
application.
Page 7 / 59
You can easily create new projects, edit and browse existing projects, by using navigator buttons.
Concept of Left and Right Dataset

Throughout ReMaDDer application and this manual, we will use terms left and right dataset or table.
In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows
similarity. For convenience, we call them left and right table.
Purpose of entity resolution framework software, such is ReMaDDer, is to identify which records from left
dataset correspond to which records from right dataset.
ReMaDDer does not operate on original data sources directly, but requires data to be imported from source
CSV (comma separated values) flat files to server, where corresponding left and right database tables are
then created and processed.
Record Matching Project vs. Data Deduplication Projects

In ReMaDDer software, there is no fundamental difference between data deduplication and records
matching projects. In both cases we compare two datasets, trying to infer which records from left dataset
correspond to which records in right dataset.
The only difference between the two is that in case of records matching project we have two different input
datasets to be compared, while in case of data deduplication project we have to compare a dataset with
itself, in order to identify duplicate records in the dataset.
Page 8 / 59
Since ReMaDDer software always compare two datasets - left and right datasets, in case of data
deduplication project we need to import the same original CSV file twice - first as left dataset and then as
right dataset. The ReMaDDer software will thus create two identical tables with different names, in the
underlying database.
Copy A Project
Instead of manually entering all the parameters for new projects, ReMaDDer allows you to copy existing
project into another project. This action copies raw data import specifications as well as solution definitions.
Raw Data Import

Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from source CSV
files, encoded in UTF-8.
The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and because all databases
and spreadsheet editors, as well as all other data sources can be easily exported to a csv file.
The source data CSV files, however, must be UTF-8 encoded. Otherwise, import will most likely fail.
Therefore, you must first ensure that the source data CSV files are properly UTF-8 encoded. ReMaDDer has
embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++
(https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/) and other powerful text
editors which are capable to perform encoding detection and conversion of files.
ReMaDDer provides simple and intuitive tool for importing csv files. It will automatically detect
fields delimiter and columns schema information. You can then edit the retrieved schema and
finally import the files on server, for further processing.
Page 9 / 59
Left and Right datasets

In each data deduplication or record matching project, we always compare two datasets for matching of
records. In case of record matching projects, these two datasets correspond to two different input CSV files,
while in case of data deduplication projects, these two datasets are imported from the same input CSV file.
Page 10 / 59
Nevertheless, we always have so-called left dataset and right dataset to be compared. Think of this like
comparing fingers from left and right hand. You can easily identify thumb on the left hand to be related to
the thumb on the right hand, since they share similar shape. It is obvious due to their physical similarity.
It is same with fuzzy match analysis, where we compare fields from left and right dataset in order to identify
string similarities. ReMaDDer internally uses various functions to measure string similarities, results of
which are then processed by artificial intelligence to infer whether two records represent same entity or not.
Import Raw Data

Process of importing raw data into server database consists of several logical phases. First we need to
identify source CSV files for left and right dataset. After source files are identified, we need to ensure
that the CSV files are properly UTF-8 encoded. Once we ensured proper encoding, then we need to retrieve
and specify schema information about the CSV files. In last phase we actually perform import from source
files, according to previously defined schema. Result of the last step is that the source files are imported on
server-side database, where they can be processed according to various solution definitions.
On Data Import page, there are two sub-pages: Left Dataset Specification and Right Dataset
Specification, in which we separately define input dataset specifications for left and right dataset.
Import can be executed separately for left and righ dataset, or both can be imported in batch, at once.
Browse And Choose CSV files

First step in importing input CSV files is to choose CSV files to be imported.
On upper part of Left Dataset Specification or Right Dataset Specification sub-page, there is a CSV file
browser dialog box.
You can browse CSV files on your computer by clicking on the browse button
. This opens a file
browser in which you can choose a CSV file. The absolute file path is then copied to the edit box.
Register CSV Files

Next step is to define CSV file schema specification. We call this process registering CSV file.
Page 11 / 59
By clicking Register CSV file button

near the file browser, the browsed CSV
file is examined for its columns and its schema information is then inserted into the corresponding list of
fields (columns).
As you can see, ReMaDDer determines field delimiter in CSV file (normally it is either ; or ,) and
retrieves information about columns.
If a column name has upper case characters, it is converted to lower case.
Currently, ReMaDDer treats all columns as text fields of various length. This is due fact that the comparison
is performed by using string comparison functions, so other data types (e.g. datetime, integer, real etc.)
would not make sense for string comparisons.
Determine And Convert CSV File To UTF-8

In previous ReMaDDer version, the program used to detect encoding and convert it to UTF-8
automatically, during CSV file registration. Although very convenient, this might have lead to wrong results,
since encoding detection function is not 100% reliable and sometimes it guesses encoding wrongly. This is
due fact that charset detection is inherently difficult task and there is no 100% sure method. It is always
kind of educated guess according to content inspection.
Therefore, we decided to remove automatic charset detection and conversion to UTF-8. You will have to do
it yourself and ensure that the source files are properly UTF-8 encoded. Charset detection, as well file
Page 12 / 59
encoding conversion to UTF-8 is still present as ReMaDDer feature (and even improved), but you will have
to trigger it manually with respective buttons, or by choosing it from menu.
Another option is to use embedded spreadsheet editor Spready to open and convert source files.
Alternatively, you can use various established tools such as Notepad++ text editor, that are capable to
recognize file encoding and perform required conversion to UTF-8.
Determine And Convert CSV File Encoding, with embedded tool

After a CSV file is registered as left or right dataset source, it can be analyzed with embedded tool for
detecting charset encoding.
When you click button Determine Encoding of Left Dataset CSV File or button Determine
Encoding of Right Dataset CSV File the respective CSV file will be analyzed for its encoding type, by
two different embedded procedures. Result of encoding analysis will be displayed in corresponding pop-up
window.
Page 13 / 59
If both functions agree that the encoding is UTF-8 (utf8), as in the example above, then the CSV file is in
appropriate format for import.
But, if result is not UTF-8, then the CSV file must be converted to UTF-8 before importing!
You can convert CSV file encoding to UTF-8 by clicking button Convert Encoding Of Left Dataset
CSV File or Convert Encoding Of Right Dataset CSV File.
When the conversion action is triggered, ReMaDDer will first back up the original CSV file and then convert
the file encoding to UTF-8.
Determine And Convert CSV File Encoding, with embedded spreadsheet editor Spready
Besides above mentioned embedded encoding detection and conversion tool, ReMaDDer has embedded
Spready spreadsheet editor (http://wiki.lazarus.freepascal.org/FPSpreadsheet), which can also be used
for file encoding conversion.
Page 14 / 59
Determine And Convert CSV File Encoding, with external tools

Charset detection with embedded tool is not 100% reliable, which is also true for any tool performing
charset inferring.
If you encounter difficulties with embedded charset detection and conversion tools or you know what is the
file encoding, you might try various external tools, of which I would recommend well established
Notepad++ text editor (https://notepad-plus-plus.org/).
Page 15 / 59
Another interesting alternative is CudaText text editor (http://uvviewsoft.com/cudatext/), which is

capable of charset detection and conversion too.
Page 16 / 59
Edit Raw Datasource Schema Information

Once you retrieved schema information from a CSV file, you might conclude that you dont want to import
all columns, but only a subset of fields.
You can edit the schema by using corresponding data grid navigator buttons.
If you wish to delete currently selected field from schema, just click delete
button.
If you wish to regain original columns schema, just click Get Fields Schema
button and the columns list will be repopulated from the CSV file.
Pre-process Raw Datasource
Page 17 / 59
While defining import schema specification, you might realize that input data need some pre-processing
before importing to server for further analysis.
Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice
Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit,
CudaText, Geany or Leafpad), but you can also use an embedded spreadsheet editor Spready.
You can launch external default spreadsheet editor by clicking the button Open CSV File in Ext.
Editor
You can launch the embedded spreadsheet editor by clicking button Open CSV File In Int. Editor
. This will open the

(http://wiki.lazarus.freepascal.org/FPSpreadsheet).
embedded
spreadsheet
editor
Spready
Page 18 / 59
Import Data From Raw Datasources

Final step in source data import is execution of import procedure, by clicking appropriate button or
triggering action from respective menu.
Page 19 / 59
We can execute import separately for left and right datasets, by clicking corresponding buttons Import
left dataset CSV file or Import right dataset CSV file or we can import them both at once by
clicking the button Import both CSV files to server.
When you click the import button, ReMaDDer will automatically open the Import Log page, where you
can watch import process progress.
Page 20 / 59
Import speed depends on the file size and most importantly, internet connection quality.
Solution Definition
A solution definition represents definition of parameters for performing record linkage or data
deduplication analysis. Each project can have many solutions, with different specification, thus you can test
which combination of parameters lead to best results.
Each solution definition consists of
specification.
solution header specification and solution constraints
Solution header specification contains general info about the solution and defines important parameters
which determine how record matching analysis will be performed. These parameters are: machine
learning strictness, join type and return only best match.
Solution constraints specification consists of: exact match relations section, fuzzy match relations
section and other constraints section.
Solution definition page (page Record Matching Analysis, sub-page Solution Definition):
As with other pages, Solution page is also divided into two sections: datagrid view and form view.
For better user experience, form view is additionaly divided into several tabs and sub-tabs. Main tabs are:
Solution Definition and Solution Result.
Solution Definition tab is furtherly divided into: Solution Header, Solution Fields Picker and Solution
Constraints.
Page 21 / 59
Solution Header tab is divided into several sub-tabs: Common, Solution Base Table Creation Query
Info and Solution Resultset Retrieval Query Info.
Solution Constraints tab is divided into sub-tabs: Exact Match Constraints, Fuzzy Match Constraints
and Other Constraints.
How ReMaDDer performs record linkage and data deduplication

For each project we can define one or more solutions. A solution consists of solution definition and solution
resultset.
Solution definition is specification which instructs ReMaDDer how to perform record linkage or data
deduplication analysis in order to retrieve resultset.
We can define three type of solution constraints: exact match constraints, fuzzy match constraints and other
constraints.
Fuzzy match constraints define field pairs from left and right dataset to be compared for fuzzy string
similarity. In order to infer records similarity, ReMaDDer utilizes various string similarity metrics, along
with powerful machine learning algorithms.
Advanced artificial intelligence automatically infers records linkage or duplicates and creates solution base
table.
Final step is resultset retrieval, in which database engine creates and executes SQL query which joins left
and right dataset with the solution base table, outputting resultset. The retrieved resultset can be exported
to a spredsheet or flat file.
Solution Definition Header

Solution definition header contains general solution definition parameters and info about solution
execution status.
Solution definition header (whole page):
Page 22 / 59
Solution definition header (datagrid view):
Solution definition header (form view):
Page 23 / 59
Solution definition header can be entered either through datagrid or through form view which shows
currently selected solution.
Solution Basic Information

Basic information about a solution is shown in fields: Solution Name, Solution Tag, Solution Base Table
Name, Tag Assigned, Solution Status and Solution Comment.
Solution Tag is automatically generated designation which is appended to each solution name by default
and is also used in Solution Base Table name formation.
Solution Base Table Name is automatically formed from Solution name and Solution Tag. Solution Tag
ensures uniqueness of created solution base table, on server.
Solution Status and Solution Comment are fields in which user can enter additional arbitrary information.
Page 24 / 59
Machine Learning Strictness

The parameter Machine Learning Strictness defines how strictly artificial intelligence will
distinguished between matches and non-matches. The options are: match, strict match and potential
match.
Machine learning strictness attribute determines how strictly fuzzy matching will be determined.
Possible values are: a) match, b) strict match, c) potential match.
"Match" option is default behavior. Resultset retrieved will contained balanced ratio between true
positives and false positives. It tends to include all true positives, with some degree of false positives and
very little false negatives.
"Strict match" is the strictest option. Resultset will tend to contain only true positives, but due to higher
incidence of false negatives, it might miss to recognize some matches.
"Potential match" is the weakest option. Resultset will tend to contain all true positives, but many false
positives as well.
Join Type
Join Type attribute determines how SQL joins between left and right tables will be established, via
solution base table. There are three options of joining: a) inner join, b) left outer join, c) right outer join.
Page 25 / 59
The "inner join" option is default behavior, meaning that the resultset will contain all rows from left and
right datasets which meet matching criteria.
In case of "left outer join" option, resultset will contain all rows from left dataset and only those rows
from right dataset that satisfy matching criteria.
In case of "right outer join" option, resultset will contain all rows from right dataset and only those rows
from left dataset that satisfy matching criteria.
Return Only Best Matching Records

The parameter Return Only Best Match can have True or False value and determines whether SQL
query will return only best matching record or multiple records satisfying similarity criteria. It is used as
modifier to left outer join or right outer join.
If this option is unchecked (default), multiple matching rows will be returned. If it is checked, only best
matching item from slave dataset will be joined to corresponding record in master dataset.
Check this option if you wish to return only the best matching records for each left or right record, when
using left or right outer joins and datasets are in master/slave relation.
In case of inner join join type, this parameter has no meaning and is ignored.
Typical use case for left or right outer join with return only best matching option is when we want to match
two product price lists of which one is master list.
Solution Definition Details

While solution definition header defines general parameters for performing fuzzy match analysis, solution
definition details are being set in Field Picker sub-page and Solution Constraints sub-page with three
sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations section
and Other Constraints section.
Page 26 / 59
Fields Picker
ReMaDDer provides simple, yet very powerful visual tool to add field pairs to exact match relations section,
fuzzy match section or other constraints section.
By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two
lists, visually establish field pairs and send them to appropriate constraints definition sections by click on
appropriate button.
Page 27 / 59
You can add selected fields pair to exact match section by clicking the button Add Fields Pair To Exact
Match Relations Section.
You can add selected fields pair to fuzzy match section by clicking the button Add Fields Pair To Fuzzy
Match Relations Section.
You can add left or right dataset field to other constraints section by clicking the respective button.
By eliminating need for tedious manual input and letting you to visually build solution constraints instead,
ReMaDDer simplifies solution definition creation and boosts your performance.
Starting from ReMaDDer version 1.1., checkbox column Output Field to Resultset? is added to the
Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default,
all fields are included in resultset.
Page 28 / 59
Solution Constraints
There are three type of constraints that you can define for a solution: exact match relations, fuzzy match
relations and other constraints.
Exact Match Relations

In exact matching relations section, we can add field pairs from "left" and "right" imported dataset and
define their equalness (=) or not-equalness (<>).
Page 29 / 59
If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of
analysis by narrowing down number of record pair combinations to be analyzed for fuzzy match.
Therefore, it is recommended to use exact match relations whenever possible.
Fuzzy Match Relations

In solution header section we can set various general parameters that determine how fuzzy match analysis
will be performed: we can choose machine learning strictness, join type and whether all matches or just
best matches will be returned.
Page 30 / 59
In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field
pairs to be compared and furtherly define how fuzzy match analysis will be performed.
Relative Field Weight
Page 31 / 59
For each field pair, which will be compared for similarity, we have to define its relative weight. The bigger
the weight, the greater is importance of the particular field pair similarity in final decision whether two
records do match or not.
The weight for particular field pair is entered as an arbitrary integer value in the field Field Weight
(integer) and ReMaDDer then calculates its relative weight. The sum of relative weights is always 1.
On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight
(integer) value, which is one (1). You can change this value to any bigger integer and ReMaDDer will
recalculate relative weights, taking care of their sum, which must always be 1.
Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight
for currently selected fields pair.
There are two buttons provided: Recalculate Weights and Reset Weights.
Page 32 / 59
The button Reset Weights reset all relative weights to 1, which is the same as if relative weights are not
used at all. In that case, all field pairs are treated equally important.
The button Recalculate Weights performs the recalculation of relative fields according to the integer
values entered in the field Field Weight (integer). You dont need to trigger this action manually, since
this procedure is triggered automatically on each change of integer value or a field pair addition.
Other Constraints
Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such
constraints can greatly increase speed of record linkage or data deduplication analysis.
We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.
Page 33 / 59
Normally, condition is: sometable.somefield= some string, but other operators such as LIKE can be used
as well.
Solution Execution
Once a solution definition is prepared by setting global parameters, exact match, fuzzy match and other
constraints, you can then execute the solution on remote server and retrieve resultset. There are two
consequences of the solution execution: solution base table is created on server and resultset is retrieved to
client.
Solution execution is actually sequence of two different steps, which can be executed in batch or separately.
First step is solution base table creation on server, which is prerequisite for next step, resultset
retrieval on client.
The first step, in which Solution Base Table is created, is the most critical point in ReMaDDer
application (and most resource and time demanding, too). In this step, sequence of several critical
underlying procedures are triggered that determine solution space from which final resultset is finally
retrieved.
This step is actually composed of several discrete sub-steps.
First of sub-steps is so-called blocking procedure, which is a method to reduce space of combinations
which will be furtherly analyzed for string similarity. This step is of great importance, since fuzzy match
analysis is inherently time-consuming job and analyzing all possible combinations would take extremely
long time to complete.
Next sub-step is step in which string similarity is calculated between left and right dataset records.
ReMaDDer utilizes multiple string similarity functions. Some of them are quite resource demanding.
Page 34 / 59
After string similarity is established for all combinations in solution space, advanced machine learning
algorithms take results from previous step and infer record linkage or detect duplicates. This is the heart of
inventive and unique approach that ReMaDDer software utilizes to perform entity resolution job.
Unlike other competing software, ReMaDDer does not require any user involvement in this step. There is
no need to provide examples of matches and non-matches, neither to provide any threshold value that
would distinguish matches from non-matches. ReMaDDer will acquire knowledge and determine records
linkage automatically, without need for human domain expert or clerical review.
As far as we are aware, there is no other software, currently available on market, that is capable to perform
such automatic record linkage inference by artificial intelligence, with accuracy reaching human clerical
review.
Technically, solution can be executed in three different ways:
A) in one step
In this scenario, both major steps (solution base table creation and resultset retrieval) are executed at once.
B) In two major steps
In this scenario, major steps (solution base table creation and resultset retrieval) are executed one by one
in consecutive order.
C) In several minor steps
In this scenario, both major steps are executed in sequence of several distinct minor steps.
In simplest scenario, you can execute solution in one step. On the Solution definition header, as well in the
corresponding Solution Header menu entry there are two buttons. The button Execute Solution
executes both steps at once, in batch, while the button Prepare And Execute Result SQL Query
executes only the last step, i.e. resultset retrieval.
Page 35 / 59
Obviously, you must trigger the button EXECUTE SOLUTION at least once, in order to create
underlying Solution Base Table on server, which is prerequisite for second step, resultset retrieval.
The first step, solution base table creation, might be extremely resource and time demanding. Depending
on the records count in left and right dataset, number of field pairs to be compared for string similarity etc.,
it can take anything from 30 seconds to 24 hours or even more (!). You must be aware that the time required
for solution base table creation grows exponentially, not linearly, with records count!
Be aware that the solution complexity, and time required for solution to be resolved, grows exponentially
with records count in left and right dataset. The same is true for number of field pairs to be compared. It is
not same if you analyze only one field pair or if you compare 9 field pairs for fuzzy match. Fuzzy match
analysis is inherently complex and time consuming.
Once the solution base table is already created, you can easily change machine learning strictness or join
type or choose whether to return only best match. For these changes, you dont need to re-trigger tedious
and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is
exactly the reason why the button Prepare And Execute Solution Result SQL Query is foreseen.
Beside default differentiation on major steps, there is also fine grained differentiation on sub-steps, which
is available in the Solution Definition menu entry.
Page 36 / 59
In fine grained differentiation of solution execution steps, we distinguish following separate actions:
Prepare Solution Base Table SQL Query --> this action will prepare SQL query for solution
base table, but will not execute it.
Execute Solution Base Table SQL Query (Create Solution Base Table) --> this action will
execute solution base table creation.
Prepare Solution Result SQL Query With Forced Base Table (Re)creation --> this will
trigger recreation of SQL Query for recreation of solution base table on server and then retrieve
resultset.
Prepare Solution Result SQL Query With Check Whether To Create Base Table --> this
will trigger action that will check whether solution base table has to be recreated. The solution base
table will be recreated only if necessary. Then resultset will be retrieved.
Prepare Solution Result SQL Query --> just prepare SQL Query that will retrieve resultset, but
dont actually trigger its execution.
Prepare And Execute Solution Result SQL Query --> prepare and execute SQL query that
will retrieve resultset.
Execute Solution Result SQL Query (Retrive Resultset) --> execute already prepared SQL
query that will retrieve resultset.
Page 37 / 59
These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For
regular user, it is only relevant to remember that the solution base table must first be created in order to be
able to retrieve resultset.
If solution base table is already created, then you dont need to recreate solution base table for different
combination of machine learning strictness, join type and return only best match parameters. It is
enough to use just Prepare And Execute Solution Result SQL Query button.
If you re in doubt and dont know what to do, the simplest and safest way to execute solution and retrieve
resultset is to click EXECUTE SOLUTION button.
Solution Execution In One Step

The simplest way to execute solution is to execute the analysis in one step, by clicking the button
EXECUTE SOLUTION or by choosing corresponding menu item.
This action will force (re)creation of solution base table on server, from scratch, and prepare and execute
resultset retrival SQL query.
Be aware that solution base table (re)creation is costly action and it might take considerable time to
complete! If left or right dataset contains million of records, this might take extremely long time to
complete.
Therefore, it is preferred to execute solution base table (re)creation only if necessary.
Page 38 / 59
Solution Execution In Two Major Steps

Besides simple solution execution in one step, there is possibility to execute solution in two major steps.
In this scenario, first step is solution base table creation on server, which is prerequisite for next step,
resultset retrieval on client.
Once the solution base table is already created, you can easily change machine learning strictness or join
type or choose whether to return only best match. For these changes, you dont need to retrigger tedious
and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is
exactly the reason why the button Prepare And Execute Solution Result SQL Query is foreseen.
On the Solution definition header, as well in the corresponding Solution Header menu entry, there is
button Prepare And Execute Result SQL Query, which executes only the last step, i.e. resultset
retrieval. You can use it if proper solution base table is already created on server.
Solution Execution In Several Minor Steps

If appropriate solution base table is not yet created or solution definition is changed so it needs to be
recreated, then you have to (re)create solution base table first, and then execute resulset retrieval query.
Beside executing everything in one step, there there is also fine grained differentiation of these sub-steps
present ed in the Solution Definition menu entry.
Page 39 / 59
In fine grained differentiation of steps, we distinguish following separate actions:
Prepare Solution Base Table SQL Query --> this action will prepare SQL query for solution
base table, but will not execute it.
Execute Solution Base Table SQL Query (Create Solution Base Table) --> this action will
execute solution base table creation.
Prepare Solution Result SQL Query With Forced Base Table (Re)creation --> this will
trigger recreation of SQL Query for recreation of solution base table on server.
Prepare Solution Result SQL Query With Check Whether To Create Base Table --> this
will trigger action that will check whether solution base table has to be recreated. The solution base
table will be recreated only if necessary.
Prepare Solution Result SQL Query --> just prepare SQL Query that will retrieve resultset, but
dont actually trigger its execution.
Prepare And Execute Solution Result SQL Query --> prepare and execute SQL query that
will retrieve resultset.
Execute Solution Result SQL Query (Retrive Resultset) --> execute already prepared SQL
query that will retrieve resultset.
These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For
regular user, it is only relevant to remember that the solution base table must first be created in order to be
able to retrieve resultset.
Page 40 / 59
If solution base table is already created, then you dont need to recreate solution base table for different
combination of machine learning strictness, join type and return only best match parameters. It is
enough to use just Prepare And Execute Solution Result SQL Query button.
If you re in doubt and dont know what to do, the simplest and safest way to execute solution and retrieve
resultset is to click EXECUTE SOLUTION button.
Data Retrieving And Storing

You can launch previously prepared solution SQL queries and return resultsets, by clicking the button
Prepare And Execute Result SQL Query.
Alternatively, you can execute solution in one step, which includes both solution base table creation and
resultset retrieval SQL query execution in one step, with button EXECUTE SOLUTION.
In both cases, once resultset is retrieved, it is stored locally on your computer and you can load it afterwards,
anytime you wish.
Page 41 / 59
You can easily browse, edit and analyze results in many different ways, including datasheet forms with
sophisticated data searching, filtering and navigation capabilities.
Execute Resultset Retrieval SQL Query

The
resultset
retrieval
query
or
by
is
executed
clicking
the
by
clicking
button
the
button
Prepare
And
Execute
Solution
Execute
Solution
, which can be used if solution base table has already been

created.
The difference is that Execute solution action (re)creates underlying solution base table and then executes
SQL query, which joins left and right datasets with the solution base table, while action Prepare And
Execute Results SQL Query just performs the last step. Obviously, prerequisite to use the latter is that the
solution base table has already been created.
When action is triggered, previously prepared SQL query text is sent to server for execution. The progress
of query execution can be monitored in Solution Log page.
The retrieved resultset is automatically opened in a separate form.
Page 42 / 59
Solution Status Info

ReMaDDer automatically updates solution status upon solution base table creation query and resultset
retrieval query preparation and execution actions. These solution status informations are shown both in the
solution header data grid and form view, in respective tabs.
Page 43 / 59
You get various information about solution base table creation process, such as: whether solution base table
is created or not, whether solution creation query has already been executed or not, whether solution base
table is empty or not, what are query execution times.
Also, you get various information about resultset retrieval query execution process, such as: whether
resultset retrival SQL query is generated (prepared) or not, whether SQL query was already executed or not,
whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also shown whether
the resultset is stored locally and in which file. There is information about execution times and number of
executions performed.
Page 44 / 59
Save And Load Resultset

Once a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file
in the ReMaDDer installation folder, subfolder /data/results.
Resultset can be loaded into the subpage Solution Result of the main form, by clicking the button Load
Solution Resultset or in a separate form, by clicking the button Load Solution Resultset In
Separate Window.
Page 45 / 59
Review And Edit Resultset

There are various ways you can post-process and review the retrieved resultset.
Resultset Browsing
You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated
data searching, filtering and navigation capabilities.
Page 46 / 59
You can scroll by using mouse, vertical and horizontal sliders and arrows.
You can also browse records by using navigation buttons.
Resultset Searching
You can easily search for any particular value in any column. On the upper left corner of the datagrid
there is a small button

represented by orange double arrow. This button opens a pop-up dialog
with various search, filter and customization options, of which one is Find data.
Page 47 / 59
When you click on the Find data button, a search dialog box appears. You can search any value on
any column.
Resultset Filtration
Page 48 / 59
You can easily filter by any column. On the upper left corner of the datagrid there is a small button
represented by orange double arrow.
This button opens a pop-up dialog with various search, filter and customization options, including Filter
data and Filter in table, which are two different ways to perform filtration in a datagrid.
Filter Data
When you click the button Filter data, a dialog box appears on which you can build your filtering
conditions. This way you can define complex multicolumn filters.
Page 49 / 59
The filtering is then applied by clicking Apply button.

Filter In Table
Another option for filtration is to use the button Filter in table, which activates a filtration
combobox, which is placed just below each columns title. When you click on the filtration combobox cell,
a combobox list appears, listing all possible values for respective column. When you select a value, the
respective column is automatically filtered by the chosen value.
Page 50 / 59
Resultset Sorting
You can sort ascending or descending on any column by clicking column title.
Resultset Edit And Review

You can edit the resultset in datagrid easily. You can delete a row by using delete button
or edit a record by clicking the edit button
Page 51 / 59
Exporting Resultset
Besides using datagrid controls, another option for resultset post-processing is to export the resultset into
a spreadsheet and then perform reviewing and editing in a spreadsheet editor.
ReMaDDer has many different possibilities of exporting resultset to spreadsheets.
Exporting Resultset To Spreadsheet
Resultset can be exported to a CSV file by clicking the button Export To CSV File.
Resultset can be exported to a XLSX file by clicking the button Export To XLSX File.
Resultset can be exported to XLS file by clicking the button Export To XLS File.
Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft
Excel, by clicking the button Load In Ext. Spreadsheet Editor.
Page 52 / 59
ReMaDDer also has its own embedded spreadsheet editor which can be used for resultset post-processing.
Resultset can be loaded into the embedded spreadsheet editor by clicking the button Load As
Spreadsheet.
Exporting Datagrid To Spreadsheet

Another possibility for exporting resultset into a spreadsheet file is to use datagrids exporting feature.
Page 53 / 59
You have to browse the destination folder for export and enter exported file name and extension, as well
as to enter page name (sheet name). If you forget to specify page name, you will get an error.
Page 54 / 59
Customize Data Grids

ReMaDDer enables you to customize your user interface in certain extent. You can shrink or stretch
columns, rearrange their order and hide/unhide columns.
Resize columns by dragging vertical splitters between columns.
Rearrange columns by pushing the left mouse button on a columns title and dragging the column while
mouse button is still pushed down. After the column is moved to another position, release the mouse button.
You can define which columns are shown and which are hidden, by clicking on the button Select visible
columns.
Page 55 / 59
When you close the application, your customization is saved (remadder_props.xml file) and when you
open the application again, your customizations will be loaded as well.
Customize Splitters
You will notice that various sections are divided by splitters which you can easily drag and thus resize
the corresponding splitted sections.
The customization you make is saved on application close and reloaded on application start.
ReMaDDer Software Trial

ReMaDDer client application is distributed as a shareware with 15-days trial period.
On first application start on your computer the trial period is initialized.
Page 56 / 59
Commercial Release Code Purchase And Activation

After trial period expires, you are required to purchase commercial release code in order to be able
to continue using server features, such as raw data import and query execution.
You can, however, continue creating and editing projects and solution definitions, as well as loading and
editing previously acquired resultsets.
When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is
a tag generated by ReMaDDer software and is unique for your hardware. The purchased commercial release
code is thus machine-specific and valid only for your hardware.
Once you purchased release code, activate it by clicking the button Activate Commercial Release
Code.
You are asked to enter the release code.
Page 57 / 59
The entered release code will be then validated and if correct, the server-side features will be unlocked for
you.
Page 58 / 59

ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage and Data Deduplication

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage and Data Deduplication

Hochgeladen von

Copyright:

Verfügbare Formate

Homepage: http://ReMaDDersoft.wix.

ReMaDDer Software Tutorial

ReMaDDer Software Tutorial

Solution Definition .............................................................................................................21

ReMaDDer Software Tutorial

How ReMaDDer performs record linkage and data deduplication .............................. 22

Solution Definition Details ............................................................................................ 26

Solution Execution ............................................................................................................ 34

Customize Data Grids........................................................................................................ 55

ReMaDDer Software Tutorial

ReMaDDer Software Tutorial

ReMaDDer Software Tutorial

ReMaDDer Software Advantages

ReMaDDer Software Tutorial

There are many small performance improvements and several bugfixes

ReMaDDer Software Tutorial

Matches and non-matches are not based on similarity thresholds

ReMaDDer Software Tutorial

ReMaDDer Software Tutorial

Concept of Left and Right Dataset

Record Matching Project vs. Data Deduplication Projects

ReMaDDer Software Tutorial

Raw Data Import

ReMaDDer Software Tutorial

Left and Right datasets

ReMaDDer Software Tutorial

Import Raw Data

Browse And Choose CSV files

Register CSV Files

ReMaDDer Software Tutorial

By clicking Register CSV file button

Determine And Convert CSV File To UTF-8

ReMaDDer Software Tutorial

Determine And Convert CSV File Encoding, with embedded tool

ReMaDDer Software Tutorial

ReMaDDer Software Tutorial

Determine And Convert CSV File Encoding, with external tools

ReMaDDer Software Tutorial

Another interesting alternative is CudaText text editor (http://uvviewsoft.com/cudatext/), which is

ReMaDDer Software Tutorial

Edit Raw Datasource Schema Information

Pre-process Raw Datasource

ReMaDDer Software Tutorial

. This will open the

ReMaDDer Software Tutorial

Import Data From Raw Datasources

ReMaDDer Software Tutorial

ReMaDDer Software Tutorial

solution header specification and solution constraints

ReMaDDer Software Tutorial

How ReMaDDer performs record linkage and data deduplication

Solution Definition Header

ReMaDDer Software Tutorial

Solution definition header (datagrid view):

Solution definition header (form view):

ReMaDDer Software Tutorial

Solution Basic Information

ReMaDDer Software Tutorial

Machine Learning Strictness

ReMaDDer Software Tutorial

Return Only Best Matching Records

Solution Definition Details

ReMaDDer Software Tutorial