In computer science and statistics, the Jaro–Winkler distance is a string metric for measuring the edit distance between two sequences. It is a variant proposed in 1999 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.
Contents
- Jaro distance
- Jaro Winkler distance
- Example
- Example 1
- Example 2
- Relationship with other edit distance metrics
- References
The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro-Winkler similarity is given by 1 - Jaro Winkler distance.
Jaro distance
The Jaro distance
Where:
Two characters from
Each character of
Jaro-Winkler distance
Jaro–Winkler distance uses a prefix scale
where:
Although often referred to as a distance metric, the Jaro–Winkler distance is actually not a metric in the mathematical sense of that term because it does not obey the triangle inequality [1]. In fact the Jaro–Winkler distance also does not satisfy that axiom that states that
In some implementations of Jaro-Winkler, the prefix bonus
Example
Note that Winkler's "reference" C code differs in at least two ways from published accounts of the Jaro–Winkler metric. First is his use of a typo table (adjwt) and also some optional additional tolerance for long strings.
Example #1
Given the strings
We find a Jaro score of:
To find the Jaro–Winkler score using the standard weight
Thus:
Given the strings
We find a Jaro score of:
To find the Jaro–Winkler score using the standard weight
Thus:
Example #2
Given the strings
Here, the shaded cells are the match window for each character. A 1 in a cell indicates a match. Note that the two Xs are not considered matches because they are outside the match window of 3.
We find a Jaro score of:
To find the Jaro–Winkler score using the standard weight
Thus:
Relationship with other edit distance metrics
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,
Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.