Improved String Matching Algorithm

Introduction

String matching is an algorithm which is used to match character patterns between strings. Mismatching between strings can be caused by mistyping error, phonetic error or some unknown error. Edit Distance does not tolerate any error source, while Editex only tolerates the phonetic error. The new weighting and distance calculation algorithm tolerates phonetic error and typographic error using letter grouping thereby improving performance which is also suitable for matching homomorphic encrypted strings.

Some basic approaches for matching strings

1. Pattern Matching

Pattern matching checks the character sequences of each string for comparison. Similarity between strings is inversely proportional to number of mismatching characters. Most of these techniques do not distinguish between various mismatching cases and therefore the mismatching may be caused by phonetic variations, mistyping, or truly different strings.

2. Phonetic Matching

Phonetic matching is an effort to match two strings that are similar in how they are heard but different in how they are written.

3. Edit Distance

Edit Distance calculates the minimum number of operations needed to convert a string to another. There are 3 kinds of operations- insertion, deletion and substitution each with a cost of 1.

4. Editex Algorithm

Two functions r(a,b) and d(a,b) determine the cost. Function r(a,b) returns a specific value- 2 if character a and b are not there in the same phonetic group, 1 if both are in same group and 0 if a and b are identical. Function d(a,b) returns the same value as r(a,b) except that if a is h or w, and a ≠ b, then d(a,b) = 1.

5. Homomorphic Encryption

Homomorphic encryption is an algorithm that allows cryptographic operations to be applied directly to the ciphertext. If the results of those operations are decrypted, it will be the same value as if the operations have been done over the plaintext.

Following is the scheme proposed by Chen and Zhao .

c = m + p + pqr

Here c is the ciphertext, m is the plaintext and p and q are prime numbers. They are used as encryption key and processing key, respectively and r is a random number.

Proposed Method

Steps to be followed:

Data is encrypted before processing it.

Collect different letter grouping from existing phonetic algorithms to define phonetic weight between two characters.
Group the letters based on physical adjacencies over different types of keyboard as a typographic weight.
Modify the distance calculation.

Encryption of data

Every single character from input data is compared with character from other data in

database in an encrypted form. These characters then are checked whether they are in the same phonetic groups and typographic groups or not. The distance between two strings is calculated by using the distance value of each group. All data having minimum distance with the input strings are shown as output.The boxes having the dashed lines show where the proposed method gives contributions.

Every letter is converted into a unique sequence of binary number based on the occurrence in every subgroup which is then encrypted with length of the encrypted number restricted to 4 digits. The binary number is 1 if that letter is a member of a subgroup, otherwise 0. For example, from the soundex group, the letter “J” is transformed into a binary sequence “001000”.(Table I)

Phonetic Letter Grouping

Phonetic letter grouping method is based on the existing phonetic letter grouping algorithms- Soundex , Phonix and Editex .

Typographic Letter Grouping

Typographic letter grouping is based on physical adjacencies between two letters in each type of keyboard layouts.

Distance Calculation

The distance for insertion and deletion operation is 1 while for the substitution operation, the initial distance is 2, which equals one deletion and one insertion. To calculate the substitution distance, if two letters are in the same group, distance of the group is subtracted from the initial distance.
For example, take two string, “John” and “Johm” as input. First 3 letters are identical so the distance till 3 characters is 0. Fourth letter involves substitution thus giving initial distance as 2. From the phonetic and typographic letter group, the letter 'n' and 'm' are in the same group in Soundex, Phonix, Editex, Qwerty, and Colemak group. Therefore we subtract from the initial distance, the distance of each group to get 2 – 0.6 – 0.2 – 0.2 – 0.6 – 0.1 =0.3 as the final output.

Results and Conclusion

The process is repeated for the proposed method, the Edit Distance and Editex algorithm in encrypted form. Evaluation is done in terms of the number of false positives (data that having minimum distance with the input, but not identical). The proposed method generates less false positive because it distinguishes the mismatching character’s source of errors.

References

http://ieeexplore.ieee.org/document/8257147/
http://home.iitk.ac.in/~ankitkr/projects/phoneme/report.pdf
https://en.wikipedia.org/wiki/Homomorphic_encryption

IIITD IR MELANAGE

Search This Blog