String Matching Algorithms

Explores various string matching algorithms like the Naive algorithm, Rabin-Karp algorithm, and Knuth-Morris-Pratt (KMP) algorithm. Discusses their efficiency and applications in text processing.

⬅ Previous Next ➡

Rabin-Karp Algorithm

Explanation of the Rabin-Karp Algorithm

The Rabin-Karp algorithm is a string searching algorithm that uses hashing to find a pattern string within a text string. It improves upon naive string searching by employing a hash function to quickly compare the pattern's hash value with the hash value of substrings of the text. If the hash values match, it might indicate a match. However, because hashing can lead to collisions (different strings producing the same hash), a direct string comparison is performed only when the hashes match. This "rolling hash" technique, where the hash of the next substring can be efficiently calculated from the previous one, makes it faster than comparing the pattern to every possible substring in the text.

In essence, the Rabin-Karp algorithm works in the following steps:

Calculate the hash value of the pattern string.
Calculate the hash value of the first substring (of the same length as the pattern) in the text.
Compare the hash values. If they match, compare the strings directly to confirm the match. This addresses the issue of potential hash collisions (spurious hits).
"Roll" the hash to the next substring. Efficiently calculate the hash value of the next substring in the text by removing the contribution of the first character and adding the contribution of the new last character.
Repeat steps 3 and 4 until the end of the text is reached.

In-Depth Discussion of the Rabin-Karp Algorithm

Hashing

Hashing is the cornerstone of the Rabin-Karp algorithm. A hash function takes an input (in this case, a string) and converts it into a fixed-size value (the hash value). A good hash function should distribute strings uniformly across the range of hash values, minimizing collisions.

The hash function typically used in Rabin-Karp is a polynomial hash function. Let's say we are working with the alphabet {a, b, c, ..., z} and assigning values to the characters a = 1, b = 2, ..., z = 26. A simple polynomial hash function could be:

hash(s) = (s[0] * b^k-1 + s[1] * b^k-2 + ... + s[k-1] * b⁰) mod m

Where:

s is the string.
s[i] is the numerical value of the i-th character in the string.
b is the base (e.g., 256 for ASCII characters, or a prime number larger than the alphabet size).
k is the length of the pattern.
m is a large prime number used to keep the hash values within a manageable range and reduce collisions. This also enables us to use modular arithmetic.

The choice of b and m is crucial for performance. b should be larger than the size of the character set and usually is chosen a prime. m should be a large prime number to reduce collisions.

Spurious Hits

A spurious hit occurs when the hash values of the pattern and a substring of the text match, but the strings themselves are different. This is due to hash collisions. The probability of spurious hits depends on the quality of the hash function and the values of b and m. A well-designed hash function minimizes these spurious hits.

When a hash match occurs, the Rabin-Karp algorithm must perform a direct string comparison to confirm the match and rule out spurious hits. This adds overhead, especially if the hash function results in a high number of collisions.

Rolling Hash Functions

The key optimization in the Rabin-Karp algorithm is the use of a rolling hash function. Instead of recalculating the hash value for each substring from scratch, a rolling hash function allows us to efficiently compute the hash value of the next substring based on the previous one.

Using the polynomial hash function described above, let's see how we can "roll" the hash. Suppose we have the hash of a substring s[i...i+k-1] and we want to calculate the hash of the next substring s[i+1...i+k].

hash(s[i...i+k-1]) = (s[i] * b^k-1 + s[i+1] * b^k-2 + ... + s[i+k-1] * b⁰) mod m

hash(s[i+1...i+k]) = (s[i+1] * b^k-1 + s[i+2] * b^k-2 + ... + s[i+k] * b⁰) mod m

We can calculate hash(s[i+1...i+k]) from hash(s[i...i+k-1]) as follows:

Subtract the contribution of the character s[i] from hash(s[i...i+k-1]). We need to multiply s[i] with b^k-1 (mod m) before subtracting it. This is equivalent to: (hash(s[i...i+k-1]) - s[i] * b^k-1) mod m
Multiply the result by b (mod m). This shifts all the exponents by 1 and is equivalent to: ((hash(s[i...i+k-1]) - s[i] * b^k-1) * b) mod m
Add the contribution of the new character s[i+k] (mod m). This is equivalent to: (((hash(s[i...i+k-1]) - s[i] * b^k-1) * b) + s[i+k]) mod m

This rolling hash technique avoids recomputing the entire hash for each substring, which dramatically improves performance.

Impact of Rolling Hash Functions on Performance

Without a rolling hash function, we would need to calculate the hash value of each substring from scratch. This would result in an O(m*n) time complexity (where n is the length of the text and m is the length of the pattern). The rolling hash function reduces the time complexity of calculating each subsequent hash to O(1). Therefore, the overall complexity is improved as it becomes primarily dictated by the initial hash calculation and string comparisons to rule out false positives.

Time Complexity Analysis

Average-Case Time Complexity

In the average case, the Rabin-Karp algorithm has a time complexity of O(n + m), where 'n' is the length of the text and 'm' is the length of the pattern. This assumes that the hash function distributes strings relatively evenly, minimizing spurious hits. The O(n) comes from iterating through the text and calculating the rolling hash. The O(m) comes from calculating the initial hash of the pattern. The impact of spurious hits is that you need to compare the pattern against the window in the text and this takes up to O(m) for each spurious hit. If the probability of finding a pattern is low this can be disregarded.

Worst-Case Time Complexity

The worst-case time complexity of the Rabin-Karp algorithm is O(m*n). This occurs when the hash function produces many spurious hits. In such a scenario, for almost every substring in the text, the hash values match, and a full string comparison (O(m)) is performed. This effectively degrades the algorithm to the naive string searching approach. A common case that results in O(m*n) complexity is when all characters in both the text and pattern are the same, as collisions would be extremely frequent.

⬅ Previous Next ➡