String Algorithms: Pattern Matching in Java

Exploring various string algorithms for pattern matching, including brute force, KMP, and Boyer-Moore algorithms, implemented in Java.


Advanced String Matching in Competitive Programming (Java)

String matching is a fundamental problem in computer science, frequently encountered in competitive programming. While basic string matching algorithms like brute force are easy to understand, they are often inefficient for large datasets. This document explores more advanced string matching techniques suitable for competitive programming in Java.

Advanced String Matching Techniques (Optional)

These techniques provide better time complexity for certain string matching scenarios. We'll explore the Knuth-Morris-Pratt (KMP) algorithm and discuss its applications.

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is a linear time string searching algorithm. It avoids unnecessary comparisons by pre-processing the pattern (the string we're searching for) to build a "prefix table" (also called a "failure function"). This table helps determine the next possible position to check when a mismatch occurs, effectively skipping characters that are already known to not match.

KMP Algorithm Explanation

  1. Prefix Table Construction: The prefix table (lps - Longest Proper Prefix which is also a Suffix) for a pattern P of length m is an array lps[0...m-1] such that lps[i] stores the length of the longest proper prefix of P[0...i] which is also a suffix of P[0...i].
  2. Searching: We traverse the text T and the pattern P. If characters match, we increment both indices. If they mismatch, instead of restarting the pattern search from the beginning, we use the lps table to shift the pattern based on the information already gathered.

Example

Text: ABABDABACDABABCABAB

Pattern: ABABCABAB

The KMP algorithm would efficiently find the pattern within the text.

Java Implementation of KMP

 public class KMP {

        public static void search(String text, String pattern) {
            int n = text.length();
            int m = pattern.length();

            int[] lps = computeLPSArray(pattern);

            int i = 0; // index for text
            int j = 0; // index for pattern

            while (i < n) {
                if (pattern.charAt(j) == text.charAt(i)) {
                    i++;
                    j++;
                }

                if (j == m) {
                    System.out.println("Found pattern at index " + (i - j));
                    j = lps[j - 1]; // Prepare for finding next occurrence
                }

                // Mismatch after j matches
                else if (i < n && pattern.charAt(j) != text.charAt(i)) {
                    // Do not match lps[0..lps[j-1]] characters,
                    // they will match anyway
                    if (j != 0) {
                        j = lps[j - 1];
                    } else {
                        i++;
                    }
                }
            }
        }


        private static int[] computeLPSArray(String pattern) {
            int m = pattern.length();
            int[] lps = new int[m];
            int length = 0; // length of the previous longest prefix suffix
            int i = 1;
            lps[0] = 0; // lps[0] is always 0

            // the loop calculates lps[i] for i = 1 to m-1
            while (i < m) {
                if (pattern.charAt(i) == pattern.charAt(length)) {
                    length++;
                    lps[i] = length;
                    i++;
                } else {
                    // This is tricky. Consider the example.
                    // AAACAAAA and i = 7.
                    if (length != 0) {
                        length = lps[length - 1];

                        // Also, note that we do not increment i here
                    } else {
                        lps[i] = 0;
                        i++;
                    }
                }
            }
            return lps;
        }

        public static void main(String[] args) {
            String text = "ABABDABACDABABCABAB";
            String pattern = "ABABCABAB";
            search(text, pattern);
        }
    } 

Complexity

The KMP algorithm has a time complexity of O(n + m), where n is the length of the text and m is the length of the pattern. The space complexity is O(m) for the prefix table.

Brief Overview of Other Advanced String Matching Techniques (Optional Topic)

Beyond KMP, several other advanced string matching algorithms exist, offering varying performance characteristics in different scenarios.

Rabin-Karp Algorithm

The Rabin-Karp algorithm uses hashing to find pattern occurrences. It calculates a hash value for the pattern and then slides a window of the same length across the text, calculating the hash value for each window. If the hash values match, it performs a character-by-character comparison to confirm the match. A good hash function is crucial for efficient performance. The average time complexity is O(n + m), but in the worst-case (many hash collisions), it can be O(nm).

Java Implementation Outline (Rabin-Karp)

 public class RabinKarp {

        private static final int PRIME = 101; // A prime number for hashing

        public static void search(String text, String pattern) {
            int n = text.length();
            int m = pattern.length();
            long patternHash = createHash(pattern, m);
            long textHash = createHash(text, m);

            for (int i = 0; i <= n - m; i++) {
                if (patternHash == textHash && checkEquality(text, i, pattern)) {
                    System.out.println("Pattern found at index " + i);
                }
                if (i < n - m) {
                    textHash = recalculateHash(text, i, i + m, textHash, m);
                }
            }
        }

        private static long createHash(String str, int length) {
            long hash = 0;
            for (int i = 0; i < length; i++) {
                hash += str.charAt(i) * Math.pow(PRIME, length - i - 1);
            }
            return hash;
        }

        private static long recalculateHash(String text, int oldIndex, int newIndex, long oldHash, int patternLength) {
            long newHash = oldHash - text.charAt(oldIndex) * (long) Math.pow(PRIME, patternLength - 1);
            newHash = newHash * PRIME + text.charAt(newIndex);
            return newHash;
        }

        private static boolean checkEquality(String text, int startIndex, String pattern) {
            for (int i = 0; i < pattern.length(); i++) {
                if (text.charAt(startIndex + i) != pattern.charAt(i)) {
                    return false;
                }
            }
            return true;
        }


        public static void main(String[] args) {
            String text = "ABABDABACDABABCABAB";
            String pattern = "ABABCABAB";
            search(text, pattern);
        }
    } 

Suffix Trees and Suffix Arrays

These are more advanced data structures designed for a variety of string-related problems, including efficient substring searching. Building these structures can be computationally expensive (though linear time algorithms exist), but they allow for very fast queries once built. Suffix trees represent all suffixes of a string in a tree structure, while suffix arrays store the sorted indices of all suffixes. They are especially useful for finding multiple occurrences of a pattern or for more complex string manipulation tasks. Implementing these from scratch in a competitive programming environment can be challenging due to their complexity.

For competitive programming, libraries providing efficient suffix tree or suffix array implementations can be very valuable if available.

Bit Manipulation Techniques

In some specific cases, particularly when dealing with alphabet sizes that are powers of 2, bit manipulation can be used to optimize string matching by representing characters or sets of characters as bitmasks. This can lead to significant speed improvements by allowing multiple character comparisons in a single operation.

Choosing the right string matching algorithm depends on the specific problem constraints, the size of the input strings, and the frequency of searches.