Skip to content

Adding topics String hashing and Rabin-Karp in String processing #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,9 @@
- [Graph](./Graph/Graph.md)
- [Tree](./Graph/Tree/Tree.md)
- [Diameter](./Graph/Tree/Diameter/diameter.md)
- [String Processing](./String_Processing/String_Processing.md)
- [String Hashing](./String_Processing/String_Hashing/String_Hashing.md)
- [Rabin-Karp Algorithm](./String_Processing/Rabin-Karp_Algorithm/Rabin-Karp.md)



43 changes: 43 additions & 0 deletions src/String_Processing/Rabin-Karp_Algorithm/Rabin-Karp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Rabin-Karp Algorithm

This is one of the applications of *String hashing*.

Given two strings - a pattern *s* and a text *t*, determine if the pattern appears in the text and if it does, enumerate all its occurrences in O(|s|+|t|) time.

***Algorithm*** : First the hash for the pattern *s* is calculated and then hash of all the substrings of text *t* of the same length as |s| is calculated. Now comparison between pattern and substring can be done in constant time.

## Implementation
```cpp
vector<int> rabin_karp(string const& s, string const& t)
{
const int p = 31;
const int m = 1e9 + 9;
int S = s.size(), T = t.size();
vector<long long> p_pow(max(S, T));
p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;
vector<long long> h(T + 1, 0);
for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (t[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0;i < S; i++)
h_s = (h_s + (s[i] - 'a' + 1) * p_pow[i]) % m;
vector<int> occurences;
for (int i = 0; i + S - 1 < T; i++)
{
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurences.push_back(i);
}
return occurences;
}
```
## Problems for Practice

- [Good_Substrings](https://codeforces.com/problemset/problem/271/D)
- [Pattern_Find](https://www.spoj.com/problems/NAJPF/)

## References

- [CP-Algorithms](https://cp-algorithms.com/)
43 changes: 43 additions & 0 deletions src/String_Processing/String_Hashing/String_Hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# String Hashing

We need this to compare the strings. Idea is to convert each string to integer and compare those instead of the actual strings which is O(1) operation. The conversion is done by a ***Hash-Function*** and the integer obtained corresponding to the string is called *hash* of the string.
A widely used function is *polynomial rolling hash function* :

![](https://hapq.me/content/images/2019/11/Screen-Shot-2019-11-06-at-4.59.06-PM.png)

where *p* and *m* are some chosen, positive numbers. *p* is a prime approximately equal to the number of characters in the input alphabet and *m* is a large number.
Here, it is m=10^9 + 9.

*The number of possible characters is higher and pattern length can be large. So the numeric values cannot be practically stored as an integer. Therefore, the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored in an integer variable.*

## Implementation
```cpp
long long compute_hash(string const& s)
{
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s)
{
hash_value = (hash_value + (c - 'a' + 1) * p_pow)%m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
```
Two strings with equal hashes need not be equal. There are possibilities of collision which can be resolved by simply calculating hashes using two different values of *p* and *m* which reduces the probability of collision.

## Examples Of Uses

- Find all the duplicate strings from a given list of strings
- Find the number of different substrings in a string

## Practice Problems

- [A Needle in the Haystack - SPOJ](https://www.spoj.com/problems/NHAY/)
- [Password - Codeforces](https://codeforces.com/problemset/problem/126/B)

### References

- [CP-Algorithms](https://cp-algorithms.com/)
9 changes: 9 additions & 0 deletions src/String_Processing/String_Processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# String Processing

A string is nothing but a sequence of symbols or characters. Sometimes, we come across problems where a string is given and the task is to search for a given pattern in that string. The straightforward method is to check by traversing the string index by index and searching for the pattern. But the process becomes slow when the length of the
string increases.
So, in this case hashing algorithms prove to be very useful.
The topics covered are :

- [String Hashing](./String_Hashing/String_Hashing.md)
- [Rabin Karp Algorithm](./Rabin-Karp_Algorithm/Rabin-Karp.md)