Substring Search in Javascript

18 August 2015

In this post I’ll detail some notes I accumulated while designing a browser-based plagiarism detector.

I’ve been working through MIT’s Introduction to Algorithms for the past couple of weeks. I have also been interviewing for a new job recently. Throughout my interviews I got really interested in Hash tables as object’s in javascript exhibit behaviours similar to Hash tables. Mainly, Search, Inserts and Deletes average O(1) complexity.

So if you want to quickly search for items in a dataset. A hacky way to achieve this in javascript would be to put them all into a vanilla object {} and access the fields via key.

Interestingly V8 (used in node.js and the Chrome browser) doesn’t use such a data-structure. Instead opting for a strategy called Hidden Classes. But it still exhibits the same performance characteristics.

The MIT course goes into Hashing algorithms as they are used to decide where in the Hash table to put the information. We studied rabin fingerprinting which uses a mathematical tool called a rolling-hash. Essentially hashing is a computationally expensive operation. But once you perform it once, with the rabin fingerprint, you can compute the hash for a string that differs by one character in O(1) time. A more in depth paper is available.

I got really excited about this algorithm and decided to implement it in javascript as an npm module, karp-rabin-search. When I showed the code to a colleague he quickly mentioned how the problem has been expressed in past as C’s strstr as well as PHP’s.

Javascript doesn’t ship with a strstr method. But we do have indexOf. A very cute implementation of strstr could be made

// returns portion of string starting with needle
function strstr(haystack, needle) {
  var indexOfResult = haystack.indexOf(needle)
  if( indexOfResult !== -1) {
    return haystack.slice(indexOfResult)
  } else {
    return false
  }
}

In any case, surely the rabin-karp algo is better than iteratively calling indexOf on a string?! wrong. Unless I did something gross in the implementation, iteratively calling indexOf like so is faster.

// returns indexes in T where P occur
function iterativeIndexOf(T, P, i) {
  if(!i) { i = 0 }
  var results = []

  var matchIndex = T.indexOf(P)
  while(matchIndex !== -1) {
    results.push(matchIndex)
    matchIndex = T.indexOf(P, matchIndex + 1)
  }
  return results
}

On my machine the above code can run on /usr/share/dict/words in 10ms whereas my rabin-karp algo takes 238ms. File is test.js So why is V8 so fast? Under the hood V8 implements indexOf using the Boyer-Moore algorithm along with some heuristics for different input sizes.

I’d like to mention JSInspect which is a very cool tool in the same spirit.

If you need help solving your business problems with software read how to hire me.



comments powered by Disqus