In this post I’ll detail some notes I accumulated while designing a browser-based plagiarism detector.
I’ve been working through MIT’s Introduction to Algorithms for the past couple of weeks. I have also been interviewing for a new job recently. Throughout my interviews I got really interested in Hash tables as object’s in javascript exhibit behaviours similar to Hash tables. Mainly, Search, Inserts and Deletes average O(1) complexity.
So if you want to quickly search for items in a dataset. A hacky way to achieve
this in javascript would be to put them all into a vanilla object {}
and
access the fields via key.
Interestingly V8 (used in node.js and the Chrome browser) doesn’t use such a data-structure. Instead opting for a strategy called Hidden Classes. But it still exhibits the same performance characteristics.
The MIT course goes into Hashing algorithms as they are used to decide where in the Hash table to put the information. We studied rabin fingerprinting which uses a mathematical tool called a rolling-hash. Essentially hashing is a computationally expensive operation. But once you perform it once, with the rabin fingerprint, you can compute the hash for a string that differs by one character in O(1) time. A more in depth paper is available.
I got really excited about this algorithm and decided to implement it in javascript as an npm module, karp-rabin-search. When I showed the code to a colleague he quickly mentioned how the problem has been expressed in past as C’s strstr as well as PHP’s.
Javascript doesn’t ship with a strstr
method. But we do have indexOf.
A very cute implementation of strstr could be made
In any case, surely the rabin-karp algo is better than iteratively calling
indexOf
on a string?! wrong. Unless I did something gross in the implementation,
iteratively calling indexOf
like so is faster.
On my machine the above code can run on /usr/share/dict/words
in 10ms
whereas my rabin-karp algo takes 238ms. File is test.js
So why is V8 so fast? Under the hood V8 implements indexOf
using the
Boyer-Moore algorithm
along with some heuristics for different input sizes.
I’d like to mention JSInspect which is a very cool tool in the same spirit.
If you need help solving your business problems with software read how to hire me.