Scraping links with just your browser

01 January 2015

I recently came across a page with lots of links to txt files. I got frustrated when I wanted to look up something in these files but couldn’t because I’d have to download each one by one.


- ENTER THE BROWSER CONSOLE -

plus js friends


// Run all this code in your browser console aka dev tools
// Grab all links on the current page
var links = $('a');

This returns a jquery-like object. A bunch of key-value pairs. The values being DOM Objects. Sadly javascript doesn’t have a .values() method like Ruby. We could make our own using the following function

var values = function(obj){
  return Object.keys(obj).map(function(key){
    return obj[key];
  });
}

Then

// make links an array of DOM anchor elements
links = values(links);

// Regex with criteria we're looking for
var myregex = /txt/;

// Filter using the regex
links = links.filter(function(anchor) {
  return myregex.exec(anchor.href);
});

// You can tell your browser to download these files by executing the
// click() method
links[0].click();

But what if there are too many? You could try

goodlinks.forEach(function(element) {
  element.target = "_blank";
  element.click();
});

Make sure that the anchor element has a target of _blank or browsers will prevent so many tabs from opening. If your browser is preventing too many tabs to open, or the site is throttling you, simulate pagination

var start = 0;
for(var i=0; i<20; i++) {
  links[i + start].target = "_blank";
  links[i + start].click();
}

and repeat calls to this after incrementing start. Say start = 20 then start = 40 and so on.

Now on to building a Full-text search engine with Postgres.

If you need help solving your business problems with software read how to hire me.



comments powered by Disqus