The other day I was sitting around thinking about scraping a website (as one does) and I was frustrated by the complicated regex I was going to have to write to pick the data I wanted out of HTML. (Your classic two-problems problem.) So I started daydreaming about being able to scrape sites using
jQuery selectors to get the elements I needed.
Then I thought, how about using a
pure JavaScript environment like
Node.js? I like servers and I like JavaScript, so it seems like I should like Node.js just fine. With a bit of searching I came across a great tutorial:
Scraping the Web with Node.js and I was off and scraping. The
cheerio module had already solved my jQuery selection problem.
At this point I could easily grab the data, but I was frustrated thinking about wrangling it into a consumable format. I was starting to get in the Node.js mindset and thought, "I bet there's a module that can help with this." A few minutes of searching later,
node-rss was installed.
It took a bit of wrangling to get my first RSS feed going, but I was surprised at how quickly it all came together. It's not so different from firing up
cpan and including modules in Perl scripts. But
npm is just
friendlier. Being able to run
npm install
in a new environment takes some the tedium out of your day.
So yeah, the Node.js script I wrote goes out and fetches Likes (favs, stars, hearts, etc.) from Medium, Twitter, GitHub, and Hacker News and returns RSS feeds with those likes. This lets me bring all of my favorites from those places into one spot where I can then do other things with them, hooray! I put the code up on GitHub:
Recommended RSS.
Then with the help of
forever I set it up as a service and I'm aggregating away. My to-do list for this little app includes caching and handling dates. I wonder if there are modules for those.
Update: See also: Chris Coyier on Social RSS.