Thursday, May 26, 2011

Codeaholics HK Lightning Talk: Web Scraping with Ruby

High Level Synopsis

Interesting Data on the internet

some works I did on scraping were scraping information Hong Kong government and using another way to present them. One of them was greenhk: http://greenhk.heroku.com

There are lots of interesting data on Gov website that you can also try: Census and Statistics, Environmental Statistics

 

Did a life demo on scraping openrice restaurant name

  1. Use Firebug to locate the information I want to scrape
  2. Open up Ruby console:
irb
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.openrice.com/restaurant/sr1.htm?s=1&district_id=&inputcategory=cname&inputstrrest=")
page.search(".resttitle").count
name_link = page.search(".resttitle").search("a").first
name_link.attribute("href").value

 

Sample program

then I showed the source code of my resumable web scrapers on github

 

One Issue for discussion

I raised one question that people always asked me: "Is it illegal?". My answer was that "grey area, depends on how you use the data".

Because I did present on penetration testing before, and I did mention I played some hacking games too, coders there seems interpreting my hacker/coder image as cracker ... sigh ...

I did performance testing or penetration testing only with authorization. I said "It may be illegal if you use it for commercial activities."

But after the sharing, I think a better answer would be ... the data is published for public usage. You can certainly take contact info like those on openrice and call the restaurant. But I think you would be crossing the line, if you use that data as if you are owning the data and use it for evil things.

Surely this statement is not precise enough, but this is the abstract way for me to determine if something is bad or not.

No comments:

Post a Comment