High Level Synopsis
Interesting Data on the internet
some works I did on scraping were scraping information Hong Kong government and using another way to present them. One of them was greenhk: http://greenhk.heroku.com
There are lots of interesting data on Gov website that you can also try: Census and Statistics, Environmental Statistics
Did a life demo on scraping openrice restaurant name
- Use Firebug to locate the information I want to scrape
- Open up Ruby console:
irb
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.openrice.com/restaurant/sr1.htm?s=1&district_id=&inputcategory=cname&inputstrrest=")
page.search(".resttitle").count
name_link = page.search(".resttitle").search("a").first
name_link.attribute("href").value
Sample program
then I showed the source code of my resumable web scrapers on github
One Issue for discussion
I raised one question that people always asked me: "Is it illegal?". My answer was that "grey area, depends on how you use the data".
Because I did present on penetration testing before, and I did mention I played some hacking games too, coders there seems interpreting my hacker/coder image as cracker ... sigh ...
I did performance testing or penetration testing only with authorization. I said "It may be illegal if you use it for commercial activities."
But after the sharing, I think a better answer would be ... the data is published for public usage. You can certainly take contact info like those on openrice and call the restaurant. But I think you would be crossing the line, if you use that data as if you are owning the data and use it for evil things.
Surely this statement is not precise enough, but this is the abstract way for me to determine if something is bad or not.