Thursday, May 26, 2011

Codeaholics HK Lightning Talk: Web Scraping with Ruby

High Level Synopsis

Interesting Data on the internet

some works I did on scraping were scraping information Hong Kong government and using another way to present them. One of them was greenhk: http://greenhk.heroku.com

There are lots of interesting data on Gov website that you can also try: Census and Statistics, Environmental Statistics

 

Did a life demo on scraping openrice restaurant name

  1. Use Firebug to locate the information I want to scrape
  2. Open up Ruby console:
irb
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.openrice.com/restaurant/sr1.htm?s=1&district_id=&inputcategory=cname&inputstrrest=")
page.search(".resttitle").count
name_link = page.search(".resttitle").search("a").first
name_link.attribute("href").value

 

Sample program

then I showed the source code of my resumable web scrapers on github

 

One Issue for discussion

I raised one question that people always asked me: "Is it illegal?". My answer was that "grey area, depends on how you use the data".

Because I did present on penetration testing before, and I did mention I played some hacking games too, coders there seems interpreting my hacker/coder image as cracker ... sigh ...

I did performance testing or penetration testing only with authorization. I said "It may be illegal if you use it for commercial activities."

But after the sharing, I think a better answer would be ... the data is published for public usage. You can certainly take contact info like those on openrice and call the restaurant. But I think you would be crossing the line, if you use that data as if you are owning the data and use it for evil things.

Surely this statement is not precise enough, but this is the abstract way for me to determine if something is bad or not.

Wednesday, May 25, 2011

"Who The Hell Made This Change !"

I have a very good team mate now, so I seldom say it recently. But sometimes (especially when I was in my previous companies), you know, everyone is possible to introduce bugs. Sometimes not tracing source of bugs, but cause of the changes. So you need something to trace and find out who (the hell) changed that particular line of code. That's where git blame could help

Here is what you can see with git blame:

git blame menu.css

Name of the author who last-touch that line of code is marked next to the code. Line number is also displayed for you to trouble shoot.

Web Scraper in Ruby

Just written a simple Web Scraper in Ruby for sharing.

It uses MongoDB to store data, use Mechanize to crawl and parse, use Parallel to do multi-threading and then export the results into CSV file.

You can read the source code here: https://github.com/3dd13/web_scrapers_ruby

Coder Presentation Tricks

My favourite trick I like to do during presentation is ...

  1. When we touch something about size of data or scalability
  2. Fire up ruby console
  3. irb
  4. Calculate binary to integer conversion
  5. 2 << 10

Invert Black/White in Mac

I always use my Mac book to play, surf and code, and I prefer lighter background with dark text.

Once I went to a presentation which was inside a small room and people prefer lights-off during presentation. My content was so bright that hurt eyes. Even worse, the text in black was unreadable. During moment like these, a quick and nice solution is to invert the screen colour (White to Black and Black to White).

Of course, you can either go to projector settings or System Preference to do that, but that is too lamb. So, here is the shortcut key to invert the colour:

[Ctrl] + [Alt] + [Mac] + [8]

Being Cool and familiar with your friends (techie tools) is extremely important to attract your audience focus. Prepare some useful shortcuts or coding tricks to stun ~

Tuesday, May 24, 2011

Gov Funding Available in Hong Kong

Gov Fundings Available in Hong Kong

I started with this list (few years old already) http://www.itc.gov.hk/en/doc/download/BK-Benefit_from_TF_Scheme_200704.pdf

Then I targeted 2 of them which are suitable for my tech startup: SERAP and Science Park Incubation Programme.

SERAP

SERAP Guide

http://www.itf.gov.hk/l-eng/Forms/itf-serap-guide_2010_07.pdf

ITF Guide

http://www.itf.gov.hk/l-tc/ITSP.asp

SERAP and ITF Application Form

http://www.itf.gov.hk/l-eng/forms.asp

Science Park Incubation Programme

Overview and Funding area

http://www.hkstp.org/HKSTPC/download/annualReport/HKSTP_Bronchure201101Jan20110317165143.pdf

Criteria

http://www.hkstp.org/HKSTPC/en_html/en_full7_2.jsp

Application Form

http://www.hkstp.org/HKSTPC/download/applicationForm/BDTS_TBI_IA_WF_APP_001_Rev113_Application%20Form20110428170309.pdf

Comments

SME fund looks good too. It funds standalone projects and oversea development/ learning events. But it focus more on helping developed business to expand. So, that's not exactly what benefits me at the moment. You can read the details here.

Science Park one offers office space, some basic facilities and covers marketing promotion cost. However, the funding does not include salary, books, travelling expenses. So, at the same time, I would apply for SERAP to support my R&D expenses too. And then later when my business get more mature, and if I have more standalone projects extended, I will go for SME fund.

Wednesday, May 18, 2011

Codeaholics Meetup | 25May2011 | Boot HK


code@holics:~/meetups/2011/may$ date

Wed May 25 19:00:00 HKT 2011

code@holics:~/meetups/2011/may$ whereis venue

Boot HK,
19/F Hang Wai Commercial Building,
231-233 Queen's Road East,
Wanchai, Hong Kong

code@holics:~/meetups/2011/may$ _

Tuesday, May 10, 2011

推介好睇漫畫俾其他人

剛新加了一些簡單的功能及改進一些CSS。網頁目的很簡單, 就是讓大家分享大家喜歡的漫畫。

如果你GOOGLE"好睇 漫畫", 你會見到很多人都苦惱下一部應該看的漫畫。漫畫數量太多, 而且好睇既漫畫不容易遇見。看一看不同關鍵字的Google Trend作引證吧:

"介紹"及"漫畫"

"好"及"漫畫"

"推介"及"漫畫"

你也開始用Comics Crowd尋找好睇漫畫吧。

Running complicate mongo query without map / reduce

Such query is actually provided in mongo documentation, but I find it useful. Many of my friends are afraid of using MongoDB. They know that MongoDB is faster, but they have no ideas how to do mongo administration and extract information they want.

Map/Reduce and javascript function can do almost everything as SQL database do. But when I say the "you can do this in Map/Reduce, no one find it comfortable. No one thinks "writing map/reduce" is programmer-friendly. When I see their faces, it feels like I am cheating them that Mongo DB is easy to use. But be frank ... Very often, a multi join-ed group by having SQL query looks simple.

So... here is one query that I think most programmer do find it readable and useful to query things in Mongo DB.

db.myCollection.find( { is_active: true, $where: "this.credits - this.debits < 0" } );

It is one of the server-side query you can do. You can pass in javascript function into the where part to do something complicated. In this case, it find all active record and compared the condition of two fields.

More examples in mongo db page.

Yes, there are lots of differences doing things in NoSQL from relational SQL DB, however, provided the capability, I do recommend every programmer to try out one of them, and I would suggest Mongo DB to be the first test bed. Quite mature, large community user base, many articles sharing how to transit from SQL ecology.

Monday, May 9, 2011

use adjacent sibling selectors for Web Scraping

I was using one CSS selector which is less commonly used in my last scraping task. It is called Adjacent Sibling Selector.

The information I needed was hotel links inside the page hotel.hk.

There was no id or class specified for that link and there are tons of link inside the same row, and I did not want to scrape the whole page and then take out that single html. The only specify information I know is that the link is always the next tag element immediately after the span having CSS class "title".

The usage of the selector is like this:

span.title + a

My program is something like this to extract the url in the href attribute:

detail_page_urls = []

page.search("span.title + a") do |title_link|
  detail_page_urls += title_link.attribute("href").value
end

I am using Ruby and Mechanize gem.

Wednesday, May 4, 2011

Domain Name Generator using dictionary service

My team are focusing on prototyping our ideas (we have a really long list of to-prototype-list). So we always bother with project naming and website naming. Random domain name generators do not really work well for us. So, suddenly, I wonder if I can do dynamic search in online dictionary.

The basic requirement is finding words ending with certain top level domain:


Word.all.select{|word| word.end_with?(target_domain)}

That's why I found http://www.onelook.com

It support something like this:


*me:crazy
It should returns english word that ends with 'me' which has meaning related to 'crazy'