Monday, May 9, 2011

use adjacent sibling selectors for Web Scraping

I was using one CSS selector which is less commonly used in my last scraping task. It is called Adjacent Sibling Selector.

The information I needed was hotel links inside the page hotel.hk.

There was no id or class specified for that link and there are tons of link inside the same row, and I did not want to scrape the whole page and then take out that single html. The only specify information I know is that the link is always the next tag element immediately after the span having CSS class "title".

The usage of the selector is like this:

span.title + a

My program is something like this to extract the url in the href attribute:

detail_page_urls = []

page.search("span.title + a") do |title_link|
  detail_page_urls += title_link.attribute("href").value
end

I am using Ruby and Mechanize gem.

No comments:

Post a Comment