I was using one CSS selector which is less commonly used in my last scraping task. It is called Adjacent Sibling Selector.
The information I needed was hotel links inside the page hotel.hk.
There was no id or class specified for that link and there are tons of link inside the same row, and I did not want to scrape the whole page and then take out that single html. The only specify information I know is that the link is always the next tag element immediately after the span having CSS class "title".
The usage of the selector is like this:
span.title + a
My program is something like this to extract the url in the href attribute:
detail_page_urls = []
page.search("span.title + a") do |title_link|
detail_page_urls += title_link.attribute("href").value
end
I am using Ruby and Mechanize gem.
No comments:
Post a Comment