Web Scraping Tools
Cluster focuses on tools, libraries, and techniques for extracting structured data from HTML websites, including Beautiful Soup, Scrapy, XPath, CSS selectors, and handling JavaScript with tools like PhantomJS or Selenium.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Do you really need to crawl an html? There's APIs to do that already, right?
This tool can extract data in a structured format from virtually any website, with any HTML structure.With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.
I get you - that's given me an idea to do the same, but I'd probably use a HTML scraping library like AngleSharp. Thanks for the explanation!
shameless self promotion: parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.[0] <a href="https://github.com/bAndie91/tools/blob/master/usr/bin/parsel" rel="nofo
Do you plan on supporting scraping content via css selectors/xpath/regex?
this might have been easier using hpricot, rubyful soup, scrapi, scrupyt, or something like that.
Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.
You write a scrapper, you could use python. Beautiful Soup might be useful.
huh interesting. we're exploring extraction from html
scrapy has a pretty decent parser too