Web Scraping Tools

Cluster focuses on tools, libraries, and techniques for extracting structured data from HTML websites, including Beautiful Soup, Scrapy, XPath, CSS selectors, and handling JavaScript with tools like PhantomJS or Selenium.

📉 Falling 0.3x Web Development
2,672
Comments
20
Years Active
5
Top Authors
#9906
Topic ID

Activity Over Time

2007
9
2008
89
2009
110
2010
117
2011
163
2012
168
2013
200
2014
159
2015
122
2016
148
2017
204
2018
135
2019
89
2020
149
2021
161
2022
189
2023
151
2024
191
2025
111
2026
7

Keywords

JS python.org REGEX robots.txt NLTK HN www.dr toadjaw.com AngleSharp CS html scraping regex soup parse scrape dom python links footer

Sample Comments

gerardnll Mar 28, 2018 View on HN

Do you really need to crawl an html? There's APIs to do that already, right?

rmbyrro Aug 14, 2023 View on HN

This tool can extract data in a structured format from virtually any website, with any HTML structure.With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.

lloydatkinson Nov 1, 2022 View on HN

I get you - that's given me an idea to do the same, but I'd probably use a HTML scraping library like AngleSharp. Thanks for the explanation!

bandie91 Sep 11, 2021 View on HN

shameless self promotion: parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.[0] <a href="https://github.com/bAndie91/tools/blob/master/usr/bin/parsel" rel="nofo

Toast_ May 29, 2017 View on HN

Do you plan on supporting scraping content via css selectors/xpath/regex?

michelson01 Apr 16, 2007 View on HN

this might have been easier using hpricot, rubyful soup, scrapi, scrupyt, or something like that.

turtlebits Feb 10, 2021 View on HN

Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.

savethefuture Nov 15, 2016 View on HN

You write a scrapper, you could use python. Beautiful Soup might be useful.

adinagoerres Feb 27, 2025 View on HN

huh interesting. we're exploring extraction from html

dataslap Nov 14, 2017 View on HN

scrapy has a pretty decent parser too