Web Scraping Tools

Cluster focuses on tools, libraries, and techniques for extracting structured data from HTML websites, including Beautiful Soup, Scrapy, XPath, CSS selectors, and handling JavaScript with tools like PhantomJS or Selenium.

📉 Falling 0.3x Web Development

2,672

Comments

Years Active

Top Authors

#9906

Topic ID

Activity Over Time

2007

2008

2009

110

2010

117

2011

163

2012

168

2013

200

2014

159

2015

122

2016

148

2017

204

2018

135

2019

2020

149

2021

161

2022

189

2023

151

2024

191

2025

111

2026

Top Contributors

1vuio0pswjnm7 (46) simonw (18) danso (12) mdaniel (9) petercooper (9)

Keywords

JS python.org REGEX robots.txt NLTK HN www.dr toadjaw.com AngleSharp CS html scraping regex soup parse scrape dom python links footer

Sample Comments

gerardnll • Mar 28, 2018 • View on HN

Do you really need to crawl an html? There's APIs to do that already, right?

rmbyrro • Aug 14, 2023 • View on HN

This tool can extract data in a structured format from virtually any website, with any HTML structure.With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.

lloydatkinson • Nov 1, 2022 • View on HN

I get you - that's given me an idea to do the same, but I'd probably use a HTML scraping library like AngleSharp. Thanks for the explanation!

bandie91 • Sep 11, 2021 • View on HN

shameless self promotion: parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.[0] <a href="https://github.com/bAndie91/tools/blob/master/usr/bin/parsel" rel="nofo

Toast_ • May 29, 2017 • View on HN

Do you plan on supporting scraping content via css selectors/xpath/regex?

michelson01 • Apr 16, 2007 • View on HN

this might have been easier using hpricot, rubyful soup, scrapi, scrupyt, or something like that.

turtlebits • Feb 10, 2021 • View on HN

Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.

savethefuture • Nov 15, 2016 • View on HN

You write a scrapper, you could use python. Beautiful Soup might be useful.

adinagoerres • Feb 27, 2025 • View on HN

huh interesting. we're exploring extraction from html

dataslap • Nov 14, 2017 • View on HN

scrapy has a pretty decent parser too