ML for Web Crawling

Comments discuss applying machine learning to web crawling tasks, such as building smart spiders, detecting spam links and crawlers, anomaly detection, and link selection using classifiers and statistical methods.

➡️ Stable 1.0x AI & Machine Learning

Comments

Years Active

Top Authors

#4287

Topic ID

Activity Over Time

2012

2017

2020

2022

2024

2025

Top Contributors

macavity23 (1) inglor (1) bradknowles (1) raincole (1) codecracker3001 (1)

Keywords

GDG COVID19 ML HalfStack machine learning learning machine relatively simple simple ml creation method crawler using time

Sample Comments

raincole • Apr 24, 2024 • View on HN

> spammers link creation time distribution is widely differnt to natural link creation timesYes, this is a statistical method. Guess what machine learning is and what it actually excels?

inglor • Jul 20, 2020 • View on HN

I recomment considering learning between test runs and I encourage you to train a relatively simple model for selection on top of http-archive and tagged data."off the shelf" machine learning makes it pretty easy to create very robust selectors. I gave a talk about it in GDG Israel and was supposed to speak about it in HalfStack that got delayed cancelled because of COVID19 - but the principle is pretty simple.It's amazing how much RoI you can get from relatively simple mode

bradknowles • Jun 19, 2022 • View on HN

Takes a long time to get here, but this is the money quote:Conclusions Using a network-based machine learning method, we have shown that plant-based foods such as tea, carrot, celery, orange, grape, coriander, cabbage and dill contain the largest number of molecules with high anti-cancer likeness through exerting influence on molecular networks in a similar fashion to existing therapeutics.

codecracker3001 • Aug 4, 2025 • View on HN

> we were able to fingerprint this crawler using a combination of machine learning and network signals.what machine learning algorithms are they using? time to deploy them onto our websites

DelightOne • Apr 10, 2020 • View on HN

So you create a backlink by detecting that a concept here refers to a prior concept? Hmm thats interesting. But can't this also be done with a simple word comparison, or am I missing something?What can ML be used for that is not otherwise in scope?

bravura • Aug 11, 2012 • View on HN

To build a smart spider, you can have a classifier that determines if you should crawl the outlinks of the page.This classifier can be trained using logistic regression.This is slightly tricky, but effective. I've been meaning to write in more depth about this topic (smart crawling).[edit: I'm stepping out so I can't write more right now. You can email me if you have any questions about this.]

macavity23 • Sep 23, 2017 • View on HN

At this point I think one has to assume there are more advanced machine-learning-based crawlers out there too. ML is very good at picking up 'anomalies'.