ML for Web Crawling

Comments discuss applying machine learning to web crawling tasks, such as building smart spiders, detecting spam links and crawlers, anomaly detection, and link selection using classifiers and statistical methods.

➡️ Stable 1.0x AI & Machine Learning
7
Comments
6
Years Active
5
Top Authors
#4287
Topic ID

Activity Over Time

2012
1
2017
1
2020
2
2022
1
2024
1
2025
1

Keywords

GDG COVID19 ML HalfStack machine learning learning machine relatively simple simple ml creation method crawler using time

Sample Comments

raincole Apr 24, 2024 View on HN

> spammers link creation time distribution is widely differnt to natural link creation timesYes, this is a statistical method. Guess what machine learning is and what it actually excels?

inglor Jul 20, 2020 View on HN

I recomment considering learning between test runs and I encourage you to train a relatively simple model for selection on top of http-archive and tagged data."off the shelf" machine learning makes it pretty easy to create very robust selectors. I gave a talk about it in GDG Israel and was supposed to speak about it in HalfStack that got delayed cancelled because of COVID19 - but the principle is pretty simple.It's amazing how much RoI you can get from relatively simple mode

bradknowles Jun 19, 2022 View on HN

Takes a long time to get here, but this is the money quote:Conclusions Using a network-based machine learning method, we have shown that plant-based foods such as tea, carrot, celery, orange, grape, coriander, cabbage and dill contain the largest number of molecules with high anti-cancer likeness through exerting influence on molecular networks in a similar fashion to existing therapeutics.

> we were able to fingerprint this crawler using a combination of machine learning and network signals.what machine learning algorithms are they using? time to deploy them onto our websites

DelightOne Apr 10, 2020 View on HN

So you create a backlink by detecting that a concept here refers to a prior concept? Hmm thats interesting. But can't this also be done with a simple word comparison, or am I missing something?What can ML be used for that is not otherwise in scope?

bravura Aug 11, 2012 View on HN

To build a smart spider, you can have a classifier that determines if you should crawl the outlinks of the page.This classifier can be trained using logistic regression.This is slightly tricky, but effective. I've been meaning to write in more depth about this topic (smart crawling).[edit: I'm stepping out so I can't write more right now. You can email me if you have any questions about this.]

macavity23 Sep 23, 2017 View on HN

At this point I think one has to assume there are more advanced machine-learning-based crawlers out there too. ML is very good at picking up 'anomalies'.