ML for Web Crawling
Comments discuss applying machine learning to web crawling tasks, such as building smart spiders, detecting spam links and crawlers, anomaly detection, and link selection using classifiers and statistical methods.
Activity Over Time
Top Contributors
Keywords
Sample Comments
> spammers link creation time distribution is widely differnt to natural link creation timesYes, this is a statistical method. Guess what machine learning is and what it actually excels?
I recomment considering learning between test runs and I encourage you to train a relatively simple model for selection on top of http-archive and tagged data."off the shelf" machine learning makes it pretty easy to create very robust selectors. I gave a talk about it in GDG Israel and was supposed to speak about it in HalfStack that got delayed cancelled because of COVID19 - but the principle is pretty simple.It's amazing how much RoI you can get from relatively simple mode
Takes a long time to get here, but this is the money quote:Conclusions Using a network-based machine learning method, we have shown that plant-based foods such as tea, carrot, celery, orange, grape, coriander, cabbage and dill contain the largest number of molecules with high anti-cancer likeness through exerting influence on molecular networks in a similar fashion to existing therapeutics.
> we were able to fingerprint this crawler using a combination of machine learning and network signals.what machine learning algorithms are they using? time to deploy them onto our websites
So you create a backlink by detecting that a concept here refers to a prior concept? Hmm thats interesting. But can't this also be done with a simple word comparison, or am I missing something?What can ML be used for that is not otherwise in scope?
To build a smart spider, you can have a classifier that determines if you should crawl the outlinks of the page.This classifier can be trained using logistic regression.This is slightly tricky, but effective. I've been meaning to write in more depth about this topic (smart crawling).[edit: I'm stepping out so I can't write more right now. You can email me if you have any questions about this.]
At this point I think one has to assume there are more advanced machine-learning-based crawlers out there too. ML is very good at picking up 'anomalies'.