And why it matters for scaling your public web data sources
As with most forms of tech these days, web scrapers have recently seen a surge of claims that they’re somehow based on AI or machine learning tech. While this suggests that an AI will detect exactly what you want extracted from a page, most scrapers are still rule-based (there are some exceptions, such as Diffbot’s Automatic Extraction APIs).
Why does this matter?
Historically rule-based extraction has been the norm. In rule-based extraction, you specify a set of rules for what you want pulled from a page. This is often an HTML element, CSS selector, or a regex pattern. Maybe you want the third bulleted item beneath every paragraph in a text, or all headers, or all links on a page; rule-based extraction can help with that.