Is it legal to use a web crawler?
Legality of crawling is currently a gray area and the Linkedin's lawsuit against hiQ which is still in progress, will likely create the first steps of a legal framework around data crawling. In case you are betting your business on crawling, for now don't.
Unless severe restrictions are placed crawling, crawling will remain an important tool in the corporate toolbox. Leading web crawling companies claim to work with Fortune 500 companies like PwC and P&G. BusinessInsider claims in a paywalled article that hedgefunds spend billions on crawling.
We will update this as the Linkedin vs HiQ case comes to a close. Please note that this does not constitute legal advice.
What are the use cases for web crawling?
Web crawling is a true Swiss army knife like Excel, therefore we will stick to the most obvious use cases here:
- Competitive analysis: Knowing your competitor's campaigns, product launches, price changes, new customers etc. can be invaluable in competitive markets. Crawlers can be set to produce alarms and reports to inform your sales, marketing and strategy teams. For example, Amazon sellers set up price monitoring bots to ensure that their products remain in the correct relative position compared to the competition. Things can take an unexpected turn when two companies automatically update their prices based on one another's price changes. Such automated pricing bots led a book to reach a $23m sales price.
- Track customers: While competition rarely kills companies, failing to understand changing customer demands can be far more damaging. Crawling customers' websites can help better understand their business and identify opportunities to serve them.
- Extract leads: Emails and contact information of potential customers can be crawled for building a lead funnel. For example, [email protected][domain].com email addresses get hundreds of sales pitches as these get added into companies' lead funnels
- Enable data-driven decision making: Even today, most business decisions rely on a subset of the available relevant data. Leveraging the world's largest database, internet, for data-driven decision making makes sense especially for important decisions where cost of crawling would be insignificant.
How does a web crawler work?
First, user needs to communicate the relevant content to the crawler. For the technically savvy, this can be done by programming a crawler. For those with less technical skills, there are tens of web crawlers with GUIs (Graphical User Interface) which let users select the relevant data
Then, user starts the crawler using a bot management module. Crawling tends to take time (e.g. 10-20 pages per minute in the starter packages of most crawlers). This is because the web crawler visits the pages to be crawled like a regular browser and copies the relevant information. If you tried doing this manually, you would quickly get visual tests to verify that you are human. This test is called a CAPTCHA "Completely Automated Public Turing test to tell Computers and Humans Apart". Websites have variety of methods like CAPTCHA to stop such automated behavior. Web crawlers rely on methods like changing their IP adresses and digital fingerprints to make their automated behavior less noticeable
What is a web crawler?
Web crawlers extract data from websites. Websites are designed for human interaction so they include a mix of structured data like tables, semi-structured data like lists and unstructured data like text. Web crawlers analyze the patterns in websites to extract and transform all these different types of data.
Crawlers are useful when data is spread over multiple pages which makes it difficult for a human to copy the data