2024: Ethical Web Scraping Considerations and Practices

With all the heat on AI companies and more exposure to this practice, what's the best way to approach building-your-own?

Aug 12, 2024

Web scraping is the process of extracting data from websites or other online sources using automated tools or scripts. It involves collecting information from various web pages and saving it in a structured format, such as a spreadsheet or database, for further analysis or use.

Web scraping can be used to gather a wide range of information, including:

- Email addresses

- Stock prices and historical financial data

- Product details and pricing from e-commerce websites

- Social media posts and user profiles

- News articles and blog posts

Some common use cases for scraped data include:

- Building chatbots or other AI models that require large amounts of training data

- Analyzing e-commerce sales data to identify trends and optimize pricing strategies

- Creating business dashboards to monitor key performance indicators (KPIs) and make data-driven decisions

In recent media coverage, AI companies in particular have been accused of malicious and un-attributed usage of their content. This has introduced a lot of questions into the legality and future of web scraping as it becomes more ubiquitous

There are several reasons why individuals or organizations might choose to build their own web scrapers instead of purchasing data from third-party providers:

1. Cost savings: Building your own scraper can be more cost-effective (in the long term) than paying for data from a provider, especially if you need to collect large amounts of data on a regular basis.

2. Customization: When you create your own scraper, you have full control over the data you collect and can tailor it to your specific needs and requirements. 3rd party providers may or may not have the appropriate data points or fields necessary for your needs.

3. Real-time data: By scraping data yourself, you can ensure that you have access to the most up-to-date information, which may not always be available through third-party providers.

4. Unique data sets: Some data may not be available through existing providers, so building your own scraper allows you to collect unique data sets that are specific to your project or research.

However, it's important to note that building and maintaining web scrapers can be time-consuming and requires technical expertise. It also comes with certain legal and ethical considerations. We'll cover best practices along with these legal considerations in the remainder of the article.

Legality of Web Scraping

In recent years, several court cases have established that web scraping is generally legal, as long as the data being accessed is publicly available.

pro-tip: If it's available from a Google search, it's publicly available.

One notable case is hiQ Labs v. LinkedIn, where the Ninth Circuit Court of Appeals ruled that scraping publicly available data from LinkedIn did not violate the Computer Fraud and Abuse Act (CFAA). The court held that the CFAA's prohibition on accessing a computer "without authorization" does not apply to public websites [1].

Another important case is Van Buren v. United States, in which the Supreme Court narrowed the scope of the CFAA. The court ruled that the CFAA does not criminalize every violation of a computer-use policy and that the "exceeds authorized access" clause only applies when an individual accesses areas of a computer system that they are not authorized to access [2].

In Sandvig v. Barr, the court interpreted the CFAA's Access Provision to likely prohibit only bypassing code-based restrictions on protected sites, but not on public sites [3]. This ruling further supports the legality of scraping publicly available data.

More recently, in Meta v. Bright Data, the court found that simply violating a website's terms of service does not automatically constitute a violation of state or federal law. However, the court al

so noted that companies can still bring breach of contract claims against scrapers who violate their terms of service [4].

And with many AI products like Perplexity and ChatGPT actively scraping on behalf of users, there will definitely be more legal cases to come as web scraping and AI evolve together.

Methods Companies Use to Prevent Scraping

Despite court rulings establishing the legality of scraping publicly available data, companies may still attempt to prevent scraping through various methods:

IP Blocking

Companies can block access from specific IP addresses associated with scraping activities. This prevents scrapers from accessing the website or API.

CAPTCHAs

CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are challenges designed to differentiate between human users and automated bots. They often require users to solve visual or audio puzzles, making it difficult for scrapers to bypass.

User Agent Detection

Websites can identify and block requests from known scraping tools by analyzing the User-Agent header in HTTP requests. Scrapers may attempt to mimic legitimate User-Agent strings to avoid detection.

Rate Limiting

Companies can restrict the number of requests a single user or IP address can make within a given time frame. Exceeding the rate limit may result in temporary or permanent blocking of access.

Honeypot Traps

Honeypots are deceptive links or elements placed on a website to detect and trap scrapers. These traps are invisible to regular users but can be accessed by scrapers, allowing companies to identify and block them.

Dynamic Content Loading

Websites may load content dynamically using JavaScript, making it harder for scrapers to extract data from the initial HTML response. Scrapers need to use headless browsers or JavaScript rendering techniques to access dynamically loaded content.

IP Rotation

Scrapers can use IP rotation techniques, such as proxy servers or VPNs, to switch between different IP addresses and evade IP-based blocking. However, companies can still detect and block suspicious IP ranges or known proxy/VPN services.

Browser Fingerprinting

Websites can collect various browser attributes (e.g., screen resolution, installed plugins, fonts) to create a unique fingerprint of a user's browser. This fingerprint can be used to identify and block scrapers, even if they switch IP addresses.

Best Practices for Ethical Web Scraping

So what is the best way to get around these? When scraping data, it's essential to follow best practices to ensure that your activities remain undetected and/or minimally intrusive. Often times, this can be boiled down to common sense: Don't overload their servers with excessive requests for data. If you are accessing data within a tolerable threshold, you won't generally have issues. Here are some key considerations:

Respect Rate Limits

To avoid overloading servers and causing disruptions, it's crucial to respect rate limits when scraping data. This involves:

- Following API guidelines on requests per minute/second

- Pausing between requests when scraping websites

- Adjusting scraping frequency based on the website's capacity and your data needs

Refer to robots.txt

The robots.txt file provides instructions to web crawlers and scrapers about which pages or sections of a website should not be accessed. Always check for and respect the directives in the robots.txt file to ensure compliance with the website's scraping policies. I'll go over this more in the next article and show some examples for clearer context.

Cache Me If You Can

Caching involves storing scraped data to avoid unnecessary re-scraping. This is especially important when dealing with large datasets or frequently updated information. By caching data, you can:

- Reduce the load on the website's servers

- Improve the efficiency of your scraping process

- Save time and resources

Mimic Human Behavior

To avoid detection and maintain ethical scraping practices, it's important to make your scraper mimic human behavior. This involves randomizing various aspects of your scraping process to appear more organic and less automated. Consider the following techniques:

- Randomize request intervals

- Rotate user agent strings to mimic different browsers and devices

- Randomize request order

- Introduce random clicks and mouse movements

- Limit concurrent requests

Remember, the goal is to make your scraper's behavior indistinguishable from that of a regular user. I'll go over some practical code examples in the next article.

Consider the Context and Purpose

Before scraping data, it's important to understand the context and purpose of your project. This involves:

- Adjust scraping frequency based on your data needs (e.g., weekly updates for dashboards)

- Be mindful of budget, time, and cost-effectiveness

- Respect user privacy and handle scraped data responsibly

Identify Your Scraper

When making requests to a website, it's a good practice to identify your scraper by setting a descriptive User-Agent header, if requested by the website owner. This helps them understand the purpose of your scraper and contact you if necessary. Include information such as:

- Your scraper's name or project identifier

- A contact email address

- A brief description of your scraping purpose

Handle Errors Gracefully

Websites may undergo changes, experience downtime, or present unexpected scenarios. To ensure the reliability and robustness of your scraper, implement proper error handling mechanisms. This includes:

- Catching and logging exceptions (pro-tip: push these to a Slack or Discord channel ;)

- Retrying failed requests with exponential backoff and a max retry cap

- Setting timeouts to prevent indefinite waiting

- Gracefully terminating the scraping process when necessary

Conclusion

Web scraping is a powerful tool for collecting data from online sources, enabling businesses, researchers, and individuals to gather valuable insights and make data-driven decisions. However, it is crucial to approach web scraping with a strong emphasis on ethics and legal compliance.

By understanding the legal landscape surrounding web scraping, being aware of the methods companies use to prevent scraping, and adhering to best practices, you can ensure that your scraping activities remain ethical and compliant. This includes respecting rate limits, referring to robots.txt files, caching data, mimicking human behavior, considering the context and purpose of your project, and handling errors gracefully.

As you embark on your web scraping journey, remember to continuously monitor and adapt your scraping process with evolving website policies and legal requirements. By doing so, you can harness the power of web scraping while maintaining a responsible and ethical approach.

In the next article, we will dive into some practical code examples and techniques to help you implement these best practices in your own web scraping projects. Stay tuned for more insights and guidance on ethical web scraping!

[1] https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/

[2] https://en.wikipedia.org/wiki/Van_Buren_v.United_States

[3] https://globalfreedomofexpression.columbia.edu/cases/sandvig-v-barr/#:~:text=Case%20Summary%20and%20Outcome&text=The%20Court%20interpreted%20CFAA's%20Access,protected%20sites%2C%20but%20public%20sites.

[4] https://www.jdsupra.com/legalnews/meta-v-bright-data-ruling-has-important-1439691/

Doug’s Substack

Discussion about this post