ChatGPT is an easy way to bring AI to web scraping, saving developers from manual parsing work that requires constant updates. Using LLMs is becoming one of the best web scraping practices.
See below how ChatGPT is used in web scraping, including various use cases where combining web scraping and ChatGPT can facilitate data collection:
How to scrape websites using ChatGPT
In this tutorial, we use ChatGPT’s current web-connected research and coding workflows. In addition to manual HTML parsing, developers can now use built-in web search, file inputs, and deep research tools to analyze pages and generate extraction logic
1. Load the HTML File:
The manual save HTML locally and upload it workflow works, but it is no longer the only first-party option. OpenAI now supports web search in the Responses API, broader file-input handling, and deep research workflows that can combine web results, uploaded files, and connected data sources.
Choose the target website you want to extract data from. Press Ctrl + S (or Cmd + S on macOS) to save the page as HTML. If you’d like to automate saving the HTML file, you can use the following prompt example. This will prompt ChatGPT to generate the necessary Python code to save the HTML file from the provided URL.
Example Prompt to ChatGPT:
“Please provide a Python script that automates the process of saving an HTML page from the following URL: https://www.walmart.com/browse/electronics/gaming-mouse/3944_1089430_132959_1008621_4842284_9144425 The script should send a GET request to the page, retrieve the HTML content, and save it to a file named walmart_gaming_mouse.html.”
2. Inspecting the structure of the HTML:
Once you have saved the HTML file from the target page, drag and drop it into ChatGPT.
Pages that render content with JavaScript may require more than static HTML inspection. In those cases, developers should validate whether the saved file contains the target elements or whether a web-connected or browser-based workflow is needed before generating selectors.
Example prompt to ChatGPT:
“Please provide a Python script that automates the inspection of the HTML structure from the file walmart_gaming_mouse.html to identify the correct HTML tags and classes that contain the product name, price, and product link. The script should load the saved HTML file, find the elements that contain product names, prices, and links, and print the relevant tag names, classes, and text content.”
Example Python script to automate inspection:
3. Parsing data from the HTML:
Example Prompt to ChatGPT:
“Please provide a Python script that automates the parsing of the HTML file walmart_gaming_mouse.html to extract product details such as product name, price, and link. The script should parse the HTML, extract the required details for each gaming mouse on the page, and store them in a structured format such as a CSV file.”
Python script for parsing the data:
For static pages, saving the HTML file is still a practical approach. For more dynamic workflows, teams can also use web-connected research or API-based retrieval to inspect current page content, compare multiple sources, and generate extraction logic without relying only on a manually saved file.
4. Storing or displaying the data:
Example Prompt to ChatGPT:
“Please provide a Python script that stores the parsed product details from the walmart_gaming_mouse.html file into a structured format like CSV. The script should extract the product name, price, and link, and save them to a CSV file named gaming_mouse_products.csv. Additionally, the script should display a confirmation message once the data is saved.”
Python script for storing or displaying the data:
Using ChatGPT as an XPath tool
ChatGPT can help you extract specific elements from the target page using XPath expressions. When you ask ChatGPT how to utilize XPath to extract data, you need to:
- Inspect the HTML structure first.
- Handle edge circumstances including missing data or JavaScript-generated content.
- To account for tiny differences in HTML, use flexible XPath expressions.
XPath remains useful for today’s scraping workflows, but browser-native agent interfaces are beginning to emerge as an alternative for some sites.
For example, Chrome introduced WebMCP in early preview in 2026 to let websites expose structured tools to AI agents, thereby reducing reliance on brittle DOM guessing for supported use cases.
Prompt:
“How can I use XPath to extract all product names, prices, and links from this HTML file?”
ChatGPT Response:
ChatGPT applications in web scraping
1. Integrate ChatGPT into scraping workflows
MCP stands for Model Context Protocol. It is a standardized way for AI systems to connect to external tools and data sources, including web and enterprise systems, in a more structured way.
Web scraping MCPs, such as those provided by Bright Data, act as intermediaries that handle dynamic content rendering, IP rotation, and anti-bot bypass mechanisms, allowing ChatGPT to access and process large-scale web data without direct HTTP request handling.
In current OpenAI workflows, deep research can use web search, remote MCP servers, and file-based retrieval together. OpenAI also updated deep research in 2026 with trusted-site search controls, app/MCP connections, and live progress tracking, making it more suitable for monitored research and extraction tasks than prompt-only workflows.
You can integrate these MCPs with ChatGPT by configuring them through VSCode agents, such as GitHub Copilot, or by leveraging libraries like mcp-use, enabling seamless and scalable web data extraction workflows. 1
2. Web search and deep research for monitored extraction
ChatGPT is no longer limited to generating scraping code from static inputs. OpenAI’s current toolset includes built-in web search for up-to-date retrieval, file inputs for working with saved source material, and deep research models that can combine web results, files, and remote MCP sources within a single workflow.
This is especially useful when you need citations, multi-source comparison, or traceable research outputs before writing extraction code.
3. Generate code for scraping websites
For developer workflows, OpenAI documents web retrieval primarily through the Responses API, where web search can be enabled as a tool. 2026 updates expanded file-input support and added hosted environment features that make it easier to process retrieved documents and data before extraction.
Keep in mind that website structures and designs may change, which can affect the HTML elements and attributes you’re targeting. In such a scenario, your code may fail to function properly or extract the desired data. You need to monitor and update your scraping code regularly.
For example, you can use the prompt below to extract product description data:
Many websites use anti-automation controls, rate limits, and dynamic rendering to restrict large-scale data collection. Before scraping, teams should review site terms, robots policies, and applicable legal requirements, then choose an approach that matches the technical and compliance constraints of the target site.
Residential proxies and web unblockers are highly effective for bypassing stringent anti-bot defenses. Unlike datacenter proxies, residential proxies use IP addresses provided by actual Internet Service Providers (ISPs), making them appear more authentic.
Sponsored
You can integrate an unblocking technology with your web crawler to enhance your web scraping projects. Bright Data’s Web Unlocker enables businesses to collect data from web sources ethically and legally while circumventing anti-scraping measures.
1.1 Provide Python instructions for web scraping
ChatGPT offers step-by-step instructions for scraping data from web sources in various programming languages. In this example, we will use the requests library to fetch the content of a webpage and Beautiful Soup to parse and retrieve the desired data.
- ChatGPT provides the command to install required libraries. You can run the following code to install the libraries in python.
- You can use the Python code generated by ChatGPT to import requests and Beautiful Soup.
- The requests library allows you to fetch the content of the target web page. You can use the requests library to send HTTP requests to that target server and handle the responses. To fetch the content of the product page, type the following command in the terminal by replacing “https://example.com/product-page” with the target web page URL:
- After fetching the content of a web page, you need to parse the fetched data to extract the desired data. To parse the fetched data using the Beautiful Soup library:
If you scrape an e-commerce website to extract product data, such as product titles, you must inspect the product page to locate the necessary tags and attributes corresponding to the data.
- To save or print the scraped data, type the code generated by ChatGPT:
2. Clean extracted data
Once you’ve scraped data, it’s essential to clean the text to remove irrelevant elements and stopwords such as “the”,”and”, etc. ChatGPT can provide guidance and suggestions on cleaning and formatting collected data.
Assume you collected a large amount of data and imported it into Excel. However, you realize that the data is disorganized and messy. For instance, the full names are in column B, and you want to separate the first and last names into two different columns. You can request that ChatGPT provide a formula for separating first and last names.
The formula generated by ChatGPT to extract the first name:
The ChatGPT-generated formula to extract the last name:
Sponsored
OxyCopilot is a web scraping API feature provided by Oxylabs, allowing users to extract relevant information based on prompt-based formatting and filter out unwanted data. In the example below, we used OxyCopilot to streamline the API results by retrieving only the four key data fields: price, name, rating and review. Unnecessary details, such as content, meta tags, and status codes, were excluded from the output, making the data easier to handle.
3. Process extracted data
3.1 Conduct sentiment analysis
ChatGPT can perform sentiment analysis on scraped data to generate interpretable insights from unstructured text data. Assume you scraped social mentions of your brand from a social media platform to analyze your audience growth. After you have obtained data and cleaned the collected data, you can instruct ChatGPT to analyze the text data and label it as negative, neutral, or positive (Figure 4).
Figure 4: Demonstrate the process of analyzing and labeling a sample text document
Here’s an example of how you can instruct ChatGPT to perform sentiment analysis:
“Analyze the sentiment of the text: ‘The battery life is also long’.”
ChatGPT’s response to our query:
Note that the accuracy of sentiment analysis can vary depending on different factors, such as the complexity of the text and context-dependent errors.
3.2 Categorize scraped content
ChatGPT can help categorize scraped data into predefined categories. You can define the categories you want to classify the content into. Here is an example of categorizing content using ChatGPT:
As an example, we want to categorize the following content:
The following is the output for categorizing scraped data with ChatGPT:
Comments 1
Share Your Thoughts
Your email address will not be published. All fields are required.
It's almost useless. If you are a good coder, you can easily write this code. I think the better way to extract dynamic or difficult html content, script send html content to chatgpt by api and chatgpt need to return the answer of key content. If this way work, it will be useful. Thanks.