In our previous post, we showed you how to build a powerful, free web scraper directly within Google Sheets using Apps Script. It was a fantastic solution for automating data collection from simple websites. However, as many of you discovered, the modern web is complex. The moment you try to scrape a dynamic e-commerce site or a portal protected by anti-bot measures, the basic solution hits a wall.

Today, we’re breaking through that wall. We’re going to upgrade our original script into a professional-grade data extraction tool that can handle the challenges of the modern web, all while keeping the convenience of managing everything from your Google Sheet.

Table of contents
  1. Why Do Simple Scrapers Fail? The Challenge of the Modern Web
  2. The Solution: Bright Data's Web Unlocker
  3. Bright Data vs. Apify: A Professional Perspective
    1. Outstanding features of Bright Data
  4. The Upgraded Script: How It Works
  5. How to Set Up and Use the Upgraded Scraper
  6. The source code
  7. New Business Opportunities Unlocked
  8. Summary: Why This Upgrade is a Game-Changer

Why Do Simple Scrapers Fail? The Challenge of the Modern Web

Websites have evolved. They aren’t just static pages anymore. When a simple scraper like our original one fails, it’s usually for one of these reasons:

  • JavaScript-Rendered Content: Many sites load a basic page first and then use JavaScript to fetch and display the actual content (like prices or product details). Google’s UrlFetchApp often only sees the initial, empty page, missing the data you need.
  • Anti-Scraping Protections: To prevent abuse and protect their data, websites employ sophisticated defenses. The most common are:
    • IP Blocking: If a server detects too many requests from a single IP address (like Google’s servers), it will block it.
    • CAPTCHAs: Those “I’m not a robot” tests are designed to stop automated scripts in their tracks.
    • Browser Fingerprinting: Websites can check for signs that a request is coming from an automated script rather than a real user’s browser.

Trying to fight this battle alone is a constant, frustrating game of cat and mouse. The solution is to use a specialized service that has already solved these problems.

The Solution: Bright Data’s Web Unlocker

This is where a service like Bright Data comes in. Bright Data is a leading web data platform that provides the infrastructure needed to access public web data reliably. Instead of making a direct request from Google to the target website, we send our request to Bright Data. They then use their vast network of proxies and their intelligent “Web Unlocker” technology” to:

  • Route your request through a real residential or mobile IP address, making it look like a regular user.
  • Automatically solve CAPTCHAs.
  • Manage browser fingerprints and cookies.
  • Retry failed requests until they succeed.

Essentially, Bright Data handles all the complex blocking issues, ensuring you get the clean HTML you need, every single time.

Bright Data vs. Apify: A Professional Perspective

When looking for scraping solutions, you’ll often see Apify mentioned. It’s a powerful platform with a marketplace of “Actors” (pre-built scrapers), many developed by the community. This is great, but it can feel less centralized than a service like Bright Data, which I consider to be a more professional, enterprise-focused service.

Both platforms offer tailor-made tools that can scrape specific websites and return structured JSON data. However, for our universal approach, we are using Bright Data’s “Web Unlocker.” This is a general-purpose tool that reliably returns the full HTML content from any URL, giving us maximum flexibility. While Apify has powerful scrapers for specific sites, it doesn’t offer a single, universal tool quite like the Web Unlocker that is designed to simply return the raw HTML from any URL, no matter the protection.

In my opinion, while both services are great, I find Bright Data to be more robust and reliable for business-critical tasks. Their pricing model is also more straightforward. Apify uses a subscription model based on “platform credits,” which can make costs hard to predict. Bright Data’s Pay-As-You-Go plan costs around $1.50 per 1,000 successful requests. This transparency is perfect for our project, and from an integration and cost perspective, I believe Bright Data offers a better value proposition.

Outstanding features of Bright Data

  • Geolocation Targeting: You can make your requests appear as if they are coming from a specific country, state, or even city. This is essential for scraping localized content, such as regional pricing, local search results, or store availability.
  • Scrape as Markdown: The API can return the scraped content directly in a clean Markdown format. This is incredibly powerful for feeding data directly into AI models or generating documentation without needing to parse complex HTML first.
  • Return a Screenshot: You can request a visual screenshot of the target page. This is invaluable for visual verification, archiving how a page looked at a specific time, or debugging issues where the layout affects the data.
  • Custom Cookies and Headers: The API allows you to send your own custom headers and cookies with a request. This is an advanced feature for mimicking a logged-in user session or a specific type of browser to access data that requires authentication or particular browser settings.

The Upgraded Script: How It Works

The evolution of our script goes far beyond just swapping out one API call. We’ve transformed it into a much more general-purpose and professional tool.

  • A Complete Workflow Enhancement: While replacing UrlFetchApp with Bright Data was the core change to handle protected sites, we’ve enhanced the entire workflow.
  • Scraping Initiated Directly from Sheets: The script is now a bounded script, meaning it’s directly attached to your Google Sheet. We’ve added a custom menu item that allows you to trigger the entire process with a single click, making the user experience much smoother.
  • Extract Any Text, Not Just Prices: We’ve removed all the price-specific logic. The script is now completely generic, capable of extracting any text-based data you point it to, whether it’s a product title, a stock status, a user review, or a news headline.
  • Capture Multiple Results from a Single Page: The most significant functional upgrade is the ability to extract multiple elements from a single page. If your CSS selector matches several items (like all product names on a category page), the script will now pull all of them and neatly place each one in a separate column in your sheet.

These changes elevate the script from a simple price tracker into a versatile, robust data extraction engine managed entirely within your Google Sheet.

From technical point of view it works on the following way:

  • Takes the target URL you want to scrape.
  • Packages it into a request to the Bright Data API.
  • Includes your secret API key for authentication.
  • Sends the request and returns the clean HTML that Bright Data retrieves.

The rest of our script remains more or less the same!

How to Set Up and Use the Upgraded Scraper

Getting started involves a straightforward, one-time setup to connect your Google Sheet to the necessary script and the Bright Data API. Follow these steps to get your powerful new scraper up and running.

  1. Step: Create the Google Sheet and Open Apps Script
    First, go to your Google Drive and create a new Google Sheet. From the menu in your new sheet, navigate to Extensions > Apps Script. This will open the script editor in a new browser tab, where you will place the scraper’s code.
  2. Step: Install the Scraper Code and Required Library
    1. Paste the Apps Script Code: Scroll down to find the full script. Copy the code, paste it into the Apps Script editor you just opened, and click the ‘Save’ icon.
    2. Add the Cheerio Library: The script relies on a library called Cheerio to efficiently read and parse the HTML from a webpage, making it easy to extract specific data using a CSS selector.
      • In the script editor’s left-hand menu, click the plus icon (+) next to ‘Libraries’.
      • You will be prompted for a Script ID. To find this, open a new tab and search Google for “Cheerio Apps Script.” The first result is typically a GitHub page containing the ID.
      • Copy the Script ID from the GitHub page, return to your script editor, and paste it into the Script ID field. Click the ‘Look up’ button.
      • Select the latest version available from the dropdown menu and click the ‘Add’ button. Cheerio is now successfully linked to your project.
  3. Step: Configure the Bright Data API Connection
    Our script sends requests to Bright Data, which then retrieves the HTML from the target website on your behalf, bypassing any anti-bot systems.
    1. Get Your Bright Data API Key:
      • Go to the Bright Data website and create an account.
      • Once logged in, navigate to ‘Proxies & Scraping Infrastructure’ on the left-side navigation. Click the ‘Add’ button and select ‘Web Unlocker’.
      • You will need to set up a ‘zone,’ which is a configuration for your scraping tasks. The default settings are fine for most websites. However, if you plan to scrape highly protected sites (known as ‘premium domains’), you must enable the premium domains setting for your zone.
      • After you click ‘Add’ to create the zone, your API key and Zone ID will be generated. Copy both of these.
    2. Add Credentials to the Script:
      • Return to your Apps Script editor.
      • Paste the API Key and the Zone ID into their respective placeholder variables at the top of the script.
      • Click the ‘Save project’ icon.
  4. Step: Run the Scraper and Grant Permissions
    You are now ready to begin scraping.
    1. Refresh your Google Sheet. After reloading, you will see a new custom menu item called ‘Scraper’.
    2. In your sheet, paste a URL into column A and its corresponding CSS selector into column B.
    3. Click the ‘Scraper’ menu and select ‘Run Scraper’.
    4. Authorize the Script: The first time you run it, Google will require your permission for the script to work.
      • An ‘Authorization required’ window will pop up. Click ‘Review permissions’.
      • Choose your Google account. You may see a screen that says, “Google hasn’t verified this app.” This is perfectly normal and expected, as the script needs permission to connect to an external service and modify your spreadsheet.
      • Review and grant all the necessary permissions to continue.

Now that the setup is complete, you can run the scraper anytime from the custom menu. The script will execute and populate the results directly into your sheet in a matter of seconds.

The source code

Source code on Github.

New Business Opportunities Unlocked

With this supercharged scraper, you can now build powerful business automation tools that were previously impossible:

  • Reliable Price Comparison: Track prices across major e-commerce platforms like Amazon or Walmart without getting blocked.
  • Real Estate Deal Alerts: Scrape multiple real estate portals for new listings that match your exact criteria and get notified instantly.
  • Lead Generation: Extract business information from protected online directories to build targeted lead lists.
  • News & Brand Monitoring: Track news sites, blogs, and forums for mentions of your brand, competitors, or industry keywords to stay ahead of trends and manage your reputation.
  • Competitor Website Monitoring: Keep an eye on your competitors’ websites for any changes—from subtle text updates to major redesigns—and get alerted automatically.
  • Market & Competitor Analysis: Monitor your competitors’ product stock levels, new product launches, or customer reviews on a daily basis.

Summary: Why This Upgrade is a Game-Changer

By switching from UrlFetchApp to Bright Data, we’ve transformed our simple tool into a robust data-gathering engine. The key advantages are:

  • Reliability: No more failed requests or missing data. You get what you ask for.
  • Power: Easily scrapes dynamic, JavaScript-heavy websites.
  • Stealth: Bypasses common anti-scraping protections effortlessly.
  • Simplicity: All the complexity is handled by Bright Data, while you continue to manage everything from a simple Google Sheet.

You now have a professional-grade web scraping solution at your fingertips, unlocking a new world of data-driven automation possibilities for your business.

Attila

Business process automation expert

Unlock the power of web data without needing advanced coding skills! This tutorial demonstrates how to effortlessly scrape and extract information from websites directly into Google Sheets using Google Apps Script, opening up a world of possibilities for businesses. Automate market research, competitor analysis, and data collection, saving countless hours of manual work and gaining valuable insights—all without requiring extensive HTML knowledge.

Table of contents
  1. What is Web Scraping?
  2. Why Use Web Scraping for Businesses?
  3. Introducing Google Apps Script and Google Sheets
  4. Step-by-Step Guide to Web Scraping with Google Apps Script and Google Sheets
    1. The Idea
    2. Setting up the Google Sheet
    3. Finding the Right Element: How to Get the CSS Selector
    4. The Code (Google Apps Script)
    5. Explanation of the Code
    6. Running the Code
    7. Important Notes:
  5. Scheduling the Script to Run Daily
    1. Create a Time-Based Trigger
  6. Tips and Limitations
    1. UrlFetchApp Quotas
  7. Conclusion

What is Web Scraping?

Web scraping is akin to automatically copying and pasting information from websites. Imagine having a digital assistant that visits websites and automatically collects specified information. Instead of manually selecting and copying text or images, a program performs this task for you, gathering data such as prices, product details, or contact information.

This extracted data can then be used for various purposes, including comparing prices across different online retailers, collecting sales leads, creating notifications for content changes, and more. Essentially, web scraping automates tedious manual work, saving significant time and effort by efficiently extracting and organizing large amounts of web data.

Why Use Web Scraping for Businesses?

In today’s competitive business environment, productivity and optimized resource allocation are crucial for success. Web scraping is a valuable tool for achieving these goals, significantly improving productivity by automating time-consuming data collection and analysis.

By automating the extraction of data from websites, businesses save considerable time and effort compared to manual methods. This increased efficiency allows teams to focus on strategic initiatives rather than repetitive tasks.

Furthermore, the data gathered through web scraping is not merely an end in itself; it serves as crucial input for further automation processes. For instance, scraped data can power dynamic, competition-based pricing strategies in e-commerce or automatically identify valuable investment opportunities by aggregating real estate listings from multiple portals.

This seamless integration with other automation workflows amplifies the impact of web scraping, making it an indispensable asset for any forward-thinking business.

Introducing Google Apps Script and Google Sheets

Google Apps Script is a cloud-based scripting language built on JavaScript that empowers you to automate tasks and build powerful integrations within Google Workspace (formerly G Suite). Its unique advantage lies in its ability to seamlessly connect various Google services, including Gmail, Docs, Drive, Calendar, and, importantly, Sheets, making it exceptionally well-suited for building comprehensive automations. This interconnectedness allows you to create workflows that span across multiple applications, streamlining processes and boosting productivity.

For web scraping specifically, Apps Script’s UrlFetchApp service allows you to fetch content directly from web pages, enabling automated data extraction and integration with the rest of your Google Workspace applications. By sending HTTP requests to websites, you can retrieve HTML, XML, or JSON data. This extracted data can then be parsed and structured using JavaScript within the script.

The true power of this approach is realized when combined with Google Sheets: scraped data can be directly written into spreadsheets, providing a convenient and familiar environment for data storage, organization, analysis, visualization, and further automation.

Step-by-Step Guide to Web Scraping with Google Apps Script and Google Sheets

The Idea

We’ll create a system that does the following:

  1. Reads a list of website addresses (URLs) from a Google Sheet.
  2. Visits each website and grabs its HTML content.
  3. Finds the predefined information (product price in our case) within that HTML.
  4. Writes the extracted prices back to the Google Sheet.

Setting up the Google Sheet

First, you need a Google Sheet.Your Google Sheet should have at least two columns:

  • Column A (Column 1): This column will contain the website addresses (URLs) of the products you want to track.
  • Column B (Column 2): This column will contain the CSS selector used to identify and extract the specific element (like the price) from the web page.
  • Column C (Column 3): This column will store the prices extracted from the websites.

Here you can find a template for the spreadsheet.

Finding the Right Element: How to Get the CSS Selector

The next crucial step for your data extraction tool is telling it exactly which piece of information to grab from the webpage. This is where the CSS selector comes in.

Think of a CSS selector as an address for a specific element on a web page – like the price, a product title, or a review count. Here’s how you can easily find this address using the Chrome browser:

  1. Open the Webpage: Go to the specific product page (or any page with the element you want to extract).
  2. Open Developer Tools: Right-click anywhere on the page and select “Inspect”. This opens the Developer Tools panel, usually at the bottom or side of your browser window. Make sure you’re on the “Elements” tab.
  3. Use the Element Picker: In the top-left corner of the Developer Tools panel, you’ll see a small icon that looks like an arrow inside a box (or sometimes just a cursor icon). Click on this icon. This activates the “select element” mode.
  4. Hover and Click: Now, move your cursor over the elements on the webpage. As you hover, the element under your cursor will be highlighted. Click on the specific element you want to extract (e.g., the price display).
  5. Locate in the HTML: Clicking the element picker will jump to and highlight that element’s code within the “Elements” tab of the Developer Tools.
  6. Copy the Selector: Right-click on the highlighted line of code in the Developer Tools panel. In the context menu that appears, hover over Copy and then click Copy selector.
Screenshot

That’s it! The CSS selector for that specific element is now copied to your clipboard. This is the value you would then paste into Column C of your Google Sheet for that product URL, telling your extraction script exactly what to look for on that page.

The Code (Google Apps Script)

Now, let’s look at the code that does the magic. Open the script editor in your Google Sheet by going to “Tools” > “Script editor”. Copy and paste the code provided below into the script editor.

/**
 * Fetches product prices from URLs in a Google Sheet and updates the sheet with the prices.
 */

// Configuration (Update these with your actual values)
const SHEET_ID = `YOUR_SHEET_ID`; // Replace with your actual Sheet ID
const SHEET_NAME = `YOUR_SHEET_NAME`; // Replace with your actual Sheet Name
const URLS_COLUMN_ID = 1; // Column containing URLs (A = 1)
const CSS_SELECTOR_COLUMN = 2; // Column containing CSS selectors (B = 2)
const PRICE_COLUMN = 3; // Column to write prices (C = 3)
const REQUEST_DELAY = 1000; // Delay between requests in milliseconds (1 second)


/**
 * Main function to run the script.
 */
function main() {
  const urlData = getUrlsFromSheet();
  if (urlData.length === 0) {
    Logger.log("No URLs to process.");
    return;
  }

  const prices = [];
  for (let i = 0; i < urlData.length; i++) {
    const url = urlData[i].url;
    const selector = urlData[i].selector;
    const html = fetchHtml(url);
    const price = extractPrice(html, selector);
    prices.push(price);
    Utilities.sleep(REQUEST_DELAY); // Delay between requests
    Logger.log(`Processed URL ${i+1}/${urlData.length}: ${url}, Price: ${price}`);
  }

  updateSheet(prices);
}


/**
 * Opens the spreadsheet and gets the URLs and selectors.
 * @return {Array<{url: string, selector: string}>} An array of objects containing URLs and selectors.
 */
function getUrlsFromSheet() {
  const ss = SpreadsheetApp.openById(SHEET_ID);
  const sheet = ss.getSheetByName(SHEET_NAME);
  const lastRow = sheet.getLastRow();
  if (lastRow < 2) return []; // Handle empty sheet

  const urls = sheet.getRange(2, URLS_COLUMN_ID, lastRow - 1)
    .getValues()
    .flat(); // Get values as a 1D array.

  const selectors = sheet.getRange(2, CSS_SELECTOR_COLUMN, lastRow - 1)
    .getValues()
    .flat(); // Get values as a 1D array.

  const urlData = [];
  for (let i = 0; i < urls.length; i++) {
    urlData.push({
      url: urls[i],
      selector: selectors[i]
    });
  }
  return urlData;
}


/**
 * Fetches the HTML content of a URL.
 * @param {string} url The URL to fetch.
 * @return {string|null} The HTML content, or null if there's an error.
 */
function fetchHtml(url) {
  try {
    const options = {
      method: 'get',
      muteHttpExceptions: true,
      headers: { // Added some common headers
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
        'Cache-Control': 'no-cache'
      }
    };
    const response = UrlFetchApp.fetch(url, options);
    if (response.getResponseCode() === 200) {
      return response.getContentText();
    } else {
      Logger.log(`Error fetching ${url}: ${response.getResponseCode()} - ${response.getContentText()}`);
      return null;
    }
  } catch (error) {
    Logger.log(`Error fetching ${url}: ${error}`);
    return null;
  }
}


/**
 * Extracts the price from the HTML using Cheerio.
 * @param {string} html The HTML content.
 * @param {string} selector The CSS selector to use.
 * @return {number|null} The extracted price as a number, or null if not found.
 */
function extractPrice(html, selector) { // Added selector parameter
  if (!html || !selector) return null; // Added check for selector

  try {
    const $ = Cheerio.load(html);
    let priceText = $(selector).text(); // Use provided selector
    if (!priceText) {
      Logger.log("Price element not found using selector: " + selector);
      return null;
    }
    priceText = priceText.replace(/,/g,'.');
    let price = parseFloat(priceText.replace(/[^0-9.]/g, ''));
    if (isNaN(price)) {
      Logger.log("Could not parse price to a number: " + priceText);
      return null;
    }
    return price;
  } catch (error) {
    Logger.log("Error extracting price: " + error);
    return null;
  }
}


/**
 * Updates the Google Sheet with the extracted prices.
 * @param {Array<number|null>} prices An array of prices to write to the sheet.
 */
function updateSheet(prices) {
  if (prices.length === 0) return;

  const ss = SpreadsheetApp.openById(SHEET_ID);
  const sheet = ss.getSheetByName(SHEET_NAME);
  const lastRow = sheet.getLastRow();
  const urlsCount = sheet.getRange(2, URLS_COLUMN_ID, lastRow - 1).getValues().flat().length;

  if (prices.length !== urlsCount) {
    Logger.log(`Number of prices (${prices.length}) does not match the number of URLs (${urlsCount}).`);
    return;
  }
  
  sheet.getRange(2, PRICE_COLUMN, prices.length, 1).setValues(prices.map(price => [price]));
}

Before you start using this code, please don’t forget to include the Cheerio library. This third-party library helps us process HTML more easily and must be added to our project manually. To do this:

  1. In the Apps Script editor, click on the “Resources” menu and select “Libraries…”
  2. A dialog box will appear. Enter the following Script ID into the input field: 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
  3. After entering the Script ID, click the “Look up” button.
  4. If the Script ID is valid, the library’s information (name and versions) will appear. Choose the specific version of the library you want to use from the dropdown menu. It’s generally recommended to use a specific version rather than “latest” for stability.
  5. Once you’ve selected the version, click the “Add” button.

The library should now appear in the “Libraries” dialog box.

Explanation of the Code

Our Google Apps Script for tracking product prices is built from several key functions working together. Let’s break down what each one does:

1. main()

  • Purpose: This is the conductor of our orchestra! The main function is the primary entry point that orchestrates the entire process of fetching and updating prices. It calls other functions in the correct sequence to get the job done.
  • Structure:
    1. It starts by calling getUrlsFromSheet() to fetch the list of URLs and their corresponding CSS selectors from your spreadsheet.
    2. It then loops through each URL/selector pair.
    3. Inside the loop, for each item:
      • It calls WorkspaceHtml() to download the webpage’s content.
      • It passes the downloaded HTML and the selector to extractPrice() to find the price.
      • It adds the found price (or null if not found) to a list.
      • It pauses briefly (Utilities.sleep()) to avoid overloading the websites.
    4. Finally, it calls updateSheet() to write all the collected prices back to the spreadsheet.
  • Analogy: Think of main() as a Project Manager. It doesn’t do the specialized tasks itself, but it knows the overall goal and directs the specialists (getUrlsFromSheet, WorkspaceHtml, extractPrice, updateSheet) in the right order, ensuring the project (price tracking) is completed step-by-step.

2. getUrlsFromSheet()

  • Purpose: This function acts as the data retriever. Its job is to open your specified Google Sheet and read the list of product page URLs and the CSS selectors needed to find the prices on those pages.
  • Structure:
    1. Opens the Google Spreadsheet using its ID (SHEET_ID).
    2. Selects the specific sheet within that spreadsheet using its name (SHEET_NAME).
    3. Determines how many rows have data.
    4. Reads the values from the URL column (URLS_COLUMN_ID).
    5. Reads the values from the CSS Selector column (CSS_SELECTOR_COLUMN).
    6. Combines these into a list where each item contains both the URL and its corresponding selector.
    7. Returns this list of URL/selector pairs.
  • Analogy: This function is like a Librarian. You give it the location (Sheet ID and Name) and the type of information you need (URLs and Selectors from specific columns), and it goes to the shelves (the Sheet), pulls out the requested lists, and hands them over neatly organized.

3. WorkspaceHtml()

  • Purpose: This function’s role is to visit a specific web address (URL) and download the raw HTML source code of that page. This code is what the browser uses to display the webpage, and it contains the price information we need.
  • Structure:
    1. Takes a single url as input.
    2. Uses Google Apps Script’s UrlFetchApp.fetch() service to make a request to that URL.
    3. It includes specific options (like User-Agent headers) to make the request look more like it’s coming from a standard web browser, which can help prevent being blocked by some websites.
    4. It checks if the website responded successfully (HTTP status code 200).
    5. If successful, it returns the downloaded HTML code as text.
    6. If there’s an error (like the page not found or access denied), it logs the error and returns null.
  • Analogy: Imagine WorkspaceHtml() as a Web Scout. You give it an address (URL), and it goes to that location on the internet, makes a copy of the page’s underlying blueprint (the HTML), and brings that blueprint back to you. It even tries to look like a regular visitor to avoid suspicion.

4. extractPrice()

  • Purpose: This is the specialized extractor. Given the raw HTML code of a page and a specific CSS selector (which acts like a map coordinate), this function finds the price text, cleans it up, and converts it into a numerical value.
  • Structure:
    1. Takes the html content and the selector as input.
    2. Uses the Cheerio library (which needs to be added to your Apps Script project) to parse the HTML, making it easy to search.
    3. Uses the provided selector to pinpoint the exact HTML element containing the price.
    4. Extracts the text content from that element.
    5. Cleans the text: removes currency symbols, thousands separators (like commas), and any other non-numeric characters except the decimal point. It also standardizes decimal separators to periods.
    6. Converts the cleaned text into a floating-point number.
    7. If any step fails (e.g., selector not found, text can’t be converted to a number), it logs the issue and returns null. Otherwise, it returns the numerical price.
  • Analogy: Think of extractPrice() as a Data Detective. It receives a large document (HTML) and a specific clue (the CSS selector). It uses the clue to find the exact piece of data (the price text), carefully cleans off any obscuring marks (currency symbols, commas), and reports the final numerical value found.

5. updateSheet()

  • Purpose: This function is the final step, responsible for taking the list of extracted prices and writing them back into the designated column in your Google Sheet.
  • Structure:
    1. Takes the prices array (collected by the main function) as input.
    2. Opens the target Google Spreadsheet and Sheet (similar to getUrlsFromSheet).
    3. Performs a safety check: It verifies that the number of prices collected matches the number of URLs originally read from the sheet to avoid writing data in the wrong rows if some fetches failed.
    4. Selects the correct range in the output column (PRICE_COLUMN), starting from the second row and spanning the necessary number of rows.
    5. Formats the simple list of prices into a 2D array format that Google Sheets requires for writing ([[price1], [price2], ...]).
    6. Writes the formatted prices into the selected range in the sheet using setValues().
  • Analogy: This function is the Accountant or Scribe. It takes the final report (the list of prices) generated by the team and carefully enters each value into the correct cell in the ledger (the Google Sheet), ensuring accuracy and proper placement.

Running the Code

After pasting the code into the script editor:

  1. Save the script.
  2. Run the main() function. You’ll need to authorize the script to access your Google Sheet.

Important Notes:

  • The extractPrice(html) function is the most website-specific part. The way prices are displayed in HTML varies from website to website. Therefore, you might need to adjust the code within this function to match the structure of the websites you’re tracking. You’ll likely need to inspect the website’s HTML source to find a suitable CSS selector.
  • The code includes a 1-second delay between requests. This is important to be polite to the websites you’re scraping and avoid getting blocked.

This automated system will save you time and effort by automatically tracking product prices for you. By understanding the basics of how it works, you can even customize it further to suit your specific needs.

Scheduling the Script to Run Daily

The last step is to schedule your script to run automatically every day at the desired time.

Create a Time-Based Trigger

  • Go to “Edit” -> “Current project’s triggers”.
  • Click “Add Trigger”.
  • Select the main() function.
  • Choose “Time-driven” as the event source.
  • Select “Day timer” and specify the desired time (e.g., 00:00 for midnight).
  • Save the trigger.

Tips and Limitations

While Google Apps Script’s UrlFetchApp is effective for many websites, it may not be suitable for heavily protected sites that employ anti-scraping measures such as CAPTCHAs and IP blocking. Attempting to bypass these protections with Apps Script alone is often unreliable.

Scraping such portals or engaging in “black hat” practices requires dedicated third-party solutions that use techniques like rotating proxies and CAPTCHA solvers. Examples of such services include Apify, Bright Data, ParseHub, and Scraper API.

Importantly, these services can often be integrated into Google Workspace automation processes. This means you can still build a complete, automated workflow within your Workspace infrastructure, even if a third-party scraping solution is necessary. Data extracted by these services can be seamlessly fed into Google Sheets, Docs, or other Workspace apps using their APIs or webhooks. It is highly recommended to respect website terms of service and `robots.txt` and to scrape responsibly and ethically.

UrlFetchApp Quotas

Google Workspace imposes quotas and limitations on `UrlFetchApp` to prevent abuse, maintain service stability, and ensure fair usage across all users. These limitations are crucial to understand when designing web scraping scripts:

  • Execution Time: There’s a daily limit on the total time your Apps Script project can run. As of this writing, this limit is typically around 6 minutes per execution for most Workspace accounts. This means if your script runs for an extended period due to numerous UrlFetchApp calls or complex processing, it might reach this limit and stop executing. However, you can often restart the script after it has stopped due to hitting the execution time limit, allowing you to process larger datasets or perform more complex tasks in stages.
  • Number of Calls: There are restrictions on how many UrlFetchApp requests you can make within specific timeframes. These limits are designed to prevent overwhelming target websites with requests. As of this writing, free Google accounts typically have a daily limit of around 20,000 `UrlFetchApp` calls, while Google Workspace accounts generally have a much higher limit, around 100,000 calls per day.
  • Sequential Approach vs. `fetchAll()`: Although Apps Script operates in a single-threaded environment and doesn’t offer true simultaneous fetching, using UrlFetchApp.fetchAll() allows for optimized retrieval of multiple URLs. This method significantly reduces overhead compared to sequential fetching by handling network communication more efficiently, even though the requests are not executed in parallel.

Conclusion

In summary, web scraping with Google Apps Script and Google Sheets offers a powerful, accessible, and importantly, free way for businesses to automate data collection and gain valuable insights. Google Apps Script is available for use with both free Google accounts and Google Workspace subscriptions, making it a cost-effective solution for businesses of all sizes.

Furthermore, if you’re using Google Workspace, relying on Apps Script for your web scraping needs offers unparalleled flexibility and integration within your existing business workflows. Unlike rigid third-party solutions that often require complex integrations and data transfer processes, Apps Script seamlessly connects with other Workspace apps.

We encourage you to explore the possibilities of web scraping with Google Apps Script and Sheets for your own business needs and unlock the potential of automated data collection for improved efficiency and strategic advantage. The number of scenarios where this powerful combination can be applied is virtually endless; from automating market research and competitor analysis to streamlining lead generation and content aggregation, only your imagination limits the potential use cases.

Attila

Business process automation expert

bestflow.

1118 Budapest, Ménesi út 24. Hungary

© 2024 bestflow. All rights reserved.