The Ultimate Guide to Web Scraping with Google Apps Script and Google Sheets

Table of contents
  1. What is Web Scraping?
  2. Why Use Web Scraping for Businesses?
  3. Introducing Google Apps Script and Google Sheets
  4. Step-by-Step Guide to Web Scraping with Google Apps Script and Google Sheets
    1. The Idea
    2. Setting up the Google Sheet
    3. The Code (Google Apps Script)
    4. Explanation of the Code
    5. Running the Code
    6. Important Notes:
  5. Scheduling the Script to Run Daily
    1. Create a Time-Based Trigger
  6. Tips and Limitations
    1. UrlFetchApp Quotas
  7. Conclusion
Attila

automation expert

Unlock the power of web data without needing advanced coding skills! This tutorial demonstrates how to effortlessly scrape and extract information from websites directly into Google Sheets using Google Apps Script, opening up a world of possibilities for businesses. Automate market research, competitor analysis, and data collection, saving countless hours of manual work and gaining valuable insights—all without requiring extensive HTML knowledge.

What is Web Scraping?

Web scraping is akin to automatically copying and pasting information from websites. Imagine having a digital assistant that visits websites and automatically collects specified information. Instead of manually selecting and copying text or images, a program performs this task for you, gathering data such as prices, product details, or contact information.

This extracted data can then be used for various purposes, including comparing prices across different online retailers, collecting sales leads, creating notifications for content changes, and more. Essentially, web scraping automates tedious manual work, saving significant time and effort by efficiently extracting and organizing large amounts of web data.

Why Use Web Scraping for Businesses?

In today’s competitive business environment, productivity and optimized resource allocation are crucial for success. Web scraping is a valuable tool for achieving these goals, significantly improving productivity by automating time-consuming data collection and analysis.

By automating the extraction of data from websites, businesses save considerable time and effort compared to manual methods. This increased efficiency allows teams to focus on strategic initiatives rather than repetitive tasks.

Furthermore, the data gathered through web scraping is not merely an end in itself; it serves as crucial input for further automation processes. For instance, scraped data can power dynamic, competition-based pricing strategies in e-commerce or automatically identify valuable investment opportunities by aggregating real estate listings from multiple portals.

This seamless integration with other automation workflows amplifies the impact of web scraping, making it an indispensable asset for any forward-thinking business.

Introducing Google Apps Script and Google Sheets

Google Apps Script is a cloud-based scripting language built on JavaScript that empowers you to automate tasks and build powerful integrations within Google Workspace (formerly G Suite). Its unique advantage lies in its ability to seamlessly connect various Google services, including Gmail, Docs, Drive, Calendar, and, importantly, Sheets, making it exceptionally well-suited for building comprehensive automations. This interconnectedness allows you to create workflows that span across multiple applications, streamlining processes and boosting productivity.

For web scraping specifically, Apps Script’s UrlFetchApp service allows you to fetch content directly from web pages, enabling automated data extraction and integration with the rest of your Google Workspace applications. By sending HTTP requests to websites, you can retrieve HTML, XML, or JSON data. This extracted data can then be parsed and structured using JavaScript within the script.

The true power of this approach is realized when combined with Google Sheets: scraped data can be directly written into spreadsheets, providing a convenient and familiar environment for data storage, organization, analysis, visualization, and further automation.

Step-by-Step Guide to Web Scraping with Google Apps Script and Google Sheets

The Idea

We’ll create a system that does the following:

  1. Reads a list of website addresses (URLs) from a Google Sheet.
  2. Visits each website and grabs its HTML content.
  3. Finds the predefined information (product price in our case) within that HTML.
  4. Writes the extracted prices back to the Google Sheet.

Setting up the Google Sheet

First, you need a Google Sheet.Your Google Sheet should have at least two columns:

  • Column A (Column 1): This column will contain the website addresses (URLs) of the products you want to track.
  • Column B (Column 2): This column will store the prices extracted from the websites.

The Code (Google Apps Script)

Now, let’s look at the code that does the magic. Open the script editor in your Google Sheet by going to “Tools” > “Script editor”. Copy and paste the code provided below into the script editor.

/**
 * Fetches product prices from URLs in a Google Sheet and updates the sheet with the prices.
 */

// Configuration (Update these with your actual values)
const SHEET_ID = `YOUR_SHEET_ID`; // Replace with your actual Sheet ID
const SHEET_NAME = `YOUR_SHEET_NAME`; // Replace with your actual Sheet Name
const URLS_COLUMN_ID = 1; // Column containing URLs (A = 1)
const CSS_SELECTOR_COLUMN = 2; // Column containing CSS selectors (B = 2)
const PRICE_COLUMN = 3; // Column to write prices (C = 3)
const REQUEST_DELAY = 1000; // Delay between requests in milliseconds (1 second)


/**
 * Main function to run the script.
 */
function main() {
  const urlData = getUrlsFromSheet();
  if (urlData.length === 0) {
    Logger.log("No URLs to process.");
    return;
  }

  const prices = [];
  for (let i = 0; i < urlData.length; i++) {
    const url = urlData[i].url;
    const selector = urlData[i].selector;
    const html = fetchHtml(url);
    const price = extractPrice(html, selector);
    prices.push(price);
    Utilities.sleep(REQUEST_DELAY); // Delay between requests
    Logger.log(`Processed URL ${i+1}/${urlData.length}: ${url}, Price: ${price}`);
  }

  updateSheet(prices);
}


/**
 * Opens the spreadsheet and gets the URLs and selectors.
 * @return {Array<{url: string, selector: string}>} An array of objects containing URLs and selectors.
 */
function getUrlsFromSheet() {
  const ss = SpreadsheetApp.openById(SHEET_ID);
  const sheet = ss.getSheetByName(SHEET_NAME);
  const lastRow = sheet.getLastRow();
  if (lastRow < 2) return []; // Handle empty sheet

  const urls = sheet.getRange(2, URLS_COLUMN_ID, lastRow - 1)
    .getValues()
    .flat(); // Get values as a 1D array.

  const selectors = sheet.getRange(2, CSS_SELECTOR_COLUMN, lastRow - 1)
    .getValues()
    .flat(); // Get values as a 1D array.

  const urlData = [];
  for (let i = 0; i < urls.length; i++) {
    urlData.push({
      url: urls[i],
      selector: selectors[i]
    });
  }
  return urlData;
}


/**
 * Fetches the HTML content of a URL.
 * @param {string} url The URL to fetch.
 * @return {string|null} The HTML content, or null if there's an error.
 */
function fetchHtml(url) {
  try {
    const options = {
      method: 'get',
      muteHttpExceptions: true,
      headers: { // Added some common headers
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
        'Cache-Control': 'no-cache'
      }
    };
    const response = UrlFetchApp.fetch(url, options);
    if (response.getResponseCode() === 200) {
      return response.getContentText();
    } else {
      Logger.log(`Error fetching ${url}: ${response.getResponseCode()} - ${response.getContentText()}`);
      return null;
    }
  } catch (error) {
    Logger.log(`Error fetching ${url}: ${error}`);
    return null;
  }
}


/**
 * Extracts the price from the HTML using Cheerio.
 * @param {string} html The HTML content.
 * @param {string} selector The CSS selector to use.
 * @return {number|null} The extracted price as a number, or null if not found.
 */
function extractPrice(html, selector) { // Added selector parameter
  if (!html || !selector) return null; // Added check for selector

  try {
    const $ = Cheerio.load(html);
    let priceText = $(selector).text(); // Use provided selector
    if (!priceText) {
      Logger.log("Price element not found using selector: " + selector);
      return null;
    }
    priceText = priceText.replace(/,/g,'.');
    let price = parseFloat(priceText.replace(/[^0-9.]/g, ''));
    if (isNaN(price)) {
      Logger.log("Could not parse price to a number: " + priceText);
      return null;
    }
    return price;
  } catch (error) {
    Logger.log("Error extracting price: " + error);
    return null;
  }
}


/**
 * Updates the Google Sheet with the extracted prices.
 * @param {Array<number|null>} prices An array of prices to write to the sheet.
 */
function updateSheet(prices) {
  if (prices.length === 0) return;

  const ss = SpreadsheetApp.openById(SHEET_ID);
  const sheet = ss.getSheetByName(SHEET_NAME);
  const lastRow = sheet.getLastRow();
  const urlsCount = sheet.getRange(2, URLS_COLUMN_ID, lastRow - 1).getValues().flat().length;

  if (prices.length !== urlsCount) {
    Logger.log(`Number of prices (${prices.length}) does not match the number of URLs (${urlsCount}).`);
    return;
  }
  
  sheet.getRange(2, PRICE_COLUMN, prices.length, 1).setValues(prices.map(price => [price]));
}

Before you start using this code, please don’t forget to include the Cheerio library. This third-party library helps us process HTML more easily and must be added to our project manually. To do this:

  1. In the Apps Script editor, click on the “Resources” menu and select “Libraries…”
  2. A dialog box will appear. Enter the following Script ID into the input field: 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
  3. After entering the Script ID, click the “Look up” button.
  4. If the Script ID is valid, the library’s information (name and versions) will appear. Choose the specific version of the library you want to use from the dropdown menu. It’s generally recommended to use a specific version rather than “latest” for stability.
  5. Once you’ve selected the version, click the “Add” button.

The library should now appear in the “Libraries” dialog box.

Explanation of the Code

Our Google Apps Script for tracking product prices is built from several key functions working together. Let’s break down what each one does:

1. main()

  • Purpose: This is the conductor of our orchestra! The main function is the primary entry point that orchestrates the entire process of fetching and updating prices. It calls other functions in the correct sequence to get the job done.
  • Structure:
    1. It starts by calling getUrlsFromSheet() to fetch the list of URLs and their corresponding CSS selectors from your spreadsheet.
    2. It then loops through each URL/selector pair.
    3. Inside the loop, for each item:
      • It calls WorkspaceHtml() to download the webpage’s content.
      • It passes the downloaded HTML and the selector to extractPrice() to find the price.
      • It adds the found price (or null if not found) to a list.
      • It pauses briefly (Utilities.sleep()) to avoid overloading the websites.
    4. Finally, it calls updateSheet() to write all the collected prices back to the spreadsheet.
  • Analogy: Think of main() as a Project Manager. It doesn’t do the specialized tasks itself, but it knows the overall goal and directs the specialists (getUrlsFromSheet, WorkspaceHtml, extractPrice, updateSheet) in the right order, ensuring the project (price tracking) is completed step-by-step.

2. getUrlsFromSheet()

  • Purpose: This function acts as the data retriever. Its job is to open your specified Google Sheet and read the list of product page URLs and the CSS selectors needed to find the prices on those pages.
  • Structure:
    1. Opens the Google Spreadsheet using its ID (SHEET_ID).
    2. Selects the specific sheet within that spreadsheet using its name (SHEET_NAME).
    3. Determines how many rows have data.
    4. Reads the values from the URL column (URLS_COLUMN_ID).
    5. Reads the values from the CSS Selector column (CSS_SELECTOR_COLUMN).
    6. Combines these into a list where each item contains both the URL and its corresponding selector.
    7. Returns this list of URL/selector pairs.
  • Analogy: This function is like a Librarian. You give it the location (Sheet ID and Name) and the type of information you need (URLs and Selectors from specific columns), and it goes to the shelves (the Sheet), pulls out the requested lists, and hands them over neatly organized.

3. WorkspaceHtml()

  • Purpose: This function’s role is to visit a specific web address (URL) and download the raw HTML source code of that page. This code is what the browser uses to display the webpage, and it contains the price information we need.
  • Structure:
    1. Takes a single url as input.
    2. Uses Google Apps Script’s UrlFetchApp.fetch() service to make a request to that URL.
    3. It includes specific options (like User-Agent headers) to make the request look more like it’s coming from a standard web browser, which can help prevent being blocked by some websites.
    4. It checks if the website responded successfully (HTTP status code 200).
    5. If successful, it returns the downloaded HTML code as text.
    6. If there’s an error (like the page not found or access denied), it logs the error and returns null.
  • Analogy: Imagine WorkspaceHtml() as a Web Scout. You give it an address (URL), and it goes to that location on the internet, makes a copy of the page’s underlying blueprint (the HTML), and brings that blueprint back to you. It even tries to look like a regular visitor to avoid suspicion.

4. extractPrice()

  • Purpose: This is the specialized extractor. Given the raw HTML code of a page and a specific CSS selector (which acts like a map coordinate), this function finds the price text, cleans it up, and converts it into a numerical value.
  • Structure:
    1. Takes the html content and the selector as input.
    2. Uses the Cheerio library (which needs to be added to your Apps Script project) to parse the HTML, making it easy to search.
    3. Uses the provided selector to pinpoint the exact HTML element containing the price.
    4. Extracts the text content from that element.
    5. Cleans the text: removes currency symbols, thousands separators (like commas), and any other non-numeric characters except the decimal point. It also standardizes decimal separators to periods.
    6. Converts the cleaned text into a floating-point number.
    7. If any step fails (e.g., selector not found, text can’t be converted to a number), it logs the issue and returns null. Otherwise, it returns the numerical price.
  • Analogy: Think of extractPrice() as a Data Detective. It receives a large document (HTML) and a specific clue (the CSS selector). It uses the clue to find the exact piece of data (the price text), carefully cleans off any obscuring marks (currency symbols, commas), and reports the final numerical value found.

5. updateSheet()

  • Purpose: This function is the final step, responsible for taking the list of extracted prices and writing them back into the designated column in your Google Sheet.
  • Structure:
    1. Takes the prices array (collected by the main function) as input.
    2. Opens the target Google Spreadsheet and Sheet (similar to getUrlsFromSheet).
    3. Performs a safety check: It verifies that the number of prices collected matches the number of URLs originally read from the sheet to avoid writing data in the wrong rows if some fetches failed.
    4. Selects the correct range in the output column (PRICE_COLUMN), starting from the second row and spanning the necessary number of rows.
    5. Formats the simple list of prices into a 2D array format that Google Sheets requires for writing ([[price1], [price2], ...]).
    6. Writes the formatted prices into the selected range in the sheet using setValues().
  • Analogy: This function is the Accountant or Scribe. It takes the final report (the list of prices) generated by the team and carefully enters each value into the correct cell in the ledger (the Google Sheet), ensuring accuracy and proper placement.

Running the Code

After pasting the code into the script editor:

  1. Save the script.
  2. Run the main() function. You’ll need to authorize the script to access your Google Sheet.

Important Notes:

  • The extractPrice(html) function is the most website-specific part. The way prices are displayed in HTML varies from website to website. Therefore, you might need to adjust the code within this function to match the structure of the websites you’re tracking. You’ll likely need to inspect the website’s HTML source to find a suitable CSS selector.
  • The code includes a 1-second delay between requests. This is important to be polite to the websites you’re scraping and avoid getting blocked.

This automated system will save you time and effort by automatically tracking product prices for you. By understanding the basics of how it works, you can even customize it further to suit your specific needs.

Scheduling the Script to Run Daily

The last step is to schedule your script to run automatically every day at the desired time.

Create a Time-Based Trigger

  • Go to “Edit” -> “Current project’s triggers”.
  • Click “Add Trigger”.
  • Select the main() function.
  • Choose “Time-driven” as the event source.
  • Select “Day timer” and specify the desired time (e.g., 00:00 for midnight).
  • Save the trigger.

Tips and Limitations

While Google Apps Script’s UrlFetchApp is effective for many websites, it may not be suitable for heavily protected sites that employ anti-scraping measures such as CAPTCHAs and IP blocking. Attempting to bypass these protections with Apps Script alone is often unreliable.

Scraping such portals or engaging in “black hat” practices requires dedicated third-party solutions that use techniques like rotating proxies and CAPTCHA solvers. Examples of such services include Apify, Bright Data, ParseHub, and Scraper API.

Importantly, these services can often be integrated into Google Workspace automation processes. This means you can still build a complete, automated workflow within your Workspace infrastructure, even if a third-party scraping solution is necessary. Data extracted by these services can be seamlessly fed into Google Sheets, Docs, or other Workspace apps using their APIs or webhooks. It is highly recommended to respect website terms of service and `robots.txt` and to scrape responsibly and ethically.

UrlFetchApp Quotas

Google Workspace imposes quotas and limitations on `UrlFetchApp` to prevent abuse, maintain service stability, and ensure fair usage across all users. These limitations are crucial to understand when designing web scraping scripts:

  • Execution Time: There’s a daily limit on the total time your Apps Script project can run. As of this writing, this limit is typically around 6 minutes per execution for most Workspace accounts. This means if your script runs for an extended period due to numerous UrlFetchApp calls or complex processing, it might reach this limit and stop executing. However, you can often restart the script after it has stopped due to hitting the execution time limit, allowing you to process larger datasets or perform more complex tasks in stages.
  • Number of Calls: There are restrictions on how many UrlFetchApp requests you can make within specific timeframes. These limits are designed to prevent overwhelming target websites with requests. As of this writing, free Google accounts typically have a daily limit of around 20,000 `UrlFetchApp` calls, while Google Workspace accounts generally have a much higher limit, around 100,000 calls per day.
  • Sequential Approach vs. `fetchAll()`: Although Apps Script operates in a single-threaded environment and doesn’t offer true simultaneous fetching, using UrlFetchApp.fetchAll() allows for optimized retrieval of multiple URLs. This method significantly reduces overhead compared to sequential fetching by handling network communication more efficiently, even though the requests are not executed in parallel.

Conclusion

In summary, web scraping with Google Apps Script and Google Sheets offers a powerful, accessible, and importantly, free way for businesses to automate data collection and gain valuable insights. Google Apps Script is available for use with both free Google accounts and Google Workspace subscriptions, making it a cost-effective solution for businesses of all sizes.

Furthermore, if you’re using Google Workspace, relying on Apps Script for your web scraping needs offers unparalleled flexibility and integration within your existing business workflows. Unlike rigid third-party solutions that often require complex integrations and data transfer processes, Apps Script seamlessly connects with other Workspace apps.

We encourage you to explore the possibilities of web scraping with Google Apps Script and Sheets for your own business needs and unlock the potential of automated data collection for improved efficiency and strategic advantage. The number of scenarios where this powerful combination can be applied is virtually endless; from automating market research and competitor analysis to streamlining lead generation and content aggregation, only your imagination limits the potential use cases.

Attila

automation expert

Leave a comment

Your email address will not be published.

bestflow.

1118 Budapest, Ménesi út 24. Hungary

© 2024 bestflow. All rights reserved.