What is web scraping?
Web scraping is a way of extracting data from the internet. In today’s world, companies rely heavily on data to make strategic decisions. And with the growth of the Internet, a massive amount of data is being created on a daily basis.
However, accessing this data may be easy, but downloading it is not. It would be nearly impossible to copy and paste it manually, which is where web scraping comes in. Web scraping does not only enable you to extract web data, but also to automate it.
There are three functions you can make use of for web scraping using Google sheets: IMPORTFEED
, IMPORTHTML
and IMPORTXML
. These functions will extract data from a given website based on what you provide as input. We will cover all three functions in this document.
A quick look at what each of these functions can specifically be used for:
IMPORTFEED
– for scraping data from an RSS feedIMPORTHTML
– for scraping data from tables and listsIMPORTXML
– for scraping data from structured data types
Table of Contents
Let’s take an example.
I find the idea of comparing prices for a given product across different e-commerce websites time consuming. If there was a one-stop source that compared prices for a product I am willing to purchase, it would have made life so much easier.
Therefore, by leveraging the IMPORTXML
function (one of the three functions enabling web scraping on Google Sheets), I was able to collect price points of a given product across multiple websites where it was available for purchase.
That’s just one example. There are plenty of other use-cases for these functions such as analyzing competitor websites (for product descriptions, titles, keyword presence etc.), collecting reviews on a product/ service or importing data from multiple sources.
Let’s dive right into real-business use-cases, where we will deal with actual values and as well as learn how we can write our own web scraping functions in Google Sheets to efficiently capture data from the internet. We will go through the functions one after the other.
The Anatomy of the IMPORTFEED Function
So the syntax (the way we write) of the IMPORTFEED
function is as follows:
=IMPORTFEED(url,[query],[headers],[num_items])
Let’s dissect this thing and understand what each of the terms mean:
- = the equal sign is just how we start any function in Google Sheets.
- url is the link to the RSS or ATOM feed, including the protocol (https://). The value of the url must either be enclosed in question marks or be a reference to a cell containing the appropriate text.
- [query] specifies the data that is to be fetched from the url. This is an optional parameter and has a default value set to “items”, should the user not provide one. Possible values for this parameter are:
- “feed”, which returns a single row containing feed information including title, description, and url,
- “feed <type>”, which returns a particular attribute of the feed, where <type> is title, description, author, or url,
- “items” which returns a full table containing items from the feed. If no num_items parameter is specified, all items currently published on the feed are returned
- “items <type>”, returns a particular attribute of the requested item(s), where <type> is title, summary (the item content, minus hyperlinks and images), url (the URL of the individual item), or created (the post date associated with the item)
- [headers] denote whether column headers are to be included as an extra row on top of the returned value. This is an optional parameter and has a default value set to “FALSE”.
- num_items denotes the number of items to return, starting from the most recent ones. If no num_items parameter is specified, all items currently published on the feed are returned.
A Real Example of Using IMPORTFEED Function
Take a look at the example below to see how IMPORTFEED
functions are used in Google Sheets.
The IMPORTFEED
function is handy in capturing data from RSS and Atom feeds. Here, I have used Google News in ATOM format (an alternative to the RSS web feed) to get an overview of the news headlines for the day. As you can see above, the function has successfully determined the title of the source URL and the summary of it. The function also offers the flexibility to only capture the titles if you just want to skim through the major news. All you have to do is replace “items” with “items title” and TRUE for the FALSE.
You may make a copy of the spreadsheet using the link I have attached below:
Awesome! Let’s begin with our IMPORTFEED
function in Google Sheets.
How to Use IMPORTFEED Function in Google Sheets
- Let’s see how to write your own
IMPORTFEED
function, step-by-step. Simply click on any cell to make it the active cell. For this guide, I will be selecting B2, where I want to show my result. - Next, simply type the equal sign ‘=‘ to begin the function and then followed by the name of the function, which is our importfeed (or IMPORTFEED, whichever works.
- You should find that the auto-suggest box appears with our function of interest. Continue by entering the first opening bracket ‘(‘. If you get a huge box with text in it, simply hit the arrow in the top-right hand corner of the box to minimize it. You should now see it as follows:
Now, the fun begins! Let’s give the required inputs to the function to get the major trending news in the technology world from TechCrunch (the American online newspaper focusing on high tech and startup companies).
- Go to the TechCrunch home website (https://techcrunch.com/). On the home page itself, right-click and select the ‘View Page Source’ option.
- Once you are at the page source, search for the word ‘Feed’ and identify the URL for the website feed. It is almost always found in the top section of the page source.
- Once you have identified the feed URL (https://techcrunch.com/feed/), input the link to the first parameter url of the function. If you do not want the detailed summary of all news links, input the [query] parameter as “items title”, specifying that you only need the summary titles and the [headers] as TRUE.
- Once you have added the required url and the optional parameters if any, or you have followed what I did, make sure to close the brackets ‘()’ as shown below.
Finally, just hit your Enter key. And voila! You’ll find that if you follow my steps, you should have the 20 latest article titles published on the TechCrunch website as shown below:
For further examples and explanations on how to use the IMPORTFEED
function in Google Sheets, check out our detailed article on the same.
The Anatomy of the IMPORTHTML Function
So the syntax (the way we write) of the IMPORTHTML
function is as follows:
=IMPORTHTML(url,query,index)
Let’s dissect this thing and understand what each of the terms mean:
- = the equal sign is just how we start any function in Google Sheets
- url is the link of the page to examine, including the protocol (https://). The value of the url must either be enclosed in question marks or be a reference to a cell containing the appropriate text.
- query takes the value “list” or “table” depending on the structure that contains the desired data on the website.
- Index, starting at 1, identifies which table or list as defined in the HTML source should be returned.
Let’s now look at a real world use-case of the IMPORTHTML
function.
A Real Example of Using IMPORTHTML Function
Take a look at the example below to see how IMPORTHTML
functions are used in Google Sheets:
The IMPORTHTML
function is handy for importing data from a table or list within an HTML page. Here, I have used the function to capture details from the ‘Demographics of India’ page on Wikipedia. Notice how I have captured the tabular form as it is.
You may make a copy of the spreadsheet using the link I have attached below:
Alrighty! Let’s begin with our IMPORTHTML
function in Google Sheets.
How to Use IMPORTHTML Function in Google Sheets
- Let’s see how to write your own
IMPORTHTML
function, step-by-step. Simply click on any cell to make it the active cell. For this guide, I will be selecting B2, where I want to show my results. - Next, simply type the equal sign ‘=‘ to begin the function and then followed by the name of the function, which is our importhtml (or IMPORTHTML, whichever works).
- You should find that the auto-suggest box appears with our function of interest. Continue by entering the first opening bracket ‘(‘. If you get a huge box with text in it, simply hit the arrow in the top-right hand corner of the box to minimize it. You should now see it as follows:
Now, let’s give the required inputs to the function to get the list of the tallest buildings in the world from the Wikipedia page..
- Go to the Wikipedia website that lists the tallest buildings in the world (https://en.wikipedia.org/wiki/List_of_tallest_buildings) and copy the link.
- Once you have the page URL, enter it as the first parameter of the function. Then enter the query value as “table” since the data is structured in tabular form on the website and then enter the corresponding index as to which table or list as defined in the HTML source should be returned.
- Once you have added the required url and the optional parameters if any, or you have followed what I did, make sure to close the brackets ‘()’ as shown below.
- Finally, just hit your Enter key. If you follow my instructions, you should have a list of the world’s tallest buildings neatly populated on your Google Sheet as shown below.
The Anatomy of the IMPORTXML Function
So the syntax (the way we write) of the IMPORTXML
function is as follows:
=IMPORTXML(url,xpath_query)
Let’s dissect this thing and understand what each of the terms mean:
- = the equal sign is just how we start any function in Google Sheets.
- url is the link of the page to examine, including the protocol (https://). The value of the url must either be enclosed in question marks or be a reference to a cell containing the appropriate text.
- xpath_query denotes the XPath query to run on structured data.
Let’s now look at a real world use-case of the IMPORTXML
function.
Now, to scrape web data with IMPORTXML
, it is imperative to use xpath_query. You may refer to the examples given below to get you started on how to write an xpath_query:
- // – this means you select all of the elements of the specified type
- //h3 – this means you select all the h3 elements.
- [@class=”] – this means that you only select elements that meet the specified criteria.
For e.g., search for H1’
- //h3[@class=’storytitle’] – this means you only select elements that look like: h3 class=”storytitle”Title/h3
Let’s now look at a real world use-case of the IMPORTXML
function.
A Real Example of Using IMPORTXML Function
Take a look at the example below to see how IMPORTXML
functions are used in Google Sheets:
The IMPORTXML
function is handy for importing data from any of the various structured data types, including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds. It is known to be faster and more convenient than some of the other tools, especially if you only need to extract data from a handful of URLs.
You may make a copy of the spreadsheet using the link I have attached below:
Awesome! Let’s begin with our IMPORTXML
function in Google Sheets.
How to Use IMPORTXML Function in Google Sheets
- Let’s see how to write your own
IMPORTXML
function, step-by-step. Simply click on any cell to make it the active cell. For this guide, I will be selecting B2, where I want to show my results. - Next, simply type the equal sign ‘=‘ to begin the function and then followed by the name of the function, which is our importxml (or IMPORTXML, whichever works).
- You should find that the auto-suggest box appears with our function of interest. Continue by entering the first opening bracket ‘(‘. If you get a huge box with text in it, simply hit the arrow in the top-right hand corner of the box to minimize it. You should now see it as follows:
Let’s give the required inputs to the function to get the product description and the number of customer reviews gathered by the website thus far
- Go to the particular listing page of a product whose details you wish to scrape, and copy the link. As you can see, I have copied the link to the landing page of a product on the Amazon website (cell A2). It’s the URL for a mobile phone.
- Once you have the page URL, enter it as the first parameter of the function and specify the details you want to extract from the page as the xpath_query as shown below:
Remember what we read in the pre-requisites: // indicates that we require all elements of the given type. ” [@id=productTitle”] indicates that we require the product title.
- Once you have added the required url and the optional parameters if any, or you have followed what I did, make sure to close the brackets ‘()’ as shown below.
- Finally, just hit your Enter key. If you follow my instructions, you should see the product title in the adjacent cell as shown below.
Let’s look at another example. How about we get the total number of customer reviews for this product that are available on the website? In the same way as above, we just change the xpath_query component to extract the number of customer reviews:
On the product page, notice how I used the INDEX() function to specify by row and column offset. If you want to know more about how to use INDEX function in Google Sheets, check out our article on the same. If you used the function as I have shown here, you would have the total number of customer reviews populated in the cell C2.
That’s pretty much it. You can use these IMPORT functions in Google Sheets to extract data from websites based on different parameters you provide to the function. For further examples and explanations on using IMPORTXML
function in Google Sheets, check out our detailed article on the same. Furthermore, try experimenting other numerous Google Sheets formulas on data scraped from the internet using the IMPORT functions.
Frequently Asked Questions (FAQ)
1. Will the scraped data automatically refresh as it updates on the original page it is pulling from or is there additional work required to make regular updates?
By default, these functions recalculate as follows:
- ImportRange: Every 30 minutes,
- ImportHtml, ImportFeed, ImportData, ImportXml: Every hour
2. Is it possible to scrape data once a day on a page that requires a password to log in?
You cannot use these IMPORT formulas to scrape data behind a login. You will need to use Apps Script to connect to that service’s API (assuming they have one).