What is a Google News XML sitemap?
A Google News XML sitemap is a file in XML format that provides information about articles on a website to Google News. This allows Google News to crawl and index the articles more efficiently. The sitemap contains important metadata about the articles, such as the headline, publication date, and author, as well as the URLs where the articles can be found. By submitting a Google News XML sitemap to Google, website owners can ensure that their articles are more easily discovered and displayed in Google News search results. The biggest difference between a regular XML sitemap and Google News XML sitemap is the that G News XML sitemap contains only the articles published in the last 48 hours.
Monitoring or scraping competitors’ Google News XML sitemap can provide valuable insights into their content strategy and help identify potential opportunities for improving your own website’s visibility in Google News. Here are some reasons why monitoring competitors’ sitemaps is important:
Overall, keeping an eye on competitors’ Google News XML sitemap can provide valuable insights about their strategy on Google News and Google Discover
If you are interested in monitoring the Google “Top Stories” carousel, please check my previous blog post:
How to monitor Google “Top Stories” carousel using Python and SERPapi
import advertools as adv import ssl import time import pandas as pd
while True: nj_1 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8) nj_2 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8) nj_3 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
The code above runs forever and keeps calling a function called sitemap_to_df from Advertools. This function helps you get an XML sitemap from a URL and save it as a variable.
Usually to find the Google News XML sitemap of the competitor you can check for a refferel link in the robots.txt file.
NOTE: However, there’s a problem with the max_workers argument being used. It could cause unnecessary network traffic and slow down your computer because the same URL is being accessed and processed repeatedly without any break.
To avoid this issue, add a delay using time.sleep() at the bottom of the script.
#combining in one variable all = [nj_1, nj_2, nj_3] result = pd.concat( all, axis=0, join="outer", ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True, )
The next step is to create a list called “all” that holds all variables likely to contain data obtained from processing different XML sitemaps. I decided to make this list to make it easier to work with the data together.
result.drop_duplicates(subset=['loc'], keep='first', inplace=True)
Google News XML sitemap contains only the articles published for the last 48 hours. If you are scraping it more frequently you are likely to have duplicated values. That’s why the code line above removes all duplicates in the column “loc” (URLs) and keeps the first one.
result.to_csv('sitemap_data.csv', mode='a', index=True, header=not bool(result.shape[0]))
I created separate variables for each sitemap that was crawled, and then we combined them all into a single variable called “result”. To save this data, we need to export it to a CSV file. Since we want to keep adding to the file every time we run the loop, we use the “mode” mode “a” which appends data without altering existing content.
Furthermore, to avoid having the column names repeated as headers every time we write to the CSV file, we use the “header” parameter to add a header row only if the DataFrame is empty and skip it otherwise.
time.sleep(43200)
The code lines “while true” and “time.sleep(43200)” pauses the execution of the program for 43200 seconds, which is equivalent to 12 hours. This is useful for introducing a delay between iterations of a loop or for scheduling a process to run at a specific time in the future.
You can adjust the frequency of the execution:
Hours | Seconds |
---|---|
6 hours | 21600 |
12 hours | 43200 |
24 hours | 86400 |
48 hours | 172800 |
In summary, this script uses the Advertools library to crawl and extract data from multiple XML sitemaps, store the data in Pandas DataFrames, and then write the data to a CSV file. The script consists of an infinite loop that calls the sitemap_to_df function to obtain data from sitemaps, combines the data into a single DataFrame, and appends it to a CSV file.
Overall, this script provides a powerful way to extract and store large amounts of data from multiple sitemaps in an organised manner. However, it is important to use it responsibly and not overload the server or computer resources. You can easily analyse and visualise this data in Looker Studio Reporting.
You can find the full script on my GitHub account: SCRIPT
NOTE: The time.sleep(43200) function is working only on a local IDE (PyCharm) and it doesn’t work on Google Collaboration