If you work for a news publishing website, you’ve probably wondered how much content your competitors are publishing—especially during big events—or how much you need to publish daily or weekly to stay competitive in Google News.
A Google News XML sitemap is a file in XML format that provides information about articles on a website to Google News. This allows Google News to crawl and index the articles more efficiently. The sitemap contains important metadata about the articles, such as
By submitting a Google News XML sitemap on Google Search Console, you can ensure that the articles are more easily discovered from Googlebot. The biggest difference between a regular XML sitemap and a Google News XML sitemap is that the news XML sitemap contains only the articles published in the last 48 hours.
Monitoring or scraping competitors’ Google News XML sitemap can provide valuable insights into their content strategy and help identify potential opportunities for improving your own publishing strategy. Here are some reasons why monitoring competitors’ sitemaps is important:
3. Analyse the event coverage - by simple title segmentation, you can understand how many articles your competitors are covering one topic or even.
4. Publishing Timing Analysis - you can easily analyse when they are starting to publish content before an event
Based on my experience, keeping an eye on competitors’ Google News XML sitemap can provide valuable insights about their strategy on Google News and Google Discover
If you are interested in monitoring the Google “Top Stories” carousel, please check my previous blog post:
How to monitor Google “Top Stories” carousel using Python and SERPapi
import advertools as adv import ssl import time import pandas as pd
The code above runs a function called sitemap_to_df from Advertools. This function helps you get an XML sitemap from a URL and save it as a dataFrame.
The easiest way to find the Google News XML sitemap of the competitor is to check for a reference link in the robots.txt file.
while True:
nj_1 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
nj_2 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
nj_3 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
NOTE: However, there’s a problem with the max_workers argument being used. It could cause unnecessary network traffic and slow down your computer because the same URL is being accessed and processed repeatedly without any break.
To avoid this issue, add a delay using time.sleep() at the bottom of the script.
The next step is to create a list called "all" that brings together all the variables likely to hold data from processing different XML sitemaps. I chose to do this to make it easier to work with all the data in one place.
#combining in one variable all = [nj_1, nj_2, nj_3] result = pd.concat( all, axis=0, join="outer", ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True, )
As I mentioned, the Google News XML sitemap contains only the articles published for the last 48 hours. If you are scraping it more frequently you are likely to have duplicated values. That’s why the code line below removes all duplicates in the column “loc” (URLs) and keeps the first one.
result.drop_duplicates(subset=['loc'], keep='first', inplace=True)
I created separate variables for each sitemap that was crawled, and then we combined them all into a single variable called “result”. To save this data, we need to export it to a CSV file. Since we want to keep adding to the file every time we run the loop, we use the “mode” mode “a” which appends data without altering existing content.
Furthermore, to avoid having the column names repeated as headers every time we write to the CSV file, we use the “header” parameter to add a header row only if the DataFrame is empty and skip it otherwise.
result.to_csv('sitemap_data.csv', mode='a', index=True, header=not bool(result.shape[0]))
The code lines “while true” and “time.sleep(43200)” pauses the execution of the program for 43200 seconds, which is equivalent to 12 hours. This is useful for introducing a delay between iterations of a loop or for scheduling a process to run at a specific time in the future.
time.sleep(43200)
You can adjust the frequency of the execution:
Hours | Seconds |
---|---|
6 hours | 21600 |
12 hours | 43200 |
24 hours | 86400 |
48 hours | 172800 |
Overall, this script provides a powerful way to extract and store large amounts of data from multiple sitemaps in an organised manner.
However, you can easily do it on Google Cloud and store that data on Big Query and visualise the data on Google Looker Studio.
You can find the full script on my GitHub account: SCRIPT