How to monitor competitors’ Google News XML sitemap with Python

Author: Svet Petkov
Published on: October 31, 2023
Last Modified: July 25, 2024

What is a Google News XML sitemap?

A Google News XML sitemap is a file in XML format that provides information about articles on a website to Google News. This allows Google News to crawl and index the articles more efficiently. The sitemap contains important metadata about the articles, such as the headline, publication date, and author, as well as the URLs where the articles can be found. By submitting a Google News XML sitemap to Google, website owners can ensure that their articles are more easily discovered and displayed in Google News search results. The biggest difference between a regular XML sitemap and Google News XML sitemap is the that G News XML sitemap contains only the articles published in the last 48 hours.

Why do we need to monitor the competitors’ Google News XML sitemap?

Monitoring or scraping competitors’ Google News XML sitemap can provide valuable insights into their content strategy and help identify potential opportunities for improving your own website’s visibility in Google News. Here are some reasons why monitoring competitors’ sitemaps is important:

  1. Discover new content ideas: By analysing competitors’ sitemaps, you can identify the topics they are covering and find new content ideas that you can incorporate into your own website.
  2. Track trends: Monitoring competitors’ sitemaps can help you stay up-to-date with the latest news and trends in your industry, and enable you to adjust your content strategy accordingly.
  3. Identify gaps: Analysing competitors’ sitemaps can help you identify areas where your website is lacking in coverage, and enable you to fill those gaps with relevant and engaging content.
  4. Analyse published frequency: This data could give you information about how often the competitor is publishing and what topics are covered the most.

Overall, keeping an eye on competitors’ Google News XML sitemap can provide valuable insights about their strategy on Google News and Google Discover

If you are interested in monitoring the Google “Top Stories” carousel, please check my previous blog post:

How to monitor Google “Top Stories” carousel using Python and SERPapi

What Python libraries we will use?

import advertools as adv
import ssl
import time
import pandas as pd
  • Advertools – This is a Python package that provides a wide range of tools and utilities for data-driven advertising and is the best one for digital marketers(SEOs). It is designed to make data analysis and reporting easier for marketers and advertisers by providing a set of functions that automate common tasks and streamline workflows.
  • Time – Python has a built-in module called “time” that provides various time-related functions. We will use it for postponing the script.
  • Pandas
  • SSL

STEP 1: XLM sitemaps to data frames

while True:
    nj_1 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
    nj_2 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
    nj_3 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)

The code above runs forever and keeps calling a function called sitemap_to_df from Advertools. This function helps you get an XML sitemap from a URL and save it as a variable.

Usually to find the Google News XML sitemap of the competitor you can check for a refferel link in the robots.txt file.

NOTE: However, there’s a problem with the max_workers argument being used. It could cause unnecessary network traffic and slow down your computer because the same URL is being accessed and processed repeatedly without any break.

To avoid this issue,  add a delay using time.sleep() at the bottom of the script.

STEP 2: Combining the data frames in one variable

#combining in one variable
    all = [nj_1, nj_2, nj_3]

    result = pd.concat(
        all,
        axis=0,
        join="outer",
        ignore_index=False,
        keys=None,
        levels=None,
        names=None,
        verify_integrity=False,
        copy=True,
    )

The next step is to create a list called “all” that holds all variables likely to contain data obtained from processing different XML sitemaps. I decided to make this list to make it easier to work with the data together.

STEP 3: Removing the duplicates and keep the first one

result.drop_duplicates(subset=['loc'], keep='first', inplace=True)

Google News XML sitemap contains only the articles published for the last 48 hours. If you are scraping it more frequently you are likely to have duplicated values. That’s why the code line above removes all duplicates in the column “loc” (URLs) and keeps the first one.

STEP 4: Exporting the data into a CSV file without headings

result.to_csv('sitemap_data.csv', mode='a', index=True, header=not bool(result.shape[0]))

I created separate variables for each sitemap that was crawled, and then we combined them all into a single variable called “result”. To save this data, we need to export it to a CSV file. Since we want to keep adding to the file every time we run the loop, we use the “mode” mode “a” which appends data without altering existing content.

Furthermore, to avoid having the column names repeated as headers every time we write to the CSV file, we use the “header” parameter to add a header row only if the DataFrame is empty and skip it otherwise.

STEP 4: Automated running the script

time.sleep(43200)

The code lines “while true” and  “time.sleep(43200)” pauses the execution of the program for 43200 seconds, which is equivalent to 12 hours. This is useful for introducing a delay between iterations of a loop or for scheduling a process to run at a specific time in the future.

You can adjust the frequency of the execution:

HoursSeconds
6 hours21600
12 hours43200
24 hours86400
48 hours172800

Summary

In summary, this script uses the Advertools library to crawl and extract data from multiple XML sitemaps, store the data in Pandas DataFrames, and then write the data to a CSV file. The script consists of an infinite loop that calls the sitemap_to_df function to obtain data from sitemaps, combines the data into a single DataFrame, and appends it to a CSV file.

Overall, this script provides a powerful way to extract and store large amounts of data from multiple sitemaps in an organised manner. However, it is important to use it responsibly and not overload the server or computer resources. You can easily analyse and visualise this data in Looker Studio Reporting.

You can find the full script on my GitHub account: SCRIPT

NOTE: The time.sleep(43200) function is working only on a local IDE (PyCharm) and it doesn’t work on Google Collaboration

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram