How to monitor competitors’ Google News XML sitemap with Python

Author: Svet Petkov
Last Modified: November 29, 2024

If you work for a news publishing website, you’ve probably wondered how much content your competitors are publishing—especially during big events—or how much you need to publish daily or weekly to stay competitive in Google News.

What is a Google News XML sitemap?

A Google News XML sitemap is a file in XML format that provides information about articles on a website to Google News. This allows Google News to crawl and index the articles more efficiently. The sitemap contains important metadata about the articles, such as

  • the headline
  • publication date and last modified date
  • author
  • the URLs where the articles can be found

By submitting a Google News XML sitemap on Google Search Console, you can ensure that the articles are more easily discovered from Googlebot. The biggest difference between a regular XML sitemap and a Google News XML sitemap is that the news XML sitemap contains only the articles published in the last 48 hours.

Why should you keep an eye on your competitors’ Google News XML Sitemap?

Monitoring or scraping competitors’ Google News XML sitemap can provide valuable insights into their content strategy and help identify potential opportunities for improving your own publishing strategy. Here are some reasons why monitoring competitors’ sitemaps is important:

  1. Discover new content ideas: By analysing competitors’ sitemaps, you can identify the topics they are covering and find new content ideas that you can incorporate into your own website.
  2. Analyse published frequency: This data could give you information about how often the competitor is publishing and what topics are covered the most.

3. Analyse the event coverage - by simple title segmentation, you can understand how many articles your competitors are covering one topic or even.

    4. Publishing Timing Analysis - you can easily analyse when they are starting to publish content before an event

      Based on my experience, keeping an eye on competitors’ Google News XML sitemap can provide valuable insights about their strategy on Google News and Google Discover

      If you are interested in monitoring the Google “Top Stories” carousel, please check my previous blog post:

      How to monitor Google “Top Stories” carousel using Python and SERPapi

      What Python libraries we will need for the Python script?

      import advertools as adv
      import ssl
      import time
      import pandas as pd
      • Advertools – This is a Python package that provides a wide range of tools and utilities for data-driven advertising and is the best one for digital marketers(SEOs). It is designed to make data analysis and reporting easier for marketers and advertisers by providing a set of functions that automate common tasks and streamline workflows.
      • Time – Python has a built-in module called “time” that provides various time-related functions. We will use it for postponing the script.
      • Pandas
      • SSL

      STEP 1: XML sitemaps to data frames

      The code above runs a function called sitemap_to_df from Advertools. This function helps you get an XML sitemap from a URL and save it as a dataFrame.

      The easiest way to find the Google News XML sitemap of the competitor is to check for a reference link in the robots.txt file.

      while True:
          nj_1 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
          nj_2 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
          nj_3 = adv.sitemap_to_df('https://www.exapmple.com/news-sitemap.xml', max_workers=8)
      

      NOTE: However, there’s a problem with the max_workers argument being used. It could cause unnecessary network traffic and slow down your computer because the same URL is being accessed and processed repeatedly without any break.

      To avoid this issue,  add a delay using time.sleep() at the bottom of the script.

      STEP 2: Combining the data frames in one variable

      The next step is to create a list called "all" that brings together all the variables likely to hold data from processing different XML sitemaps. I chose to do this to make it easier to work with all the data in one place.

      #combining in one variable
          all = [nj_1, nj_2, nj_3]
      
          result = pd.concat(
              all,
              axis=0,
              join="outer",
              ignore_index=False,
              keys=None,
              levels=None,
              names=None,
              verify_integrity=False,
              copy=True,
          )

      STEP 3: Removing the duplicates and keep the first one

      As I mentioned, the Google News XML sitemap contains only the articles published for the last 48 hours. If you are scraping it more frequently you are likely to have duplicated values. That’s why the code line below removes all duplicates in the column “loc” (URLs) and keeps the first one.

      result.drop_duplicates(subset=['loc'], keep='first', inplace=True)

      STEP 4: Exporting the data into a CSV file without headings

      I created separate variables for each sitemap that was crawled, and then we combined them all into a single variable called “result”. To save this data, we need to export it to a CSV file. Since we want to keep adding to the file every time we run the loop, we use the “mode” mode “a” which appends data without altering existing content.

      Furthermore, to avoid having the column names repeated as headers every time we write to the CSV file, we use the “header” parameter to add a header row only if the DataFrame is empty and skip it otherwise.

      result.to_csv('sitemap_data.csv', mode='a', index=True, header=not bool(result.shape[0]))

      STEP 4: Automated running the script

      The code lines “while true” and  “time.sleep(43200)” pauses the execution of the program for 43200 seconds, which is equivalent to 12 hours. This is useful for introducing a delay between iterations of a loop or for scheduling a process to run at a specific time in the future.

      time.sleep(43200)

      You can adjust the frequency of the execution:

      HoursSeconds
      6 hours21600
      12 hours43200
      24 hours86400
      48 hours172800

      How you can improve the process?

      Overall, this script provides a powerful way to extract and store large amounts of data from multiple sitemaps in an organised manner.

      However, you can easily do it on Google Cloud and store that data on Big Query and visualise the data on Google Looker Studio.

      You can find the full script on my GitHub account: SCRIPT

      linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram