What Is a Google News XML Sitemap and How Can You Monitor It?
A Google News XML sitemap is a specialised XML file that helps Google News discover and process recent news articles more efficiently. It contains article URLs together with metadata such as publication name, publication date, and title.
The key difference between a regular XML sitemap and a Google News sitemap is freshness. A Google News XML sitemap should only include articles published in the last 48 hours, which makes it useful both for Google and for competitive monitoring.
What a Google News XML sitemap does
Submitting a Google News sitemap helps Google crawl and index eligible news content faster. For publishers, that matters because speed and discoverability are often critical in Google News, Top Stories, and other freshness-driven surfaces.
| Regular XML sitemap | Google News XML sitemap |
|---|---|
| Can include a broad set of site URLs. | Should include only recent news articles. |
| Used for general discovery and crawl guidance. | Used to surface fresh content for Google News. |
| May stay relatively stable over time. | Changes rapidly as articles enter and leave the 48-hour window. |
Why monitor competitors' Google News sitemaps?
Monitoring competitor Google News sitemaps can reveal useful patterns in editorial output and distribution strategy. Because the file reflects only the most recent articles, it acts as a near-real-time signal of what a publisher is prioritising.
- Discover new content ideas: spot topics, entities, and story angles competitors are publishing.
- Track trends: understand which themes are accelerating in your niche or market.
- Identify coverage gaps: find areas where your newsroom or content operation is underrepresented.
- Measure publishing frequency: estimate how often competitors publish and which subjects dominate their recent output.
In practice, this kind of monitoring can support both Google News analysis and Google Discover research, especially when you combine sitemap data with headline classification or publisher tagging.
Python libraries used in the workflow
import advertools as adv import ssl import time import pandas as pd
- Advertools: a strong SEO-friendly Python package for working with sitemaps, crawl data, and search marketing workflows.
- time: used to add a delay between runs so the script does not hammer the same sitemap continuously.
- Pandas: used to combine, deduplicate, and export the sitemap data.
- ssl: sometimes useful when handling HTTPS and connection behaviour in local environments.
Step 1: Load the sitemap into data frames
The core method here is adv.sitemap_to_df(), which fetches a sitemap and converts it into a Pandas DataFrame.
while True:
nj_1 = adv.sitemap_to_df("https://www.example.com/news-sitemap.xml", max_workers=8)
nj_2 = adv.sitemap_to_df("https://www.example.com/news-sitemap.xml", max_workers=8)
nj_3 = adv.sitemap_to_df("https://www.example.com/news-sitemap.xml", max_workers=8)That example shows the basic concept, but it is not ideal as-is. Repeatedly hitting the same sitemap in a tight loop creates unnecessary traffic and wastes local resources. If you are going to poll a competitor sitemap, you should introduce a delay and keep the frequency reasonable.
To find a competitor's Google News sitemap, the safest first step is usually to check the robots.txt file for a sitemap reference.
Step 2: Combine the data frames
all_sitemaps = [nj_1, nj_2, nj_3]
result = pd.concat(
all_sitemaps,
axis=0,
join="outer",
ignore_index=False,
verify_integrity=False,
copy=True,
)This combines multiple DataFrames into a single variable so the full collection can be cleaned and exported together.
Step 3: Remove duplicates
result.drop_duplicates(subset=["loc"], keep="first", inplace=True)
Because a Google News sitemap only covers the most recent 48 hours, repeated snapshots will often contain the same article URLs. Deduplicating on loc keeps the export clean.
Step 4: Export the data to CSV
result.to_csv(
"sitemap_data.csv",
mode="a",
index=True,
header=not bool(result.shape[0]),
)Appending to the same CSV allows the file to grow over time as you collect more sitemap snapshots. The header logic is there to avoid writing the column names on every append.
Step 5: Schedule the script responsibly
time.sleep(43200)
A 43,200-second delay means the script runs every 12 hours. That is usually enough for this type of monitoring without creating aggressive crawl behaviour.
| Frequency | Seconds |
|---|---|
| 6 hours | 21600 |
| 12 hours | 43200 |
| 24 hours | 86400 |
| 48 hours | 172800 |
Summary
What this workflow gives you
- A simple way to collect recent competitor article URLs from Google News sitemaps.
- A historical CSV that can be analysed later in Looker Studio, spreadsheets, or Python notebooks.
- Better visibility into publishing cadence and topical focus.
This is a practical way to build a lightweight competitive intelligence dataset from Google News sitemaps. Keep it responsible, keep it clean, and the output can become genuinely useful for editorial and SEO analysis.