3 minutes reading time (618 words)

Safeguarding Your Joomla Content Against AI Crawlers

2024---JCM-Template-tipstricks-aicrawlers

Joomla is no exception to the trend towards artificial intelligence, and there are extensions that allow you to link up with it and use it to create content, but it may also be that you don't want your content to be used to feed this enormous knowledge base.

Ask yourself if your site has been used to train artificial intelligence. You can do the test at this site: https://haveibeentrained.com/ and decide whether you want to leave your site to the AIs.


In today's digital landscape, content creators using Joomla face the challenge of safeguarding their creations from AI indexing robots like GPTBot by OpenAI, Gemini by Google, Common Crawl, and Claude by Anthropic. As creators of content within the Joomla ecosystem, it's crucial to understand how to effectively manage and protect your content.

Understanding the Landscape

Several major newspapers and press groups: Financial Times, Axel Springer, Associated Press, the Spanish group Prisa (El Pais)...  have already entered agreements to expand rights, allowing AI integration. However, concerns persist regarding potential repercussions, including search engine ramifications by tech giants like Google. Despite assertions that these integrations are unrelated, the inclusion of Gemini into Chrome raises concerns.


Technical Measures

By robots.txt

Implementing safeguards can be achieved using the robots.txt file. 
Here's how content creators can block AI bots:

# Disable OpenAI bots

User-agent: ChatGPT-User

Disallow: /

User-agent: GPTBot

Disallow: /


GPTBot powers OpenAI's crawler, while ChatGPT-User serves ChatGPT plugins. 

IP ranges for GPTBot and ChatGPT-User should also be blocked.


For Common Crawl:

User-agent: CCBot

Disallow: /


And IP ranges:


38.107.191.66 through 38.107.191.119


To regulate rather than block, the Text and Data Mining Reservation Protocol (TDMRep) has been established. This protocol, governed by headers, enables content creators to specify policies and rights regarding their content.

Detailed implementation instructions can be found in the W3C documentation for TDM Reservation Protocol (TDMRep):

https://w3c.github.io/tdm-reservation-protocol/spec/

The TDM Reservation Protocol (TDMRep) provides complementary techniques for expressing rightsholders' choices.

These techniques are designed to accommodate different situations and technical skills of publishers.


TDM File on the Origin Server:
This technique involves hosting a TDM file named tdmrep.json on the origin server in the /.well-known repository.
The tdmrep.json file contains an array of JSON objects, each representing a rule.

Each rule includes the following properties:

  •         location: A pattern matching the path of the files hosted on the server.
  •         tdm-reservation: A TDM reservation value associated with the pattern.
  •         tdm-policy: An optional TDM Policy value associated with the pattern.

TDM Agents can evaluate the URL of a web resource against the patterns in the tdmrep.json file to determine the TDM reservation and policy.


HTML Metadata :

Publishers can express their TDM choices using HTML metadata tags within the <head> section of their web pages.

The tdm-reservation meta tag is used to indicate whether TDM rights are reserved or not.

The tdm-policy meta tag can be used to provide a URL pointing to the TDM Policy associated with the content.

TDM Agents can parse the HTML metadata to determine the TDM reservation and policy.

These techniques allow rightsholders to declare their choices regarding TDM rights and provide information on TDM policies associated with their content. By using these techniques, TDM Agents can adjust their scraping behavior or establish separate agreements with rightsholders to ensure compliance with TDM rights and licenses.

As guardians of digital content within the Joomla framework, content creators must be proactive in managing their content's accessibility to AI crawlers. While technical measures like robots.txt can provide immediate relief, the implementation of protocols like TDMRep ensures a nuanced approach to content control, balancing accessibility with protection. By staying informed and leveraging available tools, content creators using Joomla can navigate the evolving landscape of AI-driven content indexing while safeguarding their intellectual property.

Some articles published on the Joomla Community Magazine represent the personal opinion or experience of the Author on the specific topic and might not be aligned to the official position of the Joomla Project

0
Templates for Joomla - Episode 1: Templates, Frame...
How to add an icon to the article title
 

Comments 2

Already Registered? Login Here
Brian Teeman on Monday, 20 May 2024 10:58
Common misconception

The robots.txt is only a polite request to bots not to index the site. It is not enforced and should not be relied upon to guarantee anything.

0
The robots.txt is only a [b]polite request[/b] to bots not to index the site. It is not enforced and should not be relied upon to guarantee anything.
Yann Gomiero on Monday, 20 May 2024 12:10
note

Indeed, I should have specified that it was a "polite request" in robots.txt, as blocking IP addresses is a rather blunt method. The TDM Reservation Protocol serves as an alternative. It's up to you to judge the effectiveness of each solution. Thank you, Brian, for this clarification.

0
Indeed, I should have specified that it was a "polite request" in robots.txt, as blocking IP addresses is a rather blunt method. The TDM Reservation Protocol serves as an alternative. It's up to you to judge the effectiveness of each solution. Thank you, Brian, for this clarification.

By accepting you will be accessing a service provided by a third-party external to https://magazine.joomla.org/