How To Stop ChatGPT & OpenAI Scraping Your Proprietary Information/Get Compensated…

Orren Prunckun
2 min readMay 22, 2023

GPT3 was trained off The Common Crawl, WebText2, Books1 & Books2 and Wikipedia.

The Common Crawl is an open, and free-to-use dataset that contains petabytes of data collected from the web since 2008.

WebText2 is the text of web pages from all outbound Reddit links from posts with 3+ upvotes.

Books1 & Books2 are from BookCorpus, a dataset consisting of the text of around 11,000 unpublished books scraped from the Internet.

Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.

Training data goes up to September 2021.

OpenAI makes its money off its API and ChatGPT Plus.

It didn’t pay The Common Crawl, Reddit, Books1 & Books2 or Wikipedia for its content, copyright or not!

Steve Huffman, one of the co-founders of Reddit wants financial compensation (https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html).

Twitter is also following suit: https://nftstudio24.com/elon-musk-threatens-legal-action-tech-company-ai-training/ (and I can confirm that to scrape Twitter via the API start at USD$100/mo.)

That’s fair!

Google’s search engine profits off advertising others linked content within the search result.

It seemed to overcome its pushback by allowing other linked content to share in some of those profits via its AdSense product.

AdSense displayed Google ads on people’s linked content and they would get a share of that revenue.

OpenAI is profiting off repurposing (as opposed to infringing copyright) others’ content.

Copyright infringement, in this case, copyright text, is only infringed when exact sequences of words are used.

To date, I haven’t seen any evidence that this is or has occurred.

There was a rumour circulating LinkedIn several months ago that ChatGPT was infringing copyright by simply concatenating (nerd coding word for connecting) strings of text found on the internet — this is easily disproved by retyping those strings of text as a phrase back into Google.

For OpenAI to overcome Reddits pushback (don’t worry there will be heaps more, especially when GPT4 starts scraping your and everyone else’s websites), they’ll likely need to offer revenue sharing with data sources, ala AdSense.

In the interim, update your website robots.txt file with:

User-agent: ChatGPT-User
Disallow: /

--

--

Orren Prunckun

Entrepreneur. Australia Day Citizen of the Year for Unley. Recognised in the Top 50 Australian Startup Influencers. http://orrenprunckun.com