Creative Commons, the nonprofit behind the world’s most widely used open licenses, has developed a new framework that allows creators to clearly express that they do not consent to their work being used to train artificial intelligence models.
Known as CC signals, the system enables content owners to attach machine-readable metadata to their digital works, signalling their preferences for AI reuse. They can either indicate “Yes,” they are happy for it to be used in training, or “No,” they are not, and then add one of four conditions for use:
- Credit: Appropriate credit must be given.
- Direct Contribution: Monetary support must be provided to the work owner.
- Ecosystem Contribution: Monetary support must be provided to the ecosystem that the AI is benefiting from through the use of the work as training data.
- Open: The AI system used must be open.
Credit must be provided to the work owner with all four signals. The signal can also indicate the scope of AI usage the labels apply to, such as text and data mining, generative AI training, or AI inference.
AI developers, scrapers, or dataset aggregators can scan content for CC signals using standardised methods, like HTTP headers and metadata. These do not replace copyright licenses such as CC-BY or CC0 but are layered on top.
The CC signals are not currently enforceable but act more as social and ethical markers, similar to CC licenses. Creative Commons said in a blog post that the framework provides an alternative response to the proliferation of AI training other than “data extraction and the erosion of openness” or “a walled-off internet guarded by paywalls.”
“If we are committed to a future where knowledge remains open, we need to collectively insist on a new kind of give-and-take,” Sarah Hinchliff Pearson, general counsel, Creative Commons, said in the blog. “A single preference, uniquely expressed, is inconsequential in the machine age. But together, we can demand a different way.”
Creative Commons said that CC signals are the result of “years of consultation and analysis,” but it is still seeking public feedback over the next few months. It hopes to formally launch the framework in November.
Creators and AI companies at war over training data approaches
The emergence of AI over the past few years has seen the tech and creative industries butt heads. Tech companies want their AI models to be as useful as possible, which means feeding them vast amounts of fresh, human-created data. They’re also racing to innovate and outpace competitors, knowing that asking permission or paying creators could slow them down and cut into profits.
Meanwhile, creators, wary of powering tools that may eventually compete with them, still see potential in being fairly compensated and in contributing to models that could drive meaningful progress in fields like medicine and education.
Legal battles and copyright debates are unfolding around the world as courts and lawmakers grapple with how to resolve this fundamental tension between innovation and creative rights. Anthropic, Meta, Perplexity, Stability AI, Midjourney, and OpenAI (many, many, many times) are among the AI developers that have faced legal action from the likes of artists, news outlets, and musicians for using their work without consent. Sam Altman’s startup has signed a number of licensing deals with publishers to avoid further trouble.
Online platforms are trying to take control at an individual level
A number of platforms have made changes to their technical infrastructure and policies to gain more control over AI data scraping of their users’ content. X changed its privacy policy earlier this month to disallow the use of X content to “fine-tune or train a foundation or frontier model.”
Reddit updated its robots.txt files last year to block unauthorised AI bots and crawlers, while continuing to allow access for good-faith actors like researchers and the Internet Archive. It sued Anthropic this month for repeatedly crawling its forums without permission.
Cybersecurity company Cloudflare has launched tools designed to disrupt AI web crawlers and let website owners see and control how often AI models use their site’s content. CEO Matthew Prince has criticised the current state of the internet, where human readers face paywalls and intrusive ads, while AI scrapers get to consume content for free. He hopes to turn this model on its head by launching a Cloudflare marketplace where website owners can sell access to their content for AI training. Want to know how websites are pushing back against AI scraping? Read more on eWeek about the growing movement to block training bots.