A newly published study by the AI Disclosures Project has ignited a debate over OpenAI’s use of copyrighted data in training its large language models (LLMs). Focusing on the GPT-4o model, the research uncovers a significant recognition of paywalled, copyrighted information from O’Reilly Media books. This has raised concerns about the transparency of AI model training data sources.
Headed by technologist Tim O’Reilly and economist Ilan Strauss, the AI Disclosures Project seeks to mitigate the potential adverse societal impacts stemming from AI commercialization by promoting greater transparency from corporations. Their working paper draws connections between AI data disclosure and the established financial disclosure standards that underpin strong securities markets.
Using a dataset containing 34 legally-acquired, copyrighted books from O’Reilly Media, researchers investigated whether OpenAI's LLMs were trained on proprietary data without appropriate permissions. The DE-COP membership inference attack method revealed GPT-4o’s strong capability to recognize content from these books, highlighting potential misuse of copyrighted material. Key findings demonstrated that GPT-4o had an AUROC score of 82% for recognizing paywalled content, starkly contrasting with GPT-3.5 Turbo’s lower recognition (just above 50% AUROC).
The study also shows GPT-4o’s better recognition of non-public book content compared to publicly accessible samples, with corresponding AUROC scores of 82% and 64%. Conversely, GPT-3.5 Turbo recognized publicly available samples more effectively than non-public ones. Meanwhile, GPT-4o Mini, a smaller model, lacked the knowledge of both public and non-public O’Reilly materials, as reflected by its AUROC score of approximately 50%.
Researchers propose that access violations might occur via the LibGen database, where all the tested O’Reilly books were available. They also noted that new LLMs better distinguish between human-written and machine-generated text, indicating possible improvements in AI training methodologies but without lessening the method’s efficacy in data classification.
The study points out the potential for “temporal bias” in results due to changes in language over time. To mitigate this, the researchers conducted tests on GPT-4o and GPT-4o Mini, using data from the same period, to provide a balanced assessment.
While the findings focus primarily on OpenAI and O’Reilly Media, the report suggests a wider systemic issue in using copyrighted contents for LLM training. Unpaid usage of such data may undermine content quality and diversity on the internet as it erodes revenue streams for professional content creators.
To address these concerns, the AI Disclosures Project calls for stricter accountability in AI corporations' pre-training processes. They suggest implementing liability provisions to incentivize better disclosure of the data origin. Such transparency could pave the way for commercial markets offering data licensing and remuneration for training purposes.
The report points to the EU AI Act’s potential role in setting positive disclosure standards. Properly enforced, these could ensure that intellectual property holders are informed when their work is used for model training—a crucial step toward sanctioned AI markets for data from content creators.
Despite some AI companies illicitly sourcing data for training models, a legitimate market is forming where developers obtain data through licensing deals. Firms like Defined.ai exemplify this movement by securing content with consent and ensuring personal data protection.
In conclusion, the study employing 34 proprietary O’Reilly books presents concrete evidence indicating that GPT-4o might have been trained using non-public, copyrighted materials. These revelations underscore the need for regulatory frameworks and commercial strategies to support fair and transparent AI model training.
The exploration of AI disclosure and data provenance opens new dialogs about ethical AI development, paving the way for innovations in AI-driven sectors.
The Future of AI in Video Content Creation
We’re in an era where video content isn’t just popular—it’s essential. Whether you're growing a brand, entertaining an audience, or building a digital persona, visually engaging content is what makes you stand out. But let’s face it: creating professional-quality videos the old-fashioned way can be slow, expensive, and complicated.
That’s where AI video Generator steps in to shake things up.
AI-powered video tools are making it easier than ever to turn ideas into dynamic visuals. With just a prompt or an image, creators can now generate animations, transitions, voiceovers—and even entire scenes—without touching a single editing timeline. Tools like Dreamlux are paving the way, letting users produce cinematic results in minutes, no film crew required.
But beyond automation and efficiency, AI is also opening up entirely new creative possibilities.
Step Into the World of AI Bikini Generator
One such example is the growing use of AI Bikini Generator—a niche yet increasingly popular tool in the realm of character design, digital fashion, and visual storytelling. These generators leverage advanced image synthesis models to reimagine characters in stylized swimwear, based on existing artwork, reference images, or text descriptions.
For designers working on summer-themed campaigns, animation studios crafting seasonal episodes, or game developers designing customizable avatars, AI bikini generators offer a quick and efficient way to explore aesthetic variations without redrawing from scratch.
It’s not just about novelty—it’s about expanding creative flexibility. These tools allow for rapid prototyping of character outfits, exploration of mood and setting, and even stylistic experimentation across different artistic directions.
As AI continues to evolve, it’s not just speeding up workflows—it’s reshaping the entire creative process, giving artists and storytellers more tools to express, iterate, and innovate.
How to use Dreamlux to generate an AI Bikini Video?
Follow these steps to create a stylish AI Bikini Video using Dreamlux.ai:
- Go to the official Dreamlux.ai and click on "Templates"
- Choose "Free AI Bikini Generator" from the available options
- Upload an image of the person or character you want to transform into a beach-ready version
- Click "Create" and let the AI Bikini Generator do the rest—your customized video will be ready in minutes
Dreamlux makes it easy to bring your concept to life with a professional, AI-generated twist.