Summary The Battle Over Books3 Could Change AI Forever | WIRED www.wired.com
3,618 words - html page - View html page
One Line
Copyright activists are demanding the elimination of Books3, a widely utilized AI training dataset, by major corporations.
Slides
Slide Presentation (11 slides)
Key Points
- The Battle Over Books3 is a controversial generative AI training set that is at the center of a copyright dispute.
- The data set, created by independent AI researcher Shawn Presser, includes around 196,000 books and has been used by big companies like Meta and Bloomberg.
- Copyright activists, such as The Rights Alliance, are working to remove Books3 from the internet and have made some progress in taking it down from certain platforms.
- The debate surrounding Books3 raises questions about the balance between the rights of creators and the access to information in the age of AI.
- Some argue that cracking down on data sets like Books3 may benefit big corporations and hinder smaller companies and researchers from entering the field of generative AI.
Summaries
18 word summary
Copyright activists are calling for the removal of Books3, a popular AI training set used by major companies.
61 word summary
Controversy surrounds the popular AI training set, Books3, as copyright activists advocate for its removal. Created by researcher Shawn Presser, it has been used by major companies like Meta and Bloomberg. Critics argue that using copyrighted material in AI training sets disregards artists' rights. The outcome will impact the AI industry's future and the balance between creators' rights and information access.
129 word summary
The Battle Over Books3, a popular AI training set, is stirring controversy as copyright activists push for its removal from the internet. Books3, created by independent researcher Shawn Presser, has been used by major companies like Meta and Bloomberg to train their language models. Critics argue that using copyrighted material in AI training sets disregards artists' rights. Presser reverse-engineered a data set similar to OpenAI's GPT-3 model, suspecting it originated from Library Genesis. Books3, part of Eleuther's data set called The Pile, gained popularity but faced takedown notices from The Rights Alliance. The Authors Guild demands compensation and some writers have filed lawsuits for copyright infringement. The outcome of this battle will determine the future of the AI industry and the balance between creators' rights and access to information.
435 word summary
The Battle Over Books3, a popular generative AI training set, is causing controversy as copyright activists seek to remove it from the internet. Created by independent AI researcher Shawn Presser, Books3 has been utilized by major companies like Meta and Bloomberg to train their language models. Critics argue that using copyrighted material in AI training sets disregards the rights of artists.
Presser and his team reverse-engineered a data set similar to the one used by OpenAI for their GPT-3 model. They suspect that the data set came from an online shadow library called Library Genesis. Presser scraped books from a shadow library called Bibliotik, using a script written by Aaron Swartz, and amassed a collection of 196,000 books, which he named Books3.
Books3 was released online as part of the nonprofit artificial intelligence collective Eleuther's larger data set called The Pile and quickly became popular for training AI models. However, The Rights Alliance, a Danish anti-piracy group, is determined to remove Books3 from the internet. They have filed takedown notices and are pursuing legal action against organizations hosting the data set. They have also contacted companies like Meta and Bloomberg, who have used Books3 to train their models.
The Authors Guild has organized an open letter demanding compensation for the use of copyrighted data sets by generative AI companies. Some writers have even filed lawsuits against companies like Meta for copyright infringement. However, legal experts are uncertain about the outcome of these cases and believe that companies may be able to argue fair use.
The controversy surrounding Books3 raises important questions about the balance between creators' rights and the collective right to access information in the age of AI. Some suggest that generative AI training should shift to an opt-in model, using only works in the public domain or freely given. Efforts are being made to persuade AI companies to respect artists' wishes and provide transparency about their training data sources.
The outcome of this battle could have significant implications for the AI industry, determining who controls the data sets used to train AI models and whether smaller companies and researchers have access to them. It also highlights the need for greater clarity in copyright law and regulations surrounding AI training materials.
Ultimately, the decision about whether generative AI training on copyrighted material is acceptable will shape the future of the industry. Some argue that it is inevitable and can benefit smaller companies and researchers, while others believe it disregards the rights of creators. The resolution of this battle will determine the direction of AI development and the balance between creativity and access to information.
475 word summary
The Battle Over Books3 Could Change AI Forever. Copyright activists are trying to remove a popular generative AI training set called Books3 from the internet. The set was created by independent AI researcher Shawn Presser and has been used by big companies like Meta and Bloomberg to train their language models. However, critics argue that using copyrighted material in AI training sets disregards the rights of artists and should not be allowed.
Presser and his team recreated the GPT-3 model released by OpenAI in 2020. They reverse-engineered a data set similar to one used by OpenAI, suspecting that it came from an online shadow library called Library Genesis. Presser used a script written by Aaron Swartz to scrape books from a shadow library called Bibliotik, amassing a collection of 196,000 books. He named this corpus Books3.
Books3 was released online as part of the nonprofit artificial intelligence collective Eleuther's larger data set called The Pile. It became a popular training data set for AI models. However, the Danish anti-piracy group The Rights Alliance is determined to remove Books3 from the internet. They have filed takedown notices against organizations hosting the data set and are pursuing legal action to block sites that host it. They have also contacted companies like Meta and Bloomberg, which have trained their models using Books3.
The Authors Guild has organized an open letter to generative AI companies using copyrighted data sets, demanding compensation for the use of their writings. Many writers have signed the letter and some have filed lawsuits against companies like Meta for copyright infringement. However, legal experts are uncertain about the outcome of these cases and believe that companies may be able to argue fair use.
The controversy over Books3 raises questions about the balance between creators' rights and the collective right to access information in the age of AI. Some believe that generative AI training should shift to an opt-in model, where only works in the public domain or freely given are used in data sets. Efforts are being made to persuade AI companies to respect artists' wishes and provide transparency about their training data sources.
The outcome of this battle could have significant implications for the AI industry. It could determine who controls the data sets used to train AI models and whether smaller companies and researchers have access to them. The fight over Books3 also highlights the need for greater clarity in copyright law and regulations surrounding AI training materials.
In the end, the decision about whether generative AI training on copyrighted material is acceptable will shape the future of the industry. Some argue that it is inevitable and can benefit smaller companies and researchers, while others believe it disregards the rights of creators. The resolution of this battle will determine the direction of AI development and the balance between creativity and access to information.