Creating Constantly Improving Datasets for Ethical AI Training: A Focus on Spawning’s Goals

Jordan Meyer and Mathew Dryhurst founded Spawning AI to create tools that help artists exert more control over how their works are used online. Meyer claims that, despite the fact that it’s substantially smaller than some other generative AI training data sets out there, Source.Plus’ data set is already “high-quality” enough to train a state-of-the-art image-generating model. Generative AI models “learn” to produce their outputs (e.g., photorealistic art) by training on a vast quantity of relevant data — images, in that case. Image Credits: Spawning“Source.Plus isn’t just a repository for training data; it’s an enrichment platform with tools to support the training pipeline,” he continued. And, Meyer says, Spawning might build its own generative AI models using data from the Source.Plus datasets.

Jordan Meyer and Mathew Dryhurst joined forces to establish Spawning AI, a company aimed at empowering artists to have more control over the use of their works online. With their latest project, Source.Plus, the duo hopes to curate a “non-infringing” media library for AI model training.

The first initiative of Source.Plus is a dataset containing nearly 40 million public domain images and images under the Creative Commons’ CC0 license, which relinquishes the legal rights of creators to their works. Meyer claims that even though this dataset is smaller than other generative AI training sets, it is “high-quality” enough to train cutting-edge image-generating models.

“Our goal with Source.Plus is to build a universal ‘opt-in’ platform,” Meyer explained. “We want to make it simple for rights holders to offer their media for use in generative AI training on their own terms, and easy for developers to incorporate that media into their workflows.”

Rights management

The discussion around the ethics of training generative AI models, specifically those focused on art generation like Stable Diffusion and OpenAI’s DALL-E 3, continues to be a hot topic. The outcome of this debate will have immense effects on artists.

Generative AI models “learn” to create their outputs – such as photorealistic art – by training on a vast amount of relevant data, often images. Some developers of these models claim that fair use allows them to scrape data from public sources, regardless of the copyright status of the data. Others have tried to strike a balance by compensating or crediting content owners for their data contributions.

Meyer, CEO of Spawning, believes that there is no clear consensus on the best approach yet.

“When it comes to AI training, the default is to use the easiest available data, which may not always be ethically or responsibly sourced,” Meyer told TechCrunch in an interview. “Artists and rights holders have little control over how their data is used for AI training, and developers have not had access to high-quality alternatives that respect data rights.”

Source.Plus, in limited beta, builds upon Spawning’s existing tools for art origin and usage rights management.

In 2022, Spawning released HaveIBeenTrained, a website that allows creators to opt out of training datasets used by partnering vendors, including Hugging Face and Stability AI. After securing $3 million in funding from investors such as True Ventures and Seed Club Ventures, Spawning launched ai.text, allowing websites to “set permissions” for AI and Kudurru, a system to combat data-scraping bots.

Source.Plus marks Spawning’s first effort to build its own media library and curate it in-house. The initial image dataset, PD/CC0, can be used for both commercial and research purposes, according to Meyer.

Source.Plus library

“Source.Plus is not just a repository of training data; it’s an enhancement platform equipped with tools to support the training process,” Meyer clarified. “Our goal is to offer a high-quality, non-infringing CC0 dataset capable of supporting a powerful baseline AI model by the end of the year.”

Companies such as Getty Images, Adobe, Shutterstock, and AI startup Bria claim to use only ethically sourced data for training their models. (Getty even goes as far as calling their generative AI products “commercially safe.”) However, Meyer states that Spawning aims to set a “higher bar” for sourcing data ethically.

Source.Plus filters images based on opt-outs and other artist preferences for training, providing information on where and how the images were sourced. Additionally, it excludes images with licenses other than CC0, such as those with a Creative Commons BY 1.0 license, which require attribution. Spawning also monitors for copyright challenges from sources where someone other than the creator determines the copyright status of a work, like Wikimedia Commons.

“We thoroughly validate the reported licenses of all the images collected, and any questionable licenses are excluded. This is a step that many ‘fair’ datasets do not take,” Meyer affirmed.

The training data used for generative AI models, both open and commercial, have had a history of problematic images, including violent, pornographic, and personally sensitive photos.

The maintainers of the LAION dataset were forced to remove one library after reports surfaced about the presence of medical records and depictions of child sexual abuse. Just this week, a study from Human Rights Watch revealed that one of LAION’s repositories contained the faces of Brazilian children without their consent or knowledge. Similarly, Adobe’s stock media library, Adobe Stock, used to train their generative AI models, including the Firefly Image model for art generation, was discovered to contain AI-generated images from competitors such as Midjourney.

To counter this, Spawning has created classifier models to detect nudity, violence, personally identifiable information, and other undesirable elements in images. Recognizing that no classifier is perfect, Spawning plans to allow users to filter the Source.Plus dataset by adjusting the classifiers’ detection thresholds.

“We have moderators that verify data ownership, and we have remediation features built in for users to report images that they believe are infringing. The entire trail of the data’s consumption can also be audited,” Meyer added.

Compensation

However, most attempts at compensating creators for their data contributions to generative AI training have not been successful. Some platforms use opaque metrics to calculate payouts, while others have been criticized for offering artists relatively minuscule amounts.

For instance, Shutterstock, a stock media library that has made deals worth tens of millions of dollars with AI vendors, contributes to a “contributors fund” for the artwork it uses to train its generative AI models or licenses to third-party developers. However, Shutterstock does not disclose how much artists can expect to earn, nor does it give them the option to determine their own pricing and terms. Third-party estimates suggest that compensation amounts to only $15 for every 2,000 images, which is not a significant sum.

Once Source.Plus exits its beta phase later this year and expands to datasets beyond PD/CC0, it will adopt a different approach compared to other platforms, allowing artists and rights holders to set their own prices for each download. Spawning will charge a flat fee of a “tenth of a penny,” according to Meyer.

Alternatively, customers can choose to pay Spawning a monthly fee of $10, in addition to the per-image download cost, for Source.Plus Curation. This subscription plan allows them to privately manage image collections, download the dataset up to 10,000 times a month, and access new features such as “premium” collections and data enrichment in advance.

Source.Plus gallery

“We will provide recommendations and suggestions based on current industry standards and internal metrics. Ultimately, though, the contributors to the dataset have the final say on what they believe is a fair compensation,” Meyer stated. “We have intentionally chosen this pricing model to ensure that the artists receive the majority of the revenue and have the ability to determine their own terms for participation. We believe this revenue split is much more favorable for artists than the traditional percentage revenue split and will result in higher payouts and increased transparency.”

If Source.Plus gains the traction that Spawning hopes for, the company plans to expand beyond images to include other types of media like audio and video. Spawning is already in talks with various firms to make their data available on Source.Plus. And, in the future, the company may also create its own generative AI models using data from the Source.Plus datasets.

“We envision a future where rights holders have a chance to participate fairly in the generative AI economy and receive just compensation,” Meyer said. “We are also hopeful that artists and developers who have been hesitant to engage with AI in the past can now do so in a way that respects fellow creators.”

Undoubtedly, Spawning has found a niche in the market. Source.Plus appears to be a reputable attempt at bringing artists into the generative AI development process while allowing them to reap profits from their work.

As stated by my colleague Amanda Silberling, the uprising of apps such as the art-hosting community Cara – which saw a surge in usage after Meta announced its intention to train its generative AI on content from Instagram, including artwork – indicates that the creative community has reached a breaking point. They are looking for alternatives to companies and platforms that they perceive as thieves, and Source.Plus could very well be the solution.

Having said that, if Spawning consistently acts in the best interest of artists (though, this may be a long shot since it is a venture capital-backed company), it remains to be seen whether Source.Plus can successfully scale up as Meyer expects. If social media has taught us anything, it is that moderation – particularly of millions of pieces of user-generated content – is an arduous task.

I guess we will find out soon enough.

Avatar photo
Dylan Williams

Dylan Williams is a multimedia storyteller with a background in video production and graphic design. He has a knack for finding and sharing unique and visually striking stories from around the world.

Articles: 874

Leave a Reply

Your email address will not be published. Required fields are marked *