Generative AI’s Need for Data

View course details →

Generative AI’s Need for Data

Data plays a central role in the development of generative AI models, particularly Large Language Models (LLMs). These models rely on vast quantities of data for training and refinement. For example, OpenAI’s ChatGPT was trained on an extensive dataset comprising over 45 terabytes of text data collected from the internet, including digitized books and Wikipedia entries. However, the extensive need for data collection in generative AI can raise significant concerns, including the inadvertent collection and use of personal data without the consent of individuals. Google AI researchers have also acknowledged that these datasets, often large and sourced from various places, may contain sensitive personal information, even if derived from publicly available data.

Let’s explore the common sources of data collection employed by generative AI developers:

Publicly-Accessible Data

The majority of training data for generative AI comes from publicly-accessible data sets. Web scraping is the most common method used to collect data. It involves extracting large volumes of information from publicly accessible web pages. This data is then utilized for training purposes or may be repurposed for sale or made freely available to other AI developers.

Data obtained through web scraping often includes personal information shared by users on social media platforms like Facebook, Twitter, LinkedIn, Venmo, and other websites. While individuals may post personal information on such platforms for various reasons, such as connecting with potential employers or making new friends, they typically do not intend for their personal data to be used for training generative AI models.

User Data

Data shared by users with generative AI applications, such as chatbots, may be stored and used for training without the knowledge or consent of the data subjects. For example, users interacting with chatbots providing healthcare, advice, therapy, financial services, and other services might divulge sensitive personal information. While such chatbots may provide terms of service mentioning that user data may be used to “develop and improve the service,” critics argue that generative AI models should seek affirmative consent from users or provide clear disclosures about the collection, usage, and retention of user data.

Considering their transformative potential, many organizations have also embedded generative AI models into their products or services to enhance their offerings. Such integration, in some cases, can also serve as a source of data, including the personal data of the consumers, for the training and fine-tuning of these models.