Generative AI And Its Value

Generative AI is a category of AI that excels at creating new content after learning patterns in real-world data. When provided with inputs or prompts, various generative AI models can generate diverse types of content. Here are some examples:

Text Generation Models: Text generation models that have been aligned (typically through Reinforcement Learning from Human Feedback) include OpenAI ChatGPT, Google PaLM 2, and Meta LLaMA-2-Chat. These models are capable of unprecedented (albeit imperfect) capabilities instruction following that has led to their adoption across many industries. Particularly surprising are their abilities to perform zero-shot and few-shot learning, language translation, programming, and fluently generating meaningful content across a vast number of domains.

Text-to-Image Models: Certain generative AI models, such as those underlying Stable Diffusion, Midjourney, and DALL-E, can produce, extend, or refine images from prompts.

Text-to-Video Generation: Other models like Meta’s Make-A-Video can generate videos from prompts as well..

AI models with generative capabilities, e.g. ChatGPT, DALL-E etc., are also referred to by the regulators as ‘general purpose AI’ or ‘foundation models’. These AI models are trained on large sets of unlabelled data that can be used for different tasks with minimal fine tuning.

Two key technologies underlying the generative AI revolution are (a) transformers, and (b) diffusion.

Transformers are typically used in text data but can be used for images and audio. They are the basis for all modern Large-Language Models (LLMs) because they allow neural networks to learn patterns in very large volumes of (text) training data. The result is the amazing capabilities observed in text generation models.

Diffusion models have overtaken Generative Adversarial Networks (GANs) as the neural models of choice for image generation. Unlike the error-prone image generation process of GANs, the “simplified” image generation process of diffusion models works by iteratively constructing an image through a gradual denoising process. The result is a myriad of new AI-based tools for generating and even editing images with useful outcomes.

According to McKinsey, just generative AI has the potential to contribute between $2.6 trillion and $4.4 trillion to annual business revenues. More than 75% of this value is expected to come from the integration of generative AI into customer operations, marketing and sales, software engineering, and research and development activities.

Generative AI’s Need for Data

Data plays a central role in the development of generative AI models, particularly Large Language Models (LLMs). These models rely on vast quantities of data for training and refinement. For example, OpenAI’s ChatGPT was trained on an extensive dataset comprising over 45 terabytes of text data collected from the internet, including digitized books and Wikipedia entries. However, the extensive need for data collection in generative AI can raise significant concerns, including the inadvertent collection and use of personal data without the consent of individuals. Google AI researchers have also acknowledged that these datasets, often large and sourced from various places, may contain sensitive personal information, even if derived from publicly available data.

Let’s explore the common sources of data collection employed by generative AI developers:

Publicly-Accessible Data

The majority of training data for generative AI comes from publicly-accessible data sets. Web scraping is the most common method used to collect data. It involves extracting large volumes of information from publicly accessible web pages. This data is then utilized for training purposes or may be repurposed for sale or made freely available to other AI developers.

Data obtained through web scraping often includes personal information shared by users on social media platforms like Facebook, Twitter, LinkedIn, Venmo, and other websites. While individuals may post personal information on such platforms for various reasons, such as connecting with potential employers or making new friends, they typically do not intend for their personal data to be used for training generative AI models.

User Data

Data shared by users with generative AI applications, such as chatbots, may be stored and used for training without the knowledge or consent of the data subjects. For example, users interacting with chatbots providing healthcare, advice, therapy, financial services, and other services might divulge sensitive personal information. While such chatbots may provide terms of service mentioning that user data may be used to “develop and improve the service,” critics argue that generative AI models should seek affirmative consent from users or provide clear disclosures about the collection, usage, and retention of user data.

Considering their transformative potential, many organizations have also embedded generative AI models into their products or services to enhance their offerings. Such integration, in some cases, can also serve as a source of data, including the personal data of the consumers, for the training and fine-tuning of these models.