Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch.
Microsoft, OpenAI and Cohere are among the groups testing the use of so-called “synthetic data” — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limits of human-made data that can further improve the cutting-edge technology.
The launch of Microsoft-backed OpenAI’s ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.