Jakarta, Indonesia Sentinel — Tesla CEO Elon Musk has joined other artificial intelligence (AI) experts in warning about a potential scarcity of real-world data for training AI models. Musk predicts that the “peak data” era for AI is approaching, suggesting that most accessible human knowledge has already been utilized in AI training.
“We’ve now exhausted basically the cumulative sum of human knowledge … in AI training,” Musk said during a livestreamed conversation on X, as reported by TechCrunch. “That happened basically last year,” he added.
Musk, who launched his own AI company, xAI, in 2023, explained that this milestone was effectively reached last year. As a result, technology companies may need to rely increasingly on “synthetic data,” or data generated by AI itself, to enable self-learning processes.
“The only way to supplement it is with synthetic data,” Musk stated. “With synthetic data … AI will sort of grade itself and go through this process of self-learning.”
However, Musk cautioned that synthetic data poses challenges due to AI’s tendency to produce “hallucinations”, an outputs that are inaccurate or nonsensical.
“Hallucinations make it difficult to trust synthetic material,” Musk said. “How can you tell if it’s a hallucinated answer or a real one?”
Data Exhausted by 2026
Andrew Duncan, director of foundational AI at the Alan Turing Institute in the UK, noted that Musk’s statements align with recent academic research predicting that publicly available data for AI models could be exhausted by 2026, as reported by The Guardian.
Duncan warned that over-reliance on synthetic data could result in “model collapse,” where the quality of AI outputs deteriorates.
“When you start feeding models synthetic material, you begin to see diminishing returns,” he explained, emphasizing the risks of biased and uncreative outputs.
The rise of AI-generated content online could exacerbate the issue, as such material may inadvertently be incorporated into future AI training datasets.
Growing Use of Synthetic Data
According to TechCrunch, Tech giants such as Microsoft, Meta, OpenAI, and Anthropic are already leveraging synthetic data to train their flagship AI models. According to Gartner, 60% of the data used for AI and analytics projects in 2024 is expected to be synthetically generated.
Microsoft’s Phi-4 model, released as open-source earlier this week, was trained using a mix of synthetic and real-world data. Similarly, Google’s Gemma models and Meta’s latest Llama series relied on AI-generated data for fine-tuning. Anthropic also used synthetic data to develop Claude 3.5 Sonnet, one of its top-performing systems.
Risks and Rewards
Synthetic data offers significant advantages, including cost savings. TechCrunch reports that, AI startup Writer, developed its Palmyra X 004 model almost entirely with synthetic sources at a cost of just $700,000—far less than the $4.6 million estimated for a similar-sized OpenAI model.
Elon Musk Warns of “Extinction” for Singapore Due to Fertility Crisis
Yet, the drawbacks of synthetic data remain substantial. Studies suggest that synthetic data can reinforce biases and limit creativity in AI models, eventually compromising their functionality.
As the AI industry grapples with these challenges, the debate over synthetic data underscores the delicate balance between innovation and reliability in advancing artificial intelligence.
(Raidi/Agung)