“In the age of infinite potential, even the mightiest AI can only thrive as long as the world feeds its hunger for knowledge—until the data well runs dry.” – MJ Martin
A recent post on LinkedIn offered an insightful perspective pertaining to the sufficiency of content needed to train the myriad of artificial intelligence platforms that are all rapidly emerging.
The post was offered by Ericsson’s Regional Head of Wireless Solutions, Sebastian Barros. He is a well-educated, deep thinker, located in Mexico City.
Barros offered the thought that we are all running out of new and different sources of content to train AI systems. Since AI needs content to feed itself, and to learn from, and to derive meaningful responses, Barros claims that a Nature report indicates that all existing and available data sources will be fully exhausted in 2028, which is very soon. It is flagged as a crisis situation for AI development.
If the AI platforms cannot access new fresh content, then how will they function?
Barros thesis is to access content from the telecommunications carriers. They transport an astronomical volume of fresh content, including IoT and live video streams.
My concern is the use of this data is not as simple as Barros proposes. The carriers are simply transporters of this content and they do not own it nor do they have rights to it.
In many countries, such as in the European Union member countries or Canada where I reside, accessing the data from the carriers would violate the existing privacy laws.
A lot of the data is already securely encrypted and therefore not even accessible – even if the carriers wanted to try to monetize it.
There are several ethical and moral concerns as well that likely stand in the way of this strategy. Even though it is an interesting thought.
The Nature Report
The Springer Nature Report stated that the Internet is a vast ocean of human knowledge, but it is not infinite. And artificial intelligence (AI) researchers have nearly sucked it dry.
The article highlights a critical inflection point in the development of AI, particularly regarding the scaling of large language models (LLMs). For the past decade, these models have benefited from increasing computational power and vast amounts of training data, leading to groundbreaking advances. However, as the field matures, two primary challenges are emerging:
1. Energy Demands of Scaling
- Larger models require exponentially more energy to train, raising concerns about sustainability and efficiency.
- This ballooning energy usage not only incurs higher costs but also poses environmental challenges, driving a need for innovations in model architecture and training efficiency.
2. Imminent Training Data Shortage
- By 2028, researchers project that the size of conventional datasets will match the total available public online text, effectively exhausting this resource for training purposes.
- Tightening restrictions on the use of proprietary content, such as that from newspaper publishers, further exacerbate this issue, shrinking the ‘data commons’ and limiting the pool of accessible data.
- This potential bottleneck could already be affecting the capabilities of newer models, as suggested by experts like Shayne Longpre.
Implications for the Future
The limits of scaling necessitate a shift in focus for AI research and development. Possible areas of innovation include:
- Smarter Data Utilization: Finding ways to make better use of smaller datasets, such as through synthetic data generation or advanced data augmentation techniques.
- Model Optimization: Designing architectures that achieve comparable performance with fewer resources, such as through sparsity, modular networks, or other efficiency-focused strategies.
- Ethical and Sustainable Data Practices: Developing frameworks for sharing and curating data responsibly, including equitable access and consent from content creators.
The coming years will likely see a convergence of these strategies as researchers and developers grapple with the challenges of sustainability and scalability in AI.
The AI industry is confronting significant challenges due to the impending scarcity of high-quality training data. Researchers at Epoch AI project that by 2028, the volume of data required for training AI models will match the total available public online text, leading to a potential data shortage.
In response, leading AI companies are exploring alternative strategies to mitigate this issue:
- Synthetic Data Generation: Firms like OpenAI and Anthropic are investing in creating artificial datasets that mimic real-world data. This approach aims to supplement existing data sources and maintain the momentum of AI development.
- Unconventional Data Sources: Companies are seeking new, previously untapped data reservoirs to diversify and expand their training datasets. This includes exploring proprietary data and other non-traditional information sources.
Conclusions
Despite these efforts, the data crunch may prompt a paradigm shift in AI development. The industry could move away from large, general-purpose language models toward smaller, specialized models tailored for specific tasks. This transition may enhance efficiency and reduce the dependency on vast amounts of data.
While the anticipated data shortage poses a significant challenge, AI developers are proactively seeking innovative solutions to sustain progress and adapt to the evolving data landscape.
REFERENCES:
Barros, S. (2024). AI Is Starving for Data – Can Telcos Feed It? LinkedIn. Retrieved on December 15, 2024 from https://www.linkedin.com/in/sebastianbarros/recent-activity/all/
Jones, N. (2024). The AI revolution is running out of data. What can researchers do? 2024 Springer Nature Limited, Retrieved on December 16, 2024 from, https://www.nature.com/articles/d41586-024-03990-2
About the Author:
Michael Martin is the Vice President of Technology with Metercor Inc., a Smart Meter, IoT, and Smart City systems integrator based in Canada. He has more than 40 years of experience in systems design for applications that use broadband networks, optical fibre, wireless, and digital communications technologies. He is a business and technology consultant. He was a senior executive consultant for 15 years with IBM, where he worked in the GBS Global Center of Competency for Energy and Utilities and the GTS Global Center of Excellence for Energy and Utilities. He is a founding partner and President of MICAN Communications and before that was President of Comlink Systems Limited and Ensat Broadcast Services, Inc., both divisions of Cygnal Technologies Corporation (CYN: TSX). Martin served on the Board of Directors for TeraGo Inc (TGO: TSX) and on the Board of Directors for Avante Logixx Inc. (XX: TSX.V). He has served as a Member, SCC ISO-IEC JTC 1/SC-41 – Internet of Things and related technologies, ISO – International Organization for Standardization, and as a member of the NIST SP 500-325 Fog Computing Conceptual Model, National Institute of Standards and Technology. He served on the Board of Governors of the University of Ontario Institute of Technology (UOIT) [now OntarioTech University] and on the Board of Advisers of five different Colleges in Ontario. For 16 years he served on the Board of the Society of Motion Picture and Television Engineers (SMPTE), Toronto Section. He holds three master’s degrees, in business (MBA), communication (MA), and education (MEd). As well, he has three undergraduate diplomas and five certifications in business, computer programming, internetworking, project management, media, photography, and communication technology. He has completed over 30 next generation MOOC continuous education in IoT, Cloud, AI and Cognitive systems, Blockchain, Agile, Big Data, Design Thinking, Security, Indigenous Canada awareness, and more.

