Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
A new report from AI data provider Appen reveals that companies are struggling to source and manage the high-quality data needed to power AI systems as artificial intelligence expands into enterprise operations.
Appen’s 2024 State of AI report, which surveyed over 500 U.S. IT decision-makers, reveals that generative AI adoption surged 17% in the past year; however, organizations now confront significant hurdles in data preparation and quality assurance. The report shows a 10% year-over-year increase in bottlenecks related to sourcing, cleaning, and labeling data, underscoring the complexities of building and maintaining effective AI models.
Si Chen, Head of Strategy at Appen, explained in an interview with VentureBeat: “As AI models tackle more complex and specialised problems, the data requirements also change,” she said. “Companies are finding that just having lots of data is no longer enough. To fine-tune a model, data needs to be extremely high-quality, meaning that it is accurate, diverse, properly labelled, and tailored to the specific AI use case.”
While the potential of AI continues to grow, the report identifies several key areas where companies are encountering obstacles. Below are the top five takeaways from Appen’s 2024 State of AI report:
1. Generative AI adoption is soaring — but so are data challenges
The adoption of generative AI (GenAI) has grown by an impressive 17% in 2024, driven by advancements in large language models (LLMs) that allow businesses to automate tasks across a wide range of use cases. From IT operations to R&D, companies are leveraging GenAI to streamline internal processes and increase productivity. However, the rapid uptick in GenAI usage has also introduced new hurdles, particularly around data management.
“Generative AI outputs are more diverse, unpredictable, and subjective, making it harder to define and measure success,” Chen told VentureBeat. “To achieve enterprise-ready AI, models must be customized with high-quality data tailored to specific use cases.”
Custom data collection has emerged as the primary method for sourcing training data for GenAI models, reflecting a broader shift away from generic web-scraped data in favor of tailored, reliable datasets.
2. Enterprise AI deployments and ROI are declining
Despite the excitement surrounding AI, the report found a worrying trend: fewer AI projects are reaching deployment, and those that do are showing less ROI. Since 2021, the mean percentage of AI projects making it to deployment has dropped by 8.1%, while the mean percentage of deployed AI projects showing meaningful ROI has decreased by 9.4%.
This decline is largely due to the increasing complexity of AI models. Simple use cases like image recognition and speech automation are now considered mature technologies, but companies are shifting toward more ambitious AI initiatives, such as generative AI, which require customized, high-quality data and are far more difficult to implement successfully.
Chen explained, “Generative AI has more advanced capabilities in understanding, reasoning, and content generation, but these technologies are inherently more challenging to implement.”
3. Data quality is essential — but it’s declining
The report highlights a critical issue for AI development: data accuracy has dropped nearly 9% since 2021. As AI models become more sophisticated, the data they require has also become more complex, often requiring specialized, high-quality annotations.
A staggering 86% of companies now retrain or update their models at least once every quarter, underscoring the need for fresh, relevant data. Yet, as the frequency of updates increases, ensuring that this data is accurate and diverse becomes more difficult. Companies are turning to external data providers to help meet these demands, with nearly 90% of businesses relying on outside sources to train and evaluate their models.
“While we can’t predict the future, our research shows that managing data quality will continue to be a major challenge for companies,” said Chen. “With more complex generative AI models, sourcing, cleaning, and labeling data have already become key bottlenecks.”
4. Data bottlenecks are worsening
Appen’s report reveals a 10% year-over-year increase in bottlenecks related to sourcing, cleaning, and labeling data. These bottlenecks are directly impacting the ability of companies to successfully deploy AI projects. As AI use cases become more specialized, the challenge of preparing the right data becomes more acute.
“Data preparation issues have intensified,” said Chen. “The specialized nature of these models demands new, tailored datasets.”
To address these problems, companies are focusing on long-term strategies that emphasize data accuracy, consistency, and diversity. Many are also seeking strategic partnerships with data providers to help navigate the complexities of the AI data lifecycle.
5. Human-in-the-Loop is More Vital Than Ever
While AI technology continues to evolve, human involvement remains indispensable. The report found that 80% of respondents emphasized the importance of human-in-the-loop machine learning, a process where human expertise is used to guide and improve AI models.
“Human involvement remains essential for developing high-performing, ethical, and contextually relevant AI systems,” said Chen.
Human experts are particularly important for ensuring bias mitigation and ethical AI development. By providing domain-specific knowledge and identifying potential biases in AI outputs, they help refine models and align them with real-world behaviors and values. This is especially critical for generative AI, where outputs can be unpredictable and require careful oversight to prevent harmful or biased results.
Check out Appen’s full 2024 State of AI report right here.