The Importance of Ground Truth Data in AI Applications: An Overview

Sandra Antunes

Sep 10, 2024 — 12 min read

Photo by Benjamin Sloth Lindgreen / Unsplash

Introduction

We are witnessing explosive growth in the development of artificial intelligence (AI) applications across various industries, from virtual agents to healthcare diagnostics and autonomous driving. This growth is powered by vast datasets and advanced algorithms that enable AI systems to learn and make decisions in real-world scenarios.

However, the effectiveness of these systems largely depends on the quality of the data used for training and evaluation. This is where the concept of ground truth plays a crucial role, as it refers to the accurate, real-world data that acts as the gold standard for training AI models and assessing their performance. Without it, even the most sophisticated algorithms can generate unreliable outcomes, leading to potentially harmful consequences. Therefore, understanding and leveraging ground truth is critical for effective and responsible AI development.

Understanding Ground Truth Data

The term ground truth comes from geology and geospatial sciences, where actual information was collected on the ground to validate data acquired through remote sensing, such as satellite imagery or aerial photography. Since then, the concept has been adopted in other fields, particularly machine learning and artificial intelligence, to refer to the real-world data used for training and testing models. The fundamental idea remains the same: ensuring that the data used for technological solutions reflects reality, thus enhancing the reliability and effectiveness of said solutions.

Ground truth data is used throughout an AI model’s training and evaluation cycle. Since the dataset used for training should never be the same as the one used for evaluation – as testing something against what was used to create it would lead to a false perfect positive – the data is typically split into three subsets: the training set (usually around 60-80%), the validation set (around 10-20%), and the test set (also around 10-20%).

The training set, which constitutes the majority of the dataset, is used to learn the model and adjust its parameters. This dataset should be sufficiently large and representative to ensure the model will produce accurate predictions on unseen data.

The validation set is used during model training to evaluate the learned model and make adjustments. It helps gauge the model’s ability to generalize to new data outside of data it’s already seen to prevent overfitting, where the model performs well on training data but fails to make accurate predictions on new data. The validation set is also crucial for hyperparameter tuning, like the learning rate (which allows balancing the speed with which the model converges to an optimal solution) or the number of epochs (which determines how long the model should train to find when it has learned enough from the data without overfitting). This will allow configuring the values that yield the best results.

Finally, the test set is used to evaluate the final performance of a trained model, serving as a reliable benchmark of its accuracy and effectiveness. Keeping the test set separate throughout the training process will thus ensure that the model has learned relevant patterns and can make accurate predictions beyond the training and validation datasets.

Although collecting ground truth data is an indispensable step for ensuring accurate model performance, it is definitely not an easy task.

Collecting ground truth data

Ground truth data can be collected from various sources, including industry in-house data (like product usage logs, operations records, website analytics, databases, etc.), public records, academic repositories, websites, or open datasets. Depending on the model modality, this data may consist of text, image, video, speech, or sound.

Some AI applications can leverage raw data directly, often using advanced algorithms capable of finding patterns and structures. For example:

Automatic Speech Recognition systems require raw audio files containing speech to be able to automatically transcribe human speech.
Text or music generation systems require massive amounts of text data or audio files of music.
Recommender systems require user interaction data (like clicks, likes, views, etc.) to provide suggestions for items that seem to be most pertinent to a particular user.

On the other hand, a vast number of AI applications require annotated or labeled data, which can be formatted in different ways so that they can learn from specific examples and generalize that knowledge to new data. For example:

Information extraction systems require text data annotated with specific entities (e.g., names, dates, locations, phone numbers) to automatically extract information that can be useful in a variety of use cases, such as virtual assistants in customer service, information retrieval in search engines, documents processing in all kinds of industries.
Sentiment Analysis systems require text data labeled with a sentiment score (e.g., positive, negative, neutral) to determine sentiment about a product or service, which will help organizations enhance customer experiences.
Image detection and classification systems require images with annotated bounding boxes and respective labels (e.g., objects, animals, people) so that systems can detect them automatically. This technology can be applied across several industries, including healthcare (identifying anomalies in X-rays or MRIs), retail (recognizing items at checkout stations), automotive (detecting pedestrians, vehicles, road signs, or obstacles to ensure safe driving), manufacturing (monitoring equipment for maintenance prediction), engineering infrastructure (inspecting bridges, roads or buildings for signs of structural problems).
Facial recognition systems require images of faces with labels for the person's features, which can be used to identify unauthorized access in restricted areas, to recognize individuals in boarding processes, or to create personalized avatars in virtual environments.
Document classification systems require documents labeled with categories that can include the content type (e.g., contracts, reports, articles, letters, briefs), the industry (e.g., healthcare, retail, education, apparel), the area (e.g., sales, marketing, human resources), or the format (e.g., text, spreadsheet, presentation) to automatically categorize knowledge bases for easy access and reference.

Challenges in collecting ground truth data

As we may expect, collecting this type of data comes with several significant challenges. Below, we briefly outline some of the most critical ones.

Data diversity

Collecting ground truth data involves sourcing data that must accurately represent the real-world scenarios for the intended industry. The data also needs to be balanced so that no part of it is underrepresented, which could lead to poor model performance and bias. Thus, it’s imperative to cover a wide variety of aspects, which can include:

Source diversity. Text or music generation applications will require data from various genres (e.g., books, articles, news, or technical documents for text systems; classic, pop, rock, hip-hop, or reggae for music applications).
Scenarios diversity. Speech-to-text systems should include data from both silent and noisy environments. The noisy environments should feature diverse types of background noise, such as music, traffic, overlapping speech, and the ambient sounds typical of restaurants or shopping centers.
Demographic diversity such as age, gender, education, socio-economic background, or ethnicity. For example, facial recognition systems highly benefit from this diversity to perform fairly across all user groups.
Temporal diversity. Forecasting systems (such as financial, meteorological, or healthcare) require historical data for their predictions. Also, almost all systems need ongoing data maintenance to keep the datasets current and representative of real-world scenarios and avoid outdated outputs.
Events diversity. Including edge cases or unusual examples also ensures that the models can handle unexpected or less common situations.

Time and cost

As expected, data sourcing can require a large investment of time and money, particularly when no open datasets are available or sufficient for the specific needs of the project. Without access to readily available datasets, organizations usually dive into complex and costly mechanisms, such as setting up data collection mechanisms or negotiating data permissions and proprietary content.

Additionally, much of this data requires manual annotation, which is a meticulous, time-consuming, and costly process that often demands domain experts to ensure accuracy. Consequently, companies may incur significant time and financial burdens, which can lead to the development of AI systems that lack quality and diversity, resulting in unfair and unreliable applications.

Accurate and consistent data annotations

When manually annotated data is needed, companies need to use annotators, i.e., professionals responsible for analyzing and labeling the data. Using multiple annotators for each data point can improve the accuracy and reliability of annotations. However, ensuring the quality and consistency of the annotated data remains a significant challenge. Also, in more subjective annotation tasks, differing interpretations among annotators can result in inconsistencies. Here are some examples:

In sentiment analysis, different annotators may interpret the sentiment differently based on their perspective, cultural background, or understanding of the context. For example, The meal was fine can be seen as neutral, slightly negative, or positive by different annotators.
In image segmentation, different annotators may have different opinions on object boundaries, particularly when objects are partially obscured or overlapping.
In named entity recognition, challenges arise from differing interpretations of what constitutes an entity, the ambiguity of entities, and the span of the entities, leading to annotators’ disagreement on when and how to label them. For example, in the sentence, The Duke of Sussex makes a brief return to the UK, some annotators may annotate Duke of Sussex as a ‘title’, while others may annotate it as a ‘person’. Also, some annotators may include the The in the annotation, while others may exclude the definite article.
In the mean opinion score, annotators may also have different perceptions of the naturalness of the synthesized speech.

In addition to the subjectivity inherent in some annotation tasks, which can lead to varied interpretations, it’s important to recognize that human annotators can also introduce errors that may compromise data quality. These errors can arise not only from human fallibility but also from a variety of other factors, such as lack of domain expertise, unclear or misinterpreted instructions, fatigue, or cognitive overload. Such human-induced errors can significantly impact the performance and reliability of models trained or evaluated using this annotated data.

Besides ensuring that annotation guidelines are clear and detailed, it is also important to implement methods and strategies that can help to identify systematic errors or inconsistencies. Some of these strategies include Inter-Annotator Agreement (which measures how often annotators agree on their decision for a certain category), Pearson Correlation Coefficient (which assesses the linear relationship between different annotators' labels and is commonly used in subjective tasks, like sentiment analysis or mean opinion score), automated quality checks (which can include scripts that randomly reassign the same task to the same annotators to ensure that they are consistent and attentive to their work), or manual spot checks (where expert annotators random review annotated data to promptly identify and address inconsistent or erroneous annotations).

Ethical considerations

Accounting for ethical aspects in ground truth data collection is crucial to ensure that the process respects the rights and privacy of individuals involved and to promote trust, fairness, and integrity in AI applications. Here are some key ethical aspects to consider:

Data privacy: Data collection must adhere to privacy laws and regulations such as the General Data Protection Regulation (in the European Union) or the California Consumer Privacy Act. For example, data scraped from the internet might contain personal and identifiable information, resulting in privacy breaches. To prevent this, all sensitive personal information should be anonymized or pseudonymized to safeguard individuals' identities.
Copyright compliance: Data collection should comply with all relevant laws governing data usage. For instance, data gathered from the internet may contain copyrighted materials that violate intellectual property rights.
Informed consent: Whenever applicable, individuals whose data is being collected should be fully informed about the purpose and use of their data and give explicit consent.
Data transparency: Data should be collected from transparent sources to ensure its authenticity and relevance. It is essential to establish and maintain clear documentation that includes information about the provenance of the datasets, their characteristics, how they were obtained and selected, the cleaning methodologies and labeling procedures, if applicable, etc.
Ethical content: Data collection should exclude ethically problematic content, such as hate speech or violent material, to prevent the perpetuation of harmful, abusive, or offensive behavior.
Fair representation: Data collection should represent diverse and equitable demographics to avoid biases or prejudices that could unfairly disadvantage specific groups.

When AI systems adhere to the principles of transparency, explainability, and interpretability, we will be able to understand and interpret their outputs and decision-making processes. This will enhance user trust and address the ethical and legal concerns mentioned above. For more detailed insights on the importance of evaluation data in AI systems, refer to our blog post "LLM Evaluation at Scale."

Techniques to leverage ground truth data collection

As mentioned above, high-quality data is essential for ensuring the performance, accuracy, and overall quality of responses from AI applications. However, gathering these datasets can be extremely costly and difficult to access, posing significant obstacles to the development of robust and reliable AI models. By leveraging some advanced methodologies, organizations can overcome these hurdles, ensuring that their datasets are both accurate and representative. This section will explore some techniques that have emerged to improve ground truth data collection and address some critical challenges.

Crowdsourcing

Crowdsourcing is usually defined as “the activity of getting information or help for a project or a task from a large number of people, typically using the internet” (Oxford Learners Dictionary). There are several online platforms equipped with proper annotation tools and intuitive interfaces, as well as a diverse pool of human annotators (like Amazon Mechanical Turk, Appen, Neevo, Mighty AI, Clickworker, or Profilic) that can gather large amounts of high-quality datasets quickly. However, it’s worth noting that, aside from potentially being costly (as it may depend on the task, language, precise requirements, and project management time and effort), it’s essential to ensure data consistency by providing proper guidelines and the quality control mechanisms mentioned above.

Automated tools

There are also several tools powered by AI models that help perform a variety of data annotation tasks for text, speech, image, or video, streamlining this process and reducing manual effort. Human revision is needed to correct misclassifications and ensure proper data labeling. This hybrid approach combines the speed of automated tools with the accuracy and discernment of human revision, which can help organizations achieve high-quality data annotations that are both time-efficient and reliable for AI model training and evaluation.

Data augmentation

Data augmentation is the process of artificially generating new data from existing data with slight modifications. In essence, it enhances the dataset by artificially introducing variations while preserving the core characteristics of the original data. This technique is primarily used to increase the size and variability of small datasets for the training process to improve the robustness and generalization ability of the models.

While leveraging existing data to generate more examples can be faster than collecting new data, some augmentation techniques may require complex transformations or simulations and be computationally expensive.

Synthetic data generation

Synthetic data generation is the process of creating artificial data that mimics the properties of real-life data. Contrary to data augmentation, synthetic data usually does not contain actual values from the original dataset, as it involves creating entirely new data points that do not exist in the original dataset. Its main purpose lies in addressing data scarcity in the training process, like gaps in the datasets and edge case scenarios.

While synthetic data can provide ample amounts of data where real-world data may be scarce or hard to access, ensuring that it accurately reflects the nuances of real-world scenarios can also be a challenge. Therefore, it is crucial to validate that this data is acceptable for use, especially in critical fields like healthcare or finance.

Filling the data accessibility gap

As we can see, numerous constraints can hinder access to high-quality data needed to train and evaluate reliable AI applications. From sourcing data – which ultimately needs to comply with several ethical considerations and diversity aspects – to data annotation, this can be a complex, time-consuming, and costly process. Also, industries related to natural language processing systems, which are language-dependent, can struggle to find datasets in the intended language. This is especially true for low-resource languages (i.e., languages that have few digital resources for computational processing), which can lead to digital exclusion and reduced access to effective AI solutions.

To address some of the aforementioned challenges, two openly licensed training datasets aim to provide data transparency and promote access: Common Pile (forthcoming,) and Common Corpus and YouTube-Commons.

Common Pile is a successor of The Pile, an English dataset developed by EleutherAI on principles of transparency and interpretability. Both datasets cover a wide range of content (to account for data diversity) and are composed only of openly licensed and public domain data (to account for copyright protection and data transparency).

Common Corpus and YouTube-Comms are two multilingual open datasets developed by Pleias. The first one is the largest open dataset in the public domain. It was built with open data, including administrative, cultural, and scientific data, and it contains around 500 billion words. The second one consists of transcripts of YouTube videos with nearly 30 billion words.

On the speech side, Common Voice (by Mozilla) aims to build a multilingual open-source voice database to build speech recognition applications by having users donate their voices and written sentences.

The availability of this type of openly licensed training datasets is crucial for advancing AI development. These datasets address data scarcity and foster data transparency and accessibility, enabling researchers and developers to create more accurate and diverse AI models.

Mozilla.ai is also committed to providing high-quality ground truth data for the evaluation of AI models. The company is focusing on creating an open-source platform that supports developers in navigating the complex AI landscape, allowing them to evaluate models, understand the results, and make informed decisions about the best model to choose for their use cases. Even if they don’t have ground truth data for the evaluation, we are committed to generating it by analyzing and understanding the best model that can be used for that specific use case.

Main insights

Ground truth data is the cornerstone of effective AI model training and evaluation. Despite the challenges in obtaining and maintaining high-quality data, its importance cannot be overstated.

For the AI Systems’ training, the use of high-quality ground truth data ensures that the models are accurate as they are provided with correct examples to learn from. It reduces bias, as representative and diverse datasets ensure that the models make fair predictions across diverse population groups. It enhances robustness, as capturing a wide range of scenarios, including edge cases and rare events, helps the models become more resilient to varied and unexpected inputs, preventing over-fitting. Finally, it allows complying with ethical standards, as high-quality data ensures adherence to privacy, ethical, and regulatory standards, safeguarding user privacy, and upholding ethical and responsible AI practices.

For the AI system’s evaluation, the use of ground truth data provides a reliable benchmark for measuring the performance of AI models. It allows for meaningful comparisons between different models and algorithms, facilitating informed decision-making in model selection. It helps in uncovering disparities and biases in model predictions, allowing for timely mitigation of any unfair behavior or erroneous prediction. Finally, it builds reliability and confidence, as high-quality evaluation data fosters trust and increases acceptance of the AI system in real-world applications.