Data provenance and AI - Kigen IoTSAFE

Data provenance: Enhancing AI authenticity

The next wave of innovation in AI will rely on business data, as concerns materialize that the trove of freely-available internet data used for training AI models is running out. Businesses need guidance on frameworks such as Kigen IoT SAFE that unify data provenance. We explore why this is a good thing both for businesses and the future of AI itself.

Authors:

Bee Hayes-Thakore, VP of Marketing at Kigen

Bee Hayes-Thakore

VP of Marketing at Kigen

Paul Bradley, Director of Business Development WW, Chair GSMA IoT SAFE Group.

Paul Bradley

VP of Solution Sales at Kigen
Chair, GSMA IoT SAFE Group

How secure is the data that AI is trained on?

In the current digital landscape, securing data is paramount as enterprises increasingly rely on artificial intelligence (AI) to drive decision-making processes. The integrity and authenticity of data are critical, especially as organizations harness AI capabilities to gain insights and maintain competitive advantage. To navigate this complex environment, businesses must adopt robust technology choices and standards-backed approaches to secure data effectively. One promising solution is an implementation that leverages the GSMA IoT SAFE (IoT SIM Applet For secure End-to-end communication), which provides a standardized method for ensuring data transport security and provenance, thereby enhancing the authenticity of AI outputs. Such an implementation, found in Kigen IoT SAFE, extends the benefit of the high-grade, tamper-resistant protection offered by the eSIM all the way to enterprise credentials rather than just those of the mobile network.

The big picture: Why does it matter? 

Popular AI systems do not disclose adequate basic information about their training data. The pace of innovation has prompted calls from AI developers, AI application consumers, policymakers, etc., for more systematic data documentation. Such issues are acute for the ‘dataset-of-datasets,’ i.e., massive collections of hundreds of datasets where original provenance information is not present or lost due to the lack of standard structures. This has implications in the debate around how to avoid challenges around data transparency, vetting, privacy, representation, bias, copyright infringement, and detailed tracing as data navigates along AI applications. It’s an important consideration for all who care about the implementation of AI, and, overall, for the future of responsible and trustworthy AI.


Get cyber smarts for AI

Kigen recently extended the security solutions for unlocking enterprise AI’s value through devices and data. Read the full report, packed with the latest trends, best practices, and perspectives on technology and culture for an AI-first organization.

Cyber smarts for AI - Kigen guide on security foundations for AI value

The imperative for securing data

Data breaches and cyber threats are ever-present risks that can compromise sensitive information, disrupt operations, and erode trust. As AI systems become more prevalent, the potential damage from data tampering or manipulation grows exponentially. Therefore, securing data from its inception to its application in AI models is essential. This requires a multifaceted approach that combines advanced encryption, access controls, and data provenance mechanisms.

As concerns intensify that the data used for training the large AI models is running out, the next wave of innovation in AI is expected to leverage business data. To get ahead, businesses themselves not closed-source AI companies, need ownership and control of their proprietary models. Here lies an opportunity, that businesses already on their journey to digitalize key business processes and assets can exploit. As Vincent Korstanje, CEO of Kigen, spotlighted recently in Business Reporter’s Digital Transformation Special Edition, IoT devices at the Edge will provide a starting point for enterprises to own, control, and implement cost-effective AI, tuned to business needs that yields long-term success. Cellular IoT will certainly feature among your business-critical digital assets, containing a SIM or eSIM which can be used as a secure element and root of trust on which IoT SAFE relies.

IoT SAFE in the context of AI data provenance

IoT SAFE is an initiative that aims to establish a standardized framework for securing data in the Internet of Things (IoT) ecosystem. By embedding security features directly into IoT devices through the SIM, eSIM or secure element (iSIM), IoT SAFE ensures that data is protected from the point of collection to its eventual use in AI systems. This framework leverages SIMs and secure elements within devices to perform cryptographic operations in a black-box manner, ensuring data integrity and authenticity.

The significance of IoT SAFE extends beyond IoT devices. Its principles of data provenance and authenticity are crucial for AI, which relies on vast amounts of data to train models and generate insights. By ensuring that data is untampered and verifiable back to its source, IoT SAFE helps mitigate risks associated with data poisoning and other malicious activities that could compromise AI outputs.

Simplified access control and cryptographic verification

With IoT SAFE-secured data, access control, and cryptographic verification become more straightforward. Secure elements and SIMs within IoT devices can manage keys and perform cryptographic operations autonomously, reducing the need for complex key management systems. This simplifies the process of ensuring that only authorized entities can access sensitive data. Furthermore, cryptographic verification becomes more efficient as each data piece carries a unique identifier and cryptographic signature, making it easier to authenticate and trace the data’s history.

Newsletter Sign-up// – Light Gray Inline

Sign-up for our newsletter to receive the latest from Kigen.

Accept Rules(Required)
This field is for validation purposes and should be left unchanged.

Data integrity and authenticity for AI with Kigen IoT SAFE

Data provenance refers to the ability to trace and verify the origins and history of data throughout its lifecycle. In the context of AI, provenance is vital for ensuring that the data used to train models is authentic and trustworthy – an area we explore further in Kigen’s cybersmarts for AI guide for business leaders. Authentic data inputs into a Large Language Model (LLM), in turn, enhances the trustworthiness of AI-generated insights and decisions.

IoT SAFE supports data provenance by embedding unique identifiers and cryptographic signatures into data at the point of collection. These identifiers are securely injected at device manufacture and can be remotely re-generated using on-board key generation technology.  Their use means that data can be traced back to the original source, providing a verifiable history of the data’s journey. When integrated into AI systems, this provenance data allows enterprises to validate the authenticity of the inputs, leading to more accurate and trustworthy AI outputs.

Data provenance and AI security

Meeting key elements of data provenance with IoT SAFE

Here’s how IoT SAFE satisfies key requirements outlined with the leading definitions of data provenance:

1. Modality and Source Agnostic

IoT SAFE is designed to be modality and source agnostic, meaning it can secure data regardless of its format or origin. This is crucial for AI systems that integrate diverse data types (text, images, sensor data) from multiple sources.

  • Flexible Security Protocols: IoT SAFE employs flexible security protocols that can be applied to any data type, ensuring that all forms of data are uniformly protected.
  • Universal Integration: The framework is designed to integrate with various IoT devices and data sources, ensuring that data from different modalities can be securely captured and transmitted.

2. Verifiable

For AI systems, especially those using Retrieval-Augmented Generation (RAG), it is essential to verify the authenticity and integrity of data.

  • Cryptographic Signatures: IoT SAFE utilizes cryptographic signatures to ensure data integrity. Each data packet is signed using secure cryptographic methods, allowing for verification of its authenticity.
  • End-to-End Encryption: Data is encrypted end-to-end, ensuring that it cannot be tampered with during transmission. This makes it possible to verify that the data received is exactly what was sent.

3. Structured

Structured data is crucial for AI as it allows for easier integration, processing, and analysis.

  • Metadata Inclusion: IoT SAFE allows for embedding metadata within data packets, providing structure and context. This metadata can include timestamps, source identifiers, and data type information.
  • Standardized Data Formats: By adhering to standardized data formats, IoT SAFE provides a hash of the structured or unstructured data to the SIM to sign. The confidentiality protection is provided by the session keys generated following the IoT-SAFE secured TLS handshake, which, when combined, could provide a format that can support better AI model ingestion. 

4. Extensible and Adaptable

AI systems need data provenance mechanisms that can adapt to new requirements and scale with growing data volumes.

  • Modular Design: The modular design of IoT SAFE allows for easy extension and adaptation. New security features and protocols can be added as needed without overhauling the entire system.
  • Scalability: IoT SAFE is built to scale with the increasing number of IoT devices and the volume of data they generate, ensuring continuous security and provenance tracking as systems grow.

5. Symbolically Attributable

It is important for AI systems to attribute data to its original source clearly and unequivocally.

  • Unique Identifiers: IoT SAFE can sign each data packet, which together with its identifier, can be used to link it back to its source. This ensures that data can always be traced back to its origin.
  • Audit Trails: The framework provides comprehensive audit trails, documenting the data’s journey from its creation through to its use in AI systems. This allows for clear attribution and accountability.

Implementing with Retrieval-Augmented Generation (RAG) for enhanced AI accuracy

Retrieval-Augmented Generation (RAG) is an advanced AI technique that combines generative models with retrieval mechanisms to produce more accurate and contextually relevant results. By incorporating external knowledge sources into the generation process, RAG enhances the AI’s ability to provide precise and informed responses.

In the enterprise context, RAG can be implemented to improve the accuracy of AI systems by leveraging proprietary and external data sources. For instance, a financial institution could use RAG to enhance its AI-driven market analysis by integrating real-time market data and historical financial records. This combination allows the AI to generate insights that are not only accurate but also contextually rich. While we call out RAG here, many of the new developments in Federated Learning (FL), which allow to adjust trained models for local contexts, would potentially have good coverage.

The integration of RAG with standards-backed approaches like IoT SAFE further bolsters data security and authenticity. By ensuring that the external data sources used in RAG are verifiable and authentic, enterprises can trust that the augmented data feeding into AI models is reliable. This synergy between RAG and IoT SAFE represents a holistic approach to achieving high-quality AI outputs.

Considerations towards a data provenance standard for AI

A standard data provenance framework could have far-reaching positive impacts on responsible AI development, as outlined throughout. However, such a standard is challenging to design and adopt, as evidenced by the fact that it does not exist yet despite numerous calls from diverse stakeholders. It is important to underscore that while a well-designed standard could further important social objectives, a poorly designed standard could further entrench problematic practices: an overly onerous framework could be adopted or, worse, impose excessive costs on under-resourced researchers and developers, further benefiting large corporate AI developers. A standard data provenance framework could address these diverse needs, but existing solutions tend to address different transparency problems in isolation. Instead of proposing a new standard, how can existing standards can be unified to effectively address the range of relevant challenges enterprises face today? 

Read more with kigen.com/cybersmartsforAI/