Data Refinement

The Vana Data Access Layer expands upon our existing data ingress processes to also include secure data refinement and indexing to support later permissioned query access by applications and application builders.

Data Refinement

Data refinement is the process of ensuring that ingested datasets meet verifiable quality and security standards for future access before decentralized storage in IPFS, directly integrated into a DataDAO's Proof-of-Contribution (PoC) mechanism.

3 Steps

The refinement process can be broken down into three clearly defined steps:

Normalization to structure data to adhere to the onchain schema definition and processing found in the DataRefinementRegistry contract.
Masking of ingested data to optionally suppress any information DLP owners do not want to provide access to.
Encryption of the end result to protect against unauthorized access with strictly defined access control mechanisms.

Once refined, the final data output is structured, masked (optional), and encrypted, forming a solid foundation for decentralized data commerce.

📘
Refining Media Files
Non-text data like audio or images can be refined by generating metadata in a structured SQLite format.
For example:

Audio → transcriptions, speaker tags, labels

Images → annotations, classifications, hashes

These metadata files are indexed and queryable, while the original media is stored separately (e.g., in IPFS) and referenced via links in the SQLite schema.
🔐 Permissions are coarse-grained: access is granted to the full column containing the media pointer. You can encrypt or obfuscate links if needed — just document it in your schema.
✅ This is the recommended approach for media in v0. Future versions may support more tailored refinements.

Why This Matters

Standardized, structured datasets are easier to validate and integrate in end-user applications, increasing downstream value creation.
Masking helps DataDAOs to control which data is exposed to data consumers.
Encryption ensures that even distributed datasets remain protected, improving overall ecosystem safety.
Well-structured, highly indexed, privacy-preserving datasets allow high-throughput ingestion for more scalable and cost-effective consuming applications.

Storage & Availability

As part of a DataDAO's Proof-of-Contribution (PoC) process, refined data (normalized, masked (optional), encrypted) is uploaded to a decentralized storage network chosen by the DataDAO, such as IPFS. The resulting content identifier (CID) is recorded onchain within the corresponding file entity, tracking all file refinements across different schema versions.

This triggers an onchain event, prompting the Query Engine to index the newly refined data safely into a centralized schema database within a Trusted Execution Environment (TEE). The Query Engine then processes permissioned query requests, ensuring secure, controlled access to the refined dataset.

This guarantees:

Immutable data integrity, ensuring tamper-proof, verifiable storage through IPFS and decentralized solutions.
Granular data access control with strictly enforced onchain permissioning mechanisms.
Real-time data availability with event-driven indexing and aggregation for up-to-date, query-ready datasets.
Resilient, fault tolerant storage, providing high availability and failover protection through decentralized redundancy.

DataDAO Data Refiners

Data refiners, defined by DataDAOs and DLP owners, serve as the primary processing units responsible for determining how data is normalized, as well as the structure of the resulting output.

There are two core components to this:

The data refiner image, which when executed as part of the Proof-of-Contribution (PoC) step performs the data refinement on the raw input data. This is documented onchain as the data refiner's "refinement instruction url" in the Data Refiner Registry contract.
The schema definition, which is uploaded to IPFS and documents the structure of the refined dataset as a queryable database. The IPFS content id (CID) is documented onchain as the data refiner's schema in the Data Refiner Registry contract.