Data Refinement & Publishing

A guide for DataDAO creators on how to process, structure, and publish datasets for secure access on the Vana network.

For your DataDAO's data to be useful to application builders and researchers, it must be processed from its raw form into a structured, encrypted, and queryable format. This guide outlines the steps to refine and publish your dataset on the Vana network.

📘

Before You Begin

This is a technical guide that assumes you have a working Proof-of-Contribution (PoC) mechanism for your DataDAO. The data refinement process described here is an extension of your existing data contribution flow.

Step 1: Define Your Dataset Schema

First, you must define a clear schema that provides a structured format for your dataset. This schema dictates how the raw data will be organized.

You must then upload this schema definition to a decentralized storage provider like IPFS and record its content identifier (CID), as you will need it for the next step.

An example schema definition looks like this:

{
  "name": "social_media_posts",
  "version": "0.0.1",
  "description": "Refined social media dataset",
  "dialect": "sqlite",
  "schema": "CREATE TABLE posts (id INTEGER PRIMARY KEY, user_id TEXT, content TEXT, timestamp DATETIME);"
}

Step 2: Create and Register a Data Refiner

A "Data Refiner" is a Dockerized script that you create to process incoming raw data. This script runs as part of your Proof-of-Contribution flow and performs three key actions:

  1. Normalize: Converts raw data into a structured SQLite format that conforms to the schema you defined in Step 1.
  2. Mask (Optional): Hides or removes any sensitive data that should not be accessible through standard queries.
  3. Encrypt: Encrypts the final, structured dataset to protect it from unauthorized access.

Triggering the Refiner

Your application's backend will trigger this Dockerized script by making a POST request to the Vana Refinement Service for each piece of contributed data.

  • Mainnet Endpoint: https://592387e3ed196d95ce8df7af54dab6ebca21a3c8-8000.dstack-prod5.phala.network/refine
  • Moksha Testnet Endpoint: https://a7df0ae43df690b889c1201546d7058ceb04d21b-8000.dstack-prod5.phala.network/refine

The POST request should be structured as follows:

POST /refine
Content-Type: application/json

{
  "file_id": 1234,
  "encryption_key": "0xabcd1234...",
  "refiner_id": 12,
  "env_vars": {
    "PINATA_API_KEY": "xxx",
    "PINATA_API_SECRET": "yyy"
  }
}

Registering the Refiner

Once your Dockerized refiner is built and published, you must register it on-chain with the DataRefinerRegistry contract, providing the URLs to both your schema (the CID from Step 1) and the Docker refinement instruction.

Step 3: Publish Your Data On-Chain

After a piece of data is successfully refined, you must store it and publish a record of the refinement on-chain.

  1. Store the Refined Data: Upload the encrypted, refined data file to a decentralized storage provider like IPFS.

  2. Link the Refinement: Link this new data to the original contribution by calling the addProof function on the DataRegistry contract. This creates an immutable, on-chain record that connects the original data to its processed, queryable version.

    • Contract: DataRegistry (0x8C...)
    • Function: addProof(uint256 fileId, bytes proof)

Step 4: Set Public Access Permissions

Finally, you must set the default access policies and pricing for your dataset to make it discoverable and usable by others. To set a general, public access policy for everyone, call the addGenericPermission function on the QueryEngine contract.

  • Contract: QueryEngine (0xd25...)
  • Function:
    function addGenericPermission(
      uint256 refinerId,
      string calldata tableName,
      string calldata columnName,
      uint256 price
    ) external returns (uint256 permissionId)
  • Parameters:
    • refinerId: Your published dataset's ID.
    • tableName / columnName: Leave these blank to grant access to the entire dataset, or specify them for more granular control.
    • price: The amount in $VANA you wish to charge per query. Set to 0 for free access.

This function creates a permission that is granted to everyone (address(0)), making it the ideal way to set public query rules. To grant access to a specific user, see the Granting Data Access guide.