Complying with VRC-15 (Data Access)

๐Ÿ“˜

Tip

Check out our Data Access Quickstart guide.

1. Introduction

Overview of Data Access in Vana

The Vana ecosystem is evolving to support data access, allowing DataDAOs to make their collected data available to application builders in a secure, controlled, and monetizable way. This represents a significant advancement for the ecosystem, opening new revenue streams for DataDAOs while enabling the development of valuable applications.

Benefits for DataDAOs

  • New Revenue Streams: Monetize your data by providing secure access to application builders
  • Governance Control: Maintain fine-grained control over who accesses your data and at what price
  • Data Security: Ensure your contributors' data remains secure with advanced encryption
  • Ecosystem Participation: Play a critical role in the Vana data economy

Key Components

The data access architecture consists of several key components:

  • Data Refinement: Process that transforms raw data into queryable format
  • Data Refiner Registry: Smart contract storing schema information
  • Query Engine: Provides secure access to refined data
  • Compute Engine: Executes jobs that interact with the Query Engine
  • Permission System: Controls access to data at various levels

This guide will walk you through each step of preparing your DLP for data access integration.


2. Data Refinement Process

What is Refinement and Why It's Necessary

Refinement is the process of transforming raw user-contributed data into a structured, queryable format (specifically libSQL, a modern fork of SQLite). This step is essential because:

  1. It normalizes data into a consistent schema
  2. It enables SQL-based querying across all data points
  3. It allows for optional PII (Personally Identifiable Information) removal or masking
  4. It creates a secure boundary between raw data and query access

Refinement occurs after the proof-of-contribution process and is a required step for making data available through the query engine.

The Refinement Template

Vana provides a template repository that you can fork and customize for your DLP's specific data structure.

Building Your Refinement Docker Image

  1. Fork the Template: Start by forking the refinement template repository
  2. Update Models:
    • Modify refiner/models/unrefined.py to match your raw data structure
    • Modify refiner/models/refined.py to define your target SQLite schema
  3. Customize Transformers:
    • Update refiner/transformer/ files to map your raw data to refined models
  4. Configure Settings:
    • Modify refiner/config.py with your schema name, version, and other parameters
  5. Build and Test:
    • Test locally with sample data
    • Build a Docker image
    • Publish the Docker image to a publicly accessible location

PII Considerations and Best Practices

While Vana does not strictly enforce PII removal, it's recommended to consider:

  1. Masking Identifiers: Use techniques like hashing for emails, usernames, etc.
  2. Data Minimization: Only include fields necessary for the intended use cases
  3. Aggregation: Consider aggregating data when individual records aren't needed
  4. Pseudonymization: Replace direct identifiers with pseudonyms

Remember that you can create multiple refinement types from the same data, perhaps one with PII and one without, each with different access controls.

This is also a great place to enrich your dataset, for example, by calling an API and attaching supplementary data to your original data point.


3. Schema Registration

Creating and Uploading Your Schema

Your schema defines the structure of your refined data. It should:

  1. Be created as a JSON file following the proper format:

    {
      "name": "spotify",
      "version": "0.0.1",
      "description": "Schema for storing music-related data",
      "dialect": "sqlite",
      "schema": "CREATE TABLE IF NOT EXISTS \"albums\"\n(\n    [AlbumId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,\n    [Title] NVARCHAR(160)  NOT NULL,\n    [ArtistId] INTEGER  NOT NULL,\n    FOREIGN KEY ([ArtistId]) REFERENCES \"artists\" ([ArtistId]) \n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n);\n..."
    }
  2. Include all necessary tables, columns, constraints, and relationships

  3. Be uploaded to IPFS to obtain a permanently accessible URL

The refinement template includes utilities to help generate this schema from your SQLAlchemy models and upload it to IPFS.

Registering with the DataRefinerRegistry Contract

Once your schema is uploaded to IPFS, you need to register it with the DataRefinerRegistry contract:

  1. Call the addRefiner function with:
    • dlpId: Your DLP's ID
    • name: A name for your schema/refiner
    • schemaDefinitionUrl: IPFS URL where your schema is stored
    • refinementInstructionUrl: URL to your Docker image
    • publicKey: The public key provided to you by Vana for encrypting refinements

Example:

// This transaction must be sent from your DLP's owner wallet
dataRefinerRegistry.addRefiner(
    myDlpId,
    "Spotify User Data",
    "https://ipfs.vana.org/ipfs/Qma3dWDFZCQnWFTv1owAXyJjtpxjENGvKizPhFT4fsX8do",
    "https://github.com/vana-com/vana-data-refinement-template/releases/download/v4/refiner-4.tar.gz",
    "0x04bfcab8282071e4c17b3ae235928ec9dd9fb8e2b2f981c56c4a5215c9e7a1fcf1a84924476b8b56f17f719d3d3b729688bb7c39a60b00414d53ae8491df5791fa"
)

The contract will assign a unique refiner ID to your schema, which will be used to reference your schema in queries.

You can register multiple refiners for different types of data or different views of the same data, eg:

  • Separate refiners for different data sources or types
  • One refiner with PII and another without
  • New refiners for major schema changes

Each refiner gets its own ID and permissions, allowing you to control access granularly.


4. Refinement Service Integration

Using the Refinement Service API

Vana provides a refinement service that runs your Docker image to process data:

  • You can use Vana's hosted service or self-host (recommended)
  • The service is hosted in a TEE (Trusted Execution Environment) for security
  • The code is available at: https://github.com/vana-com/vana-refinement-service

Invoking Refinements for New Data

After proof-of-contribution is run for a new data point, call the refinement service:

POST <refinement-service-url>/refine
{
    "file_id": 1234,             // File ID in the Data Registry
    "encryption_key": "0xabcd1234...",  // User's original file encryption key
    "refiner_id": 12,             // Your refiner ID from the Data Refiner Registry
    "env_vars": {
         PINATA_API_KEY: "xxx",
         PINATA_API_SECRET: "yyy"
    }
}

The service will:

  1. Download and decrypt the original file
  2. Create a refinement encryption key (REK), derived from the original file encryption key, and inject it into your refinement container so your refinement can be encrypted
  3. Run your refinement Docker container
  4. Add the refined data to the Data Registry via addRefinementWithPermission
  5. Grant permission to the query engine to access the refinement, by encrypting REK with the public key from your refiner

Batch Refinement for Existing Data

For refining data collected before implementing VRC-15:

  1. DLPs can collect a list of file IDs and corresponding encryption keys
  2. For each file, call the refinement API

This enables you to make your existing data queryable


5. Permission Management

Understanding the Permission Model

The Query Engine contract manages permissions for data access at three levels of granularity:

  1. Schema Level: Grant access to an entire schema
  2. Table Level: Grant access to specific tables within that schema
  3. Column Level: Grant access to specific columns in specific tables within that schema

Permissions are managed by DLP owners and stored on-chain for transparency and security.

Granting Permissions to Application Builders

After negotiating with application builders off-chain:

  1. Call the addPermission function on the Query Engine contract:

    queryEngine.addPermission(
        "0x123...",  // Grantee: Address of the application builder
        12,          // Refiner ID: Your schema ID
        "users",     // Table Name: Optional, for table-level permission
        "email",     // Column Name: Optional, for column-level permission
        "5000000000000000000"  // Price: 5 VANA tokens per query
    )
    
  2. The application builder will then be able to query your data by paying the specified price

  3. Revenue is distributed with 80% going to your DLP and 20% to Vana

Managing Generic Permissions

You can also grant permissions to anyone (not just specific addresses):

queryEngine.addGenericPermission(
    12,          // Refiner ID: Your schema ID
    "users",     // Table Name: Optional, for table-level permission
    "name",      // Column Name: Optional, for column-level permission
    "1000000000000000000"  // Price: 1 VANA token per query
)

This allows any user to access the specified data by paying the set price.

Revoking and Updating Permissions

You can manage existing permissions with:

  1. Update Approval: Revoke or reapprove permissions

    queryEngine.updatePermissionApproval(
        permissionId,  // The ID of the permission
        false          // Set to false to revoke, true to approve
    )

6. Pricing and Revenue

Setting Prices for Data Access

When setting prices for your data, consider:

  1. What unique value does your data provide?
  2. Higher quality data can command higher prices
  3. Different tables or columns may have different values
  4. What are comparable DLPs charging?
  5. Are you targeting high volume or high value?

Prices are set in $VANA tokens and are charged per query. For v0, pricing is deterministic. If a user is granted permissions to run a query on a schema for a fixed price, if the same query returns 100k records, or 0 records, the user is charged the same amount.

Revenue Distribution

In the current implementation (v0):

  • 80% of query fees are reserved for DLPs to manually collect
  • 20% go to the Vana data access treasury

Future Pricing Models

In future versions:

  • DLPs may be able to charge in their own $DLPT tokens
  • More complex pricing models may be supported
  • Payment for compute resources may be introduced

7. Security Considerations

Encryption Model and Key Management

The data access system uses the following encryption model:

  1. Original Data Encryption:
    • Raw files are encrypted with the user's wallet signature (EK)
    • For others to access, EK is wrapped with receiver's public key and stored in the Data Registry permissions
  2. Refinement Encryption:
    • A secondary key (REK) is derived from EK using HKDF:

      const refinedEncryptionKey = HKDF(
         hash = 'SHA-256',
         masterKey = '0xabcd1234...',
         salt = null,
         info = `query-engine`,
         length = 64
      );
    • REK is used to encrypt the refined file

    • REK is encrypted with a DLP-specific public key and stored in the Data Registry permissions to grant the Query Engine access to the refinement

    • This prevents the Query Engine from accessing the original data

  3. Key Management:
    • Public keys for the Query Engine are generated and provided by Vana
    • Each DLP has a specific keypair for their refinements

GDPR Compliance and Takedown Procedures

As a DLP owner, you are considered a GDPR Data Controller, which means:

  1. You are responsible for user consent and data handling
  2. You must handle takedown requests if users exercise their right to be forgotten
  3. When a takedown request is received:
    • Stop pinning the relevant files on IPFS
    • Work with Vana to ensure any cached data is removed from the Query Engine

Best Practices for Data Security

  1. Only refine the fields absolutely necessary for your use cases
  2. Consider masking or removing PII when possible
  3. Periodically review your permissions and access patterns
  4. Maintain documentation of your data handling practices
  5. Be transparent with your users about how their data is used

8. End-to-End Workflow

Step-by-Step Guide

  1. Preparation:
    • Fork the refinement template
    • Customize for your data structure
    • Build and publish your Docker image
    • Create and upload your schema to IPFS
  2. Registration:
    • Register your refiner with the DataRefinerRegistry contract
    • Receive your unique refiner ID
  3. Integration:
    • Update your proof-of-contribution workflow to include refinement
    • For each new data point:
      • Run proof-of-contribution
      • Call the refinement service API
      • Refinement is added to Data Registry
  4. Permissions:
    • Set up desired permissions in the Query Engine contract
    • Negotiate with application builders
    • Grant specific or generic permissions
  5. Monitoring:
    • Track queries and revenue
    • Respond to new permission requests
    • Handle any takedown requests

9. Contract Addresses and API Endpoints

Moksha Testnet & Mainnet:

  • DataRefinerRegistry: 0x93c3EF89369fDcf08Be159D9DeF0F18AB6Be008c
  • QueryEngine: 0xd25Eb66EA2452cf3238A2eC6C1FD1B7F5B320490
  • ComputeInstructionRegistry: 0x5786B12b4c6Ba2bFAF0e77Ed30Bf6d32805563A5
  • ComputeEngine: 0xb2BFe33FA420c45F1Cf1287542ad81ae935447bd

Refinement Service API:

Vana will host an instance of the Refinement Service, available at:

https://a7df0ae43df690b889c1201546d7058ceb04d21b-8000.dstack-prod5.phala.network/refine


Testing and Verification

Before going to production:

  1. Test your refinement process with sample data
  2. Verify the schema and refinement output
  3. Test permission granting and querying
  4. Validate the encryption and security measures
  5. Ensure proper error handling and monitoring

By working closely with the Vana team during initial setup, you can ensure a smooth integration with the data access system. Once youโ€™ve confirmed data is refined, encrypted, and queryable in a TEE, youโ€™re set for VRC-15.