Vana uses a Proof-of-Contribution (PoC) system to validate data submitted to the network. The PoC system functions to ensure the integrity and quality of data within Data Liquidity Pools (DLPs). Everyone's data is different, so to enable data liquidity, data must be mapped to some fungible asset.
Each DLP implements their own proof of contribution function based on their particular dataset. For example, r/datadao measured contributions based on amount of karma, and included an ownership check having users post a code in their reddit profile to confirm ownership. This proof-of-contribution check depends on the goals of the data liquidity pool and the best way to measure data contributions.
The proof-of-contribution function defines success for your data liquidity pool. If you do not want a particular kind of data in your DLP, but it passes or is rewarded by your proof-of-contribution function, then your proof-of-contribution function is not complete.
To validate data submissions, DLP Validators scan through the data transactions and assign a score using the DLP's contribution function. The function takes into account various data characteristics, such as completeness, accuracy, and relevance to the DLP’s purpose.
Each function depends on the constraints imposed by the DLP that receives the data contributions. As such, DLP Validators may impose their own unique functions to incentivize the type and quality of data they collect. This flexibility ensures efficient evaluation of data for each DLP while ensuring that data contributions are accurately evaluated.
One recommended implementation for DLP Proof-of-Contribution is to run a model influence function, which measures exactly how much new information a given data point teaches the AI model.
To protect the privacy of data contributions, great care has gone in to protecting the user's data. Validators can act as a trusted party and securely run PoC on user data. Read more about how Validators protect data in Data Privacy.
The PoC system supports zero-knowledge proofs. When a Data Contributor or Custodian submits data to the DLP, they generate a zero-knowledge proof that verifies the authenticity and integrity of the data, as well as its contribution to the DLP, without revealing its full contents. Read more about it in Zero-Knowledge Proof of Contribution.
A zero-knowledge proof (ZKP) is a cryptographic method by which one party (the prover) can prove to another party (the verifier) that they know a value without conveying any information apart from the fact that they know the value. This means the verifier learns nothing about the value itself, only that the prover knows it.
To protect the privacy of data contributions, a DLP can implement a Proof of Contribution using ZKP. When a Data Contributor or Custodian submits data to the DLP, they generate a zero-knowledge proof that verifies the authenticity and integrity of the data and its contribution to the DLP without revealing its full contents.
To illustrate this example, imagine a DLP for ChatGPT data exports. The DLP considers a data point "valid" if the number of conversations inside the zip file exceeds 50. We can generate cryptographic proof that a file meets this requirement without revealing its contents (or even the exact number of conversations in the file).
To protect against tampering with the proof generation while maintaining privacy and ensuring the data doesn't leave the user's browser unencrypted, the proof is generated in a WebAssembly environment, which is much harder to tamper with than generating proofs in the browser in plain JavaScript.
We provide an example here: https://zk-proof-poc.vercel.vana.com/
The source code is available here: https://github.com/vana-com/zk-proof-poc
Each DLP implements their own proof of contribution function based on their particular dataset. As an example, the ChatGPT DLP handles Proof of Contribution via four categories below.
The authenticity check aims to prove that the data submitted is authentic and not tampered with. The attack vector this aims to mitigate is submitting altered data to the DLP. For example, a malicious data contributor may add synthetically generated conversation history to their chats, making the data seem more valuable than it actually is. They may also alter their personal information, such as their birthday or when the account was created.
In the ChatGPT DLP, we rely on the email from OpenAI linking the user to their export to verify the authenticity of the data.
User requests a data export of their ChatGPT data.
Once they receive the "Your export is ready email", they download the zip file and copy the download link from the email.
In gptdatadao.org, along with uploading their zip file, they are asked to provide the download link. Both are encrypted such that only a DLP validator can see them.
The DLP validator receives the encrypted file and download link. They download and decrypt the file from the user's storage, as well as the one provided in the link. They calculate a checksum of both files and ensure they match, ensuring the zip that's uploaded to the user's storage has not been tampered with.
The ownership check aims to prove that the data contributor indeed owns the data they are submitting. The attack vector this prevents is a data contributor contributing someone else's data.
Specifically for the ChatGPT DLP, ownership is covered by the authenticity check, because it's difficult fake a unique link to download a ChatGPT export.
The quality check aims to prove that the data submitted is of high quality. If a data contributor submits a data export for a newly created account, the data will still be authentic and rightfully owned by the contributor, however, it is probably not very useful.
We leverage an LLM and sample conversations to determine the quality of the data.
When data is submitted to a validator, they take a few randomly sampled conversations and sends them to an LLM ( OpenAI in this case) and is prompted to determine the coherence and relevance of the conversation and score it from 0-100.
The scores from different conversations are then averaged, giving an idea of the quality of the data.
The uniqueness check aims to prove that the data submitted is unique. Similar to the authenticity check, this proof aims to thwart malicious data contributors who may submit the same data multiple times to the DLP.
We implement a model influence function that fingerprints a data point and compares it to other data points on the network.
The validator calculates a feature vector of the zip file by first getting a deterministic string representation of the file, and converting it to a feature vector. This is the fingerprint of that data point. If a slightly altered file is ran through this same process, it will produce a very similar fingerprint, unlike a hash, which will be vastly different even if 1 bit of the underlying data is changed.
The validator then records this on-chain so other validators are aware of the fingerprints of other data points in the network. They then build a local vector store of all existing data points.
After the fingerprint is calculated, it inserts the fingerprint into the local vector store and checks how similar it is to other fingerprints in the store. If it is too similar, it will reject the data point.
This method offers an efficient way to check similarity against all other files in the network. If you'd like to use this in your DLP, see here for an example.
While Proof-of-contribution is different for different DLPs, some ideas outlined here can be applied to other DLPs. By checking authenticity, ownership, quality and uniqueness, the DLP creator can be sure that their data DAO consists of high-quality, meaningful data while preventing attackers who submit low-quality data.