ChatGPT DataDAO Example
How it works in the ChatGPT DataDAO
In the ChatGPT DataDAO, they rely on the email from OpenAI linking the user to their export to verify the authenticity of the data.
- User requests a data export of their ChatGPT data.
- Once they receive the "Your export is ready email", they download the zip file and copy the download link from the email.
- In gptdatadao.org, along with uploading their zip file, they are asked to provide the download link. Both are encrypted such that only a DataDAO validator can see them.
- The DataDAO validator receives the encrypted file and download link. They download and decrypt the file from the user's storage, as well as the one provided in the link. They calculate a checksum of both files and ensure they match, ensuring the zip that's uploaded to the user's storage has not been tampered with.
Implementation
A model influence function is implemented that fingerprints a data point and compares it to other data points on the network.
- The validator calculates a feature vector of the zip file by first getting a deterministic string representation of the file, and converting it to a feature vector. This is the fingerprint of that data point. If a slightly altered file is ran through this same process, it will produce a very similar fingerprint, unlike a hash, which will be vastly different even if 1 bit of the underlying data is changed.
- The validator then records this on-chain so other validators are aware of the fingerprints of other data points in the network. They then build a local vector store of all existing data points.
- After the fingerprint is calculated, it inserts the fingerprint into the local vector store and checks how similar it is to other fingerprints in the store. If it is too similar, it will reject the data point.
This method offers an efficient way to check similarity against all other files in the network. If you'd like to use this in your DataDAO, see here for an example.
Explanation
Let’s use the ChatGPT DataDAO as an example to explain how the Proof of Contribution process works in practice:
- Authenticity Check
- Mechanism: Contributors upload their ChatGPT data exports.
- Verification: The validator compares the checksum of the uploaded export file with the checksum of the file provided via OpenAI’s download link, ensuring no tampering has occurred.
- Ownership Verification
- Mechanism: In the ChatGPT DataDAO, ownership is validated through metadata included in the data export. This metadata links the data to the user’s OpenAI account.
- Simplified Check: A simple confirmation via email or an associated account link can serve as an ownership check.
- Quality Evaluation
- Mechanism: The validator takes sample conversations from the exported data and evaluates them using a language model (e.g., OpenAI’s GPT).
- Scoring: The conversations are scored for coherence and relevance, with an overall quality score assigned to the data.
- Uniqueness Check
- Mechanism: A model influence function fingerprints the data and compares it against previously submitted datasets.
- Preventing Duplicates: If the fingerprint is too similar to existing data, the submission is flagged as a duplicate and rejected.
Updated about 2 months ago