Illustration: Validating ChatGPT Data
How to Validate ChatGPT Contributions
In a ChatGPT DataDAO, one can rely on the email from OpenAI linking the user to their export to verify the authenticity of the data.
- User requests a data export of their ChatGPT data.
- Once they receive the "Your export is ready email", the users downloads the zip file and copies the download link from the email.
- In the data contribution UI, along with uploading their zip file, the user is asked to provide the download link. Both are encrypted such that only a DataDAO validator can see them.
- A data validator receives the encrypted file and download link. It downloads and decrypts the file from the user's storage, as well as the one provided in the link. It calculate a checksum of both files and ensures they match, ensuring the zip that's uploaded to the user's storage has not been tampered with.
Implementation
A model influence function is implemented that fingerprints a data point and compares it to other data points on the network.
- The validator calculates a feature vector of the zip file by first getting a deterministic string representation of the file, and converting it to a feature vector. This is the fingerprint of that data point. If a slightly altered file is ran through this same process, it will produce a very similar fingerprint, unlike a hash, which will be vastly different even if 1 bit of the underlying data is changed.
- The validator then records this on-chain so other validators are aware of the fingerprints of other data points in the network. They then build a local vector store of all existing data points.
- After the fingerprint is calculated, it inserts the fingerprint into the local vector store and checks how similar it is to other fingerprints in the store. If it is too similar, it will reject the data point.
This method offers an efficient way to check similarity against all other files in the network. If you'd like to use this in your DataDAO, see here for an example.
Explanation
Let’s use a ChatGPT DataDAO as an example to explain how the Proof of Contribution process works in practice:
- Authenticity Check
- Mechanism: Contributors upload their ChatGPT data exports.
- Verification: The validator compares the checksum of the uploaded export file with the checksum of the file provided via OpenAI’s download link, ensuring no tampering has occurred.
- Ownership Verification
- Mechanism: In the ChatGPT DataDAO, ownership is validated through metadata included in the data export. This metadata links the data to the user’s OpenAI account.
- Simplified Check: A simple confirmation via email or an associated account link can serve as an ownership check.
- Quality Evaluation
- Mechanism: The validator takes sample conversations from the exported data and evaluates them using a language model (e.g., OpenAI’s GPT).
- Scoring: The conversations are scored for coherence and relevance, with an overall quality score assigned to the data.
- Uniqueness Check
- Mechanism: A model influence function fingerprints the data and compares it against previously submitted datasets.
- Preventing Duplicates: If the fingerprint is too similar to existing data, the submission is flagged as a duplicate and rejected.
Updated 18 days ago