notebooks/data-augmentation/changemyview-builder/README.md
This program converts data obtained from the subreddit r/changemyview into a cleaner format for further data processing. The data is not clean enough to be used directly in a model yet, and additional preprocessing is required.
The cleaned data is stored in an Apache Parquet file with the following columns:
| Column Name | Description | Data Type |
|---|---|---|
| INSTRUCTION | Post title + body text | String |
| RESPONSE | Body text of comments attempting to change OP's mind of INSTRUCTION. | List<String> |
| SOURCE | Permalink to the reddit post | String |
| METADATA | Metadata related to RESPONSE. | Dict<Variant> |
Currently, metadata is only broken into one category:
detoxify_labels- A Dictionary of values outputted by the Unitaryai Detoxifier model, fitted to every comment under any given post.To use the program, follow these instructions:
git clone https://github.com/LAION-AI/Open-Assistant.gitcd notebooks/data-augmentation/changemyview-builderjupyter notebook data_processor.ipynbIf you would like to contribute to this project, please fork the repository and submit a pull request with your changes.
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.