data/datasets/biostars_qa/README.md
This dataset contains 4803 question/answer pairs extracted from the BioStars website. The site focuses on bioinformatics, computational genomics, and biological data analysis.
https://huggingface.co/datasets/cannin/biostars_qa
This dataset was generated by downloading individual posts; only limited metadata is included with the dataset. The following Zenodo dataset has the entirety of the downloaded post content as a single JSON file.
https://zenodo.org/record/7813785
get_biostars_dataset(): This function downloads the content from
Biostars API; each post is downloaded as
an individual JSON fileextract_accepted_data(): This function loads the individual files to Pandas
then extracts out question/answer pairs. Questions were included if they were
an accepted answer and the question had at least 1 vote. The content is then
formatted as a Apache Parquet dataset with columns: INSTRUCTION, RESPONSE,
SOURCE, METADATA