notebooks/data-augmentation/writing-prompt/.ipynb_checkpoints/writing_prompt-checkpoint.ipynb
Use the prompts/story dataset from here: https://www.kaggle.com/datasets/ratthachat/writing-prompts. In addition to the prompts and story, augment with instructions such as “write a story about {prompt}, ending with the sentence {last_sentence}”. “write a story about {prompt}, where the beginning of the story is about {summary of the beginning part}”. “write a story about {prompt}, where the middle of the story is about {summary of the middle part}”. “write a story about {prompt}, where the end of the story is about {summary of the end part}”
Here are some samples from writing prompts:
| id | prompt | story |
|---|---|---|
| 1 | [ WP ] When you die , you do n't go to the afterlife of you 're religion , you go to the afterlife of the religion whose tenets you followed most closely , knowingly or not . | Thomas loves science fiction , and is pleased to find himself sitting by the park entrance with Arthur C. Clarke ’ s “ Fountains of Paradise ” open in his lap . He must have jogged there , he thinks to himself as he admires his brand new black-and-white Nikes . He stretches out in his black joggers and turns the page . “ But there was no substitute for reality , one should beware of imitations ” , he reads before shutting the book . <newline> <newline> Thomas ponders what he has read as he looks to the right ; not a single car can be seen . The street appears infinite in length and the buildings fade in to the distance with it . He stands and begins his first step down the street . <newline> <newline> His movement halts when he hears a young voice behind him , “ You look thirsty mister . Would you like some lemonade ? ” <newline> <newline> Thomas walks back past the park entrance and over to the lemonade stand , wondering how he had not noticed it before . It is beautiful , the entrance ; but the park is closed now . Thomas stares up at the gates in awe . <newline> <newline> Thomas is interrupted again by the child , “ $ 5.50 , please. ” <newline> <newline> Thomas looks at the counter , flustered . “ I ’ ll have the punch instead. ” <newline> <newline> As the child pours the purple drink in to the cup , Thomas reaches in his pocket finding a five dollar bill and three quarters . <newline> <newline> “ Keep the change ” , Thomas says as he picks up his drink . <newline> <newline> Thomas sips and the sky slowly dims . He feels his breath drawn away from him as a comet sails over the park entrance . And Heaven ’ s Gate opens . <newline> |
| 2 | [ CW ] [ PM ] Write your hero into a corner , and let me get them out . | Bob dropped five of the Zeds , reloaded his Colt 45 , and ran up the stairs . <newline> <newline> He had someone currently upstairs , alerting Search and Rescue to find a place to land in this urban , industrial nightmare . They were currently in a truck depot , the places where goods would be transferred truck from truck . <newline> <newline> Already , some men defending the front door had been pulled in , causing the rest to fall back . The first , and only , line of physical defense , the hardened steel gates , created to stop robbers , were badly banged up , from the onslaught of fists against it . It was bad enough that the zombies managed to cram two at once inside the doorway , but losing the gates would mean that the horde would rush in . <newline> <newline> Hey ! '' Courtney rushed outside the communications office , her .22 rifle in hand . They 're at the trainstation , just a block from here ! '' <newline> <newline> It 's probably too late , mate . '' Bob said back , Just look at 'em ! '' <newline> <newline> The metal steps leading to the elevated walkway was a savior , only allowing one body to get in at a time . Unfortunately , our heroes had just fought their way here , from a few streets down . Seems easy ? Not when you have to take detours through heavily infested buildings because of blockades in the roads , or just the sheer number of walkers wouldn't 've allowed you to run through them . <newline> <newline> Bob 's equipped with a Colt 1911 .45 caliber pistol , excellent at punching through heads , but at the cost of heavy kickback . Also due to it 's temptingness , Bob has used all but three 7-round magazines . He has a knife , but who the hell would be able to take anyone out with that ? <newline> <newline> Courtney has her 10/22 Ruger Takedown . Initially intended for long range hunting , the rifle particularly excels at going through targets cleanly . The only disadvantage is the lack of stopping power . <newline> <newline> They have a fully gassed up FedEx truck at their disposal . A few men inside , surrounded , but armed , are ready to go when you tell them where they need to go . <newline> <newline> Around 31 zombies have gotten in already , with god knows how much outside . |
| 3 | [ cw ] write about the strangest/scariest/saddest dream you 've ever had in less than 200 words . | The night was as thick and terrifying as any I had ever seen before . All I could hear was the scream of the wind past my ears , the pounding of hooves , huffed horse breaths , and the pounding of my own heart . <newline> <newline> The woods were closeknit , and my path was barely visible , hidden under a thick layer of bracken . <newline> <newline> `` Faster , '' I whispered as I dug my heels in . Safety was close and yet so far away , calling to me . He would save me ; I knew it with all my heart . <newline> <newline> All I had to do was outrun the demons at my back first . |
Just in case anyone wants the prompt tag description.
@ontocord , can you improve the issue details having the samples above, please?
Interesting how the [XX] tags are used. I wasn't thinking about those.
I was thinking of Instructions -> answers like "User: write me a story about {stripped_prompt} -> Rosey: Sure, here's a story about {stripped_prompt}: {story}" where stripped_prompt removes things like "write about" "in less than 200 words", etc.
And the inverse "User: What is this story about {story} -> Rosey: I think it's about {striped_prompt}"
You could also do summarization of longer stories into 4 or 5 pointed sentences and ask for an outline. Or you could give an outline and ask Rosey to fill in the story.
For the prompt tag, you could add constraings to the prompts based on the tag. So for [RF], you could add to the end of the actual instruciton: this story could {have happened before or should be able to happen in the real world to unknown people. Not what you think could happen in the future.}
Lmk know if you need more input.
Also these instructions: “write a story about {prompt}, ending with the sentence {last_sentence}”. “write a story about {prompt}, where the beginning of the story is about {summary of the beginning part}”. “write a story about {prompt}, where the middle of the story is about {summary of the middle part}”. “write a story about {prompt}, where the end of the story is about {summary of the end part}”
The goal of this task was to auto-generate question/answer samples from writingPrompts to feed openAssistant. To do that we should standardize the way a prompt was written. Our choice was to set prompt templates which might turn the generation process feasible. Here are the templates we applied:
User: write me a story about: {stripped_prompt} -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}
where stripped_promt is the cleared prompt output by regex pattern to take out parts of a prompt that would not fit the template. And story is the actual answer to a prompt.
Base template, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}
where stripped_constraint is the constraint found.
Base template, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}
where beginning is the first sentence of a story.
Base template, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}
where ending is the last sentence of a story.
Base template, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}
where middle is a summary of a story without the first and last sentence brought by a generative model
To get the samples we used the following pipeline:
# helper functions
import json
def save_credentials(d):
with open("/root/.kaggle/kaggle.json", "w") as outfile:
json.dump(d, outfile)
# uncomment the following instructions, in case you want to save a .kaggle.json
# d = {}
# d['username'] = 'user'
# d['key'] = 'key'
#!mkdir ~/.kaggle
# save_credentials(d)
!mv ~/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
#!pip install kaggle
!kaggle datasets download -d ratthachat/writing-prompts
!unzip writing-prompts.zip
import pandas as pd
from IPython.display import display, HTML
# helper functions
import re
def load_file(path, names):
with open(path, "r") as f:
lines = f.readlines()
return pd.DataFrame(lines, columns=names)
def load_data():
tags = {
"WP": "Writing Prompt",
"SP": "Simple Prompt",
"EU": "Established Universe",
"CW": "Constrained Writing",
"TT": "Theme Thursday",
"PM": "Prompt Me",
"MP": "Media Prompt",
"IP": "Image Prompt",
"PI": "Prompt Inspired",
"OT": "Off Topic",
"RF": "Reality Fiction",
}
dfConcat = pd.DataFrame()
for split in ["train", "valid", "test"]:
df = load_file(f"writingPrompts/{split}.wp_source", ["prompt"])
for tag in tags.keys():
df[tag.lower()] = df["prompt"].map(lambda x: check_tag(x, tag.lower()))
df["tagCounter"] = df.iloc[:, [2, -1]].sum(axis=1)
df["splitLineIndex"] = df.index
story = load_file(f"writingPrompts/{split}.wp_target", ["story"])
df["story"] = story["story"]
df["split"] = split
dfConcat = pd.concat([dfConcat, df])
return dfConcat
def check_tag(item, tag):
r = re.compile(r"[\(\{\[]\s*[\w]{2}\s*[\]\}\)]\s*")
m = r.findall(item.lower())
if len(m) > 0:
for group in m:
if tag in group:
return 1
return 0
def show_data(df):
html_string = """
<html>
<head><title>HTML Pandas Dataframe with CSS</title></head>
<link rel="stylesheet" type="text/css" href="df_style.css"/>
<body>
{table}
</body>
</html>.
"""
df = df.replace("\<newline\>|\< newline \>|\<new line\>", "\n", regex=True)
df.style.set_properties(**{"text-align": "left"}).set_table_styles(
[dict(selector="th", props=[("text-align", "left")])]
)
html = df.to_html()
html_string = html_string.format(table=html)
html_string = (
html_string.replace(r"\n", "
")
.replace("<td>", '<td style="text-align:left">')
.replace("<th>", '<th style="text-align:left">')
)
display(HTML(html_string))
def get_samples(df, n, constraint=None, show=True):
samples = zip(df["prompt"].iloc[:n, 0].index, df["prompt"].iloc[:n, 0], df["story"].iloc[:n, 0])
df = pd.DataFrame(samples, columns=["index", "prompt", "story"])
if constraint is not None:
df = df[df["prompt"].str.contains(constraint)]
return df
!head -n2 writingPrompts/test.wp_source
ds = load_data()
ds.head(3)
print(ds.shape)
ds[ds["split"] == "test"].iloc[:2, [13, 0, 14, -1]].columns
show_data(ds[ds["split"] == "train"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);
show_data(ds[ds["split"] == "valid"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);
show_data(ds[ds["split"] == "test"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);
from tqdm import tqdm
df_rep = ds.groupby(["prompt", "split"]).size().reset_index().rename(columns={0: "records"})
df_rep = df_rep[df_rep["records"] > 20].sort_values(["records"], ascending=False)
# _str = df_rep[df_rep['records']>20].sort_values(['records'], ascending=False).iloc[1,0]
# df_rep[df_rep["split"] == "valid"].iloc[1:3, 0]
# topPrompts20Reps += df_rep[df_rep["split"] == "valid"].iloc[1:3, 0].to_list()
topPrompts20Reps
topPrompts20Reps = df_rep[df_rep["records"] > 20].sort_values(["records"], ascending=False)["prompt"].tolist()
print(f"We found {len(topPrompts20Reps)} prompts having more than 20 stories")
PROMPT_PATTERNS = "(Lucifer\snever[\s\w,]+)|\
([\. \w,]+)\.\s+Tell me|\
(All injuries[\. \w,]+)\.|\
(?<!\])(At your[\. \w,]+)\.|\
Daily Prompt \: ([\. \w,]+)|\
In 100 words or less , ([\. \w,]+)\.|\
(Last words/thoughts[\. \w,]+)\.|\
(Magic is Hereditary.*) \[|\
word limit (\) [\. \w,\/]+) \.|\
(Make me love the person you love)|\
(Pack a punch) in 150 words|\
(The last man on earth[\. \w,\/]+kill himself)|\
(The year is 2352 [\. \w,\/'-]+)\.|\
(A person dies[\. \w,\/]+)\.?|\
^[wW]rite a story([\. \w,\/]+) |\
^[wW]rite about ([\. \w,\/-]+)\.?|\
^Writing Prompt (?:\: [wW]rite|\
\[ WP \]) ([\. \w,\/']+) ?|\
^(You 're a[\. \w,\/']+)|\
(You 're moments[\. \w,\/']+)\.|\
(Describe the room you [\. \w\/']+)|\
(Get me hooked \. [ \w,\/']+)|\
[\. \w\/',\`]+ , (tell a horror story)|\
(Make me cry)|\
(Make me hate your character)|\
(Most responses on here have a twist[\. \w\/',\`;]+)|\
(Pick your favorite[\(\)\. \w\/',\`;]+beginning)|\
(Start your story[\(\)\. \w\/',\`;]+meanings \.)|\
(The [\. \w\/',\`;]+ reader)|\
(Two people[\. \w,\/']+bench)|\
Write (a gruesome story)|\
Write (a möb[\. \w,\/']+story) that|\
(Write the letter [ ,\w]+) |\
There is no prompt[ \.\w]+(you[ \.\w']+\.)|\
(A peaceful alien race[ \.\w'-]+)\.|\
(This is the prologue[\(\) \.\w'-]+)\.|\
Write a short story where (the first[\(\) \.\w'-,]+)\.|\
(Write the first and last paragraph[\(\) \.\w'-,]+)\.|\
(Killing Hitler has[\(\) \.\w'-,\?]+)|\
(You live in a city full[\(\) \.\w'-,\?\#]+)|\
\`\` She said she loved him . [\`'\(\) \.\w'-,\?\#]+\.|\
(A soldier on the front dies[\(\) \.\w'-,\?\#]+)|\
(You discover a grand hall[\(\) \.\w'-,\?\#]+)|\
(A boy asks a girl out . It 's high[\(\) \.\w'-,\?\#]+)|\
(When everyone turns 18 , they receive a pet[\(\) \.\w'-,\?\#]+)|\
(To get in Heaven , you have to [\/\(\) \.\w'-,\?\#]+)|\
(You are born without emotions [;\/\(\) \.\w'-,\?\#]+)|\
(You are a teenager with the ability[\`;\/\(\) \.\w'-,\?\#]+)|\
(You live in a world where every person [\`;\/\(\) \.\w'-,\?\#]+)"
CONST_PATTERNS = "Daily Prompt \: [\. \w,]+\[ ([\. \w,\:]+)|\
(In 100 words or less) , ([\. \w,\:]+) \.|\
Make a story \( ([\. \w,\:]+) |\
Pack a punch (in 150 words)|\
Describe the room you [\. \w\/']+([\. \w,\:\/]+)\.|\
Get me hooked \. Reel me in \. ([\. \w\/',\`]+)\.|\
([\. \w\/',\`]+) , tell a horror story|\
Make me cry ([ \w\/',\`]+).?|\
(in 150 words or less)|\
Pick your favorite[\(\)\. \w\/',\`;]+beginning \. ([ \w\/',\`]+)|\
Start your story[\(\)\. \w\/',\`;]+meanings \.([ \w\/',\`]+\.)|\
The [\. \w\/',\`;]+ reader ,([\. \w\/',\`;]+)|\
Two people[\. \w,\/']+bench \. ([\. \w,\:]+)|\
Write a gruesome story ([\. \w,\:]+)|\
Write a möb[\. \w,\/']+story (that[\. \w,\/']+)"
#!pip install spacy -qqq
We aim to augment data as following:
#!pip install transformers
# @markdown utils
from transformers.utils.logging import set_verbosity
set_verbosity(40)
import warnings
# ignore hf pipeline complaints
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")
import torch
from transformers import pipeline
summarizer = pipeline(
"summarization",
"pszemraj/long-t5-tglobal-base-16384-book-summary",
device=0 if torch.cuda.is_available() else -1,
)
params = {
"max_length": 1024,
"min_length": 8,
"no_repeat_ngram_size": 3,
"early_stopping": False,
"repetition_penalty": 3.5,
"length_penalty": 0.3,
"encoder_no_repeat_ngram_size": 3,
"num_beams": 4,
} # parameters for text generation out of model
import spacy
# helper functions
import re
def extract_prompt_parts(prompt, pattern):
"""
takes a prompt and some parts that matches to patern
"""
pattern = pattern.replace("\\\n", "\\")
if m := re.search(pattern, prompt, re.IGNORECASE):
if len(m.groups()) > 0:
return m.group(0)
return None
from spacy.lang.en import English
def get_sentences(_str):
chunks = _str.split("\n")
sentences = []
nlp = English()
nlp.add_pipe("sentencizer")
for chunk in chunks:
doc = nlp(chunk)
sentences += [sent.text.strip() for sent in doc.sents]
return sentences
from itertools import islice
def window(seq, n=2):
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield " ".join(result)
for elem in it:
result = result[1:] + (elem,)
yield " ".join(result)
def extract_story_parts(story):
sentences = get_sentences(story)
beginning = sentences.pop(0)
middles = window(sentences, 4)
ending = sentences.pop(-1)
return beginning, middles, ending
def clear_prompt(prompt):
return re.sub(r"^[Ww]rite ", "", prompt)
def get_sample_dict(split, id, text):
return {"split": split, "splitLineIndex": id, "text": text}
def generate_instruction_diologs(df):
dialogs = []
"""User: What is this story about: {story} -> Rosey: I think it's about: {striped_prompt}"""
dialogBase = """User: write me a story about: {stripped_prompt}"""
dialog1 = """ -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}"""
dialog2 = """, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}"""
dialog3 = """, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}"""
dialog4 = """, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}"""
dialog5 = """, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}"""
df_rep = df.groupby(["prompt"]).size().reset_index().rename(columns={0: "records"})
df_rep.sort_values(["records"], ascending=False, inplace=True)
pbar = tqdm()
pbar.reset(total=len(df_rep))
for prompt in df_rep.iloc[:, 0]:
strippedPrompt = extract_prompt_parts(prompt, PROMPT_PATTERNS)
if strippedPrompt is None:
continue
strippedPrompt = clear_prompt(strippedPrompt)
strippedConstraint = extract_prompt_parts(prompt, CONST_PATTERNS)
for row in df[df["prompt"] == prompt].itertuples():
try:
story = (
row.story.replace("<newline>", "\n")
.replace("< newline >", "\n")
.replace("<new line>", "\n")
.strip()
)
beginning, middles, ending = extract_story_parts(story)
dialogBeg = dialogBase.format(stripped_prompt=strippedPrompt)
dialog = dialogBeg + dialog1.format(story=story, stripped_prompt=strippedPrompt)
dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
if strippedConstraint is not None:
dialog = dialogBeg + dialog2.format(
stripped_prompt=strippedPrompt, stripped_constraint=strippedConstraint, story=story
)
dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
dialog = dialogBeg + dialog3.format(stripped_prompt=strippedPrompt, story=story, beggining=beginning)
dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
dialog = dialogBeg + dialog4.format(stripped_prompt=strippedPrompt, story=story, ending=ending)
dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
middlesSumarizered = summarizer(middles, **params)
for middle, sumarizedMiddle in zip(middles, middlesSumarizered):
# dialogs.append(dialogBeg + dialog5.format(stripped_prompt=strippedPrompt, story=story, middle=middle))
dialog = dialogBeg + dialog5.format(
stripped_prompt=strippedPrompt, story=story, middle=sumarizedMiddle[0]["summary_text"]
)
dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
pbar.update()
except Exception as e:
print(f"{row.split}/{row.splitIndex}")
raise e
pbar.refresh()
return dialogs
def filter_data(
dataset,
negativeTagFilter=None,
positiveTagFilter=None,
patternFilter=None,
):
"""
> filter_data(dataset['train'],negativeTagFilter=['ip'], positiveTagFilter=['pm'] )
"""
prompt = dataset["prompt"]
if negativeTagFilter is not None:
prompt = prompt[(prompt[negativeTagFilter] < 1).any(axis=1)]
if positiveTagFilter is not None:
prompt = prompt[prompt[positiveTagFilter].gt(0).all(axis=1)]
if patternFilter is not None:
prompt = prompt[prompt["prompt"].str.contains(patternFilter)]
story = dataset["story"]
story = story.iloc[prompt.index]
return {"prompt": prompt, "story": story}
def generate_instruction_diologs(prompt, df):
dialogs = []
"""User: What is this story about: {story} -> Rosey: I think it's about: {striped_prompt}"""
dialogBase = """User: write me a story about: {stripped_prompt}"""
dialog1 = """ -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}"""
dialog2 = """, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}"""
dialog3 = """, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}"""
dialog4 = """, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}"""
dialog5 = """, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}"""
strippedPrompt = extract_prompt_parts(prompt, PROMPT_PATTERNS)
if strippedPrompt is not None:
strippedPrompt = clear_prompt(strippedPrompt)
strippedConstraint = extract_prompt_parts(prompt, CONST_PATTERNS)
pbar = tqdm(ascii=True, desc="stories")
pbar.reset(total=len(df[df["prompt"] == prompt]))
for row in df[df["prompt"] == prompt].itertuples():
try:
story = (
row.story.replace("<newline>", "\n")
.replace("< newline >", "\n")
.replace("<new line>", "\n")
.strip()
)
dialogBeg = dialogBase.format(stripped_prompt=strippedPrompt)
dialog = dialogBeg + dialog1.format(story=story, stripped_prompt=strippedPrompt)
dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
if strippedConstraint is not None:
dialog = dialogBeg + dialog2.format(
stripped_prompt=strippedPrompt, stripped_constraint=strippedConstraint, story=story
)
dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
beginning, middles, ending = extract_story_parts(story)
if beginning is not None:
beginning, middles, ending = extract_story_parts(story)
dialog = dialogBeg + dialog3.format(
stripped_prompt=strippedPrompt, story=story, beggining=beginning
)
dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
dialog = dialogBeg + dialog4.format(stripped_prompt=strippedPrompt, story=story, ending=ending)
dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
middlesSumarizered = summarizer(middles, **params)
for middle, sumarizedMiddle in zip(middles, middlesSumarizered):
# dialogs.append(dialogBeg + dialog5.format(stripped_prompt=strippedPrompt, story=story, middle=middle))
dialog = dialogBeg + dialog5.format(
stripped_prompt=strippedPrompt, story=story, middle=sumarizedMiddle[0]["summary_text"]
)
dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
pbar.update()
except Exception as e:
print(f"{row.split}/{row.splitLineIndex}")
raise e
pbar.refresh()
return dialogs
It saves parquet every step samples to avoid losing work.
## filter dataset to take only prompts with frequency greater than 20 stories.
dialogs = []
i = 0
start = 0
step = 10
for index in range(start, len(topPrompts20Reps), step):
pbar = tqdm(ascii=True, desc="prompt")
pbar.reset(total=len(topPrompts20Reps[index : index + step]))
for prompt in topPrompts20Reps[index : index + step]:
tmpDialogs = generate_instruction_diologs(prompt, ds)
if tmpDialogs is not None:
dialogs += tmpDialogs
pbar.update()
if len(dialogs) > 0:
pd.DataFrame(dialogs).to_parquet("writing-prompts-aug.parquet")
pbar.refresh()
df = pd.read_parquet("writing-prompts-aug.parquet")
for split in list(set(df.split)):
df_aux = df[df["split"] == split].iloc[:, 1:]
df_aux.reset_index(inplace=True)
df_aux.iloc[:, 1:].to_parquet(f"{split}.parquet")