Back to Open Assistant

writing prompt augmentation data task

notebooks/data-augmentation/writing-prompt/.ipynb_checkpoints/writing_prompt-checkpoint.ipynb

0.0.127.8 KB
Original Source

writing prompt augmentation data task

Comments

1. ontocord

Use the prompts/story dataset from here: https://www.kaggle.com/datasets/ratthachat/writing-prompts. In addition to the prompts and story, augment with instructions such as “write a story about {prompt}, ending with the sentence {last_sentence}”. “write a story about {prompt}, where the beginning of the story is about {summary of the beginning part}”. “write a story about {prompt}, where the middle of the story is about {summary of the middle part}”. “write a story about {prompt}, where the end of the story is about {summary of the end part}”

2. fabraz

Here are some samples from writing prompts:

idpromptstory
1[ WP ] When you die , you do n't go to the afterlife of you 're religion , you go to the afterlife of the religion whose tenets you followed most closely , knowingly or not .Thomas loves science fiction , and is pleased to find himself sitting by the park entrance with Arthur C. Clarke ’ s “ Fountains of Paradise ” open in his lap . He must have jogged there , he thinks to himself as he admires his brand new black-and-white Nikes . He stretches out in his black joggers and turns the page . “ But there was no substitute for reality , one should beware of imitations ” , he reads before shutting the book . <newline> <newline> Thomas ponders what he has read as he looks to the right ; not a single car can be seen . The street appears infinite in length and the buildings fade in to the distance with it . He stands and begins his first step down the street . <newline> <newline> His movement halts when he hears a young voice behind him , “ You look thirsty mister . Would you like some lemonade ? ” <newline> <newline> Thomas walks back past the park entrance and over to the lemonade stand , wondering how he had not noticed it before . It is beautiful , the entrance ; but the park is closed now . Thomas stares up at the gates in awe . <newline> <newline> Thomas is interrupted again by the child , “ $ 5.50 , please. ” <newline> <newline> Thomas looks at the counter , flustered . “ I ’ ll have the punch instead. ” <newline> <newline> As the child pours the purple drink in to the cup , Thomas reaches in his pocket finding a five dollar bill and three quarters . <newline> <newline> “ Keep the change ” , Thomas says as he picks up his drink . <newline> <newline> Thomas sips and the sky slowly dims . He feels his breath drawn away from him as a comet sails over the park entrance . And Heaven ’ s Gate opens . <newline>
2[ CW ] [ PM ] Write your hero into a corner , and let me get them out .Bob dropped five of the Zeds , reloaded his Colt 45 , and ran up the stairs . <newline> <newline> He had someone currently upstairs , alerting Search and Rescue to find a place to land in this urban , industrial nightmare . They were currently in a truck depot , the places where goods would be transferred truck from truck . <newline> <newline> Already , some men defending the front door had been pulled in , causing the rest to fall back . The first , and only , line of physical defense , the hardened steel gates , created to stop robbers , were badly banged up , from the onslaught of fists against it . It was bad enough that the zombies managed to cram two at once inside the doorway , but losing the gates would mean that the horde would rush in . <newline> <newline> Hey ! '' Courtney rushed outside the communications office , her .22 rifle in hand . They 're at the trainstation , just a block from here ! '' <newline> <newline> It 's probably too late , mate . '' Bob said back , Just look at 'em ! '' <newline> <newline> The metal steps leading to the elevated walkway was a savior , only allowing one body to get in at a time . Unfortunately , our heroes had just fought their way here , from a few streets down . Seems easy ? Not when you have to take detours through heavily infested buildings because of blockades in the roads , or just the sheer number of walkers wouldn't 've allowed you to run through them . <newline> <newline> Bob 's equipped with a Colt 1911 .45 caliber pistol , excellent at punching through heads , but at the cost of heavy kickback . Also due to it 's temptingness , Bob has used all but three 7-round magazines . He has a knife , but who the hell would be able to take anyone out with that ? <newline> <newline> Courtney has her 10/22 Ruger Takedown . Initially intended for long range hunting , the rifle particularly excels at going through targets cleanly . The only disadvantage is the lack of stopping power . <newline> <newline> They have a fully gassed up FedEx truck at their disposal . A few men inside , surrounded , but armed , are ready to go when you tell them where they need to go . <newline> <newline> Around 31 zombies have gotten in already , with god knows how much outside .
3[ cw ] write about the strangest/scariest/saddest dream you 've ever had in less than 200 words .The night was as thick and terrifying as any I had ever seen before . All I could hear was the scream of the wind past my ears , the pounding of hooves , huffed horse breaths , and the pounding of my own heart . <newline> <newline> The woods were closeknit , and my path was barely visible , hidden under a thick layer of bracken . <newline> <newline> `` Faster , '' I whispered as I dug my heels in . Safety was close and yet so far away , calling to me . He would save me ; I knew it with all my heart . <newline> <newline> All I had to do was outrun the demons at my back first .

Just in case anyone wants the prompt tag description.

@ontocord , can you improve the issue details having the samples above, please?

3. ontocord

Interesting how the [XX] tags are used. I wasn't thinking about those.

I was thinking of Instructions -> answers like "User: write me a story about {stripped_prompt} -> Rosey: Sure, here's a story about {stripped_prompt}: {story}" where stripped_prompt removes things like "write about" "in less than 200 words", etc.

And the inverse "User: What is this story about {story} -> Rosey: I think it's about {striped_prompt}"

You could also do summarization of longer stories into 4 or 5 pointed sentences and ask for an outline. Or you could give an outline and ask Rosey to fill in the story.

For the prompt tag, you could add constraings to the prompts based on the tag. So for [RF], you could add to the end of the actual instruciton: this story could {have happened before or should be able to happen in the real world to unknown people. Not what you think could happen in the future.}

Lmk know if you need more input.

4. ontocord

Also these instructions: “write a story about {prompt}, ending with the sentence {last_sentence}”. “write a story about {prompt}, where the beginning of the story is about {summary of the beginning part}”. “write a story about {prompt}, where the middle of the story is about {summary of the middle part}”. “write a story about {prompt}, where the end of the story is about {summary of the end part}”

Pipeline

The goal of this task was to auto-generate question/answer samples from writingPrompts to feed openAssistant. To do that we should standardize the way a prompt was written. Our choice was to set prompt templates which might turn the generation process feasible. Here are the templates we applied:

  • Base template: every prompt would have this sample.

User: write me a story about: {stripped_prompt} -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}

where stripped_promt is the cleared prompt output by regex pattern to take out parts of a prompt that would not fit the template. And story is the actual answer to a prompt.

  • General constraints: a prompt whose constraint was found by regex pattern would have this also.

Base template, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}

where stripped_constraint is the constraint found.

  • Answer beginning constraints: this constraint was imposed by the way the answer should start.

Base template, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}

where beginning is the first sentence of a story.

  • Answer end constraints: this constraint was imposed by the way the answer should end.

Base template, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}

where ending is the last sentence of a story.

  • Answer middle constraints: this constraint was imposed by the way the answer should have in its middle text.

Base template, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}

where middle is a summary of a story without the first and last sentence brought by a generative model

To get the samples we used the following pipeline:

  • Get data: download from kaggle
  • Pre-processing: load data from entails source/taget (aka: prompt/story) by every split (train/valid/test) merging into one pandas dataframe, enhancing tit with tabular info about the sample tags.
  • Triage prompts: we pick prompts sorted by frequency, and we built regex pattern for some of them to extract a striped prompt and the related constraint.
  • Split stories: after removing story beginning and ending sentences, we applied a sentence sliding window to get stories middle summaries.

Get data from Kaggle

python
# helper functions
import json


def save_credentials(d):
    with open("/root/.kaggle/kaggle.json", "w") as outfile:
        json.dump(d, outfile)
python
# uncomment the following instructions, in case you want to save a .kaggle.json
# d = {}
# d['username'] = 'user'
# d['key'] = 'key'
#!mkdir ~/.kaggle
# save_credentials(d)
!mv ~/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
python
#!pip install kaggle
python
!kaggle datasets download -d ratthachat/writing-prompts
python
!unzip writing-prompts.zip

Pre-processing

python
import pandas as pd
from IPython.display import display, HTML
python
# helper functions
import re


def load_file(path, names):
    with open(path, "r") as f:
        lines = f.readlines()
    return pd.DataFrame(lines, columns=names)


def load_data():
    tags = {
        "WP": "Writing Prompt",
        "SP": "Simple Prompt",
        "EU": "Established Universe",
        "CW": "Constrained Writing",
        "TT": "Theme Thursday",
        "PM": "Prompt Me",
        "MP": "Media Prompt",
        "IP": "Image Prompt",
        "PI": "Prompt Inspired",
        "OT": "Off Topic",
        "RF": "Reality Fiction",
    }

    dfConcat = pd.DataFrame()
    for split in ["train", "valid", "test"]:
        df = load_file(f"writingPrompts/{split}.wp_source", ["prompt"])
        for tag in tags.keys():
            df[tag.lower()] = df["prompt"].map(lambda x: check_tag(x, tag.lower()))
        df["tagCounter"] = df.iloc[:, [2, -1]].sum(axis=1)
        df["splitLineIndex"] = df.index
        story = load_file(f"writingPrompts/{split}.wp_target", ["story"])
        df["story"] = story["story"]
        df["split"] = split
        dfConcat = pd.concat([dfConcat, df])
    return dfConcat


def check_tag(item, tag):
    r = re.compile(r"[\(\{\[]\s*[\w]{2}\s*[\]\}\)]\s*")
    m = r.findall(item.lower())
    if len(m) > 0:
        for group in m:
            if tag in group:
                return 1
    return 0


def show_data(df):
    html_string = """
                <html>
                  <head><title>HTML Pandas Dataframe with CSS</title></head>
                  <link rel="stylesheet" type="text/css" href="df_style.css"/>
                  <body>
                    {table}
                  </body>
                </html>.
                """
    df = df.replace("\<newline\>|\< newline \>|\<new line\>", "\n", regex=True)
    df.style.set_properties(**{"text-align": "left"}).set_table_styles(
        [dict(selector="th", props=[("text-align", "left")])]
    )
    html = df.to_html()
    html_string = html_string.format(table=html)
    html_string = (
        html_string.replace(r"\n", "
")
        .replace("<td>", '<td style="text-align:left">')
        .replace("<th>", '<th style="text-align:left">')
    )
    display(HTML(html_string))


def get_samples(df, n, constraint=None, show=True):
    samples = zip(df["prompt"].iloc[:n, 0].index, df["prompt"].iloc[:n, 0], df["story"].iloc[:n, 0])
    df = pd.DataFrame(samples, columns=["index", "prompt", "story"])
    if constraint is not None:
        df = df[df["prompt"].str.contains(constraint)]
    return df
python
!head -n2 writingPrompts/test.wp_source
python
ds = load_data()
python
ds.head(3)
python
print(ds.shape)
python
ds[ds["split"] == "test"].iloc[:2, [13, 0, 14, -1]].columns

Samples

Train

python
show_data(ds[ds["split"] == "train"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);

Valid

python
show_data(ds[ds["split"] == "valid"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);

Test

python
show_data(ds[ds["split"] == "test"].iloc[:2][["splitLineIndex", "prompt", "story", "split"]]);

Augmentation

python
from tqdm import tqdm

Triage Prompts

  1. Take the prompts list order by frequency
  2. Define regex patterns for prompt and constraint
  3. Generate prompts
python
df_rep = ds.groupby(["prompt", "split"]).size().reset_index().rename(columns={0: "records"})
python
df_rep = df_rep[df_rep["records"] > 20].sort_values(["records"], ascending=False)
# _str = df_rep[df_rep['records']>20].sort_values(['records'], ascending=False).iloc[1,0]
python
# df_rep[df_rep["split"] == "valid"].iloc[1:3, 0]
# topPrompts20Reps += df_rep[df_rep["split"] == "valid"].iloc[1:3, 0].to_list()
python
topPrompts20Reps
python
topPrompts20Reps = df_rep[df_rep["records"] > 20].sort_values(["records"], ascending=False)["prompt"].tolist()
python
print(f"We found {len(topPrompts20Reps)} prompts having more than 20 stories")
python
PROMPT_PATTERNS = "(Lucifer\snever[\s\w,]+)|\
([\. \w,]+)\.\s+Tell me|\
(All injuries[\. \w,]+)\.|\
(?<!\])(At your[\. \w,]+)\.|\
Daily Prompt \: ([\. \w,]+)|\
In 100 words or less , ([\. \w,]+)\.|\
(Last words/thoughts[\. \w,]+)\.|\
(Magic is Hereditary.*) \[|\
word limit (\) [\. \w,\/]+) \.|\
(Make me love the person you love)|\
(Pack a punch) in 150 words|\
(The last man on earth[\. \w,\/]+kill himself)|\
(The year is 2352 [\. \w,\/'-]+)\.|\
(A person dies[\. \w,\/]+)\.?|\
^[wW]rite a story([\. \w,\/]+) |\
^[wW]rite about ([\. \w,\/-]+)\.?|\
^Writing Prompt (?:\: [wW]rite|\
\[ WP \]) ([\. \w,\/']+) ?|\
^(You 're a[\. \w,\/']+)|\
(You 're moments[\. \w,\/']+)\.|\
(Describe the room you [\. \w\/']+)|\
 (Get me hooked \. [ \w,\/']+)|\
[\. \w\/',\`]+ , (tell a horror story)|\
(Make me cry)|\
(Make me hate your character)|\
(Most responses on here have a twist[\. \w\/',\`;]+)|\
(Pick your favorite[\(\)\. \w\/',\`;]+beginning)|\
(Start your story[\(\)\. \w\/',\`;]+meanings \.)|\
(The [\. \w\/',\`;]+ reader)|\
(Two people[\. \w,\/']+bench)|\
Write (a gruesome story)|\
Write (a möb[\. \w,\/']+story) that|\
(Write the letter [ ,\w]+) |\
There is no prompt[ \.\w]+(you[ \.\w']+\.)|\
(A peaceful alien race[ \.\w'-]+)\.|\
(This is the prologue[\(\) \.\w'-]+)\.|\
Write a short story where (the first[\(\) \.\w'-,]+)\.|\
(Write the first and last paragraph[\(\) \.\w'-,]+)\.|\
(Killing Hitler has[\(\) \.\w'-,\?]+)|\
(You live in a city full[\(\) \.\w'-,\?\#]+)|\
\`\` She said she loved him . [\`'\(\) \.\w'-,\?\#]+\.|\
(A soldier on the front dies[\(\) \.\w'-,\?\#]+)|\
(You discover a grand hall[\(\) \.\w'-,\?\#]+)|\
(A boy asks a girl out . It 's high[\(\) \.\w'-,\?\#]+)|\
(When everyone turns 18 , they receive a pet[\(\) \.\w'-,\?\#]+)|\
(To get in Heaven , you have to [\/\(\) \.\w'-,\?\#]+)|\
(You are born without emotions [;\/\(\) \.\w'-,\?\#]+)|\
(You are a teenager with the ability[\`;\/\(\) \.\w'-,\?\#]+)|\
(You live in a world where every person [\`;\/\(\) \.\w'-,\?\#]+)"


CONST_PATTERNS = "Daily Prompt \: [\. \w,]+\[ ([\. \w,\:]+)|\
(In 100 words or less) , ([\. \w,\:]+) \.|\
Make a story \( ([\. \w,\:]+) |\
Pack a punch (in 150 words)|\
Describe the room you [\. \w\/']+([\. \w,\:\/]+)\.|\
Get me hooked \. Reel me in \. ([\. \w\/',\`]+)\.|\
 ([\. \w\/',\`]+) , tell a horror story|\
Make me cry ([ \w\/',\`]+).?|\
(in 150 words or less)|\
Pick your favorite[\(\)\. \w\/',\`;]+beginning \. ([ \w\/',\`]+)|\
Start your story[\(\)\. \w\/',\`;]+meanings \.([ \w\/',\`]+\.)|\
The [\. \w\/',\`;]+ reader ,([\. \w\/',\`;]+)|\
Two people[\. \w,\/']+bench \. ([\. \w,\:]+)|\
Write a gruesome story ([\. \w,\:]+)|\
Write a möb[\. \w,\/']+story (that[\. \w,\/']+)"

Add summary columns to data

python
#!pip install spacy -qqq

We aim to augment data as following:

  • Prompt:
    • whole
      • constraints
  • Story:
    • whole
    • beginning
    • middle - sliding window summarized
    • end

Summarization

python
#!pip install transformers
python
# @markdown utils
from transformers.utils.logging import set_verbosity

set_verbosity(40)

import warnings

# ignore hf pipeline complaints
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")
python
import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)
python
params = {
    "max_length": 1024,
    "min_length": 8,
    "no_repeat_ngram_size": 3,
    "early_stopping": False,
    "repetition_penalty": 3.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
}  # parameters for text generation out of model

Interpolation

python
import spacy
python
# helper functions

import re


def extract_prompt_parts(prompt, pattern):
    """
    takes a prompt and some parts that matches to patern
    """
    pattern = pattern.replace("\\\n", "\\")
    if m := re.search(pattern, prompt, re.IGNORECASE):
        if len(m.groups()) > 0:
            return m.group(0)
    return None


from spacy.lang.en import English


def get_sentences(_str):
    chunks = _str.split("\n")
    sentences = []
    nlp = English()
    nlp.add_pipe("sentencizer")
    for chunk in chunks:
        doc = nlp(chunk)
        sentences += [sent.text.strip() for sent in doc.sents]
    return sentences


from itertools import islice


def window(seq, n=2):
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield " ".join(result)
    for elem in it:
        result = result[1:] + (elem,)
        yield " ".join(result)


def extract_story_parts(story):
    sentences = get_sentences(story)
    beginning = sentences.pop(0)
    middles = window(sentences, 4)
    ending = sentences.pop(-1)
    return beginning, middles, ending


def clear_prompt(prompt):
    return re.sub(r"^[Ww]rite ", "", prompt)


def get_sample_dict(split, id, text):
    return {"split": split, "splitLineIndex": id, "text": text}


def generate_instruction_diologs(df):
    dialogs = []
    """User: What is this story about: {story} -> Rosey: I think it's about: {striped_prompt}"""
    dialogBase = """User: write me a story about: {stripped_prompt}"""
    dialog1 = """ -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}"""
    dialog2 = """, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}"""
    dialog3 = """, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}"""
    dialog4 = """, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}"""
    dialog5 = """, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}"""

    df_rep = df.groupby(["prompt"]).size().reset_index().rename(columns={0: "records"})
    df_rep.sort_values(["records"], ascending=False, inplace=True)
    pbar = tqdm()
    pbar.reset(total=len(df_rep))
    for prompt in df_rep.iloc[:, 0]:
        strippedPrompt = extract_prompt_parts(prompt, PROMPT_PATTERNS)
        if strippedPrompt is None:
            continue
        strippedPrompt = clear_prompt(strippedPrompt)
        strippedConstraint = extract_prompt_parts(prompt, CONST_PATTERNS)

        for row in df[df["prompt"] == prompt].itertuples():
            try:
                story = (
                    row.story.replace("<newline>", "\n")
                    .replace("< newline >", "\n")
                    .replace("<new line>", "\n")
                    .strip()
                )
                beginning, middles, ending = extract_story_parts(story)
                dialogBeg = dialogBase.format(stripped_prompt=strippedPrompt)
                dialog = dialogBeg + dialog1.format(story=story, stripped_prompt=strippedPrompt)
                dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                if strippedConstraint is not None:
                    dialog = dialogBeg + dialog2.format(
                        stripped_prompt=strippedPrompt, stripped_constraint=strippedConstraint, story=story
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                dialog = dialogBeg + dialog3.format(stripped_prompt=strippedPrompt, story=story, beggining=beginning)
                dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                dialog = dialogBeg + dialog4.format(stripped_prompt=strippedPrompt, story=story, ending=ending)
                dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                middlesSumarizered = summarizer(middles, **params)
                for middle, sumarizedMiddle in zip(middles, middlesSumarizered):
                    # dialogs.append(dialogBeg + dialog5.format(stripped_prompt=strippedPrompt, story=story, middle=middle))
                    dialog = dialogBeg + dialog5.format(
                        stripped_prompt=strippedPrompt, story=story, middle=sumarizedMiddle[0]["summary_text"]
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitIndex, dialog))
                pbar.update()
            except Exception as e:
                print(f"{row.split}/{row.splitIndex}")
                raise e
        pbar.refresh()
    return dialogs


def filter_data(
    dataset,
    negativeTagFilter=None,
    positiveTagFilter=None,
    patternFilter=None,
):
    """
    > filter_data(dataset['train'],negativeTagFilter=['ip'], positiveTagFilter=['pm'] )
    """
    prompt = dataset["prompt"]
    if negativeTagFilter is not None:
        prompt = prompt[(prompt[negativeTagFilter] < 1).any(axis=1)]
    if positiveTagFilter is not None:
        prompt = prompt[prompt[positiveTagFilter].gt(0).all(axis=1)]
    if patternFilter is not None:
        prompt = prompt[prompt["prompt"].str.contains(patternFilter)]
    story = dataset["story"]
    story = story.iloc[prompt.index]
    return {"prompt": prompt, "story": story}


def generate_instruction_diologs(prompt, df):
    dialogs = []
    """User: What is this story about: {story} -> Rosey: I think it's about: {striped_prompt}"""
    dialogBase = """User: write me a story about: {stripped_prompt}"""
    dialog1 = """ -> Rosey: Sure, here's a story about: {stripped_prompt}:\n{story}"""
    dialog2 = """, {stripped_constraint} -> Rosey: Sure, here's a story about: {stripped_prompt}, {stripped_constraint}:\n{story}"""
    dialog3 = """, starting with: {beggining} -> Rosey: Sure, here's a story about: {stripped_prompt}, starting with: {beggining}:\n{story}"""
    dialog4 = """, ending with: {ending} -> Rosey: Sure, here's a story about {stripped_prompt}: ending with: {ending}\n{story}"""
    dialog5 = """, where the middle of the story is about: {middle} -> Rosey: Sure, here's a story about: {stripped_prompt}, where the middle of the story is about: {middle}:\n{story}"""

    strippedPrompt = extract_prompt_parts(prompt, PROMPT_PATTERNS)
    if strippedPrompt is not None:
        strippedPrompt = clear_prompt(strippedPrompt)
        strippedConstraint = extract_prompt_parts(prompt, CONST_PATTERNS)
        pbar = tqdm(ascii=True, desc="stories")
        pbar.reset(total=len(df[df["prompt"] == prompt]))
        for row in df[df["prompt"] == prompt].itertuples():
            try:
                story = (
                    row.story.replace("<newline>", "\n")
                    .replace("< newline >", "\n")
                    .replace("<new line>", "\n")
                    .strip()
                )
                dialogBeg = dialogBase.format(stripped_prompt=strippedPrompt)
                dialog = dialogBeg + dialog1.format(story=story, stripped_prompt=strippedPrompt)
                dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                if strippedConstraint is not None:
                    dialog = dialogBeg + dialog2.format(
                        stripped_prompt=strippedPrompt, stripped_constraint=strippedConstraint, story=story
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                beginning, middles, ending = extract_story_parts(story)
                if beginning is not None:
                    beginning, middles, ending = extract_story_parts(story)
                    dialog = dialogBeg + dialog3.format(
                        stripped_prompt=strippedPrompt, story=story, beggining=beginning
                    )
                    dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                    dialog = dialogBeg + dialog4.format(stripped_prompt=strippedPrompt, story=story, ending=ending)
                    dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                    middlesSumarizered = summarizer(middles, **params)
                    for middle, sumarizedMiddle in zip(middles, middlesSumarizered):
                        # dialogs.append(dialogBeg + dialog5.format(stripped_prompt=strippedPrompt, story=story, middle=middle))
                        dialog = dialogBeg + dialog5.format(
                            stripped_prompt=strippedPrompt, story=story, middle=sumarizedMiddle[0]["summary_text"]
                        )
                        dialogs.append(get_sample_dict(row.split, row.splitLineIndex, dialog))
                pbar.update()
            except Exception as e:
                print(f"{row.split}/{row.splitLineIndex}")
                raise e
            pbar.refresh()
    return dialogs

Generate

It saves parquet every step samples to avoid losing work.

python
## filter dataset to take only prompts with frequency greater than 20 stories.
dialogs = []
i = 0
start = 0
step = 10
for index in range(start, len(topPrompts20Reps), step):
    pbar = tqdm(ascii=True, desc="prompt")
    pbar.reset(total=len(topPrompts20Reps[index : index + step]))
    for prompt in topPrompts20Reps[index : index + step]:
        tmpDialogs = generate_instruction_diologs(prompt, ds)
        if tmpDialogs is not None:
            dialogs += tmpDialogs
        pbar.update()
    if len(dialogs) > 0:
        pd.DataFrame(dialogs).to_parquet("writing-prompts-aug.parquet")
    pbar.refresh()
python
df = pd.read_parquet("writing-prompts-aug.parquet")
python
for split in list(set(df.split)):
    df_aux = df[df["split"] == split].iloc[:, 1:]
    df_aux.reset_index(inplace=True)
    df_aux.iloc[:, 1:].to_parquet(f"{split}.parquet")