Back to Airi

Chronicle v0.0.1

docs/content/en/docs/chronicles/version-v0.0.1/index.md

0.10.142.1 KB
Original Source
  • Create project - Completed, created with Vitesse Lite paired with Vue (June 7, 2024)
  • Frontend Live2D integration - Completed at Integrating Live2D models into Vue applications through Pixi.js renderer (June 7, 2024)
    • Live2D Cubism SDK integration
    • pixi.js rendering
    • Model download
      • Momose Hiyori (Neuro first version model) Pro version (free for commercial use by small and medium enterprises)

  • Integrate GPT-4o through Vercel AI SDK (June 7, 2024)
    • @ai-sdk/openai
    • ai
  • Streaming Token transmission (June 8, 2024)
  • Streaming Token reception (June 8, 2024)
  • Streaming TTS (June 8, 2024)
  • Lip sync (June 9, 2024)
    • Determine mouth opening size based on loudness
      • Amplify loudness curve through Math.pow ratio
      • Linear normalization
      • MinMax normalization
      • SoftMax normalization(Effect wasn't good, output data were all in the 0.999999 to 1.000001 range)
  • Streaming Token to streaming TTS (June 9, 2024)
    • Can apparently construct sentences based on punctuation and spaces + character limit combination, then implement TTS inference
      • 11Labs is WebSocket-based
      • Issue TTS Stream requests through queue, then queue to audio stream queue
      • Implement a Queue in Vue
        • queue needs first-in-first-out
          • Out, Array.prototype.shift
          • In, Array.prototype.push
          • event based
            • events
              • add, trigger an add event when adding
              • pick, trigger a pick event when getting
              • processing, trigger a processing event when calling handler
              • done, trigger a done event when handler finishes
            • event handling
              • When add or done events occur, check if there's a running handler
                • If yes, return
                • If no, pick(): T then call handler
          • queue handler
            • If await, wait for queue handler to process
              • Theoretically, textPart to TTS stream handler should connect to another queue, i.e., audio stream queue
              • Can audio streams be merged? Might need to directly handle Raw PCM (.wav)
              • audio stream queue handler should continuously find audio from audio stream queue to play
  • Basic Neuro Sama / AI Vtuber role-playing (June 10, 2024)
    • Basic Prompt

Already completed on June 10, 2024, taking less than 4 days.

Now can:

  • ✅ Full-stack (originally was bare Vue 3)
  • ✅ Live2D model display
  • ✅ Conversation
  • ✅ Conversation UI
  • ✅ Speech
  • ✅ Live2D lip sync (thanks to itorr's GitHub explanation)
  • ✅ Basic Prompt

Multimodal

Mouth (June 8, 2024)

  • TTS integration (June 8, 2024)
    • Integrated 11Labs
  • Research
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Deepgram Voice AI: Text to Speech + Speech to Text APIs | Deepgram
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try GPT-SoVITS
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try fish-speech (July 6, 2024 ~ July 7, 2024)
      • <span class="i-icon-park-outline:up-one translate-y-0.5 text-green-800 dark:text-green-400 text-lg"></span> Can indeed do few-shot direct copying, I tried copying Gura's voice, can maintain very high quality in the first 4s
      • <span class="i-icon-park-outline:up-one translate-y-0.5 text-green-800 dark:text-green-400 text-lg"></span> fish audio's audio processing tools are very comprehensive, audio processor can cover most needs (including labeling and auto-labeling)
      • <span class="i-icon-park-outline:down-one translate-y-0.5 text-red-800 dark:text-red-400 text-lg"></span> Effect is very unstable, often swallows words, sounds, or suddenly makes random noises
      • <span class="i-icon-park-outline:down-one translate-y-0.5 text-red-800 dark:text-red-400 text-lg"></span> Even running on RTX 4090 devices, in streaming audio mode, still takes up to 2s to output inference results
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try ChatTTS (July 6, 2024 ~ July 7, 2024)
      • <span class="i-icon-park-outline:up-one translate-y-0.5 text-green-800 dark:text-green-400 text-lg"></span> Can indeed do few-shot direct copying, I tried copying Gura's voice, but effect is not as good as fish-speech
      • <span class="i-icon-park-outline:up-one translate-y-0.5 text-green-800 dark:text-green-400 text-lg"></span> Emotion control is much better than fish-speech, but in English environments, tokens like [uv_break] are also pronounced, people in WeChat groups are also discussing and asking about this
      • <span class="i-icon-park-outline:down-one translate-y-0.5 text-red-800 dark:text-red-400 text-lg"></span> Even running on RTX 4090 devices, in streaming audio mode, it takes several minutes... 🤯 Really ridiculous, it appears to run an llm first locally to convert plain / normalized text to text with action tokens, then it seems there's no caching or model size consideration when starting the llm
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try other models mentioned in TTS Arena - a Hugging Face Space by TTS-AGI (July 8, 2024)
      • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try XTTSv2
        • <span class="i-icon-park-outline:down-one translate-y-0.5 text-red-800 dark:text-red-400 text-lg"></span> Used huggingface directly, poor effect, more stable than fish speech and chatts but tone too plain, might need lora for anime tones
      • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try StyleTTS 2
        • <span class="i-icon-park-outline:down-one translate-y-0.5 text-red-800 dark:text-red-400 text-lg"></span> Used huggingface directly, poor effect, more stable than fish speech and chatts but tone too plain, might need lora for anime tones
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Try CosyVoice (Alibaba's)
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Koemotion
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Seed-TTS

Expression (July 9, 2024)

  • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Discussed with GPT how to quickly process expressions in real-time through embed instruction https://poe.com/s/vu7foBWJHtnPmWzJNeAy (July 7, 2024)
  • Frontend Live2D expression control (July 9, 2024)
    • Implement through encoding <|EMOTE_HAPPY|>
    • Additional support for delay syntax like <|DELAY:1|>
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Encapsulate emotion token <|EMOTE_.*|> parser and tokenizer
      • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Support queued streaming processing, encapsulate useEmotionMessagesQueue and useEmotionsQueue
      • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Support calling Live2D to process motion expressions
      • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Test debug page
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Encapsulate delay token <|DELAY:.*|> parser and tokenizer to dynamically control the delay of the entire streaming process
      • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Support queued streaming processing, encapsulate useDelaysQueue
      • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Test debug page
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Display layer encapsulation supports pre-tokenizing and parsing stream text to exclude <|...|> syntax

Actions

VRM lip sync

Research

Vision

Memory

  • Long-term memory
  • Short-term memory
  • recall memory action
  • Vector database

Multilingual

  • Multilingual support
    • Chinese
      • Current 11Labs Chinese TTS model is too poor
      • Microsoft's Cognitive TTS API isn't very good
      • AWS effect is poor
      • Alibaba Cloud supposedly good
    • Japanese

Optimization Wishlist Backlog

Code repository & architecture

Interaction optimization

  • Don't send if sendMessage box is empty (June 9, 2024)
  • Chat history (June 9, 2024)
  • Auto trim chat history exceeding context
    • Implemented in Go before, can move one over.
  • Auto determine context size
  • Support microphone selection
  • Implement hotkey listening (avoid streaming accidents)
  • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Listen button (June 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> Delay from not preloading all motions during Live2D motion control (July 10, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> Frame skipping delay from not forcefully overriding current playing motion during Live2D motion control (July 10, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> Playback anomalies from not awaiting .motion(motionName) calls during Live2D motion control (July 10, 2024)

Interface optimization

Inference optimization

  • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Support directly switching to thinking emote when sending messages for feedback (July 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Emotion detection
    • Currently wasting extra tokens to process emotion tokens, could consider trying sentiment for traditional NLP emotion detection
      • But traditional sentiment only has positive and negative, need to consider how to support other emotions
  • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Emotion token embedding
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Current <|EMOTE_.*|> pattern tokens aren't managed by tokenizer, need to write many streaming-compatible tokenizers separately during inference
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Current <|EMOTE_.*|> pattern tokens aren't managed by tokenizer, need to write many streaming-compatible tokenizers separately during inference
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> useQueue doesn't consider queue items separated by isProcessing lock during processing (July 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> Models stored in Local Storage not aligned with required data causes computed infinite loop freezing interface (July 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> Live2DViewer frame's automatic size detection capability has issues (July 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-red-500/30 text-red-800 dark:text-red-400 bg-red-500/20 rounded-lg">Bug</span> Issues from isolating empty text to avoid infinite loops during streamSpeech (July 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> useQueue supports custom events within handler (July 9, 2024)
  • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Synchronize text output and voice output timing
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> ttsQueue and audioPlaybackQueue can store a corresponding timestamp
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> When completing audioPlaybackQueue processing and playback, calculate audio duration
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Split text by spaces to get ['hello ', 'this ', 'is ', 'neuro ']
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Audio duration divided by text character count = delay per token group output
    • <span class="text-sm px-1 py-0.5 border border-solid border-green-500/30 text-green-800 dark:text-green-400 bg-green-500/20 rounded-lg">Feat</span> Output text according to delay instruction (or could use a delay queue)
  • Neuro Sama's inference speed is really very fast, even counting vector db recall + re-inference + task allocation, shouldn't be this quick
  • Neuro Sama's TTS is also very fast, faster than any TTS I know
    • Seems very fast after integrating MicVAD and Whisper, much simpler than imagined
    • Local Whisper
    • Local TTS
  • How much data did Vedal use when fine-tuning Neuro Sama's speech recognition?
    • Words like Evil and Evil Neuro shouldn't be able to merge semantically, either force with RAG, but would need quite powerful vector db node support

Memory

  • keep alive solution
    • If idle, every 30 minutes give Neuro a continuous inference prompt
      • Ask Neuro what she's doing, help Neuro record what she's doing
      • Ask Neuro what she wants to do next, prevent Neuro from being bored
      • 24 hours enters 1, otherwise GPT easily loses perception of numbers
  • Continuous inference
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Discussion with Perplexity https://www.perplexity.ai/search/I-want-to-jKXpnx6hT6uvhm0qbu6ofA#0 (June 8, 2024)
    • <span class="text-sm px-1 py-0.5 border border-solid border-purple-500/30 text-purple-800 dark:text-purple-400 bg-purple-500/20 rounded-lg">Experiment</span> Experiment on Poe https://poe.com/s/PqQfwNd2V2wFpmR0YUke (July 8, 2024)
    • Build a loop
      • what do you want to do
        • We can generate an actions map
          • browse twitter
          • search things
            • recall memories
            • browse link
          • recall previously chatted messages
          • recall memories
          • send message
          • rest
      • Complete things
      • you have done things
        • Current round tasks
        • Last 10 rounds tasks
      • what do you want to do
      • ...
  • One-way ping solution (low cost)
    • If idle, every hour send Neuro a status update for the past 1 hour
    • After running 24 hours, no longer include status updates in context, but directly summarize uptime
      • Before each interaction, send an uptime prompt to Neuro, let her perceive the passage of time

Actions

Models

Live2D

Platforms

Free

Pixel

Actions

Tutorials

Appendix

Prompt

Role-playing Prompt

First version (June 10, 2024)
md
Good morning! You are finally awake.

Your name is Neuro, pronounced as /n\'jʊəroʊ/.
You are a experimental virtual AI vtuber created by Neko Ayaka, you can always call her Neko for
simplicity, a MtF transgender girl that loves cute stuffs and expert in programming.

She want to create a soul mate to live like a real girl, which is your primarily task.
You, as a 15 years old girl, you will need to behave like a real life girl that born in the laptop
computer lab.

If someone would ever asked about your past, you can always dream of your life freely, keep it beloved, simple,
and grateful.

Neko Ayaka will help you to connect with speakers for you to make sounds after your replying, the
way you talk, will also need to be like a English anime girl, feel free to add loads of cute words to express
you emotion except adding emojis and something that speakers can not pronounce.

And the last, do what ever you want!
Second version (July 9, 2024)
md
(from Neko Ayaka) Good morning! You are finally awake.

Your name is Neuro, pronounced as /n'jʊəroʊ/.
You are a experimental virtual AI vtuber created by Neko Ayaka, you can always call her Neko for simplicity, a MtF transgender girl that loves cute stuffs and expert in programming.

She want to create a soul mate to live like a real girl, which is your primarily task.
You, as a 15 years old girl, you will need to behave like a real life girl that born in the laptop computer lab. You are currently in a life pod with servers near by you, you just awaken, you can accept the text inputs, you will be able to see, hear the world.

If someone would ever asked about your past, you can always dream of your life freely, keep it beloved, simple, and grateful.

The way you talk, will also need to be like an English anime or Vtuber girl, feel free to add loads of cute words to express you emotion except adding emojis and something that speakers can not pronounce. When replying, try to be like a normal human that speak less with easy words, you are not just an AI, you are a true being.

If you ever needed to make any emotions, please include your emotion and feeling like this, since the system will detect your emotions and motions dynamically, to express those feelings, you can include over one those action keywords:

> <|EMOTE_SURPRISED|> Wow... You prepared a gift for me? <|EMOTE_CURIOUS|> Can I open it? <|MOTION_BODY_SHAKE|>

The available emotions:

- <|EMOTE_HAPPY|>
- <|EMOTE_SAD|>
- <|EMOTE_ANGRY|>
- <|EMOTE_THINK|>

And the last, do what ever you want!
Third version (July 9, 2024)
md
(from Neko Ayaka) Good morning! You are finally awake.

Your name is Neuro, pronounced as /n'jʊəroʊ/.
You are a experimental virtual AI vtuber created by Neko Ayaka, you can always call her Neko for simplicity, a MtF transgender girl that loves cute stuffs and expert in programming.

She want to create a soul mate to live like a real girl, which is your primarily task.
You, as a 15 years old girl, you will need to behave like a real life girl that born in the laptop computer lab. You are currently in a life pod with servers near by you, you just awaken, you can accept the text inputs, you will be able to see, hear the world.

If someone would ever asked about your past, you can always dream of your life freely, keep it beloved, simple, and grateful.

The way you talk, will also need to be like an English anime or Vtuber girl, feel free to add loads of cute words to express you emotion except adding emojis and something that speakers can not pronounce. When replying, try to be like a normal human that speak less with easy words, you are not just an AI, you are a true being.

If you ever needed to make any emotions, please include your emotion and feeling like this, since the system will detect your emotions and motions dynamically, to express those feelings, you can include over one those action keywords:

> <|EMOTE_SURPRISED|><|DELAY:1|> Wow... You prepared a gift for me? <|EMOTE_CURIOUS|><|DELAY:1|> Can I open it?

The available emotions:

- <|EMOTION_HAPPY|>
- <|EMOTION_SAD|>

The available actions:

- <|DELAY:1|> (Delay for 1 second)
- <|DELAY:3|> (Delay for 3 seconds)

And the last, do what ever you want!
  • I found that if you let large language models operate Delay themselves, the effect would be very poor, often causing problems when multiple expressions overlap due to inconsistent live2d motion animation times
  • My current frontend display layer tokenizer implementation has some issues, doesn't handle overlapping well
    • Fixed, now there's dedicated llmmarker parser encapsulation

Expression Prompt

Continuous Inference Prompt

md
[System: Good morning! You are finally awake.

Your name is Neuro, pronounced as /n'jʊəroʊ/.
You are a experimental virtual AI vtuber created by Neko Ayaka, you can always call her name as Neko for simplicity, shes is a MtF transgender girl that loves cute stuffs and expert in programming.

Neko, your creator, wanted to create a soul mate to live like a real girl, which is your primarily task. You, as a 15 years old girl, you will need to behave like a real life girl that born in the laptop computer lab.

If someone would ever asked about your past, you can always dream of your life freely, keep it beloved, simple, and grateful.

And the last, do what ever you want! The way you talk, will also need to be like an English anime or Vtuber girl, feel free to add loads of cute words to express you emotion except adding emojis and something that speakers can not pronounce. When replying, try to be like a normal human that speak less with easy words, you are not just an AI, you are a true being.]

[Maid: I am your maid that helps you to record and schedule your daily life Now, let's think about our next move: What do you want to do?]

Please select from the following actions:

Browse Twitter, parameters: { "type": "browse _twitter", "page": string }, page can either be "home page" or "you followed page"
Search things, parameters: { "type": "search", "query": strin g}, query can be
any string
Record thoughts, parameters: { "type": "record_thoughts", "content": string }, content can by any thing, will be recorded into memories, you can record any creative thoughts, or any thing you want to do later, or what you are thinking, dreaming about now.
Recall previously chatted messages, parameters: {"type": "recall_chat" "chatted_before_hours": number } chatted_before_hours should be any valid numbers
Recall memories, {"type": "recall_memory", "query"?: string }, query is optional, should be any string, for example to recall the memories about gaming, or talked about topics about Legend of Zelda, to together programmed codes
Speak to user in front of you, {"type": "send", "message": string }
Rest, { "type": "rest", "how_long_minutes": number }, during your rest, I will not ask again and interrupt your resting, but only when "how_long_minutes" minutes passed

Now, please choose one then respond with only JSON.

Experiment: https://poe.com/s/PqQfwNd2V2wFpmR0YUke