docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Refactor internal/ai/ into a stt/ (deterministic transcription) and audiollm/ (multimodal audio LLM) package pair, port behavior from the current openai.go and gemini.go, refactor the Transcribe API handler to dispatch by provider type, then update proto comments and frontend i18n. End state preserves current end-to-end behavior; the response field for Gemini failures is more informative (FinishReason).
Architecture: Two new packages under internal/ai/. Each has a thin interface in the parent (stt.Transcriber, audiollm.Model), a factory that switches on ai.ProviderType, and a sub-package per provider (stt/openai/, audiollm/gemini/). Application-layer dispatch in the gRPC handler decides which interface to invoke. No proto field changes — only two comment updates.
Tech Stack: Go 1.x (existing Memos toolchain), github.com/openai/openai-go/v3, google.golang.org/genai, github.com/pkg/errors, github.com/stretchr/testify. Proto: buf v2. Frontend: TypeScript + bufbuild/es.
Spec: docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md — read it before starting; this plan implements §6–§7 of the spec.
Worktree: Run from a dedicated worktree (e.g., git worktree add ../memos-stt-split). Each stage is a separate commit; the branch should be PR-ready when all 7 stages are done.
Created (new files):
internal/ai/stt/stt.gointernal/ai/stt/options.gointernal/ai/stt/factory.gointernal/ai/stt/openai/openai.gointernal/ai/stt/openai/openai_test.gointernal/ai/audiollm/audiollm.gointernal/ai/audiollm/options.gointernal/ai/audiollm/factory.gointernal/ai/audiollm/gemini/gemini.gointernal/ai/audiollm/gemini/gemini_test.goModified:
internal/ai/errors.go (add 2 sentinel errors)proto/store/instance_setting.proto (2 field-comment blocks)proto/gen/store/instance_setting.pb.go (regenerated)web/src/types/proto/store/instance_setting_pb.ts (regenerated)server/router/api/v1/ai_service.go (refactor Transcribe handler)web/src/locales/en.json (3 strings)Deleted:
internal/ai/transcription.gointernal/ai/client.gointernal/ai/openai.gointernal/ai/openai_test.gointernal/ai/gemini.gointernal/ai/gemini_test.goUntouched:
internal/ai/ai.go, resolver.go, models.gointernal/ai/audio/webm.go, audio/webm_test.goweb/src/components/Settings/AISection.tsx (no structural change; only the i18n strings it consumes change)Files:
internal/ai/errors.gointernal/ai/stt/stt.gointernal/ai/stt/options.gointernal/ai/stt/factory.gointernal/ai/audiollm/audiollm.gointernal/ai/audiollm/options.gointernal/ai/audiollm/factory.goThis stage adds types and stubs only. No real provider implementation yet — both factories return errors for every provider type. We verify the project still builds.
Open internal/ai/errors.go, read the existing file. Append (or replace if you find a clearer location) so the file ends with these errors in addition to whatever is already there:
// ErrSTTNotSupported indicates that the provider does not have a dedicated
// speech-to-text endpoint. Use the audiollm package for multimodal audio
// understanding when this is returned.
var ErrSTTNotSupported = errors.New("provider does not support speech-to-text capability")
// ErrAudioLLMNotSupported indicates that the provider does not have a
// multimodal-audio LLM available in this codebase.
var ErrAudioLLMNotSupported = errors.New("provider does not support multimodal audio capability")
If errors.go does not yet import errors, add:
import "errors"
to its imports block. Use the standard library errors, not github.com/pkg/errors — errors.New from stdlib is what errors.Is uses for sentinel comparison.
internal/ai/stt/stt.go// Package stt defines the speech-to-text capability for AI providers.
// Implementations call dedicated STT endpoints (e.g. OpenAI /audio/transcriptions)
// and return deterministic transcription output. For multimodal LLMs that
// happen to accept audio input, see internal/ai/audiollm.
package stt
import (
"context"
"io"
)
// Transcriber transcribes audio to text using a provider's dedicated STT endpoint.
type Transcriber interface {
Transcribe(ctx context.Context, req Request) (*Response, error)
}
// Request is the input to a transcription call.
type Request struct {
Audio io.Reader
Size int64
Filename string
ContentType string // IANA media type, e.g. "audio/wav"
Model string // provider-specific model id (e.g. "whisper-1", "gpt-4o-transcribe")
Prompt string // soft spelling/vocabulary hint (Whisper "prompt" parameter)
Language string // ISO 639-1, optional
}
// Response is the output of a transcription call.
type Response struct {
Text string
Language string // empty if provider did not return it
Segments []Segment // empty unless provider returned timestamps
}
// Segment is a timestamped portion of the transcript.
type Segment struct {
Text string
Start float64
End float64
Speaker string // empty unless using a diarization-capable model
}
internal/ai/stt/options.gopackage stt
import (
"net/http"
"time"
)
const defaultHTTPTimeout = 2 * time.Minute
// Options is the resolved option set passed to provider implementations.
type Options struct {
HTTPClient *http.Client
}
// TranscriberOption customizes a Transcriber.
type TranscriberOption func(*Options)
// WithHTTPClient overrides the HTTP client used by the transcriber.
func WithHTTPClient(client *http.Client) TranscriberOption {
return func(o *Options) {
if client != nil {
o.HTTPClient = client
}
}
}
// ApplyOptions resolves a TranscriberOption slice into Options with defaults.
func ApplyOptions(opts []TranscriberOption) Options {
resolved := Options{HTTPClient: &http.Client{Timeout: defaultHTTPTimeout}}
for _, apply := range opts {
apply(&resolved)
}
return resolved
}
internal/ai/stt/factory.gopackage stt
import (
"github.com/pkg/errors"
"github.com/usememos/memos/internal/ai"
)
// NewTranscriber returns a Transcriber for the given provider, or an error if
// the provider type does not have a dedicated STT endpoint.
func NewTranscriber(cfg ai.ProviderConfig, opts ...TranscriberOption) (Transcriber, error) {
switch cfg.Type {
case ai.ProviderOpenAI:
// Implemented in stage B; keep the stub return so the package compiles.
return nil, errors.New("stt/openai not implemented yet")
case ai.ProviderGemini:
return nil, errors.Wrap(ai.ErrSTTNotSupported,
"Gemini does not provide a dedicated STT endpoint; use audiollm.NewModel instead")
default:
return nil, errors.Wrapf(ai.ErrCapabilityUnsupported, "provider type %q", cfg.Type)
}
}
internal/ai/audiollm/audiollm.go// Package audiollm defines the multimodal-audio capability for AI providers.
// Implementations call chat-completions or generate-content style APIs that
// accept audio as input. For deterministic transcription, prefer internal/ai/stt
// where a dedicated STT endpoint exists.
package audiollm
import (
"context"
"io"
)
// Model invokes a multimodal LLM with audio input.
type Model interface {
GenerateFromAudio(ctx context.Context, req Request) (*Response, error)
}
// Request is the input to a multimodal-audio call.
type Request struct {
Audio io.Reader
Size int64
ContentType string
Model string
Instructions string // literal instruction the model is expected to follow
Temperature *float32 // optional; nil leaves the provider default in place
}
// Response is the output of a multimodal-audio call.
type Response struct {
Text string
FinishReason FinishReason
}
// FinishReason describes why the model stopped generating.
type FinishReason string
const (
FinishStop FinishReason = "stop" // model finished normally
FinishLength FinishReason = "length" // truncated by max-tokens
FinishSafety FinishReason = "safety" // safety filter blocked output
FinishOther FinishReason = "other" // anything else, including unknown
)
internal/ai/audiollm/options.gopackage audiollm
import (
"net/http"
"time"
)
const defaultHTTPTimeout = 2 * time.Minute
// Options is the resolved option set passed to provider implementations.
type Options struct {
HTTPClient *http.Client
}
// ModelOption customizes a Model.
type ModelOption func(*Options)
// WithHTTPClient overrides the HTTP client used by the model.
func WithHTTPClient(client *http.Client) ModelOption {
return func(o *Options) {
if client != nil {
o.HTTPClient = client
}
}
}
// ApplyOptions resolves a ModelOption slice into Options with defaults.
func ApplyOptions(opts []ModelOption) Options {
resolved := Options{HTTPClient: &http.Client{Timeout: defaultHTTPTimeout}}
for _, apply := range opts {
apply(&resolved)
}
return resolved
}
internal/ai/audiollm/factory.gopackage audiollm
import (
"github.com/pkg/errors"
"github.com/usememos/memos/internal/ai"
)
// NewModel returns a Model for the given provider, or an error if the provider
// type does not expose a multimodal-audio capability in this codebase.
func NewModel(cfg ai.ProviderConfig, opts ...ModelOption) (Model, error) {
switch cfg.Type {
case ai.ProviderGemini:
// Implemented in stage C; keep the stub return so the package compiles.
return nil, errors.New("audiollm/gemini not implemented yet")
case ai.ProviderOpenAI:
// gpt-4o-audio-preview support is intentionally out of scope (see spec §2).
return nil, errors.Wrap(ai.ErrAudioLLMNotSupported,
"OpenAI multimodal audio (gpt-4o-audio) is not implemented")
default:
return nil, errors.Wrapf(ai.ErrCapabilityUnsupported, "provider type %q", cfg.Type)
}
}
go build ./...
Expected: exits 0, no output. (internal/ai/transcription.go, client.go, openai.go, gemini.go are still in place from before this plan; the new packages compile alongside them.)
git add internal/ai/errors.go \
internal/ai/stt \
internal/ai/audiollm
git commit -m "$(cat <<'EOF'
feat(ai): scaffold stt and audiollm packages
Adds two new sub-packages under internal/ai/ that will replace the single
Transcriber abstraction:
- stt/ — deterministic speech-to-text (dedicated STT endpoints)
- audiollm/ — multimodal audio LLMs (chat-completions / generateContent)
This commit only adds the interface types, factories with stub returns,
and two sentinel errors. Provider implementations land in the next two
commits; until then the existing internal/ai/transcription.go path is
still the live code path.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
stt/openaiFiles:
internal/ai/stt/openai/openai.gointernal/ai/stt/openai/openai_test.gointernal/ai/stt/factory.goPort the wire behavior of internal/ai/openai.go::openAITranscriber.Transcribe into the new package. TDD: write the test (adapted from internal/ai/openai_test.go) first.
Create internal/ai/stt/openai/openai_test.go:
package openai_test
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
"github.com/stretchr/testify/require"
"github.com/usememos/memos/internal/ai"
"github.com/usememos/memos/internal/ai/stt"
sttopenai "github.com/usememos/memos/internal/ai/stt/openai"
)
func TestTranscribe(t *testing.T) {
t.Parallel()
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodPost, r.Method)
require.Equal(t, "/audio/transcriptions", r.URL.Path)
require.Equal(t, "Bearer test-key", r.Header.Get("Authorization"))
require.NoError(t, r.ParseMultipartForm(10<<20))
require.Equal(t, "gpt-4o-transcribe", r.FormValue("model"))
require.Equal(t, "json", r.FormValue("response_format"))
require.Equal(t, "domain words", r.FormValue("prompt"))
require.Equal(t, "en", r.FormValue("language"))
file, header, err := r.FormFile("file")
require.NoError(t, err)
defer file.Close()
require.Equal(t, "voice.wav", header.Filename)
require.Equal(t, "audio/wav", header.Header.Get("Content-Type"))
w.Header().Set("Content-Type", "application/json")
require.NoError(t, json.NewEncoder(w).Encode(map[string]any{
"text": "hello world",
"language": "en",
"duration": 1.5,
}))
}))
defer server.Close()
transcriber, err := sttopenai.New(ai.ProviderConfig{
Type: ai.ProviderOpenAI,
Endpoint: server.URL,
APIKey: "test-key",
}, stt.ApplyOptions(nil))
require.NoError(t, err)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
response, err := transcriber.Transcribe(ctx, stt.Request{
Model: "gpt-4o-transcribe",
Filename: "voice.wav",
ContentType: "audio/wav",
Audio: strings.NewReader("RIFF"),
Prompt: "domain words",
Language: "en",
})
require.NoError(t, err)
require.Equal(t, "hello world", response.Text)
require.Equal(t, "en", response.Language)
// Note: Duration intentionally omitted from stt.Response — not exposed in the new contract.
}
go test ./internal/ai/stt/openai/...
Expected: build failure with "undefined: openai.New" or similar — the implementation file does not exist yet.
internal/ai/stt/openai/openai.goPort from internal/ai/openai.go. The body is structurally identical to today's openAITranscriber.Transcribe, only the Request/Response types and package boundary change.
// Package openai implements stt.Transcriber against the OpenAI
// /audio/transcriptions endpoint (and any compatible third-party endpoint
// such as Groq Whisper, faster-whisper self-hosted, or Azure Whisper).
package openai
import (
"context"
"mime"
"net/url"
"strings"
openaisdk "github.com/openai/openai-go/v3"
openaioption "github.com/openai/openai-go/v3/option"
"github.com/pkg/errors"
"github.com/usememos/memos/internal/ai"
"github.com/usememos/memos/internal/ai/stt"
)
const defaultEndpoint = "https://api.openai.com/v1"
// Transcriber implements stt.Transcriber for OpenAI-compatible STT endpoints.
type Transcriber struct {
client openaisdk.Client
}
// New constructs a Transcriber from a provider config.
func New(cfg ai.ProviderConfig, options stt.Options) (*Transcriber, error) {
endpoint, err := normalizeEndpoint(cfg.Endpoint)
if err != nil {
return nil, err
}
if cfg.APIKey == "" {
return nil, errors.New("OpenAI API key is required")
}
return &Transcriber{
client: openaisdk.NewClient(
openaioption.WithAPIKey(cfg.APIKey),
openaioption.WithBaseURL(endpoint),
openaioption.WithHTTPClient(options.HTTPClient),
),
}, nil
}
// Transcribe sends the audio to /audio/transcriptions.
func (t *Transcriber) Transcribe(ctx context.Context, req stt.Request) (*stt.Response, error) {
if strings.TrimSpace(req.Model) == "" {
return nil, errors.New("model is required")
}
if req.Audio == nil {
return nil, errors.New("audio is required")
}
filename, contentType, err := normalizeAudioMetadata(req)
if err != nil {
return nil, err
}
params := openaisdk.AudioTranscriptionNewParams{
File: openaisdk.File(req.Audio, filename, contentType),
Model: openaisdk.AudioModel(req.Model),
ResponseFormat: openaisdk.AudioResponseFormatJSON,
}
if req.Prompt != "" {
params.Prompt = openaisdk.String(req.Prompt)
}
if req.Language != "" {
params.Language = openaisdk.String(req.Language)
}
resp, err := t.client.Audio.Transcriptions.New(ctx, params)
if err != nil {
return nil, errors.Wrap(err, "failed to send OpenAI transcription request")
}
return &stt.Response{
Text: resp.Text,
Language: resp.Language,
}, nil
}
func normalizeEndpoint(endpoint string) (string, error) {
endpoint = strings.TrimSpace(endpoint)
if endpoint == "" {
endpoint = defaultEndpoint
}
if _, err := url.ParseRequestURI(endpoint); err != nil {
return "", errors.Wrap(err, "invalid OpenAI endpoint")
}
return strings.TrimRight(endpoint, "/"), nil
}
func normalizeAudioMetadata(req stt.Request) (string, string, error) {
filename := strings.TrimSpace(req.Filename)
if filename == "" {
filename = "audio"
}
contentType := strings.TrimSpace(req.ContentType)
if contentType == "" {
contentType = "application/octet-stream"
} else {
mediaType, _, err := mime.ParseMediaType(contentType)
if err != nil {
return "", "", errors.Wrap(err, "invalid audio content type")
}
contentType = mediaType
}
return sanitizeFilename(filename), contentType, nil
}
func sanitizeFilename(filename string) string {
filename = strings.NewReplacer("\r", "_", "\n", "_").Replace(filename)
if strings.TrimSpace(filename) == "" {
return "audio"
}
return filename
}
Replace the stub branch in internal/ai/stt/factory.go:
package stt
import (
"github.com/pkg/errors"
"github.com/usememos/memos/internal/ai"
sttopenai "github.com/usememos/memos/internal/ai/stt/openai"
)
func NewTranscriber(cfg ai.ProviderConfig, opts ...TranscriberOption) (Transcriber, error) {
switch cfg.Type {
case ai.ProviderOpenAI:
return sttopenai.New(cfg, ApplyOptions(opts))
case ai.ProviderGemini:
return nil, errors.Wrap(ai.ErrSTTNotSupported,
"Gemini does not provide a dedicated STT endpoint; use audiollm.NewModel instead")
default:
return nil, errors.Wrapf(ai.ErrCapabilityUnsupported, "provider type %q", cfg.Type)
}
}
go test ./internal/ai/stt/openai/... -v
Expected: --- PASS: TestTranscribe. Exit 0.
go build ./...
Expected: exits 0, no output.
git add internal/ai/stt
git commit -m "$(cat <<'EOF'
feat(ai/stt): implement OpenAI-compatible Transcriber
Ports the wire behavior of internal/ai/openai.go into the new
internal/ai/stt/openai package. Behavior is identical: same SDK, same
endpoint, same request shape; only the contract types (stt.Request,
stt.Response) change. Duration is intentionally dropped from the response
shape (audio file metadata, not transcription metadata).
The legacy internal/ai/openai.go is still in place — it will be removed
once the handler is refactored to call into the new stack.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
audiollm/geminiFiles:
internal/ai/audiollm/gemini/gemini.gointernal/ai/audiollm/gemini/gemini_test.gointernal/ai/audiollm/factory.goPort the wire behavior of internal/ai/gemini.go::geminiTranscriber.Transcribe with two semantic upgrades:
Request.Instructions.FinishReason is mapped to audiollm.FinishReason and surfaced in the response, instead of being collapsed into a generic error.Create internal/ai/audiollm/gemini/gemini_test.go. Two tests: happy path and unsupported-content-type rejection. The happy-path test asserts that the caller-supplied Instructions string flows through to the request and that FinishStop is returned. We also verify that WebM input is transcoded to WAV by re-using internal/ai/audio.WebMOpusToWAV.
package gemini_test
import (
"context"
"encoding/base64"
"encoding/json"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
"github.com/stretchr/testify/require"
"github.com/usememos/memos/internal/ai"
"github.com/usememos/memos/internal/ai/audiollm"
audiollmgemini "github.com/usememos/memos/internal/ai/audiollm/gemini"
)
func TestGenerateFromAudio(t *testing.T) {
t.Parallel()
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodPost, r.Method)
require.Equal(t, "/v1beta/models/gemini-2.5-flash:generateContent", r.URL.Path)
require.Equal(t, "test-key", r.Header.Get("x-goog-api-key"))
require.Equal(t, "application/json", r.Header.Get("Content-Type"))
var request struct {
Contents []struct {
Parts []struct {
Text string `json:"text"`
InlineData *struct {
MIMEType string `json:"mimeType"`
Data string `json:"data"`
} `json:"inlineData"`
} `json:"parts"`
} `json:"contents"`
GenerationConfig map[string]json.Number `json:"generationConfig"`
}
require.NoError(t, json.NewDecoder(r.Body).Decode(&request))
require.Len(t, request.Contents, 1)
require.Len(t, request.Contents[0].Parts, 2)
require.NotNil(t, request.Contents[0].Parts[0].InlineData)
require.Equal(t, "audio/mp3", request.Contents[0].Parts[0].InlineData.MIMEType)
audio, err := base64.StdEncoding.DecodeString(request.Contents[0].Parts[0].InlineData.Data)
require.NoError(t, err)
require.Equal(t, "audio bytes", string(audio))
require.Equal(t, "transcribe please", request.Contents[0].Parts[1].Text)
require.Equal(t, json.Number("0"), request.GenerationConfig["temperature"])
w.Header().Set("Content-Type", "application/json")
require.NoError(t, json.NewEncoder(w).Encode(map[string]any{
"candidates": []map[string]any{
{
"finishReason": "STOP",
"content": map[string]any{
"parts": []map[string]string{{"text": "hello from gemini"}},
},
},
},
}))
}))
defer server.Close()
model, err := audiollmgemini.New(ai.ProviderConfig{
Type: ai.ProviderGemini,
Endpoint: server.URL + "/v1beta",
APIKey: "test-key",
}, audiollm.ApplyOptions(nil))
require.NoError(t, err)
temp := float32(0)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
resp, err := model.GenerateFromAudio(ctx, audiollm.Request{
Model: "models/gemini-2.5-flash",
ContentType: "audio/mpeg",
Audio: strings.NewReader("audio bytes"),
Instructions: "transcribe please",
Temperature: &temp,
})
require.NoError(t, err)
require.Equal(t, "hello from gemini", resp.Text)
require.Equal(t, audiollm.FinishStop, resp.FinishReason)
}
func TestGenerateFromAudioRejectsUnsupportedContentType(t *testing.T) {
t.Parallel()
model, err := audiollmgemini.New(ai.ProviderConfig{
Type: ai.ProviderGemini,
Endpoint: "https://example.com/v1beta",
APIKey: "test-key",
}, audiollm.ApplyOptions(nil))
require.NoError(t, err)
_, err = model.GenerateFromAudio(context.Background(), audiollm.Request{
Model: "gemini-2.5-flash",
ContentType: "video/mp4",
Audio: strings.NewReader("video bytes"),
Instructions: "transcribe please",
})
require.Error(t, err)
require.Contains(t, err.Error(), "not supported by Gemini")
}
go test ./internal/ai/audiollm/gemini/...
Expected: build failure ("undefined: gemini.New" or similar).
internal/ai/audiollm/gemini/gemini.goPort from internal/ai/gemini.go. The structure mirrors the existing file; the differences are:
Instructions).Temperature comes from the request, not from a package constant.FinishReason, and we no longer treat empty text as an error here — that decision moves to the caller (the handler), which knows whether FinishStop is required.// Package gemini implements audiollm.Model against the Gemini generateContent
// endpoint. Used by Memos transcription when the user picks a Gemini provider:
// the handler issues a transcription instruction via audiollm.Request.Instructions.
package gemini
import (
"context"
"io"
"mime"
"net/url"
"strings"
"github.com/pkg/errors"
"google.golang.org/genai"
"github.com/usememos/memos/internal/ai"
"github.com/usememos/memos/internal/ai/audio"
"github.com/usememos/memos/internal/ai/audiollm"
)
const (
defaultEndpoint = "https://generativelanguage.googleapis.com/v1beta"
defaultAPIVersion = "v1beta"
maxInlineSize = 14 * 1024 * 1024
providerName = "Gemini"
)
var supportedContentTypes = map[string]string{
"audio/wav": "audio/wav",
"audio/x-wav": "audio/wav",
"audio/mp3": "audio/mp3",
"audio/mpeg": "audio/mp3",
"audio/aiff": "audio/aiff",
"audio/aac": "audio/aac",
"audio/ogg": "audio/ogg",
"audio/flac": "audio/flac",
"audio/x-flac": "audio/flac",
}
// Model implements audiollm.Model for Gemini generateContent.
type Model struct {
client *genai.Client
}
// New constructs a Model from a provider config.
func New(cfg ai.ProviderConfig, options audiollm.Options) (*Model, error) {
endpoint, err := normalizeEndpoint(cfg.Endpoint)
if err != nil {
return nil, err
}
if cfg.APIKey == "" {
return nil, errors.Errorf("%s API key is required", providerName)
}
baseURL, apiVersion, err := splitEndpoint(endpoint)
if err != nil {
return nil, err
}
httpOptions := genai.HTTPOptions{BaseURL: baseURL, APIVersion: apiVersion}
if options.HTTPClient != nil && options.HTTPClient.Timeout > 0 {
timeout := options.HTTPClient.Timeout
httpOptions.Timeout = &timeout
}
client, err := genai.NewClient(context.Background(), &genai.ClientConfig{
APIKey: cfg.APIKey,
Backend: genai.BackendGeminiAPI,
HTTPClient: options.HTTPClient,
HTTPOptions: httpOptions,
})
if err != nil {
return nil, errors.Wrap(err, "failed to create Gemini client")
}
return &Model{client: client}, nil
}
// GenerateFromAudio calls Gemini generateContent with the audio attached.
func (m *Model) GenerateFromAudio(ctx context.Context, req audiollm.Request) (*audiollm.Response, error) {
if strings.TrimSpace(req.Model) == "" {
return nil, errors.New("model is required")
}
if req.Audio == nil {
return nil, errors.New("audio is required")
}
if strings.TrimSpace(req.Instructions) == "" {
return nil, errors.New("instructions are required")
}
audioBytes, err := io.ReadAll(req.Audio)
if err != nil {
return nil, errors.Wrap(err, "failed to read audio")
}
if len(audioBytes) == 0 {
return nil, errors.New("audio is required")
}
contentType := req.ContentType
if audio.IsWebMContentType(contentType) {
wav, err := audio.WebMOpusToWAV(audioBytes)
if err != nil {
return nil, errors.Wrap(err, "failed to transcode webm audio for Gemini")
}
audioBytes = wav
contentType = "audio/wav"
}
if len(audioBytes) > maxInlineSize {
return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
}
contentType, err = normalizeContentType(contentType)
if err != nil {
return nil, err
}
cfg := &genai.GenerateContentConfig{}
if req.Temperature != nil {
t := *req.Temperature
cfg.Temperature = &t
}
resp, err := m.client.Models.GenerateContent(ctx, normalizeModelName(req.Model), []*genai.Content{
genai.NewContentFromParts([]*genai.Part{
genai.NewPartFromBytes(audioBytes, contentType),
genai.NewPartFromText(req.Instructions),
}, genai.RoleUser),
}, cfg)
if err != nil {
return nil, errors.Wrap(err, "failed to send Gemini request")
}
return &audiollm.Response{
Text: strings.TrimSpace(resp.Text()),
FinishReason: mapFinishReason(resp),
}, nil
}
func mapFinishReason(resp *genai.GenerateContentResponse) audiollm.FinishReason {
if resp == nil || len(resp.Candidates) == 0 {
return audiollm.FinishOther
}
switch resp.Candidates[0].FinishReason {
case genai.FinishReasonStop:
return audiollm.FinishStop
case genai.FinishReasonMaxTokens:
return audiollm.FinishLength
case genai.FinishReasonSafety:
return audiollm.FinishSafety
default:
return audiollm.FinishOther
}
}
func normalizeEndpoint(endpoint string) (string, error) {
endpoint = strings.TrimSpace(endpoint)
if endpoint == "" {
endpoint = defaultEndpoint
}
if _, err := url.ParseRequestURI(endpoint); err != nil {
return "", errors.Wrapf(err, "invalid %s endpoint", providerName)
}
return strings.TrimRight(endpoint, "/"), nil
}
func splitEndpoint(endpoint string) (string, string, error) {
parsed, err := url.Parse(endpoint)
if err != nil {
return "", "", errors.Wrap(err, "invalid Gemini endpoint")
}
path := strings.TrimRight(parsed.Path, "/")
apiVersion := defaultAPIVersion
for _, supported := range []string{"v1alpha", "v1beta", "v1"} {
if path == "/"+supported || strings.HasSuffix(path, "/"+supported) {
apiVersion = supported
parsed.Path = strings.TrimSuffix(path, "/"+supported)
break
}
}
return strings.TrimRight(parsed.String(), "/"), apiVersion, nil
}
func normalizeContentType(contentType string) (string, error) {
mediaType, _, err := mime.ParseMediaType(strings.TrimSpace(contentType))
if err != nil {
return "", errors.Wrap(err, "invalid audio content type")
}
mediaType = strings.ToLower(mediaType)
normalized, ok := supportedContentTypes[mediaType]
if !ok {
return "", errors.Errorf("audio content type %q is not supported by Gemini", mediaType)
}
return normalized, nil
}
func normalizeModelName(model string) string {
return strings.TrimPrefix(strings.TrimSpace(model), "models/")
}
Replace the stub branch in internal/ai/audiollm/factory.go:
package audiollm
import (
"github.com/pkg/errors"
"github.com/usememos/memos/internal/ai"
audiollmgemini "github.com/usememos/memos/internal/ai/audiollm/gemini"
)
func NewModel(cfg ai.ProviderConfig, opts ...ModelOption) (Model, error) {
switch cfg.Type {
case ai.ProviderGemini:
return audiollmgemini.New(cfg, ApplyOptions(opts))
case ai.ProviderOpenAI:
return nil, errors.Wrap(ai.ErrAudioLLMNotSupported,
"OpenAI multimodal audio (gpt-4o-audio) is not implemented")
default:
return nil, errors.Wrapf(ai.ErrCapabilityUnsupported, "provider type %q", cfg.Type)
}
}
go test ./internal/ai/audiollm/gemini/... -v
Expected: --- PASS: TestGenerateFromAudio and --- PASS: TestGenerateFromAudioRejectsUnsupportedContentType.
If TestGenerateFromAudio fails because the genai SDK constant for "STOP" is named differently from genai.FinishReasonStop, look up the exact constant name in google.golang.org/genai and adjust mapFinishReason. The behavioral expectation is unchanged: a JSON "finishReason": "STOP" from the server must map to audiollm.FinishStop.
go build ./...
Expected: exits 0.
git add internal/ai/audiollm
git commit -m "$(cat <<'EOF'
feat(ai/audiollm): implement Gemini multimodal audio Model
Ports internal/ai/gemini.go into the new internal/ai/audiollm/gemini
package. Two behavior changes from the previous code:
- Transcription instructions are now caller-supplied via
audiollm.Request.Instructions instead of being hardcoded in the package.
This frees the audiollm/gemini package to serve future non-transcription
use cases (audio summarization, etc.) without changes.
- The genai FinishReason is mapped to audiollm.FinishReason and surfaced
on the response, replacing the previous "did not include text" error
that collapsed safety-filter, max-tokens, and other failure modes.
The legacy internal/ai/gemini.go is still in place and still wired to the
handler — that switches in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Files:
server/router/api/v1/ai_service.goThe handler currently calls ai.NewTranscriber(provider) and transcriber.Transcribe(...). We replace that core block with a switch provider.Type that calls either stt.NewTranscriber or audiollm.NewModel. All input validation and provider resolution above it stays the same.
In server/router/api/v1/ai_service.go, locate the existing call sequence inside the Transcribe method:
transcriber, err := ai.NewTranscriber(provider)
if err != nil {
return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err)
}
transcription, err := transcriber.Transcribe(ctx, ai.TranscribeRequest{
Model: model,
Filename: filename,
ContentType: contentType,
Audio: bytes.NewReader(content),
Size: int64(len(content)),
Prompt: persisted.GetPrompt(),
Language: persisted.GetLanguage(),
})
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
}
return &v1pb.TranscribeResponse{
Text: transcription.Text,
}, nil
Replace that entire sequence (everything from transcriber, err := ai.NewTranscriber(provider) through the closing }, nil of the return statement) with:
var text string
switch provider.Type {
case ai.ProviderOpenAI:
text, err = s.transcribeViaSTT(ctx, provider, persisted, model, content, filename, contentType)
case ai.ProviderGemini:
text, err = s.transcribeViaAudioLLM(ctx, provider, persisted, model, content, contentType)
default:
return nil, status.Errorf(codes.FailedPrecondition,
"provider type %q is not supported for transcription", provider.Type)
}
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
}
return &v1pb.TranscribeResponse{Text: text}, nil
}
Remove the now-unused import of bytes if Go points it out — bytes.NewReader(content) moves into the helpers below.
Append to the same file, after the Transcribe function:
func (*APIV1Service) transcribeViaSTT(
ctx context.Context,
provider ai.ProviderConfig,
persisted *storepb.TranscriptionConfig,
model string,
content []byte,
filename string,
contentType string,
) (string, error) {
transcriber, err := stt.NewTranscriber(provider)
if err != nil {
return "", errors.Wrap(err, "failed to create STT transcriber")
}
resp, err := transcriber.Transcribe(ctx, stt.Request{
Audio: bytes.NewReader(content),
Size: int64(len(content)),
Filename: filename,
ContentType: contentType,
Model: model,
Prompt: persisted.GetPrompt(),
Language: persisted.GetLanguage(),
})
if err != nil {
return "", err
}
return resp.Text, nil
}
func (*APIV1Service) transcribeViaAudioLLM(
ctx context.Context,
provider ai.ProviderConfig,
persisted *storepb.TranscriptionConfig,
model string,
content []byte,
contentType string,
) (string, error) {
m, err := audiollm.NewModel(provider)
if err != nil {
return "", errors.Wrap(err, "failed to create audio LLM")
}
resp, err := m.GenerateFromAudio(ctx, audiollm.Request{
Audio: bytes.NewReader(content),
Size: int64(len(content)),
ContentType: contentType,
Model: model,
Instructions: buildTranscriptionInstructions(persisted.GetPrompt(), persisted.GetLanguage()),
})
if err != nil {
return "", err
}
if resp.FinishReason != audiollm.FinishStop {
return "", errors.Errorf("transcription incomplete (finish reason: %s)", resp.FinishReason)
}
if strings.TrimSpace(resp.Text) == "" {
return "", errors.New("transcription response did not include text")
}
return resp.Text, nil
}
func buildTranscriptionInstructions(prompt, language string) string {
parts := []string{
"Transcribe the audio accurately. Return only the transcript text. " +
"Do not summarize, explain, or add content that is not spoken.",
}
if language = strings.TrimSpace(language); language != "" {
parts = append(parts, "The input language is "+language+".")
}
if prompt = strings.TrimSpace(prompt); prompt != "" {
parts = append(parts, "Context and spelling hints:\n"+prompt)
}
return strings.Join(parts, "\n\n")
}
The imports block of ai_service.go needs:
import (
"bytes"
"context"
"mime"
"net/http"
"strings"
"github.com/pkg/errors"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"github.com/usememos/memos/internal/ai"
"github.com/usememos/memos/internal/ai/audiollm"
"github.com/usememos/memos/internal/ai/stt"
v1pb "github.com/usememos/memos/proto/gen/api/v1"
storepb "github.com/usememos/memos/proto/gen/store"
)
(bytes and mime were already there; errors, audiollm, and stt are new.)
go build ./...
Expected: exits 0.
go test ./server/router/api/v1/... -run Transcrib -v
Expected: any existing tests touching Transcribe still pass. If there are no Transcribe tests yet, the command will print no tests to run — that is fine.
go test ./...
Expected: exits 0. Watch for any test that imports internal/ai.Transcriber, internal/ai.NewTranscriber, internal/ai.TranscribeRequest, or internal/ai.TranscribeResponse — if such a test exists outside the files we delete in stage E, it must be migrated now.
git add server/router/api/v1/ai_service.go
git commit -m "$(cat <<'EOF'
refactor(ai_service): dispatch transcription by provider type
The Transcribe handler now switches on the resolved provider's Type and
calls into either internal/ai/stt (OpenAI-compatible) or internal/ai/audiollm
(Gemini multimodal). buildTranscriptionInstructions centralizes the
literal instruction sent to multimodal LLMs, replacing the previously
hardcoded prompt inside internal/ai/gemini.go.
The legacy internal/ai package is no longer called by the handler. It is
still on disk; it gets removed in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Files (deleted):
internal/ai/transcription.gointernal/ai/client.gointernal/ai/openai.gointernal/ai/openai_test.gointernal/ai/gemini.gointernal/ai/gemini_test.goAfter the handler refactor in stage D, nothing references the legacy Transcriber interface or its implementations. We delete them.
git grep -nE 'ai\.(NewTranscriber|TranscribeRequest|TranscribeResponse|Transcriber)\b' -- ':(exclude)internal/ai/transcription.go' ':(exclude)internal/ai/client.go' ':(exclude)internal/ai/openai.go' ':(exclude)internal/ai/openai_test.go' ':(exclude)internal/ai/gemini.go' ':(exclude)internal/ai/gemini_test.go'
Expected: no output (no remaining external callers). If anything matches, fix it before deleting.
git rm internal/ai/transcription.go \
internal/ai/client.go \
internal/ai/openai.go \
internal/ai/openai_test.go \
internal/ai/gemini.go \
internal/ai/gemini_test.go
go build ./...
Expected: exits 0.
go test ./...
Expected: exits 0.
internal/ai/ only has the surviving filesls internal/ai/
Expected output (alphabetical, may include audio/):
ai.go
audio
audiollm
errors.go
models.go
resolver.go
stt
(audio/ is the existing webm package, untouched.)
git commit -m "$(cat <<'EOF'
refactor(ai): remove legacy Transcriber implementations
The Transcribe handler now uses internal/ai/stt and internal/ai/audiollm
exclusively. Removes the obsolete single-interface Transcriber, its
factory, and both provider implementations. internal/ai/ now contains
only shared primitives (ProviderConfig, errors, default models, provider
resolver) plus the per-capability sub-packages.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Files:
proto/store/instance_setting.protoproto/gen/store/instance_setting.pb.goweb/src/types/proto/store/instance_setting_pb.tsDocumentation-only proto change; no field tags moved or renamed.
buf is availablebuf --version
Expected: prints a version. If absent: brew install bufbuild/buf/buf.
model field commentEdit proto/store/instance_setting.proto. Find the existing block (around lines 179–182):
// model is the provider-specific model identifier.
// Empty string falls back to the engine default
// (whisper-1 for OPENAI providers, gemini-2.5-flash for GEMINI providers).
string model = 2;
Replace with:
// model is the provider-specific model identifier.
// Empty string falls back to the engine default.
// OPENAI examples:
// - whisper-1 (legacy, lower cost)
// - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
// - gpt-4o-transcribe-diarize (includes speaker labels)
// GEMINI examples:
// - gemini-2.5-flash (default, multimodal call)
// - gemini-2.5-pro
string model = 2;
prompt field commentIn the same file, find (around lines 188–191):
// prompt is a default spelling/vocabulary hint passed to the provider.
// Used as the OpenAI Whisper "prompt" parameter and folded into the Gemini
// generation prompt as a "Context and spelling hints" block.
string prompt = 4;
Replace with:
// prompt is a default spelling/vocabulary hint passed to the provider.
// Used as the OpenAI Whisper "prompt" parameter (a soft hint that the model
// may ignore) and folded into the Gemini generation prompt as a "Context and
// spelling hints" block (which the LLM will treat more literally).
string prompt = 4;
cd proto && buf format -w && buf generate && cd ..
Expected: each command exits 0; proto/gen/store/instance_setting.pb.go and web/src/types/proto/store/instance_setting_pb.ts are rewritten in place.
grep -A 9 "model is the provider-specific model identifier" proto/gen/store/instance_setting.pb.go
grep -A 4 "prompt is a default spelling/vocabulary hint" proto/gen/store/instance_setting.pb.go
Expected: both blocks contain the new wording (model: 9 lines including the OPENAI/GEMINI examples and Model string ...; prompt: 4 lines of comment + the Prompt string ... line).
grep -A 9 "model is the provider-specific model identifier" web/src/types/proto/store/instance_setting_pb.ts
grep -A 4 "prompt is a default spelling/vocabulary hint" web/src/types/proto/store/instance_setting_pb.ts
Expected: both blocks contain the new wording (TS comments are JSDoc-formatted with leading * per line).
git status --short
Expected: exactly:
M proto/store/instance_setting.proto
M proto/gen/store/instance_setting.pb.go
M web/src/types/proto/store/instance_setting_pb.ts
If other generated files appear modified, the regen touched something unrelated — inspect with git diff --stat and stop.
go build ./...
Expected: exits 0. (Comment-only proto changes cannot break the build, but this catches accidental edits that bled into adjacent fields.)
git add proto/store/instance_setting.proto \
proto/gen/store/instance_setting.pb.go \
web/src/types/proto/store/instance_setting_pb.ts
git commit -m "$(cat <<'EOF'
docs(proto): clarify TranscriptionConfig model and prompt fields
- model: list current OpenAI (whisper-1, gpt-4o-transcribe family,
gpt-4o-transcribe-diarize) and Gemini (2.5-flash, 2.5-pro) examples.
The /audio/transcriptions endpoint is no longer Whisper-only.
- prompt: note that OpenAI Whisper treats it as a soft hint while
Gemini folds it literally into the generation prompt.
Documentation-only; no field changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
Files:
web/src/locales/en.jsonThe AISection.tsx component already conditionally swaps the model placeholder by provider type and renders the prompt help string. Only the underlying strings need updating.
web/src/locales/en.jsonFind these three keys (they currently live under setting.ai.*, around lines 441–449):
"transcription-model-placeholder-gemini": "gemini-2.5-flash",
"transcription-model-placeholder-openai": "whisper-1",
"transcription-prompt-help": "Improves spelling of proper nouns and jargon. Whisper limit is roughly 224 tokens.",
Replace with:
"transcription-model-placeholder-gemini": "gemini-2.5-flash, gemini-2.5-pro",
"transcription-model-placeholder-openai": "whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize",
"transcription-prompt-help": "Improves spelling of proper nouns and jargon. OpenAI Whisper treats this as a soft hint (Whisper limit is roughly 224 tokens). Gemini treats it as a literal instruction inside the generation prompt.",
python3 -m json.tool web/src/locales/en.json > /dev/null
Expected: exits 0, no output.
cd web && pnpm build && cd ..
(Use npm run build if pnpm is not the project's tool — check web/package.json for the canonical command.)
Expected: build succeeds with no TypeScript errors.
git add web/src/locales/en.json
git commit -m "$(cat <<'EOF'
i18n(en): expand transcription model and prompt help text
- Model placeholders list the full set of supported transcription models
for OpenAI (whisper-1 plus the gpt-4o-transcribe family) and Gemini.
- Prompt help text notes the cross-provider semantic difference: Whisper
treats the field as a soft hint, Gemini treats it as a literal
instruction. Mirrors the proto field comments.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
EOF
)"
After Stage G commits, run a full validation pass:
go test ./...
Expected: exits 0.
go build ./...
Expected: exits 0.
cd web && pnpm build && cd ..
Expected: exits 0.
If a local Memos instance is convenient, configure one OpenAI provider and one Gemini provider via the AI settings UI, then upload a short audio file via the voice-input feature for each. The end-to-end behavior should be identical to before this refactor (same transcript text). For Gemini specifically: a deliberately bad audio (e.g., silence) should now produce a clearer error message when FinishReason != FinishStop.
git log --oneline -8
Expected (most-recent-first):
<sha> i18n(en): expand transcription model and prompt help text
<sha> docs(proto): clarify TranscriptionConfig model and prompt fields
<sha> refactor(ai): remove legacy Transcriber implementations
<sha> refactor(ai_service): dispatch transcription by provider type
<sha> feat(ai/audiollm): implement Gemini multimodal audio Model
<sha> feat(ai/stt): implement OpenAI-compatible Transcriber
<sha> feat(ai): scaffold stt and audiollm packages
Each commit is independently revertable.