registry/etcd/PERFORMANCE.md
This document describes the improvements made to address etcd authentication performance issues and cache penetration problems.
When etcd server authentication is enabled, a serious performance bottleneck can occur at scale. This was observed in production environments with 4000+ service pods.
KeepAliveOnce for lease renewal, which requires a new authentication request for each callChange: Replaced KeepAliveOnce with KeepAlive
Implementation:
etcdRegistry structstartKeepAlive() method that establishes a long-lived keepalive streamregisterNode() to reuse existing keepalive channelsstopKeepAlive() for proper cleanup on deregistrationBenefits:
Code Changes:
// Before: New auth request every heartbeat
if _, err := e.client.KeepAliveOnce(context.TODO(), leaseID); err != nil {
// handle error
}
// After: Single auth request, reused channel
if err := e.startKeepAlive(s.Name+node.Id, leaseID); err != nil {
// handle error
}
Existing Protection: The registry cache already uses singleflight pattern to prevent stampede
How it Works:
Additional Safety:
Verification: Added comprehensive tests to confirm this behavior works correctly under load.
TestKeepAliveManagement: Validates keepalive lifecycle
TestKeepAliveReducesAuthRequests: Confirms channel reuse
TestKeepAliveChannelReconnection: Tests error handling
TestSingleflightPreventsStampede: Validates cache behavior
TestStaleCacheOnError: Confirms graceful degradation
TestCachePenetrationPrevention: End-to-end validation
No code changes required! The improvements are transparent:
registry.Registry interfaceIf you maintain a custom registry plugin:
Potential improvements for consideration: