docs/tech-notes/api_authentication.md
Each node has 2 API surfaces:
Then, also each RPC API endpoint can perform fan-out queries to RPC endpoints on other nodes.
So we have the following API request stacks.
Go object names are in package server unless otherwise specified.
Table of Contents
HTTP -> direct HTTP handler
Example:
/_status/vars
-> (http.ServeMux) ServeHTTP()
-> (varsHandlers) handleVars()
Note: only very few handlers (health, prometheus vars) are served without authentication. Generally, we use authentication for all HTTP APIs. The only endpoints that do not use authentication are those required by orchestration tools.
HTTP -> grpc-gateway -> local RPC handler
Example:
/_admin/v1/health
-> (http.ServeMux) ServeHTTP()
-> (*gwruntime.ServeMux) ServeHTTP()
-> gRPC loopback dial, inside same server
-> (rpc.kvAuth) AuthUnary()
-> (*adminServer) Health()
This works because the gRPC loopback dial uses the TLS node cert
in-memory. then rpc.kvAuth finds that its (local, in-memory) peer is
node and lets the RPC go through.
HTTP -> HTTP authn -> direct HTTP handler
Example:
/api/v2/events`
-> (http.ServeMux) ServeHTTP()
-> (*authenticationV2Mux) ServeHTTP()
-> (*apiV2Server) listEvents()
The final handler cares about the SQL identity. This is achieved as follows:
authenticationV2Mux checks the session cookie then stores the
identity in the context.Context using
context.WithValue(webSessionUserKey{}).listEvents() checks the identity by calling userFromHTTPAuthInfoContext()
which looks at context.Value(webSessionUserKey{})listEvents() then authorizes the request based on the SQL identity.HTTP -> HTTP authn -> grpc-gateway -> local RPC handler
Example:
/_admin/v1/trace_snapshots
-> (http.ServeMux) ServeHTTP()
-> (*authenticationMux) ServeHTTP()
-> (*gwruntime.ServeMux) ServeHTTP()
-> gRPC loopback dial, inside same server
-> (rpc.kvAuth) AuthUnary()
-> (*adminServer) ListTracingSnapshots()
The final handler cares about the SQL identity. This is achieved as follows:
authenticationMux checks the session cookie then stores the
identity in the context.Context using
context.WithValue(webSessionUserKey{}).gwruntime.WithMetadata(translateHTTPAuthInfoToGRPCMetadata) which
causes translateHTTPAuthInfoToGRPCMetadata to be called on every RPC.
This function, in turn, copies the SQL identity from
context.Value's webSessionUserKey{} into a gRPC "outgoing
context" at metadata key "websessionuser"node
TLS cert.metadata.MD
in the outgoing context to become MD in the incoming context on
the callee RPC handler.(rpc.kvAuth).AuthUnary() ignores this metadata. Instead it
merely verifies that the incoming RPC is coming from another
CockroachDB node or a root/node RPC client (node-to-node RPC
authentication).ListTracingSnapshots() calls getUserAndRole() calls
userFromIncomingRPCContext() which looks up "websessionuser" from the
gRPC incoming context metadata.ListTracingSnapshots() then authorizes the request based on the SQL identity.Generally, most HTTP services not only verify the user is authenticated using a valid SQL identity; they also verify the privileges of that user.
A few services do not do this and always work regardless of the user privileges.
Example:
/ts/query
-> (http.ServeMux) ServeHTTP()
-> (*authenticationMux) ServeHTTP()
-> (*gwruntime.ServeMux) ServeHTTP()
-> gRPC loopback dial, inside same server
-> (rpc.kvAuth) AuthUnary()
-> (*ts.Server) Query()
Like above, authenticationMux does authentication places the
identity into gRPC metadata, but Query() never uses it.
It is assumed that any user who can log in can access the Query endpoint and fetch timeseries data.
Other examples: /_admin/v1/liveness (Liveness())
/_admin/v1/uidata (GetUIData / SetUIData)
HTTP -> HTTP authn -> local HTTP handler -> fanout RPC on other nodes.
Example:
/api/v2/sessions
-> (http.ServeMux) ServeHTTP()
-> (*authenticationV2Mux) ServeHTTP()
-> (*apiV2Server) listSessions()
-> (*statusServer) paginatedIterateNodes()
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (other node) (*statusServer) ListLocalSessions()
The final handler cares about the SQL identity of the original query. This is achieved as follows:
authenticationV2Mux checks the session cookie then stores the
identity in the context.Context using
context.WithValue(webSessionUserKey{}).listSessions() on first node uses forwardHTTPAuthInfoToRPCCalls() to
copy the SQL identity from context.Value's webSessionUserKey{}
into a gRPC "outgoing context" at metadata key "websessionuser"
(metadata.NewOutgoingContext(ctx, translateHTTPAuthInfoToGRPCMetadata(ctx)))node TLS cert + remote calls automatically
forwards the metadata.MD in the outgoing context to become MD in
the incoming context on the remote node.(rpc.kvAuth).AuthUnary() ignores this metadata. Instead it
merely verifies that the incoming RPC is coming from another
CockroachDB node or a root/node RPC client (node-to-node RPC
authentication).ListLocalSessions() calls getUserAndRole() calls
userFromIncomingRPCContext() which looks up "websessionuser" from the
gRPC incoming context metadata.HTTP -> HTTP authn -> grpc-gateway -> local RPC handler -> fanout RPC on other nodes
Example:
/_admin/v1/sessions
-> (http.ServeMux) ServeHTTP()
-> (*authenticationMux) ServeHTTP()
-> (*gwruntime.ServeMux) ServeHTTP()
-> gRPC loopback dial, inside same server
-> (rpc.kvAuth) AuthUnary()
-> (*statusServer) ListSessions()
-> (*statusServer) paginatedIterateNodes()
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (other node) (*statusServer) ListLocalSessions()
The final handler cares about the SQL identity. This is achieved as follows:
authenticationMux checks the session cookie then stores the
identity in the context.Context using
context.WithValue(webSessionUserKey{}).gwruntime.WithMetadata(translateHTTPAuthInfoToGRPCMetadata) which
causes translateHTTPAuthInfoToGRPCMetadata to be called on every RPC.
This function, in turn, copies the SQL identity from
context.Value's webSessionUserKey{} into a gRPC "outgoing
context" at metadata key "websessionuser"node cert.metadata.MD
in the outgoing context to become MD in the incoming context on
the callee RPC handler.(rpc.kvAuth).AuthUnary() ignores this metadata. Instead it
merely verifies that the incoming RPC is coming from another
CockroachDB node or a root/node RPC client (node-to-node RPC
authentication).(*statusServer) ListSessions() calls getUserAndRole() calls
userFromIncomingRPCContext() which looks up "websessionuser" from the
gRPC incoming context metadata, and performs SQL authz.ListSessions then fans-out. For this, it first calls
forwardSQLIdentityThroughRPCCalls() which copies
"websessionuser" from the gRPC incoming metadata to the outgoing
context metadata.node cert + remote calls automatically
forwards the metadata.MD in the outgoing context of the 1st node
to become MD in the incoming context on the remote node.(rpc.kvAuth).AuthUnary() also ignores this metadata.
Instead it merely verifies that the incoming RPC is coming from
another CockroachDB node or a root/node RPC client (node-to-node
RPC authentication).ListLocalSessions() also calls getUserAndRole() calls
userFromIncomingRPCContext() which looks up "websessionuser" from the gRPC
incoming context metadata, that has been forwarded.These services do not have a HTTP endpoint.
CLI RPC client -> RPC authn -> local RPC handler
Example:
cockroach init
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (*initServer) Bootstrap()
This works as follows:
root or node TLS cert.(rpc.kvAuth).AuthUnary() verifies that the incoming TLS client
has a valid cert for root or node.In general, most RPC API endpoints exposed to external clients are
limited to this very simple model: the rpc.kvAuth component checks
the client has a root or node certs and the API has no checks of
its own.
SendKVBatch() odditycockroach debug send-kv-batch is a pure RPC client, calling
(*systemAdminServer) SendKVBatch() remotely.
The code for SendKVBatch() calls userFromIncomingRPCContext() to then call
hasAdminRole().
userFromIncomingRPCContext() looks up the username from the gRPC metadata.
Because this is a pure RPC call and the cockroach debug send-kv-batch client never adds this metadata, userFromIncomingRPCContext()
finds no metadata to work with. In that case, by an accident of
history, userFromIncomingRPCContext() assumes the client is root. This is why
SendKVBatch() happens to work. This behavior is arguably a bug,
identified a long time ago:
https://github.com/cockroachdb/cockroach/issues/45018
KV server -> KV RPC endpoint
Example:
node initialization code
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (*Node) Join() on other node
This works as follows:
node certificate to
perform a RPC dial.(rpc.kvAuth).AuthUnary() verifies this TLS identity then lets the
RPC go through. The RPC doesn't further use the identity of the
client.SQL server -> SQL server, but the initial request is not due to a HTTP->RPC gateway call.
Example:
distsql execution
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (*distsql.ServerImpl) SetupFlow() on other node
This works as follows:
node cert (in case it's for the system
tenant) or the tenant client cert (if it's a SQL pod) to dial the
remote node.(rpc.kvAuth).AuthUnary() asserts the incoming TLS identity is that
of a node client or a SQL server for the same tenant ID as the
local tenant ID. The RPC doesn't further use the identity of the
client.SQL server -> KV RPC endpoint
Example:
txn client distsender
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (*Node) Batch() on other node
This works as follows:
node certificate to
perform a RPC dial.(rpc.kvAuth).AuthUnary() verifies this TLS identity then lets the
RPC go through. The RPC doesn't further use the identity of the
client.SQL server -> KV RPC endpoint
Example:
tenant connector
-> gRPC dial, remote call
-> (other node) (rpc.kvAuth).AuthUnary()
-> (*Node) GossipSubscription()
This works as follows:
(rpc.kvAuth).AuthUnary() verifies this TLS identity. It sees the
client is for a tenant server, then additionally verifies the
parameters of the request. Then it lets the RPC go through. The RPC
doesn't further use the identity of the client.All the RPC API patterns in the previous section neatly fall under either of 3 categories:
endpoints for DB console or low-privileged external tools (e.g.
EnqueueRange), which may or may not have a HTTP
alias via grpc-gateway.
privileged endpoints which can only be used by a root or node
trusted client, for example SendKVBatch, Bootstrap or distsql
SetupFlow(). These are used both by CLI admin tools and in
server-to-server internal RPCs.
KV APIs meant for use by lower-privileged SQL servers for secondary
tenants. For example, TokenBucket.
These 3 categories all have the following properties in common:
We can verify this as follows:
Well-scoped DB Console endpoints (e.g. EnqueueRange) always use
the same HTTP authentication (1st property) and they only validate
the SQL role/privileges within the confines of the tenant that
serves them (2nd property). For example:
EnqueueRange on a secondary tenant is only used by DB Console
connected to the secondary tenant, and just needs a valid session cookie
for that tenant service.EnqueueRange on the system tenant is only used by DB Console
connected to the secondary tenant, and just needs a valid session
cookie for that tenant service.Those that perform RPC fan-outs (e.g. ListSessions) still have
the property: they use the same HTTP Authn (1st property)
and the fan-out still occurs within the same tenant ID,
so fan-out SQL priv checks are within the same tenant (2nd property).
privileged endpoints:
Bootstrap etc. check a TLS
client cert valid for root or node on system tenant (1st&2nd properties).SetupFlow etc are available both on system tenant and secondary
tenant but only need a TLS client cert valid for the same tenant
(1st&2nd properties).cross-tenant-boundary RPCs are always from a secondary tenant to a KV node. They have always the same rule:
By an accident of history, we have created API endpoints that have mixed-purpose and have more complex properties (and thus much harder to reason about).
We've designed all our RPC API endpoints with the expectation they would be called through the gRPC stack, and they peek into their incoming context to get authentication details with that assumption.
Alas...
Consider pkg/sql/cancel_queries.go, called from the (internal)
execution of a CANCEL QUERY statement.
We see:
(*cancelQueriesNode) Next()
-> (*extendedEvalContext).SQLStatusServer.CancelQuery()
-> (serverpb.SQLStatusErver).CancelQuery()
-> (statusServer) CancelQuery()
Now consider that this latter Go function was implemented as a RPC handler.
It calls checkCancelPrivilege() which calls getUserAndRole
which calls userFromIncomingRPCContext()
userFromIncomingRPCContext() looks up the username from the gRPC metadata.
(As would be expected for HTTP handlers served by grpc-gateway, see
previous sections).
Because this is a pure Go call from SQL and (*cancelQueriesNode) Next() never adds this metadata, userFromIncomingRPCContext() finds no
metadata to work with. In that case, by an accident of history,
userFromIncomingRPCContext() assumes the client is root. This is why
CancelQuery() as directly from SQL happens to work.
This behavior is arguably a bug, identified a long time ago: https://github.com/cockroachdb/cockroach/issues/45018
Due to this bug, every function listed in the SQLStatusServer interface
bypasses authorization (because from their perspective, the
incoming call appears to be run by root), unless the author
of the RPC handler was cognizant of this limitation and did something special.
The reason why we seem to be OK with this is that:
in some cases, the engineer who created this integration foresaw
this ambiguity and created an interlock of sorts in the callee RPC
handler. This is e.g. what happened with the CancelQuery() handler
above.
in other cases, the callers of these functions are inside the SQL
layer and also perform an authorization check of their own.
This is e.g. what happened with ListExecutionInsights() as called
from crdb_internal vtable.
Arguably, this integration is brittle, as we are one step away from someone forgetting to add an authz check.
These are API endpoints that can be used both "within their own tenant scope" and across the KV/tenant API boundary.
For example, TenantRanges may be used:
by DB Console on secondary tenant, uses HTTP authn: (*statusServer) TenantRanges()
API handler asserts privileges on SQL incoming identity from client within the secondary tenant.
by DB Console on system tenant, uses HTTP authn. (*systemStatusServer) TenantRanges()
API handler asserts privileges on SQL incoming identity from client within the system tenant.
as an API forward from the secondary tenant to system tenant, requires on privilege escalation:
(*statusServer) TenantRanges()` verifies SQL identity locally
-> (*serverpb.TenantStatusServer).TenantRanges()
-> (*kvtenantccl.Connector) TenantRanges()
-> gRPC dial, remote call
-> (rpc.kvAuth) AuthUnary(), verifies tenant identity
-> (*systemStatusServer) TenantRanges()
In this case, something odd happens: (*systemStatusServer) TenantRanges() calls requireAdminUser() which calls
getUserAndRole() which calls userFromIncomingRPCContext().
userFromIncomingRPCContext() looks up the username from the gRPC metadata.
(As would be expected for HTTP handlers served by grpc-gateway, see
previous sections).
Because this is a pure RPC call and the (*kvtenantccl.Connector) TenantRanges() never adds this metadata, userFromIncomingRPCContext() finds
no metadata to work with. In that case, by an accident of history,
userFromIncomingRPCContext() assumes the client is root. This is why
TenantRanges() as called through the tenant/KV interface happens
to work.
This behavior is arguably a bug, identified a long time ago: https://github.com/cockroachdb/cockroach/issues/45018
Similar problems can be found in HotRangesV2() and perhaps others.
Like abov, these are API endpoints that can be used both "within their own tenant scope" and across the KV/tenant API boundary. However, the situation is slightly better because they do not perform authorization.
For example, Query may be used:
by DB Console on secondary tenant, uses HTTP authn: (*ts.TenantServer) Query()
API handler doesn't care about SQL identity.
by DB Console on system tenant, uses HTTP authn. (*ts.Server) Query()
API handler doesn't care about SQL identity.
as an API forward from the secondary tenant to system tenant, requires a privilege escalation:
(*ts.TenantServer) Query() doesn't care about SQL identity
-> (tspb.TenantTimeSeriesServer).Query()
-> (*kvtenantccl.Connector) Query()
-> gRPC dial, remote call
-> (rpc.kvAuth) AuthUnary(), verifies *tenant identity*
-> (*ts.Server) Query() doesn't care about SQL identity
Another example of this is Liveness().
This stacking is brittle however: if, later, we decide to add authorization to the service (e.g. to add an observability SQL privilege), we would fall back into the case above with broken/confused authorization semantics.
It's very weird that we use the same KV authentication
component (rpc.kvAuth) to authenticate gRPC loopback
dials within the same server with grpc-gateway. This also
forces the use of a full TLS handshake for these in-memory
loopback dials, which adds latency and memory usage. Maybe
we should find a way to simplify the grpc-gateway integration.
APIs that depend on information added through an authn layer
should fail when they don't find it. For example, SendKVAuthn
should see its indirect call to userFromIncomingRPCContext fail if
we decide that userFromIncomingRPCContext only works after HTTP
authentication.
(This would identify when we've failed to forward authn bits during RPC fanouts.)
Handlers meant to be used for external RPCs should only be called
directly as a go function from the SQL layer if the SQL layer also
passes all the required authn bits. i.e. the SQL layer
should "authenticate itself as a RPC client" to call the
functions in the SQLStatusServer interface.
For example, the calls to CancelQuery() in pkg/sql should pass
the SQL identity of the calling session as the RPC client identity
through the call.
Pure privileged RPC APIs that assume that they are only
ever reachable using a node cert (KV-to-KV RPCs) should
fail if reached from a client conn identified as non-node.
(This would identify missing authz checks).
APIs that need a valid SQL identity should assert that it is present
and fail if it is not there. This requires us to split mixed-purpose
APIs like TenantRanges into:
one endpoint that serves external clients and verifies the SQL identity and performs authorization based on SQL privileges;
another endpoint exposed through *kvtenantccl.Connector which
doesn't need to check the SQL identity because it is only
ever guaranteed to be called via the tenant/KV interface
with a valid node cert or tenant client cert.
(This would make the semantics easier to understand and to audit for correctness.)