docs/content/v2024.1/yugabyte-platform/troubleshoot/install-upgrade-issues/kubernetes.md
Occasionally, you might encounter issues during installation and upgrade of YugabyteDB Anywhere on Kubernetes. You can troubleshoot most of these issues.
If you have problems while troubleshooting, contact {{% support-platform %}}.
For more information, see the following:
YugabyteDB Anywhere pod scheduling can fail for a variety of reasons, such as insufficient resource allocation, mismatch in the node selector or affinity, incorrect storage class configuration, problems with Elastic Block Store (EBS). Typically, this manifests by pods being in a pending state for a long time.
For additional information, see Node selection in kube-scheduler.
To start diagnostics, execute the following command to obtain the pod:
kubectl get pod -n <NAMESPACE>
If the issue you are experiencing is due to the pod scheduling failure, expect to see STATUS as Pending, as per the following output:
NAME READY STATUS RESTARTS AGE
yw-test-yugaware-0 0/4 Pending 0 2m30s
Execute the following command to obtain detailed information about the pod and failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Expect to see a Message similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 56s default-scheduler 0/2 nodes are available: 2 Insufficient cpu.
For more information, see Kubernetes: Specify a CPU request that is too big for your nodes.
Resolution
To start diagnostics, execute the following command to obtain detailed information about the failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
If the issue you are experiencing is due to the mismatched node selector or affinity, expect to see a Message similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 75s (x40 over 55m) default-scheduler 0/55 nodes are available: 19 Insufficient cpu, 36 node(s) didn't match Pod's node affinity/selector
Resolution
Ensure that there is no mismatch between labels or taints when you schedule YugabyteDB Anywhere pods on specific nodes. Otherwise, the scheduler can fail to identify the node. For more information, see the following:
During multi-zone deployment of YugabyteDB, start diagnostics by executing the following command to obtain detailed information about the failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
If the issue you are experiencing is due to the incorrect setting for the storage class VolumeBindingMode, expect to see a Message similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 75s (x40 over 55m) default-scheduler 0/10 nodes are available: 3 node(s) had volume node affinity conflict, 7 node(s) didn't match Pod's node affinity/selector.
You can obtain information related to storage classes, as follows:
Get the VolumeBindingMode setting information from all storage classes in the universe by executing the following command:
kubectl get storageclass
Expect an output similar to the following:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
premium-rwo pd.csi.storage.gke.io Delete WaitForFirstConsumer true 418d
standard (default) kubernetes.io/gce-pd Delete Immediate true 418d
standard-rwo pd.csi.storage.gke.io Delete WaitForFirstConsumer true 418d
If a storage class name standard was defined in the universe, you can obtain information about a specific storage class by executing the following command:
kubectl describe storageclass standard
Expect an output similar to the following:
Name: standard
IsDefaultClass: Yes
Annotations: storageclass.kubernetes.io/is-default-class=true
Provisioner: kubernetes.io/gce-pd
Parameters: type=pd-standard
AllowVolumeExpansion: True
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
Resolution
Because not setting VolumeBindingMode to WaitForFirstConsumer might result in the universe creating the volume in a different zone than the selected zone, ensure that the StorageClass used during YugabyteDB Anywhere deployment has its WaitForFirstConsumer set to VolumeBindingMode. You can use the following command:
kubectl get storageclass standard -ojson \
| jq '.volumeBindingMode="WaitForFirstConsumer" | del(.metadata.managedFields, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid)' \
| kubectl replace --force -f -
Execute the following command to check events for the persistent volume claim (PVC):
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
If the issue you are experiencing is due to the missing EBS controller, expect an output similar to the following:
waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
Resolution
Follow the instructions provided in Troubleshoot AWS EBS volumes.
In some cases, a scheduled pod fails to run and errors are thrown.
A Kubernetes pod may encounter these errors when it fails to pull the container images from a private container registry and the pod enters the ImagePullBackOff state.
The following are some of the specific reasons for the errors:
kubelet node agent cannot authenticate with the container registry.To start diagnostics, execute the following command to obtain the pod:
kubectl get pod -n <NAMESPACE>
If the issue you are experiencing is due to the image pull error, expect to initially see the ErrImagePull error, and on subsequent attempts the ImagePullBackOff error listed under STATUS, as per the following output:
NAME READY STATUS RESTARTS AGE
yw-test-yugaware-0 0/4 Init:ErrImagePull 0 3
NAME READY STATUS RESTARTS AGE
yw-test-yugaware-0 0/4 Init:ImagePullBackOff 0 2m10s
Execute the following command to obtain detailed information about the pod and failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Expect an output similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 25s (x3 over 75s) kubelet Pulling image "quay.io/yugabyte/yugaware:2.16.0.0-b90"
Warning Failed 22s (x3 over 72s) kubelet Failed to pull image "quay.io/yugabyte/yugaware:2.16.0.0-b90": rpc error: code = Unknown desc = failed to pull and unpack image "quay.io/yugabyte/yugaware:2.16.0.0-b90": failed to resolve reference "quay.io/yugabyte/yugaware:2.16.0.0-b90": pulling from host quay.io failed with status code [manifests 2.16.0.0-b90]: 401 UNAUTHORIZED
Warning Failed 22s (x3 over 72s) kubelet Error: ErrImagePull
Normal BackOff 7s (x3 over 72s) kubelet Back-off pulling image "quay.io/yugabyte/yugaware:2.16.0.0-b90"
Warning Failed 7s (x3 over 72s) kubelet Error: ImagePullBackOff
Resolution
yugabyte-k8s-pull-secret is performed. For more information, see the values.yaml file`.There is a number of reasons for the CrashLoopBackOff error. It typically occurs when a YugabyteDB Anywhere pod crashes due to an internal application error.
To start diagnostics, execute the following command to obtain the pod:
kubectl get pod -n <NAMESPACE>
If the issue you are experiencing is due to the CrashLoopBackOff error, expect to see this error listed under STATUS, as per the following output:
NAME READY STATUS RESTARTS AGE
yugabyte-platform-1-yugaware-0 3/4 CrashLoopBackOff 2 4d14h
Resolution
Execute the following command to obtain detailed information about the YugabyteDB Anywhere pod experiencing the CrashLoopBackOff error:
kubectl describe pods <POD_NAME> -n <NAMESPACE>
Execute the following commands to check YugabyteDB Anywhere logs for a specific container and perform troubleshooting based on the information in the logs:
# YugabyteDB Anywhere
kubectl logs <POD_NAME> -n <NAMESPACE> -c yugaware
# PostgreSQL
kubectl logs <POD_NAME> -n <NAMESPACE> -c postgres
Load balancer might not be ready to provide services to a running YugabyteDB Anywhere instance.
The internet-facing load balancer may not perform as expected because the default AWS load balancer used in Amazon Elastic Kubernetes Service (EKS) by the YugabyteDB Anywhere Helm chart is not suitable for your configuration.
Resolution
Use the following settings to customize the AWS load balancer controller behavior:
aws-load-balancer-scheme to the internal or internet-facing string value.aws-load-balancer-backend-protocol and aws-load-balancer-healthcheck-protocol to the http string value.The following is a sample configuration:
service.beta.kubernetes.io/aws-load-balancer-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "http"
Due to a variety of reasons, such as the absence of the load balancer controller or exceeded public IP quota, the load balancer IP assignment might enter a continuous pending state.
To start diagnostics, execute the following command to obtain the switched virtual circuit (SVC) information:
kubectl get svc -n <NAMESPACE>
If the issue you are experiencing is due to IP assignment for the load balancer, expect to see an output similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/yw-test-yugaware-ui LoadBalancer 10.4.1.7 <pending> 80:30553/TCP,9090:32507/TCP 15s
Resolution
Typically, cloud providers supply a load balancer controller that serves the load balancer type service. You need to verify whether or not the universe has the load balancer controller, as follows:
A number of other issue can occur while installing and upgrading YugabyteDB Anywhere on Kubernetes.
You might encounter a CORS error while accessing YugabyteDB Anywhere through a load balancer. The condition can manifest itself by the initial setup or any sign in attempts not working or resulting in a blank screen.
To start diagnostics, check the developer tools of your browser for any errors. In addition, check the logs of the YugabyteDB Anywhere pod by executing the following command:
kubectl logs <POD_NAME> -n <NAMESPACE> -c yugaware
If the issue you are experiencing is due to the load balancer access-related CORS error, expect to see an error message similar to the following:
2023-01-09T10:48:08.898Z [warn] 57fe083d-6ebb-49ab-bbaa-5e6576040d62
AbstractCORSPolicy.scala:311
[application-akka.actor.default-dispatcher-10275]
play.filters.cors.CORSFilter Invalid CORS
request;Origin=Some(https://localhost:8080);Method=POST;Access-Control-Request-Headers=None
Resolution
Specify correct domain names during the Helm installation or upgrade, as per instructions provided in Set a DNS name.
This error manifests itself in an inability to expand the PVC via the helm upgrade command. The error message should look similar to the following:
Error: UPGRADE FAILED: cannot patch "yw-test-yugaware-storage" with kind PersistentVolumeClaim: persistentvolumeclaims "yw-test-yugaware-storage" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize
To start diagnostics, execute the following command to obtain information about the storage class:
kubectl describe sc <STORAGE_CLASS>
For example:
kubectl describe sc test-sc
The following output shows that the AllowVolumeExpansion parameter of the storage class is set to false:
Name: test-sc
IsDefaultClass: No
Provisioner: kubernetes.io/gce-pd
Parameters: type=pd-standard
AllowVolumeExpansion: False
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
Resolution
Set the AllowVolumeExpansion parameter to true to expand the PVC, as follows:
kubectl get storageclass <STORAGE_CLASS> -o json \
| jq '.allowVolumeExpansion=true | del(.metadata.managedFields, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid)' \
| kubectl replace --force -f -
Use the following command to verify that true is returned:
kubectl get storageclass <STORAGE_CLASS> -o json | jq '.allowVolumeExpansion'
Increase the storage size using Helm upgrade and then execute the following command to obtain the persistent volume information:
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
Expect to see events for the PVC listed via an output similar to following:
Normal ExternalExpanding 95s volume_expand CSI migration enabled for kubernetes.io/gce-pd; waiting for external resizer to expand the pvc
Warning VolumeResizeFailed 85s external-resizer pd.csi.storage.gke.io resize volume "pvc-71315a47-d93a-4751-b48e-c7bfc365ae19" by resizer "pd.csi.storage.gke.io" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Normal Resizing 84s (x2 over 95s) external-resizer pd.csi.storage.gke.io External resizer is resizing volume pvc-71315a47-d93a-4751-b48e-c7bfc365ae19
Normal FileSystemResizeRequired 84s external-resizer pd.csi.storage.gke.io Require file system resize of volume on node
Normal FileSystemResizeSuccessful 44s kubelet MountVolume.NodeExpandVolume succeeded for volume "pvc-71315a47-d93a-4751-b48e-c7bfc365ae19"