docs/en/administration/Troubleshooting.md
This page is a collection of high-level guides and tips regarding how to diagnose issues encountered in Alluxio.
Note: this doc is not intended to be the full list of Alluxio questions. Join the Alluxio community Slack Channel to chat with users and developers, or post questions on Github issues.
Alluxio generates Master, Worker and Client logs under the dir ${ALLUXIO_HOME}/logs. They are
named as master.log, master.out, worker.log, worker.out, job_master.log, job_master.out,
job_worker.log, job_worker.out and user/user_${USER}.log. Files
suffixed with .log are generated by log4j; File suffixed with .out are generated by redirection of
stdout and stderr of the corresponding process.
The master and worker logs are useful to understand what the Alluxio Master and Workers are doing, especially when running into any issues. If you do not understand the error messages, search for them in the Github issues, in the case the problem has been discussed before. You can also join our Slack channel and seek help there. You can find more details about the Alluxio server logs [here]({{ '/en/administration/Basic-Logging.html#server-logs' | relativize_url }}).
The client-side logs are also helpful when Alluxio service is running but the client cannot connect to the servers. Alluxio client emits logging messages through log4j, so the location of the logs is determined by the client side log4j configuration used by the application. You can find more details about the client-side logs [here]({{ '/en/administration/Basic-Logging.html#application-logs' | relativize_url }}).
The user logs in ${ALLUXIO_HOME}/logs/user/ are the logs from running Alluxio shell.
Each user will have separate log files.
For more information about logging, please check out [this page]({{ '/en/administration/Basic-Logging.html' | relativize_url }}).
Java remote debugging makes it easier to debug Alluxio at the source level without modifying any code. You
will need to set the JVM remote debugging parameters before starting the process. There are several ways to add
the remote debugging parameters; you can export the following configuration properties in shell or conf/alluxio-env.sh:
# Java 8
export ALLUXIO_MASTER_ATTACH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=60001"
export ALLUXIO_WORKER_ATTACH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=60002"
# Java 11
export ALLUXIO_MASTER_ATTACH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:60001"
export ALLUXIO_WORKER_ATTACH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:60002"
In general, you can use ALLUXIO_<PROCESS>_ATTACH_OPTS to specify how an Alluxio process should be attached to.
suspend={y | n} will decide whether the JVM process waits until the debugger connects or not.
address determines which port the Alluxio process will use to be attached to by a debugger. If left blank, it will
choose an open port by itself.
After completing this setup, learn how to attach.
If you want to debug shell commands (e.g. bin/alluxio fs ls /), you can set the ALLUXIO_USER_ATTACH_OPTS in
conf/alluxio-env.sh as above:
# Java 8
export ALLUXIO_USER_ATTACH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=60000"
# Java 11
export ALLUXIO_USER_ATTACH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:60000"
After setting this parameter, you can add the -debug flag to start a debug server such as bin/alluxio fs -debug ls /.
After completing this setup, learn how to attach.
There exists a comprehensive tutorial on how to attach to and debug a Java process in IntelliJ.
Start the process or shell command of interest, then create a new java remote configuration, set the debug server's host and port, and start the debug session. If you set a breakpoint which can be reached, the IDE will enter debug mode. You can inspect the current context's variables, call stack, thread list, and expression evaluation.
Alluxio has a collectInfo command that collect information to troubleshoot an Alluxio cluster.
collectInfo will run a set of sub-commands that each collects one aspect of system information, as explained below.
In the end the collected information will be bundled into one tarball which contains a lot of information regarding your Alluxio cluster.
The tarball size mostly depends on your cluster size and how much information you are collecting.
For example, collectLog operation can be costly if you have huge amounts of logs. Other commands
typically do not generate files larger than 1MB. The information in the tarball will help you troubleshoot your cluster.
Or you can share the tarball with someone you trust to help troubleshoot your Alluxio cluster.
The collectInfo command will SSH to each node and execute the set of sub-commands.
In the end of execution the collected information will be written to files and tarballed.
Each individual tarball will be collected to the issuing node.
Then all the tarballs will be bundled into the final tarball, which contains all information about the Alluxio cluster.
NOTE: Be careful if your configuration contains credentials like AWS keys! You should ALWAYS CHECK what is in the tarball and REMOVE the sensitive information from the tarball before sharing it with someone!
collectAlluxioInfo will run a set of Alluxio commands that collect information about the Alluxio cluster, like bin/alluxio fsadmin report etc.
When the Alluxio cluster is not running, this command will fail to collect some information.
This sub-command will run both alluxio getConf which collects local configuration properties,
and alluxio getConf --master --source which prints configuration properties that are received from the master.
Both of them mask credential properties. The difference is the latter command fails if the Alluxio cluster is not up.
collectConfig will collect all the configuration files under ${alluxio.work.dir}/conf.
From Alluxio 2.4, the alluxio-site.properties file will not be copied,
as many users tend to put their plaintext credentials to the UFS in this file.
Instead, the collectAlluxioInfo will run a alluxio getConf command
which prints all the configuration properties, with the credential fields masked.
The [getConf command]({{ '/en/operation/User-CLI.html#getconf' | relativize_url }}) will collect all the current node configuration.
So in order to collect Alluxio configuration in the tarball,
please make sure collectAlluxioInfo sub-command is run.
WARNING: If you put credential fields in the configuration files except alluxio-site.properties (eg.
alluxio-env.sh), DO NOT share the collected tarball with anybody unless you have manually obfuscated them in the tarball!
collectLog will collect all the logs under ${alluxio.work.dir}/logs.
NOTE: Roughly estimate how much log you are collecting before executing this command!
collectMetrics will collect Alluxio metrics served at http://${alluxio.master.hostname}:${alluxio.master.web.port}/metrics/json/ by default.
The metrics will be collected multiple times to see the progress.
collectJvmInfo will collect information about the existing JVMs on each node.
This is done by running a jps command then jstack on each found JVM process.
This will be done multiple times to see if the JVMs are making progress.
collectEnv will run a set of bash commands to collect information about the running node.
This runs system troubleshooting commands like env, hostname, top, ps etc.
WARNING: If you stored credential fields in environment variables like AWS_ACCESS_KEY or in process start parameters like
-Daws.access.key=XXX, DO NOT share the collected tarball with anybody unless you have manually obfuscated them in the tarball!
all will run all the sub-commands above.
The collectInfo command has the below options.
$ bin/alluxio collectInfo
[--max-threads <threadNum>]
[--local]
[--help]
[--additional-logs <filename-prefixes>]
[--exclude-logs <filename-prefixes>]
[--include-logs <filename-prefixes>]
[--start-time <datetime>]
[--end-time <datetime>]
COMMAND <outputPath>
<outputPath> is the directory you want the final tarball to be written into.
Options:
--max-threads threadNum option configures how many threads to use for concurrently collecting information and transmitting tarballs.
When the cluster has a large number of nodes, or large log files, the network IO for transmitting tarballs can be significant.
Use this parameter to constrain the resource usage of this command.
--local option specifies the collectInfo command to run only on localhost.
That means the command will only collect information about the localhost.
If your cluster does not have password-less SSH across nodes, you will need to run with --local
option locally on each node in the cluster, and manually gather all outputs.
If your cluster has password-less SSH across nodes, you can run without --local command,
which will essentially distribute the task to each node and gather the locally collected tarballs for you.
--help option asks the command to print the help message and exit.
--additional-logs <filename-prefixes> specifies extra log file name prefixes to include.
By default, only log files recognized by Alluxio will be collected by the collectInfo command.
The recognized files include below:
logs/master.log*,
logs/master.out*,
logs/job_master.log*,
logs/job_master.out*,
logs/master_audit.log*,
logs/worker.log*,
logs/worker.out*,
logs/job_worker.log*,
logs/job_worker.out*,
logs/proxy.log*,
logs/proxy.out*,
logs/task.log*,
logs/task.out*,
logs/user/*
Other than mentioned above, --additional-logs <filename-prefixes> specifies that files
whose names start with the prefixes in <filename-prefixes> should be collected.
This will be checked after the exclusions defined in --exclude-logs.
<filename-prefixes> specifies the filename prefixes, separated by commas.
--exclude-logs <filename-prefixes> specifies file name prefixes to ignore from the default list.
--include-logs <filename-prefixes> specifies only to collect files whose names start
with the specified prefixes, and ignore all the rest.
You CANNOT use --include-logs option together with either --additional-logs or
--exclude-logs, because it is ambiguous what you want to include.
--end-time <datetime> specifies a datetime after which the log files can be ignored.
A log file will be ignore if the file was created after this end time.
The first couple of lines of the log file will be parsed, in order to infer when the log
file started.
The <datetime> is a datetime string like 2020-06-27T11:58:53.
The parsable datetime formats include below:
"2020-01-03 12:10:11,874"
"2020-01-03 12:10:11"
"2020-01-03 12:10"
"20/01/03 12:10:11"
"20/01/03 12:10"
2020-01-03T12:10:11.874+0800
2020-01-03T12:10:11
2020-01-03T12:10
--start-time <datetime> specifies a datetime before with the log files can be ignored.
A log file will be ignored if the last modified time is before this start time.There are some special characters and patterns in file path names that are not supported in Alluxio. Please avoid creating file path names with these patterns or acquire additional handling from client end.
'?')./ and ../)'\')If you are operating your Alluxio cluster it is possible you may notice a message in the logs like:
LEAK: <>.close() was not called before resource is garbage-collected. See https://docs.alluxio.io/os/user/stable/en/administration/Troubleshooting.html#resource-leak-detection for more information about this message.
Alluxio has a built-in detection mechanism to help identify potential resource leaks. This message indicates there is a bug in the Alluxio code which is causing a resource leak. If this message appears during cluster operation, please open a GitHub Issue as a bug report and share your log message and any relevant stack traces that are shared with it.
By default, Alluxio samples a portion of some resource allocations when
detecting these leaks, and for each tracked resource record the object's recent
accesses. The sampling rate and access tracking will result in a resource and
performance penalty. The amount of overhead introduced by the leak detector can
be controlled through the property alluxio.leak.detector.level. Valid values
are
DISABLED: no leak tracking or logging is performed, lowest overheadSIMPLE: samples and tracks only leaks and does not log recent accesses. minimal overheadADVANCED: samples and tracks recent accesses, higher overheadPARANOID: tracks for leaks on every resource allocation, highest overhead.Alluxio master periodically checks its resource usage, including CPU and memory usage, and several internal data structures
that are performance critical. This interval is configured by alluxio.master.throttle.heartbeat.interval (defaults to 3 seconds).
On every sampling point in time (PIT), Alluxio master takes a snapshot of its resource usage. A continuous number of PIT snapshots
(number configured by alluxio.master.throttle.observed.pit.number, defaults to 3) will be saved and used to generate the aggregated
resource usage which is used to decide the system status.
Each PIT includes the following metrics.
directMemUsed=5268082, heapMax=59846950912, heapUsed=53165684872, cpuLoad=0.4453061982287778, pitTotalJVMPauseTimeMS=190107, totalJVMPauseTimeMS=0, rpcQueueSize=0, pitTimeMS=1665995384998}
directMemUsed: direct memory allocated by ByteBuffer.allocateDirectheapMax : the allowed max heap sizeheapUsed : the heap memory usedcpuLoad : the cpu loadpitTotalJVMPauseTimeMS : aggregated total JVM pause time from the beginningtotalJVMPauseTimeMS : the JVM pause time since last PITrpcQueueSize : the rpc queue sizepitTimeMS : the timestamp in millisecond when this snapshot is takenThe aggregated server indicators are the certain number of continuous PITs, this one is generated in a sliding window. The alluxio
master has a derived indicator Master.system.status that is based on the heuristic algorithm.
"Master.system.status" : {
"value" : "STRESSED"
}
The possible statuses are:
IDLEACTIVESTRESSEDOVERLOADEDThe system status is mainly decided by the JVM pause time and the free heap memory. Usually the status transition is
IDLE <---> ACTIVE <---> STRESSED <---> OVERLOADED
alluxio.master.throttle.overloaded.heap.gc.time, the system status is directly set to OVERLOADED.The thresholds are
// JVM paused time
alluxio.master.throttle.overloaded.heap.gc.time
// heap used thresholds
alluxio.master.throttle.active.heap.used.ratio
alluxio.master.throttle.stressed.heap.used.ratio
alluxio.master.throttle.overloaded.heap.used.ratio
If the system status is STRESSED or OVERLOADED, WARN level log would be printed containing the following the filesystem indicators:
2022-10-17 08:29:41,998 WARN SystemMonitor - System transition status is UNCHANGED, status is STRESSED, related Server aggregate indicators:ServerIndicator{directMemUsed=15804246, heapMax=58686177280, heapUsed=157767176816, cpuLoad=1.335918594686334, pitTotalJVMPauseTimeMS=62455, totalJVMPauseTimeMS=6, rpcQueueSize=0, pitTimeMS=1665989354196}, pit indicators:ServerIndicator{directMemUsed=5268082, heapMax=59846950912, heapUsed=48601091600, cpuLoad=0.4453061982287778, pitTotalJVMPauseTimeMS=190107, totalJVMPauseTimeMS=0, rpcQueueSize=0, pitTimeMS=1665995381998}
2022-10-17 08:29:41,998 WARN SystemMonitor - The delta filesystem indicators FileSystemIndicator{Master.DeletePathOps=0, Master.PathsDeleted=0, Master.MetadataSyncPathsFail=0, Master.CreateFileOps=0, Master.ListingCacheHits=0, Master.MetadataSyncSkipped=3376, Master.UfsStatusCacheSize=0, Master.CreateDirectoryOps=0, Master.FileBlockInfosGot=0, Master.MetadataSyncPrefetchFail=0, Master.FilesCompleted=0, Master.RenamePathOps=0, Master.MetadataSyncSuccess=0, Master.MetadataSyncActivePaths=0, Master.FilesCreated=0, Master.PathsRenamed=0, Master.FilesPersisted=658, Master.CompletedOperationRetryCount=0, Master.ListingCacheEvictions=0, Master.MetadataSyncTimeMs=0, Master.SetAclOps=0, Master.PathsMounted=0, Master.FreeFileOps=0, Master.PathsUnmounted=0, Master.CompleteFileOps=0, Master.NewBlocksGot=0, Master.GetNewBlockOps=0, Master.ListingCacheMisses=0, Master.FileInfosGot=3376, Master.GetFileInfoOps=3376, Master.GetFileBlockInfoOps=0, Master.UnmountOps=0, Master.MetadataSyncPrefetchPaths=0, Master.getConfigHashInProgress=0, Master.MetadataSyncPathsSuccess=0, Master.FilesFreed=0, Master.MetadataSyncNoChange=0, Master.SetAttributeOps=0, Master.getConfigurationInProgress=0, Master.MetadataSyncPendingPaths=0, Master.DirectoriesCreated=0, Master.ListingCacheLoadTimes=0, Master.MetadataSyncPrefetchSuccess=0, Master.MountOps=0, Master.UfsStatusCacheChildrenSize=0, Master.MetadataSyncPrefetchOpsCount=0, Master.registerWorkerStartInProgress=0, Master.MetadataSyncPrefetchCancel=0, Master.MetadataSyncPathsCancel=0, Master.MetadataSyncPrefetchRetries=0, Master.MetadataSyncFail=0, Master.MetadataSyncOpsCount=3376}
The monitoring indicators describe the system status in a heuristic way to have a basic understanding of its load.
A: Check ${ALLUXIO_HOME}/logs to see if there are any master or worker logs. Look for any errors
in these logs. Double check if you missed any configuration
steps in [Running-Alluxio-Locally]({{ '/en/deploy/Running-Alluxio-Locally.html' | relativize_url }}).
Typical issues:
ALLUXIO_MASTER_MOUNT_TABLE_ROOT_UFS is not configured correctly.ssh localhost fails, make sure the public SSH key for the host is added in ~/.ssh/authorized_keys.A: Please follow [Running-Alluxio-on-a-Cluster]({{ '/en/deploy/Running-Alluxio-On-a-Cluster.html' | relativize_url }}), [Configuring-Alluxio-with-HDFS]({{ '/en/ufs/HDFS.html' | relativize_url }}), and [Configuring-Spark-with-Alluxio]({{ '/en/compute/Spark.html' | relativize_url }}).
Tips:
A: Alluxio requires Java 8 or 11 runtime to function properly. You can find more details about the system requirements [here]({{ '/en/deploy/Requirements.html' | relativize_url }}).
A: This error message is seen when your applications (e.g., MapReduce, Spark) try to access
Alluxio as an HDFS-compatible file system, but the alluxio:// scheme is not recognized by the
application. Please make sure your HDFS configuration file core-site.xml (in your default hadoop
installation or spark/conf/ if you customize this file for Spark) has the following property:
<configuration>
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
</configuration>
See the doc page for your specific compute framework for detailed setup instructions.
A: This error message is seen when your applications (e.g., MapReduce, Spark) try to access
Alluxio as an HDFS-compatible file system, the alluxio:// scheme has been
configured correctly, but the Alluxio client jar is not found on the classpath of your application.
Depending on the computation frameworks, users usually need to add the Alluxio
client jar to their class path of the framework through environment variables or
properties on all nodes running this framework. Here are some examples:
$HADOOP_CLASSPATH:$ export HADOOP_CLASSPATH={{site.ALLUXIO_CLIENT_JAR_PATH}}:${HADOOP_CLASSPATH}
See [MapReduce on Alluxio]({{ '/en/compute/Hadoop-MapReduce.html' | relativize_url }}) for more details.
$SPARK_CLASSPATH:$ export SPARK_CLASSPATH={{site.ALLUXIO_CLIENT_JAR_PATH}}:${SPARK_CLASSPATH}
See [Spark on Alluxio]({{ '/en/compute/Spark.html' | relativize_url }}) for more details.
Alternatively, add the following lines to spark/conf/spark-defaults.conf:
spark.driver.extraClassPath {{site.ALLUXIO_CLIENT_JAR_PATH}}
spark.executor.extraClassPath {{site.ALLUXIO_CLIENT_JAR_PATH}}
For Presto, put Alluxio client jar {{site.ALLUXIO_CLIENT_JAR_PATH}} into the directory
${PRESTO_HOME}/plugin/hive-hadoop2/
Since Presto has long running processes, ensure they are restarted after the jar has been added.
See [Presto on Alluxio]({{ '/en/compute/Presto.html' | relativize_url }}) for more details.
For Hive, set HIVE_AUX_JARS_PATH in conf/hive-env.sh:
$ export HIVE_AUX_JARS_PATH={{site.ALLUXIO_CLIENT_JAR_PATH}}:${HIVE_AUX_JARS_PATH}
Since Hive has long running processes, ensure they are restarted after the jar has been added.
If the corresponding classpath has been set but exceptions still exist, users can check whether the path is valid by:
$ ls {{site.ALLUXIO_CLIENT_JAR_PATH}}
See [Hive on Alluxio]({{ '/en/compute/Hive.html' | relativize_url }}) for more details.
A: This problem can be caused by different possible reasons.
alluxio.security.authentication.type.
This error happens if this property is configured with different values across servers and clients
(e.g., one uses the default value NOSASL while the other is customized to SIMPLE).
Please read [Configuration-Settings]({{ '/en/operation/Configuration.html' | relativize_url }}) for how to customize Alluxio clusters and applications.A: This error indicates insufficient space left on Alluxio workers to complete your write request. This is either because the worker fails to evict enough space or the block size is too large to fit in any of the worker's storage directories.
A: First you should check if you are running Alluxio with UFS journal or Embedded journal. See the difference [here]({{ '/en/operation/Journal.html#embedded-journal-vs-ufs-journal' | relativize_url }}).
Also you should verify that the journal you are using is compatible with the current configuration. There are a few scenarios where the journal compatibility is not guaranteed and you need to either [restore from a backup]({{ '/en/operation/Journal.html#restoring-from-a-backup' | relativize_url }}) or [format the journal]({{ '/en/operation/Journal.html#formatting-the-journal' | relativize_url }}):
ROCKS and HEAP metastore are not compatible.If you are using UFS journal and see errors like "Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try",
it is because Alluxio master failed to update journal files stored in a HDFS directory according to
the property alluxio.master.journal.folder setting. There can be multiple reasons for this type of errors, typically because
some HDFS datanodes serving the journal files are under heavy load or running out of disk space. Please ensure the
HDFS deployment is connected and healthy for Alluxio to store journals when the journal directory is set to be in HDFS.
If you do not find the answer above, please post a question following here.
A: By default, Alluxio loads the list of files the first time a directory is visited.
Alluxio will keep using the cached file list regardless of the changes in the under file system.
To reveal new files from under file system, you can use the command
alluxio fs ls -R -Dalluxio.user.file.metadata.sync.interval=${SOME_INTERVAL} /path or by setting the same
configuration property in masters' alluxio-site.properties.
The value for the configuration property is used to determine the minimum interval between two syncs.
You can read more about metadata sync from under file systems
[here]({{ '/en/core-services/Unified-Namespace.html' | relativize_url }}#ufs-metadata-sync).
A: When writing files to Alluxio, one of the several write type can be used to tell Alluxio worker how the data should be stored:
MUST_CACHE: data will be stored in Alluxio only
CACHE_THROUGH: data will be cached in Alluxio as well as written to UFS
THROUGH: data will be only written to UFS
ASYNC_THROUGH: data will be stored in Alluxio synchronously and then written to UFS asynchronously
By default the write type used by Alluxio client is ASYNC_THROUGH, therefore a new file written to Alluxio is only stored in Alluxio
worker storage, and can be lost if a worker crashes. To make sure data is persisted, either use CACHE_THROUGH or THROUGH write type,
or increase alluxio.user.file.replication.durable to an acceptable degree of redundancy.
Another possible cause for this error is that the block exists in the file system, but no worker has connected to master. In that case the error will go away once at least one worker containing this block is connected.
A: Most Alluxio shell commands require connecting to Alluxio master to execute. If the command fails to connect to master it will
keep retrying several times, appearing as "hanging" for a long time. It is also possible that some command can take a long time to
execute, such as persisting a large file on a slow UFS. If you want to know what happens under the hood, check the user log (stored
as ${ALLUXIO_HOME}/logs/user_${USER_NAME}.log by default) or master log (stored as ${ALLUXIO_HOME}/logs/master.log on the master
node by default).
If the logs are not sufficient to reveal the problem, you can [enable more verbose logging]({{ '/en/administration/Basic-Logging.html#enabling-advanced-logging' | relativize_url }}).
A: One possible cause is the RPC request is not recognized by the server side. This typically happens when you are running the Alluxio client and master/worker with different versions where the RPCs are incompatible. Please double check and make sure all components are running the same Alluxio version.
If you do not find the answer above, please post a question following here.
A: Alluxio accelerates your system performance by leveraging temporal or spatial locality using distributed in-memory storage (and tiered storage). If your workloads don't have any locality, you will not see noticeable performance boost.
For a comprehensive guide on tuning performance of Alluxio cluster, please check out [this page]({{ '/en/administration/Performance-Tuning.html' | relativize_url }}).
Alluxio can be configured under a variety of modes, in different production environments. Please make sure the Alluxio version being deployed is update-to-date and supported.
It is highly recommended searching if your questions have been answered and problem have been resolved already. Past Github issues and Slack chat histories are both very good sources.
When posting questions on the Github issues or Slack channel, please attach the full environment information, including
alluxio-site.properties and alluxio-env.sh