docs/en/development/internals.md
This article introduces implementation details of JuiceFS, use this as a reference if you'd like to contribute. The content below is based on JuiceFS v1.0.0, metadata version v1.
Before digging into source code, you should read Data Processing Workflow.
High level concepts:
Low level concepts (learn more at Data Processing Workflow):
Assuming you're already familiar with Go, as well as JuiceFS architecture, this is the overall code structure:
cmd is the top-level entrance, all JuiceFS functionalities is rooted here, e.g. the juicefs format command resides in cmd/format.go;pkg is actual implementation:
pkg/fuse/fuse.go provides abstract FUSE API;pkg/vfs contains actual FUSE implementation, Metadata requests are handled in pkg/meta, read requests are handled in pkg/vfs/reader.go and write requests are handled by pkg/vfs/writer.go;pkg/meta directory is the implementation of all metadata engines, where:
pkg/meta/interface.go is the interface definition for all types of metadata enginespkg/meta/redis.go is the interface implementation of Redis databasepkg/meta/sql.go is the interface definition and general interface implementation of relational database, and the implementation of specific databases is in a separate file (for example, the implementation of MySQL is in pkg/meta/sql_mysql.go)pkg/meta/tkv.go is the interface definition and general interface implementation of the KV database, and the implementation of a specific database is in a separate file (for example, the implementation of TiKV is in pkg/meta/tkv_tikv.go)pkg/object contains all object storage integration code;sdk/java is the Hadoop Java SDK, it uses sdk/java/libjfs through JNI.JuiceFS implements a userspace file system based on FUSE (Filesystem in Userspace), and the implementation library libfuse provides two APIs: high-level API and low-level API, where the high-level API is based on file name and path, and the low-level API is based on inode.
JuiceFS is implemented based on low-level API (in fact JuiceFS does not depend on libfuse, but go-fuse), because this is the same set of APIs used by kernel VFS when interacting with FUSE. If JuiceFS were to use high level API, it'll have to implement the VFS tree within libfuse, and then expose path based API. This method works better for systems that already expose path based APIs (e.g. HDFS, S3). If metadata itself implements file / directory tree based on inode, the inode → path → inode conversions will have an impact on performance (this is the reason why FUSE API for HDFS doesn't perform well). JuiceFS Metadata directly implements file tree and API based on inode, so naturally it uses FUSE low level API.
File systems are usually organized in a tree structure, where nodes represent files and edges represent directory containment relationships. There are more than ten metadata structures in JuiceFS. Most of them are used to maintain the organization of file tree and properties of individual nodes, while the rest are used to manage system configuration, client sessions, asynchronous tasks, etc. All metadata structures are described below.
It is created when the juicefs format command is executed, and some of its fields can be modified later by the juicefs config command. The structure is specified as follows.
type Format struct {
Name string
UUID string
Storage string
Bucket string
AccessKey string `json:",omitempty"`
SecretKey string `json:",omitempty"`
SessionToken string `json:",omitempty"`
BlockSize int
Compression string `json:",omitempty"`
Shards int `json:",omitempty"`
HashPrefix bool `json:",omitempty"`
Capacity uint64 `json:",omitempty"`
Inodes uint64 `json:",omitempty"`
EncryptKey string `json:",omitempty"`
KeyEncrypted bool `json:",omitempty"`
TrashDays int `json:",omitempty"`
MetaVersion int `json:",omitempty"`
MinClientVersion string `json:",omitempty"`
MaxClientVersion string `json:",omitempty"`
EnableACL bool
}
s3, oss, etc.This structure is serialized into JSON format and stored in the metadata engine.
Maintains the value of each counter in the system and the start timestamps of some background tasks, specifically
Records the session IDs of clients connected to this file system and their timeouts. Each client sends a heartbeat message to update the timeout, and those who have not updated for a long time will be automatically cleaned up by other clients.
:::tip Read-only clients cannot write to the metadata engine, so their sessions will not be recorded. :::
Records specific metadata of the client session so that it can be viewed with the juicefs status command. This is specified as
type SessionInfo struct {
Version string // JuiceFS version
HostName string // Host name
MountPoint string // path to mount point. S3 gateway and WebDAV server are "s3gateway" and "webdav" respectively
ProcessID int // Process ID
}
This structure is serialized into JSON format and stored in the metadata engine.
Records attribute information of each file, as follows
type Attr struct {
Flags uint8 // reserved flags
Typ uint8 // type of a node
Mode uint16 // permission mode
Uid uint32 // owner id
Gid uint32 // group id of owner
Rdev uint32 // device number
Atime int64 // last access time
Mtime int64 // last modified time
Ctime int64 // last change time for meta
Atimensec uint32 // nanosecond part of atime
Mtimensec uint32 // nanosecond part of mtime
Ctimensec uint32 // nanosecond part of ctime
Nlink uint32 // number of links (sub-directories or hardlinks)
Length uint64 // length of regular file
Parent Ino // inode of parent; 0 means tracked by parentKey (for hardlinks)
Full bool // the attributes are completed or not
KeepCache bool // whether to keep the cached page or not
AccessACL uint32 // access ACL id (identical ACL rules share the same access ACL ID.)
DefaultACL uint32 // default ACL id (default ACL and the access ACL share the same cache and store)
}
There are a few fields that need clarification.
--atime-modeThis structure is usually encoded in binary format and stored in the metadata engine.
Records information on each edge in the file tree, as follows
parentInode, name -> type, inode
where parentInode is the inode number of the parent directory, and the others are the name, type, and inode number of the child files, respectively.
Records the parent directory of some files. The parent directory of most files is recorded in the Parent field of the attribute; however, for files that have been created with hard links, there may be more than one parent directory, so the Parent field is set to 0, and all parent inodes are recorded independently, as follows
inode -> parentInode, links
where links is the count of the parentInode, because multiple hard links can be created in the same directory, and these hard links share one inode.
Records information on each Chunk, as follows
inode, index -> []Slices
where inode is the inode number of the file to which the Chunk belongs, and index is the number of all Chunks in the file, starting from 0. The Chunk value is an array of Slices. Each Slice represents a piece of data written by the client, and is appended to this array in the order of writing time. When there is an overlap between different Slices, the later Slice is used.
type Slice struct {
Pos uint32 // offset of the Slice in the Chunk
ID uint64 // ID of the Slice, globally unique
Size uint32 // size of the Slice
Off uint32 // offset of valid data in this Slice
Len uint32 // size of valid data in this Slice
}
This structure is encoded and saved in binary format, taking up 24 bytes.
Records the reference count of a Slice, as follows
sliceId, size -> refs
Since the reference count of most Slices is 1, to reduce the number of related entries in the database, the actual value minus 1 is used as the stored count value in Redis and TKV. In this way, most of the Slices have a refs value of 0, and there is no need to create related entries in the database.
Records the location of the softlink file, as follows
inode -> target
Records extended attributes (Key-Value pairs) of a file, as follows
inode, key -> value
Records BSD locks (flock) of a file, specifically.
inode, sid, owner -> ltype
where sid is the client session ID, owner is a string of numbers, usually associated with a process, and ltype is the lock type, which can be 'R' or 'W'.
Record POSIX record locks (fcntl) of a file, specifically
inode, sid, owner -> []plockRecord
Here plock is a more fine-grained lock that can only lock a certain segment of the file.
type plockRecord struct {
ltype uint32 // lock type
pid uint32 // process ID
start uint64 // start position of the lock
end uint64 // end position of the lock
}
This structure is encoded and stored in binary format, taking up 24 bytes.
Records the list of files to be cleaned. It is needed as data cleanup of files is an asynchronous and potentially time-consuming operation that can be interrupted by other factors.
inode, length -> expire
where length is the length of the file and expire is the time when the file was deleted.
Records delayed deleted Slices. When the Trash feature is enabled, old Slices deleted by the Slice Compaction will be kept for the same amount of time as the Trash configuration, to be available for data recovery if necessary.
sliceId, deleted -> []slice
where sliceId is the ID of the new slice after compaction, deleted is the timestamp of the compaction, and the mapped value is the list of all old slices that were compacted. Each slice only encodes its ID and size.
type slice struct {
ID uint64
Size uint32
}
This structure is encoded and stored in binary format, taking up 12 bytes.
Records the list of files that need to be kept temporarily during the session. If a file is still open when it is deleted, the data cannot be cleaned up immediately, but needs to be held temporarily until the file is closed.
sid -> []inode
where sid is the session ID and the mapped value is the list of temporarily undeleted file inodes.
The common format of keys in Redis is ${prefix}${JFSKey}, where
In Redis Keys, integers (including inode numbers) are represented as decimal strings if not otherwise specified.
settingallSessionssessionInfosi${inode}d${inode}p${inode}c${inode}_${index}sliceRefk${sliceId}_${size}s${inode}x${inode}lockf${inode}${sid}_${owner}, owner in hexadecimallockp${inode}${sid}_${owner}, owner in hexadecimaldelfiles${inode}:${length}delSlices${sliceId}_${deleted}session${sid}Metadata is stored in different tables by type, and each table is named with jfs_ followed by its specific structure name to form the table name, e.g. jfs_node. Some tables use Id with the bigserial type as primary keys to ensure that each table has a primary key, and the Id columns do not contain actual information.
type setting struct {
Name string `xorm:"pk"`
Value string `xorm:"varchar(4096) notnull"`
}
There is only one entry in this table with "format" as Name and file system formatting information in JSON as Value.
type counter struct {
Name string `xorm:"pk"`
Value int64 `xorm:"notnull"`
}
type session2 struct {
Sid uint64 `xorm:"pk"`
Expire int64 `xorm:"notnull"`
Info []byte `xorm:"blob"`
}
There is no separate table for this, but it is recorded in the Info column of session2.
type node struct {
Inode Ino `xorm:"pk"`
Type uint8 `xorm:"notnull"`
Flags uint8 `xorm:"notnull"`
Mode uint16 `xorm:"notnull"`
Uid uint32 `xorm:"notnull"`
Gid uint32 `xorm:"notnull"`
Atime int64 `xorm:"notnull"`
Mtime int64 `xorm:"notnull"`
Ctime int64 `xorm:"notnull"`
Nlink uint32 `xorm:"notnull"`
Length uint64 `xorm:"notnull"`
Rdev uint32
Parent Ino
AccessACLId uint32 `xorm:"'access_acl_id'"`
DefaultACLId uint32 `xorm:"'default_acl_id'"`
}
Most of the fields are the same as Attr, but the timestamp precision is lower, i.e., Atime/Mtime/Ctime are in microseconds.
type edge struct {
Id int64 `xorm:"pk bigserial"`
Parent Ino `xorm:"unique(edge) notnull"`
Name []byte `xorm:"unique(edge) varbinary(255) notnull"`
Inode Ino `xorm:"index notnull"`
Type uint8 `xorm:"notnull"`
}
There is no separate table for this. All Parents are found based on the Inode index in edge.
type chunk struct {
Id int64 `xorm:"pk bigserial"`
Inode Ino `xorm:"unique(chunk) notnull"`
Indx uint32 `xorm:"unique(chunk) notnull"`
Slices []byte `xorm:"blob notnull"`
}
Slices are an array of bytes, and each Slice corresponds to 24 bytes.
type sliceRef struct {
Id uint64 `xorm:"pk chunkid"`
Size uint32 `xorm:"notnull"`
Refs int `xorm:"notnull"`
}
type symlink struct {
Inode Ino `xorm:"pk"`
Target []byte `xorm:"varbinary(4096) notnull"`
}
type xattr struct {
Id int64 `xorm:"pk bigserial"`
Inode Ino `xorm:"unique(name) notnull"`
Name string `xorm:"unique(name) notnull"`
Value []byte `xorm:"blob notnull"`
}
type flock struct {
Id int64 `xorm:"pk bigserial"`
Inode Ino `xorm:"notnull unique(flock)"`
Sid uint64 `xorm:"notnull unique(flock)"`
Owner int64 `xorm:"notnull unique(flock)"`
Ltype byte `xorm:"notnull"`
}
type plock struct {
Id int64 `xorm:"pk bigserial"`
Inode Ino `xorm:"notnull unique(plock)"`
Sid uint64 `xorm:"notnull unique(plock)"`
Owner int64 `xorm:"notnull unique(plock)"`
Records []byte `xorm:"blob notnull"`
}
Records is an array of bytes, and each plockRecord corresponds to 24 bytes.
type delfile struct {
Inode Ino `xorm:"pk notnull"`
Length uint64 `xorm:"notnull"`
Expire int64 `xorm:"notnull"`
}
type delslices struct {
Id uint64 `xorm:"pk chunkid"`
Deleted int64 `xorm:"notnull"`
Slices []byte `xorm:"blob notnull"`
}
Slices is an array of bytes, and each slice corresponds to 12 bytes.
type sustained struct {
Id int64 `xorm:"pk bigserial"`
Sid uint64 `xorm:"unique(sustained) notnull"`
Inode Ino `xorm:"unique(sustained) notnull"`
}
The common format of keys in TKV (Transactional Key-Value Database) is ${prefix}${JFSKey}, where
${VolumeName}0xFD, where 0xFD is used as a special byte to handle cases when there is an inclusion relationship between different file system names. In addition, for databases that are not shareable (e.g. BadgerDB), the empty string is used as prefix.In TKV's Keys, all integers are stored in encoded binary form.
setting -> file system formatting information in JSON format
C${name} -> counter value
SE${sid} -> timestamp
SI${sid} -> session information in JSON format
A${inode}I -> encoded Attr
A${inode}D${name} -> encoded {type, inode}
A${inode}P${parentInode} -> counter value
A${inode}C${index} -> Slices
where index takes up 4 bytes and is encoded with big endian. Slices is an array of bytes, one Slice per 24 bytes.
K${sliceId}${size} -> counter value
where size takes up 4 bytes and is encoded with big endian.
A${inode}S -> target
A${inode}X${name} -> xattr value
F${inode} -> flocks
where flocks is an array of bytes, one flock per 17 bytes.
type flock struct {
sid uint64
owner uint64
ltype uint8
}
P${inode} -> plocks
where plocks is an array of bytes and the corresponding plock is variable-length.
type plock struct {
sid uint64
owner uint64
size uint32
records []byte
}
where size is the length of the records array and every 24 bytes in records corresponds to one plockRecord.
D${inode}${length} -> timestamp
where length takes up 8 bytes and is encoded with big endian.
L${timestamp}${sliceId} -> slices
where slices is an array of bytes, and one slice corresponds to 12 bytes.
SS${sid}${inode} -> 1
Here the Value value is only used as a placeholder.
According to the design of Edge, only the direct children of each directory are recorded in the metadata engine. When an application provides a path to access a file, JuiceFS needs to look it up level by level. Now suppose the application wants to open the file /dir1/dir2/testfile, then it needs to
Failure in any of the above steps will result in the file pointed to by that path not being found.
From the previous section, we know how to find the file based on its path and get its attributes. The metadata related to the contents of the file can be found based on the inode and size fields in the file properties. Now suppose a file has an inode of 100 and a size of 160 MiB, then the file has (size-1) / 64 MiB + 1 = 3 Chunks, as follows.
File: |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _|
Chunk: |<--- Chunk 0 --->|<--- Chunk 1 --->|<-- Chunk 2 -->|
In standalone Redis, this means that there are 3 Chunk Keys, i.e.,c100_0, c100_1 and c100_2, each corresponding to a list of Slices. These Slices are mainly generated when the data is written and may overwrite each other or may not fill the Chunk completely, so you need to traverse this list of Slices sequentially and reconstruct the latest version of the data distribution before using it, so that
Now suppose there are 3 Slices in Chunk 0
Slice{pos: 10M, id: 10, size: 30M, off: 0, len: 30M}
Slice{pos: 20M, id: 11, size: 16M, off: 0, len: 16M}
Slice{pos: 16M, id: 12, size: 10M, off: 0, len: 10M}
It can be illustrated as follows (each '_' denotes 2 MiB)
Chunk: |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
Slice 10: |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
Slice 11: |_ _ _ _ _ _ _ _|
Slice 12: |_ _ _ _ _|
New List: |_ _ _ _ _|_ _ _|_ _ _ _ _|_ _ _ _ _|_ _|_ _ _ _ _ _ _ _ _ _ _ _|
0 10 12 11 10 0
The reconstructed new list contains and only contains the latest data distribution for this Chunk as follows
Slice{pos: 0, id: 0, size: 10M, off: 0, len: 10M}
Slice{pos: 10M, id: 10, size: 30M, off: 0, len: 6M}
Slice{pos: 16M, id: 12, size: 10M, off: 0, len: 10M}
Slice{pos: 26M, id: 11, size: 16M, off: 6M, len: 10M}
Slice{pos: 36M, id: 10, size: 30M, off: 26M, len: 4M}
Slice{pos: 40M, id: 0, size: 24M, off: 0, len: 24M} // can be omitted
Block is the basic unit for JuiceFS to manage data. Its size is 4 MiB by default, and can be changed only when formatting a file system, within the interval [64 KiB, 16 MiB]. Each Block is an object in the object storage after upload, and is named in the format ${fsname}/chunks/${hash}/${basename}, where
${sliceId}_${index}_${size}, where
Currently there are two hash algorithms, and both use the sliceId in basename as the parameter. Which algorithm will be chosen to use follows the HashPrefix of the file system.
func hash(sliceId int) string {
if HashPrefix {
return fmt.Sprintf("%02X/%d", sliceId%256, sliceId/1000/1000)
}
return fmt.Sprintf("%d/%d", sliceId/1000/1000, sliceId/1000)
}
Suppose a file system named jfstest is written with a continuous 10 MiB of data and internally given a SliceID of 1 with HashPrefix disabled, then the following three objects will be generated in the object storage.
jfstest/chunks/0/0/1_0_4194304
jfstest/chunks/0/0/1_1_4194304
jfstest/chunks/0/0/1_2_2097152
Similarly, now taking the 64 MiB chunk in the previous section as an example, its actual data distribution is as follows
0 ~ 10M: Zero
10 ~ 16M: 10_0_4194304, 10_1_4194304(0 ~ 2M)
16 ~ 26M: 12_0_4194304, 12_1_4194304, 12_2_2097152
26 ~ 36M: 11_1_4194304(2 ~ 4M), 11_2_4194304, 11_3_4194304
36 ~ 40M: 10_6_4194304(2 ~ 4M), 10_7_2097152
40 ~ 64M: Zero
According to this, the client can quickly find the data needed for the application. For example, reading 8 MiB data at offset 10 MiB location will involve 3 objects, as follows
10_0_4194304, corresponding to 0 to 4 MiB of the read data10_1_4194304, corresponding to 4 to 6 MiB of the read data12_0_4194304, corresponding to 6 to 8 MiB of the read dataTo facilitate obtaining the list of objects of a certain file, JuiceFS provides the info command, e.g. juicefs info /mnt/jfs/test.tmp.
objects:
+------------+---------------------------------+----------+---------+----------+
| chunkIndex | objectName | size | offset | length |
+------------+---------------------------------+----------+---------+----------+
| 0 | | 10485760 | 0 | 10485760 |
| 0 | jfstest/chunks/0/0/10_0_4194304 | 4194304 | 0 | 4194304 |
| 0 | jfstest/chunks/0/0/10_1_4194304 | 4194304 | 0 | 2097152 |
| 0 | jfstest/chunks/0/0/12_0_4194304 | 4194304 | 0 | 4194304 |
| 0 | jfstest/chunks/0/0/12_1_4194304 | 4194304 | 0 | 4194304 |
| 0 | jfstest/chunks/0/0/12_2_2097152 | 2097152 | 0 | 2097152 |
| 0 | jfstest/chunks/0/0/11_1_4194304 | 4194304 | 2097152 | 2097152 |
| 0 | jfstest/chunks/0/0/11_2_4194304 | 4194304 | 0 | 4194304 |
| 0 | jfstest/chunks/0/0/11_3_4194304 | 4194304 | 0 | 4194304 |
| 0 | jfstest/chunks/0/0/10_6_4194304 | 4194304 | 2097152 | 2097152 |
| 0 | jfstest/chunks/0/0/10_7_2097152 | 2097152 | 0 | 2097152 |
| ... | ... | ... | ... | ... |
+------------+---------------------------------+----------+---------+----------+
The empty objectName in the table means a file hole and is read as 0. As you can see, the output is consistent with the previous analysis.
It is worth mentioning that the 'size' here is size of the original data in the Block, rather than that of the actual object in object storage. The original data is written directly to object storage by default, so the 'size' is equal to object size. However, when data compression or data encryption is enabled, the size of the actual object will change and may no longer be the same as the 'size'.
You can configure the compression algorithm (supporting lz4 and zstd) with the --compress <value> parameter when formatting a file system, so that all data blocks of this file system will be compressed before uploading to object storage. The object name remains the same as default, and the content is the result of the compression algorithm, without any other meta information. Therefore, the compression algorithm in the file system formatting Information is not allowed to be modified, otherwise it will cause the failure of reading existing data.
The RSA private key can be configured to enable static data encryption when formatting a file system with the --encrypt-rsa-key <value> parameter, which allows all data blocks of this file system to be encrypted before uploading to the object storage. The object name is still the same as default, while its content becomes a header plus the result of the data encryption algorithm. The header contains a random seed and the symmetric key used for decryption, and the symmetric key itself is encrypted with the RSA private key. Therefore, it is not allowed to modify the RSA private key in the file system formatting Information, otherwise reading existing data will fail.
:::note If both compression and encryption are enabled, the original data will be compressed and then encrypted before uploading to the object storage. :::