Documentation/filesystems/ext4/inodes.rst
.. SPDX-License-Identifier: GPL-2.0
In a regular UNIX filesystem, the inode stores all the metadata pertaining to the file (time stamps, block maps, extended attributes, etc), not the directory entry. To find the information associated with a file, one must traverse the directory files to find the directory entry associated with a file, then load the inode to find the metadata for that file. ext4 appears to cheat (for performance reasons) a little bit by storing a copy of the file type (normally stored in the inode) in the directory entry. (Compare all this to FAT, which stores all the file information directly in the directory entry, but does not support hard links and is in general more seek-happy than ext4 due to its simpler block allocator and extensive use of linked lists.)
The inode table is a linear array of struct ext4_inode. The table is
sized to have enough blocks to store at least
sb.s_inode_size * sb.s_inodes_per_group bytes. The number of the
block group containing an inode can be calculated as
(inode_number - 1) / sb.s_inodes_per_group, and the offset into the
group's table is (inode_number - 1) % sb.s_inodes_per_group. There
is no inode 0.
The inode checksum is calculated against the FS UUID, the inode number, and the inode structure itself.
The inode table entry is laid out in struct ext4_inode.
.. list-table:: :widths: 8 8 24 40 :header-rows: 1 :class: longtable
i_blocks_lo 512-byte blocks
on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
inode.i_flags, then the file consumes i_blocks_lo + (i_blocks_hi << 32) 512-byte blocks on disk. If huge_file is set and
EXT4_HUGE_FILE_FL IS set in inode.i_flags, then this file
consumes (i_blocks_lo + i_blocks_hi << 32) filesystem blocks on
disk... _i_mode:
The i_mode value is a combination of the following flags:
.. list-table:: :widths: 16 64 :header-rows: 1
.. _i_flags:
The i_flags field is a combination of these values:
.. list-table:: :widths: 16 64 :header-rows: 1
dirsync) (EXT4_DIRSYNC_FL).EXT4_SNAPFILE_FL). (not in mainline)EXT4_SNAPFILE_DELETED_FL). (not in
mainline)EXT4_SNAPFILE_SHRUNK_FL). (not in
mainline).. _i_osd1:
The osd1 field has multiple meanings depending on the creator:
Linux:
.. list-table:: :widths: 8 8 24 40 :header-rows: 1
Hurd:
.. list-table:: :widths: 8 8 24 40 :header-rows: 1
Masix:
.. list-table:: :widths: 8 8 24 40 :header-rows: 1
.. _i_osd2:
The osd2 field has multiple meanings depending on the filesystem creator:
Linux:
.. list-table:: :widths: 8 8 24 40 :header-rows: 1
Hurd:
.. list-table:: :widths: 8 8 24 40 :header-rows: 1
Masix:
.. list-table:: :widths: 8 8 24 40 :header-rows: 1
Inode Size
In ext2 and ext3, the inode structure size was fixed at 128 bytes
(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
128 bytes. Starting with ext4, it is possible to allocate a larger
on-disk inode at format time for all inodes in the filesystem to provide
space beyond the end of the original ext2 inode. The on-disk inode
record size is recorded in the superblock as ``s_inode_size``. The
number of bytes actually used by struct ext4_inode beyond the original
128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
inode, which allows struct ext4_inode to grow for a new kernel without
having to upgrade all of the on-disk inodes. Access to fields beyond
EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
of August 2019) the inode structure is 160 bytes
(``i_extra_isize = 32``). The extra space between the end of the inode
structure and the end of the inode record can be used to store extended
attributes. Each inode record can be as large as the filesystem block
size, though this is not terribly efficient.
Finding an Inode
Each block group contains sb->s_inodes_per_group inodes. Because
inode 0 is defined not to exist, this formula can be used to find the
block group that an inode lives in:
bg = (inode_num - 1) / sb->s_inodes_per_group. The particular inode
can be found within the block group's inode table at
index = (inode_num - 1) % sb->s_inodes_per_group. To get the byte
address within the inode table, use
offset = index * sb->s_inode_size.
Inode Timestamps
Four timestamps are recorded in the lower 128 bytes of the inode
structure -- inode change time (ctime), access time (atime), data
modification time (mtime), and deletion time (dtime). The four fields
are 32-bit signed integers that represent seconds since the Unix epoch
(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
January 2038. If the filesystem does not have orphan_file feature, inodes
that are not linked from any directory but are still open (orphan inodes) have
the dtime field overloaded for use with the orphan list. The superblock field
``s_last_orphan`` points to the first inode in the orphan list; dtime is then
the number of the next orphaned inode, or zero if there are no more orphans.
If the inode structure size ``sb->s_inode_size`` is larger than 128
bytes and the ``i_inode_extra`` field is large enough to encompass the
respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
inode fields are widened to 64 bits. Within this “extra” 32-bit field,
the lower two bits are used to extend the 32-bit seconds field to be 34
bit wide; the upper 30 bits are used to provide nanosecond timestamp
accuracy. Therefore, timestamps should not overflow until May 2446.
dtime was not widened. There is also a fifth timestamp to record inode
creation time (crtime); this field is 64-bits wide and decoded in the
same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
through the regular stat() interface, though debugfs will report them.
We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
In other words:
.. list-table::
:widths: 20 20 20 20 20
:header-rows: 1
* - Extra epoch bits
- MSB of 32-bit time
- Adjustment for signed 32-bit to 64-bit tv_sec
- Decoded 64-bit tv_sec
- valid time range
* - 0 0
- 1
- 0
- ``-0x80000000 - -0x00000001``
- 1901-12-13 to 1969-12-31
* - 0 0
- 0
- 0
- ``0x000000000 - 0x07fffffff``
- 1970-01-01 to 2038-01-19
* - 0 1
- 1
- 0x100000000
- ``0x080000000 - 0x0ffffffff``
- 2038-01-19 to 2106-02-07
* - 0 1
- 0
- 0x100000000
- ``0x100000000 - 0x17fffffff``
- 2106-02-07 to 2174-02-25
* - 1 0
- 1
- 0x200000000
- ``0x180000000 - 0x1ffffffff``
- 2174-02-25 to 2242-03-16
* - 1 0
- 0
- 0x200000000
- ``0x200000000 - 0x27fffffff``
- 2242-03-16 to 2310-04-04
* - 1 1
- 1
- 0x300000000
- ``0x280000000 - 0x2ffffffff``
- 2310-04-04 to 2378-04-22
* - 1 1
- 0
- 0x300000000
- ``0x300000000 - 0x37fffffff``
- 2378-04-22 to 2446-05-10
This is a somewhat odd encoding since there are effectively seven times
as many positive values as negative values. There have also been
long-standing bugs decoding and encoding dates beyond 2038, which don't
seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
incorrectly use the extra epoch bits 1,1 for dates between 1901 and
1970. At some point the kernel will be fixed and e2fsck will fix this
situation, assuming that it is run before 2310.