Sie sind auf Seite 1von 25

The ext4 file system

A work in progress update

Suparna Bhattacharya
(suparna@in.ibm.com)
Linux Technology Center
India Systems and Technology Lab
IBM

FOSS.in 2006
Credits

Alex Tomas ●
Andrew Morton

Andreas Dilger ●
Laurent Vivier

Theodore Tso ●
Alexandre Ratchov

Stephen Tweedie ●
Eric Sandeen

Mingming Cao ●
Takashi Sato

Dave Kleikamp ●
And many kernel
developers on linux-

Badari Pulavarathy
ext4 and linux-fsdevel

Avantika Mathur
Agenda

Filesystem evolution challenges

Why ext4 ?

Layout changes

Features (existing, upcoming)
− Extents
− 64 bit meta data
− Expanded inode (fine grained timestamps)
− Improved allocation
− Reliability

Getting involved
Filesystem evolution challenges

Dependability
− Stability, simplicity, robustness
− Trusted with user's data

Compatibility
− Backwards, forwards, across-OS

Growing requirements
− More data, More options, Workload types, Storage
trends, Different optimization points

On-disk format lock-in
− Problem of switching formats

Filesystem multiplicity, customizability
Motivation for ext4

Ext3: default filesystem for many users
− Reputation of dependability & compatibility

Scaling up to support large filesystems
− Storage advancements
− Increasing data storage requirements

Features requiring on-disk format change
− nanosec timestamps, fast EA, preallocation

Reliability wrt on-disk corruption
Why fork ext3->ext4?

User-base split, similar to ext2->ext3

Leave existing ext3 users undisturbed, stable

Only large filesystem users move to ext4

ext4 development to proceed unfettered
− Allow experimentation

Downside
− Code duplication, maintaining 2 filesystems

64 bit JBD split

Forward compatibility/upgradeability
Features

Ability to use > 16TB filesystems (>32bit blkno)

Support for extent format
− Reduced meta-data, robust wrt on-disk corruption

Improved file allocation (mballoc, delalloc)

Ability to have > 32K files in a directory

Nanosec timestamps, inode version on disk

Uninitialized groups for faster mkfs, fsck

Persistent file preallocation

Journal checksumming

Online defragmentation
Ext2/3/4 layout

Block Group 0 Block Group 1 ... Block Group N

Group Block Inode Inode Data


Superblock
descriptors Bitmap Bitmap Table Blocks
Ext3 layout (contd)

Super Block

Block groups
− Grp desc, Inode bitmap, Block bitmap, Inode table

Inodes
− Block map
− Extended attributes
− Directories [Htree (dir_index)]

Journal

Compatability
− COMPAT, RO_COMPAT, INCOMPAT feature
Ext2/3 Indirect Block disk blocks
0
Map ...
i_data ...
200
0 200
201
1 201 213 ...
... ... ...
... ... ... 213
11 211 1236 ...
1238 1239
12 212 ... ...
13 1237 ... ...
... ...
14 65530 ...
1239
65531 65532 65533 ...
... ... ... ...
direct block
indirect block ... ... ... ...
double indirect block 65533
triple indirect block ...
...
Ext 4 extents

Single descriptor maps a range of
contiguous blocks (12 bytes)
32 16 48
logical block length physical block

Extent header (12 bytes)
16 16 16 16 32
magic #entries max depth generation

B+ tree
− Index node
− Leaf node
Extent disk
Map blocks
i_data 200
201
header ...
...
1199
0 ...
1000 ...
200 ...
6000
6001
1001 ...
2000 ...
6000 6199
...
...
...
...
leaf node disk blocks
Extent Tree
0
i_data index node
...
header
0

root ...
... ...

extents
extents index ...
node header
64 bit meta-data

Super block
__le16 s_desc_size; (replaces a reserved field)
/* 64bit support valid if EXT4_FEATURE_INCOMPAT_64BIT */
/*150*/ __le32 s_blocks_count_hi; /* Blocks count */
__le32 s_r_blocks_count_hi; /* Reserved blocks count */
__le32 s_free_blocks_count_hi; /* Free blocks count */


Group desc (increased size)
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
__le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
__le32 bg_inode_table_hi; /* Inodes table block MSB */
64 bit meta-data

JBD2
/*
* The block tag: used to describe a single buffer in the journal
*/
typedef struct journal_block_tag_s
{
__be32 t_blocknr; /* The on-disk block number */
__be32 t_flags; /* See below */
/* Valid only if JBD2_FEATURE_INCOMPAT_64BIT */
+ __be32 t_blocknr_high; /* most-significant high 32bits. */
} journal_block_tag_t;
Bigger block groups

Using a single block for the block bitmap
limits limits max blocks in a group
− Contiguity broken by block group boundary
− Lot of block groups for a large filesystem

For a 32 bit inode number space, this reduces the
number of inodes per group

BIG_BG
− Bitmap can span multiple blocks
− Allows larger block groups, reduced meta-
data
Expanded inode

128 <= 2N bytes <= block size
− 256 bytes min for ext4 features

Fast EA
− EA in inode
− Helps Samba4 benchmarks

Finer grained (nanosec) timestamps
− ctime, mtime, atime, create time

Inode version field on disk
− Lustre, NFSv4
Expanded inode
0 [ old inode ]
:
127 [ old inode ]
128 [ extended inode fields ]
: [ [amc]time_extra ]
[ creation time ]
148 [ fast EA space ]
:
255+ [ fast EA space ]
Improved file allocation

Multi-block allocation
− Allocate contiguous blocks together

Reduce fragmentation, Reduce extent meta-data
− Stripe aligned allocations
− Free extents buddy information used in conjunction
with block bitmap

Delayed allocation
− Defer block allocation to writeback time
− Improves chances allocating contiguous blocks,
reducing fragmentation
− Trickier to implement for ordered mode
Buddy mballoc example
free extents buddy info disk block
bitmap
20 0 0 0 0 0 0 1 0

21 0 0 0 1 0
1
22 0 1 2
3
23 1 4
5
6
block bitmap free 7
exent
0 free 22
4 free 21
6 allocated
Total extent from block 0 22 + 21
=6
Persistent file preallocation

Allow applications to preallocate blocks for a file
without having to initialize them
− Contiguous allocation, reduce fragmentation

Irrespective of order in which blocks are written

While avoiding overhead of zeroing blocks
− Guaranteed space allocation
− Useful for Streaming audio/video, Databases

Implemented as uninitialized extents
− MSB of ee_len used to flag such extents
− Read as a zero-filled range, just as with holes
− Writes split this into valid & uninitialised extents
Improving reliability

Background/Related work
− Iron filesystems (ixt) paper (U. Wisconsin)
− Storage trends (Filesystems workshop)

Reliability and seek times not keeping up with capacity
increase rates
− Chunkfs & continuation inodes
− Stanford filesystem checker (eXplode)

Potential enhancements
− Extent, group descriptor and bitmap checksums
− Journal checksumming (prototype: Andreas Dilger)
− Scaling fsck time
Getting involved

Read Documentation/filesystems/ext4.txt

Websites/wikis
− linuxfs.pbwiki.com,ext4 wiki to be set up on kernel.org
− ext2.sf.net, http://fedoraproject.org/wiki/ext3-devel
− www.bullopensource.org/ext4

Join mailing list
− linux-ext4@vger.kernel.org
− linux-fsdevel@vger.kernel.org

IRC channel: irc.oftc.net, /join #linux-fsdevel

Contributions to e2fsprogs for ext4
Legal Statement

This work represents the view of the authors and does not necessarily represent the
view of IBM.

IBM and the IBM logo are trademarks or registered trademarks of International
Business Machines Corporation in the United States and/or other countries.

Lustre is a trademark of Cluster File Systems, Inc.

Unix is a registered trademark of The Open Group in the United States and other
countries.

Linux is a registered trademark of Linus Torvalds in the United States, other


countries, or both.

Other company, product, and service names may be trademarks or service marks of
others

References in this publication to IBM products or services do not imply that IBM
intends to make them available in all countries in which IBM operates.

This document is provied ``AS IS,'' with no express or implied warranties. Use the
information in this document at your own risk.
And then ... ext5 :)

Das könnte Ihnen auch gefallen