Application
FUSE library
FSYNC , FLUSH , ACCESS
Data (2) READ , WRITE
Attributes (2) GETATTR , SETATTR
Kernel
cache /dev/fuse
VFS Other kernel Extended SETXATTR , GETXATTR ,
Q
FUSE u subsystems Attributes (4) LISTXATTR , REMOVEXATTR
e
Kernel−based driver u
e Symlinks (2) SYMLINK , READLINK
file system
Directory (7) MKDIR , RMDIR , OPENDIR , RE -
Figure 1: FUSE high-level architecture. LEASEDIR , READDIR , READDIRPLUS,
FSYNCDIR
FUSE consists of a kernel part and a user-level dae-
Locking (3) GETLK , SETLK , SETLKW
mon. The kernel part is implemented as a Linux kernel
Misc (6) BMAP, FALLOCATE , MKNOD , IOCTL ,
module that, when loaded, registers a fuse file-system
POLL, NOTIFY REPLY
driver with Linux’s VFS. This Fuse driver acts as a proxy
for various specific file systems implemented by differ- Table 1: FUSE request types, by group (whose size is in paren-
ent user-level daemons. thesis). Requests we discuss in the text are in bold.
In addition to registering a new file system, FUSE’s User-kernel protocol. When FUSE’s kernel driver
kernel module also registers a /dev/fuse block de- communicates to the user-space daemon, it forms a
vice. This device serves as an interface between user- FUSE request structure. Requests have different types
space FUSE daemons and the kernel. In general, dae- depending on the operation they convey. Table 1 lists all
mon reads FUSE requests from /dev/fuse, processes 43 FUSE request types, grouped by their semantics. As
them, and then writes replies back to /dev/fuse. seen, most requests have a direct mapping to traditional
Figure 1 shows FUSE’s high-level architecture. When VFS operations: we omit discussion of obvious requests
a user application performs some operation on a (e.g., READ , CREATE) and instead next focus on those
mounted FUSE file system, the VFS routes the operation less intuitive request types (marked bold in Table 1).
to FUSE’s kernel driver. The driver allocates a FUSE The INIT request is produced by the kernel when a
request structure and puts it in a FUSE queue. At this file system is mounted. At this point user space and
point, the process that submitted the operation is usu- kernel negotiate (1) the protocol version they will op-
ally put in a wait state. FUSE’s user-level daemon then erate on (7.23 at the time of this writing), (2) the set of
picks the request from the kernel queue by reading from mutually supported capabilities (e.g., READDIRPLUS or
/dev/fuse and processes the request. Processing the FLOCK support), and (3) various parameter settings (e.g.,
request might require re-entering the kernel again: for FUSE read-ahead size, time granularity). Conversely,
example, in case of a stackable FUSE file system, the the DESTROY request is sent by the kernel during the
daemon submits operations to the underlying file system file system’s unmounting process. When getting a DE -
(e.g., Ext4); or in case of a block-based FUSE file sys- STROY , the daemon is expected to perform all necessary
KERNEL USER
Process Daemon
for this session and subsequent reads from /dev/fuse
will return 0, causing the daemon to exit gracefully. sync
sync, interrupts
The INTERRUPT request is emitted by the kernel if async
any previously sent requests are no longer needed (e.g.,
reply
T T T T T
when a user process blocked on a READ is terminated).
Each request has a unique sequence# which INTERRUPT Fuse
Async Kernel
uses to identify victim requests. Sequence numbers are Page Driver
assigned by the kernel and are also used to locate com- Cache
H H H H
pleted requests when the user space replies. Background Pending Processing Interrupts Forgets
Every request also contains a node ID—an unsigned Dentry/Inode
Interrupts
Cache Forgets Forgets
64-bit integer identifying the inode both in kernel and
user spaces. The path-to-inode translation is performed Figure 2: The organization of FUSE queues marked with their
by the LOOKUP request. Every time an existing inode Head and Tail. The processing queue does not have a tail
is looked up (or a new one is created), the kernel keeps because the daemon replies in an arbitrary order.
the inode in the inode cache. When removing an in- space. This part exports the low-level FUSE API.
ode from the dcache, the kernel passes the FORGET re- The High-level FUSE API builds on top of the low-
quest to the user-space daemon. At this point the dae- level API and allows developers to skip the implemen-
mon might decide to deallocate any corresponding data tation of the path-to-inode mapping. Therefore, neither
structures. BATCH FORGET allows kernel to forget mul- inodes nor lookup operations exist in the high-level API,
tiple inodes with a single request. easing the code development. Instead, all high-level API
An OPEN request is generated, not surprisingly, when methods operate directly on file paths. The high-level
a user application opens a file. When replying to this re- API also handles request interrupts and provides other
quest, a FUSE daemon has a chance to optionally assign convenient features: e.g., developers can use the more
a 64-bit file handle to the opened file. This file handle common chown(), chmod(), and truncate()
is then returned by the kernel along with every request methods, instead of the lower-level setattr(). File
associated with the opened file. The user-space daemon system developers must decide which API to use, by bal-
can use the handle to store per-opened-file information. ancing flexibility vs. development ease.
E.g., a stackable file system can store the descriptor of Queues. In Section 2.1 we mentioned that FUSE’s
the file opened in the underlying file system as part of kernel has a request queue. FUSE actually maintains
FUSE’s file handle. FLUSH is generated every time an five queues as seen in Figure 2: (1) interrupts, (2) for-
opened file is closed; and RELEASE is sent when there gets, (3) pending, (4) processing, and (5) background. A
are no more references to a previously opened file. request belongs to only one queue at any time. FUSE
OPENDIR and RELEASEDIR requests have the same puts INTERRUPT requests in the interrupts queue, FOR -
semantics as OPEN and RELEASE, respectively, but for GET requests in the forgets queue, and synchronous re-
directories. The READDIRPLUS request returns one or quests (e.g., metadata) in the pending queue. When a
more directory entries like READDIR, but it also includes file-system daemon reads from /dev/fuse, requests
metadata information for each entry. This allows the ker- are transferred to the user daemon as follows: (1) Pri-
nel to pre-fill its inode cache (similar to NFSv3’s READ - ority is given to requests in the interrupts queue; they
DIRPLUS procedure [4]). are transferred to the user space before any other re-
When the kernel evaluates if a user process has per- quest. (2) FORGET and non-FORGET requests are se-
missions to access a file, it generates an ACCESS request. lected fairly: for each 8 non-FORGET requests, 16 FOR -
By handling this request, the FUSE daemon can imple- GET requests are transferred. This reduces the bursti-
ment custom permission logic. However, typically users ness of FORGET requests, while allowing other requests
mount FUSE with the default permissions option that al- to proceed. The oldest request in the pending queue is
lows kernel to grant or deny access to a file based on transferred to the user space and simultaneously moved
its standard Unix attributes (ownership and permission to the processing queue. Thus, processing queue re-
bits). In this case no ACCESS requests are generated. quests are currently processed by the daemon. If the
Library and API levels. Conceptually, the FUSE li- pending queue is empty then the FUSE daemon is
brary consists of two levels. The lower level takes care blocked on the read call. When the daemon replies to
of (1) receiving and parsing requests from the kernel, a request (by writing to /dev/fuse), the correspond-
(2) sending properly formatted replies, (3) facilitating ing request is removed from the processing queue.
file system configuration and mounting, and (4) hiding The background queue is for staging asynchronous
potential version differences between kernel and user requests. In a typical setup, only read requests go to
Number of FUSE
user daemon
Requests (log2)
65536
tion for HDD. The situation is reversed, e.g., when a
4096
mail-server [row #44] workload is used.
256
Observation 6. At least in one Stackfs configuration, 34 32 32 32
all write workloads (sequential and random) [rows #21– 16
36] are within the Green class for both HDD and SSD. 0
1
IN 1
LO
RE
FL
RE
G
PE
ET
IT
U
A
LE
O
Observation 7. The performance of sequential read
SH
A
D
K
TT
U
SE
P
R
[rows #1–12] are well within the Green class for both Types of FUSE Requests
HDD and SSD; however, for the seq-rd-32th-32f Figure 3: Different types and number of requests generated by
workload [rows #5–8] on HDD, they are in Orange StackfsBase on SSD during the seq-rd-32th-32f work-
class. Random read workload results [rows #13–20] load, from left to right, in their order of generation.
span all four classes. Furthermore, the performance Figure 3 demonstrates the types of requests that were
grows as I/O sizes increase for both HDD and SSD. generated with the seq-rd-32th-32f workload. We
Observation 8. In general, Stackfs performs visi- use seq-rd-32th-32f as a reference for the figure
bly worse for metadata-intensive and macro workloads because this workload has more requests per operation
[rows #37–45] than for data-intensive workloads [rows type compared to other workloads. Bars are ordered
#1–36]. The performance is especially low for SSDs. from left to right by the appearance of requests in the ex-
periment. The same request types, but in different quan-
Observation 9. The relative CPU utilization of Stackfs
tities, were generated by the other read-intensive work-
(not shown in the Table) is higher than that of Ext4 and
loads [rows #1–20]. For the single threaded read work-
varies in the range of +0.13% to +31.2%; similarly, CPU
loads, only one request per LOOKUP, OPEN, FLUSH, and
cycles per operation increased by 1.2× to 10× times be-
RELEASE type was generated. The number of READ
tween Ext4 and Stackfs (in both configurations). This
requests depended on the I/O size and the amount of
behavior is seen in both HDD and SSD.
data read; INIT request is produced at mount time so its
Observation 10. CPU cycles per operation are count remained the same across all workloads; and fi-
higher for StackfsOpt than for StackfsBase for nally GETATTR is invoked before unmount for the root
the majority of workloads. But for the work- directory and was the same for all the workloads.
loads seq-wr-32th-32f [rows #25–28] and
Figure 3 also shows the breakdown of requests by
rnd-wr-1th-1f [rows #30–32], StackfsOpt con-
queues. By default, READ, RELEASE, and INIT are asyn-
sumes fewer CPU cycles per operation.
chronous; they are added to the background queue first.
All other requests are synchronous and are added to
5.2 Analysis
pending queue directly. In read workloads, only READ
We analyzed FUSE performance results and present requests are generated in large numbers. Thus, we dis-
main findings here, following the order in Table 3. cuss in detail only READ requests for these workloads.
65,536
Sequential Read using 32 threads on 1 file.
This workload exhibits similar performance trends to 4,096
LO
CR
FL
RE
ET
ET
RI
IT
L
O
SH
EA
A
T
K
E
TT
TE
A
U
SE
TT
P
1.5e+06
on SSD than on HDD. In all cases StackfsOpt’s per-
1M
1M 1M 1M 1M 1M
1e+06
780K formance degradation is more than StackfsBase’s be-
500000
cause neither splice nor the writeback cache helped
1
files-del-N th workloads and only added addi-
0
tional overhead for managing extra threads.
IN
LO
RE
FL
RE
FO
PE
ET
IT
U
A
LE
RG
O
SH
A
D
K
A
TT
ET
U
SE
P
R
Types of FUSE Requests 5.2.4 Macro Server Workloads
Figure 5: Different types of requests that were generated by We now discuss the behavior of Stackfs for macro-
StackfsBase on SSD for the files-rd-1th workload, from workloads [rows #43–45].
left to right in their order of generation.
background queue processing queue
268,435,456
pending queue user daemon
Ext4 (30–46 thousand creates/sec) because small newly
Number of FUSE
16,777,216 8.7M
Requests(log2)
created inodes can be effectively cached in RAM. Thus, 1.4M
250K
1,048,576 364K 250K500K 427K 750K 750K
250K 250K
overheads caused by the FUSE’s user-kernel communi- 65,536
256
File Reads. Figure 5 shows different types of requests 16
that got generated during the files-rd-1th work- 0
1 2
IN
LO
CR
RE
FL
RE
FO
ET
PE
ET
N
RI
IT
U
A
LE
RG
O
EA
LI
N
SH
A
TE
D
K
N
A
TT
TE
ET
U
K
SE
contains many small files (one million 4KB files) that are
TT
P
R
Types of FUSE Requests
repeatedly opened and closed. Figure 5 shows that half
of the READ requests went to the background queue and Figure 6: Different types of requests that were generated by
the rest directly to the pending queue. The reason is that StackfsBase on SSD for the file-server workload.
when reading a whole file, and the application requests File Server. Figure 6 shows different types of oper-
reads beyond the EOF, FUSE generates a synchronous ations that got generated during the file-server
READ request which goes to the pending queue (not the workload. Macro workloads are expected to have a
background queue). Reads past the EOF also generate a more diverse request profile than micro workloads, and
GETATTR request to confirm the file’s size. file-server confirms this: many different requests
The performance degradation for files-rd-1th got generated, with WRITEs being the majority.
in StackfsBase on HDD is negligible; on SSD, however, The performance improved by 25–40% (depending
the relative degradation is high (–25%) because the SSD on storage device) with StackfsOpt compared to Stack-
is 12.5× faster than HDD (see Ext4 absolute through- fsBase, and got close to Ext4’s native performance for
put in Table 3). Interestingly, StackfsOpt’s performance three reasons: (1) with a writeback cache and 128KB re-
degradation is more than that of StackfsBase (by 10% quests, the number of WRITEs decreased by a factor of
and 35% for HDD and SSD, respectively). The rea- 17× for both HDD and SSD, (2) with splice, READ and
son is that in StackfsOpt, different FUSE threads pro- WRITE requests took advantage of zero copy, and (3) the
cess requests for the same file, which requires additional user daemon is multi-threaded, as the workload is.
synchronization and context switches. Conversely, but
268,435,456 background queue processing queue
as expected, for files-rd-32th workload, Stack- pending queue user daemon
Number of FUSE
16,777,216
fsOpt performed 40–45% better than StackfsBase be-
Requests(log2)
1,048,576 0.5M
1M 1M 1.5M 1.5M 0.5M 0.5M
cause multiple threads are needed to effectively process 65,536
256
File Deletes. The different types of operations that got
16
generated during the files-del-1th workloads are 0
1 1
LOOKUP , UNLINK , FORGET (exactly 4 million each).
IN
LO
CR
RE
FS
FL
RE
FO
ET
PE
ET
N
RI
IT
U
A
LE
RG
O
EA
LI
N
N
SH
A
TE
D
K
N
A
C
TT
TE
ET
U
1.5M
Number of FUSE
16,777,216
5M
Requests(log2)
2.8M 2.7M 2.5M 2.5M 2.5M 2.6M
1,048,576 0.8M ized FUSE performance for a variety of workloads but
65,536 did not analyze the results [36]. Furthermore, they evalu-
4,096 ated only default FUSE configuration and discussed only
256
FUSE’s high-level architecture. In this paper we evalu-
16
1 1
ated and analyzed several FUSE configurations in detail,
0
and described FUSE’s low-level architecture.
IN
LO
RE
FL
RE
FO
ET
PE
ET
RI
IT
U
A
LE
RG
O
SH
A
T
D
K
A
E
TT
ET
U
SE
TT
P
R
Types of FUSE Requests ful extensions to FUSE. Re-FUSE automatically restarts
Figure 8: Different types of requests that were generated by FUSE file systems that crash [33]. To improve FUSE
StackfsBase on SSD for the web-server workload. performance, Narayan et al. proposed to marry in-kernel
different requests got generated, with WRITEs being the stackable file systems [44] with FUSE [23]. Shun et al.
majority. Performance trends are also similar between modified FUSE’s kernel module to allow applications
these two workloads. However, in the SSD setup, even to access storage devices directly [17]. These improve-
the optimized StackfsOpt still did not perform close to ments were in research prototypes and were never in-
Ext4 in this mail-server workload, compared to cluded in the mainline.
file-server. The reason is twofold. First, com- 7 Conclusion
pared to file server, mail server has almost double the
User-space file systems are popular for prototyping new
metadata operations, which increases FUSE overhead.
ideas and developing complex production file systems
Second, I/O sizes are smaller in mail-server which im-
that are difficult to maintain in kernel. Although many
proves the underlying Ext4 SSD performance and there-
researchers and companies rely on user-space file sys-
fore shifts the bottleneck to FUSE.
tems, little attention was given to understanding the per-
Web Server. Figure 8 shows different types of re- formance implications of moving file systems to user
quests generated during the web-server workload. space. In this paper we first presented the detailed design
This workload is highly read-intensive as expected from of FUSE, the most popular user-space file system frame-
a Web-server that services static Web-pages. The perfor- work. We then conducted a broad performance charac-
mance degradation caused by StackfsBase falls into the terization of FUSE and we present an in-depth analysis
Red class in both HDD and SSD. The major bottleneck of FUSE performance patterns. We found that for many
was due to the FUSE daemon being single-threaded, workloads, an optimized FUSE can perform within 5%
while the workload itself contained 100 user threads. of native Ext4. However, some workloads are unfriendly
Performance improved with StackfsOpt significantly on to FUSE and even if optimized, FUSE degrades their
both HDD and SSD, mainly thanks to using multiple performance by up to 83%. Also, in terms of the CPU
threads. In fact, StackfsOpt performance on HDD is utilization, the relative increase seen is 31%.
even 6% higher than of native Ext4. We believe this All of our code and Filebench workloads files are
minor improvement is caused by the Linux VFS treating available from http:// filesystems.org/ fuse/ .
Stackfs and Ext4 as two independent file systems and
Future work. There is a large room for improvement
allowing them together to cache more data compared to
in FUSE performance. We plan to add support for com-
when Ext4 is used alone, without Stackfs. This does not
pound FUSE requests and investigate the possibility of
help SSD setup as much due to the high speed of SSD.
shared memory between kernel and user spaces for faster
6 Related Work communications.
Many researchers used FUSE to implement file sys- Acknowledgments
tems [3, 9, 15, 40] but little attention was given to under- We thank the anonymous FAST reviewers and our shep-
standing FUSE’s underlying design and performance. herd Tudor Marian for their valuable comments. This
To the best of our knowledge, only two papers stud- work was made possible in part thanks to Dell-EMC,
ied some aspects of FUSE. First, Rajgarhia and Gehani NetApp, and IBM support; NSF awards CNS-1251137,
evaluated FUSE performance with Java bindings [27]. CNS-1302246, CNS-1305360, and CNS-1622832; and
Compared to this work, they focused on evaluating Java ONR award 12055763.
library wrappers, used only three workloads, and ran
experiments with FUSE v2.8.0-pre1 (released in 2008).
The version they used did not support zero-copying
via splice, writeback caching, and other important fea-
tures. The authors also presented only limited informa-
[8] Jonathan Corbet. In defense of per-bdi writeback, [21] David Mazieres. A toolkit for user-level file sys-
September 2009. http:// lwn.net/ Articles/ 354851/ . tems. In Proceedings of the 2001 USENIX Annual
Technical Conference, 2001.
[9] B. Cornell, P. A. Dinda, and F. E. Bustamante.
Wayback: A User-level Versioning File System [22] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S.
for Linux. In Proceedings of the Annual USENIX Fabry. A fast file system for UNIX. ACM Transac-
Technical Conference, FREENIX Track, pages 19– tions on Computer Systems, 2(3):181–197, August
28, Boston, MA, June 2004. USENIX Association. 1984.
[10] Mathieu Desnoyers. Using the Linux ker- [23] Sumit Narayan, Rohit K Mehta, and John A
nel tracepoints, 2016. https:// www.kernel.org/ doc/ Chandy. User space storage system stack modules
Documentation/ trace/ tracepoints.txt . with file level control. In Proceedings of the 12th
[11] Ext4 Documentation. https:// www.kernel.org/ doc/ Annual Linux Symposium in Ottawa, pages 189–
Documentation/ filesystems/ ext4.txt . 196, 2010.
[12] Filebench, 2016. https:// github.com/ filebench/ [24] Nimble’s Hybrid Storage Architecture.
filebench/ wiki . http:// info.nimblestorage.com/ rs/ nimblestorage/
images/ nimblestorage technology overview.pdf .
[13] S. Ghemawat, H. Gobioff, and S. T. Leung. The
Google file system. In Proceedings of the 19th [25] NTFS-3G. www.tuxera.com .