Anda di halaman 1dari 27

Chapter 5: The Kernel API System Calls

Objectives
Introduce notions of mode, space and context. Distinguish interrupts, exceptions and traps and identify how they are used for kernel entry/exit. Introduce mechanisms for tracing system calls. Discuss implications of blocking system calls. Briefly consider Intel low-level hardware events. Carefully examine system_call, the Linux system call entry code. Examine implementations of several system calls. Describe how to implement a new system call.
2

Mode, Space, Context


Mode: hardware restricted execution state
restricted access, privileged instructions user mode vs. kernel mode
dual-mode architecture, protected mode

Intel supports 4 protection rings: 0 kernel, 1 unused, 2 unused, 3 user

Space: kernel (system) vs. user (process) address space


requires MMU support (virtual memory) userland: any process address space; there are many user address spaces reality: kernel is often mapped into user process space

Context: kernel activity on behalf of ???


process: on behalf of current process system: unrelated to current process (maybe no process!)
example interrupt context blocking not allowed!

User Mode, Process Context


CONTEXT Userland Process
User Space

System

User

Kernel Space

MODE
Kernel

Kernel Mode, Process Context


CONTEXT Process System

trap to kernel User

MODE
Kernel
System calls, exceptions

User Space Kernel Space

Kernel Mode, System Context


CONTEXT Process System

User

MODE
Kernel
interrupts, system tasks
User Space?? Kernel Space

interrupts

User Mode, System Context?


CONTEXT Process System
User Space

Not allowed!

User

Kernel Space?

MODE
Kernel

Interrupts and Exceptions


Interrupts - async device to cpu communication
example: service request, completion notification aside: IPI interprocessor interrupt (another cpu!) system may be interrupted in either kernel or user mode interrupts are logically unrelated to current processing

Exceptions - sync hardware error notification


example: divide-by-zero (AU), illegal address (MMU) exceptions are caused by current processing

Software interrupts (traps)


synchronous simulated interrupt allows controlled entry into the kernel from userland
8

Kernel Entry and Exit


Library Code
exceptions (error traps)

System Call Interface


trap
80h

boot Kernel

trap / interrupt table

system call table

scheduler

interrupt

device dialog

page faults

IPI: interprocessor interrupt


9

Devices

Cost of Crossing the Kernel Barrier


more than a procedure call less than a context switch costs:
vectoring mechanism establishing kernel stack validating parameters kernel mapped to user address space?
updating page map permissions

kernel in a separate address space?


reloading page maps invalidating cache, TLB

10

System Calls vs. Library Calls


man 2 historical evolution of # of calls
Unix 6e (~50), Solaris 7 (~250) Linux 2.0 (~160), Linux 2.2 ( ~190), Linux 2.4 (~220)

library calls vs. system call possibilities:


library call never invokes system call library call sometimes invokes system call library call always invokes system call system call not available via library

can invoke system call directly via assembly code man 2: undocumented, unimplemented, obsolete externals vs. internals
11

Blocking System Calls


system calls may block in the kernel slow system calls may block indefinitely
reads, writes of pipes, terminals, net devices some ipc calls, pause, some opens and ioctls disk io is NOT slow (it will eventually complete)

blocking slow calls may be interrupted by a signal


returns EINTR

problem: slow calls must be wrapped in a loop BSD introduced automatic restart of slow interrupted calls POSIX didnt specify semantics Linux
no automatic restart by default specify restart when setting signal handler (SA_RESTART)
12

Tracing Process Signals and System Calls


ptrace() allow parent process to observe/control child
child stops before signal delivery or system call execution parent waits for child parent can view/modify child state possible to attach and reparent existing processes architecture dependent

strace useful diagnostic application to trace processes


strace whatever

Solaris uses more sophisticated /proc mechanism

13

Sample strace r Output


> strace r sync 0.000000 execve("/bin/sync", ["sync"], [/* 21 vars */]) = 0 0.001002 brk(0) = 0x804a178 0.000192 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000 0.000164 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) 0.000133 open("/etc/ld.so.cache", O_RDONLY) = 4 0.000069 fstat(4, {st_mode=S_IFREG|0644, st_size=20404, ...}) = 0 0.000120 old_mmap(NULL, 20404, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40015000 0.000075 close(4) = 0 0.000064 open("/lib/libc.so.6", O_RDONLY) = 4 0.000076 fstat(4, {st_mode=S_IFREG|0755, st_size=4101324, ...}) = 0 0.000096 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\210\212"..., 4096) = 4096 0.000192 old_mmap(NULL, 1001564, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x4001a000 0.000083 mprotect(0x40107000, 30812, PROT_NONE) = 0 0.000058 old_mmap(0x40107000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xec000) = 0x40107000 0.000137 old_mmap(0x4010b000, 14428, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4010b000 0.000080 close(4) = 0 0.000102 mprotect(0x4001a000, 970752, PROT_READ|PROT_WRITE) = 0 0.001043 mprotect(0x4001a000, 970752, PROT_READ|PROT_EXEC) = 0 0.000248 munmap(0x40015000, 20404) = 0 0.000077 personality(PER_LINUX) = 0 0.000127 getpid() = 2225 0.000193 brk(0) = 0x804a178 0.000054 brk(0x804a1b0) = 0x804a1b0 0.000097 brk(0x804b000) = 0x804b000 0.000130 sync() = 0 0.015855 _exit(0) = ?

14

Sample strace c Output


> strace c sync execve("/bin/sync", ["sync"], [/* 21 vars */]) = 0 % time seconds usecs/call calls errors ------ ----------- ----------- --------- --------97.47 0.008277 8277 1 0.85 0.000072 24 3 1 0.45 0.000038 38 1 0.40 0.000034 7 5 0.37 0.000031 10 3 0.19 0.000016 16 1 0.11 0.000009 2 4 0.08 0.000007 4 2 0.05 0.000004 2 2 0.02 0.000002 2 1 0.02 0.000002 2 1 ------ ----------- ----------- --------- --------100.00 0.008492 24 1 syscall ---------------sync open read old_mmap mprotect munmap brk fstat close getpid personality ---------------total

15

Low-level Intel Event Mechanisms


Intel provides very complex hardware protection
task: execution environment
provides hardware support for context switching (not used by Linux)

4 protection levels lots of segments and descriptors


segments and descriptors all have privilege levels

hardware support for stack swapping on privilege change


avoids holes where privileged code crashes because no stack space

task gate context switch to privileged code call gate execute privileged code with stack swapping interrupt gate call gate with interrupts disabled trap gate interrupt gate with interrupts still enabled
16

System Call Dispatch Table


Broad system call categories:
files, i/o, devices memory, processes ipc, time, misc

System call listing:


include/unistd.h (include/asm-i386/unistd.h)

17

System Calls (2.2)


_exit _llseek _newselect _sysctl accept access acct adjtimex afs_syscall alarm bdflush bind break brk cacheflush capget capset chdir chmod chown chroot clone close connect creat create_module delete_module dup dup2 execve exit fchdir fchmod fchown fcntl fdatasync flock fork fstat fstatfs fsync ftruncate get_kernel_syms getcontext getdents getdomainname getdtablesize getegid geteuid getgid getgroups gethostid gethostname getitimer getpagesize getpeername getpgid getpgrp getpid getppi getpriority getresgid getresuid getrlimit getrusage getsid getsockname getsockopt gettimeofday getuid gtty idle init_module intro ioctl ioctl_list ioperm iopl ipc kill killpg lchown link listen llseek lock lseek lstat mkdir mknod mlock mlockall mmap modify_ldt mount mprotect mpx mremap msgctl msgget msgop msgrcv msgsnd msync munlock munlockall munmap nanosleep nfsservctl nice obsolete oldfstat oldlstat oldolduname oldstat olduname open outb pause personality pipe poll prctl pread prof ptrace query_module quotactl read readdir readlink readv reboot recv recvfrom recvmsg rename rmdir sbrk sched_get_priority_max sched_get_priority_min sched_getparam sched_getscheduler sched_rr_get_interval sched_setparam sched_setscheduler sched_yield select semctl semget

18

System Calls (2.2)


semop send sendfile sendmsg sendto setcontext setdomainname setegid seteuid setfsgid setfsuid setgid setgroups sethostid sethostname setitimer setpgid setpgrp setpriority setregid setresgid setresuid setreuid setrlimit setsid setsockopt settimeofday setuid setup sgetmask shmat shmctl shmdt shmget shmop shutdown sigaction sigaltstack sigblock siggetmask sigmask signal sigpause sigpending sigprocmask sigreturn sigsetmask sigsuspend sigvec socket socketcall socketpair ssetmask stat statfs stime stty swapoff swapon symlink sync syscalls sysctl sysfs sysinfo syslog time times truncate umask umount uname undocumented unimplemented unlink uselib ustat utime utimes vfork vhangup vm86 wait wait3 wait4 waitpid write writev

19

system_call
arch/i386/kernel/entry.S:ENTRY(system_call)

SAVE_ALL get current task struct syscall # not OK? badsys traced? tracesys dispatch specific syscall *(sys_call_table[call_number]) save return value bottom half active? handle_bottom_half need to reschedule? reschedule signal pending? signal_return (do_signal) RESTORE_ALL return_from_exception return_from_intr

20

lcall7
arch/i386/kernel/entry.S:ENTRY(lcall7) entry point for Intel UNIX binary compatability iBCS2 Intel Binary Compatibility System v2 allows execution of SCO, Solaris, FreeBSD binaries! execution domain (personality) possible to register personality module sys_personality alters current personality see include/linux/personality.h compatibility issues: signal number mapping restructure stack as necessary perform simple system call conversions call execution domain specific lcall7 handler

21

Example System Calls


sys_foo, do_foo idiom
all system calls proper begin with sys_ often delegate to do_ function for the real work

asmlinkage
gcc magic to keep parameters on the stack avoids register optimizations

sys_ni_syscall
just return ENOSYS! guards position 0 in table (catch uninitialized bugs) fills holes for obsolete syscalls or library implemented calls

22

Example System Calls: sys_time


kernel/time.c:sys_time

just return the number of seconds since Jan 1, 1970 available as volatile CURRENT_TIME (xtime.tv_sec) snapshot current time check user-supplied pointer for validity copy time to user space (asm/uaccess.h:put_user) return time snapshot or error

23

Example System Calls: sys_reboot


kernel/sys.c:sys_reboot

require SYS_BOOT capability check magic numbers (0xfee1dead, Torvalds family birthdays) acquire the big kernel lock switch options shutdown in various ways: restart, halt, poweroff user-specified shutdown command for some architectures toggle control-alt-delete processing go through reboot_notifier callbacks as appropriate unlock and return error if failure

24

Example System Calls: sys_sysinfo


kernel/info.c:sys_sysinfo

allocate a local struct to return info to user space disable (clear) interrupts to keep info consistent calculate uptime calculate 1, 5, 15 second load averages average length of run queue over interval use confusing int math to avoid floating-point inefficiency enable (set) interrupts return number of processes and some mem stats copy local struct values to user space (copy_to_user)

25

Adding a System Call


link statically or implement as a kernel module allocate a number from sys_call_table export sys_whatever validate all parameters! return appropriate error codes use uaccess.h macros as necessary create a library wrapper with _syscallN macros
linux/unistd.h _syscallN( return_type, entry, type1, arg1, type2, arg2, )

26

Summary
System calls represent the primary kernel API. A system call is one way to enter protected mode. Crossing the kernel barrier is expensive. System calls are usually wrapped in library routines. Blocking slow system calls may be interrupted by a signal. It is possible to trace system calls with ptrace(). Intel Linux implements system calls using interrupts and a low-level feature referred to as a call gate. System calls often requiring copy data to and from user and kernel space.
27

Anda mungkin juga menyukai