Sie sind auf Seite 1von 74

COSC440 Lab Manual

COSC440 Lab Manual

COSC440 Lab Manual


ii

REVISION HISTORY
NUMBER

DATE

1.0

2012-02

DESCRIPTION

NAME
QL

COSC440 Lab Manual


iii

Contents

Installation of Linux (Lubuntu) and system calls

1.1

Resources for COSC440 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Installation of Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

System calls for file operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.1

open() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.2

creat() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.3

read() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.4

write() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.5

close() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.6

ioctl() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.7

lseek() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5

Setting up your Git repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6

Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The structure of a Linux kernel module

2.1

Install XV6 and qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Compile and load a module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

The Makefile for module compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

About the template module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5

Basics of a device driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.5.1

Initialization, cleanup and character device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.2

File operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2.1

open() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2.2

release() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2.3

read() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2.4

write() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.2.5

llseek() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.2.6

ioctl() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.2.7

Registering your file operation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6

Kernel configuration and compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7

Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

COSC440 Lab Manual


iv

Linked list and seeking the device, process sleeping


3.1

3.2

Linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1

Exercise: Basic linked-list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2

Exercise: Finding tainted kernel modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.3

Exercise: Enabling seeking of a device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Suspending process - wait queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


3.2.1

4.2

4.3

28

Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1

Exercise: Using wait queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Mutex, semaphore and the proc file system


4.1

21

Exercise: Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Semaphore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1

Exercise: Semaphore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2

Exercise: Device-private data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

/proc entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1

Creating entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.2

Reading entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.3

Writing entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.4

Exercise: using the proc filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.5

Exercise: Making your own subdirectory in /proc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Memory management

36

5.1

kmalloc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2

__get_free_pages() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3

vmalloc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4

slabs and cache allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


5.4.1

Exercise: memory caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4.2

Exercise: Testing maximum memory allocation (optional) . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5

copying data across user/kernel space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.6

Memory mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.7

Atomic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Using ioctl()

50

6.1

Defining ioctl() commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2

Automatically create device nodes under /dev with the udev system . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.3

Exercise: using ioctl to pass data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.4

Exercise: using ioctl() to pass data of variable length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Catchup lab

54

COSC440 Lab Manual


v

Hardware interrupts, tasklets and workqueues


8.1

Top half . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.2

Bottom half . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.3

8.4
9

55

8.2.1

Tasklets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.2.2

Workqueues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.2.3

Exercise: Deferred Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.2.4

Exercise: Shared interrupts and bottom halves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Hardware I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.3.1

REGISTERING I/O PORTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.3.2

READING AND WRITING DATA FROM I/O REGISTERS . . . . . . . . . . . . . . . . . . . . . . . . 61

8.3.3

SLOWING I/O CALLS TO THE HARDWARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Walking through the Assignment 2

63

9.1

Preparing the base code from Assignment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.2

Parallel port I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.3

9.2.1

Preparing your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.2.2

Setting up Parallel I/O in your module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9.2.3

Interrupt handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Overview of the Assignment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65


9.3.1

Top half . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9.3.2

Bottom half (producer) and read() (consumer) in the multiple-page queue . . . . . . . . . . . . . . . . . 65

10 Timers (Optional)

67

10.1 jiffies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.2 Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.3 Exercise: Kernel Timers from a Character Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

COSC440 Lab Manual


1 / 69

Chapter 1

Installation of Linux (Lubuntu) and system calls


1.1

Resources for COSC440

Skeleton code for all exercises in COSC440 labs are available at:
http://www.cs.otago.ac.nz/cosc440/resources.php
Also the schedule, lectures notes and course reading materials are available at:
http://www.cs.otago.ac.nz/cosc440/schedule.php
Please ensure you read the materials before attending the lectures and labs.

1.2

Installation of Ubuntu

You need to maintain your own Linux box to hack in Linux kernel. This makes the hack real but affects no other users. To install
Ubuntu, simply follow the follow steps:
1. Get a copy of Ubuntu installation disc (CD-ROM) ready.
2. Set up a PC with power cables, monitor, Ethernet cable, mouse, and keyboard.
3. Plug the power cables to the power sockets and the network cable to the private network (yellow socket).
4. Connect the mouse and the keyboard to the PC.
5. When everything is ready, power on the PC.
6. Insert the installation disc and follow the instructions (you may need to hit Del or F2 to get into the BIOS menu and change
the boot sequence in order to boot from CD-ROM).
7. Once the Ubuntu is installed, you need to make apt (the Ubuntu package manager) use the University repository to minimize internet traffic. You should download the updated repository list and refresh the repository:
$
$
$
$

wget
sudo
sudo
sudo

http://www.cs.otago.ac.nz/cosc440/01/sources.list
mv /etc/apt/sources.list /etc/apt/sources.list.orig
cp sources.list /etc/apt/sources.list
apt-get update

1. Install packages such as build-essential and libncurses-dev needed for compiling the kernel. To compile kernel modules,
you also need to have the header files for the kernel the OS runs on.
$ sudo apt-get install libncurses-dev build-essential kernel-package linux-headers-generic

COSC440 Lab Manual


2 / 69

1.3

System calls for file operations

While your virtual machine is installing, lets have a recap on system calls used in file operations in the user space. The rest of
this lab can be completed using your host desktop OS X terminal.
Many devices are abstracted as a file in Linux/Unix. The standard operations are open(), close(), read(), write(), ioctl(), lseek(),
and etc, which are called system calls and are the API for user applications to request for kernel services. They are different from
the libc functions fopen(), fclose(), fread(), fwrite(), which are based on the system calls, but can prefetch and buffer data at the
user space in order to avoid unnecessary system calls. System calls are more expensive than function calls because they involve
context switch from user space to kernel space and then a switch back to user space. Below are the brief semantics of the file
related system calls.

1.3.1

open()

int open(const char *pathname, int flags);


int open(const char *pathname, int flags, ... /* mode_t mode */);

When using the function, the following header files should be included in your program (we will put these files before each
system call for the rest of the course material):
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

For a given file name pathname, open() returns a file descriptor, which is a small, non-negative integer for use in subsequent
system calls such as read() and write(). A call to open() will cause the kernel to create a file structure in the kernel space for the
calling process.
The second argument flags specifies the options of open(). A file can be open for reading only or for both reading and writing.
For example, the following constants (defined in fcntl.h) are often used:
O_RDONLY
O_WRONLY
O_RDWR

open for reading only


open for writing only
open for reading and writing

The argument is formed by ORing together different options, e.g. O_RDWR| O_CREAT|O_NONBLOCK. To find out other
options, use the Linux manual pages by the command
$ man 2 open

The third argument is only used when a new file is being created, in which case a protection mode such as 0x700. However, to
avoid errors, the following constants (defined in sys/stat.h) should be used:
S_IRWXU
S_IRUSR
S_IWUSR
S_IXUSR
S_IRWXG
S_IRGRP
S_IWGRP
S_IXGRP
S_IRWXO
S_IROTH
S_IWOTH
S_IXOTH

00700 user (file owner) has read, write and execute permission
00400 user has read permission
00200 user has write permission
00100 user has execute permission
00070 group has read, write and execute permission
00040 group has read permission
00020 group has write permission
00010 group has execute permission
00007 others have read, write and execute permission
00004 others have read permission
00002 others have write permission
00001 others have execute permission

COSC440 Lab Manual


3 / 69

mode must be specified when O_CREAT is in the flags, and is ignored otherwise.
If successful, open() returns a non-negative file descriptor; otherwise, it returns -1, in which case errno can be used to check the
error number and message.
In order to check the value of errno, its header needs to be included:
#include <errno.h>

Then with the errno, it is useful to print a human-readable error message, and this can be done with
#include <errno.h>
#include <stdio.h>
void perror(const char *s);

which takes a string (usually the name of the system call) and prints out an error message.
Note that it is a standard for system programming to always check the return value of system calls and take corresponding actions
when error occurs. The following common errors are worth noting:
EAGAIN/EWOULDBLOCK Non-blocking I/O has been selected using O_NONBLOCK and there was no data immediately
available for reading.
EFAULT You provided an invalid buffer space that is outside your accessible address space.
EINTR The call was interrupted by a signal before any data was read. You normally should repeat the system call on this error.
However, each system call has its own set of errors. For more detailed information about errors of each system call, use the Linux
manual pages man 2 <syscall-name>.

1.3.2

creat()

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int creat(const char *pathname, mode_t mode);

It is equivalent to:
#include <unistd.h>
int open(pathname, O_WRONLY|O_CREAT|O_TRUNC, mode);

1.3.3

read()

#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);

It reads count bytes of data from an open file fd into the user buffer buf. If successful, the number of bytes read is returned; if
the end of file is reached, 0 is returned; otherwise, -1 is returned on error.
Note that the number of bytes returned can be less than the amount requested (i.e. count). For example, if there are 70 bytes
left until the end of file and you try to read 100 bytes, read() returns only 70. Also if you read from a socket (a special kind of
file descriptor), the amount of data requested has not yet arrived from the network and read() just returns whatever in the socket
buffer. It is always a good practice to compare the return value with the requested amount ( count) in order to react accordingly.
Here is a code snippet showing how read() is used to read from a file descriptor into a fixed-size buffer, then process the data in
the buffer, and read again until the EOF is reached, or until the socket is empty:

COSC440 Lab Manual


4 / 69

#include <errno.h>
#include <unistd.h>
#define BUF_SIZE 100
char buf[BUF_SIZE];
size_t size_to_be_read = BUF_SIZE;
restart:
while ((size_read = read(fd, buf, size_to_be_read)) > 0) {
/* process the data in the buffer...... */
}
if (size_read < 1) {
if (EINTR == errno) {
/* read() interrupted by a signal, so lets try again */
goto restart;
} else {
/* other error, so print a message and quit */
perror("read()");
exit(EXIT_FAILURE);
}
}
/* size_read == 0, so EOF reached, or socket disconnected
task completed */

1.3.4

write()

#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t count);

It requests the kernel to write count bytes from the buffer pointed by buf into the file fd. The return value usually equals to count
if successful. However, the return value can be less than count when the physical medium of the file has no enough space. You
should check the return value in order to make sure all data are successfully stored in the file. In case of socket, you should write
the rest of the data again hoping the socket buffer or the network capacity is available soon.
Here is a code snippet showing the usage of write():
size_remaining = size_read;
size_written = 0;
while ((size_written_this_time = write(fd, &buf[size_written], size_remaining))
< size_remaining) {
if (size_written_this_time < 0) {
/* something wrong */
if (EINTR == errno) {
/* interrupted, start again */
continue;
} else {
perror("write()");
exit(EXIT_FAILURE);
}
} else {
/* write() wrote less than required, so we need to
update size_written and size_remaining
and call write() again to write out the remaining data */

COSC440 Lab Manual


5 / 69

size_written += size_written_this_time;
size_remaining -= size_written_this_time;
}
}

1.3.5

close()

#include <unistd.h>
int close(int fd);

It closes a file referred to by the file descriptor fd, so that your program no longer accesses the file. It returns 0 on success but -1
on error.

1.3.6

ioctl()

#include <sys/ioctl.h>
int ioctl(int fd, int request, ... /* char *argp */);

ioctl() is mainly used by device drivers which is an essential part of the paper. It manipulates the underlying device parameters
through different requests. fd is normally a file descriptor referring to an opened device file such as /dev/tty. The request is device
dependent. The third argument argp points to a buffer that contains data to be passed to the device driver. You will learn more
about ioctl() when you work on the implementation of device drivers.
It returns 0 on success but -1 on error.

1.3.7

lseek()

#include <sys/types.h>
#include <unistd.h>
off_t lseek(int fd, off_t offset, int whence);

Every open file has an associated "current file offset", which points to the current position where read() and write() apply. lseek()
can change this position and repositions the offset of the open file fd to the argument offset according to the directive whence in
the following ways:
SEEK_SET: The offset is set to offset bytes.
SEEK_CUR: The offset is set to its current location plus offset bytes.
SEEK_END: The offset is set to the size of the file plus offset bytes.

This means lseek() allows the current position to be set beyond the end of the file, which does not change the size of the file until
data is written at the position. The subsequent reads of the data in the gap will return null (0) until data is actually written into
the gap.
It returns the resulting offset position (in terms of the beginning of the file) on success.
There are many other system calls related to a file. However, the above system calls are the most important to device drivers and
will be implemented when you go through the implementation of device drivers.

COSC440 Lab Manual


6 / 69

1.4

Assessment

In this lab, you should implement your cat program, which can read from a file and write the content of the file to the standard
output referred by the file descriptor STDOUT_FILENO. Standard input, output and error files are automatically open when
a program is loaded for execution and can be referred to by the file descriptors STDIN_FILENO, STDOUT_FILENO and
STDERR_FILENO (defined in stdio.h). For example, the following command
$

my_cat test.txt

should display the content of "test.txt".


In the program my_cat, you should read from a file (the file name is given through command line argument argv and argc), check
any errors such as EINTR (in which case you should repeat the same read()), write to standard output. You should also check the
number of bytes read and if the end of file is reached. Also you should check the errors from open() in case of non-existing files
or denied access and print out error messages accordingly.
The pseudocode of the cat program (exclude error-checking) is:
fd = open(filename, readonly)
while (there is data read from fd into buf) {
write the data to the standout output
check the size written,
if smaller than the data size, update indices
and called write() again to write out the remaining data
}
close(fd)

1.5

Setting up your Git repository

For any projects, such as source code, assignments and theses, it is important to:
1. keep track of everything you added/edited, so that you can revert to a given state should something goes wrong
2. have a distributed backup system so that if your working copy is lost, you can still retrieve your work.
Therefore a distributed versioning system such as git:
http://git-scm.com
should be used. You are required to make use of git to version all assignments in COSC440, and encouraged to do the same to
other papers, and your COSC480 dissertation.
To begin with, first, create an account on the cloud-based git server bitbucket
http://bitbucket.org
Then after your account is set up, you can create a git repository for each project. For example cosc440-asgn1. You can also
create a repository for the COSC440 labs, and this is a good way to share code between your work machine, and your home linux
machine should you wish to work at home in the weekend.
Now, it is a good idea to create a git repository in bitbucket for your first programming assignment "cosc440-asgn1".

Important
Please make sure that the box "private" is clicked.

COSC440 Lab Manual


7 / 69

Then you can check out a working copy on your working machine by following the instructions on github.
Before you can do that on your machine, you need to install git first:
$ sudo apt-get install git-core

For general usage of git, please have a look at:


http://jonas.iki.fi/git_guides/HTML/git_guide/
http://gitref.org/basic/

1.6

Reference

1. Advanced programming in the UNIX environment, by W.R. Stevens, Addison Wesley


2. Linux manual pages, man 2 <syscall-name>

COSC440 Lab Manual


8 / 69

Chapter 2

The structure of a Linux kernel module


2.1

Install XV6 and qemu

Before we get our hands on Linux kernel and kernel modules, we first install XV6 and qemu, which will be useful for us to
understand the concepts of OS through reading the XV6 code and its execution in the simulator qemu.
Install qemu with git:
$ sudo apt-get install qemu-system
$ cd /usr/bin
$ sudo ln -s /usr/bin/qemu-system-i386 qemu

Install, compile, and run XV6:


$
$
$
$
$
$

cd
wget http://www.cs.otago.ac.nz/cosc440/resources/xv6-rev5.tar.gz
tar xvfz xv6-rev5.tar.gz
cd xv6
make
make qemu

You will see XV6 running in qemu with a simple command shell. You may type commands like ls, cat, etc.

2.2

Compile and load a module

Kernel modules are code components that can be loaded into and unloaded from the kernel dynamically. Modules enable a
modular design of the kernel and can help slow down the kernel expansion. You are going to use modules to learn kernel
programming, which is more flexible and less error prone than modifying the complicated kernel source tree.
Under the skelenton code 02/temp, compile the module:
$ make

Load the module:


$ sudo insmod ./temp.ko

Run dmesg to see if the module is loaded successfully: You should find the message "Hello world from Template Module" and
etc.
To unload the module, use:
$ sudo rmmod temp

COSC440 Lab Manual


9 / 69

If a user-space program wants to access a device, then the device needs to have a node created under the /dev filesystem.
To create a node of a device, run:
$ sudo mknod [-m MODE] /dev/<node name> <device type> <major number> <minor number>

where the device type can be one of:


b block device
c character device
n network device
t tty device
and the optional argument -m specified the file permission of the node.
You can find out the device major number allocated to the module using:
$ cat /proc/devices | grep temp
249 temp

You can use this number to create a device file with mknod:
$ sudo mknod -m 666 /dev/temp c 249 0

Then the user-space program can open the device with open() like it open any filesystems. Here is an example:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
.....
int fd = open("/dev/temp", O_RDWR);
.....

2.3

The Makefile for module compilation

The Makefile file is very simple, since it just jumps into the kernel tree to do the compilation. You should have installed the
kernel source before any module compilation.
This is an example of a Makefile:
obj-m
+= temp.o
test-objs := temp.o
KDIR
:= /lib/modules/$(shell uname -r)/build
PWD
:= $(shell pwd)
default:
$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
clean:
$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) clean
rm -f modules.order

The first line in the Makefile:


obj-m

+= temp.o

COSC440 Lab Manual


10 / 69

specifies the module name tem and the target. It assumes the source file has the name tem.c. If you use a different file name e.g.
test.c, you will need the following line:
tem-objs := test.o

If you have multiple source files for the module, you should have:
tem-objs := src1.o src2.o src3.o

The next line specifies the kernel source that the module is going to be compiled with:
KDIR

:= /lib/modules/$(shell uname -r)/build

/$(shell uname -r) will return the current running kernel source. The module will be compiled with the same options and flags as
the modules in the kernel source tree. You can compile your module with other kernel sources by changing KDIR, but you have
to make sure you load your module into the same release kernel that it is compiled with.
The next line specifies the location of the module source:
PWD

:= $(shell pwd)

$(shell pwd) returns the current working directory. You can simply set the source directory to any absolute pathname such as
/home/foo/test. The target default and clean are commands for compilation and cleanup respectively.

2.4

About the template module

This module is a character device driver for a simple and nave memory device. We will talk about the details later. Here we are
going to explain a few module related code.
You need include a number of header files depending on which kernel functions the module calls. For module compilation, you
often should include the following header files:
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/init.h>

The macros MODULE_XXX allow the programmer specify the module related information. For example, MODULE_LICENSE
specifies the license of the module. If it is not an open source license, the module will be marked as "tainted", in which case it
will function normally but the kernel developers will not be keen to help.
You may specify module parameters using:
module_param(major, int, S_IRUGO);
MODULE_PARM_DESC(major, "device major number");

The first argument in "module_param" is the variable name, the second is type, and the third is permission mask which is used for
an accompanying entry in the sysfs file system. If you are not interested in sysfs, you can put 0 there. The S_IRUGO (0444) in
tem.c means read permission for all users. If you want the entry both readable and writable for all users, you can use S_IRUGO
| S_IWUGO. You can find other related macros from include/linux/stat.h from the linux source tree.
When the major parameter is declared, the module can be loaded as:
$ sudo insmod ./tem.ko major=250

where the variable major will be initialized as 250 in the module.


At the end of tem.c, callback functions for module initialization and cleanup are declared:
module_init(temp_init_module);
module_exit(temp_exit_module);

COSC440 Lab Manual


11 / 69

2.5

Basics of a device driver

In Linux kernel, device drivers are divided into different classes:


character
block
network
tty
Different interfaces are provided by the Linux kernel API for each class of device drivers. In COSC440, you will work with
character devices.
Each driver is uniquely identified by a major number and a minor number. The major number identifies the device class or
group of the same type, such as a controller for several terminal. The minor number identifies a specific device, such as a single
terminal.

2.5.1

Initialization, cleanup and character device

As explained above, a kernel module must be loaded before use, and when a kernel module is loaded, init() will be called to set
up the class.
First, a driver needs to be registered with:
int register_chrdev_region(dev_t from, unsigned count, char *name);

where:
from is the first in the desired range of device numbers; must include the major number
count is the number of consecutive device numbers required
name is the device name
If the major number of the device is not known, the system can dynamically allocate one. In this case, this function should be
used instead:
int alloc_chrdev_region (dev_t
*dev,
unsigned
baseminor,
unsigned
count,
const char *name);

After a major number is dynamically allocated, the device name and the allocated number can be found in /proc/devices.
Then the function:
void cdev_init(struct cdev *cdev, const struct file_operations *fops);

is called to initialize cdev (the internal structure of your character device) and register the file operation struct (see next section).
After this step, init() needs to set the owner field:
struct cdev cdev;
.....
cdev.owner = THIS_MODULE;

If a major number is registered with the device during init(), the driver module needs to free the major number with:

COSC440 Lab Manual


12 / 69

void unregister_chrdev_region (dev_t from, unsigned count);

while it is unloaded.
During initialization, each initialization step should check return value. A negative return value indicates an error, Then init()
needs to undo all previously completed initialization steps in the reverse order, and return a suitable errno code. This will be
more important as you will learn other resource allocation functions later in this course.
Similarly, the exit() function, which is called when the module is unloaded, needs to free up resources allocated in init() in the
reverse order.
Since this is a kernel module, and we are now in the kernel space, if the cleanup steps are not properly done, resources could
remain occupied after the module is unloaded, until the system is rebooted.
Note
Please have a look at the example kernel module code at the end of this section to see how it handles initialization and cleanup
steps.

2.5.2

File operations

The kernel API provides a number of file operations for device drivers. The main file operation-related functions are listed below:
2.5.2.1

open()

int (*open)(struct inode *inode, struct file *filp);

is provided for a driver to initialize a sub-device. If implemented, it may perform the following tasks:
Check for device-specific errors (such as device-not-ready or similar hardware problems)
Initialize the device if it is being opened for the first time
Update the f_op pointer, if necessary
Allocate and fill any data structure to be put in filpprivate_data
If open() is not implemented, opening the device always succeeds, but your driver is not notified.
2.5.2.2

release()

int (*release)(struct inode *inode, struct file *filp);

releases the device. any resources allocated in open() must be released in this function.
2.5.2.3

read()

ssize_t (*read)(struct file *filp, char __user *buf,


size_t count, loff_t *offp);

is used to retrieve data from the device to the user-space buf. A null pointer in this position causes the read system call to fail
with -EINVAL (Invalid argument). A nonnegative return value represents the number of bytes successfully read(the return
value is a signed size type, usually the native integer type for the target platform).

COSC440 Lab Manual


13 / 69

2.5.2.4

write()

ssize_t write(struct file *filp, const char _ _user *buff,


size_t count, loff_t *offp);

sends data from the user-space buf to the device. If NULL, -EINVAL is returned to the program calling the write system call.
The return value, if nonnegative, represents the number of bytes successfully written.
2.5.2.5

llseek()

sets the file offset of the device to the specified value. Will be covered in lab 3.
2.5.2.6

ioctl()

allows you to implement special functions for your device that is not available in existing system calls. Will be covered in lab 6.
2.5.2.7

Registering your file operation functions

Implementations of above file operation functions are registered by struct file_operation:


#include <linux/module.h>
struct file_operations fops = {
.owner = THIS_MODULE,
.llseek = temp_llseek,
.read = temp_read,
.write = temp_write,
.unlocked_ioctl = temp_ioctl,
.open = temp_open,
.release = temp_release,
};

which is in turn registered by the function:


void cdev_init(struct cdev *cdev, const struct file_operations *fops);

where cdev is the cdev structure you declares (statically), and fops is the struct holding pointers to your file operation-related
functions.
init() need to call cdev_init(). Then it needs to add the character device to the system by:
int cdev_add (struct cdev *p, dev_t dev, unsigned count);

where:
p points to the cdev structure
dev is the first device number for which this device is responsible
count is the number of consecutive minor numbers corresponding to this device
Here is an example of a real kernel module:
/*----------------------------------------------------------------------------*/
/* File: tem.c
*/
/* Date: 13/03/2006
*/
/* Author: Zhiyi Huang
*/
/* Version: 0.1
*/

COSC440 Lab Manual


14 / 69

/*----------------------------------------------------------------------------*/
/* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

<linux/module.h>
<linux/moduleparam.h>
<linux/init.h>
<linux/kernel.h>
<linux/slab.h>
<linux/fs.h>
<linux/errno.h>
<linux/types.h>
<linux/proc_fs.h>
<linux/fcntl.h>
<linux/aio.h>
<asm/uaccess.h>

#include <linux/ioctl.h>
#include <linux/cdev.h>
#include <linux/device.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Zhiyi Huang");
MODULE_DESCRIPTION("A template module");

/* The parameter for testing */


int major=0;
module_param(major, int, S_IRUGO);
MODULE_PARM_DESC(major, "device major number");
#define MAX_DSIZE
3071
struct my_dev {
char data[MAX_DSIZE+1];
size_t size;
struct semaphore sem;
struct cdev cdev;
struct class *class;
struct device *device;
} *temp_dev;

/* 32-bit will suffice */


/* Mutual exclusion */

int temp_open (struct inode *inode, struct file *filp)


{
return 0;
}
int temp_release (struct inode *inode, struct file *filp)
{
return 0;
}
ssize_t temp_read (struct file *filp, char __user *buf, size_t count,loff_t *f_pos)
{
int rv=0;
if (down_interruptible (&temp_dev->sem))
return -ERESTARTSYS;
if (*f_pos > MAX_DSIZE)

COSC440 Lab Manual


15 / 69

goto wrap_up;
if (*f_pos + count > MAX_DSIZE)
count = MAX_DSIZE - *f_pos;
if (copy_to_user (buf, temp_dev->data+*f_pos, count)) {
rv = -EFAULT;
goto wrap_up;
}
up (&temp_dev->sem);
*f_pos += count;
return count;
wrap_up:
up (&temp_dev->sem);
return rv;
}
ssize_t temp_write (struct file *filp, const char __user *buf, size_t count, loff_t *f_pos)
{
int count1=count, rv=count;
if (down_interruptible (&temp_dev->sem))
return -ERESTARTSYS;
if (*f_pos > MAX_DSIZE)
goto wrap_up;
if (*f_pos + count > MAX_DSIZE)
count1 = MAX_DSIZE - *f_pos;
if (copy_from_user (temp_dev->data+*f_pos, buf, count1)) {
rv = -EFAULT;
goto wrap_up;
}
up (&temp_dev->sem);
*f_pos += count1;
return count;
wrap_up:
up (&temp_dev->sem);
return rv;
}
long temp_ioctl (struct file *filp, unsigned int cmd, unsigned long arg)
{
return 0;
}
loff_t temp_llseek (struct file *filp, loff_t off, int whence)
{
long newpos;
switch(whence) {
case SEEK_SET:
newpos = off;
break;
case SEEK_CUR:
newpos = filp->f_pos + off;
break;
case SEEK_END:
newpos = temp_dev->size + off;

COSC440 Lab Manual


16 / 69

break;
default: /* cant happen */
return -EINVAL;
}
if (newpos<0 || newpos>MAX_DSIZE) return -EINVAL;
filp->f_pos = newpos;
return newpos;
}
struct file_operations temp_fops = {
.owner =
THIS_MODULE,
.llseek =
temp_llseek,
.read =
temp_read,
.write =
temp_write,
.unlocked_ioctl = temp_ioctl,
.open =
temp_open,
.release =
temp_release,
};

/**
* Initialise the module and create the master device
*/
int __init temp_init_module(void){
int rv;
dev_t devno = MKDEV(major, 0);
if(major) {
rv = register_chrdev_region(devno, 1, "temp");
if(rv < 0){
printk(KERN_WARNING "Cant use the major number %d; try atomatic
allocation...\n", major);
rv = alloc_chrdev_region(&devno, 0, 1, "temp");
major = MAJOR(devno);
}
}
else {
rv = alloc_chrdev_region(&devno, 0, 1, "temp");
major = MAJOR(devno);
}
if(rv < 0) return rv;
temp_dev = kmalloc(sizeof(struct my_dev), GFP_KERNEL);
if(temp_dev == NULL){
rv = -ENOMEM;
unregister_chrdev_region(devno, 1);
return rv;
}
memset(temp_dev, 0, sizeof(struct my_dev));
cdev_init(&temp_dev->cdev, &temp_fops);
temp_dev->cdev.owner = THIS_MODULE;
temp_dev->size = MAX_DSIZE;
sema_init (&temp_dev->sem, 1);
rv = cdev_add (&temp_dev->cdev, devno, 1);
if (rv) printk(KERN_WARNING "Error %d adding device temp", rv);
temp_dev->class = class_create(THIS_MODULE, "temp");
if(IS_ERR(temp_dev->class)) {
cdev_del(&temp_dev->cdev);

COSC440 Lab Manual


17 / 69

unregister_chrdev_region(devno, 1);
printk(KERN_WARNING "%s: cant create udev class\n", "temp");
rv = -ENOMEM;
return rv;
}
temp_dev->device = device_create(temp_dev->class, NULL,
MKDEV(major, 0), "%s", "temp");
if(IS_ERR(temp_dev->device)){
class_destroy(temp_dev->class);
cdev_del(&temp_dev->cdev);
unregister_chrdev_region(devno, 1);
printk(KERN_WARNING "%s: cant create udev device\n", "temp");
rv = -ENOMEM;
return rv;
}
printk(KERN_WARNING "Hello world from Template Module\n");
printk(KERN_WARNING "temp device MAJOR is %d, dev addr: %lx\n", major, (unsigned
long)temp_dev);

return 0;
}

/**
* Finalise the module
*/
void __exit temp_exit_module(void){
device_destroy(temp_dev->class, MKDEV(major, 0));
class_destroy(temp_dev->class);
cdev_del(&temp_dev->cdev);
kfree(temp_dev);
unregister_chrdev_region(MKDEV(major, 0), 1);
printk(KERN_WARNING "Good bye from Template Module\n");
}

module_init(temp_init_module);
module_exit(temp_exit_module);

2.6

Kernel configuration and compilation

You need kernel source to compile your own kernel programs (or modules). Also it is good to know how to configure the kernel
to suit your own purposes (e.g. for embedded systems or make the kernel leaner by removing unnecessary components/modules).
The following steps guide you to reconfigure, compile, and install the kernel image and its related modules.
1. Get a copy of kernel source
Linux kernel has many different versions. You may get one version suitable for you from http://www.kernel.org However for the
purpose of this course, we will use linux-3.9.6.tar.xz, which is stable release for 3.9. You can get the tarball by running:
$ scp mal@192.168.1.123:linux-3.9.6.tar.xz .

Use the password: Quack1nce4


Suppose the kernel tar ball is under your current directory (e.g. your home directory), use:
$ tar -Jxf linux-3.9.6.tar.xz

COSC440 Lab Manual


18 / 69

to untar the kernel and then enter the source directory.


Use the config file under /boot as your initial .config file
$ cp /boot/config-3.2.0-23-generic .config

and reconfigure the kernel according to the following instructions.


$ vim .config

Make sure the following configuration parameters are set to y:


-

CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y

Under device drivers --- block devices:


- CONFIG_BLK_DEV_LOOP=y
- CONFIG_BLK_DEV_RAM=y
- CONFIG_BLK_DEV_NBD=y

Under device drivers --- ATA/ATAPI/MFM/RLL support:


-

CONFIG_IDE_GD=y
CONFIG_IDE_GD_ATA=y
CONFIG_IDE_GENERIC=y
CONFIG_BLK_DEV_GENERIC=y
CONFIG_BLK_DEV_PIIX=y

Under file systems:


- CONFIG_AUTOFS4_FS=y

Also make sure the proper file systems are supported (ext2, ext3, ext4, etc).
-

CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y

You can find the exact place of the options using / in vim.
1. Compile the kernel image, source, headers and kernel documentations into deb packages:
$

fakeroot make-kpkg --initrd --append-to-version -cosc440 binary

make-kpkg compiles the kernel, and packages the kernel, together with its modules into a .deb package, which you will install
later in this lab.
The option --initrd ensures that initrd will be generated after installing the package; and --append-to-version appends the
specified string to the version number.

COSC440 Lab Manual


19 / 69

fakeroot is used to provide a chroot-like environment for make-kpkg to allow it to run without root privileges. If it does not
work properly, use sudo instead.
If you build on a multicore machine, you can speed up the build process by setting the environment variable CONCURRENCY_LEVEL:
export CONCURRENCY_LEVEL=<nthreads>

before running make-kpkg.


It will take a few hours. . . Use the time to read the reading material of the paper. Before compilation, you may use the screen
command to make the job running at the background so that any network/router failure will not affect your job.
$ screen

Ctrl + a, then d will detach the job.


To bring the job to the foreground, use:
$ screen -r

Install the kernel:


$ cd ..
$ sudo dpkg -i linux-image-3.9.6-cosc440_3.9.6-cosc440-10.00.Custom_i386.deb

Install the kernel headers as you will need to compile kernel modules against your kernel:
$ sudo dpkg -i linux-headers-3.9.6-cosc440_3.9.6-cosc440-10.00.Custom_i386.deb

Install the kernel documentations:


$sudo dpkg -i linux-doc-3.9.6-cosc440_3.9.6-cosc440-10.00.Custom_all.deb

Install the kernel API manpage (accessible by man 9 <func>):


$sudo dpkg -i linux-manual-3.9.6-cosc440_3.9.6-cosc440-10.00.Custom_all.deb

Your kernel will be set as the default kernel for the next reboot. In case of kernel errors, you may wish to boot the previous
kernel.
You can enable an interactive menu of grub by modifying /etc/default/grub where the following variables are set
GRUB_HIDDEN_TIMEOUT=10
GRUB_HIDDEN_TIMEOUT_QUIET=false

and uncomment
GRUB_TERMINAL=console

Then run:
$ sudo update-grub

1. Reboot you PC:


$ sudo reboot

1. Once your PC reboots, use:


$ uname -a

to check if your compiled kernel successfully boots.

COSC440 Lab Manual


20 / 69

2.7

Reference

1. Writing Linux Device Drivers, by Jerry Cooperstein


2. README under linux source tree
3. UBUNTU INSTALLATION GUIDE
4. https://help.ubuntu.com/11.10/installation-guide/amd64/kernel-baking.html

COSC440 Lab Manual


21 / 69

Chapter 3

Linked list and seeking the device, process sleeping


3.1

Linked list

The kernel API provides a standard API for a circular doubly-linked list. The elementary data structure is:
#include <linux/list.h>
struct list_head {
struct list_head *next;
struct list_head *prev;
};

In a data node struct, list_head will always be the first member of the struct, so that its base address is the same as the data node
struct itself. For example:
struct my_struct {
struct list_head list;
int val;
};

The list_head struct must be initialized before use:


LIST_HEAD(my_list);

or
struct my_struct me = {
.list = LIST_HEAD_INIT(me.list);
.val = 0;
};

or at runtime:
struct list_head list;
INIT_LIST_HEAD(&list);

Basically, the list initialization macro sets both prev and next field to point to itself.
Here are some functions provided by the Linux kernel API to manipulate lists:
void list_add(struct list_head *new, struct list_head *head);

COSC440 Lab Manual


22 / 69

inserts an element pointed to by new after the element pointed by head


void list_add_tail(struct list_head *new, struct list_head *head);

inserts an element pointed to by new at the end of the list pointed by head
void list_del(struct list_head *entry);

removes the element pointed by entry from its list


void list_del_init(struct list_head *entry);

removes the element pointed by entry from its list and reinitialize the list head entry. This function must be used when a node
removed from a list will be re-inserted into a different list.
int list_empty(struct list_head *head);

test whether a list pointed by head is empty


void list_splice(struct list_head *list, struct list_head *head);

joins two lists together, by inserting the new list list at head in the first list
The kernel API also provides the following helper macros:
list_entry(ptr, type, member);

returns pointer to the data structure of the type indicated in the second argument, from the list of which head is pointed to by the
first argument ptr, and of which the member we desire is given by the third argument
list_for_each(struct list_head *pos, struct list_head *head);

iterates the list forward. pos points to the struct list_head of the current node, and head points to the head of the list.
Since pos points to the struct list_head of the current node only rather than the current node itself,
list_entry(ptr, type, member);

must be used to get the pointer to the current node to access elements of the current node.
For example:
static LIST_HEAD(my_list);
struct my_entry {
struct list_head clist;
int val;
};

void print_list() {
struct list_head *ptr;
struct my_entry *curr;
list_for_each(ptr, &my_list) {
curr = list_entry(ptr, struct my_entry, clist);
printk(KERN_INFO "val = %d\n", curr->val);
}
}
list_for_each_prev(pos, head);

as in list_for_each(), but iterates the list backward.

COSC440 Lab Manual


23 / 69

list_for_each_entry(pos, head, member);

directly returns the data structure (entry pointer) that encloses the list_head node, so the code can be simplified by eliminating
the need to call list_entry() to get the pointer to the current node.
Here is the snippet above using list_for_each_entry():
static LIST_HEAD(my_list);
struct my_entry {
struct list_head clist;
int val;
};

void foo() {
struct my_entry *curr;
list_for_each_entry(curr, &my_list, clist) {
printk(KERN_INFO "val = %d\n", curr->val);
}
}
list_for_each_safe(struct list_head *pos, struct list_head *tmp, struct list_head *head);

handles the case where one is removing the list entry. pos points to the curr node (serves as the iterator); tmp is an extra pointer
of type struct list_head * that is used by the macro for temporary storage, and head points to the head of your linked list.

Important
This function must be used when the current node pos may be deleted

Here is an example of how to use this function to delete all nodes in a list:
static LIST_HEAD(my_list);
struct my_entry {
struct list_head clist;
int val;
};
void del_list(void) {
struct list_head *pos;
struct list_head *tmp;
struct my_entry *curr;
list_for_each_safe(pos, tmp, &my_list) {
curr = list_entry(pos, struct my_entry, clist);
list_del(&curr->clist);
printk(KERN_INFO "(exit): val %d removed\n", curr->val);
kfree(curr);
}
}

list_for_each_entry_safe(pos, tmp, head, member)

COSC440 Lab Manual


24 / 69

like list_for_each_safe(), except that pos now points to the node itself, saving you from calling list_entry(). Here is the above
code with list_for_each_entry_safe() used instead:
static LIST_HEAD(my_list);
struct my_entry {
struct list_head clist;
int val;
};
void del_list(void) {
struct my_entry *curr;
struct my_entry *tmp;
list_for_each_entry_safe(curr, tmp, &my_list, clist) {
list_del(&curr->clist);
printk(KERN_INFO "(exit): val %d removed\n", curr->val);
kfree(curr);
}
}

For more information, please read:


Writing Linux Device Drivers Chapter 7.7 P. 81

Important
For the following exercises you should use the skeleton code for Lab 3. Download the skeleton code tar ball from the
resources page if you have not done so.

3.1.1

Exercise: Basic linked-list

Write a module that sets up a doubly linked circular list of data structures. Each element contains an integer.
For this lab, elements should be allocated using:
void *kmalloc(size_t size, int flags); with flags set to GFP_KERNEL. kmalloc() is normally used to allocate small memory
pieces. First, inserts elements to the list, then traverse the list and print the value of each element.
Then at the cleanup function, delete all the elements. Remember to use list_for_each_safe() for this purpose.
Dont forget to free the memory using kfree() that you have allocated using kmalloc()
You only need to touch the functions my_init() and my_exit().

3.1.2

Exercise: Finding tainted kernel modules

All modules loaded on the system are linked in a list that can be accessed from any module:
struct module {
.....
struct list_head list;
.....
char name[MODULE_NAME_LEN];
.....
unsigned int taints;
.....
};

COSC440 Lab Manual


25 / 69

Write a module that walks through this list and prints out the value of taints of all modules.
You can begin from THIS_MODULE.
The skeleton code already list the details from THIS_MODULE (your module), then when you run the list_for_each_entry()
loop, it will loop through the rest of the loaded modules.

3.1.3

Exercise: Enabling seeking of a device

The lseek() system call allows user-space programs to change the current read/write position in a file (i.e. our device).
Now your task is to implement:
loff_t mycdev_lseek(struct file *file, loff_t offset, int whence)

struct file contains I/O related metadata of your device, and filef_pos gives the current file position.
First, you need to find out the type of request by looking at whence, which can be one of:
SEEK_SET
SEEK_CUR
SEEK_END

position file relative to the beginning of the file


position file relative to the current position
position file relateive to the end of the file

After you calculate the new offset, make sure it is within the address range f the file (i.e. not negative, and not above the maximum
ramdisk size). If within range, set file>f_pos to the calculated offset and return the calculated offset, otherwise, return -EINVAL.
Dont forget to register mycdev_lseek() in mycdrv_fops
Finally, you can use the program seek_test included in the tarball to test the correct functioning of your driver.
Before you can run seek_test, you need to create a node for your module first:
$ sudo mknod /dev/mycdrv c 700 0

3.2

Suspending process - wait queue

Sometimes a process needs to wait for a condition to be fulfilled (e.g. wait for data to arrive on a peripheral device). In this case,
the kernel provides wait queues for putting a task (process) to sleep until the condition it waits for is ready (become true), then it
becomes necessary to wake up.
The data structure used by wait queues is of the type wait_queue_head_t. It is declared and initialized by:
#include <linux/sched.h>
wait_queue_head_t wq;
init_waitqueue_head(&wq);

If the wait queue is statically declared, it can be declared and initialized with the macro:
DECLARE_WAIT_QUEUE_HEAD(wq);

A wait queue must be initiaized before it can be used.


These macros put a task to sleep:
#include <linux/wait.h>
wait_event
(wait_queue_head_t wq, int condition);
wait_event_interruptible (wait_queue_head_t wq, int condition);
wait_event_killable
(wait_queue_head_t wq, int condition);

COSC440 Lab Manual


26 / 69

and these corresponding functions wakes up tasks on the wait queue:


void wake_up
(wait_queue_head_t *wq);
void wake_up_interruptible (wait_queue_head_t *wq);

In general, the interruptible version of wait_event should be used, which returns 0 if it returns due to a wakeup call and ERESTARTSYS if it returns due to a signal arriving. The other versions are used in critical sections of a kernel, where it is
unacceptable to be scheduled out while holding a lock. The killable version is only interrupted when the process is killed.
The proper wake up call should be paired with the originating sleep call, except wait_event_killable() shoud be paired with
wake_up()
When you use the interruptible forms, you will always have to check upon awakening whether you woke up because a signal has
arrived, or there was an explicit wakeup call. The signal_pending(current) macro can be used for this purpose.
This is a simple example of wait queue:
#include <linux/sched.h>
DECLARE_WAIT_QUEUE_HEAD(wq);
static int func1( ... ) {
...
printk(KERN_INFO "task%i (%s) going to sleep\n", current->pid, current->comm);
wait_event_interruptible(wq, atomic_read(&dataready));
printk(KERN_INFO "awoken %i (%s)\n", current->pid, current->comm);
if (signal_pending(current))
return -ERESTARTSYS;
...
atomic_set(&dataready, 0);
}
static int func2( ... ) {
...
printk(KERN_INFO "task %i (%s) awakening sleepers...\n", current->pid, current->comm);
atomic_set(&dataready, 1);
wake_up_interruptible(&wq);
...
}

The sleeping tasks we dealt with are non-exclusive and the API we discussed above will wake up allsleeping tasks. However if
more than one task is waiting for exclusive access to a resource (one that only one can use at a time), then this kind of wake up
is inefficient and leads to the thundering herd problem, where all sleepers are woken up, but only one of the sleepers can get the
resource and the others must be put back to sleep.
To address the thundering herd problem, we need an exclusive sleeping system that only wakes up one task from the wait queue
at a time. Exclusive wait can be set up by using this macro:
wait_event_interruptible_exclusive (wait_queue_head_t wq, int condition);

The usual wake up functions mentioned above can be used, but now only one task will be woken up. If more than one tasks from
an exclusive wait queue need to be waken up at one time, these functions can be used:
void
void
void
void
void
void

wake_up_all
wake_up_interruptible_all
wake_up_nr
wake_up_sync_nr
wake_up_interruptible_nr
wake_up_interruptible_sync_nr

(wait_queue_head_t
(wait_queue_head_t
(wait_queue_head_t
(wait_queue_head_t
(wait_queue_head_t
(wait_queue_head_t

*wq);
*wq);
*wq, int
*wq, int
*wq, int
*wq, int

nr);
nr);
nr);
nr);

The functions with the suffix _all will wake up all tasks, but those with the suffix _nr will only wake up nr tasks.

COSC440 Lab Manual


27 / 69

3.2.1

Exercise: Using wait queues

Start from the skeleton code in 03/3.2.1/wait_event.c and get it to use wait queues.
Have the read() function go to sleep until woken by a write() function. (You could also try reversing read and write).
You may want to open up two windows and read in one window and then write in the other window.
Try putting more than one process to sleep, i.e. run your test read program more than once simutaneously before running the
write program to awaken them. If you keep track of the pids you should be able to detect what order processes are woken.
You should use exclusive wait, i.e. wait_event_interruptible_exclusive(), and any global variables used in the logical condition
should be atomic.

COSC440 Lab Manual


28 / 69

Chapter 4

Mutex, semaphore and the proc file system


A device is often accessed by multiple processes, therefore critical section of the code, that intends to access shared data atomically, must be protected by mechanisms such as mutex, semaphore, or spinlocks. "Critical section" is a section of code that can
be executed by one process at a time. Mutex and semaphore are discussed in this lab, and spinlock, a lock that busy-waits and
expected to be held for only a very short time, is discussed later in this course.

4.1

Mutex

A mutex is a basic kind of sleepable locking mechanism, used for protecting critical section of the code. A process must lock the
mutex before entering the critical section and release the mutex after leaving the critical section.
Mutexes are initialized in an unlocked state with:
DEFINE_MUTEX(name);

at compile time, or:


void mutex_init(struct mutex *lock);

Locking primitives come with interruptible and uninterruptible forms:


void mutex_lock(struct mutex *lock);
int mutex_lock_interruptible(struct mutex *lock);
int mutex_lock_killable(struct mutex *lock);

These functions return 0 if the lock is acquired.


There is only one unlocking primitive:
void mutex_unlock(struct mutex *lock);

Any signals will break a lock taken out with mutex_lock_interruptible() while only fatal signals can break a lock taken out with
mutex_lock_killable(). Both functions return -EINTR if they are interrupted by signals.
Locks taken out by mutex_lock() are not affected by signals (i.e. unkillable), and in most cases, it is a bad idea to use this
unkillable version because the only way to terminate a blocked process is a system reboot.
int mutex_trylock(struct mutex *lock);

is the non-blocking version of mutex_lock(), which always returns immediately with 1 on success and 0 on contention.
Here are some rules on the use of mutexes:
1. The mutex must be released by the original owner.

COSC440 Lab Manual


29 / 69

2. The mutex cannot be applied recursively (i.e. a process cannot acquire the mutes again without releasing the mutex first.
3. The mutex cannot be locked or unlocked from interrupt context.
For example:
DEFINE_MUTEX(my_mutex);
static ssize_t
mycdrv_read(struct file *file, char __user * buf, size_t lbuf, loff_t * ppos)
{
printk(KERN_INFO "process %i (%s) going to sleep\n", current->pid,
current->comm);
if (mutex_lock_interruptible(&my_mutex)) {
printk(KERN_INFO "process %i woken up by a signal\n",
current->pid);
return -ERESTARTSYS;
}
printk(KERN_INFO "process %i (%s) awakening\n", current->pid,
current->comm);
return mycdrv_generic_read(file, buf, lbuf, ppos);
}
static ssize_t
mycdrv_write(struct file *file, const char __user * buf, size_t lbuf,
loff_t * ppos)
{
int nbytes = mycdrv_generic_write(file, buf, lbuf, ppos);
printk(KERN_INFO "process %i (%s) awakening the readers...\n",
current->pid, current->comm);
mutex_unlock(&my_mutex);
return nbytes;
}

4.1.1

Exercise: Mutex

Write three simple modules where the second and the third one use a variable exported from the first one. The second and third
one can be identical, but having different module names.
Hint: You can use the macro __stringify(KBUILD_MODNAME) to print out the module name.
The exported variable should be a mutex. Have the first module initialize it in the unlocked state.
A variable from a module can be exported using the macro:
EXPORT_SYMBOL(my_variable);

so in module 1, if you want to declare, initialize and export the mutex, you would do this in the global scope of module 1:
DEFINE_MUTEX(my_mutex);
EXPORT_SYMBOL(my_mutex);

The second and third modules should attempt to lock the mutex during initialization. If the mutex is already locked, the module
will not be loaded because the initialization function should return with an appropriate value for the module to be loaded.
Make sure the mutex is released in the cleanup (module_exit) function in the module.
Test by trying to load both modules simultaneously, and see if it is possible. Make sure you can load one of the modules after the
other has been unloaded, to make sure you released the mutex properly.

COSC440 Lab Manual


30 / 69

4.2

Semaphore

Semaphores are also used in protecting access to critical sections in kernel code. There are two types of semaphore structures:
struct semaphore, and struct rw_semaphore, the latter of which supports single-writer multiple-readers (SWMR) policy.
Semaphores can be initialized with:
DEFINE_SEMAPHORE(name);

and rw_semaphore can be initialized with:


DECLARE_RWSEM(name);

Here are the semaphore primitives:


#include <linux/semaphore.h>
void down(struct semaphore *sem);
void down_interruptible(struct semaphore *sem);
void down_killable(struct semaphore *sem);
int down_trylock(struct semaphore *sem);
/* non-block */
void up(struct semaphore *sem);

Here are the read-write semaphore primitives:


#include <linux/rwsem.h>
void
void
void
void

down_read(struct rw_semaphore *sem);


down_write(struct rw_semaphore *sem);
up_read(struct rw_semaphore *sem);
up_write(struct rw_semaphore *sem);

The down() function checks to see if someone else has already entered the critical code section; if the value of the semaphore is
greater than zero, it decrements it and returns. If it is already zero, it will sleep and try again later.
The down_interruptible() function differs in that it can be interrupted by a signal; the other form blocks any signals to the process,
and should be used only with great caution. However you will now have to check ti see if a signal arrived if you use this form, so
you-ll have code like:
if (down_interruptible(&sem)) return -ERESTARTSYS;

which tells the system to either retry the system call or return -EINTR to the application.
The down_killable() function can only be interrupted by fatal signals, so it will not be interrupted unless the signal is intended to
terminate the program. Therefore in most cases, this version should be used to ensure the caller user-space programme is killable,
otherwise the only way to terminate a deadlocked proces is to reboot the system.
The down_trylock() form checks if the semaphore is available, and if not, returns a non-zero value immediately without blocking
(which is why it does not need an interruptible form) and if the semaphore is available, return 0. For instance, a typical read entry
from a driver may contain:
...
if (file->f_flags & O_NONBLOCK) {
if (down_trylock(&iosem)) return -EAGAIN;
} else {
if (down_interruptible(&iosem)) return -ERESTARTSYS;
}

The up() function increments the semaphore value, waking up any processes waiting on the semaphore. It doesnt require any
_interruptible form.
Here is an example showing the use of semaphores in a kernel module:

COSC440 Lab Manual


31 / 69

DEFINE_SEMAPHORE(name);
static ssize_t
mycdrv_read(struct file *file, char __user * buf, size_t lbuf, loff_t * ppos)
{
printk(KERN_INFO "process %i (%s) going to sleep\n", current->pid,
current->comm);
if (down_interruptible(&my_sem)) {
printk(KERN_INFO "process %i woken up by a signal\n",
current->pid);
return -ERESTARTSYS;
}
printk(KERN_INFO "process %i (%s) awakening\n", current->pid,
current->comm);
return mycdrv_generic_read(file, buf, lbuf, ppos);
}
static ssize_t
mycdrv_write(struct file *file, const char __user * buf, size_t lbuf,
loff_t * ppos)
{
int nbytes = mycdrv_generic_write(file, buf, lbuf, ppos);
printk(KERN_INFO "process %i (%s) awakening the readers...\n",
current->pid, current->comm);
up(&my_sem);
return nbytes;
}

4.2.1

Exercise: Semaphore

Replace the mutex in the last lab with semaphores.


Please note that the macro used to statically declare and initialize a semaphore in an unlocked state is:
DEFINE_SEMAPHORE(name);

4.2.2

Exercise: Device-private data

In struct file, there is a field:


void *private_data;

which can point to a piece of data that is private to the process that opens the device.
Now you should convert the results of Exercise 3.1.3 to use private data. Each process that opens the device now has its own
private ramdisk, so the ramdisk structure should be allocated in open() and pointed to by fileprivate_data and freed in close().

4.3

/proc entry

/proc is a virtual filesystem allowing devices to display their information. Each device needs to create an entry under /proc to
make use of the /proc filesystem to display its information. In addition, writing to a /proc entry can set system parameters and
modify device functionality.

COSC440 Lab Manual


32 / 69

4.3.1

Creating entries

Creating, managing and removing entries in the proc filesystem is done with these functions:
#include <linux/proc_fs.h>
struct proc_dir_entry *create_proc_entry (const char *name, mode_t mode,
struct proc_dir_entry *parent);
void remove_proc_entry (const char *name, struct proc_dir_entry *parent);
struct proc_dir_entry *proc_symlink (const char *name, struct proc_dir_entry *parent,
const char *dest);
struct proc_dir_entry *proc_mkdir (const char *name, struct proc_dir_entry *parent);

The name argument gives the name of the directory entry, which will be created with the permissions contained in the mode
argument. parent is the proc entry of the parent subdirectory which the current proc entry resides in. If the parent argument is
NULL, the entry will go in the /proc main directory. mode sets the permission of the proc entry (please see lab 2 for details on
file mode).
The function proc_symlink() creates a symbolic link; it is equivalent to doing:
$ ln -s <dest> <name>

The function proc_mkdir() creates directory name under parent.


The parent directory can be something you created with proc_mkdir(), or if you want to put it in an already-created subdirectory
of /proc, such as /proc/driver, one can do:
my_proc = create_proc_entry("driver/my_proc", NULL, NULL);

4.3.2

Reading entries

When a process tries to read an entry in the proc filesystem, it causes invocation of the read callback function associated with the
directory entry; i.e., you would have something like:
static struct proc_dir_entry *my_proc_entry;
...
my_proc_entry = create_proc_entry ("my_proc", 0, NULL);
my_proc_entry->read_proc = my_proc_read;

perhaps in init_module(), where the read callback function my_proc_read() has been previously defined. This has an integer
return type and its prototype definition is given by:
typedef int (read_proc_t)(char *page, char **start, off_T off,
int count, int *eof, void *data);

When someone tries to read the entry, the information will be written into the page argument at an offset of off, writing at most
count bytes. For reading just a few bytes, the callback function usually ignores these arguments.
The eof argument is only used when off and count are used; it should signal the end of the file with a 1. The start argument is a
left-over legacy from earlier implementation and is not used. The data argument can be used to create a single callback function
for multiple proc entries, or for other purposes.
Alternatively, the function create_proc_read_entry() can be used to create a read-only proc entry in one step:
struct proc_dir_entry *create_proc_read_entry (const char *name,
mode_t mode,
struct proc_dir_entry *parent,
read_proc_t *read_proc,
void *data);

COSC440 Lab Manual


33 / 69

where read_proc is the read callback function.


When successful, your read function should return the number of bytes written into the buffer pointed to by page. Here is a
simple example of a module using a proc read callback:
#include
#include
#include
#include
#include

<linux/module.h>
<linux/proc_fs.h>
<linux/init.h>
<linux/version.h>
<linux/jiffies.h>

static int x_delay = 1;


/* the default delay */
static int x_read_busy (char *buf, char **start, off_t offset, int len,
int *eof, void *unused) {
unsigned long j = jiffies + x_delay * HZ;
while (time_before(jiffies, j))
/* nothing */ ;
*eof = 1;
return sprintf(buf, "jiffies = %d\n", (int)jiffies);
}
static struct proc_dir_entry *x_proc_busy;
static int __init my_init(void) {
x_proc_busy = create_proc_entry("x_busy", 0, NULL);
if (NULL == x_proc_busy) {
printk(KERN_ALERT "Error: Could not initialize /proc/x_busy\n");
return -ENOMEM;
}
x_proc_busy->read_proc = x_read_busy;
return 0;
}
static void __exit my_exit(void) {
if (x_proc_busy)
remove_proc_entry("x_busy", NULL);
}
module_init(my_init);
module_exit(my_exit);
MODULE_LICENSE ("GPL v2");

4.3.3

Writing entries

When a process tries to write data to an entry in the proc filesystem, it causes invocation of the write callback function associated
with the directory entry; i.e. you would have something like:
static struct proc_dir_entry *my_proc_entry;
...
my_proc_entry = create_proc_entry ("my_proc", 0, NULL);
my_proc_entry->write_proc = my_proc_write;

where the read callback function, my_proc_write() has been previously defined. This has an integer return type and its prototype
definition is given by:
typedef int (write_proc_t)(struct file *file, const char __user *buffer, unsigned long
count, void *data);

COSC440 Lab Manual


34 / 69

This function will read count bytes (at most) from the location pointed to by buffer.
The file location is generally unused, and once again the location pointed to by the data argument can be used when a single
callback function is used for multiple file entries or for other purposes.
It is important to note that buffer is a user space pointer; thus you must use a function like copy_from_user() to obtain its contents.
Once you have the contents you can put them to use in your kernel functions as needed.
Note that usually /proc entries are text, not binary. This means to convert user-space input into usable form, you may require the
services of functions like atoi(). However, these are not defined in the kernel. Instead you need to use the following functions,
defined in /usr/src/linux/lib/vsprintf.c:
long simple_strtol (const char *cp, char **endp, unsigned int base);
unsigned long simple_strtoul (...);
unsigned long long simple_strtoull (...);
long long simple_strtoull (...);

all of which have the same arguments. The first argument is a pointer to the string to convert, the second is a pointer to the end
of the parsed string, and the third is the number base to use; giving 0 is the same as giving 10. The following statements are
equivalent:
long j = simple_strtol ("-1000", NULL, 10);
long j = simple_strtol ("-1000", 0, 0);

You can also do the format conversion using the kernel implementation of sscanf()
Here is a code snippet showing the use of the proc filesystem:
....
....
static int
my_proc_write(struct file *file, const char __user * buffer,
unsigned long count, void *data)
{
char *str;
str = kmalloc((size_t) count, GFP_KERNEL);
if (copy_from_user(str, buffer, count)) {
kfree(str);
return -EFAULT;
}
sscanf(str, "%d", &param);
printk(KERN_INFO "param has been set to %d\n", param);
kfree(str);
return count;
}
static int __init my_init(void)
{
my_proc = create_proc_entry(NODE, S_IRUGO | S_IWUSR, NULL);
if (!my_proc) {
printk(KERN_ERR "I failed to make %s\n", NODE);
return -1;
}
printk(KERN_INFO "I created %s\n", NODE);
my_proc->read_proc = my_proc_read;
my_proc->write_proc = my_proc_write;
return 0;
}
....
....

COSC440 Lab Manual


35 / 69

4.3.4

Exercise: using the proc filesystem

Write a module that creates a /proc filesystem entry and can read and write to it.
When you read from the entry, you should obtain the value of some parameter set in your module.
When you write to the entry, you should modify that value, which should then be reflected in a subsequent read.
Make sure you remove the entry when you unload the module. What happens if you dont and you try to access the entry after
the module has been removed?
The solution shows how to create the entry in the /proc directory and also in the /proc/driver directory.

4.3.5

Exercise: Making your own subdirectory in /proc

Write a module that creates your own proc filesystem subdirectory and creates at least two entries under it.
As in the first exercise, reading an entry should obtain a parameter value, and writing it should reset it.
You may use the data element in the proc_dir_entry structure to use the same callback functions for multiple entries.

COSC440 Lab Manual


36 / 69

Chapter 5

Memory management

Important
Please revise materials in lecture 1 before attempting this lab.

5.1

kmalloc()

For simple allocation/freeing memory in kernel space, one would use:


#include <linux/slab.h>
void *kmalloc (unsigned int len, gfp_t gfp_mask);
void kfree (void *ptr);

where gfp_mask is normally set to one of the following:


GFP_KERNEL

GFP_ATOMIC

GFP_DMA

Block and cause going to sleep if the memory is not


immediately available, allowing preemption to occur. This
is the normal way of calling kmalloc().
Return immediately if no pages are available. For instance,
this might be done when kmalloc() is being called from an
interrupt, where sleep would prevent receipt of other
interrupts.
For buffers to be used with ISA DMA devices, is ORed
with GFP_KERNEL or GFP_ATOMIC. Ensures the
memory will be contiguous and falls under
MAX_DMA_ADDRESS=16MB for ISA devices; for PCI
this is unnecessary. The exact meaning of this flag is
platform dependent.

Like malloc() in user space, kmalloc will return the base address of the piece of kernel memory with at least the size of len, and
on failure, return NULL.
The in_interrupt() macro can be used to check whether you are in an interrupt context; and similarly, in_atomic() also check if
you are in a preemptible context.
For example:
char *buffer = kmalloc(nbytes, in_interrupt() ? GFP_ATOMIC : GFP_KERNEL);

COSC440 Lab Manual


37 / 69

In this example, if it is in the interrupt context, GFP_ATOMIC mode will be used, otherwise, the less expensive GFP_KERNEL
mode will be used instead.
Since GFP_ATOMIC is allowed to take more memory resources than GFP_KERNEL to lessen chances of failure, therefore
GFP_ATOMIC should not be used unless necessary.
Memory allocated by kmalloc() can be resized by:
void *krealloc(const void *p, size_t new_size, gfp_t flags);

kzalloc():
void *kzalloc(size_t size, gfp_t flags);

works like kmalloc, but also zero the memory.


kmalloc() will return a memory chunk with size of power of 2 that matches or exceeds len and will return NULL upon failure.
The maximum size allocatable by kmalloc() is 1024 pages, or 4MB on x86. Generally for requests larger than 64kB, one should
use __get_free_page() functions to ensure inter-platform compatibility.

5.2

__get_free_pages()

To allocate (and free) entire pages (or multiple pages) at once, one can use:
#include <linux/mm.h>
unsigned long get_zeroed_page (gfp_t gfp_mask);
unsigned long __get_free_page (gfp_t gfp_mask);
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned long order);
void free_page (unsigned long addr);
void free_pages(unsigned long addr, unsigned long order);

__get_free_page() returns the base address of a page in kernel space, that can be used directly by the kernel. The gfp_mask is
the same as kmalloc()
__get_free_pages() is like get_free_page(), but allocates consecutive pages in kernel space. order gives the number of pages in
the power of 2. The maximum number of pages allocatable by get_free_pages is 1024 (i.e. order = 10). Here is an example:
/* allocates 2 ^ 5 = 32 pages of kernel memory */
tty->read_buf = (unsigned char *)__get_free_pages(in interrupt() ? GFP_ATOMIC : GFP_KERNEL, 5);
if (!tty->read_buf) return -ENOMEM;
ALLOC_PAGE()

In cases where you might need to deal with high memory, which does not have a constant address in kernel space, you will need
to deal with struct page directly. For example in assignment 1, you will need to get the page frame number out of each page
allocated in order to map the page with mmap(). In these cases, you should allocate and free the page using:
struct page *alloc_page(unsigned int flags);
void __free_page(struct page *page);

alloc_page() works like __get_free_pages(), except the internal kernel structure struct page is returned instead of the kernel base
address of the page.
Here is an example showing how to add a page to your page list:

COSC440 Lab Manual


38 / 69

/**
* The node structure for the memory page linked list.
*
*/
typedef struct node {
struct list_head list;
struct page *asgn1_page;
} NODE;
struct list_head mem_list;

size_t foo() {
size_t size_written = 0;
struct list_head *ptr = mem_list.next;
.....
NODE *curr;
.....
if (ptr == &mem_list) {
/* need to add a page to the page list of your module */
curr = kmalloc(sizeof(NODE), GFP_KERNEL);
/* check curr is not NULL...... */
curr->asgn1_page = alloc_page(GFP_KERNEL);
if (NULL == curr->asgn1_page) {
printk(KERN_WARNING "Not enough memory left\n");
return size_written;
}
list_add_tail(&(curr->list), &mem_list);
num_pages++;
ptr = mem_list.prev;
}
....
return size_written;
}

To get the page frame number from a struct page you can use the function:
unsigned long page_to_pfn(struct page *page);

and to get the kernel virtual address of a page, you can use:
void *page_address(struct page *page);

Here is an example showing how to copy data from user space to a page:
ssize_t foo(const char __user* buf, size_t count) {
size_t size_to_copy = min(count, PAGE_SIZE);
size_t size_not_copied;
struct page *my_page = alloc_page(GFP_KERNEL);

/* check for errors ... */

/* copy from user space to the beginning of the


page we have just allocated */
size_not_copied = copy_from_user(page_address(my_page), buf, size_to_copy);

COSC440 Lab Manual


39 / 69

.......
}

5.3

vmalloc()

vmalloc() allocates a contiguous memory region in the virtual address space. This is the API for allocating and freeing memory
with the vmalloc approach:
#include <linux/vmalloc.h>
void *vmalloc(unsigned long size);
void vfree(void *ptr);

vmalloc() returns a pointer to a linear memory area of size at least size, returns NULL if an error occurs.
vmalloc() cannot be used when real physical address is needed (such as for DMA) and cannot be used at interrupt time. vmalloc()
can allocate more memory than get_free_pages() because the allocated memory may not be consecutive in physical memory, but
the kernel sees see the memory as a contiguous range of addresses, The resulting virtual addresses are higher than the top of the
physical memory.
vmalloc() has a higher overhead than get_free_pages(), so vmalloc() should not be used for small requests. Here is a code
example:
in_buf[dev] = (struct mbuf *)vmalloc(sizeof(struct mbuf));
if (in_buf[dev] = NULL) {
printk(KERN_WARNING "Cant allocate buffer in_buf\n");
my_dev[dev]->close(dev);
return -EIO;
}

Note we do not recommend you to use vmalloc() in our memory devices.

5.4

slabs and cache allocations

If you want to allocate memory for objects less than a page in size and you dont want to waste space by requesting whole pages,
and you need to create and destroy such objects (with the same size) frequently, it would be more efficient to allocate your own
pool of memory and set up your own caching system, rather than repeatedly allocating/freeing these objects through the linux
kmalloc() system. The Linux has set up the slab allocator interface that you should use. As part of the scheme, you can create a
special memory pool, add and remove objects. The kernel can dynamically shrink the cache if it has memory needs elsewhere,
but it will not have to re-allocate a new object every time you need one, as long as there are still wholly- or partially-unused slabs
on the cache.
The following functions can create, shrink and destroy your own memory cache:
#include <linux/slab.h>
struct kmem_cache *kmem_cache_create(const char *name, size_t size,
size_t offset, unsigned long flags,
void (*ctor)(void *,struct kmem_cache *, unsigned long flags));
int kmem_cache_shrink(struct kmem_cache *cache);
void kmem_cache_destroy(struct kmem_cache *cache);

COSC440 Lab Manual


40 / 69

where name serves to identify the cache in the system, viewable in /proc/slabinfo. All objects in the cache must have the same
size, which is specified by the parameter size that cannot be more than 1024 pages (4MB on x86).
offset indicates alignment, or offset into the page for the objects you are allocating (normally set to 0).
The flags argument is a bitmask of choices given in /usr/src/linux/include/linux/slab.h
SLAB_HWCACHE_ALIGN

SLAB_POISON
SLAB_RED_ZONE
SLAB_PANIC
SLAB_DEBUG_FREE
SLAB_CACHE_DMA

Force alignment of data objects on cache lines. Improves


performance but may waste memory. Should be set for
critical performance code.
Fill the slab layer with the known value a5a5a5a5. Good
for catching access to uninitialized memory.
Surround allocated memory with red zones that scream
when touched, to detect buffer overruns.
Causes system panic upon allocation failure.
Perform expensive checks on freeing objects
Make sure the allocation is in the DMA zone.

ctor is an optional constructor function used to initialize any objects before they are used. If not provided, this parameter is set
to NULL.
To view slab caches currently allocated in your system, you can use slabtop or vmstat -m. Read their manpage entries for more
information.
Once the cache is set up, you can allocate / free objects by:
void *kmem_cache_alloc(struct kmem_cache *cache, gfp_t gfp_mask);
void kmem_cache_free(struct kmem_cache *cache, void *);

You can use the function kmem_cache_shrink() to release unused objects. When you finish using the memory cache, you must
free it up by calling kmem_cache_destroy(), otherwise resources will not be freed. This function will fail if there are still objects
in use.
For example:
#include <linux/module.h>
#include <linux/slab.h>
static int size = PAGE_SIZE;
static struct kmem_cache *my_cache;
module_param(size, int, S_IRUGO);
static int mycdrv_open(struct inode *inode, struct file *file)
{
/* allocate a memory cache object */
if (!(ramdisk = kmem_cache_alloc(my_cache, GFP_ATOMIC))) {
printk(KERN_ERR " failed to create a cache object\n");
return -ENOMEM;
}
printk(KERN_INFO " successfully created a cache object\n");
return mycdrv_generic_open(inode, file);
}
static int mycdrv_release(struct inode *inode, struct file *file)
{
/* destroy a memory cache object */
kmem_cache_free(my_cache, ramdisk);
printk(KERN_INFO "destroyed a memory cache object\n");
printk(KERN_INFO " closing character device: %s:\n\n", MYDEV_NAME);
return 0;
}

COSC440 Lab Manual


41 / 69

static int __init my_init(void)


{
/* create a memory cache with blocks of specific size */
if (size > (1024 * PAGE_SIZE)) {
printk
(KERN_INFO
" size=%d is too large; you cant have more than 1024 pages!\n",
size);
return -1;
}
if (!(my_cache = kmem_cache_create("mycache", size, 0,
SLAB_HWCACHE_ALIGN, NULL))) {
printk(KERN_ERR "kmem_cache_create failed\n");
return -ENOMEM;
}
printk(KERN_INFO "allocated memory cache correctly\n");
ramdisk_size = size;
........
return 0;
}
static void __exit my_exit(void)
{
........
(void)kmem_cache_destroy(my_cache);
}
module_init(my_init);
module_exit(my_exit);

5.4.1

Exercise: memory caches

In one of the modules you previously created that uses kmalloc(), implement memory cache and make one of the kmalloc() calls
use your memory cache instead. Make sure you free any slabs you create.

5.4.2

Exercise: Testing maximum memory allocation (optional)

See how much memory you can obtain dynamically, using both kmalloc() and get_free_pages().
Start with requesting 1 page of memory, and then keep doubling until your request fails for each type fails. Make sure you free
any memory you receive.
You will probably want to use GFP_ATOMIC rather than GFP_KERNEL. (why?)
If you have trouble getting enough memory due to memory fragmentation trying writing a poor-mans de-fragmenter, and then
running again. The de-fragmenter can just be a module that grabs all available memory, use it and then release it when done,
thereby clearing the caches. You can also try the command:
$ sync; echo 3 > /proc/sys/vm/drop_caches

Try the same thing with vmalloc() Rather than doubling allocations, start at 4MB and increases in 4MB increments until failure
results. Note this may hang while loading. (Why?)
Kernel code cannot directly access user-space memory and user-space programs also cannot directly access kernel memory. To
transfer data between user and kernel spaces, one must use copying constructs, or memory mapping.

COSC440 Lab Manual


42 / 69

5.5

copying data across user/kernel space

To copy the value of a variable from the user space to kernel space, one can use:
#include <linux/uaccess.h>
int get_user(lvalue, ptr);

which copy value pointed by ptr at user space to the kernel space variable lvalue. Returns 0 for success, -EFAULT otherwise.
Here is an example which data from the user space is copied to the kernel memory byte by byte:
static inline ssize_t
mycdrv_write(struct file *file, const char __user * buf, size_t lbuf,
loff_t * ppos)
{
int nbytes = 0, maxbytes, bytes_to_do;
char *tmp = ramdisk + *ppos;
maxbytes = ramdisk_size - *ppos;
bytes_to_do = maxbytes > lbuf ? lbuf : maxbytes;
if (bytes_to_do == 0)
printk(KERN_INFO "Reached end of the device on a write");
while ((nbytes < bytes_to_do) && !get_user(*tmp, (buf + nbytes))) {
nbytes++;
tmp++;
}
*ppos += nbytes;
printk(KERN_INFO "\n Leaving the
WRITE function, nbytes=%d, pos=%d\n",
nbytes, (int)*ppos);
return nbytes;
}

To copy the value from the kernel space to a variable in user space, one can use:
#include <linux/uaccess.h>
int put_user(expr, ptr);

which copy value pointed by ptr at kernel space to the user space variable expr. Returns 0 for success, -EFAULT otherwise.
To copy a chunk of memory across space, one would use:
#include <linux/uaccess.h>
unsigned long copy_to_user (void __user * to, const void * from, unsigned long n);
unsigned long copy_from_user (void * to, const void __user * from, unsigned long n);

copy_from_user() copies data of size len from the user space memory pointed to by from to the kernel space memory pointed to
by to. This function returns the number of bytes not transferred, and in error, returns -EFAULT.
copy_to_user() copies data from the kernel space to the user space.
The caller must check the return value for error, and if there are data not transferred, the caller needs to make the call again
to transfer the remaining data, and check again for errors, or data not transferred. This is best handled by a do-while loop, as
illustrated by the following example.
In this example ioctl() determines the size of the data, then use copy_from_user() and copy_to_user() to transfer data across the
user space and the kernel space.
#include <linux/uaccess.h>
#include <linux/module.h>
#define MYIOC_TYPE k

COSC440 Lab Manual


43 / 69

static inline long


mycdrv_unlocked_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
{
int i, rc, direction;
int size;
char *buffer;
void __user *ioargp = (void __user *)arg;
int copied = 0;
/* make sure it is a valid command */
if (_IOC_TYPE(cmd) != MYIOC_TYPE) {
printk(KERN_WARNING " got invalid case, CMD=%d\n", cmd);
return -EINVAL;
}
/* get the size of the buffer and kmalloc it */
size = _IOC_SIZE(cmd);
buffer = kmalloc((size_t) size, GFP_KERNEL);
if (!buffer) {
printk(KERN_ERR "Kmalloc failed for buffer\n");
return -ENOMEM;
}
/* fill it with X */
memset(buffer, X, size);
direction = _IOC_DIR(cmd);
switch (direction) {
case _IOC_WRITE:
printk
(KERN_INFO
" reading = %d bytes from user-space and writing to device\n",
size);
do {
rc = copy_from_user(&buffer[copied], &ioargp[copied], size);
printk(KERN_INFO "rc from copy_from_user = %d\n", rc);
copied += size - rc;
size -= size - rc;
} while (size > 0);
break;
case _IOC_READ:
printk(KERN_INFO
" reading device and writing = %d bytes to user-space\n",
size);
do {
rc = copy_to_user(&ioargp[copied], &buffer[copied], size);
printk(KERN_INFO "rc from copy_to_user = %d\n", rc);
copied += size - rc;
size -= size - rc;
} while (size > 0);
break;
default:
printk(KERN_WARNING " got invalid case, CMD=%d\n", cmd);
return -EINVAL;

COSC440 Lab Manual


44 / 69

}
for (i = 0; i < size; i++)
printk(KERN_INFO "%c", buffer[i]);
printk(KERN_INFO "\n");
if (buffer)
kfree(buffer);
return rc;
}

5.6

Memory mapping

Memory copying has overheads, which can be eliminated by memory mapping from one space to the other, so that memory can
be accessed by the other side directly without the expenses of copying.
When a file is memory mapped, the file can be associated with a range of linear addresses. Input and output operations on the
file can be accomplished with simple memory references, rather than explicit I/O operations.
Memory mapping can also be done on device nodes for direct access to hardware devices. In this case, the driver must register
and implement a proper mmap() entry point.
This method is not useful for stream-oriented devices. The mapped area must be a multiple of PAGE_SIZE extent, and start on a
page boundary.
Two basic kinds of memory mapping exist:
Shared mapping - operations on the memory region is equivalent to changing to the file it represents. Changes are immediately
visible to processes accessing the file.
Private mapping - operations on the memory region are not committed to the disk and invisible to other processes accessing the
file. More efficient, but is designed to be used in read-only situations (e.g. final saving of data is done by writing to another file).
From the user side, memory mapping is done with:
#include <unistd.h>
#include <sys/mman.h>
void *mmap(void *start, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void *start, size_t length);

This requests the mapping into memory of length bytes, starting at offset offset, from the file specified by fd. The offset must be
an integral number of pages.
The address start is a preferred address to map to. If 0 is given (the usual case), mmap() will choose the address and put it in the
return value.
prot is the desired memory protection. It has bits:
PROT_EXEC
PROT_READ
PROT_WRITE
PROT_NONE

Page may be executed


Page may be read
Page may be written
Page may not be accessed

Except PROT_NONE, the above flags can be ORed.


flag specifies the type of mapped object. It has bits:
MAP_FIXED
MAP_SHARED

If start cant be used, fail.


Share the mapping with all other processes.

COSC440 Lab Manual


45 / 69

MAP_PRIVATE
MAP_ANONYMOUS

Create a private copy-on-write mapping.


Create a mapping only in memory, without a file
association

Either MAP_SHARED or MAP_PRIVATE must be specified. Remember, a private mapping does not change the file on disk.
Therefore any changes to it will be lost when the process terminates.
MAP_ANONYMOUS is a common way to share memory between the parent process and children processes. Here is an
example:
#include
#include
#include
#include
#include
#include
#include

<stdlib.h>
<stdio.h>
<unistd.h>
<string.h>
<sys/mman.h>
<sys/types.h>
<sys/wait.h>

int main(int argc, char **argv) {


int fd = -1;
int size = 4096;
int status;
char *area;
pid_t pid;
area = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, fd, 0);
pid = fork();
if (0 == pid) {
/* child */
strcpy(area, "This is a message from the child");
printf("Child has written: %s\n", area);
exit(EXIT_SUCCESS);
} else if (pid > 0){
/* parent */
wait(&status); /* wait until child terminates */
printf("Parent has read: %s\n", area);
exit(EXIT_SUCCESS);
}
exit(EXIT_FAILURE);
}

munmap() deletes the mapping and causes further references to addresses within the range to generate invalid memory references.
For the kernel side, the driver entry point looks like:
#include <linux/mm.h>
int (*mmap)(struct file *filp, struct vm_area_struct *vma);

The vm_area_struct data structure is defined in /usr/src/linux/include/linux/mm.h and contains the important information. The
basic elements are:
struct vm_area_struct {
...
unsigned long vm_start;
unsigned long vm_end;
...
pgprot_t vm_page_prot;
unsigned long vm_flags;

/* Our start address within vm_mm. */


/* The first byte after our end address within vm_mm. */
/* Access permissions of this VMA. */
/* Flags, listed below */

COSC440 Lab Manual


46 / 69

...
/* Functions pointers to deal with this struct */
struct vm_operations_struct *vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
...
}

Like the fops structure discussed in lab 3, the vm_ops structure can be used to override default operations. Pointers can be given
for functions to: open(), close(), protect(), sync(), advice(), swapout(), swapin(). . .
Here is a simple example to show how the fields are used:
#include <linux/mm.h>
int my_mmap(struct file *file, struct vm_area_struct *vma) {
if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start, vm->vm_page_prot))
return -EAGAIN;
return 0;
}

Most of the work is done by the function:


#include <linux/mm.h>
int remap_pfn_range(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long pfn, unsigned long size, pgprot_t prot);

which maps pages to the specified vma range, where:


vma points to the vma struct
start_addr is the beginning of the mapping address
pfn stands for "page frame number" of the page (or the first page, in case of mapping contiguous pages in one go)
size is the size of the mapping, must be a multiple of page sizes
prot is the protection setting of the page(s).
The page frame number of a page can be obtained by one of the functions:
unsigned long page_to_pfn(struct page *page);
unsigned long virt_to_pfn(void *addr);

For contiguous pages, remap_pfn_range() can map the entire mapping range requested in one go. However in cases where pages
are not contiguous, as in the assignment one, you need to map each page one by one.
Note that this function does allow mapping memory above the 4GB barrier.
Here is a simple example of a program to test the mmap() entry:
#include
#include
#include
#include
#include
#include
#include

<stdlib.h>
<stdio.h>
<unistd.h>
<string.h>
<sys/mman.h>
<fcntl.h>
<errno.h>

COSC440 Lab Manual


47 / 69

#define DEATH(mess) { perror(mess); exit(errno); }


int main(int argc, char **argv) {
int fd, size, rc, j;
char *area, tmp, *nodename = "/dev/mycdrv";
char c[2] = "CX";
if (argc > 1)
nodename = argv[1];
size = getpagesize();

/* use one page by default */

if (argc > 2)
size = atoi(argv[2]);
printf("Memory Mapping Node: %s, of size %d bytes\n", nodename, size);
if ((fd = open (nodename, O_RDWR)) < 0)
DEATH ("problems opening the node ");
area = mmap (NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (area == MAP_FAILED)
DEATH ("error mmaping");
/* can close the file now */
close (fd);
/* put the string repeatedly in the file */
tmp = area;
for (j = 0; j < size - 1; j += 2, tmp += 2)
memcpy (tmp, &c, 2);
/* just cat out the file to see if it worked */
rc = write (STDOUT_FILENO, area, size);
if (rc != size)
DEATH ("problems writing");
exit (EXIT_SUCCESS);
}

Here is a simple driver with a mmap() entry point:


#include
#include
#include
#include
#include
#include
#include

<linux/module.h>
<linux/fs.h>
<linux/uaccess.h>
<linux/init.h>
<linux/slab.h>
<linux/cdev.h>
<linux/mm.h>

/*
/*
/*
/*
/*
/*
/*

for modules */
file_operations */
copy_(to,from)_user */
module_init, module_exit */
kmalloc */
cdev utilities */
remap_pfn_range */

#define MYDEV_NAME "mycdrv"


static
static
static
static

dev_t first;
unsigned int count = 1;
int my_major = 700, my_minor = 0;
struct cdev *my_cdev;

COSC440 Lab Manual


48 / 69

static int mycdrv_mmap (struct file *file, struct vm_area_struct *vma)


{
printk (KERN_INFO "I entered the mmap function\n");
if (remap_pfn_range (vma, vma->vm_start,
vma->vm_pgoff,
vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
return -EAGAIN;
}
return 0;
}
/* dont bother with open, release, read and write */
static struct file_operations mycdrv_fops = {
.owner = THIS_MODULE,
.mmap = mycdrv_mmap,
};
static int __init my_init (void)
{
first = MKDEV (my_major, my_minor);
register_chrdev_region (first, count, MYDEV_NAME);
my_cdev = cdev_alloc ();
cdev_init (my_cdev, &mycdrv_fops);
cdev_add (my_cdev, first, count);
printk (KERN_INFO "\nSucceeded in registering character device %s\n",
MYDEV_NAME);
return 0;
}
static void __exit my_exit (void)
{
cdev_del (my_cdev);
unregister_chrdev_region (first, count);
printk (KERN_INFO "\ndevice unregistered\n");
}
module_init (my_init);
module_exit (my_exit);
MODULE_AUTHOR ("Jerry Cooperstein");
MODULE_DESCRIPTION ("LDD:1.0 s_18/mmapdrv.c");
MODULE_LICENSE ("GPL v2");

5.7

Atomic operations

In multithreaded architectures, there can be more than one process accessing (making a system call) on the same device concurrently. Therefore care must be taken to protect access of the shared data, or data race can occur if more than one processes write
to the same location simultaneously.
For integer-sized data, the Linux kernel API provides the atomic type atomic_t, which is accessed by atomic functions as one
single instruction. atomic_t is defined as:
typedef struct {
volatile int counter;
} atomic_t;

Atomic variables with the type atomic_t must be accessed by the following functions:

COSC440 Lab Manual


49 / 69

atomic_read(atomic_t *v);
atomic_set(atomic_t *v, int i);
void
void
void
void
int
int
int
int
int
int
int
int

atomic_add
atomic_sub
atomic_inc
atomic_dec

(int i, atomic_t *v);


(int i, atomic_t *v);
(atomic_t *v);
(atomic_t *v);

atomic_dec_and_test (atomic_t *v);


atomic_inc_and_test_greater_zero (atomic_t *v);
atomic_sub_and_test (int i, atomic_t *v);
atomic_add_negative (int i, atomic_t *v);
atomic_sub_return (int i, atomic_t *v);
atomic_add_return (int i, atomic_t *v);
atomic_inc_return (int i, atomic_t *v);
atomic_dec_return (int i, atomic_t *v);

Now go back to assignment 1 and add a mmap() entry point that maps the ramdisk to user space. as in the assignment specification.

COSC440 Lab Manual


50 / 69

Chapter 6

Using ioctl()
The system call ioctl() is provided for device-specific custom commands (such as format, reset and shutdown) that are not
provided by standard system calls such as read(), write and mmap().
To invoke ioctl commands of a device, the user-space program would open the device first, then send the appropriate ioctl() and
any necessary arguments.
#include <sys/ioctl.h>
int ioctl(int fd, int command, ...);

and on success, 0 is returned and on error, -1 will be returned and errno will be set to:
EBADF
ENOTTY
EINVAL

Bad file descriptor


File descriptor not associated with character special device, or the request does not apply to the kind
of object the file descriptor references.
Invalid command or argp

In the kernel code of the device, the entry point for ioctl() looks like:
#include <linux/ioctl.h>
static int mydrvr_ioctl (struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg);

where arg can be used directly either as a long or a pointer in user-space. In the latter case, the pointer points to user-space data,
therefore to access the user-space data, one would use put_user(), get_user(), copy_to_user() and copy_from_user() functions.
Here is an example of an ioctl implementation in a driver:
static int mydrvr_ioctl (struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg) {
if (_IOC_TYPE(cmd) != MYDRBASE) return -EINVAL;
switch (cmd) {
case MYDRVR_RESET:
.....
return 0;
case MYDRVR_OFFLINE:
.....
return 0;
case MYDRVR_GETSTATE:

COSC440 Lab Manual


51 / 69

if (copy_to_user((void *)arg, &mydrvr_state_struct, sizeof(mydrvr_state_struct))) {


return -EFAULT;
}
return 0;
default:
return -EINVAL;
}
}

From kernel 2.6.36 onwards, the Big Kernel Lock (a single lock that only allow one process making a system call to the device
at a time) is removed, so the member .ioctl, which assumes protection from BKL, is also removed from struct file_operations.
All system call implementations are now required to use its own synchronization methods such as spinlocks, mutex, semaphores,
and atomic operations (discussed in the next lab) to ensure atomic access to shared data.
In struct file_operations, your ioctl() implementation must now be registered with the member .unlocked_ioctl(). For example:
struct file_operations asgn1_fops = {
.owner = THIS_MODULE,
......
.unlocked_ioctl = asgn1_ioctl,
......
};

6.1

Defining ioctl() commands

Programmers much choose a number for the integer command representing each command implemented through ioctl. The
number should be unique across the system. Picking arbitrary number is a bad idea, because:
Two device nodes may have the same major number. An application could open more than one device and mix up the file
descriptors, thereby sending the right command to the wrong device. Sending wrong ioctl commands can have catastrophic
consequences, including damage to hardware. An inode number should be encoded with one of the following macros:
_IO (type,
_IOR (type,
_IOW (type,
_IORW(type,

number)
number, size)
number, size)
number, size)

where type is the 8-bit magic number unique to the device. For currently-used magic numbers in the kernel (therefore you should
not use), please have a look at HERE. For all tasks in this lab, please use k as the magic number of your modules. number is the
sequential number you assign to your command.
size codes for the size of the data structure passed from/to the user space and kernel space. Rather than passing the actual size,
one must pass the actual data structure (not pointer to the data structure), which then gets a sizeof() primitive applied to it. For
example:
MY_IOCTL = _IOWR(k, 1, struct my_data_structure);

Also since the field size only has 14-bit, therefore the largest size of the data struct is 16KB.
If you command does not involve in passing data, then you should use _IO(); if your command lets the user-space program read
data from the data structure, use _IOR(); if the user-space program writes to the data structure an passes to the kernel, then use
_IOW(); otherwise if the data structure is both read and written to by the user-space program, then use _IORW().
Here is an example for encoding ioctl() command numbers:
#define
#define
#define
#define

MYDRBASE k
MYDR_RESET _IO( MYDRBASE, 1)
MYDR_STOP _IO( MYDRBASE, 2)
MYDR_READ _IOR( MYDRBASE, 2, my_data_buffer)

COSC440 Lab Manual


52 / 69

In your ioctl implementation (in the kernel module), you can use the following macros to decode information from the ioctl
command integer:
_IOC_TYPE(cmd)
_IOC_NR(

cmd)

_IOC_SIZE(cmd)
_IOC_DIR( cmd)

6.2

/* gets the magic number of the device


this command targets */
/* gets the sequential number of the command
within your device */
/* gets the size of the data structure */
/* gets the direction of data transfer,
can be one of the following:
_IOC_NONE
_IOC_READ
_IOC_WRITE
_IOC_READ | _IOC_WRITE
/
*

Automatically create device nodes under /dev with the udev system

udev is the device manager for the Linux kernel, which manages device nodes automatically during module insertion and removal, thus preventing the troubles of manually creating and removing devices nodes and matching the major and minor numbers.
Device nodes are commonly created by the init() function. Since once the device node is created, it will be accessible by other
modules, or user-space program. Therefore the device node is usually created at the end of init(), where everything is already
initialized and the device is ready to be used.
To create a node, first you must create a class using:
#include <linux/device.h>
struct class *class_create(struct module *owner, const char *name);

where owner is usually set to THIS_MODULE and name will be the name of the class, which does not have to be same as the
module name.
Then you must create the node itself using:
struct device *device_create(struct class *cls, struct device *parent, dev_t devt,
const char **fmt...);

where cls is the class youve just created; parent is parent node, which is set to NULL for our assignments; dev is our entry and
fmtis the name of the node appears under /dev
Then in exit(), you must remove the node(s) and the class by:
void device_destroy(struct class *cls), dev_t dev);
void class_destroy(struct class *cls);

Here is an example:
#include <linux/device.h>
struct class *my_class;
dev_t my_dev;
/* ... */
static int __init my_init(void) {
/* ... */
/* create node */
my_class = class_create(THIS_MODULE, "my_class");

COSC440 Lab Manual


53 / 69

device_create(my_class, NULL, my_dev, "mycdrv");


return 0
}

static void __exit my_exit(void) {


/* remove node */
device_destroy(my_class, my_dev);
class_destroy(my_class);
/* ... */
}
module_init(my_init);
module_exit(my_exit);

Important
For all assignments in COSC440, we will use dynamic major number allocation and udev

6.3

Exercise: using ioctl to pass data

Write a simple module that uses the ioctl directional information to pass a data buffer of fixed size back and forth between the
driver and the user-space program.
The size and directions(s) of the data transfer should be encoded in the command number.
You will need to write a user-space application to test this.

6.4

Exercise: using ioctl() to pass data of variable length

Extend the previous exercise to send a buffer whose length is determined at run time. You will probably need to use the _IOC
macro directly in the user-space program. (See linux/ioctl.h.)

COSC440 Lab Manual


54 / 69

Chapter 7

Catchup lab
This lab is set aside to help people catch up; priority will be given to students completing previous labs or assignments.

COSC440 Lab Manual


55 / 69

Chapter 8

Hardware interrupts, tasklets and workqueues


Interrupt handlers usually have two parts: the top halves and the bottom halves.
The top half does what needs to be done immediately, for example, a network driver top half acknowledges the interrupt and gets
data off the network card into a buffer for later processing. Basically the top half itself is the interrupt handler.
The bottom half does the rest of the processing that has been deferred, which can be time consuming, or have delays that would
otherwise hampers the response of the top half if put in the top half.

8.1

Top half

The top half is the interrupt handler, and it:


Checks to make sure the interrupt was generated by the right hardware. This check is necessary for interrupt sharing.
Clears an interrupt pending bit on the interface board.
Does what needs to be done immediately (usually read or write something to/from the device). This data is usually written to
or read from a device-specific bufer, which has been previously allocated.
Schedules handling the new information later (in the bottom half) if the handling required is not trivial.
An example of an interrupt handler (top half) that schedules the bottom half with tasklets:
static struct my_dat { ... } my_fun_data;
/* tasklet bottom half */
static void t_fun (unsigned long t_arg{ ... }
DECLARE_TASKLET (t_name, t_fun, (unsigned long)&my_data);
/* interrupt handler */
static irqreturn_t my_interrupt (int irq, void *dev_id) {
top_half_fun();
tasklet_schedule(&t_name);
return IRQ_HANDLED;
}

An interrupt handler needs to be registered during initialization, and deregistered during cleanup.
To register an interrupt handler:

COSC440 Lab Manual


56 / 69

#include <linux/interrupt.h>
int request_irq(unsigned int irq, irq_handler_t handler,
unsigned long flags, const char *name, void *dev);

and to remove an interrupt handler:


#include <linux/interrupt.h>
void free_irq(unsigned int irq, void *dev);

Here is an example showing how to set up an interrupt handler:


#include
#include
#include
#include
#include
#include
#include

<linux/module.h>
<linux/init.h>
<linux/interrupt.h>
<linux/delay.h>
<linux/workqueue.h>
<linux/kthread.h>
<linux/slab.h>

/* IRQ of your network card to be shared */


#define SHARED_IRQ 19
static int irq = SHARED_IRQ;
module_param(irq, int, S_IRUGO);
/* default delay time in top half -- try 10 to get results */
static int delay = 0;
module_param(delay, int, S_IRUGO);
static atomic_t counter_bh, counter_th;
struct my_dat {
unsigned long jiffies;
/* used for timestamp */
struct tasklet_struct tsk;
/* used in dynamic tasklet solution */
struct work_struct work;
/* used in dynamic workqueue solution */
};
static struct my_dat my_data;
static irqreturn_t my_interrupt(int irq, void *dev_id);
static int __init my_generic_init(void)
{
atomic_set(&counter_bh, 0);
atomic_set(&counter_th, 0);
/* use my_data for dev_id */
if (request_irq(irq, my_interrupt, IRQF_SHARED, "my_int", &my_data))
return -1;
printk(KERN_INFO "successfully loaded\n");
return 0;
}
static void __exit my_generic_exit(void)
{
synchronize_irq(irq);
free_irq(irq, &my_data);
printk(KERN_INFO " counter_th = %d, counter_bh = %d\n",
atomic_read(&counter_th), atomic_read(&counter_bh));

COSC440 Lab Manual


57 / 69

printk(KERN_INFO "successfully unloaded\n");


}

8.2

Bottom half

A bottom half is used to process data, letting the top half to deal with new incoming interrupts. Interrupts are enabled when
a bottom half runs. Interrupt can be disabled if necessary, but generally this should be avoided as this goes against the basic
purpose of having a bottom half - processing data while listening to new interrupts.
There are three main types of bottom halves: namely tasklets, workqueues and kernel threads.

8.2.1

Tasklets

Tasklets are used to queue up work to be done at a later time. Tasklets can be run in parallel, but the same tasklet cannot be run
on multiple CPUs at the same time. Also each tasklet will run only on the CPU that schedules it, to optimize cache usage. Since
the thread that queued up the tasklet must complete before it can run the tasklet, race conditions are naturally avoided. However,
this arrangement can be suboptimal, as other potentially idle CPUs cannot be used to run the tasklet. Therefore workqueues can,
and should be used instead, and workqueues will be discussed in the next section.
The tasklet code is explained in /usr/src/linux/include/linux/interrupt.h, and the important data structure is:
struct tasklet_struct {
struct tasklet_struct *next;
unsigned long state;
atomic_t count;
void (*func)(unsigned long);
unsigned long data;
};

func is a pointer to the function that will be run, with data as its parameter. state is used to determine whether the tasklet has
already been scheduled, and if so, then it cannot be done so a second time.
The API of tasklets include:
DECLARE_TASKLET(name, function, data);
DECLARE_TASKLET_DISABLED(name, function, data);
void tasklet_init(struct tasklet_struct *t,
void (*func)(unsigned long), unsigned long data);
void
void
void
void

tasklet_schedule(struct
tasklet_enable (struct
tasklet_disable (struct
tasklet_kill
(struct

tasklet_struct
tasklet_struct
tasklet_struct
tasklet_struct

*t);
*t);
*t);
*t);

A tasklet must be initialized before being used, either by dynamically allocating space for the structure and call tasklet_init(),
or statically declare and initialize by DECLARE_TASKLET(). Alternative, the tasklet can be declared and set at disabled state
by DECLARE_TASKLET_DISABLED(), which means that tasklet can be scheduled, but will not be run until the tasklet is
specifically enabled.
tasklet_kill() is used to kill tasklets which reschedule themselves.
tasklet_schedule() is called to schedule a tasklet. Please note that if a tasklet has previously been scheduled (but not yet run), the
new schedule will be silently discarded.
Here is a trivial example with my_init() scheduling a tasklet:

COSC440 Lab Manual


58 / 69

#include
#include
#include
#include
#include

<linux/module.h>
<linux/sched.h>
<linux/interrupt.h>
<linux/slab.h>
<linux/init.h>

typedef struct simp_t


int i;
int j;
} simp;

static simp t_data;


static void t_fun(unsignned long t_arg) {
simp *datum = (simp *)t_arg;
printk(KERN_INFO
datum->i,
printk(KERN_INFO
datum->j,

"Entering t_fun, datum->i = %d, jiffies = %ld\n",


jiffies);
"Entering t_fun, datum->j = %d, jiffies = %ld\n",
jiffies);

}
static int __init my_init(void) {
printk(KERN_INFO "\nHello: my_init loaded at address 0x%p\n",
my_init);
t_data.i = 100;
t_data.j = 200;
printk(KERN_INFO "scheduling my tasklet, jiffies = %ld\n", jiffies);
tasklet_schedule(&t_name);
return 0;
}
static void __exit my_exit(void) {
printk(KERN_INFO "\nHello: my_exit loaded at address 0x%p\n",
my_exit);
}
module_init(my_init);
module_exit(my_exit);

8.2.2

Workqueues

A workqueue contains a linked list of tasks to be run at a deferred time. Tasks in workqueue:
run in process context, therefore can sleep, and without inteferring with tasks running in any other queues.
but still cannot transfer data to and from user space, as this is not a real user context to access.
The important data structure describing the tasks put into the queue is:
#include <linux/workqueue.h>
typedef void (*work_func_t)(struct work_struct *work);
struct work_struct {
atomic_long_t data;
struct list_head entry;
work_funct_t func;
};

COSC440 Lab Manual


59 / 69

func points to the function that will be run to get the work done. The other arguments are for internal use.
In order to pass data to a function, one needs to embed the work_struct in a use-defined data structure and then to pointer
arithmetic in order to recover it. Here is an example:
static struct my_dat {
int irq;
struct work_struct work;
};
static void w_fun(struct work_struct *w_arg) {
struct my_dat *data = container_of(w_arg, struct my_dat, work);
atomic_inc(&bhs[data->irq]);
}

A work_struct can be declared and initialized at compiled time with:


DECLARE_WORK(name, void (*function)(void *));

where name is the name of the structure which points to queueing up function() to run. A previously declared work queue can be
initialized and loaded with the the two macros:
INIT_WORK(
struct work_struct *work, void (*function)(void *));
PREPARE_WORK(struct work_struct *work, void (*function)(void *));

where work has already been declared as a work_struct. The INIT_WORK() macro initializes the list_head linked-list pointer,
and PREPARE_WORK() sets the function pointer. The INIT_WORK() macro needs to be called at least once, and in turn calls
PREPARE_WORK(). INIT_WORK() should not be called while a task is alraedy in the work queue.
Alternatively, a workqueue can be statically declared by:
DECLARE_WORK(work, void (*function)(void *));

In the kernel, there is a default workqueue named events. Tasks are added to amd flushed from this queue with the functions:
int schedule_work(struct work_struct *work);
void flush_scheduled_work(void);

flush_scheduled_work() is used when one needs to wait until all entries in a work queue have run.

8.2.3

Exercise: Deferred Functions

Write a driver that schedule a deferred function whenever a write() to the device takes place.
Pass some data to the driver and have it print out.
Have it print out the currentpid field when the tasklet is scheduled, and then again when the queued function is executed.
Implement this using:
tasklets
work queues

8.2.4

Exercise: Shared interrupts and bottom halves

Write a module that shares its IRQ with your network card. You can generate some network interrupts either by browsing or
pinging.
You can find the interrupt used by eth0 by:

COSC440 Lab Manual


60 / 69

$ cat /proc/interrupts

The first column is the IRQ number and the fourth colummn lists the users of that IRQ number. On my machine, eth0 uses IRQ
19. You can see more details about /proc/interrupts at:
http://www.centos.org/docs/5/html/5.1/Deployment_Guide/s2-proc-interrupts.html
Make it use a top half and a bottom half. Implement the bottom half using a tasklet, then repeat this exercise with a workqueue.
Check /proc/interrupts while it is loaded.

8.3

Hardware I/O

In this part of the lab, we will look at reading from, and writing to hardware I/O ports, which you will need to do for the second
assignment.
Operations on I/O registers differs in important ways from normal memory access. In particular, there may be side effects
caused by compiler and/or hardware optimizations that reorder instructions. In conventional memory reads and writes, there is
no problem, as a write always store a value, and a read always return the last value written. However for I/O ports, there can
be a problem because the CPU cannot tell when a process depends on the order of memory access. In another word, because of
reading or writing an I/O register, device may initiate or respond to various actions.
Therefore a driver must ensure that no caching or reordering occurs. Otherwise problems which are difficult to diagnose, and
only occur intemmitently, may occur.
The solution is to use appropriate memory barrier to prevent re-ordering of some instructions:
#include <asm-generic/system.h>
void barrier(void);
void rmb(void);
void wmb(void);
void mb (void);
void smp_rmb(void);
void smp_wmb(void);
void smp_mb (void);

barrier() causes the compiler to store in memory all values currently modified in a CPU register, to read them again later when
they are needed. This function does not have effects on the hardware itself.
rmb() forces any reads before the barrier to complete before any reads after the barrrier are done; wmb() does the same thing for
writes and mb() does it for both reads and writes.
Functions with smp_ prefix insert barriers only on multi-processor systems, and on single CPU systems, just expands to barrier().
Here is a code snippet showing the use of barriers:
io32write(direction, dev->base + OFF_DIR);
op32write(size, dev->base + OFF_SIZE);
wmb();
io32write(value, dev->base + OFF_GO);

In addition, many architectures provide convenience macros which combine setting a value with invoking a memory barrier. For
example:
#define set_mb(var, value) do {var = value; mb(); } while (0)
#define set_wmb(var, value) do {var = value; wmb(); } while (0)
#define set_rmb(var, value) do {var = value; rmb(); } while (0)

COSC440 Lab Manual


61 / 69

Memory barriers may cause performance hit, therefore they should be used with care. For example, on x86, the write memory
barrier does nothing, as writes are never reordered. However reads may be reordered. So you should noy use mb() when wmb()
will suffice.

8.3.1

REGISTERING I/O PORTS

Before accessing the I/O ports, the driver must register their use (usually during initialization), and the driver must unregister I/O
ports during cleanup. These steps are done by:
#include <linux/ioport.h>
struct resource *request_region(unsigned long from, unsigned long extent, const char *name) ;
void release_region(unsigned long from, unsigned long extent);

In these functions, the argument from is the base address of the I/O region. The argument extent is the number of ports (or
addresses) and the argument name is the name that will appear in /port/ioports as the entry that claims the region.
Here is a code snippet showing how an I/O port is registered:
#include <linux/ioport.h>
static int my_dev_detect(unsigned long port_addr, unsigned long extent) {
if (!request_region(port_addr, extent, "my_dev"))
return -EBUSY;
/* the port is occupied */
if (mydrv_probe(port_addr, extent) != 0) {
release_region(port_addr, extent);
return -ENODEV;
/* cant find the device */
}
return 0;
}

8.3.2

READING AND WRITING DATA FROM I/O REGISTERS

The following macros give the ability to read and write 8-bit (with suffix b), 16-bit (with suffix w) and 32-bit (with suffix l) once
or multiple times:
Reading:
#include <linux/io.h>
unsigned char
unsigned short
unsigned long
void
void
void

inb
inw
inl
insb
insw
insl

(unsigned
(unsigned
(unsigned
(unsigned
(unsigned
(unsigned

long
long
long
long
long
long

port_address);
port_address);
port_address);
port_address, void *addr unsigned long count);
port_address, void *addr unsigned long count);
port_address, void *addr unsigned long count);

Writing:
#include <linux/io.h>
void
void
void
void
void
void

outb
outw
outl
outsb
outsw
outsl

(unsigned
(unsigned
(unsigned
(unsigned
(unsigned
(unsigned

char b, unsigned long port_address);


short w, unsigned long port_address);
long l, unsigned long port_address);
long port_address, void *addr, unsigned long count);
long port_address, void *addr, unsigned long count);
long port_address, void *addr, unsigned long count);

COSC440 Lab Manual


62 / 69

In all architectures, long functions gives only 32-bit operations. Even in 64-bit architectures, there is no 64-bit data path.
The functions above that takes the count arguments do not write to a range of addresses. Instead, they write only to one port
address, but they loop efficiently around the operation.
All these functions do I/O in little-endian order regardless of underlying architecture, and do any necessary byte-swapping.
Reading and writing I/O ports may require the use of memory barriers.
Here is an example:
/**
* This function writes from the user buffer to the parallel port
*/
static ssize_t do_parport_write(struct file *filp, const char __user *buf,
size_t count, loff_t *f_pos) {
size_t written = 0;
while (written < count) {
outb_p(0x00, parport_base);
mb();
outb_p(*(buf + written) | 0x80, parport_base);
mb();
udelay(5);
written++;
parport_device->total_written++;
}
*f_pos += written;
return written;
}

8.3.3

SLOWING I/O CALLS TO THE HARDWARE

Some hardware can only read/write at a slower speed, therefore the kernel provides pausing functions that can be used to handle
I/O to these slow devices. These functions have the same form as the I/O functions mentioned above, but with the suffix $_p$
attached to their names (e.g. $outb_p()$)
These functions insert a small delay after the I/O instruction if another such function follows. They should not be necessary
except for very old ISA hardware.
Have a look at Assignment 2 and prepare your system accordingly to ensure the module parport is not loaded. (Note: you dont
need to insert the dummy module).
Request the region of the parallel port 0x378 (1 byte). Wrtie a byte to it, then read from it again.
Check and see if the region 0x378 is properly registered in /proc/ioports.

8.4

Reference

1. Writing Linux Device Drivers Chapters 20 and 21, by Jerry Cooperstein

COSC440 Lab Manual


63 / 69

Chapter 9

Walking through the Assignment 2


In this lab, we will go through the implementation of assignment 2.

9.1

Preparing the base code from Assignment 1

First, after fixing the problems I have mentioned in your assignment 1 code, make a copy of the assignment 1 code into assignment
2:
$ cp -r asgn1 asgn2

Then in the source code, rename all instances of "asgn1" to "asgn2".


Also in the assignment 2, mmap() and lseek() will not be needed. Therefore please remove the implementation of these functions.
write() will also not needed in assignment 2, but you may want to use part of the code later in your bottom half. So at the moment,
please comment out the body of write(). Dont forget to remove entries of .mmap, .llseek, and .write in:
struct file_operations asgn2_fops

This will serve as the base code for your assignment 2.

9.2
9.2.1

Parallel port I/O


Preparing your system

On your system, you need to prevent the module parport from being loaded, which its unloading function contains extra routines
that disables the parallel port altogether and the module in this assignment would not be able to turn back on without rebooting.
First, you need to take the following steps to prevent parallel port-related modules from being loaded:
Change the following line in /etc/default/cups:
LOAD_LP_MODULE=yes

to
LOAD_LP_MODULE=no

Comment out lp in /etc/modules

COSC440 Lab Manual


64 / 69

Blacklist all parallel port modules by adding the file /etc/modprobe.d/blacklist_asgn2.conf, which contains:
blacklist
blacklist
blacklist
blacklist

parport
ppdev
lp
parport_pc

Reboot your machine


Plug in the dummy device to the parallel port, our dummy device will trigger an interrupt at the parallel port when the MSB of
0x378 rises from 0 to 1.

9.2.2

Setting up Parallel I/O in your module

In Linux, the first parallel port is represented as three ports: 0x378, 0x379 and 0x37a, where 0x378 is the data port where we
read from, 0x379 is the status port and 0x37a is the control port. For the first parallel port, 0x378 is the base address. For details
of the parallel port I/O, please refer to the wikipedia entry:
http://en.wikipedia.org/wiki/Parallel_port
In asgn2_init(), we need to acquire access to the first parallel port by calling:
struct resource* request_region (unsigned long
unsigned long
const char*

start,
n,
name)

where:
start is the base address of the port, in our case, 0x378
n is the port size, in our case, 3
name is the name of the module
On error, it will return NULL.
When the device unloads, exit() must call:
void release_region(unsigned long start, unsigned long n)

to release access to the parallel port memory.

9.2.3

Interrupt handler

When there is a byte appearing on the parallel port, an interrupt is triggered and we need an interrupt handler (top half) to very
quickly copy that byte from the data port (0x378) to a temporary area such as a circular buffer.
After access to the parallel port memory is successfully acquired, we can install the interrupt handler for our parallel port by
calling:
int request_irq (unsigned int
irq_handler_t
unsigned long
const char *
void *

where:

irq,
handler,
irqflags,
devname,
dev_id);

COSC440 Lab Manual


65 / 69

irq is the interrupt line for this handler, which is 7 for our parallel port
handler your IRQ handler function (top half)
irqflags interrupt type flags, in our case, 0
devname the module name
dev_id an identifier, we can pass in asgn2_deice
It returns a non-zero value upon failure, which we must check for.
Then once the IRQ is successfully reserved, we then enable the interrupt of the parallel port by turning on the interrupt-enable
bit on the control port:
outb_p(inb_p(parport_base + 2) | 0x10, parport_base + 2);

At cleanup, the IRQ must be released by:


void free_irq ( unsigned int irq, void *dev_id);

9.3

Overview of the Assignment 2

In this assignment, the supplied binary program data_generator


http://www.cs.otago.ac.nz/cosc440/data_generator
will send ascii text files to the parallel port, a byte at a time, and the end of file is signalled by a \0 character.
$ sudo ./data_generator <ascii file>

Your module will be read-only, and will only allow one reader at a time (i.e. each reader will receive a complete ascii file). If
there are multiple readers trying to open your device, then your device will only allow one reader in, and queue up the rest of the
readers.

9.3.1

Top half

Each byte appearing on the parallel port, will trigger an interrupt, which will trigger the interrupt handler (the top half) of the
module. interrupt handler needs to quickly copy the byte off the parallel port and add to the circular buffer - the temporary
storage. Then the interrupt handler will call its bottom half, (a tasklet or a workqueue) to get the content of the circular buffer to
the multiple page queue.
Since the interrupt handler cannot sleep, there is not enough time to reallocate the circular buffer, therefore when the circular
buffer is full, and there are bytes coming in, the only option is to drop the bytes. It is your design decision to decide whether to
drop the newest incoming bytes, or the oldest bytes stored in the circular buffer. Also, depending in your choice of the bottom
half (tasklet or workqueue), you need to consider whether there will be data race between the top half (interrupt) and the bottom
half, and if there may be a data race, you need to find ways to prevent data race, such as spinlocks. These design issues need to
be clearly discussed in your report.

9.3.2

Bottom half (producer) and read() (consumer) in the multiple-page queue

The bottom half needs to get all bytes from the circular buffer into the multiple page queue. Here, the multiple page queue
has a head and a tail index. Each index has an entry of the page, and the in-page offset. The bottom half is the producer, so it
increments the tail; and read() is the consumer, so it increments the head.
After moving the bytes from the circular buffer to the multiple page queue, the bottom half needs to update the tail.

COSC440 Lab Manual


66 / 69

Here, read() is the consumer reading data from the head of the multiple-page queue to the user space. As it reads data, it advances
the head, and when a page is finished, it moves to the next page. You need to make a design decision whether to free the used
page immediately, or recycle the used page, and explain in your report.
read() needs to stop reading when an EOF of the current file is reached. Also if the multi-page queue is empty but EOF is not
yet encountered, read() needs to be blocked, until new data appears in the multi-page queue. This is best handled by a mutex
between the bottom half and read(), and for this reason, after the bottom half adds new data to the multiple page queue, it needs
to wake up the consumer queue.

COSC440 Lab Manual


67 / 69

Chapter 10

Timers (Optional)
10.1

jiffies

jiffies is a coarse time measurement variable provided by the linux kernel:


#include <linux/jiffies.h>
unsigned long volatile jiffies;

And here are some convenience macros compare two jiffies timestamps:
time_after(a, b);
time_before(a, b);
time_after_eq(a, b);
time_before_eq(a, b);

jiffies can be used to introduce busy waiting. For example:


#include <linux/sched.h>
jifdone = jiffies + delay * HZ;
while (time_before(jiffies, jifdone)); /* do nothing */

This kind of busy waiting is inefficient, as jiffies will be re-read every time it is accessed, and this loop locks up the CPU during
the delay. Therefore busy waiting should not be used unless the wait time is very short (say under 50 jiffies).
Therefore for short delays, the following functions should be used instead:
#include <linux/delay.h>
void ndelay(unsigned long nanoseconds);
void udelay(unsigned long microseconds);
void mdelay(unsigned long milliseconds);

Dont expect ndelay() to give true nanoseconds. Instead, most architecture will give resolution up to microseconds.
Alteratively, these functions can be used, which do not have busy waiting:
void msleep (unsigned int milliseconds);
unsigned long msleep_interruptible (unsigned int milliseconds);

If msleep_interruptible() returns before the sleep has finished because of a signal, it returns the number of milliseconds left in
the requested sleep period.

COSC440 Lab Manual


68 / 69

10.2

Timers

Timers are used to delay the execution of a function until a specified time has elapsed. The function will run on the CPU on
which it is submitted.
Since the CPU may not be available when it is time to execute the function, therefore timers can only guarantee the function will
not run before the specified time has elapsed. In practice, the function will run a clock tick after the timer expires, unless some
greedy high latency tasks have been suspending interrupts.
Since the function scheduled will run in the atomic context instead of the user context, therefore it can not do anythings that
cannot be done at interrupt time, including anything that can sleep (e.g. no transfer of data back and forth with user space, no
semaphores, no memory allocation with GFP_KERNEL etc).
Here are the important data structure and functions for kernel timers:
#include <linux/timer.h>
struct timer_list {
struct list_head entry;
unsigned long expires;
void (*function)(unsigned long);
unsigned long data;
struct tvec_t_base_s *base;
};
void
void
void
void
void

init_timer
add_timer
mod_timer
del_timer
del_timer_sync

(struct
(struct
(struct
(struct
(struct

timer_list
timer_list
timer_list
timer_list
timer_list

*timer);
*timer);
*timer, unsigned long expires);
*timer);
*timer);

where:
entry points to the doubly-linked circular list of kernel timers.
expires is measured in jiffies. It is an absolute value, not a relative one.
The function to be run is passed as function() and data can be passed to it through the pointer argument data.
init_timer() zeroes the previous and next pointers in the linked list.
add_timer() inserts the timer into the global timer list.
mod_timer() can be used to reset the time at which a timer expires.
del_timer() can remove a timer before it expires. Returns 1 if it deletes the timer, or 0 if it is too late because the timer function
has already started executing. It is not necessary to call del_timer() if the timer expires on its own.
del_timer_sync() makes sure that upon return, the timer function is not running on any CPUs. This function should be used
on SMP systems as it avoids race conditions.
A timer can reinstall itself to set up a periodic timer. This can be done by:
mod_timer(&t, jiffies + delay);

Here is a code snippet showing the usage of kernel timers:


static struct timer_list my_timer;
init_timer(&my_timer);
my_timer.function = my_function;
my_timer.expires = jiffies + ticks;

COSC440 Lab Manual


69 / 69

my_timer.data = &my_data;
add_timer(&my_timer);
.....
/* we dont need to execute my_function() anymore */
del_timer(&my_timer);

10.3

Exercise: Kernel Timers from a Character Driver

Write a driver that launches a kernel timer whenever a write() to the device takes place.
Pass some data to the driver and have it printed out.
Have it print out the currentpid field when the timer functions is scheduled, and then again when the function is executed.

Das könnte Ihnen auch gefallen