Sie sind auf Seite 1von 688

Linux Device Drivers

These slides are made available to you under a Creative Commons Share-Alike 3.0 license. The full terms of this license are here: https://creativecommons.org/licenses/by-sa/3.0/ Attribution requirements and misc., PLEASE READ: This slide must remain as-is in this specific location (slide #2), everything else you are free to change; including the logo :-) Use of figures in other documents must feature the below Originals at URL immediately under that figure and the below copyright notice where appropriate. You are free to fill in the Delivered and/or customized by space on the right as you see fit. You are FORBIDEN from using the default About the instructor slide as-is or any of its contents. (C) Copyright 2005-2012, Opersys inc. These slides created by: Karim Yaghmour Originals at: www.opersys.com/training/linux-device-drivers

Delivered and/or customized by

Coursestructureandpresentation
1.Abouttheinstructor 2.Goals 3.Presentationformat 4.Expectedknowledge 5.Daybydayoutline 6.Courseware

1.Abouttheinstructor

Authorof:

IntroducedLinuxTraceToolkitin1999 Originated Adeos and relayfs (kernel/relay.c)

2.Goals

Toprovideanindepthunderstandingofhowto develop,buildandtestdevicedriversforLinux Togiveyouahandsonexperienceindeveloping, buildingandtestingdevicedriversforLinux

3.Presentationformat

Coursehastwomaintracks:

Lecture:Instructorpresentsanddiscussesmaterial Exercises:Attendeesputtopracticethematerial presentedwithinstructorassistance.

Fastpace.Answerstosomequestionswillbe postponeduntilrelevantsectioniscovered. Giventhatthereisalotofmaterial,theinstructor willsetthespeedasrequired.

4.Expectedknowledge

Basicembeddedsystemsexperience Basicunderstandingofoperatingsystem concepts. Cprogrammingexperience Basicgraspofopensourceandfreesoftware philosophy. Goodunderstandingofthedebuggingprocess Goodunderstandingofcomplexproject architecture.

5.Daybydayoutline

Day1:
1.Introduction 2.HardwareandLinux,aviewfromuserspace 3.Writingmodules 4.Drivertypes,subsystemAPIsanddriverskeletons 5.Hookingupwithandusingkeykernelresources

Day2:
6.Lockingmechanisms 7.Interruptsandinterruptdeferal 8.Timelyexecutionandtimemeasurement 9.Memoryresources 10.Hardwareaccess

Day3:
11.Chardrivers 12.Blockdrivers 13.Networkdrivers 14.PCIdrivers 15.USBdrivers 16.TTYdrivers

6.Courseware

LinuxDeviceDrivers,3 ed.book DeviceDriversforEmbeddedLinuxslides manual. Exerciseset CDROMforhandsonsession

rd

Introduction
1.Systemarchitecturereview 2.Userspacevs.kernelspace 3.Inthekernelworld 4.Drivers 5.Handsonworkenvironment

1.Systemarchitecturereview

Kernel

ProvideUnixAPItoapplications ManagecoreI/O Managememory Controlprocessscheduling etc. Controlhardware Interfacewithkernel Provideexpectedsemanticstokernel

Drivers

Libraries

Providesugarcoatedinterfacesto applications Talktokernelthroughlibraries Talktodevicethroughkernel Implementenduserfunctionality

Applications

2.Userspacevs.kernelspace

Separateaddressspace:

Noexplicitreferencestoobjectsfromotherspace Noprocesscandirectlyaccessoralterother processes'memoryareas. Noprocesscanaccessanythinginsidethekernel Processesthatattemptdie(segfault)

Memoryprotectionamongstprocesses:

Memoryprotectionbetweenprocessesandkernel:

Crossingbetweenuserspaceandkernelspaceis throughspecificevents(willseelater)

3.Inthekernelworld

Usemodules Onceloaded,coreAPIavailable:

kernelAPIchanges:

http://lwn.net/Articles/2.6kernelapi/

Havefullcontrol...easytocrash Lotsofconcurrencyinthekernel OnlyonecurrentprocessperCPU

4.Drivers

Builtinvs.modularized Userspacedrivers? Aconceptmorethanareality Xwindow,libush,gadgetfs,CDwriters,... Hardtomaphardwareresources(RAM,interrupts, etc.) Slow(swapping,contextswitching,etc.) Securityissues Musthaveparametercheckingindrivers Preinitializebufferspriortopassingtouspace

Licensingreminder

Althoughtheuseofbinaryonlymodulesis widespread,Kernelmodulesarenotimmunetokernel GPL.SeeLDD3,p.11 Manykerneldevelopershavecomeoutratherstrongly againstbinaryonlymodules. HavealookatBELSappendixCforafewcopiesof noticesonbinaryonlymodules. Ifyouarelinkingadriverasbuiltin,thenyouaremost certainlyforbiddenfromdistributingtheresultingkernel underanylicenseotherthantheGPL.

Firmware:

Oftenmanufacturerprovidesbinaryonlyfirmware Kernelusedtocontainfirmwarebinariesasstatichex strings. Nowadays,firmwareloadedatruntimefromuserspace (likelaptopIntel2200).

5.Handsonworkenvironment

Typicallywouldusesamecrossdevelopment toolchainusedfortherestofthesystem components. Qemu Noactualcrossdev RealhardwarewouldrequireBDM/JTAG Buildisonregularx86Linuxhost

HardwareandLinux,aviewfrom userspace
1.Devicefiles 2.Typesofdevices 3.Majorandminornumbers 4./procandprocfs 5./sysandsysfs 6./devandudev 7.Thetools

1.Devicefiles

EverythingisafileinUnix,includingdevices Alldevicesarelocatedinthe/devdirectory Onlynetworkingdevicesdonothave/devnodes Everydeviceisidentifiedbymajor/minornumber Canbeallocatedstatically(devices.txt) Canbeallocateddynamically Toseedevicespresent:$ cat /proc/devices

2.Typesofdevices

Whatuserspacesees:

Char Block Network USB SCSI I2C ALSA MTD etc.

Abstractionsprovidedbykernel

3.Majorandminornumbers

Thegluebetweenuserspacedevicefilesand thedevicedriversinthekernel. Userspaceaccessesdevicesthroughdevice nodes...specialentriesinthefilesystem. Eachdeviceinstancehasamajornumberanda minornumber. Eachcharandblockdriverthatregisterswiththe kernelhasamajornumber. Whenuserspaceattemptstoaccessadevice nodewiththatsamenumber,allaccessesresultin actions/callbackstodriver.

Minornumberisdeviceinstance Minornumberisnotrecognizedorusedbythe kernel. Minornumberonlyusedbydriver

4./procandprocfs

/procisavirtualfilesystemoftypeprocfs Allfilesanddirectoriesin/procexistonlyin memory. Read/writesresultincallbackinvocation /procisusedtoexportinformationaboutalotof thingsinthekernel. Usedbymanysubsystems,drivers,andcore functionality. Typicallyregardedbykerneldevelopersasa mess.

Example/procentries:

cpuinfo: interrupts: iomem: ioports: devices: dma: bus/pci: bus/usb:

InfoaboutCPU Listofinterruptsandrelateddrivers ListofI/Omemoryregions ListofI/Oportregions Listofactivecharandblockdrivers ActiveDMAchannels(ISA) InfotreeaboutPCI InfotreeaboutUSB(usbdevfs)

5./sysandsysfs

Newwayofexportingkernelinfotouserspace Meanttoeventuallyreplace/proc Tightlytiedtothekernel'sdevicemodel Providesverygoodviewofhardwareanddevices tiedtoit. Containsmutipleviewsofthesamehardware:


Busview(/sys/bus) Deviceview(/sys/devices) Classview(/sys/class)

6./devandudev

/devisthemainrepositoryfordevicenodes Distrosusedtoshipwiththousandsofentriesin /dev;foreverypossiblehardwareoutthere. Becameincreasinglydifficulttouseasdevices weremoreandmoremobile. Withthearrivalofsysfsandtherelatedhotplug functionality:udev. udevautomagicallycreatestheappropriateentries in/devdynamically. Canbeconfiguretoprovidepersistentview

7.Thetools

Manytoolstoseeorcontrolhardware Examples:

lspci:listPCIdevices lsusb:listUSBdevices fdisk:partitiondisk hdparm:setdiskparameters ifconfig,iwconfig:configurenetworkinterface insmod,modprobe,rmmod,lsmod:manage modules halt,reboot:controlsystem hotplug:managetheadding/removalof hardware

Writingmodules
1.Settingupyourtestsystem 2.Kernelmodulesversusapplications 3.Compilingandloading 4.Thekernelsymboltable 5.Preliminaries 6.Initializationandshutdown 7.Modulesparameters 8./sys/modulesand/proc/modules

1.Settingupyourtestsystem

Usemainline 3rdpartykernelscanhaveadifferentAPI Buildsystemrequirestargetedkernel

2.Kernelmodulesversusapplications

initandcleanup kspacevs.uspace:

Memoryprotection/isolation Accessrights Limitedwaysfortransition:


Interrupts Traps Syscalls

Concurrency The"current"process

Stacklimitinkernel:4K Useappropriateallocationprimitivestoobtain largestructures. Bewareof__fct=>internals NoFP

3.Compilingandloading

Compilingmodules:

objm:=>modulename moduleobjs:=>filesmakingupmodule M=>moduledirectory(pwd) Examplemakefileonp.24ofLDD3 Morecompletemakefileneeds"clean:" Useofkernelbuildsystem insmod sys_init_module(kernel/module.c)

Loadingandunloadingmodules:

modprobe(resolvesunresolvedsymbols) rmmod lsmod Nodiscussionofmodversions Mustcompilemodulesagainstexactkernelused Moduleslinkedtovermagic.ofor:


Versiondependency:

Version CPUbuildflags

vermagictestedagainsttargetedkernelatloadtime

KERNELDIRspecifiesversion linux/modules.hincludeslinux/version.hwhichhas:

UTS_RELEASE =>string LINUX_VERSION_CODE =>hex KERNEL_VERSION() =>macroforbuildinghex valueforcomparisonagainstcheckpoint.

#ifdefswhenneeded Uselowlevel/highmacrostohidedetails

4.Thekernelsymboltable

Basickernelsymboltableexportedatbuildtime Kernelsymboltablekeptuptodateatruntimeby kernel. Modulestacking(ex.:USB,FSes,parport,etc.) modproberesolvesdependencies EXPORT_SYMBOL():

Macrousedoutsideanyfunctionscopeinmoduleto exportsymbolsforusebyothermodules.

EXPORT_SYMBOL_GPL():

SameasEXPORT_SYMBOL()butsymbolsareonly availabletomoduleslicensedas"GPL".

5.Preliminaries

Musthave:

#include<linux/module.h> #include<linux/init.h> #include<linux/moduleparam.h> "GPLv2" "GPL" "GPLandadditionalrights" "DualBSD/GPL"

Ifneedmoduleparameters:

MODULE_LICENSE("...");

"DualMPL/GPL" "Proprietary"

MODULE_AUTHOR(); MODULE_DESCRPITION(); MODULE_VERSION(); MODULE_ALIAS(); MODULE_DEVICE_TABLE(); Usually,macrosputattheendofthefile Tocheckfields,usemodinfocommand

6.Initializationandshutdown

Initialization:

Declareinitas"static"=>filelimitedscope __init=>dropfctafterinit module_init=>specifymoduleinitializtionfunction UserelevantkernelAPItoregistercallbacksor structureswithcallbacks. Registerallsortsofthings:

devices,filesystems,linedisciplines,/procentries,etc.

Thecleanupfunction:

__exittospecifythatcleanupisforunloadingonly

void=>noreturnvalue Usemodule_exit() Inexit,reversedeallocationorderforresources allocatedininit. Nocleanup=nounloading Registration/allocationmayfail Failuretodeallocateonerror=unstable Usereverseordergotoinsteadofpererrorrollback Useproperreturncodeincaseoferror (<linux/errno.h>)

Errorhandlingduringinitialization:

Useofcustomcleanup:

Tobecalledfrominitorasexit Checksfornonnullglobalvars Can'tbemarkedas__exit Callbackscouldbeinvokedassoonastheyare registered:


Moduleloadingraces:

Makesurecallbacksaren'tactivepriortoloadfinish Properlycodemoduletocompleteinternalregistrationprior tocallbackregistration.

Incaseofinitfailure,somekernelpartsmay alreadybeusingfctsregisteredpriortofailure.

7.Modulesparameters

Mayneedtospecifysomeparamsatloadtime Specifiedatloadtime:

insmod modprobe(/etc/modprobe.conf)

module_param():Usedoutsideofanyfunction scope

Nameofvariable Variabletype Permissionmaskforsysfs

Vartypes:

bool invbool charp int long short uint ulong ushort

module_param_arrayforparametersarray:

Foreachentry:name,type,num,andperm Commaseparatedentries numisnumberofelementsinarray

Usedefaultvaluesforallparams Defaultschangeoninsmodonlyifrequestedby user. Permissionareasspecifiedin<linux/stat.h>:


S_IRUGO=>readonlyforworld S_IRUGO|S_IWUSR=>rootonlywrite

Writableparametersdonotgeneratesignalto module,mustbedetectedlive.

8./sys/moduleand/proc/modules

/sys/modulehasonedirectoryperloadedmodule. /proc/modulesisolderinterfaceprovidinglistof loadedmodules.

Typesofdrivers,subsystemAPIs anddriverskeletons
1.Chardevicedriver 2.Blockdevicedriver 3.Networkdevicedriver 4.MTDmapfile 5.Framebufferdriver

1.Writingachardevicedriver

Registerchardevduringmoduleinitialization Chardevregistration:include/linux/fs.h
int register_chrdev(unsigned int, const char *, struct file_operations *);

Firstparam:Majornumber Secondparam:Devicename(asdisplayedin /proc/devices) Thirdparam:Fileops


Definedininclude/linux/fs.h Containscallbacksforallpossibleoperationsona chardevice.

struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned ... int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); };

Callregister_chrdev()andpassitavalid file_operationsstructure. Return0frominitializationfunctiontotellinsmod thateverythingisOK. That'sit.Everytimethedevicein/devhavingthe samemajornumberastheoneyouregisteredis opened,youdriverwillbecalled. Toremovechardevonrmmod:


int unregister_chrdev(unsigned int, const char *);

2.Writingablockdevicedriver

Registerblockdevduringmoduleinitialization Blockdevregistration:include/linux/fs.h
int register_blkdev(unsigned int, const char *);

Firstparam:Majornumber Secondparam:Devicename Diskallocation:include/linux/genhd.h


struct gendisk *alloc_disk(int minors);

Blockqueueregistration:include/linux/blkdev.h
extern void blk_init_queue(request_fn_proc *, spinlock_t *);

QueueofpendingI/Ooperationsfordevice

Firstparam:Queuehandlerfunction Secondparam:Lockforaccessingqueue Callregister_blkdev(). Callalloc_disk()andpassitthenumberofdisks. Callblk_init_queue()andpassitavalidcallback. Return0frominitfunctiontotellinsmodstatus

Now,allblockoperationsonyourdevice(/dev entrywithsamemajornumberasdriver)willbe queuedtoyourdriver. Toremoveblockdevonrmmod:


void blk_cleanup_queue(request_queue_t *); void put_disk(struct gendisk *disk); int unregister_blkdev(unsigned int, const char *);

3.Writinganetworkdevicedriver

Registernetdevduringmoduleinitialization Netdevregistration:include/linux/netdevice.h
int register_netdevice(struct net_device *dev);

Param:netdeviceops

Definedininclude/linux/netdevice.h Containsallcallbacksrelatedtonetworkdevices ThisisahugestructurewithALOToffields

Callregister_netdevice()andpassitavalid net_devicestructure. Return0asstatustoinsmod

Yourdevicewillneedtobeopenedbythekernel inresponsetoanifconfigcommand. Youropen()functionmustallocateapacketqueue todealwithpacketssenttoyourdevice. Callingyourdevicewilldependonpacketrouting attheupperlayersofthestack. Toremove:unregister_netdev(struct net_device *dev);

4.WritinganMTDmapfile

Mustfinddeviceinmemoryandthenregisterit Findingadeviceinmemory: include/linux/mtd/map.h


struct mtd_info *do_map_probe(char *name, struct map_info *map);

Firstparam:Typeofprobe(ex:cfi_probe) Secondparam:mapinfo

Definedininclude/linux/mtd/map.h Informationregardingsizeandbuswidth Functionsforaccessingchip

struct map_info { char *name; unsigned long size; unsigned long phys; #define NO_XIP (-1UL) void __iomem *virt; void *cached; int bankwidth; #ifdef CONFIG_MTD_COMPLEX_MAPPINGS map_word (*read)(struct map_info *, unsigned long); void (*copy_from)(struct map_info *, void *, unsigned long, ssize_t); void (*write)(struct map_info *, const map_word, unsigned long); void (*copy_to)(struct map_info *, unsigned long, const void *, ssize_t); #endif void (*inval_cache)(struct map_info *, unsigned long, ssize_t); /* set_vpp() must handle being reentered -- enable, enable, disable must leave it enabled. */ void (*set_vpp)(struct map_info *, int); unsigned long map_priv_1; unsigned long map_priv_2; void *fldrv_priv; struct mtd_chip_driver *fldrv; };

Oncelocated,useadd_mtd_partitions()to providepartitioninformationtoMTDsubsystem.

add_mtd_partition()isin include/linux/mtd/partitions.h
int add_mtd_partitions(struct mtd_info *, struct mtd_partition *, int);

Firstparam:pointerreturnedby do_map_probe() Secondparam:partitioninformationaswe sawearlier. Thirdparam:numberofpartitions

5.Writingaframebufferdriver

Registerframebufferduringmoduleinit Framebufferregistration:include/linux/fb.h
int register_framebuffer(struct fb_info *fb_info);

Param:fbinfo

Definedininclude/linux/fb.h Containscallbacksforallframebufferoperations

struct fb_info { int node; int flags; struct fb_var_screeninfo var; /* Current var */ struct fb_fix_screeninfo fix; /* Current fix */ struct fb_monspecs monspecs; /* Current Monitor specs */ struct work_struct queue; /* Framebuffer event queue */ struct fb_pixmap pixmap; /* Image hardware mapper */ struct fb_pixmap sprite; /* Cursor hardware mapper */ struct fb_cmap cmap; /* Current cmap */ struct list_head modelist; /* mode list */ struct fb_ops *fbops; struct device *device; #ifdef CONFIG_FB_TILEBLITTING struct fb_tile_ops *tileops; /* Tile Blitting */ #endif char __iomem *screen_base; /* Virtual address */ unsigned long screen_size; /* Amount of ioremapped VRAM or 0 */ void *pseudo_palette; /* Fake palette of 16 colors */ #define FBINFO_STATE_RUNNING 0 #define FBINFO_STATE_SUSPENDED 1 u32 state; /* Hardware state i.e suspend */ void *fbcon_par; /* fbcon use-only private area */ /* From here on everything is device dependent */ void *par; };

Callregister_framebuffer()andpassitavalid fb_infostructure. Return0frominitcode Accessthe/dev/fbX(whereXisyour framebuffer'sregistrationorderinrelationshipto otherfbdrivers)resultsinthefunctionsprovidedin fb_infotobecalled. Toremovefbdevonrmmod:


int unregister_framebuffer(struct fb_info *fb_info);

Hookingupwithandusingkey kernelresources
1.printk 2./proc 3.Introductiontosysfs 4.Sysfsentrytypes 5.Sysfslayout 6.Kobjects,ksets,andsubsystems 7.Lowlevelsysfsoperations 8.Hotplugeventgeneration 9.Buses,devices,anddrivers

10.Classes 11.Sysfsexample 12.Hotplug 13.Copyto/fromuser 14.Dealingwithfirmware

1.printk

Basics

Equivalenttolibc'sprintf() Samesemanticsasprintf Recognizes8loglevels:

KERN_EMERG:

Veryurgentmessages,rightbeforecrashing Alertrequiringimmediateaction Criticalissue Importanterror

KERN_ALERT:

KERN_CRIT:

KERN_ERR:

KERN_WARNING:

Noncriticalwarnings Normal,butworthnoting. Informational Debuggingmessages

KERN_NOTICE:

KERN_INFO:

KERN_DEBUG:

Nologlevel=>DEFAULT_MESSAGE_LOGLEVEL=> usuallyKERN_WARNING Printtoconsoledependsonloglevel(console_loglevel)

Whenklogdandsyslogdarerunning=>appendto /var/log/messages,regarldessofvalueof console_loglevel. klogddoesnotsaveconsecutiveidenticallines,just theircount. Ifklogdnotrunning,mustmanuallyread/proc/kmsg. console_loglevelsetto DEFAULT_CONSOLE_LOGLEVEL. console_loglevelsetthroughsys_syslog(). Readloglevelconfigfrom/proc/sys/kernel/printkLDD3 (p.77).

Writeto/proc/sys/kernel/printkmodifiescurrent loglevel. Maysendmessagestospecificvirtualconsole Bydefault,messagessenttocurrentvirtualterminal Canuseioctl(TIOCLINUX)todirectmessagestoa givenconsole SeeTIOCLINUXindrivers/char/tty_io.cformoreinfo. Circularbuffer

Redirectingconsolemessages:

Howmessagesgetlogged:

Sizeis__LOG_BUF_LEN,configurableatbuildtime from4KBto1MB Useofsys_syslogor/proc/kmsgtoreadcontent printkwakesupwhoeveriswaitingoneither sys_syslogmaybemadenottoconsume /proc/kmsgisalwaysconsuming(klogd),likeaFIFO. Wraparoundonoverflow Dataisunprocessedifunread

klogddispatchesmessagestosyslogd,which checks/etc/syslog.confforfiguringouthowtodeal withsuchmessages. syslogdlogsmessagesbyfacility,kernelis LOG_KERN. Maywanttocustomize/etc/syslog.conffordispatching kernelmessages. syslogdmessagesimmediatelyflushedtodisk=> performanceissue. syslogdmaybemadetoavoidthisbyprefixinglog filenamewithhyphen.

Turningthemessagesonandoff:

UseCmacrostodisable/enabledebugstatementsat buildtime. SeeLDD3p.80forexample. Toomanyprintks:


Ratelimiting:

Printkbufferoverflow Toomanymessagestoconsole=slow Notprofessionalonproductionsystems etc.

Use:
if(printk_ratelimit()) printk(...);

Iftoomanymessagestoconsole,printk_ratelimit() retval!=0 Ratecustomizablethrough /proc/sys/kernel/printk_ratelimit Macrosforprettyformattingdevicenumber (<linux/kdev_t.h>):


Printingdevicenumbers:

intprint_dev_t(char*buffer,dev_tdev); char*format_dev_t(char*buffer,dev_tdev);

Differenceinreturnval(quantityvs.buffer) bufferof20+bytes

2./proc

Avoidingsyslogoverhead Usingthe/procfilesystem:

int(*read_proc)(char*page,char**start,off_toffset, intcount,int*eof,void*data); structproc_dir_entry*create_proc_read_entry(const char*name,mode_tmode,structproc_dir_entry *base,read_proc_t*read_proc,void*data); seqfiles:


Foreasyimplementationoflargeentriesin/proc <linux/seq_file.h>

Prototypes:

void*start(structseq_file*sfile,loff_t*pos); void*next(structseq_file*sfile,void*v,loff_t*pos); visiteratorasreturnedbypreviousstart()ornext() posisfileposition intshow(structseq_file*sfile,void*v); voidstop(structseq_file*sfile,void*v);

Sequenceofcallsfromstarttostopisatomic. Functionstobeusedbyshow():

intseq_printf(structseq_file*sfile,constchar*fmt,...); Equivalentofprintf Ifretval>0,bufferhasfilledandoutputisdiscarded Usuallyretvalisignoredbyusersofseq_printf() intseq_putc(structseq_file*sfile,charc); Equivalentofuserspaceputc()

intseq_puts(structseq_file*sfile,constchar*s); Equivalentofuserpaceputs() intseq_escape(structseq_file*m,constchar*s,const char*esc); Print"esc"charactersfoundin"s"inoctalform Typicalvaluefor"esc":"\t\n\\" intseq_path(structseq_file*sfile,structvfsmount*m, structdentry*dentry,char*esc); Printoutfilenameassociatedwithdirectoryentry Notusuallyusedindrivers.
seq_operations my_seq_ops = { = my_seq_start, = my_seq_next, = my_seq_stop, = my_seq_show

Declarefunctionsinstructseq_operations:
static struct .start .next .stop .show };

Declarefileopsforconnectingseqops:
static struct file_operations my_proc_ops = .owner = THIS_MODULE, .open = my_proc_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release };

seq_read,seq_lseek,seq_releaseprovidedbykernel my_proc_open:
static int my_proc_open(struct inode *inode, struct file *file) { return seq_open(file, &my_seq_ops); }

Registerentrywith/proc:
entry = create_proc_entry("scullseql", 0, NULL); if (entry) entry->proc_fops = &scull_proc_ops;

3.Introductiontosysfs

Needawaytoorganizethesystemlayoutfor variouskernelpartstouse Existingdevicemodelusedfor:

Powermanagementandsystemshutdown:

Devicesshutdownfromleavestotrunk Devicelayoutvisiblethroughsysfs Devicestunablethroughsysfsentries Sendinginforegardingnewlypluggeddevices

Communicationswithuserspace:

Hotpluggabledevices:

Deviceclasses:

Tellappswhichtypesofperipheralsarepresent Objetrelationshipsandrefcounting

Objectlifecycles:

Devicemodeltreecanbeverycomplex(see/sys) Selfcoherentmodel Driverwritersneednotworrytoomuchabout bookeeping. Typically,subsystemstakecareofsysfsentries andkobjecthierarchies.

4.Sysfsentrytypes

Directories:ksetsorkobjects Files:kobjectattributes

Attributetypes:

Default(ksetspecificlistofattributes) Nondefault(attributesspecifictokobject) Binary

Symlinks:Relationshipsbetweenkobjects

5.Sysfslayout

/sys/devices:

Actual/realdevicetreestartingfromCPU Perbusviewofthesystem Eachbus,regardlessofhowitconnectstoother buses,hasitsownseparateentryin/sys/bus. Eachbusentryhasatleasttwosubdirectories:

/sys/bus:

devices:containsonedirectoryforeachdeviceconnectedto thebus. drivers:containsonedirectoryforeachdriverloadedforthe bus.

/sys/class:

Classviewofthehardwareinthesystem Eachentryisadifferenttypeofdevice Eachdevicetypeentrycontainslistofdevicesofthe giventypeavailableinthesystem. Devicesaregroupedbytype,regardlessofthetypeof bustheyareconnectedto. Specialdirectoryforthe"block"class Informationexportedbythesystemfirmware

/sys/block:

/sys/firmware:

/sys/kernel:

Informationexportedbythekernel Moduleview Containsoneentryforeachmoduleloadedinthe system Eachentrycontainsarefcountandadirectory containingthesectionsexportedbythemodulewhich couldbeofinteresttouserspace. Informationpertainingtopowermanagement.

/sys/module:

/sys/power:

6.Kobjects,ksets,andsubsystems

Introduction:

kobjectissmallestelementofdevicemodel structkobject:<linux/kobject.h>

Refcounting Sysfsvisibility(thoughnotallkobjectsdisplayedinsysfs) Cstructureglue Hotplugeventhandling

Kobjectbasics:

Embeddedingkobjects:

kobjectsalwaystiedtosomethingelse Almostneverexistindependently

Typicallypartofanotherstruct{} Usecontainer_of()toretrieveparentstruct cdevexample: structcdev*device=container_of(kp,structcdev,kobj);

kobject
struct kobject { char *k_name; char name[KOBJ_NAME_LEN]; struct kref kref; struct list_head entry; struct kobject *parent; struct kset *kset; struct kobj_type *ktype; struct dentry *dentry; };

kobjectinitialization:

1Mustmemset()kobjectinstanceto0.Otherwiseserious crashes.

2Initializeobjectandsetrefcountto1:

voidkobject_init(structkobject*kobj); intkobject_set_name(structkobject*kboj,constchar*format,...); Likeprintf Checkretvalasfunctionmayfail

3Setobject'sname:

Referencecountmanipulation:

structkobject*kobject_get(structkobject*kobj);

Incrementrefcount retvaliskobjectptrifsuccess retvalisNULLifobjectisbeingdestroyed Decrementrefcount Callatleastoncetomatchkobject_init()

voidkobject_put(structkobject*kobj);

Releasefunctionsandkobjecttypes:

Whathappenswhenrefcountgoesto0 Asynchronousevent(notnecessarilypredictable) Mustprovide"release"methodforeverykobject Releasemethodassociatedwithstructkobj_type,not kobject:


struct kobj_type { void (*release)(struct kboject *); struct sysfs_ops *sysfs_ops; struct attribute **default_attrs; }

Everykobjecthasonekobj_type=>ktype Ifkobjectnotpartofkset,ktypewillcontainactualrelease method. Ifkobjectpartofkset,ksetwillprovidethestructkobj_type withthereleasemethod.

Getproperkobj_type:

structkobj_type*get_ktype(structkobject*kobj);

Kobjecthierarchies,ksets,andsubsystems:

Basics:

Needtotiekobjectstogetheraccordingtosubsystem structure Possibleties:

"parent"ptrinkobject: Dictatessysfslayout ksets

Ksets:

Objectcontainer/aggregation

structkset:<linux/kobject.h>
struct kset { struct subsystem *subsys; struct kobj_type *ktype; struct list_head list; spinlock_t list_lock; struct kobject kobj; struct kset_hotplug_ops *hotplug_ops; };

Isakobjectitself(containskobjectinstruct) Eachksethasseparatesysfsdirectory Addingkobjecttokset:


1.Setkobject'sksetfieldtoappropriatestructkset 2.intkobject_add(structkobject*kobj); Checkretvalforerror Objectrefcountincremented

Helperfunctionforkobject_init()andkobject_add():

intkobject_register(structkobject*kobj);

Removingkobjectfromkset:

voidkobject_del(structkobject*kobj); voidkobject_unregister(structkobject*kobj); ksetmaintainedlinkedlistofembeddedkobjects "parent"entryinkobjectpointstokset>kobject

Helperfunctionforkobject_del()andkobject_put():

Relationshipsummary:

Operationsonksets:

Basicksetmanipulation:

voidkset_init(structkset*kset); intkset_add(structkset*kset); intkset_register(structkset*kset); voidkset_unregister(structkset*kset);

ksetrefcounting:

structkset*kset_get(structkset*kset); voidkset_put(structkset*kset);

kset'snameisinitskobjectstructentry ksetshavepointerto"structkobj_type" kobjectsembeddedinksetusetheset's"struct kobj_type"insteadofhavingtheirown. ksetcontainspointertosubsystem ksetmustbelongtosubsystem Highlevelkernelabstractions Typically,eachsubsystemhastoplevelsysfsentry Shouldalmostneverhavetocreateownsubsystem

Subsystems:

Subsystemcancontainmultipleksets
struct subsystem { struct kset kset; struct rw_semaphore rwsem; }

Subsystemrwsemusedtoserializeaccesstokset'skobject list Subsystemdeclarationmacro:

decl_subsys(name,structkobj_type*type,structkset_hotplug_ops *hotplug_ops); Createsastructsubsystem Initializesksetwith"type" Actualsubsystemnameisaggregateofnameand"_subsys" voidsubsystem_init(structsubsystem*subsys); intsubsystem_register(structsubsystem*subsys);

Subsystemhelperfunctions:

voidsubsystem_unregister(structsubsystem*subsys); structsubsystem*subsys_get(structsubsystem*subsys); voidsubsys_put(structsubsystem*subsys);

7.Lowlevelsysfsoperations

Basics:

Sysfsisvirtualfilesystembuiltontopofkobjects "Attributes"exportedbykobjectsappearinsysfsas files. Kobjectsappearinsysfswhenkobject_add()iscalled Sysfsentrycreation:


kobjectsappearinsysfsasdirectorieswithattributes kobjectdirectorynameisassignedusing kobject_set_name() Directoryhierarchymatcheskobject/kset/subsystem hierarchy.

Defaultattributes:

Reminder:
struct kobj_type { void (*release)(struct kobject *); struct sysfs_ops *sysfs_ops; struct attribute **default_attrs; };

"default_attrs"isarrayofsysfsattributepointers structattribute
struct attribute { char *name; struct module *owner; mode_t mode; };

"name":nameofattributeasseeninsysfs "owner":ownermodule

"mode":fileaccessmodeinsysfs structsysfs_ops
struct sysfs_ops { ssize_t (*show)(struct kobject *kobj, struct attribute *attr, char *buffer); ssize_t (*store)(struct kobject *kobj, struct attribute *attr, const char *buffer, size_t size); };

Attributesimplementedbysysfs_ops:

read()onsysfsattributegeneratescalltoshow()

Putdatainbuffer(sizeisPAGE_SIZE) Returnamountofdatawritten Convention:eachattributeisonehumanlyreadableattribute Splitmultipleinformationpiecesaccrossmultipleattributes

write()onsysfsattributegeneratescalltostore()

Readdatafrombuffer(sizemaxisPAGE_SIZE) Returnamountofdatadecoded Negativeretavalmeanserror Dataisfromuspaceandmustbevalidated

Nondefaultattributes:

Defaultattributesusuallyenough Addingattributetokobject:

intsysfs_create_file(structkobject*kobj,structattribute *attr); retvalis0onsuccess retvalis<0onerror

Mustmakesureshow()andstore()functionsknowhowto dealwithattribute intsysfs_remove_file(structkobject*kobj,structattribute *attr); Uspaceappmaystillhaverefonkobj=>show/storemayyet getcalled

Removingattribute:

Binaryattributes:

Sometimesneedmorethanjusthumanlyreadable entries. Ex.:loading/unloadingfirmware

structbin_attribute
struct bin_attribute { struct attribute attr; size_t size; ssize_t (*read)(struct kobjet *kobj, char *buffer, loff_t pos, size_t size); ssize_t (*write)(struct kobject *kobj, char *buffer, loff_t pos, size_t size); };

"attr":attributeasdefinedearlier "size":maximumsizeofattribute(0fornomax) "read"/"write":chardevlikecallbacks,onepagemaxpercall. Endoffileshouldbedeterminedbyread/writecallbacks,no wayforsysfstonotify.

Create/destruction:

intsysfs_create_bin_file(structkobject*kobj,struct bin_attribute*attr); intsysfs_remove_bin_file(structkobject*kobj,struct bin_attribute*attr);

Symboliclinks:

Basickobjectrelationshipsdonot,bythemselves providefullpictureofhowthingsaretiedtogetherin thekernel. Sometimesneedtocreatesymboliclinksbetwen kobjectstoshowrelationships.

Create/destroysymboliclinks:

intsysfs_create_link(structkobject*kobj,structkobject *target,char*name); voidsysfs_remove_link(structkobject*kobj,char*name);

Codecreatingsymlinkshouldbeabletotieintoobject linkedtoinordertoremovelinkwhenremoteobject ceasestoexist. Exampleofobjecttyingneedingsymboliclinks:

Thereisnowayforlinkingadrivertothedeviceitcontrols.

/sys/buscontainsentriesforeachbus.Eachoftheseentries containsatleast2entries:"devices"and"drivers":

Theentriesinthe"devices"directoriesaresymlinkstotheactual devices,whicharethemselvesstoredin/sys/devices. Thereisoneentryinthe"drivers"directoriesforeachdriverinthe system.Thosedriverentriescontainasymlinkbackto/sys/devices whenthedriverisactive.

/sys/devicesentriesthemselvescontainsymboliclinksback tothepertinent/sys/busentries.

8.Hotplugeventgeneration

Basics:

Methodforkerneltonotifyuserspacethatanew devicehasappearedinsystem(i.e.anewkobjecthas beenadded). Newdevicesappearwhennewkobjecthasbeen addedorremoved:


kobject_add() kobject_del()

Notificationsgenerateinvocationofuserspace /sbin/hotplug. /sbin/hotplugmaydoanumberofthings,including:

Loadingadriver

Creatingadevicenode Mountingpartitions

Tohelpuspacehotplugdoitsjob,kobjets(orrather ksets)canprovidefurtherinformationusingtheproper abstractions. Inkernelhotplugeventhandlingdoneusingstruct kset_hotplug_ops. Withinstructkset,thereisahotplug_opsoftypestruct kset_hotplug_ops.

Hotplugoperations:

Foragivenkobject,kerneltraverseskobjectparenting untilitfindsonethathasaksetparentwith hotplug_ops. structkset_hotplug_ops


struct kset_hotplug_ops { int (*filter)(struct kset *kset, struct kobject *kobj); char *(*name)(struct kset *kset, struct kobject *kobj); int (*hotplug)(struct kset *kset, struct kobject *kobj, char **envp, int num_envp, char *buffer, int buffer_size); };

"filter":Calledwhenkernelintendstogenerateeventfor "kobj".Allowsksettodecidewhetheraneventshouldindeed begenerated.Ifretval0,noeventgenerated.

"name":Thesubsystemnamepassedtotheuserspace hotplug.Thisistheonlyparameterepassedto/sbin/hotplug. "hotplug":Thisishow/sbin/hotplugknowstherestofthe storyofwhatitneedstodo.Thisfunctionsallowsksetcode tosetupenvironmentvariablesfor/sbin/hotplugtouse.


"envp":Arrayofenvironmentvariables(NULLterminated) "num_envp":Numberofenvironmentvariables "buffer":Whereenvironmentvariablesarestored retvalshouldbe0 Nonzeroretvalwillabortevent Hotplugisusuallyhandledbybusdriver

9.Buses,devices,anddrivers

Buses:

Basics:

Alldevicesarepluggedintobuses "Integrated"devicesarepartofthe"platform"bus Busescanbeconnectedtootherbuses(USBcontrolleron PCIbus). structbus_type:<linux/device.h>


struct bus_type { /* Most important fields */ char *name; struct subsystem subsys; struct kset drivers; struct kset devices; int (*match)(struct device *dev, struct device_driver *drv); struct device *(*add)(struct device *parent, char *bus_id); int (*hotplug)(struct device *dev, char **envp, int num_envp, char *buffer, int buffer_size); };

"name":Busname,like"pci"or"usb"

Eachbusisasubsystemofitsown Allbussubsystemsarepartofthe"bus"subsystem(/sys/bus) Prefillstructbus_type Mustfillin"name","match"and"hotplug" Register:

Busregistration:

intbus_register(structbus_type*bus); Mustcheckretval Onsuccess,canseein/sys/bus voidbus_unregister(structbus_type*bus);

Deregister:

Busmethods:

match:

Calledwhendeviceordriveraddedtobus Mustdetermineifgivendevicecanbehandledbygivendriver Returnnonzeroifdevicecanbehandled Setenvironmentvariablesfor/sbin/hotplug

hotplug:

Iteratingoverdevicesanddrivers:

Incaseofhavingtodoanoperationforalldevicesordrivers specifiedforbus. intbus_for_each_dev(structbus_type*bus,structdevice *start,void*data,int(*fn)(structdevice*,void*));

"start":firstdevicetostartfrom.IfNULL,startfromfirstdevonbus.

"fn":functiontocallwithdevptrand"data".Ifretvalisnonzero, iterationstopsandbus_for_each_dev()returnsretval.

intbus_for_each_drv(structbus_type*bus,structdevice *start,void*data,int(*fn)(structdevice*,void*));

Sameasbuf_for_each_dev()

Subsystemrwlockusedwhenthesefunctionsarecalled: careful structbus_attribute:<linux/device.h>


struct bus_attribute { struct attribute attr; ssize_t (*show)(struct bus_type *bus, char *buf); ssize_t (*store)(struct bus_type *bus, const char *buf, size_t count); };

Busattributes:

structattributealreadydescribed

show/storealreadyexplainedaspartofstructsysfs_ops

Staticallycreatingbus_attributestructures:
BUS_ATTR(name,mode,show,store); Actualnameis"bus_attr_"concatenatedwith"name"

Attributeregistration:

intbus_create_file(structbus_type*bus,structbus_attribute*attr); voidbus_remove_file(structbus_type*bus,structbus_attribute*attr);

Devices:

Basics:

Everydevicerepresentedusingstructdevice

structdevice:<linux/device.h>
struct device { /* Most important fields */ struct device *parent; struct kobject kobj; char bus_id[BUS_ID_SIZE]; struct bus_type *bus; struct device_driver *driver; void *driver_data; void (*release)(struct device *dev); };

"parent":Usuallythebuscontroller.IsNULLiftopmostdevice. "kobj":kobjecttiedtothisdevice "bus_id":UniquedeviceIDonbus "bus":Busdeviceisattachedto "driver":Drivermanagingdevice "driver_data":Privatedriverdata "release":Callbackforwhenkobjectrefcountreacheszero

Deviceregistration:

Minimumstructdevicefieldsthatneedtobesetfor registrationtosucceed:parent,bus_id,bus,andrelease. intdevice_register(structdevice*dev); voiddevice_unregister(structdevice*dev); Busesaredevicesand,assuch,mustberegisteredtoo Ifparentis"NULL",successfulbusregistrationresultsin entryin/sys/devices.

Deviceattributes:

structdevice_attribute:<linux/device.h>
struct device_attribute { struct attribute attr; ssize_t (*show)(struct device *dev, char *buf); ssize_t (*store)(struct device *dev, const char *buf, size_t count); };

Staticallycreatingbusattributedeclarations:
DEVICE_ATTR(name,mode,show,store); Actualnameis"dev_attr_"andname

Attributeregistration:

intdevice_create_file(structdevice*device,structdevice_attribute *entry); voiddevice_remove_file(structdevice*device,struct device_attribute*attr);

structbus_typecontainsa"dev_attrs"field,whichisalistof defaultattributesforalldevicesonbus. structdeviceisoftennotenoughtodescribe,byitself,the actualdevice. Mostsubsystemsembedstructdeviceinsidetheirown devicestructs. Ex:structpci_devandstructpci_usb Usecontainer_of()whennecessary

Devicestructureembeddeding:

Devicedrivers:

Basics:

Objectmodelhasdevicedriversinordertomapdevice driverstonewdevices.

Asasidebenefit,somedriverconfigoptionscanbe controlleredwithoutreferencetoanyspecificdevice. structdevice_driver:<linux/device.h>


struct device_driver { /* Most important fields */ char *name; struct bus_type *bus; struct kobject kobj; struct list_head devices; int (*probe)(struct device *dev); int (*remove)(struct device *dev); void (*shutdown)(struct device *dev); };

"name":Drivernameasseeninsysfs "bus":Bustypeforthisdriver "kobj":kobjecttiedtothisdriver "devices":Listofdevicescurrentlyservicedbydriver "probe":Checkfortheexistenceofgivendevice "remove":Invokedupondeviceremovalfromsystem "shutdown":Shutdowndevice

Registration/deregistation:

intdriver_register(structdevice_driver*drv); voiddriver_unregister(structdevice_driver*drv); structdriver_attribute:<linux/device.h>


struct driver_attribute { struct attribute attr; ssize_t (*show)(struct device_driver *drv, char *buf); ssize_t (*store)(struct device_driver *drv, const char *buf, size_t count); };

Driverattributes:

Staticallycreatingdriverattributedeclarations:
DRIVER_ATTR(name,mode,show,store);

Registering/deregisteringattributes:

intdriver_create_file(structdevice_driver*drv,struct device_attribute*attr); voiddriver_remove_file(structdevice_driver*drv,struct driver_attribute*attr);

structbus_typecontainsa"drv_attrs"field,whichisalistof defaultattributesforalldriversassociatedwithbus. Likedevicestruct,driverstructoftenembeddedinother subsystemspecificstructs.

Driverstructureembedding:

10.Classes

Basics:

Highlevelrepresentationofwhatisbeingworked, insteadofhowit'simplemented. Basicallyaclassisanaggregateofalldevicesofa certaintype. See/sys/classesformostclassesfoundonsystem /sys/blockistheonlyclasswithitsowntopmostentry (historical). Classownershiphandledbysubsystems,noneedfor drivertocare. Driversshouldcareaboutclassesmainlyforexporting datatouserspace.

Interfacesexportedbydrivercoreforclass manipulation:

class_simple Fullclassinterface

Theclass_simpleinterface:

Easytouseinterfaceforexportingdevice'sassigned ID. structclass_simple*class_simple_create(struct module*owner,char*name);


Createsimpleclass MusttestretvalusingIS_ERR()

voidclass_simple_destroy(structclass_simple*cs);

Destroyclass

structclass_device*class_simple_device_add(struct class_simple*cs,dev_tdevnum,structdevice*device, constchar*fmt,...);


Adddevicetoclass "devnum":Numberassignedtodevice "device":device'sstructdevice "fmt"and...:device'sname Symlinkcreatedtorelevant/sys/devices/entryifdevice!= NULL

voidclass_simple_device_remove(dev_tdev);

Remove"dev"fromclass

intclass_simple_set_hotplug(structclass_simple *cs,int(*hotplug)(structclass_device*dev,char **envp,intnum_envp,char*buffer,intbuffer_size));

Setupahotplughandlerforaclass

Thefullclassinterface:

Managingclasses:

structclass
struct class { /* Most important fields */ char *name; struct subsystem subsys; struct list_head children; struct list_head interfaces; struct class_attribute *class_attrs; struct class_device_attribute *class_dev_attrs; int (*hotplug)(struct class_device *dev, char **envp, int num_envp, char *buffer, int buffer_size); void (*release)(struct class_deivce *dev); void (*class_release)(struct class *class); };

"release":Adeviceisreleasedfromclass "class_release":Classisreleased

intclass_register(structclass*cls); voidclass_unregister(structclass*cls);

structclass_attribute
struct class_attribute { struct attribute attr; ssize_t (*show)(struct class *cls, char *buf); ssize_t (*store)(struct class *cls, const char *buf, size_t count); };

CLASS_ATTR(name,mode,show,store); intclass_create_file(structclass*cls,conststruct class_attribute*attr); voidclass_remove_file(structclass*cls,conststruct class_attribute*attr); Classesaredevicecontainers

Classdevices:

structclass_device
struct class_device { /* Most important fields */ struct kobject kobj; struct class *class; struct device *dev; void *class_data; char class_id[BUS_ID_SIZE]; };

"dev":Ifset,symlinkcreatedtocorresponding/sys/devicesentry.

intclass_device_register(structclass_device*cd); voidclass_device_unregister(structclass_device*cd); intclass_device_renaming(structclass_device*cd,char *new_name);

structclass_device_attribute
struct class_device_attribute { struct attribute attr; ssize_t (*show)(struct class_device *cls, char *buf); ssize_t (*store)(struct class_device *cls, const char *buf, size_t count); };

CLASS_DEVICE_ATTR(name,mode,show,store); intclass_device_create_file(structclass_device*cls,const structclass_device_attribute*attr); voidclass_device_remove_file(structclass_device*cls, conststructclass_device_attribute*attr); Knowingwhendevicesenterandleaveaclass

Classinterfaces:

structclass_interface
struct class_interface { struct class *class; int (*add)(struct class_device *cd); void (*remove)(struct class_device *cd); };

intclass_interface_register(structclass_interface*intf); voidclass_interface_unregister(structclass_interface *intf);

11.Puttingitalltogether

LookingathowPCIsubsysteminteractswith objectmodel. Seefigure143,LDDp.392 Addadevice:

PCIbusisdeclaredusing:
struct bus_type pci_bus_type = { .name = "pci", .match = pci_bus_match, .hotplug = pci_hotplug, .suspend = pci_device_suspend, .resume = pci_device_resume, .dev_attrs = pci_dev_attrs, };

pci_bus_typeregisteredusingbus_register()atstartup

Entriescreatedin/sys/pci:devicesanddrivers PCIdriversarestructpci_driver,whichcontainsa structdevice_driver. WhenPCIdriverregistered,structdevice_driveris initializedbyPCI_code. PCIktypeissettopci_driver_kobj_type Driverregisteredusingdriver_register() Whenadeviceisfoundonthebus,newstructpci_dev created.pci_devcontainsastructdeviceentry. Afterstructpci_devisinitialized,deviceregisteredwith device_register().

Deviceaddedtolistofdeviceinpci_bus_type Codethenwalkslistofdriverstofinda"match"for device. Whenmatchfound(seeearlierexplanation),driver's probe()functionisinvokedtoseeifdriverwillaccept responsibilityfordevice. Oncedriveracks,driveranddevicearetiedtogether andthenecessarysymlinksarecreatedinsysfs. Hotplug Removaldonethroughpci_remove_bus_device(), whichcallsdevice_unregister().

Removeadevice:

device_unregister():

Unlinkssysfsentry Removesdevicesfrominternallistofdevices Callskobject_del()withdevicekobject

Removalofkobjectresultsinhoptlugcall Iflastrefcount,pci_release_dev()invoked pci_register_driver():


Addadriver:

Initializesstructdevice_driverinstructpci_driver Callsdriver_register()

Followearlierdescription

Matchandprobecalledtomatchdriverwithdevice pci_unregister_driver():

Removeadriver:

Callsdriver_unregister()

driver_unregister()resultsincalltorelease()foreach device. Codethenwaitsforallreferencestodrivertobefreed beforeallowing. driver_unregister()toreturnand,mostlikely,allow moduleunloading.

12.Hotplug

Dynamicdevices:

Driversmustbeabletodealwithhardwarebeing pluggedandunplugged. EvenCPUsandmemoryhotplugneedtobe supported. Actualactionstobeundertakendependonbustype SeeearlierdiscussionofUSBurbs Basics:


The/sbin/hotplugutility:

Calledbykerneluponhotplugevent Smallshellscript

Scriptinvokesscriptsfrom/etc/hotplug.d/ See"manhotplug"formoredetails Onlysubsystemnameprovidedtohotplugscript Actualeventdetailspassedusingenvironmentvariables ACTION:

Defaultenvironmentvariablespassedtohotplug:

"add"or"remove"string Pathwithinsysfspointingtodesignatedkobject Eachhotplugeventhasuniquesequencenumberduringsystem lifetime. Samestringpassedonhotplugcommandline

DEVPATH:

SEQNUM:

SUBSYSTEM:

Subsystemspecificenvironmentvariables(seeLDD3 forfulldetail):

IEEE1394(FireWire):SUBSYSTEM="ieee1394"

VENDOR_ID MODEL_ID GUID SPECIFIER_ID VERSION INTERFACE PCI_CLASS PCI_ID PCI_SUSBSYS_ID PCI_SLOT_NAME

Networking:SUBSYSTEM="net"

PCI:SUBSYSTEM="pci"

Input:SUBSYSTEM="input"

PRODUCT NAME PHYS EV,KEY,REL,ABS,MSC,LED,SND,FF PRODUCT TYPE INTERFACE DEVICE Nospecificenvironmentvariables ThereisaSCSIspecificscriptinvokedinuserspace

USB:SUBSYSTEM="usb"

SCSI:SUBSYSTEM="scsi"

Laptopdockingstations:SUBSYSTEM="dock" S/390andzSeries:SUBSYSTEM="dasd"

Using/sbin/hotplug:

Linuxhotplugscripts:

Trytofinddrivermatchingaddeddevice UseofmapsgeneratedviaMODULE_DEVICE_TABLE macros. /lib/module/KERNEL_VERSION/modules.*map MapsforPCI,USB,IEEE1394,INPUT,ISAPNPandCCW. Continueloadingallmodulesrelevantfound,kerneldecides bestmatch. Onshutdown,scriptsdonotremovedriversinceother devicesmayhavebeenputunderitsresponpsibilitysince firstload.

Maychangeinthefutureasmodprobecanreadtables generatedusingMODULE_DEVICE_TABLEwithoutneedfor modules.*map.

udev:

Allowautomatic/easycreationof/deventriesfordevicesat runtime. Manypreviousattemptsatfixingproblemhadfailed(devfs). Problemwithpreservingconsistentnameinside/devalthough devicemightnotbeconnectedtosysteminthesameway (USBhdforexample.) Notifiedby/sbin/hotplugofnewdeviceaddition.

Forproperusebydriver,makesuremajorandminor numberattributedtodeviceareexportedthroughsysfs. Driversusingexistingsubsystemsneednotmanuallyexport majorandminornumbersthroughsysfs,astheparent subsystemmostlikelydoesthatautomaticallyalready. udevlooksfor"dev"entryinrelevant"/sys/class"pathtoget majorandminornumber. class_simpleiseasywaytoexport"dev" "dev"format:

<major>:<minor>

13.Copyto/fromuser

<asm/uaccess.h> functionsusearchitectureindependent"magic"=> exceptiontables. unsignedlongcopy_to_user(void__user*to, constvoid*from,unsignedlongcount); unsignedlongcopy_from_user(void*to,const void__user*from,unsignedlongcount); Similartomemcpy(),but...

Cansleep.Codecallingshouldbe:

Reentrant Capableofexecutingconcurrentlywithotherpartsof driver. Inasituationwhereitcansleep

Retval=>amountofmemorystilltobecopied Ifaccesserror,retval!=0

14.Dealingwithfirmware

Basics:

Somedevicesneedtobeloadedwithproprietary, closedsource,binaryfirmwarepriortobeing functional. Hardcodingfirmwareashexstringindriverislikely GPLviolation. Instead,firmwareshouldbeloadedfromfilein userspace. Donotopenfirmwarefileanddumptodevice

Thekernelfirmwareinterface:

Useappropriatekernelfunctioninstead:

<linux/firmware.h> intrequest_firmware(conststructfirmware**fw,char *name,structdevice*device);


"name":filename "fw":structpopulatedbykernelcontainingpointertofirmwareand sizeoffirmware. Aseverythingelsefromuserspace,thisfirmwaremustbeverifiedfor securityreasons. Function*will*sleep

Havingsentfirmwaretodevice,itcanbereleased:

voidrelease_firmware(structfirmware*fw);

Ifcan'tsleepwaitingonfirmware:

intrequest_firmware_nowait(structmodule*module,char *name,structdevice*device,void*context,void(*cont) (conststructfirmware*fw,void*context));


"module":THIS_MODULE "context":privatedate "cont":callbackinvokedwhenfirmwareisavailable

Howitworks:

SeeLDD3,p.407 UseofFIRMWAREenvironmentvariableto /sbin/hotplug.

Lockingmechanisms
1.Concurrencyanditsmanagement 2.Semaphoresandmutexes 3.Completions 4.Spinlocks 5.Lockingtraps 6.Alternativestolocking 7.Summary

1.Concurrencyanditsmanagement

Avoidsharedresourceswhenpossible(ex.global variables). Must"manage"concurrentaccesswhenever resourcesaresharedtoguaranteeatomicity. Uselocksorsimilarmechanismstoimplement "criticalsections". Nocodeinstanceshoulduse"object"untilitis properlyinitializedforall. Mustkeeptrackof"object"instancestofreewhen appropriate.

Linuxprovidesmanydifferentmechanismsfor implementingcriticalsections,dependingonthe situation.

2.Semaphoresandmutexes

P,V,andsemaphoreint LockwithP:

ifsemaphore>0,decrementandproceed ifsemaphore<=0,waituntilsemaphoreisreleased incrementsemaphore wakeupwaitingprocesses,ifany

UnlockwithV:

Whensemaphoreinitiallysetto"1"=>Mutex TheLinuxsemaphoreimplementation:

structsemaphore:<asm/semaphore.h>

Basicinitialization:

voidsema_init(structsemaphore*sem,intval); DECLARE_MUTEX(name); DECLARE_MUTEX_LOCKED(name); voidinit_MUTEX(structsemaphore*sem); voidinit_MUTEX_LOCKED(structsemaphore*sem); Typically,useinit_MUTEX*priortodoingdeviceregistration.

Staticdeclareandinitmutex:

Dynamicinitmutex:

InLinux:

Pis"down" Vis"up" voiddown(structsemaphore*sem);

Versionsof"down":

Decrementandwaitaslongasnecessary Decrementandwait,butallowuserspaceprocesstocontinue receivingsignals.Musttestreturnvalueforinterruption. Ifinterrupted: ReturnERESTARTSYSifnoeffect,kernelwillrestartcall ReturnEINTRiffailuretocompleteoperation Neverwait,tryandfailifunabletolock.Musttestretval.

intdown_interruptible(structsemaphore*sem);

intdown_trylock(structsemaphore*sem);

Onlyone"up":

voidup(structsemaphore*sem);

Reader/writersemaphores:

Allowmultiplereaders,butonlyonewriter Writerhaspriority,onceonewriterasksfor semaphore,allreaderswillwait. Nottypicalofdrivers structrw_semaphore:<linux/rwsem.h> Initialization:

voidinit_rwsem(structrw_semaphore*sem);

Readonlyaccess:

voiddown_read(structrw_semaphore*sem);

MayputtaskinstateTASK_UNINTERRUPTIBLE Nonzeroifaccessgranted. Zerootherwise

intdown_read_trylock(structrw_sempaphore*sem);

voidup_read(structrw_semaphore*sem); voiddown_write(structrw_semaphore*sem); intdown_write_trylock(structrw_semaphore*sem); voidup_write(structrw_semaphore*sem); voiddowngrade_write(structrw_semaphore*sem);

Writeaccess:

Changeawritelocktoareadlock.

3.Completions

"fork"offtaskandwaitforittocomplete structcompletion:<linux/completion.h> Staticdeclaration:

DECLARE_COMPLETION(my_completion); voidinit_completion(&my_completion); voidwait_for_completion(structcompletion*c);

Dynamicinitialization:

Waitingoncompletion(uninterruptablewait):

Completion:

voidcomplete(structcompletion*c);

Wakeupjustonewaitingthread Wakeupallwaitingthreads

voidcomplete_all(structcompletion*c);

Reinitializingforreuseaftercomplete_all():

INITIALIZE_COMPLETION(structcompletionc); voidcomplete_and_exit(structcompletion*c,long retval); Inthecase,forexample,whereathreadisstartedin amoduleandmustbekilledwhenmoduleisunloaded.

Completioninkernelthread:

4.Spinlocks

Introductiontospinlocks:

Mostoftenusedmechanisminthekernel Unlikesemaphores,canbeusedincodethatcannot sleep. Usuallybetterperformancethansemaphores. Typicallymeantforprotectingconcurentaccesson SMPsystems. TypicallydefaultstonothingonUP(exceptforIRQ spinlocks). Eitherlockedorunlocked Iflockalreadytaken,spinintightloopwaitingfor resource.

IntroductiontothespinlockAPI:

spinlock_t<linux/spinlock.h> Staticinitialization:

spinlock_tmy_lock=SPIN_LOCK_UNLOCKED; voidspin_lock_init(spinlock_t*lock); voidspin_lock(spinlock_t*lock); voidspin_unlock(spinlock_t*lock);

Dynamicinitialization:

Entercriticalsection:

Leavecriticalsection:

Manymorefunctions

Spinlocksandatomiccontext:

Nevercreatesituationwherecontrolmaybelostwhile holdingaspinlock,otherwise=>deadlock. Codeusingspinlocksshouldbeatomic Carefullyexaminewhichkernelservicesyoucallwhile holdingaspinlock(copy_to/from_user,kmalloc,etc. willsleep.) Usespeciallocksifatomicsectioncouldbeinterrupted byaninterruptservicedbyaroutinerequiringthat samelock. Holdforasshortatimeaspossible

Thespinlockfunctions:

Locking:

voidspin_lock(spinlock_t*lock); voidspin_lock_irqsave(spinlock_t*lock,unsignedlong flags);

DisablesinterruptsonlocalCPUandstorespreviousinterruptstate inflags. Takesspinlock(mayspininwaitforlock.) DisablesinterruptsonlocalCPUwithoutrecordingflags Usefulifsurenoothercodehasalreadymodifiedintflag Takesspinlock Leaveshardwareinterruptsenabled Disablessoftwareinterrupts

voidspin_lock_irq(spinlock_t*lock);

voidspin_lock_bh(spinlock_t*lock);

3levelsof"priorities"withspinlocks:
3User=>spin_lock() 2Softwareinterrupt=>spin_lock_bh() 1Interrupt=>spin_lock_irq*()

Typeoflocktousedependsonwhatisthehighest priorityinwhichcriticalsectionisaccessed. Unlocking:


voidspin_unlock(spinlock_t*lock); voidspin_unlock_irqrestore(spinlock_t*lock,unsignedlong flags);


flagsarethesameasthosepassedtospin_lock_irqsave Mustbecalledwithinsamefunctionasspin_lock_irqsave,orwillbrake onsomearchitectures.

voidspin_unlock_irq(spinlock_t*lock); voidspin_unlock_bh(spinlock_t*lock); intspin_trylock(spinlock_t*lock); intspin_trylock_bh(spinlock_t*lock); Noneforinterruptlocking Returnnonzeroonsuccess,andzerootherwise.

Trylocking:

Reader/writerspinlocks:

Multiplereadersincriticalsection,onlyonewriter rwlock_t:<linux/spinlock.h> Staticinitialization:

rwlock_tmy_rwlock=RW_LOCK_UNLOCKED

Dynamicinitialization:

voidrwlock_init(rwlock_t*lock); voidread_lock(rwlock_t*lock); voidread_lock_irqsave(rwlock_t*lock,unsignedlong flags); voidread_lock_irq(rwlock_t*lock); voidread_lock_bh(rwlock_t*lock); voidread_unlock(rwlock_t*lock); voidread_unlock_irqrestore(rwlock_t*lock,unsignedlong flags); voidread_unlock_irq(rwlock_t*lock); voidread_unlock_bh(rwlock_t*lock);

Forreaders:

Forwriters:

voidwrite_lock(rwlock_t*lock); voidwrite_lock_irqsave(rwlock_t*lock,unsignedlong flags); voidwrite_lock_irq(rwlock_t*lock); voidwrite_lock_bh(rwlock_t*lock); intwrite_trylock(rwlock_t*lock); voidwrite_unlock(rwlock_t*lock); voidwrite_unlock_irqrestore(rwlock_t*lock,unsignedlong flags); voidwrite_unlock_irq(rwlock_t*lock); voidwrite_unlock_bh(rwlock_t*lock);

5.Lockingtraps

Ambiguousrules:

Designforlocksfromthestart Donotcallonlockgrabbingfunctionsifalready holdingalockthatwouldbeacquiredbylockee. Properlydocumentfunctionsthatexpecttogetcalled withlocksheld. Ifmultiplelocksarerequiredforimplementingsome criticalsections,alwaysacquirethelocksinthesame order. Obtainlocallockspriortoobtainingglobalones

Lockorderingrules:

Obtainsemaphorespriortoobtainingspinlocks Avoidmultiplelockswheneverpossible Kernelusedtohaveonebigkernellock(stillpersists foraverysmallnumberofoperations.) Mostofthekernel'sresourcesnowhaveindependent locks. Usually,driversshouldstartwithcoarselocking Avoidoverdesigning Uselockmetertodetectcontention:


http://oss.sgi.com/projects/lockmeter

Fineversuscoarsegrainedlocking:

6.Alternativestolocking

Lockfreealgorigthms:

Sometimespossibletomodifyalgorithmtoobtainlock freeconditions. Ex.:circularbufferwithonereader/onewriter. Genericcircularbufferimplementationstartingwith 2.6.10:

<linux/kfifo.h>

Atomicvariables:

Usuallyoneintegerusedforcounting atomic_t:<asm/atomic.h> atomic_tholdsintonallarchs

Interruptsafe SMPsafe Cannotcountonatomic_ttoholdmorethan24bits becauseofarchdetails. Cannotuseatomic_tifmultiplearithmeticoperations required;wouldneedadditionallocking. Staticinitialization:

atomic_tv=ATOMIC_INIT(0); voidatomic_set(atomic_t*v,inti); intatomic_read(atomic_t*v);

Dynamicinitialization:

Reading:

Arithmeticopswithoutretval:

voidatomic_add(inti,atomic_t*v); voidatomic_sub(inti,atomic_t*v); voidatomic_inc(atomic_t*v); voidatomic_dec(atomic_t*v); intatomic_add_return(inti,atomic_t*v); intatomic_sub_return(inti,atomic_t*v); intatomic_inc_return(atomic_t*v); intatomic_dec_return(atomic_t*v);

Arithmeticopswithretval:

Arithmeticopswithtest:

intatomic_sub_and_test(inti,atomic_t*v);

Retvalistrueiffinalvaliszero,falseotherwise Same Same Retvalistrueiffinalvalis<0,falseotherwise.

intatomic_inc_and_test(atomic_t*v);

intatomic_dec_and_test(atomic_t*v);

intatomic_add_negative(inti,atomic_t*v);

Bitoperations:

Bitwiseequivalentofatomic_t Functionsdeclaredin<asm/bitops.h> Interruptsafe SMPsafe

Fineforsettingsharedflags Trickytouseoncriticalsections Architecturespecificdatatyping:

Bittomanipulateusuallyint,butcanbeunsignedlongon somearchitectures. Addresstobemodifiedusuallypointertounsignedlong,but canbe*voidonsomearchitectures. voidset_bit(nr,void*addr); voidclear_bit(nr,void*addr); voidchange_bit(nr,void*addr);

Basicops:

Toggle

Nonatomicbitvalretrieval:

inttest_bit(nr,void*addr); inttest_and_set_bit(nr,void*addr); inttest_and_clear_bit(nr,void*addr); inttest_and_change_bit(nr,void*addr); while(test_and_set_bit(nr,addr)!=0)wait_a_little(); Ifalreadyset,thisloopwillwait,untiltheotherbitofcode alreadyholdingthelockdoesatest_and_clear_bit(). Ifmultiplethreadscompeting,oneofthemwillhaveits test_and_set_bit()succeed,andtheotherswillcontinue looping.

Atomictestthenmodify:

Enteringcriticalsection:

Leavingcriticalsection:

if(test_and_clear_bit(nr,addr)==0)*ERROR*; Ifthisstatementispositive(i.e.wegointhe"if"),then somethinghadalreadyreleasedthelock,whichmeans there'sasynchronizationerrorsomewhereinourcode.

seqlocks:

Appropriateforsituationswhere:

Resourceprotecteddoesn'tneedtobeheldforlong Resourceprotectedisfrequentlyaccessed Writeaccessisrareandfast

Readersget"free"access,butmusttestforcollision withwriters,andretryinthosecases.

Cannotbeusedonanythinginvolvingpointers. seqlock_t:<linux/seqlock.h> Staticinitialization:

seqlock_tmy_lock=SEQLOCK_UNLOCKED; voidseqlock_init(seqlock_t*lock); Obtain"sequencenumber":

Dynamicinitialization:

Forreaders:

unsignedintread_seqbegin(seqlock_t*lock);

Conductsimplecomputation Testifconcurrentwriteoccured

intread_seqretry(seqlock_t*lock,unsignedintseq);

Ifso,discardresultandrepeat unsignedintread_seqbegin_irqsave(seqlock_t*lock, unsignedlongflags); intread_seqretry_irqrestore(seqlock_t*lock,unsignedint seq,unsignedlongflags); voidwrite_seqlock(seqlock_t*lock); voidwrite_sequnlock(seqlock_t*lock); voidwrite_seqlock_irqsave(seqlock_t*lock,unsignedlong flags); voidwrite_seqlock_irq(seqlock_t*lock);

Interruptprotectedread:

Forwriters:

Variantsforwriters:

voidwrite_seqlock_bh(seqlock_t*lock); voidwrite_sequnlock_irqrestore(seqlock_t*lock,unsigned longflags); voidwrite_sequnlock_irq(seqlock_t*lock); voidwrite_sequnlock_bh(seqlock_t*lock);

readcopyupdate:

Powerful,butcomplexmechanism Seldomusedindrivers Optimizedforfrequentreadsandrarewrites Resourcesprotectedareaccessedviapointers Referencestoresourcesareatomicallyprotected

Tochangedata:

Makecopyofdata Changecopy Aimrelevantpointertonewversion Whennomorerefstooldcopyexist,kernelfreesresource.

Functionsfoundin<linux/rcupdate.h> Readermacros:

rcu_read_lock()

Disablepreemption Enablepreemption

rcu_read_unlock()

Thereisnorcu_write_lock()becauseofhowthe algorithmoperates,IOWwritersareneverlocking,and that'swhatmakesRCUfast. Multiplewritersareassumedtosynchronizeusing someothermechanism,suchasspinlocks. Writer:


Allocatenewresource Copydatatonewresource Changecopy Replacepointerseenbyread()code

Complicatedpartistoknowwhentofree"oldcopy":

OtherCPUsmaystillhavereferencestooldcopy Writermustwaituntilitknowsnootherinstancehaspointer tooldcopy. Sinceallcodepathsreferencingresourceareatomically protected,itisassumedthatonceeveryprocessoronthe systemhasbeenscheduledatleastonce,thennoother processorstillholdsacopyoftheolddatapointer. Hence,wecanfreetheoldcopy. ThekernelRCUmechanismprovidesawayforregisteringa callbacktobeissuedonceallprocessorshavebeen scheduledtocleanuptheoldcopy. voidcall_rcu(structrcu_head*head,void(*func)(void*arg), void*arg);

FunctiontoregisterRCUcallback:

Callbackobtainssame"arg"aspassedtocall_rcu(). Typicallycallbackissuesakfree().

FulldetailofAPIandalgorithmin<linux/rcupdate.h>

7.Summary

Semaphores/ > ServicinguserspacecallsMutexes Completions >Endsignalsharedby routinesservicinguserspace. Spinlocks > SMPsystems/disablinginterrupts. Atomicops > Singlearithmeticop Bitop > Singlebitop Seqlocks > Fewwriters/lotsofreaders. RCU

> Pointerstructmodifications.

Interruptsandinterruptdeferal
1.Installinganinterrupthandler 2.Implementingahandler 3.Topandbottomhalves 4.Interruptsharing 5.InterruptdrivenI/O

1.Installinganinterrupthandler

Thebasics:

Ifthere'snohandlerregisteredforaninterrupt,the kernelwillackitbutdonothingwithit. Kernelkeepstrackofwhichdriver/handleris associatedwithaninterruptline. Registerinterrupthandler:

intrequest_irq(unsignedintirq,irqreturn_t(*handler)(int, void*,structpt_regs*),unsignedlongflags,constchar *dev_name,void*dev_id); retvaliszeroonsucess retvalisnegativeonerror(EBUSYifalreadyallocated). "irq"theIRQnumberbeingrequested

"handler"theIRQcallback "flags"bitmaskonhowtomanageinterrupt "dev_name"nameprintedoutin/proc/interrupts "dev_id"pointerneededforhandlerfreeingwheninterrupt shared.iftheinterruptisn'tshared,canbesettoNULL. Otherwise,settosomeuniquepointerwithindriver. SA_INTERRUPT:

Interruptbitmask:

"Fastinterrupt".Executehandlerwithinterruptsdisabledonlocal CPU. Interruptcanbe"shared".

SA_SHIRQ:

SA_SAMPLE_RANDOM:

Interruptsgeneratedbydevicecancontributetoentropypoolfor randomnumbergeneration(/dev/randomand/dev/urandom).

Unregisterhandler:

voidfree_irq(unsignedintirq,void*dev_id);

Handlerregistrationcanbedoneondevice instantiationoronopen(). Besttoregisterhandleronfirstopen()/freeonlast close(). Kernelkeepstrackofhowtimeseachtypeofinterrupt occurs(internalcounter).

The/procinterface:

Do"cat/proc/interrupts"tosee:

Interruptlinesthatcurrentlyhaveregisteredhandlers Thenumberoftimeseachtimeofinterruptoccuredforeach CPU. ThePIC(ProgrammableInterruptController)configurationfor theinterrupt. Thedriver(s)thathaveregisteredhandlersforthegiven interrupt,asprovidedbythe"dev_name"parameterof request_irq(). Thetotalnumberofallinterruptsthatoccuredsinceboot Thetotalnumberofinterruptsofagiventypethatoccured sinceboot,eachentrybeingseparatedbyaspace.

Do"cat/proc/stat"andlookforthe"intr"linetosee:

Numberofentriesinbothfileswillvarygreatly betweenarchs. Sometimesinterruptnumbersareknowninadvance AskPCIconfigforinterruptnumber Canaskdevicetogenerateinterruptandmonitor result. Can'tprobesharedinterrupts Kernelassistedprobing:


AutodetectingtheIRQnumber:

Onlyfornonsharedinterrupts UsuallyforISAonly

<linux/interrupt.h> unsignedlongprobe_irq_on(void);

retvalisbitmaskofunasignedinterrupts recordretvalforpassingtoprobe_irq_off() Enableinterruptsafterthiscall Configuredevicetoemitinterrupt Calltoaskkernelwhichinterruptoccured Disableinterruptsbeforethiscall Mayneedtoinsertdelaypriortocallingthisfunctiontogivetimefor theinterrupttooccur. retvalis>0ifonlyoneinterruptoccured retvalis0ifnointerruptoccured retvalis<0ifmorethanoneinterruptoccured

intprobe_irq_off(unsignedlong);

Probingcantakealotoftime(20msforframegrabber)

Besttodoprobingonlyonceatmoduleloadtime MostnonPCplatformsdon'tneedprobingandabove functionsareplaceholders(includingmostPPC,andMIPS implementations). Looponallpossibleinterrupts Recordinterrupthandlerforagiveninterrupt Configuredevicegeneratinganinterrupt Waitforinterrupttooccur Checktoseeifhandlerwascalled Freeinterrupthandler

Doityourselfprobing:

Fastandslowinterrupts:

Oldkernelabstraction

Currently,theonlydifferencebetweenhandlersisthose thatusetheSA_INTERRUPTflagandthosethatdon't. IfSA_INTERRUPTisused,handleriscalledwithall localinterruptsdisabled. Mostdriversshouldn'tuseSA_INTERRUPTunless absolutelyrequired. Theinternalsofinterrupthandlingonthex86:


Alreadycoveredinembeddedclass Assemblygeneratingmacrosinarch/i386/kernel/entry.Spush intnumberonstackandcalldo_IRQfrom arch/i386/kernel/irq.c.

do_IRQdoes:

Maskandackinterrupt RequestsspinlocksforgivenIRQnumber(toavoidotherprocessors fromtryingtohandleit.) Ifhandlerregister,callhandle_IRQ_eventtoinvokeit Otherwiseunlockandreturn IfSA_INTERRUPTnotset,reenableinterrupts Invokehandler(s)

handle_IRQ_event:

Checkforscheduling(processesmayhavebeenwokenup asaresultofinterrupt).

2.Implementingahandler

Thebasics

Role:

Interactwithdeviceregardinginterrupt,usuallyACKdevice. Transferdatato/fromdeviceasrequired Wakeupprocesseswaitingfordevice Deferasmuchworkaspossibletotaskletsorworkqueues.

Restrictionsastowhathandlercando Restrictionssimilartothoseoftimer:

Can'taccessuserspace

Can'tsleepordoanythingthatmaysleep,includingmem allocationorgrabbingsemaphores. Can'tcallthescheduler

Handlerarugmentsandreturnvalue:

intrequest_irq(unsignedintirq,irqreturn_t(*handler) (int,void*,structpt_regs*),unsignedlongflags,const char*dev_name,void*dev_id); 1:"irq",theirqnumber 2:"*dev_id",privatedata(sameaspassedto request_irq(),canpassinternaldevice"instance" pointertoeasilyfindinstanceoninterrupt). 3:"*regs",theCPUregistersatinterruptoccurence, seldomused.

retvalisstatusofinterrupthandling:

IRQ_HANDLED,aninterruptoccured.Shouldalsobeused innowaytodetermineifinterruptdidoccur. IRQ_NONE,nointerruptoccurred.Interruptwassupriousor isshared. Macroforgeneratingreturnvaluedependingonvariable (nonzeromeansinterrupthandled):

IRQ_RETVAL(var);

Enablinganddisablinginterrupts:

Trytoavoidinasmuchaspossible Nowaytodisableinterruptsonallprocessorsinthe sametime.

Disablingasingleinterrupt:

<asm/irq.h> voiddisable_irq(intirq);

Waitforinterrupthandlerifit'srunninganddisableit Carefulwithdeadlocks Disablesinterruptwithoutcheckingifhandlerisrunning

voiddisable_irq_nosync(intirq);

voidenable_irq(intirq); MayplaywihtPIC'smask Callscanbenested DisableinterruptsonlocalCPU <asm/system.h>

Disablingallinterrupts:

voidlocal_irq_save(unsignedlongflags);

Disablesinterruptsandsaveslocalinterruptflags Disableswithoutrecordingflags Restoresinterruptflags Reenableslocalinterrupts

voidlocal_irq_disable(void);

voidlocal_irq_restore(unsignedlongflags);

voidlocal_irq_enable(void);

Nonestingpossible:uselocal_irq_save().

3.Topandbottomhalves

Interrupthandlersmustfinishrapidely(tophalf) Lengthyworkdeferedtolaterin"bottomhalf"with allinterruptsenabled. Usually:

Tophalfrecordsdevicedatatotemporarybufferand schedsBH.Networkcodepushedpacketupstack, whileactualprocessingdoneinBH. BHfurtherdoeswhateverschedulingisneeded WillbecoveredindetailinCh.8 Reminder:fromUsingLinuxinEmbeddedSystems:

Tasklets:Overview

Runsinsoftwareinterruptcontext Onlyonetaskletofagiventypewilleverberunninginthe sametimeintheentiresystem. Evenifreschededmultipletimes,willonlyrunonce. Interruptmayoccurwhiletaskletisrunning=>use appropriatelocks. TaskletsrunonthesameCPUwheretheyscheduled DECLARE_TASKLET() tasklet_init() tasklet_schedule()

APIreminder:

Workqueues:

Issuefunctionwithinworkqueueprocesscontext

Reminder:

Cansleep Can'taccessuserspace Canusesystemdefaultworkqueue

4.Interruptsharing

Interruptsharingamustonmodernhardware (usedbyPCI). Installingasharedhandler:


Usesamerequest_irq() Differencewithnonsharedhandlers:

UsetheSA_SHIRQflag PassnonNULL,uniqueIDin*dev_id,kernelcomplains otherwiseasitmayoopsatirqfreeing.

Registrationfailsifotherhandlershaveregistered withoutsettingtheSA_SHIRQflag.

Oninterruptoccurence,kernelinvokesevery registeredhandlerforthegiveninterrupt,passingitthe *dev_idprovidedonregistration. Eachhandlermustdetermineifthedeviceithandles issuedtheinterrupt,andreturnIRQ_NONEotherwise. *dev_id'simportanceisshownwhenfree_irq()iscalled sincethat'stheonlywaytoknowwhichhandler shouldberemovedfromsharedlist.

Runningthehandler

Rememberthatinterruptisshared=>don'tdisable interrupts. /proc/interruptsshowslistofdriversharingagiven interrupt.

The/procinterfaceandsharedinterrupts

5.InterruptdrivenI/O

Reminder:usebufferI/O BufferedI/OallowsforinterruptdrivenI/O Usually,bestifhardware:


Generatesintwhendataisreadyforread Generatesintwhenmoredatacanbewrittenor generateinttoackdatawriting. Keepinmindthatyourdrivermay"miss"interrupts. Forthatreason,it'salwaysagoodidea,when appropriate,tosetupatimerfunctiontocheck whetherinterruptshavebeenmissed.

Guidelines:

Timelyexecutionandtimemeasurement
1.Measuringtimelapses 2.Knowingthecurrenttime 3.Delayingexecution 4.Kerneltimers 5.Tasklets 6.Workqueues

1.Measuringtimelapses

Background:

Kerneltimer:thekernel'sheartbeat Intervalconfiguredatboottime HZin<linux/param.h> Valuerangesaccordingtoarchitecture Forx86,PPC:1000;ForMIPS:100 CanchangeHZifneeded Foreachtimerinterrupt:kernelincrementsinternal counter


jiffies_64(evenon32bitplatforms) Accesstojiffies_64notatomicon32bitplatforms

Driverstypicallyusejiffies(unsignedlong):

Sameasjiffies_64orleastsignificantbitsofjiffies_64.

Usingthejiffiescounter:

jiffiesandjiffies_64:<linux/sched.h> Noneedtoholdanylockstoreadjiffies jiffiesisvolatile:willbefetchedfromRAMonevery read. Neverwritejiffies jiffiesmaywraparound,comparisonsshouldbe properlydone.

Comparisonfunctions:

<linux/jiffies.h> inttime_after(unsignedlonga,unsignedlongb);

Trueif"a"isafter"b" Trueif"a"ifbefore"b" Afterorequal Beforeorequal

inttime_before(unsignedlonga,unsignedlongb);

inttime_after_eq(unsignedlonga,unsignedlongb);

inttime_before_eq(unsignedlonga,unsignedlongb);

Obtainingtimedifference:

diff=(long)t2(long)t1

Convertingtimetomilliseconds:

msec=diff*1000/HZ;

Userspaceuses"structtimeval"and"structtimespec" Convertingjiffiesto/from"structtimeval"and"struct timespec"


<linux/time.h> unsignedlongtimespec_to_jiffies(structtimespec*value); voidjiffies_to_timespec(unsignedlongjiffies,struct timespec*value); unsignedlongtimeval_to_jiffies(structtimeval*value);

voidjiffies_to_timeval(unsignedlongjiffies,structtimeval *value); <linux/jiffies.h> u64get_jiffies_64(void) Typicallynoneedtoreadjiffies_64on32bitsystems

Helperfunctionincaseyouneedtoreadjiffies_64:

Inuserspace,HZisalwaysseenas100.Helper functionsmaintainview. Someprocessorshaveinternalscounterscountingthe numberofCPUclockcycles. Veryusefulforprecisetimemeasurements

Processorspecificregisters:

Veryarchitecturespecific:size,userspacevisibility, readonly. Onx86,TSC(64bit) OnMIPS,register9of"coprocessor0"(VR4181):32 bit Thereexistplatformspecificfunctions,butthere'salso agenerichelper:


<linux/timex.h> cycles_tget_cycles(void); Returns0onprocessorswithoutCPUcounters cycles_tisunsignedtype

x86typeassemblymacros:
rdtsc(low32,high32); rdtscl(low32); rdtscll(var64);

OftennotsyncrhonizedonSMPsystems

2.Knowingthecurrenttime

Timerepresentationtiedtojiffies Usuallydriversneedjustjiffies Insomerarecases,driversneedtoknowexact realworldtime. Convertingwalltime(userspacetime)tojiffies:


<linux/time.h> unsignedlongmktime(unsignedintyear,unsignedint mon,unsignedintday,unsignedinthour,unsignedint min,unsignedintsec);

Gettingabsolutetimestamp:

<linux/time.h> voiddo_gettimeofday(structtimeval*tv); Microsecondresolution <linux/time.h> structtimespeccurrent_kernel_time(void); jiffyresolution

Gettingcurrenttime:

3.Delayingexecution

Longdelays

Busywaiting

Waitintightloopforacertaintime:
while(time_before(jiffies,deadline)) cpu_relax();

Thecpu_relax()calldoesn'tusuallydomuch Thistechniqueisdiscouraged Shouldn'tbedonewhileinterruptsdisabled Mayendupwaitingmuchmorethanexpectedbecauseof scheduling. Callthescheduler:


while (time_before(jiffies, deadline)) schedule();

Yieldingtheprocessor

MayresultinprocessloopingrapidlyifonlyprocessonCPU. Mayendupwaitingaverylongtimeifprocessisloaded Sometypesofwaitcanbetimedout:


Timeouts

<linux/wait.h> longwait_event_timeout(wait_queue_head_tq,condition,long timeout); longwait_event_interruptible_timeout(wait_queue_head_tq, condition,longtimeout); timeoutisnumberofjiffies,notabsolutetime retvaliszeroiftimeout retvalisremainingdelayifwokenup <linux/sched.h>

Ifnoeventiswaitedfor:

signedlongschedule_timeout(signedlongtimeout); timeoutisnumberofjiffies MustsetcurrentprocesstoTASK_INTERRUPTIBLEor TASK_UNINTERRUPTIBLE. retvaliszerounlessreturnbeforetimeout(becauseofsignal) Aftercall,processissettoTASK_RUNNING

Shortdelays:

Sometimesshortbusywaitsneededforhardwareops Helperfunctions:

<linux/delay.h> voidndelay(unsignedlongnsecs);/*nanoseconds*/ voidudelay(unsignedlongusecs);/*microseconds*/ voidmdelay(unsignedlongmsecs);/*milliseconds*/

Willtypicallywakeupafterdelayisexpired ndelay()andudelay()haveupperlimit:ifstaticvalue passedistoolarge,modulewillfailtoloadwith __bad_udelay. Nonbusywaitfunctions:


<linux/delay.h> voidmsleep(unsignedintmillisecs);

Uninterruptiblesleep Interruptiblesleep retvalzeroifdelayachieved retvalisnumberofmillisecondsearlyifwokenbeforetimeout

unsignedlongmsleep_interruptible(unsignedintmillisecs);

voidssleep(unsignedintseconds);

Uninterruptiblesleep

Likelywakeupmuchlaterthandelay

4.Kerneltimers

Background

Notifiedinduetimewithoutblocking Schedulecallbackexecution Jiffiesresolution Asynchronousexecutionofcallbacks Registration:


Providecallbackfunction Providedelay Provideparametertocallbackfunction Usesoftwareinterruptlocks(_bh())

Limitations:

No"currentprocess"context Mustnotaccessuserspace "current"hasnorelationtocallback Cannotsleeporrescheduleanything Softrealtime,nothardrealtime <asm/hardirq.h> in_interrupt() in_atomic()/*Cannotschedule*/

Testingcurrentcontext:

Atimercallbackcanrescheduleitself OnSMP,callbackrunsonsameCPUasregistered

Timerscancauseraceconditions:useappropriate locks <linux/timer.h>


struct timer_list { ... unsigned long expires; void (*function)(unsigned long); unsigned long data; } /* jiffies */

ThetimerAPI:

Staticinitialization:

structtimer_listTIMER_INITIALIZER(_function,_expires, _data);

Dynamicinitialization:

voidinit_timer(structtimer_list*timer);

Maychangethe3fieldsinstructafterinitialization Addingtimertolist:

voidadd_timer(structtimer_list*timer); voiddel_timer(structtimer_list*timer); intmod_timer(structtimer_list*timer,unsignedlong expires);

Deletingtimerpriortoexpiry:

Modifytimerexpiry:

Deletetimerand,onreturn,makesureit'snotrunning onanyCPU:

intdel_timer_sync(structtimer_list*timer); Maysleepifnotinatomiccontext Carefulwhencallingwhileholdinglocks=>deadlockiftimer functionattemptstoobtainsamelock inttimer_pending(conststructtimer_list*timer);

Indicateiftimeriscurrentlyscheduledforexecution:

Theimplementationofkerneltimers:

Percpudatastructure Timersinsertedintopercpustructusing internal_add_timer()

Timersinsertedin"cascadingtable"dependingon expiry:

negativeexpirytime:scheduledtorunatnextticik between0to255jiffies(bits18in"expires":256lists between255and16,384jffies(bits914in"expires"):64 lists. bits1520:64lists bits2126:64lists bits2731:64lists Largervalues:hash Executesalltimersforcurrenttick

When__run_timersisexecuted:

Ifjiffiesismutltipleof256,rehashnextlevelinto256lists, andcascadeotherlevelsasneeded.

5.Tasklets

Somewhatsimilartotimers:

Runinsoftinterruptcontext RunonsameCPUwhereregistered Receivedunsignedlongargumentprovidedon registration. Canreregisterthemselves Taskletlistsarepercpu Nospecifictimeforexecution Justpendingworkforalatertime

Differencefromtimers:

Especiallyusefulforinterrupthandlerstodelay workfor"later". TheAPI:

<linux/interrupt.h>
struct tasklet_struct { ... void (*func)(unsigned long); unsigned long data; }

Staticinitialization:

DECLARE_TASKLET(name,func,data); DECLARE_TASKLET_DISABLED(name,func,data);

Dynamicinitialization:

volidtasklet_init(structtasklet_struct*t,void(*func) (unsignedlong),unsignedlongdata); Candisable/enable,evenmultipletimes Willonlyrunifenabledasoftenasitwasdisabled Canspecify"high"or"low"prioritytasklets,theformer beingexecutedfirst. Willruneitherimmediately,ifnosystemload,oratthe nextsystemtickatthelatest. ManytaskletscanruninparallelonmanyCPUs

Taskletfeatures:

OnlyonetaskletofagiventypecanrunonanyCPUat agiventime. Percpuksoftirqd voidtasklet_disable(structtasklet_struct*t);


Whorunstasklets?

FullAPI:

Disabletaskletexecution Busywaitiftaskletcurrentlyrunning Disabletaskletexecution Don'twaitforrunningtasklettofinish

voidtasklet_disable_sync(structtasklet_struct*t);

voidtasklet_enable(structtasklet_struct*t);

Enabletaskletexecution Scheduleforexecution Ifresched,runonce Ifreschedwhilerunning,rerun Schedulewithhighpriority Avoidunlessabsolutelynecessary(ex.:mediastreaming) Makesuretaskletdoesn'trunagain Usuallyondeviceclose

voidtasklet_schedule(structtasklet_struct*t);

voidtasklet_hi_schedule(structtasklet_struct*t);

voidtasklet_kill(structtasklet_struct*t);

Willblockiftaskletscheduled Iftaskletreschedulesitself,mustmakesureitdoesn'tpriorto usingtasklet_kill().

6.Workqueues

Basics:

Notthesamethingaspreviouslyseenwaitqueues Similartotasklets:

Runsomethinginthefuture Usuallyrunsonsameprocessorregistered Cannnotaccessuserspace Runinspecialprocesscontext,notsoftwareinterrupt context. Cansleep Canaskfordelayedexecutionforspecifictime Noneedforatomicexecution

Differentfromtasklets:

Cantoleratehighlatency Eachworkqueuehasdedicated"kernelthread" structworkqueue_struct:<linux/workqueue.h> structwrokqueue_struct*create_workqueue(constchar *name);

BasicAPI:

Onekernelthreadpercpu

structworkqueue_struct *create_singlethread_workqueue(constchar*name);

Onekernelthreadforentiresystem

Submittingatasktoaworkqueue:

Staticdeclaration:

DECLARE_WORK(name,void(*function)(void*),void*data);

Dynamicdeclaration:

INIT_WORK(structwork_sutrct*work,void(*function)(void*),void *data); PREPARE_WORK(structwork_struct*work,void(*function)(void*), void*data); SimilartoINIT_WORK()butdoesn'tinitializepointerstolink structwork_structtoactualworkqueue. Usefulifstructuremayhavealreadybeensubmittedtowork queue. intqueue_work(structworkqueue_struct*queue,structwork_struct *work); retvaliszeroifsuccessfulladd retvalnonzeroifalreadyinqueue(notaddedagain) intqueue_delayed_work(structworkqueue_struct*queue,struct work_struct*work,unsignedlongdelay);

Actualsubmission:

Aboutworkqueuecallbacksleep:

Willaffectothercallbacksqueuedinworkqueue. intcancel_delayed_work(structwork_struct*work);

RestofAPI:

retvalnonzeroifcancelpriortoexecution retvalzeroifworkmayhavebeenrunning(andcouldstillberunning ondifferentCPU. Makesurenoscheduledworkisrunninganywhereinthesystem. Freeworkqueue.

voidflush_workqueue(structworkqueue_struct*queue);

voiddestroy_workqueue(structworkqueue_struct*queue);

Thesharequeue:

Notalldriversneedtheirownworkqueue Kernelprovidesdefaultsharedqueue Makesureyourtaskdoesn'tmonopolizequeue API:


intschedule_work(structwork_struct*work); intschedule_delayed_work(structwork_struct*work unsignedlongdelay); voidflush_scheduled_work(void);

Canstillusecancel_delayed_work().

Memoryresources
1.Therealstoryofkmalloc 2.Lookasidecaches 3.get_free_pageandfriends 4.Thealloc_pagesinterface 5.vmallocandfriends 6.PerCPUvariables 7.Obtaininglargebuffers 8.MemorymanagementinLinux 9.Themmapdeviceoperation

10.PerformingdirectI/O 11.AsynchronousI/O 12.Directmemoryaccess 13.ThegenericDMAlayer

1.Therealstoryofkmalloc

Easytouse Equivalenttomalloc() Allocatedregionisphysicallycontiguous Mustflushcontentofkmallc'edregionsifshared withuspaceforsecurity. Prototypes:


<linux/slab.h> void*kmalloc(size_tsize,intflags); voidkfree(constvoid*ptr);

Theflagsargument:

Flagsdefinedin<linux/gfp.h> FlagsprefixisGFP_becausekmalloctypicallyrelies onaninternalfunctionknownas__get_free_pages(). Basicflags:

GFP_ATOMIC:

Neversleeptogetmemory Usefulininterrupthandlers,taskletsandtimers Mayfailifemergencyreserveisfull(afewfreepages) Normalkernelallocation Maysleep

GFP_KERNEL:

Usedbyfunctionsservicingsystemcallsonbehalfofprocess Callermustbereentrant Callermustnotbeholdinglocks Allocateforuserspace Maysleep Allocateshighmemory(ifavailable) Maysleep SimilartoGFP_KERNEL IndicatestokernelnottodoanyI/Otosatisfyrequest SimilartoGFP_KERNEL Indicatestokernelnottodoanyfilesystemcalls

GFP_USER:

GFP_HIGHUSER:

GFP_NOIO:

GFP_NOFS:

AdditionalflagstoOR"|"withbasicflagstofurther detailalloc:

__GFP_DMA:

AllocateregionthatcanbeusedforDMA Architecturespecific Allocatehighmemifpossible Allocatememoryoutsidetheprocessors'cache UsefulforallocatingmemoryforDMAreads Don'tprintkkernelwarningifallocationfails Veryhighpriorityrequest

__GFP_HIGHMEM:

__GFP_COLD:

__GFP_NOWARN:

__GFP_HIGH:

Triestouseallofthekernel'slastreserves Modifyallocatorbehaviortorepeatiffail Maystillfail Tellallocatornevertofail Shouldneverbeusedindriver Tellallocatortofailimmediatelyifregionnotavailable

__GFP_REPEAT:

__GFP_NOFAIL:

__GFP_NORETRY:

Memoryzones:

MinimumzonesrecognizedonallplatformsbyLinux:

Normalmemory: Regionwhereallocationstypicallyoccur DMAmemory: Memory"preferred"byarchitectureforDMA

Highmemory: Forverylargeallocation Typicallyrequiresspecialstructurestobesetuptoaccessthe highmemory.

Thesizeargument:

Spacenotmanagedlikeuserspacemalloc()(usinga heap). Pageorientedallocation Poolsofmemoryobjects Actualimplementationdetailscomplicated Possibleallocationsizespredefined,likelywillget morethanwhatisrequested.

Smallestunit:32or64bytes(dependingonsizeof pagesinarchitecture). Largestunit:128k

2.Lookasidecaches

Customfunctionalityfordriversallocatingsame sizedobjectsrepeatidly. <linux/slab.h>


kmem_cache_t*kmem_cache_create(constchar *name,size_tsize,size_toffset,unsignedlongflags, void(*constructor)(void*,kmem_cache_t*,unsigned longflags),void(*destructor)(void*,kmem_cache_t*, unsignedlongflags), "name"isforhousekeeping:noblanks,mustbestatic string. Createspoolofobjectswith"size"size

"offset"ofobjectinpage(ifalignmentneeded).Usually "0". "flags"bitmask:

SLAB_NO_REAP:

Donotresizecachewhenmemoryallocatorislookingformemory. Discouraged AlignwithCPUcachelines MaybegoodforperformanceonSMPmachines Maywastelotsofmemory AllocateinDMAregion

SLAB_HWCACHE_ALIGN:

SLAB_CACHE_DMA:

"constructor"/"destructor"

Optional Initializenewlyallocatedobjects/cleanupobjectspriorto free. Constructorcalledafterallocation,notnecessarily immediately. Destructorsmaybecalledatanytimeafterfreerequest Mayormaynotsleepdependingif"flags"passedcontains. SLAB_CTOR_ATOMIC Canusesamefunctionforboth Actualconstructorcalledwith SLAB_CTOR_CONSTRUCTOR

void*kmem_cache_alloc(kmem_cache_t*cache, intflags);

Allocatememoryfromcache Flags(sameaskmalloc)usedincasecacheneedsto allocatemoremem.

voidkmem_cache_free(kmem_cache_t*cache, constvoid*obj);

Freecacheallocatedmemory

intkmem_cache_destroy(kmem_cache_t *cache);

Destroyentirecache(usuallyonmodule_exit)

Failsifnotallobjectshavebeenkmem_cache_free'd Checkretvalforfailure(memleak)

Statisticsoncacheusageavailablefrom /proc/slabinfo. Memorypools:


Forplaceswherememoryallocationcan'tfail Likealookasidecachewithreservesforemergencies. Builtontopoflookasidecachefunctionality Typicallyshouldnotbeusedindrivers <linux/mempool.h>

mempool_t*mempool_create(intmin_nr, mempool_alloc_t*alloc_fn,mempool_free_t*free_fn, void*pool_data);


"min_nr":minimumnumberofobjectstokeepavailable typedefvoid*(mempool_alloc_t)(intgfp_mask,void *pool_data);

Usuallysetto"mempool_alloc_slab"

typedefvoid(mempool_free_t)(void*element,void *pool_data);

Usuallysetto"mempool_free_slab"

"pool_data"ispassedtoalloc_fnandfree_fn mempool_alloc_slabandmempool_free_slabrelyon kmem_cache_allocandkmem_cache_free.

Typically:
cache = kmem_cache_create(...); pool = mempool_create(MY_POOL_MINIMUM, mempool_alloc_slab, mempool_free_slab, cache);

void*mempool_alloc(mempool_t*pool,int gfp_mask);

Allocatefrompool. Freefrompool.

voidmempool_free(void*element,mempool_t*pool);

intmempool_resize(mempool_t*pool,int new_min_nr,intgfp_mask);

Resizememorypool Checkretvalforsuccess Mustfreeallobjectspriortomempooldestruction Kerneloopsotherwise

voidmempool_destroy(mempool_t*pool);

3.get_free_pageandfriends

Necessaryforallocatinglargechunksofmemory Perpageallocationwillresultinlessmemory wastethankmalloc(). <linux/slab.h> Pageallocation:

get_zeroed_page(unsignedintflags);

Getpointertonewpagepreinitializedwithzeroes Getpointertonewpagewithoutclearingcontent

__get_free_page(unsignedintflags);

__get_free_pages(unsignedintflags,unsignedint order);

Get2^orderphysicallycontiguouspageswithoutclearing content. Failsifordertoobig Canuseget_order()togetorderfromintvalue Maximumallowedorderis10or11(dependingonarch) Thefurtherfromreboot,thelessthechancesalargealloc willsucceed. Forinformationonwhatordersofallocationareavailable:


/proc/buddyinfo

"flags"sameasforkmalloc voidfree_page(unsignedlongaddr);

Pagefreeing:

voidfree_pages(unsignedlongaddr,unsignedlong order);

Different"order"fromallocationwillcausememory corruption.

Pageallocationcanbedoneanywhere,pending thesamerestrictionsaskmalloc(). Requestingtoomuchmemorycanresultin systemperformancedegradation.

4.Thealloc_pagesinterface

Shouldbeusedwhenusingnittygritty functionalityrequiringaccesstostructpage,the kernel'sinternaldescriptionofamemorypage. <linux/slab.h> structpage*alloc_pages_node(intnid,unsigned intflags,unsignedintorder);


"nid"isNUMAID,nottypicallyusedbymostdrivers. "flags"sameasforkmalloc() "order"sameasfor__get_free_pages()

structpage*alloc_pages(unsignedintflags, unsignedintorder);

Sameasalloc_pages_order(),butforlocalNUMA node Simplestallocation Freesinglepage

structpage*alloc_page(unsignedintflags);

void__free_page(structpage*page);

void__free_pages(structpage*page,unsignedin order);

Freepagelot

voidfree_hot_page(struct*page);

Freesinglepagethatisincache Freesinglepagethatisnotincache

voidfree_cold_page(struct*page);

5.vmallocandfriends

Contiguousaddressrangeinvirtualmemory Maybephysicallydiscoutiguous Discouragedinmostsituations:


Memoryallocatedbyvmalloc()slightlylessefficient Amountofmemorysetassideforvmalloc()limitedon somearchs. Mustconvenientlysetupentirepagetableentriesfor obtainingacontiguousvirtualaddressspace. Can'tbeusedforDMA

Can'tbeusedinanatomiccontext(relieson GFP_KERNEL).

Appropriateforlargesoftwareonlybuffers Usedbykernel'smodulefunctionalitytoallocate spaceformodules. Usedbyrelayfs <linux/vmalloc.h> void*vmalloc(unsignedlongsize);


retvaliszeroonfailure retvalispointertoregionofatleast"size"size

voidvfree(void*addr);

Freevmalloc'edmemory

6.PerCPUvariables

Newto2.6.x Creatingapercpuvariablecreatesonecopyfor eachCPU. Nocontentiononvariable Goodcaching Heavilyusedbynetworkingsubsystem Someportshavelimitedmemoryavailableforper CPUvariables:

Usewisely.

<linux/percpu.h>

Staticdeclaration:

DEFINE_PER_CPU(type,name);

Typecanbearray(char[10])

Dynamicdeclaration:

void*alloc_percpu(type); void*__alloc_percpu(size_tsize,size_talign);

Useincaseofspecialalignmentneeds

free_percpu();

Canbemanipulatedwithoutlocks,pending preemptionprotection.

Forstaticallyallocatedvariables:

get_cpu_var(var);

GetlocalCPU'svalue Disablepreemption Enablepreemption GetanotherCPU'svariable Mustuselocking

put_cpu_var(var);

per_cpu(var,cpu_id);

Fordynamicallyallocatedvariables:

intget_cpu(void);

GetcurrentCPU'sIDanddisablepreemption AccessvariableonagivenCPU Enablepreemption

per_cpu_ptr(void*per_cpu_ver,intcpu_id);

voidput_cpu(void);

get_cpu()andput_cpu()shouldonlybeneededwhen accessingthelocalprocessor'sdynamicallyallocated variable.

Mayexportpercpuvariables:

EXPORT_PER_CPU_SYMBOL(per_cpu_var); EXPORT_PER_CPU_SYMBOL_GPL(per_cpu_var); DECLARE_PER_CPU(type,name);

Tousefromanothermodule:

IncontrastwithDEFINE_PER_CPU()

Cannedpercpucounters:

<linux/percpu_counter.h>

7.Obtaininglargebuffers

Largeallocationarepronetofailure Themorethesystemruns,themoreitsmemory getsfragmented. Acquiringadedicatedbufferatboottime:


Bestchancesofgettinglargebuffersisatboottime Considered"dirty"trick Onlydriverslinkedinkernelcanallocateboottime memory. Modulescantrybygettingloadedveryearlyafterboot and,therefore,calling__get_free_pages()priortoany memoryfragmentation.

<linux/bootmem.h> void*alloc_bootmem(unsignedlongsize);

Allocatenonpagealignedmemory Mayallocatehighmemory Allocatenonpagealignedmemory Allocatelowmemory TypicallyforDMA Allocatepagealignedmemory Mayallocatehighmemory

void*alloc_bootmem_low(unsignedlongsize);

void*alloc_bootmem_pages(unsignedlongsize);

void*alloc_bootmem_low_pages(unsignedlong size);

Allocatepagealignedmemory Allocatelowmemory TypicallyforDMA

voidfree_bootmem(unsignedlongaddr,unsigned longsize);

Intheunlikelycasewhereyou'dliketofreethismemory Havingcalledthis,furtherattemptstoreallocatethesame areawillmostlikelyfail.

8.MemorymanagementinLinux

Addresstypes:

SeeFig151onp.414 Uservirtualaddress:

Addresswithinuserspaceprocess ActualaddressputbyCPUonsystembus Iflowmem,use__va()toconverttovirtualaddress Addressforaccessingperipherals Oftenmappedtophysicaladdress,butnotalways

Physicaladdress:

Busaddress:

Kernellogicaladdress:

Virtualaddressesthatmapdirectlytophysicaladdressesby anoffset. kmalloc()handsoutthistypeofmemory Use__pa()toconverttophysicaladdress Virtualaddressthatdoesn'tnecessarilymaptoagivenrange ofphysicaladdress. Acontinguousrangeofkernelvirtualaddresseswilltypically notbephysicallycontiguous.Itwillbecontiguousinvirtual spacebecauseofthepagemappings. vmalloc()'edmemory

Kernelvirtualaddress:

Physicaladdressesandpages:

Memorysubdividedinpages,usuall4K(PAGE_SIZE: <asm/page.h>. Addressisalwaysaggregate:<pageframe number><offsetinpage>. PAGE_SHIFTisnumberofbitsinpageoffset Hardwarelimitation:

Highandlowmemory:

32bitsystemcanonlyaddress4GBofphysicalmemory

Linuxspecificlimitations:

Virtualaddressspaceisusuallysplitbetweenkerneland process:

3GBforprocess/1GBforkernel

Kernelcan'thandlememorynotmappedinitsspace(logical address):

Therefore,foralongtime,1GBRAMwasallLinuxsupported

Modernprocessors:

Havebeenmodifiedsothattheycanhandle>4GBRAM OnlyportionoftheRAMhaslogicaladdresses(~1GB) Sincekernelneedstoruninlogicaladdressspace,kernelis located<1GBRAM. Memorybeyondthatisusedforuserspaceprocesses

Linux'suseoftheseextensions:

Mostembeddedsystemshavenohighmemoryto worryabout. Anyreferencetophysicalmemoryhandledbyrefto structpage. structpage:<linux/mm.h>

Thememorymapandstructpage:

atomic_tcount;

Numberofrefstopage Deallocationonrefcount0 Ifmapped(usuallyisforlowmem),page'svirtualaddress Ifnotmapped(probablyhighmem),NULL Notpresentonallarchs

void*virtual;

unsignedlongflags;

Pageuseflags

Addressconversionandstructpage:

structpage*virt_to_page(void*kaddr);

<asm/page.h> Getstructpageusingkernellogicaladdress Notforkernelvirtualaddresses GetstructpagefromPageFrameNumber <linux/mm.h> Getvirtualaddressusingstructpage

structpage*pfn_to_page(intpfn);

void*page_address(structpage*page);

Mappingstructpage:

Simplemapping:

<linux/highmem.h> void*kmap(structpage*page); Iflowmem,returnlogicaladdress Ifhighmem,mapshighmemtokernelspace Maysleeptocreatemapping voidkunmap(structpage*page); Freeingmappingdonewithkmap <linux/highmem.h> <asm/kmap_types.h> void*kmap_atomic(structpage*page,enumkm_typetype); Atomicformofkmap() voidkunmap_atomic(void*addr,enumkm_typetype); Freeingmappingdonewithkmap_atomic

Atomicmapping:

Pagetables:

Mappingfromvirtualaddresstophysicalpage BuiltandmaitainedbyOS UsedbyCPUtoconvertaddresses Noneedfordirectpagetablemanipulationindrivers Basics:

Virtualmemoryareas:

"Memoryobjectwithitsownproperties":

Virtuallycontiguousregion Setofpermissionflags Sameobject

Processesusuallyhavefollowingareas:

Textsection(.text) Initializeddata(.data) Unitializeddata(.bss) Stack Memorymappings

See/proc/<pid>/mapsforexamplelayout Eachlineinmapshasformat:
startendpermoffsetmajor:minorinodeimage

Thevm_are_structstructure:

DefinitionofaVMA <linux/mm.h> unsignedlongvm_start;unsignedlongvm_end;

VirtualaddressrangeforVMA

structfile*vm_file;

FileassociatedwithVMAifmemmap MaybeNULL Offsetinpageswithinmappedfileifmemmap Descriptionofregion VM_IO:donotincludeinprocesscoredump VM_RESERVED:donotswap(typicalfordriver) OpstoworkonVMA Privatedata

unsignedlongvm_pgoff;

unsignedlongvm_flags;

structvm_operations_struct*vm_ops;

void*vm_private_data;

Thevm_operations_structstructure:

void(*open)(structvm_area_struct*vma);

Calledeverytimenewreftoarea(likefork) Calledeverytimereftoareaisclosed(likeclose())

void(*close)(structvm_area_struct*vma);

structpage*(*nopage)(structvm_area_struct*vma, unsignedlongaddress,int*type);

Accesstopagebutpageisn'tthere IfNULL,emptypageisallocatedbykernel

int(*populate)(structvm_area_struct*vm,unsignedlong address,unsignedlonglen,pgprot_tprot,unsignedlong pgoff,intnonblock);


AllowsVMAtobeinitializedpriortobeingreferenced Notneededbydrivers

Theprocessmemorymap:

Eachprocesshasstructmm_struct(linux/sched.h) structmm_structcontainsdescriptionforprocess's struct. current>mm Canbesharedamongstthreads

9.Themmapdeviceoperation

Basics:

Abilitytomapfileordevicedirectlytoprocess'address space. Xmaps/dev/mem PCIgenerallylendsitselfwelltommaping Read/writingtoarearesultsindirectread/writeto device. Mappedareasmustbepagealigned Mappedareasmusthavesizeswhichareamultipleof PAGE_SIZE.

Systemcallgoesthroughlotsofprocessingbefore driver'smmapfunctiongetscalled. Fromuserspace:

void*mmap(caddr_taddr,size_tlen,intprot,intflags,intfd, off_toffset) int(*mmap)(structfile*filp,structvm_area_struct*vma); "vma"isalreadyinitializedpriortocallbackissued Populatepagetables Replacedvmaopsifnecessary

Actualydrivercallbackprototype:

Drivermust:

Populatingpagetables:

remap_pfn_range nopage

Usingremap_pfn_range:

Tobeusedbydriver'smmap()callback intremap_pfn_range(structvm_area_struct*vma, unsignedlongvirt_addr,unsignedlongpfn,unsigned longsize,pgprot_tprot);

pfnisactualRAM

intio_remap_pfn_range(structvm_area_struct*vma, unsignedlongvirt_addr,unsignedlongphys_addr, unsignedlongsize,pgprot_tprot);

phys_addrisI/Omemory

Onmostarchs,io_remap_pfn_range== remap_pfn_range. Manydriversdirectlyuseremap_pfn_rangeinsteadof io_remap_pfn_range. Fields:

vma:

VMAwhererangeistobemapped Userregionwheretostartmmapping ThephysicalPFNtowhichregionshouldbemapped PhysicaladdressrightshiftedbyPAGE_SHIFT

virt_addr:

pfn:

Typicallyvma>vm_pgoffcontainsthisvalue Sizeofregiontobemmapedinbytes Protectionbits Usevma>vm_page_prot

size:

prot:

Musttestretvalforsuccess(0meanfail) Aboutcaching:;

Asstatedearlier,I/Omemshouldnotbecached ShouldbetakencareofbyBIOS/bootloader/earlyboot code. CanbesetviaprotectionfieldinVMA Seedrivers/char/mem.c:pgprot_nocached()forexample

Asimpleimplementation:
static int my_driver_mmap(struct file *filp, struct vm_area_struct *vma) { if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, vma->vm_end vma->vm_start, vma->vm_page_prot)) return -EAGAIN; return 0; }

AddingVMAoperations:

Ifdriverneedstodofurtherprocessingeverytimea processgetsareferencetomemmapedarea(likeona fork)orwhenaprocessdoesn'trefertoregion anymore(likeonclose()).

Modifyvm_opspriortoreturningfrommmap() callback:
vma>vm_ops=&my_vma_ops;

Mostdriversdon'tneedtodothis Sometimgremap_pfn_range()notflexibleenough Ifusingnomap,actualmmap()callbackjustsetsafew fieldsinthevmaandchangesthevm_ops,butdoesn't actuallycallonremap_pfn_range(). nopageisnecessaryformremap()syscall

Mappingmemorywithnopage:

Asshownbefore:

structpage*(*nopage)(structvm_area_struct*vma,unsigned longaddress,int*type);

nopageinvokedonaccesstoaddressnotmapped Ifprocessdoesmremapanddriverdoesn'tprovide nopage,anewpagefilledwithzeroisprovidedto process. Candefinesimplenopagefunctionthatjustreturns NOPAGE_SIGBUSifyoudon'twantanyuserspace processtobeabletodoamremap. SeeLDD3forfulldetails

RemappingspecificI/Oregions:

Cancheckinmmap()callbacktoverifyifuspace processistryingtommapmoreI/Omemorythan availableondevice. ReturnEINVALiftoolarge Canonlyuseremap_pfn_rangeon:


RemappingRAM:

PhysicalrangesaboveRAM RAMlockedpages(reservedpages)/can'tbeswapped

Attemtpstoremapvirtualaddresseswillresultin mappingofthe"zeropage".

RemappingRAMwiththenopagemethod:

SeeLDD3forexampleofhowtomapphysicalpages allocatedinlogicaladdressestouspace.

RemappingKernelvirtualaddresses:

SeeLDD3forhowtomapvirtualaddressrangesto uspace. Ifyoueverneedtosharebuffersbetweenkernel spaceanduserspace,userelayfs. Takescareofallocationoflargebuffers Allowsmmap()fromuserspace Providesefficientandatomicloggingfunctionstofill buffer.

Relayfs:

10.PerformingdirectI/O

Allowdirectreading/writingusingauserspace bufferwithoutevercopyingthedatatokernel spacefirst. Veryusefulforstufflikeblockandnetworking devices. However,mostdriversthatneeddirectI/Oalready havesubsystemsthatdothedirtywork,suchasin thecaseofblockandnetworking. Kernelprovidesget_user_pages()tolockuser pagesinRAMfortransfer.

Oncedone,mustmarkpagesasdirtyifmodified:

voidSetPageDirty(structpage*page); voidpage_cache_release(structpage*page);

Removefrompagecache:

SeeLDD3forfulldetails

11.AsynchronousI/O

InitiateI/Owithoutwaitingforactualtransferto occur. Networkandblockdriversasynchronousby default. AsyncAPIvalidforchardriversonly ExampledriversworthimplementingasyncI/Ofor:

Streaming

TypicallyinvolvescarryingoutdirectI/O

Relevantchardevcallbacks:

aio_read() aio_write() aio_fsync()

SeeLDD3forfulldetails

12.Directmemoryaccess

OverviewofaDMAdatatransfer:

Basically:

1Setuphardwarefortransfer 2Respondtointerruptbysignalingthattransferisdone 1Processread 2DriversetsuphardwaretotransferfromdevicetoDMA buffer. 3Driverputsprocesstosleep 4Hardwarewritesdatatobuffer 5Hardwareissuesinterrupt 6Interrupthandlerdealswithtransfereddataandawakens process.

Usuallyonread:

Usuallyonwrite:

1Processwrite 2DatacopiedtoDMAbuffer 3DriversetsuphardwaretotransferfromDMAbufferto device. 4Driverputsprocesstosleep 5Hardwarecopiesdatafrombuffertodevice 6Hardwareissuesinterrupt 7Interrupthandlerwakesupprocess 1Interruptsignalsarrivalofnewdata 2Interrupthandlerinstructshardwaretotransferdatato designatedbuffer.

Forasynchronousdataarrival(likedataacquisition):

3Hardwarewritesdatatobuffer 4Hardwareraisesinterrupttonotifyofendoftransfer 5Interrupthandlerdispatchesdata

NetworkcardsoperateusingDMAringbuffershared withCPU. DMAbuffersmustbephysicallycontiguous Devicesknownothingofvirtualmapppings Sometimestherearelimitationsevenontheregion whereaDMAbuffermaybeallocatedfrom. UseofGFP_DMA

AllocatingtheDMAbuffer:

Doityourselfallocation:

Usekmalloc()orget_free_pages()presentedearlier Mayhaveproblemsifbufferrequiredistoolarge CanalsoreserveRAMatboottimebytellingkernelnotto useallRAM:


Useioremap()latertomapregionforI/O Doesn'tworkonsystemswithhighmem

Canusescatter/gatherI/Oifdeviceallowsit

Busaddresses:

Devicesonlyrecognizephysicaladdresses Onsomeplatforms,devicesuse"busaddresses", whicharenotthesameas"physicaladdresses".(I/O addressesmappedontobus...)

Conversionfunctionsexist,butstronglydiscouraged (deprecated):

unsignedlongvirt_to_bus(volatilevoid*address); void*bus_to_virt(unsignedlongaddress);

UsegenericDMAlayerinsteadforobtainingthe properaddresses. SeeLDD3

DMAforISAdevices:

13.ThegenericDMAlayer

Basics:

KernelprovidesDMAlayertotakecareofallthe possiblecomplexitiesofDMAoperationaccrossall platformsinaportablefashion. DMAlayeroperationsoperateonstructdevice Mainincludefile:<linux/dmamapping.h> Ifdriverknowsdeviceisnodmacapable:

Dealingwithdifficulthardware:

intdma_set_mask(structdevice*dev,u64mask);

"mask":Numberofbitsdevicecanaddress retvaliszeroifdmaisnotpossiblewithmask

Notneededfordevicessupporting32bitDMA

DMAmappings:

Combinationof:

DMAbufferallocation Deviceaccessibleaddresstobuffer

Donotusevirt_to_bus Ifavailable,IOMMUmayhave"mappingregisters" Useof"bouncebuffers"usedwhenDMAbuffernot devicereachable. DMAlayermaintains"cachecoherency"withCPU Busaddresses(opaque):dma_addr_t

DMAmappingsforPCIcode:

CoherentDMAmappings:

MappingmustbeavailabletobothCPUandperipheralinthesame time. Requiressettingupmappingfordriver'sentirelifetime Discourageduse Pertransactionmappings: Allocatedontransferinit Freedoninterruptreceive Onsomearchitecturescanbefasterthancoherentmappings Prefered

StreamingDMAmappings:

EachPCIDMAmappingtypedealtwithdifferently

SettingupcoherentDMAmappings:

void*dma_alloc_coherent(structdevice*dev,size_t size,dma_attr_t*dma_handle,intflag);

Allocatesandmapsbuffer "dev":device "size":buffersize "dma_handle":busaddressifcallsuccessfull "flag":GFP_KERNELorGFP_ATOMIC(ifinatomiccontext). retvaliskernelvirtualaddressforusebydriver AllocatesDMA'ableregion Smallestallocatablesizeisonepage

voiddma_free_coherent(structdevice*dev,size_t size,void*vaddr,dma_addr_tdma_handle);

Allparametersmustbeproperlyset

DMApools:

AllocatingsmallDMAareas <linux/dmapool.h> structdma_pool*dma_pool_create(constchar *name,structdevice*dev,size_tsize,size_t,align, size_tallocation);


"name":poolname "dev":device "size":sizeofbuffersinpool

"align":hardwarealignmentforpoolbuffers(inbytes) "allocation":ifnonzero,memoryboundarynottobecrossed overby. allocatedbuffers.

voiddma_pool_destroy(structdma_pool*pool); void*dma_pool_alloc(structdma_pool*pool,int mem_flags,dma_addr_t*handle);


"pool":pooltoallocatefrom "mem_flags":GFP_* "handle":busaddressifcallsuccessfull retvaliskernelvirtualaddressforusebydriver

voiddma_pool_free(structdma_pool*pool,void *vaddr,dma_addr_taddr);

SettingupstreamingDMAmappings:

Workonbufferpreallocatedbydriver Mustindicatetransferdirecation

DMA_TO_DEVICE DMA_FROM_DEVICE DMA_BIDIRECTIONAL(mayhaveperformancepenalty). DMA_NONE(debugging)

dma_addr_tdma_map_single(structdevice*dev,void *buffer,size_tsize,enumdma_data_direction direction);

retvalisbusaddresstopasstodevice

voiddma_unmap_single(structdevice*dev, dma_addr_tdma_addr,size_tsize,enum dma_data_directiondirection);

Parametersmustmatchmapinginitialization Canonlytransferinspecifieddirection Oncemapped,thebuffercannotbetouchedbythedriver, mustthereforefillbufferpriortotransfer. Buffershouldnotbeunmappeduntiltransferiscomplete

Importantnotes:

voiddma_sync_single_for_cpu(structdevice*dev, dma_addr_tbus_addr,size_tsize,enum dma_data_directiondirection);

Reacquirebufferfordrivermanipulationwithoutunmapping it.

voiddma_sync_single_for_device(structdevice *dev,dma_addr_tbus_addr,size_tsize,enum dma_data_directiondirection);

Returnbufferbacktodevice

Singlepagestreamingmappings:

Maywanttotransferbufferforwhichastructpageis available. Exampleofuserspacebuffersmappedusing get_user_pages.

dma_attr_tdma_map_page(structdevice*dev,struct page*page,unsignedlongoffset,size_tsize,enum dma_data_directiondirection); voiddma_unmap_page(structdevice*dev, dma_addr_tdma_address,size_tsize,enum dma_data_directiondirection); Avoidpartialpagemappingsbecauseofpotential cachecoherencyproblems. Transferingseveralbuffersinthesametime Manydevicesacceptlistofbufferpointersforsingle DMA.

Scatter/gathermappings:

Incaseofbouncebufferuse,separatebuffersmaybe concatenated. Firstset,fillscatterlist:


Architecturedependent structscatterlist:<asm/scatterlist.h> structpage*page;

Pagecontainingbuffertobeused Buffersize Bufferoffsetwithinpage

unsignedintlength;

unsignedintoffset;

Second,mapscatterbuffer:

intdma_map_sg(structdevice*dev,structscatterlist*sg,int nents,enumdma_data_directiondirection);

"nents":numberofentriesinlist FunctionstoresattributedDMAbusaddressesintoscatterliststruct.

Third,transferbuffers:

Loopthroughscatterlistandtransfereachbuffer Usekernelmacrostoretrievedma_addr_tfromarchspecific structscatterlist. dma_addr_tsg_dma_address(structscatterlist*sg);

ReturnsDMAaddressforgivenscatterlistentry Returnslengthofbuffer Maybedifferentfromtheonesetpriortocallingdma_map_sg()

unsignedintsg_dma_len(structscatterlist*sg);

Finally,whendone,unmap:

voiddma_unmap_sg(structdevice*dev,structscatterlist *list,intnents,enumdma_data_direction);

"nents"isvaluepassedoriginallytodma_map_sg(),notwhatthat functionactuallydid.

AswithstreamingDMAmappings,scatter/gather mappingshavestrictusagerules. Totemporarilyreaquirebuffers,use:

voiddma_sync_sg_for_cpu(structdevice*dev,struct scatterlist*sg,intnents,enumdma_data_direction direction); voiddma_sync_sg_for_device(structdevice*dev,struct scatterlist*sg,intnents,enumdma_data_direction direction);

PCIdoubleaddresscyclemappings:

Kernelsupportsspecialmodefor64bitDouble AddressCycle(DAC)PCItransfers. <linux/pci.h> intpci_dac_set_dma_mask(structpci_dev*pdev,u64 mask);

retval0ifDACcanbeused

dma64_addr_tpci_dac_page_to_dma(structpci_dev *pdev,structpage*page,unsignedlongoffset,int direction);


DACmappingsshouldliveinhighmemory DACmappingsmustbecreatedonepageatatime

"direction":PCI_DMA_TODEVICE, PCI_DMA_FROMDEVICE,PCI_DMA_BIDIRECATIONAL.

Noneedfor"unmapping"DACmappings Must,however,restrictaccess:

voidpci_dac_dma_sync_single_for_cpu(structpci_dev *pdev,dma64_addr_tdma_addr,size_tlen,intdirection); voidpci_dac_dma_sync_single_for_device(structpci_dev *pdev,dma64_addr_tdma_addr,size_tlen,intdirection);

Hardwareaccess
1.I/OportsandI/Omemory 2.I/Oregistersvs.conventionalmemory 3.UsingI/Oports 4.UsingI/Omemory 5.PortsasI/Omemory

1.I/OportsandI/Omemory

Devicescontrolledthroughaccesstoitsspecial registers. Registersoftenplacedinconsecutivememory addresses. Mostarchitectureshavesinglememoryaddress spaceanddeviceaddressspace. SomearchitectureshaveseparateI/Oportsand mainmemory,eitherbydesignofthethe architectureordesignoftheactualboard.Thex86 isanexamplearchitecturewhereI/Oportsexist.

Kernelprovidesseparatefunctionsfordealingwith I/OportsanddealingwithI/Omemory.

2.I/Oregistersvs.conventionalmemory

Compiler,runtimeoptimizationsbytheCPU,and cachingarefineformemoryaccesses,butthey aren'tappropriateforaccesstodeviceregisters. Driversmustmakesurenosuchoptimization takesplacewhenreadingandwritingtodevice registers. Hardwarecachingproblemusuallytakencareof eitherinhardware,priortoLinux'sbootorbyLinux initializationcode.

Mustusea"memorybarrier"toprotectagainst compilerandCPUoptimizations. Compilerbarrier:


<linux/kernel.h> voidbarrier(void);

Tellcompilertoinsertmemorybarrier(i.e.storeallvaluesto memory)andrereadifneedbe. Hasnoeffectonpossiblehardwareoptimizations

Hardwarebarriers:

Platformdependent <asm/system.h>

voidrmb(void);

Completeallreadsappearingbeforestatementpriorto executinganysubsequentread. Lessstringent"rmb()".Blockreorderingofreadsdepending ondatafromotherreads. Discouraged.Usermb()instead. Completeallwritesappearingbeforestatementpriorto executinganysubsequentwrite. Dobothrmb()andwmb().

voidread_barrier_depends(void);

voidwmb(void);

voidmb(void);

SMPequivalentbarriers:

voidsmp_rmb(void); voidsmp_read_barrier_depends(void); voidsmp_wmb(void); voidsmp_mb(void);

Barriersshouldbeinsertedinbetweenread/write operations. Combinedsettingofvalueandmemorybarrier macros:

set_mb(var,value)

set_wmb(var,value) set_rmb(var,value)

Barriersareusedinkernel'slockingmechanisms (spinlocks,atomic_t,etc.)

3.UsingI/Oports

I/Oportallocation:

Must"allocate"I/Oregionpriortousingit <linux/ioport.h> structresource*request_region(unsignedlongfirst, unsignedlongn,constchar*name);


"first":firstportinrange "n":numberofports "name":nameofdevice retvalnonNULLmeansallocationsuccess retvalNULLmeansallocationfailure Allocatedportsvisiblein/proc/ioports

voidrelease_region(unsignedlongstart,unsigned longn);

Releasepreviouslyallocatedregion

intcheck_region(unsignedlongfirst,unsignedlong n);

Checkifregionisalreadyallocated Deprecatedsincecheckingandrequestarenotatomic Userequest_region()directly

ManipulatingI/Oports:

Someparametertypesandreturnvaluesdependon architecture.

Input:

<archdeptype>inb(<archdeptype>port);/*8bit*/ <archdeptype>inw(<archdeptype>port);/*16bit*/ <archdeptype>inl(<archdeptype>port);/*32bit*/ voidoutb(unsignedcharbyte,<archdeptype>port);/*8bit*/ voidoutw(unsignedcharbyte,<archdeptype>port);/*16bit*/ voidoutl(unsignedcharbyte,<archdeptype>port);/*32bit*/

Output:PerformingdirectI/O

I/Oportaccessfromuserspace:

Onx86/PC,portsareaccessiblefromuserspacein certainconditions:

ApplicationbuiltwithO Usesys_iopermorsys_iopltogetpermissiontouseI/O ports. Requestforsys_iopermorsys_ioplmustbedonewhileapp isrunningasrootorhasCAP_SYS_RAWIO.

Stringoperations:

Writeentirestringsinsteadofsingledataunits Sometimesprocessorshavespecialinstructions.If not,tightloopsareusedtowriteoutentirestring.

Careful:singledataunitoperationsdobyteswapping incaseperipheralandsystemhavedifferentbyte ordering.Stringoperationsdon't. Input:

voidinsb(<archdeptype>port,void*addr,unsignedlong count); voidinsw(<archdeptype>port,void*addr,unsignedlong count); voidinsl(<archdeptype>port,void*addr,unsignedlong count); voidoutsb(<archdeptype>port,void*addr,unsignedlong count);

Output:

voidoutsw(<archdeptype>port,void*addr,unsignedlong count); voidoutsl(<archdeptype>port,void*addr,unsignedlong count);

PausingI/O:

Onsomesystems,problemsiftransfersaretoofast Mustsometimesinsertpauses Kernelprovidespausingvariantsofpreviouscalls, theycallendwith_p(inb_p(),outb_p(),etc.)Such variantsareoftenthesameastheoriginalcalls.

Platformdependencies:

I/Oinstructionsarehighlyprocessordependent(ifthey existatall). Oftenthedatatypesarecompletelydifferent CodeconductingI/Oisthereforeoftenarchitecture dependentaswell. Examples:

x86andx86_64:

AllI/Oportfunctionssupported Portnumbersare"unsignedshort" Portsarememorymapped

ARM:

AllI/Oportfunctionssupported Portnumbersare"unsignedint" Portsarememorymapped AllI/Oportfunctionssupported Portnumbersare"unsignedchar*"for32bitand "unsignedlong"on64bit. Portsarememorymapped AllI/Oportfunctionssupported Portnumbersare"unsignedlong"

PPC:

MIPS:

4.UsingI/Omemory

I/Omemoryallocationandmapping:

Allocation:

<linux/ioport.h> structresource*request_mem_region(unsignedlongstart, unsignedlonglen,char*name); retvalisnonNULLonsuccess retvalisNULLonfailure I/Omemallocationvisiblein/proc/iomem voidrelease_mem_region(unsignedlongstart,unsigned longlen);

Freeing:

Checking:

intcheck_mem_region(unsignedlongstart,unsignedlong len);

Deprecated:can'tguaranteeatomiccheckandrequest

Mapping:

Virtualmemoryoperationsimilartovmalloc()andfriends. <linux/vmalloc.h> void*ioremap(unsignedlongphys_addr,unsignedlongsize);


Mapaphysicaladdressrangetovirtualaddressspace UsedmostlyformappingPCIdevicesintokernel'sspace Returnedaddressesshouldn'tbeaccesseddirectly ReturnedaddressesshouldbeaccessedusingproperI/Ofunctions

void*ioremap_nocache(unsignedlongphys_addr, unsignedlongsize);

Similartoioremap,butmakessureremappedareaisnotcached Inmostcases,thesameasioremap Unmapioremap'edregion

voidiounmap(void*addr);

AccessingI/Omemory:

SomeplatformstoleratedirectdereferencingofI/O mem,butthisisdangerous,nonportable,andhighly discouraged. Helperfunctionsoperateonaddressesreturnedby ioremap()

Readfunctions:

unsignedintioread8(void*addr); unsignedintioread16(void*addr); unsignedintioread32(void*addr); voidiowrite8(u8value,void*addr); voidiowrite16(u16value,void*addr); voidiowrite32(u32value,void*addr); voidioread8_rep(void*addr,void*buf,unsignedlong count); voidioread16_rep(void*addr,void*buf,unsignedlong count);

Writefunctions:

Repeatreadfunctions:

voidioread32_rep(void*addr,void*buf,unsignedlong count); voidiowrite8_rep(void*addr,constvoid*buf,unsignedlong count); voidiowrite16_rep(void*addr,constvoid*buf,unsigned longcount); voidiowrite32_rep(void*addr,constvoid*buf,unsigned longcount); voidmemset_io(void*addr,u8value,unsignedintcount); voidmemcpy_fromio(void*dest,void*source,unsignedint count);

Repeatwritefunctions:

Memoryblockoperations:

voidmemcpy_toio(void*dest,void*source,unsignedint count); <archdeptype>readb(address); <archdeptype>readw(address); <archdeptype>readl(address); voidwriteb(<archdeptype>value,address); voidwritew(<archdeptype>value,address); voidwritel(<archdeptype>value,address); Confusionover"b","w","l":movingtou8,u16,u32,u64

Olderfunctions:

5.PortsasI/Omemory

Insomecases,thesamepieceofhardwarecan sometimesappearasaccessibleviaI/Oportsand inotherinstances(otherimplementations)appear asmemorymapped. KernelprovidesfunctionsallowingI/Oportstobe accessedasI/Omemory. So:


Userequest_region()toreserveI/Oportregion RemapI/Oportregiontomemory:

void*ioport_map(unsignedlongport,unsignedintcount);

Canthenuseioread*()andiowrite*()insteadof portfunctions Unmapping:

voidioport_unmap(void*addr);

Chardrivers
1.Majorandminor 2.Thefileoperationsstructure 3.Thefilestructure 4.Theinodestructure 5.Chardeviceregistration 6.Theolderwayfordeviceregistration 7.Openandrelease 8.Readandwrite 9.Read

10.Write 11.readv/writev 12.ioctl 13.BlockingI/O 14.pollandselect 15.fsync 16.fasync 17.llseek 18.Accesscontrolonadevicefile

1.Majorandminor

Basics:

Chardevicesaccessedthroughnamesinfs Devicenodestypicallyin/dev "c"typefilesforchar "b"typefilesforblock Use"lsl"toseedevicetypesin/dev Major/minortuplesforeachdevice Onemajornumber=onetypeofdevice Minornumber=deviceinstance Minornumberisignoredbykernel

Theinternalrepresentationofdevicenumbers:

dev_t Macrosin<linux/kdev_t.h>:

MAJOR(dev_tdev); MINOR(dev_tdev); MKDEV(intmajor,intminor); 12bitmajornumber 20bitminornumber

Currently,dev_tis32bit:

Allocatingandfreeingdevicenumbers:

register_chrdev_region():staticallocation

Beginingdevicenumberforrange Numberofdevicesinrange(minornumbers) Nameofdevice Pointertodev_t Firstminornumber,usually0 Numberofdevicesinrange Nameofdevices

alloc_chrdev_region():dynamicallocation

unregister_chrdev_region():deallocation

Beginingdevicenumberforrange Numberofdevicesinrange

Dynamicallocationofmajornumbers:

StaticnumbersfoundinDocumentation/devices.txt Nonewnumbersallocated Stronglyencouragedtousealloc_chrdev_region() Problemwithnodecreationbeforehand See/proc/devicesor,ifdriverexportsinfoinsysfs,the relevantdirectoryentryinsysfs. Example/proc/devicesparsingscriptp.46ofLDD3

Couldalsocountonexactsamedynamicnumberbeing reallocatedifsamedriverloadingsequence.This, though,maybecomedeprecatedassomekernel developershavehinted. Canwritecodethatsupportsbothstaticanddynamic allocation.Seep.48ofLDD3.

2.Thefileoperationsstructure

Basics:

Connectionbetweenmajor/minornbrsandchardriver callbacks:

structfile_operations

Definedin<linux/fs.h> Bunchoffunctioncallbacks Eachopenedfile(structfile)hasanassociatedsturct file_operations:

f_op

file_operationsaretheactualimplementationofthe mainfilesystemcalls:open,close,read,write,etc.

Onecallbackperimplementedcall NULLforunsupportedcalls:

KernelbehaviorforNULLdependsoncall

Useof__usertospecifythatpointerindeclarationis fromuser_spaceandshouldn'tbeuseddirectly. structmodule*owner;

Details:

THIS_MODULE Setread/writeposition retvalisnewposition

loff_t(*llseek)(structfile*,loff_t,int);

PositioncounterunpredictableifNULL

ssize_t(*read)(structfile*,char__user*,size_t,loff_t *);

Readfromdevice IfNULL,EINVAL retvalishowmuchdataread

ssize_t(*aio_read)(structkiocb*,char__user*, size_t,loff_t);

Asynchronousread Mayreturnevenincomplete IfNULL,normalreadgetscalled

ssize_t(*write)(structfile*,constchar__user*, size_t,loff_t*);

Writetodevice IfNULL,EINVAL retvalishowmuchdatawritten

ssize_t(*aio_write)(structkiocb*,constchar__user *,size_t,loff_t);

Asynchronouswrite ShouldbeNULL ForFSimplementationsonly

int(*readdir)(structfile*,void*,filldir_t);

unsignedint(*poll)(structfile*,structpoll_table_struct *);

Oneofpoll,epollorselect Queryifreadorwritewouldblock IfNULL,deviceisnonblockingoneitherreadorwrite

int(*ioctl)(structinode*,structfile*,unsignedint, unsignedlong);

Devicespecificcommands Someioctlcommandsrecognizedbykerneldirectlywithout callingioctl()callback. IfNULL,ENOTTY

int(*mmap)(structfile*,structvm_area_struct*);

Mappingofdevicetoprocessaddressspace IfNULL,ENODEV Firstopondevice(always) Notrequired IfNULL,openalwayssucceeds,butdriverisnevernotified.

int(*open)(structinode*,structfile*);

int(*flush)(structfile*);

Calledwhenfiledescriptorclosedandshouldexecuteand waitforanyoutstandingoperation. Perfileoperation Notthesameasfsync Veryseldomused IfNULL,apprequestignored Calledwhenfilestructisreleased(fileisclosed) Likeopen,canbeNULL

int(*release)(structinode*,structfile*);

int(*fsync)(structfile*,structdentry*,intdatasync);

Kernelsideoffsyncsystemcall Calledbyusertoflushpendingdata IfNULL,EINVAL Asynchronousversionoffsync TonotifydeviceofchangesinFASYNCflag Seeasyncoperationslater Filelocking Neverimplementedbydrivers

int(*aio_fsync)(structkiocb*,intdatasync);

int(*fasync)(int,structfile*,int);

int(*lock)(structfile*,int,structfile_lock*);

ssize_t(*readv)(structfile*,conststructiovec*, unsignedlong,loff_t*);

Scatter/gatherread Operationsonmultiplememoryareas IfNULL,looparoundregularread

ssize_t(*writev)(structfile*,conststructiovec*, unsignedlong,loff_t*);

Scatter/gatherwrite Operationsonmultiplememoryareas IfNULL,looparoundregularwrite

ssize_t(*sendfile)(structfile*,loff_t*,size_t, read_actor_t,void*);

Movefilecontentfromonefiledescriptortoanotherwith minimalcopy. DriversusuallyleaveasNULL

ssize_t(*sendpage)(structfile*,structpage*,int, size_t,loff_t*,int);

Otherhalfofsendfile Calledbykerneltomoveonepageatatime Notusuallyimplementedbydrivers

unsignedlong(*get_unmapped_area)(structfile*, unsignedlong,unsignedlong,unsignedlong, unsignedlong);


Findproperlocationinprocessaddresstomapdevice Usuallydonebymmcode Availablefordriverswithspecialalignmentrequirements UsuallyleftasNULL Checkflagspassedtoanfcntl(manipulatefiledescriptor) Callwhenuspaceusesfcntltorequestdirectorychange notifications.

int(*check_flags)(int);

int(*dir_notify)(structfile*filp,unsignedlongarg);

UsuallyimplementedbyFSesonly

Mostimportantcalls:

owner open release read write ioctl llseek

Usefullcalls:

mmap poll

3.Thefilestructure

"structfile"definedin<linux/fs.h> UnrelatedtouserspaceFILEpointers Eachisentirelyopaquetotheopositespace Representsanopenfile Createdbykernelandpassedtoopen Releasedonceallfileinstancesareclosed Relevantfields(othersnotrelevanttodevice drivers):

modet_tf_mode:

FMODE_READ:Readable FMODE_WRITE:Writable Permissionscheckedbykernelpriortofileopinvocation Currentread/writeposition Shouldnotbemodifiedbyread/writefileops Uselastargumentofcallbackinstead Canbemodifiedbyllseek O_RDONLY,O_NONBLOCK,O_SYNC DrivershouldusuallyjustcheckforO_NONBLOCK

loff_tf_ops:

unsignedintf_flags:

structfile_operations*f_op:Thefileopsforthefile

Thefileopsforthefile Assignedatopenbykernel Couldbereassignedatruntimebydrivertoimplement differentbehaviorunderthesamemajornumber. SettoNULLonopen Usefultopreserve"private"informationacrosssyscalls Carefultofreeonrelease() Directoryentryassociatedwithfile Nottypicallyusefulfordrivers

void*private_data:

structdentry*f_dentry:

Canbeusedtoobtaininodestruct:

filp>f_dentry>d_inode

4.Theinodestructure

Kernelinternalrepresentationoffiles Canbemanyfilestructspointingtomultipleopen filedescriptors,butallpointingtosameinode. inodestructcontainsalotofinfo. Relevantfields:

dev_ti_rdev:

Fordevicedriverinodes,thisisdevicenumber Kernelinternalrepresentationofchardevices

structcdev*i_cdev:

Wheninodeischardevice,pointstoactualstructcdev

Shouldnotaccessinodefieldsdirectly,use macrosinstead:

unsignedintiminor(structinode*inode) unsignedintimajor(structinode*inode)

5.Chardeviceregistration

Mustallocatestructcdevsokernelcancallchar devicecallbacks. Use<linux/cdev.h> Allocatingsinglecdevstruct:


structcdev*my_cdev=cdev_alloc(); my_cdev>ops=&my_fops;

Initializingexistingcdevstruct:

voidcdev_init(structcdev*cdev,structfile_operations *fops);

Typically,shoulduseadevicespecificstructto holdinfoaboutyourdevice.Somethinglike my_device_struct,whichcontainsastructcdevas oneofitselements. Mustsetowner(forbothstaticanddynamic):


my_cdev>owner=THIS_MODULE;

Mustadddevicetosystem(forbothstaticand dynamic):

intcdev_add(structcdev*dev,dev_tnum,unsigned intcount); thecdevstruct

thefirstdevicenumbertowhichdeviceresponds howmanydevicesareassociatedwithdevice Canfail=>Mustcheckretval Oncecalled,deviceisliveandcallbackmaybe invoked. Shouldnotbecalleduntildeviceisfullyinitialized voidcdev_del(structcdev*dev);

Aboutcdev_add:

Toremove:

Exampledeviceregistrationonp.57

6.Theolderwayfordeviceregistration

intregister_chrdev(unsignedintmajor,const char*name,structfile_operations*fops);

Majornumber(0fordynamicallyallocated) Devicename fileops

intunregister_chrdev(unsignedintmajor,const char*name);

Samemajorandnamepassedtoregister_chrdev.

7.Openandrelease

openshoulddo:

Identifythedevicebeingopened,ifusingnew registrationscheme:

Assumingthecdevusedindeviceisamemberofadevice specificstructsuchasmy_device_struct,andwasinitialized andregisteredusingsomethinglike:


cdev_init(&my_device_instance>cdev,...); cdev_add(&my_device_instance>cdev,...);

Wecannowobtainthepointertomy_device_instanceby substractingfromtheinode>i_cdevpointertheoffsetofthe cdevelementinthemy_device_structstruct.

container_of(pointer,container_type,container_filed) macro:

pointertostructmember(inode>i_cdev) typeofcontainer(structscull_dev) nameofmemberwithincontainer(cdev)

Setfilp>private_datatopointerreturnedbycontainer_of() forotherfileopstouse. Useiminor()macrotogetdeviceminornumber Checkifdeviceisoneofthosesupported Checkfordevicespecificerrors Initializethedeviceiffirstopen

Ifusingoldregistrationscheme:

Otheropenresponsibilities:

Ifnecessary,updatef_op Allocateandfillanyprivatedataforfip>private_data

Exampleopenonp.59 IfO_NONBLOCKspecified,openshouldreturnwithout wait.Someopenfunctionsmayindeedhavelengthy initializationprocedures. releaseiswhathappensonclose() reverseopen:


releaseshoulddo:

Deallocateanythingallocatedinfilp>private_data Shutdowndeviceonlastclose

Examplebasicreleaseonp.59 Notallprocessesthathaveareferencetoagiven structfilehavegottenitthroughopen(),yettheyallcall close()onthestructfileonexit.Howdoesthiswork? Callstoforkordupdonotresultinnewstructfile,they justincrementacounterintheexistingstructfile. Conversely,thecounterisdecrementedonclose(). release()willbecalledbythekernelonlywhenthe incrementreacheszero.

Generalitiesaboutopen/release:

flush()calledeverytimethereisaclose(),thoughfew driverssupportthiscall.

8.Readandwrite

ssize_tread(structfile*filp,char__user*buff, size_tcount,loff_t*offp); ssize_twrite(structfile*filp,constchar__user *buff,size_tcount,loff_t*offp);


filp=filepointer count=sizeofdatatotransfer buff=buffertotransferto(read)orfrom(write) offp=offsetatwhichfileisbeingaccessed Userspacepointer

Alitbitmoreaboutbuff:

Maynotevenbe"visible"fromkernelspaceinsome architectures, Maypointtoareathatispagedout Maybesentbymaliciousprogram

Mustuseappropriatefunctionsforuserspace references:

read/writeshouldusecopy_to_user/copy_from_user respectively. read/writeoperationsshouldupdate*offp Anyupdateon*offpwillbepropagatedbythekernelto filestruct.

read/writeshouldreturnnegativevalueonerror read/writeshouldreturn0orpositivevaluetoindicate numberofbytessuccessfullytransferred. Ifpartialsuccess,read/writeshouldfirstreturnthe numberofbytestransferred,andreturnerroronnext call. Ifretval<0,userspacesees1,andmustuseerrnoto geterror.

9.Read

Interpretationofreturnvaluebyuserspaceapp:

bytes==count,transfersuccess count>bytes>0,applicationretries bytes==0,endoffile bytes<0,error=>errnofrom<linux/errno.h>

Incaseofdatanotpresent,read()shouldtypically block. Exampleimplementationp.67.

10.Write

Interpretationofreturnvaluebyuserspaceapp:

bytes==count,transfersuccess count>bytes>0,applicationretries bytes==0,noerror,thereforeretries bytes<0,error=>errnofrom<linux/errno.h>

Exampleimplementationp.6869.

11.readv/writev

Vectoroperations Vectorentriescontain:bufferpointer+length value. Ifmissing,read/writearecalledmultipletimesby kernel. Ifuseful,alwaysbesttohaverealreadv/writev Easiest,havealoopindrivercallingread/write Ifimplemented,shouldbemorefancy,likehaving commandreodering. Nottypicallyimplementedbydriver

12.ioctl

Background:

Mustbeabletocontroldevicewithmorethanjust read/write. Userspaceprototype:

intioctl(intfd,unsignedlongcmd,...);

Notactuallyavariablenumberofarguments,instead, 3rdargis:

char*argp

The"..."avoidstypecheckingatbuildtime argpcanbeusedtopassallsortsofthings

ioctl()oftenconsideredanuisancebykernel developersbecauseitopens undocumented/unauditablebackdoorstodrivers. Driverprototype:

int(*ioctl)(structinode*inode,structfile*filp,unsignedint cmd,unsignedlongarg); inodeandfilparesameaspassedtoopen() cmdpassedasisfromuserspace argispassedas"unsignedlong"regardlessofitstype

Usuallyimplementedasahuge"switch(cmd)" Each"cmd"isinterpretedandactedondifferently

Commandsmustbedefinedinheaders,andshared withuserspaceappssothattheyknowwhat commandstoinvokeforagivenaction. Numbersshouldbeuniquetoavoidmistakendriver accesstocausedamage. Useof"magicnumbers" Tohelpmanagingthesenumbers,thecommand codeshavebeensplitupinseveralbitfields. include/asm/ioctl.hdefinestheseparatebitfields Documentation/ioctlnumber.txtdefinesthealready allocated"magic"numbers.

Choosingtheioctlcommands:

Bitfields:

Type(magicnumber),uniquetodriver:8bits Ordinalnumber,uniquecommand"id"indriver:8bits Directionoftransfer:bitmask


_IOC_NONE:notransfer _IOC_READ:readfromdevice _IOC_WRITE:writetodevice _IOC_READ|_IOC_WRITE:writetoandreadfromdevice

Sizeofargument:usually13or14bits,butactualwidth dependsonarch.

Notmandatory,butrecommended Ifneedlargerdatastructs,canignorethisfield

Macrosfordefiningcommandnumbers:

<asm/ioctl.h>

_IO(type,nr) _IOR(type,nr,datatype) _IOW(type,nr,datatype) _IOWR(type,nr,datatype) typeandnr(number)explicit,sizeobtainedusing sizeof(datatype)


251 _IO(MY_IOC_MAGIC, 0) _IO(MY_IOC_MAGIC, 1) _IOR(MY_IOC_MAGIC, 2, int) _IOW(MY_IOC_MAGIC, 3, int) _IOWR(MY_IOC_MAGIC, 4, int) 10

Example:

#define MY_IOC_MAGIC #define MY_IOCFIRSTCOMMAND #define MY_IOCSECONDCOMMAND #define MY_IOCREAD #define MY_IOCWRITE #define MY_IOCREADWRITE #define MY_IOC_MAXNR

Actualcommanddefinitionsnotinterpretedbykernel. Could,therefore,defineentirelydifferentcommand numberranges. Usuallydependsonactualoutcomeofswitch() statement. Bydefault,mostkernelfunctionsreturnEINVAL AccordingtoPOSIX,shouldreturnENOTTY,which wouldbeinterpretedbytheClibraryasmeaning "inappropriateioctlfordevice".

Thereturnvalue

Thepredefinedcommands

Thoughioctlcommandsaretypicallynotinterpretedby thekernel,someare. Interpretedcommandsarerecognizedpriortothe invocationofyourdriver'sioctl(). Ifdriverdefinessamenumbers,driverwillnotsee commands. 3typesofpredefinedcommandsaccordingtowhat theycanbeissuedon:

Anyfile(theonlyonerelevanttodevicedrivers,magicnbr "T"). Regularfiles

Filesystemspecificfiles FIOCLEX:

Commandsspecifiedforanyfile,includingdrivers:

Filedescriptorclosedonexecutionofnewprogram UndoFIOCLEX Setorresetasyncnotificationonfile(unused) Getsizeoffileordirectory.ENOTTYondevices ModifytheO_NONBLOCKflaginfilp>f_flags.Lastioctl()argument servestoindicatewhethertoenableordisableblocking.

FIONCLEX:

FIOASYNC:

FIOQSIZE:

FIONBIO:

Usingtheioctlargument:

Ifnotpointer,useasis Ifpointer,mustmakesureaddressisvalid:

Canusecopy_to_user()andcopy_from_user() Sinceioctl()transfers,thereareotherfunctionsthatcan usedaswell. Verifyifregionisvaliduserspace:


intaccess_ok(inttype,constvoid*addr,unsignedlongsize); typeiseitherVERIFY_READ,fromuspace,orVERIFY_WRITE,to uspace. addrisuserspaceaddress sizeis...well,hmmm,size retvalis1onsucess,0onfailure ShouldreturnEFAULTonfailure

access_ok()checkstomakesureaddressisnotkernelspace <asm/uaccess.h> put_user(datum,ptr)and__put_user(datum,ptr) Writesdatumtousespace Sizeoftransferdeterminedusingptrtype __put_userassumeshavingalreadycalledaccess_ok() retvaliszeroonsuccess get_user(local,ptr)and__get_user(local,ptr) Readdatumfromuserspaceinto"local" retvaliszeroonsuccess

Transfermacros:

Capabilitiesandrestrictedoperations

Typicallyaccessiscontroledusingfilepermissions Regardlessofactualfilepermissions,someoperations shouldnotbeallowedtoallusers.

Inadditiontosimpleuserpermissions,Linuxprovides "capabilities". Capabilitiesenableselectivelysettingpriviligeswith finergranularitythanjustusingroot/nonroot. Capabilities==permissionmanagement Capabilitiescontroledusingsys_capget()and sys_capset(). Availablecapabilitiesdefinedin<linux/capability.h> Noadditionalcapabilitiescanbedefined Capabilitiesrelevanttodrivers:

CAP_DAC_OVERRIDE:

Overrideaccessrestrictionsonfilesanddirectories Administernetworking Load/unloadmodules PerformrawI/O "Administer"system(manythings) TTYconfiguration

CAP_NET_ADMIN:

CAP_SYS_MODULE:

CAP_SYS_RAWIO:

CAP_SYS_ADMIN:

CAP_SYS_TTY_CONFIG:

Driversshouldcheckforcapabilitiespriortocarrying outprivilegedoperations.

Capabilitychecking:

<linux/sched.h> intcapable(incapability); Returns1ifcapable,0otherwise DrivercouldreturnEPERMonfailure

Devicecontrolwithoutioctl:

Canuse"controlsequences"writtentodevice,like escapesequencesonconsole. Controlsequencesrequireparsinginputtodevice,and drivermustmakesurethatsuchsequencesarenot writtentoactualdevice.

Shouldbeusedifdevicedoesn'tactuallytransferany data,butjustactsoncommands. IfASCIIcommands,canevenuse"cat"tosend commandstodevice,whichavoidshavingto implementanioctl()andacustomapptotalktodriver.

13.BlockingI/O

What'sblockingI/O:

Whatifdrivercan'tsatisfyaread()orwrite()? Drivermustbeabletoputcallingprocesstosleep Puttingprocessinspecialmodeandtakingitoffthe scheduler'squeue. Rules:


Introductiontosleeping:

Neversleepinacriticalsection Neversleepwithinterruptsdisabled Onwakeup,notimeoreventcontext=>rechecksleep condition. Neversleepifwakingprocedureunknown/unsure

A"waitqueue"containslistofsleepingprocesses Processusuallywokenupasaresultofahardware interrupt. copy_to/from_usercansleep=>thistypeofsleeping isok. Managedusing"waitqueuehead" wait_queue_head_t:<linux/wait.h> Staticinitialization:

Introductiontowaitqueues:

DECLARE_WAIT_QUEUE_HEAD(name);

Dynamicinitialization:

voidinit_waitqueue_head(wait_queue_head_t*queue);

Simplesleeping

Basicsleepingmacrostake2or3arguments:

queue,thewait_queue_head_t,notapointer condition,abooleanexpressionevaluatedbymacrobefore andaftersleeping.Couldbeevaluatedanarbitrarynumber oftimes. timeout,incaseatimeoutinjiffiescanbespecified wait_event(queue,condition)

Sleepmacros:

Putprocessinuninterruptiblemode(notrecommended)

wait_event_interruptible(queue,condition)

Putprocessininterruptiblestate(mayreceivesignals) Ifretval!=0,sleepwasinterrupted,returnERESTARTSYS Waitforalimitedamountoftime Alwaysreturns0

wait_event_timeout(queue,condition,timeout)

wait_event_interruptible_timeout(queue,condition, timeout)

Sameaswait_event_timeout()

Wakeupfunctions,thebasics:

voidwake_up(wait_queue_head_t*queue);

Wakeupallprocessessleepinginqueue. Wakeuponlyprocessesinstateinterruptible.

voidwake_up_interruptible(wait_queue_head_t*queue);

Blockingandnonblockingoperations

Whenshouldaprocessbeputtosleep? MustmatchUnixsemantics MustcheckforO_NONBLOCKinfilp>f_flags Propersemantics:

Assumingtherearedriverimplementedbuffersforreadand write,whichisthecaseformostdrivers.Inputbufferavoids loosingdatawhennooneisreading.Outputbufferavoids leavingdatainuserspacewhiledeviceisnotready. Onread,blockifnotenoughdata,wakeupwhendatais availableevenifincomplete.

Onwrite,blockifnotenoughspaceinbuffer,wakeupwhen spaceisavailableinbuffer(possiblyduetoapreviouswrite completing)eventifspaceisinsufficient. Waitqueueforbothblockingoperationsshouldbedifferent, oneforreadandoneforwrite. IfO_NONBLOCK,returnEAGAINonfailure.

Advancedsleeping:

Howaprocesssleeps:

3steps:

Allocate,initializeandpendwaitqueueentry Settaskstate Callscheduler

wait_queue_head_tisdefinedin<linux/wait.h>

Containslinkedlistandspinlock
Waitqueueentryisqueueentryoftypewait_queue_t struct __wait_queue { unsigned int flags; #define WQ_FLAG_EXCLUSIVE 0x01 struct task_struct * task; wait_queue_func_t func; struct list_head task_list; };

Thisstructureisusedtoknowwhichprocessestowakeup andhow. Inadditiontoallocatingandinitializingthisstructproperly, mustalsosettheprocess'state. Relevantprocessstates

TASK_RUNNING: Taskisabletorun,butnotnecessarilyrunning

TASK_INTERRUPTIBLE: Taskissleeping,butmayreceivesignals TASK_UNINTERRUPTIBLE: Taskissleepingandcan'treceivesignals voidset_current_state(intnew_state);

Tomodifyprocessstate:

Don'tmanipulateprocessstructdirectly Priortocallingscheduler,mustcheckifconditionwaitedon becametrue.Otherwise,therecouldbearaceconditionif youarewokenupexactlyasyouaresettingtheprocessto gotosleepbecausewhatyouwerewaitingforoccured:


if(!condition) schedule();

Ifschedule()called,processwillreturninTASK_RUNNING.

Ifnot,mustresetprocessstatetoTASK_RUNNING manually. Eitherway,mustremovetaskfromwaitqueuetoavoidbeing wokenupagain. Ifsleepconditionsatisfiedbetweenif()andschedule(),then thisisok,processisreturnedto"TASK_RUNNING"and schedule()willreturn,thoughnotnecessarilyrightaway. Oncedone,mustcheckifcodeneedstosleepagain becausethesleepconditionisn'tfulfilled. Candopreviouslydescribedproceduremanually,orbetter usehelperfunctions.

Manualsleeps

Staticwaitqueueentrydeclaration:

DEFINE_WAIT(my_wait); voidinit_wait(wait_queue_t*wait); voidprepare_to_wait(wait_queue_head_t*queue,wait_queue_t *wait,intstate);

Dynamicwaitqueueinitialization:

Addprocesstowaitqueueandsetstate:

Cannowschedule()aftercheckingifsleepconditionwas fulfilled. Cleaninguptakencareofby:

voidfinish_wait(wait_queue_head_t*queue,wait_queue_t*wait);

Makesuredon'tneedtosleepagainbecausesleep conditionisn'tfulfilled.

Checkifwakeupduetosignal:

intsignal_pending(structtask_struct*p) SendERESTARTSYSinsuchacase.

Exclusivewaits

Awakeupeventusuallywakesupallprocesseswaitingin waitqueue. Allprocessesmustthenrecheckwaitconditionandsleep somemoreifneedbe. Thisisfineiftherearen'tmanyprocesseswaiting. Otherwise:"thunderingherd"(Apache) Fix:WQ_FLAG_EXCLUSIVE:


Processesmarkedwiththisflagplacedattheendofwaitqueue Processeswithoutflagplaceatthebeginningofwaitqueue Kernelstopsafterwakingupthefirstprocessmarkedwithflag

IOW,allprocessesnotmarked(ifany)wakeup,andonlymarked wakesup. Seriousresourcescontention Wakingupsingleprocessresultsinproperconsumption voidprepare_to_wait_exclusive(wait_queue_head_t*queue, wait_queue_t*wait,intstate);

Whentouse:

Mustusemanualwaithelperfunctionsforexclusivewaits:

Thedetailsofwakingup:

Typicalwakeupbehaviorcontrolledbyactuallywakeup function. Defaultfunctionisdefault_wake_function() Alldriversshouldusedefault

Fullwakeupfunctions:

voidwake_up(wait_queue_head_t*queue);

Alreadycovered Alreadycovered Wakeupnrexclusivewaiters.If0,wakeallup.

voidwake_up_interruptible(wait_queue_head_t*queue);

voidwake_up_nr(wait_queue_head_t*queue,intnr);

voidwake_up_interruptible_nr(wait_queue_head_t *queue,intnr);

Same Wakesupallprocesses,whetherexclusiveornot.

voidwake_up_all(wait_queue_head_t*queue);

voidwake_up_interruptible_all(wait_queue_head_t *queue);

Same

voidwake_up_interruptible_sync(wait_queue_head_t *queue);'

Insurethatprocesswokenupdoesn'tgettorunpriortothisfunction havingreturned.

Mostdriversshouldjustusewake_up_interruptible()

14.pollandselect

Introductiontopollandselect:

NonblockingI/Oapplicationsusepoll,select,and epolltodetermineifdataisreadyforconsumption. Thesecallscanbeusedtocheckentiresetsoffile descriptorsforwhetheroneofthemisreadyfor read/writeorwaitforoneofthemtobereadyfor read/write. pollandselectessentiallyequivalent,implementedby 2separateunixteamsinparallel. epollisLinuxspecificforhandlingthousandsoffile descriptors.

Indrivercallbackforpoll,selectandepoll:

unsignedint(*poll)(structfile*filp,poll_table*wait); Tellkernelwhichwaitqueuestheprocesscouldwaitonifit triedtoreadorwrite:

Drivershoulddo:

voidpoll_wait(structfile*,wait_queue_head_t*,poll_table*);

Returnbitmaskofoperationsthatcouldbeperformed withoutblocking. Allocatepoll_tabletorecordallwaitqueuesprocesscould waiton.

Whatthekerneldoes:

Foreveryfiledescriptorsetinuserspaceforpoll(),callits kernelsidecallback,whichcouldbeyourdriver,another driverorafilesystem,andprovideitwiththeallocated poll_table. Dependingonthebitmaskreturnedbythevariouscallbacks invokedbythekernel:

IfnoneindicatesthatI/Ocanoccurwithoutblocking,theuserspace pollwillsleepuntiloneofthewaitqueuesitplacesthecalling processonwakesthatprocessup. Ifoneofthemhassomething,reportittouserspaceimmediately.

Onwakeup,removeprocessfromallwaitqueuesandfree allocatedpoll_table. Passpoll_tableasNULLtocallbacks:


Ifuserspacesetpolltimeoutto0 IfoneofthecallbacksforanyoftheFDsindicatednonblockingI/Ois possible.

Forepoll:

Avoidallocation/deallocationoneverycall,sincethisisexpensive whenlargeamountsofFDs. Preallocatepoll_tableforallactivities Releasepoll_tableoncedoneforreal

Flagsrecognizedaspartofbitmask:

<linux/poll.h> POLLIN:

Canreadwithoutblocking "normal"dataavailableforread.Usually:POLLIN|POLLRDNORM. "outofband"dataavailable.Typicallyunused.Socketsonly.

POlLRDNORM:

POLLRDBAND:

POLLPRI:

"highpriority"data(outofband)available.Reportedasexceptionby select. Endoffile.Selecttolddeviceisreadable. Errorondevice.Onpoll,deviceisreadableandwritable Canwritewithoutblocking SameasPOLLOUT.Usually:POLLOUT|POLLWRNORM "outofband"canbewrittentodevice.Socketsonly.

POLLHUP:

POLLERR:

POLLOUT:

POLLWRNORM:

POLLWBAND:

Interactionwithreadandwrite:

readingdatafromthedevice

Ifdataavailableandread(),returnwhatisavailable Ifdataavailableandpoll(),returnPOLLIN|POLLRDNORM. Ifnodataandread(),waituntilanydataisthere Ifnodata,read()andO_NONBLOCK,returnEAGAIN Ifnodataandpoll(),returnemptymaskforread Ifendoffileandread(),return0immediately Ifendoffileandpoll(),returnPOLLHUP Ifspaceavailableandwrite(),writewhatevercanbe

writingdatatothedevice

Ifspaceavailableandpoll(),returnPOLLOUT| POLLWRNORM. Ifnospaceandwrite(),waituntilspaceisavailable Ifnospace,write()andO_NONBLOCK,returnEAGAIN. Ifnospaceandpoll(),returnemptymaskforwrite Ifdevicefullandwrite(),returnENOSPC

15.fsync

Flushingpendingoutput write()callsshouldneverwaituntildatais transmitted,withorwithoutO_NONBLOCK. Programrequiringwaitshouldusefsync(). Ifanyprogramrequiresfsync(),drivershould provideit. Prototype:

int(*fsync)(structfile*file,structdentry*dentry,int datasync); datasyncforfilesystemsonly.

fsync()doesn'treturnuntilalldataisflushedto device. Usuallynotdefinedfordrivers,exceptforblock drivers.

16.fasync

Asynchronousnotification Neededforappsthatdon'twanttoroutinelypoll() orselect(). Instead,applicationnotifieskernelthatitwantsto receiveaSIGIOwhensomethingbecomesready forit. Ifapplicationhassetasyncnotificationformore thanonefd,itmustthenusepoll()orselect()to determinewhichonebecameavailable.

Settingupasyncnotificationinuserspace:
signal(SIGIO, my_sig_handler); fcntl(my_fd, F_SETOWN, getpid()); c_flags = fcntl(my_fd, F_GETFL); fcntl(my_fd, F_SETFL, c_flags | FASYNC); /* /* /* /* Set Set Get Set sig-handler */ fd "owner" */ current flags */ async notification */

Driver'sinvolvement:

fasync()callbackinvokedonfcntl(...FASYNC) ShouldsendSIGIOwhendataisavailable intfasync_helper(intfd,structfile*filp,intmode, structfasync_struct**fa);

fasync()callbackresponsibility:

First3argumentstakenfromparametersoffasync() callback. Lastargumentisdriver'sownpointertoastruct fasync_struct*. voidkill_fasync(structfasync_struct**fa,intsig,int band); Firstparamsameasforfasync_helper sigistypicallySIGIO bandisPOLLIN|POLLRDNORMifreadavailableand POLLOUT|POLLWRNORMifwrite.

Upondataavailability,andifasyncreaders:

Oncloseshouldcallinternalfasync()callbackto removefilefromlistofasynclisteners:

my_fasync(1,filp,0);

17.llseek

Seekingadevice llseek()callbackservesuspacelseek()and llseek(). Ifmissing,kernelperformsoperationsonfilp >f_pos. Asstatedbefore,shouldprovidellseek Mustcooperatewithread/write Mustmaintaininternalcountersforseeking Ifdevicenonseekable,informkernelonopen():

intnonseekable_open(structinode*inode,structfile *filp); Setllseek()callbackinfileopstono_llseek (<linux/fs.h>)

18.Accesscontrolonadevicefile

Singleopendevices:

Onlyoneprocesscanopendevice Avoidwhenpossible Canuseatomic_tval:


Initializewithvalueof1 Useatomic_dec_and_test()onopenandfailifvalueisnot0 (retvalis1).Iffail,useatomic_inc()toeraseeffect. Useatomic_inc()onclose.

Restrictingaccesstoasingleuseratatime

Allowsusertoopendevicemultipletimes

Mustmaintain:

PID Usecount Onfirstopen(usecountzero),recordcurrent>uid Onotheropens,checkcurrent>uid,current>euid,and capable(CAP_DAC_OVERRIDE)toseeifaccessshouldbe granted. Incrementusecountonallsuccessfullopens Decrementusecount

Usecriticalsection(spinlock)inopento:

Usecriticalsection(spinlock)incloseto:

BlockingI/OasanalternativetoEBUSY

UsuallyunavailabledeviceshouldreturnEBUSY Sometimes,it'sbettertojustwaituntiltheopencan succeed,dependingonwhat'stheexpeceduser experience. Onopen,putprocessonwaitqueueuntildeviceis available. Onrelease,wakeupwaitingprocesses Foreachseparateprocessdoinganopencreatesa newdevicewhichthatprocessnowdealswith independentlyofotherprocesses.

Cloningthedeviceonopen

Possibleonlyfornonhardwaredevices ttyusessomethingsimilartoobtainvirtualttys Cansetpolicyfornewdeviceon"newuser","new terminal",etc.

Blockdrivers
1.Registration 2.Theblockdeviceoperations 3.Requestprocessing 4.Someotherdetails

1.Registration

Blockdriverregistration:

<linux/fs.h> intregister_blkdev(unsignedintmajor,constchar *name);


"major"=0meansdynamicallocation Calltothisfunctionisoptional Doesthefollowing:


Allocatemajornumberifneedbe Createentryin/proc/devices

intunregister_blkdev(unsignedintmajor,constchar *name);

Diskregistration:

Blockdeviceoperations:

Equivalenttochardev'sstructfile_operations structblock_device_operations:<linux/fs.h> int(*open)(structinode*inode,structfile*filp);

Calledondeviceopen Calledondeviceclose

int(*release)(structinode*inode,structfile*filp);

int(*ioctl)(structinode*inode,structfile*filp,unsignedint cmd,unsignedlongarg);

ioctl Blocklayerrecognizesafairnumberofioctlmessagespriortothem reachingthedevice. Mostlikelydriverdoesn'tneedtoimplementthis

int(*media_changed)(structgendisk*gd);

Askdriverifremovablestoragemediahaschanged Ifmediachanged,thisisinvokedtosetupnewmedia Afterthisiscalled,kernelrereadspartitiontable THIS_MODULE

int(*revalidate_disk)(structgendisk*gd);

structmodule*owner;

Noread/writeoperations,useofblockI/Orequests Individualdiskdevicerepresentationinkernel structgendisk:<linux/genhd.h> intmajor;intfirst_minor;intminors;

Thegendiskstructure:

Majornumber,firstminornumberandnumberof"disks"

chardisk_name[32];

Disknameasseeninsysfsand/proc/partitions Structdescribedabove Requestqueue(willseeshortly) Notoftenused.SeeLDD3 Notmodifiedbydriverdirectly Numberof512bytesectorsondisk Driverinternaldata

structblock_device_operations*fops;

structrequest_queue*queue;

intflags;

sector_tcapacity;

void*private_data;

structgendiskcontainsakobject

Allocatingdisks:

structgendisk*alloc_disk(intminors); voiddel_gendisk(structgendisk*gd);

Freeingdisks:

Addingallocateddisktosystem:
1.Properlyinitializedallocatedstructgendisk 2.voidadd_disk(structgendisk*gd); Shouldnotbecalleduntildriverisfullyfunctionalasitwillresult incallstoyourdriver'scallbacks. IOW,ablockI/Orequestcallbackmusthavebeenregistered priortoadd_disk()beinginvoked.

Anoteonsectorsizes:

Kernelseesdeviceasflatarrayofsectors Kernelconsiderseverysectortohave512bytes

Canneverthelessoverridedefaultbymodifying parameterinrequestqueue:

blk_queue_hardsect_size(dev>queue,hardsect_size); set_capacity(dev>gd,nsectors*(harsect_size/512));

Mustcontinuetranslatingsectornumbernonetheless:

2.Theblockdeviceoperations

Theopenandreleasemethods:

int(*open)(structinode*inode,structfile*filp);

inode>ib_dev>bd_diskcontainsappropriatestructgendisk pointer. Tasks:


SetupDMAchannels Startdiskspinningetc. uspacefilesystemformat uspacepartitioning uspacefilesystemcheck kspacefilesystemmounting

Whenisthiscalled:

Nowayfordrivertoknowdifferencebetweenuspaceand kspaceopen.

int(*release)(structinode*inode,structfile*filp);

Reverseopen()

Theioctlmethod:

Mayusethistoprovideuspacewithdiskgeometry info. Notusefultokernel(sinceitconsidersdisktobearray ofsectors). Usefultofdisk

3.Requestprocessing

Introductiontotherequestmethod:

CoreofblockI/O Kernelprovidesmanylevelsofabstractiontoallow bothsimpledriversanddriversgearedtowards performance. voidrequest(request_queue_t*queue);


Calledbykernelforeveryrequest Typicallystartstherequestprocessing

Everyblockdrivermusthaverequestqueue Requestfunction"consumes"requestsfromdriver's queue.

Requestfunctioncalledinatomiccontext(useoflock providedbydriver). Norequestsqueuedwhilerequestfunctionrunning Possibleoptimization:droplockinrequestfctand reacquirepriortoreturn. Calltorequestfctasynchronoustouspacerequest Requestsqueuedupbykernelprovideallthecontext requiredtobeproperlyprocessed. Requestsarestoredinstructrequest

Asimplerequestmethod:

Requestqueuetraversalisdoneusing elv_next_request():

ReturnsNULLwhennomorerequests

Requestsmustbeexplicitelyremovedfromqueue using:

blkdev_dequeue_request(structrequest*req);

Oncerequestisfullyprocessed,musttellkernelthatit isso:

voidend_request(structrequest*req,intsucceeded);

Somerequestsarenotforactualtransfers,butfor devicecommand.

Useblock_fs_request()tocheckwhetherarequestis anactualtransfer. Fieldsinrequest:

sector_tsector;

Beginningsectorforrequest Numberofsectorstotransfer Kernelvirtualaddressofbuffertobetransferred

unsignedlongnr_sectors;

char*buffer;

Tofigureouttransferdirection,usemacro:

rq_data_dir(structrequest*req);

Requestqueuesareorderedforfasteraccess Driverrequesthandlershouldtakeadvantageofthis. Moredetailsbelow.

Requestqueues:

Basics:

structrequest_queueorrequest_queue_t:<linux/blkdev.h> Storesrequests Isusedbykernelduringthecreationofrequests.Stores:


Maximumallowablesize Nbrofindependentsegmentspartofrequest Hardwaresectorsize Alignmentrequirements etc.

Properlyconfiguredqueuesshouldpresentonlyvalid requests. RequestqueuesallowpluggableI/Oschedulers

TypicalI/0scheduler:

Storerequests Sortrequests Reorderforbestperformance Mergeadjacentrequests Deadlinescheduler:servicerequestinlessthanX Anticipatoryscheduler:stalldeviceinexpectationofrepeatread

Availableschedulers:

Queuecreationanddeletion:

request_queue_t*blk_init_queue(request_fn_proc*request, spinlock_t*lock);

"request":Requestcallback "lock":Locktobeheldwhileservicingqueue Shouldbecalledduringdriverinitialization

Set>queuedatatoprivatedata

voidblk_cleanup_queue(request_queue_t*req_queue);

Nomorerequestsqueuedafterthiscall Shouldbecalledduringdriverfinalization

Queuingfunctions:

structrequest*elv_next_request(request_queue_t*queue);

Returnnextavailablerequestinqueue ReturnsNULLifnomorerequests Requestismarkedas"active" Requestnotdequeued Removerequestfromqueue

voidblk_dequeue_request(structrequest*req);

voidelv_requeue_request(request_queue_t*queue,struct request*req);

Requeuerequest

Queuecontrolfunctions:

Controlhowqueueoperates voidblk_stop_queue(request_queue_t*queue);

Suspendqueue Resumequeue

voidblk_start_queue(request_queue_t*queue);

voidblk_queue_bounce_limit(request_queue_t*queue, u64dma_addr);

SethighestphysicaladdressforDMA Ifrequestcomesinfromhigheraddress,kernelwillusebounce buffer.

voidblk_queue_max_sectors(request_queue_t*queue, unsignedshortmax);

Maximumnumberofsectorsperrequest

voidblk_queue_max_phys_segments(request_queue_t *queue,unsignedshortmax);

Maximumnumberofnoncontiguousmemoryrangeshandledby driverperrequest.

voidblk_queue_max_hw_segments(request_queue_t *queue,unsignedshortmax);

Maximumnumberofnoncontiguousmemoryrangeshandledby deviceperrequest.

voidblk_queue_max_segment_size(request_queue_t *queue,unsignedintmax);

Maximumsizeforsegmentsinarequest

blk_queue_segment_boundary(request_queue_t*queue, unsignedlongmask);

Setmaximummemoryboundaryforrequestsserviceablebydevice.

voidblk_queue_dma_alignment(request_queue_t*queue, intmask);

DMAalignmentconstraints Requestswillmatchsizeandalignment

voidblk_queue_hardsect_size(request_queue_t*queue, unsignedshortmax);

Specifysectorsizeotherthandefault512bytes Kernelwillcontinuetooperateon512byteassumptionhowever

Theanatomyofarequest:

Basics:

Requestsaremadeupofasetofsegmentsscatteredin memory. Kernelmaycombineadjacentrequests,butneverjoins readsandwrites.

RequestsarealinkedlistofblockI/Ooperations(biostruct) BlockI/Ooperationsarearrayofsegmentstobetransfered BlockI/Orequests,regardlessoftheirorigin,arepackaged inbiostruct. bioisthenfedtoblockI/Osubsystemwhichmaycombineit tootherbiostoformarequest. structbio:<linux/bio.h> sector_tbi_sector;

Thebiostructure:

Firstsectortobetransfered Bytestobetransfered

unsignedintbi_size;

unsignedlongbi_flags;

Biooperationdescription Numberofphysicalsegmentsinbio NumberofsegmentsseenbyhardwareafterDMAmapping Nextbioinlist Array Actualdescriptionoftheregiontobetransferred Perpagedescription Thisisthedatathatthedriverneedstoloopthroughandwriteout.

unsignedshortbio_phys_segments;

unsignedshortbio_hw_segments;

structbio*bi_next;

structbio_vecbi_io_vec;

structbio_vec
struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; };

Seefig161inLDD3,p.482 Loopingaroundentriesinbioentries:

bio_for_each_segment(bvec,bio,segno){} bvec:currentbio_vecentry segno:segmentnumber

Canusebio_vecstructentriestocreateDMAmappings Ifdirectpageaccessneeded:

char*__bio_kmap_atomic(structbio*bio,inti,enumkm_typetype); void__bio_kunmap_atomic(char*buffer,enumkm_typetype);

Helperfunctions:

Operateonbuffertobetransferednextwithinbio Maynotbeusefulifdriverwantstowantbiolistbeforedecidingwhat totransfer. structpage*bio_page(structbio*bio); Getpointertonextpagetotransfer intbio_offset(structbio*bio); Getpageoffsetofrequest intbio_cur_sectors(structbio*bio); Getnumberofsectorstotransferfrompage char*bio_data(structbio*bio); Getkernellogicaladdresstobuffertobetransfered Nothighmem(atleast,bydefault,blockI/Olayerdoesn'thand highmembufferstodrivers). char*bio_kmap_irq(structbio*bio,unsignedlong*flags); IRQsafemappingofanytypeofbuffer,evenfromhighmem.

voidbio_unmap_irq(char*buffer,unsignedlong*flags); Undomappingdonewithbio_kmap_irq

Requeststructurefields:

structrequestinternals sector_thard_sector;

Firstsectornotyettransferred Numberofsectorsyettotransfer Numberofsectorsleftincurrentbio Listofbios Kernellogicaladdressofthecurrentbuffertobetransferred

unsignedlonghard_nr_sectors;

unsignedinthard_cur_sectors;

structbio*bio;

char*buffer;

unsignedshortnr_phys_segments;

Numberofnoncontiguousphysicalsegmentsinthisrequest Linktotherestoftherequestqueue Cannotbeusedasisuntilrequesthasbeenremovedfromqueue

structlist_headqueuelist;

Barrierrequests:

Mayneedtoforcenonreorderingofcertainoperations becausesomeappsdependonit. Examples:databases,journalingfilesystems Useofbarrierrequest:requestwithREQ_HARDBARRIER flag. Mustmakesuredrive'scachingdoesn'thidenoncommitting.

Informingblocklayerthatdriverhandlesbarrierrequests:

voidblk_queue_ordered(request_queue_t*queue,intflag); Nonzeroflagmeans"canhandle" intblk_barrier_rq(structrequest*rq); retvalnonzeromeansthisisabarrierrequest

Figuringoutwhetherrequestcontainsbarrier:

Nonretryablerequests:

Blockdriverstypicallyretryfailedrequests Insomecases,thekernel'sblocklayerdoesn'twantthat First,testifkernelwantsretry:

intblk_noretry_request(structrequest*req); Ifretvalnonzero,drivershouldabort

Requestcompletionfunctions:

Basics:

intend_that_request_first(structrequest*req,intsuccess, intcount);

Tellblocklayer,"count"byteshavebeentransferedsincelastcall. Completionmustbesignaledinorder,evenifactualtransfers happenedoutoforder. retvalis0ifallsectorshavebeentransferredandrequestisdone Ifrequestdone,useblkdev_dequeue_requesttodequeuerequest Informanythingwaitingonrequestthatrequestisdone

voidend_that_request_last(structrequest*req);

Typically:
if (!end_that_request_first(req, 1, sectors_xferred) { blkdev_dequeue_request(req); end_that_request_last(req); }

Workingwithbios:

Requestcallbackshould:

Userq_for_each_bio(bio,req){}togothroughbios Foreachbio,usebio_for_each_segment(bvec,bio,segno){}togo throughallsegmentsinbio. Mapeachbiosegment Transferbiosegment

BlockrequestsandDMA:

InsteadofgoingthroughbiosandsettingupDMAforeach transfer:

intblk_rq_map_sg(request_queue_t*queue,structrequest*req, structscatterlist*list); "list":Preallocatedscatterlistthathasasmanyentriesasthereare physicalsegmentsintherequest. Scatterlistreturnedcanhavedma_map_sg()usedonitsentries

Ifnorequestconcatenationwanted:
clear_bit(QUEUE_FLAG_CLUSTER, &queue->queue_flags);

Doingwithoutarequestqueue:

Needfordrivertodoitsownsorting/reordering,ifany Solidstatestoragedevicesdonobenefitfromaccess reordering. SamewithRAID Insteadofusingrequestcallback,usemake_request:

typedefint(make_request_fn)(request_queue_t*q,structbio*bio);

make_requestcan:

Executetransferrightaway(loopthrough"bio"listprovided) Pendrequesttootherdevice voidbio_endio(structbio*bio,unsignedintbytes,interror); Donotuseonregularrequesttypebios

Signalingbiocompletiondirectly:

make_requestshouldreturn0,regardlessofthetransfer status. Ifmake_requestretvalnonzero,bioisresubmitted Tousemake_request,mustmanuallyallocatequeue:

request_queue_t*blk_alloc_queue(intflags); Allocatedqueuenotsetuptoreceiveactualrequests "flags":usuallyGFP_KERNEL voidblk_queue_make_request(request_queue_t*queue, make_request_fn*func);

Mustalsosetupmake_requestcallback:

4.Someotherdetails

Commandprepreparation:

Getinformationaboutpendingrequestpriortoitbeing returnedbyelv_next_request(). voidblk_queue_prep_rq(request_queue_t*queue, prep_rq_fn*func); Bydefaultnoprepreparationinvolved typedefint(prep_rq_fn)(request_queue_t*queue, structrequest*req); SeeLDD3forfulldetails

Taggedcommandqueueing:

Somehardwaresupportshavingmultiplerequests activeinparallel. Ifso,tagscanbeassociatedwithrequestssothat hardwaremaytellwhichonehascompleted. SeeLDD3forfulldetailsonhowtosetupanduse TaggedCommandQueuing(TCQ).

Networkdrivers
1.Basics 2.Connectingtothekernel 3.Thenet_devicestructureindetail 4.Openingandclosing 5.Packettransmission 6.Packetreception 7.Theinterrupthandler 8.Receiveinterruptmitigation 9.Changesinlinkstate

10.Thesocketbuffers 11.MACaddressresolution 12.Customioctlcommands 13.Statisticalinformation 14.Multicast 15.Afewotherdetails

1.Basics

Networkdeviceshaveno/deventries Useofsocketstotalkovernetwork Receivesdatafromoutsideworld Networkinglayersupportslotsofconfigurability andstatsaccumulation. Linuxnetsubsystemisnetworkandhardware protocolindependent. NetworkstackmakesitveryeasytouseIPand Ethernetprotocols.

2.Connectingtothekernel

Deviceregistration:

Likeothertypesofdevices,networkdriversmust initializethingsatstartup. Unlikecharandblockdevices,therearenomajoror minornumberstoregister. Instead,anetworkdrivermustregistereveryinterface itdetectsaspartofthesystemwidelistofrecognized interfaces. structnet_device:<linux/netdevice.h> structnet_deviceisakobject

structnet_device*alloc_netdev(intsizeof_priv,const char*name,void(*setup)(structnet_device*));

"sizeof_priv":sizeofprivatedatafordriver "name":nameofdeviceasseenbyuserspace.Shouldhave a"%d"instringsothatkernelcanreplaceitwiththeIDof theinterface.Ex.:eth0,eth1,etc. "setup":callbacktosetuprestofstructnet_device structnet_device*alloc_etherdev(intsizeop_priv); Allocates"eth%d" Providecannedsetupfunction:ether_setup Othersuchhelpersforothertypesofphysicalinterfaces

Helper:

Oncedeviceisallocate*and*initialized:

intregister_netdevice(structnet_device*dev); Oncethisiscalled,anyofthecallbacksprovidedmaybe invoked.

Initializingeachdevice:

structnet_devicecannotbeinitializedatbuildtime structnet_deviceisverycomplex KernelprovidesdefaultentriesforEthernet Mustsetupcallbacks,flags,anddefaultvaluesin structnet_device.

Accessingprivatedatapartofstructnet_device:

structsnull_priv*priv=netdev_priv(dev);

Moduleunloading:

intunregister_netdevice(structnet_device*dev); voidfree_netdev(structnet_device*dev);

3.Thenet_devicestructureindetail

Globalinformation:

charname[IFNAMSIZ];

Devicename Devicestate.Useofutilityfunctionstomanipulatefield Nextdeviceinglobaldevicelist Initializationfunction Notusedbymostmoderndrivers

unsignedlongstate;

structnet_device*next;

int(*init)(structnet_device*dev);

Hardwareinformation:

unsignedlongrmem_end;

Receivememoryend Receivememorystart Transmitmemoryend Transmitmemorystart I/Obaseaddress

unsignedlongrmem_start;

unsignedlongmem_end;

unsignedlongmem_start;

unsignedlongbase_addr;

Notusedbythekernel Device'sIRQnumber Ifmultiportdevice,portinuse DMAchannel

unsignedcharirq;

unsignedcharif_port;

unsignedchardma;

Helperfunctionsforsettingupinterface information:

voidether_setup(structnet_device*dev);

SetupforEthernetdevice

voidltalk_setup(structnet_device*dev);

SetupforLocalTalkdevice Setupforfiberchanneldevice SetupforFDDIdevice SetupforHIPPIdevice Setupfortokenringdevice

voidfc_setup(structnet_device*dev);

voidfddi_setup(structnet_device*dev);

voidhippi_setup(structnet_device*dev);

voidtr_setup(structnet_device*dev);

Interfaceinformation:

Usuallynoneedtomanipulatebyhandunlessno helperfunctionisavailableforthetypeofadapter beingused. unsignedshorthard_header_len;

Hardwareheaderlength MaximumTransferUnit(largestpacketsize) Maximumnumberofpacketsontransmissionqueue Hardwareinterfacetype

unsignedmtu;

unsignedlongtx_queue_len;

unsignedshorttype;

unsignedcharaddr_len;

Lengthofhardwareaddress Broadcastaddressonphysicalnetwork Actualdevice'shardwareaddress Interfaceflagsbitmask Interfacefeatures

unsignedcharbroadcast[MAX_ADDR_LEN];

unsignedchardev_addr[MAX_ADDR_LEN];

unsignedshortflags;

intfeatures;

Interfaceflags:

IFF_UP:

Interfaceisup Interfacecanbroadcast Printdebuginfofordriver Deviceisloopback Deviceispointtopointlink ARPnotsupported

IFF_BROADCAST:

IFF_DEBUG:

IFF_LOOPBACK:

IFF_POINTOPOINT:

IFF_NOARP:

IFF_PROMISC:

Deviceispromiscuous Devicecanmulticast Receiveallmulticasts Usedbyloadequalizationcode Setifdevicecanswitchbetweenmediatypes Notusedbykernel Addresscanchange Notusedbykernel

IFF_MULTICAST:

IFF_ALLMULTI:

IFF_MASTER,IFF_SLAVE:

IFF_PORTSEL,IFF_AUTOMEDIA:

IFF_DYNAMIC:

IFF_RUNNING:

BSDcompatibility Notusedbykernel BSDcompatibility Notusedbykernel

IFF_NOTRAILERS:

Features:

Setbydrivertotellkernelwhatdevicecando NETIF_F_SG,NETIF_F_FRAGLIST:

Scatter/gatherI/O

NETIF_F_IP_CSUM,NETIF_F_NO_CSUM, NETIF_F_HW_CSUM:

Controlwhetherkernelhastodochecksumingorwhetherdevice doesit.

NETIF_F_HIGHDMA:

HighmemoryDMAcapable

NETIF_F_HW_VLAN_TX, NETIF_F_HW_VLAN_RX, NETIF_F_HW_VLAN_FILTER, NETIF_F_VLAN_CHALLENGED:

Supportfor802.1qVLANsupport TCPsegmentationoffloading

NETIF_F_TSO:

Fundamentaldevicemethods:

int(*open)(structnet_device*dev);

Calledwhendeviceisifconfig'edup Calledwhendeviceisifconfig'eddown

int(*stop)(structnet_device*dev);

int(*hard_start_xmit)(structsk_buff*skb,struct net_device*dev);

Transmitcallback retvaliszeroonsuccess Nonzeroretvalwillresultinkernelretry

int(*hard_header)(structsk_buff*skb,struct net_device*dev,unsignedshorttype,void*daddr, void*saddr,unsignedlen);

Hardwareheaderbuildingcallback CallbackforrebuildinghardwareheaderafterARP completion.

int(*rebuild_header)(structsk_buff*skb);

void(*tx_timeout)(structnet_device*dev);

Timeoutcallback

structnet_device_stats*(*get_stats)(structnet_device *dev);

Statisticscallback

int(*set_config)(structnet_device*dev,structifmap *map);

Interfaceconfigchangecallback Nottypicallyneeded

Optionaldevicemethods:

intweight;int(*poll)(structnet_device*dev,int *quota);

NAPIdriverpollcallback Checkwhethereventsoccuredoninternface

void(*poll_controller)(structnet_device*dev);

int(*do_ioctl)(structnet_device*dev,structifreq*ifr, intcmd);

Devicespecificioctl Mutlicastlistchangecallback

void(*set_multicast)(structnet_device*dev);

int(*set_mac_address)(structnet_device*dev,void *addr);

Ifsupported,allowchangingMACaddress

int(*change_mtu)(structnet_device*dev,int new_mtu);

ChangeMTU

int(*header_cache)(structneighbour*neigh,struct hh_cache*hh);

Fill"hh"withARPqueryresult

int(*header_cache_update)(structhh_cache*hh, structnet_device*dev,unsignedchar*haddr);

Update"hh"cache

int(*hard_header_parse)(structsk_buff*skb, unsignedchar*hadd);

Parsehardwareheaders

Utilityfields:

Maintainedbydriver unsignedlongtrans_start;

Timeatwhichlasttransmitbegan,injiffies Usedbynetworkingsubsystemtodetectlockup Lastreceiveinjiffies Notcurrentlyused

unsignedlonglast_rx;

intwatchdog_timeo;

Minimumdelayfordetectingtransmissiontimeoutinjiffies. Driverprivatedata Multicastrelatedfields Lockstoavoidconcurrentaccesstodriver's hard_start_xmit(). Shouldnotbemodifiedbydriver

void*priv;

structdev_mc_list*mc_list;intmc_count;

spinlock_txmit_lock;intxmit_lock_owner;

4.Openingandclosing

Alldoneviaifconfig Bringingupinterfacewithifconfig:

Assignaddresstointerface:

ioctl(SIOCSIFADDR) SocketI/OControlSetInterfaceAddress Notseenbydriver Handledbykernel ioctl(SIOCSIFFLAGS) SocketI/OControlSetInterfaceFlags ResultsinIFF_UPbeingset Resultsindriver'sopen()callbackbeinginvoked

Turninterfaceon:

Driver'sopenshoulddo:

RetrieveMACaddressfromdevice SetMACaddressinstructnet_device(dev >dev_addr). Startupdevicequeue:

voidnetif_start_queue(structnet_device*dev);

Driver'scloseshoulddo:

Stopdevicequeue:

voidnetif_stop_queue(structnet_device*dev);

5.Packettransmission

Basics:

Kernelsignalstransmissiontodriverusing hard_start_xmit(). Packetsarehandedovertodriverinformofstruct sk_buff(skb). skbcontainseverythingthatisneededforpacketto travelonnetwork. Drivershouldusuallyjusttransmitskbasis Whentransmittedpacketshorterthanminimumsize supportedbydevice,zerooutremaindertoavoid securityleaks. hard_start_xmitshouldfreeskbaftertransmission

Controllingtransmissionconcurrency:

Callstohard_start_xmitareserializedusingspinlock. Thoughhard_start_xmitwillhaveinitiatedatransfer, saidtransferwilllikelynotbeoveratfunctionreturn. Sincedeviceshavelimitedbuffers,drivermusttell kernelwhennomoretransferscanbeacceptedfor sometime. Usenetif_stop_queuetosuspendtransmission

Notifykernelthatmoretransferscanstartoccurring again:

voidnetif_wake_queue(structnet_device*dev); Likenetif_start_queue,butnotifiesnetworkingsubsystem too.

Iftransmissionmustbesuspendelsewherethanin hard_start_xmit():

voidnetif_tx_disable(structnet_device*dev); Uponreturn,meansthathard_start_xmitisnotrunningon anyCPU.

Restartingqueueafternetif_tx_disable()using netif_wake_queue().

Transmissiontimeouts:

Noneedtosetuptimers Handledbynetworksubsystem Mustsetwatchdog_timeoandtrans_start Ifperiodexceeded,tx_timeout()calledautomatically Ifsupported,allowsavoidingpacketassembly Allowszerocopy NETIF_F_SGmustbesetforhard_start_xmit()to receivedfragmentedpackets.

Scatter/gatherI/O:

Checkingwhetherpacketisfragmented:

skb_shinfo(skb)>nr_frags Ifnr_frags!=0,fragmentedpacket >datapointstobeginingoffirstfragment Firstfragmentsizeis:skb>lenskb>data_len Restoffragmentsarein>fragsarray Eachfragis:


struct skb_frag_struct { struct page *page; __u16 page_offset; __u16 size; };

Inafragmentedskb:

MustuseDMAoperationstomapstructpageasseenearlier.

6.Packetreception

Needtoallocateskbtopasstohigherlayers Receptionmode:

Interruptdriven:

Oneinterruptperpacket Systempollsinterfacefornewpackets

Polled:(highbandwidth)

Allocateskb:

structsk_buff*dev_alloc_skb(unsignedintlength); Callisatomic(werelikelyinaninterrupthandler)

Setinformationaboutpacket:

Protocol Checksumrequirements:

CHECKSUM_HW,CHECKSUM_NONE, CHECKSUM_UNNECESSARY

Setstatistics Pushskbtonetworkstack:

intnetif_rx(structsk_buff*skb); retvalindicatesnetworkingsubsystemcongestion level. Mostdriversdon'tcheckretval

Highperformancedriversshouldcheckretval PreallocateDMA'ableskbs Instructdevicetotransferdirectlytothoseskbs

Possibleoptimization:

7.Theinterrupthandler

Possiblecauses:

Linkstatuschange Transmissioncomplete Newpacket

Onreceipt,doasdescribedinprevioussection Onsend,deallocatetransmittedbuffer:

dev_kfree_skb(structsk_buff*skb); dev_kfree_skb_irq(structsk_buff*skb); dev_kfree_skb_any(structsk_buff*skb);

8.Receiveinterruptmitigation

Toomanypacketscomingintoorapidly Toomanyinterruptsgenerated KernelprovidesNAPI:NewAPI NAPIdoespollingondevice SeeLDD3forfulldetails

9.Changesinlinkstate

Linkstatemaychange Changingcarrierstate:

voidnetif_carrier_off(structnet_device*dev); voidnetif_carrier_on(structnet_device*dev); intnetif_carrier_ok(structnet_device*dev);

Checkingcarrierstate:

10.Thesocketbuffers

Basic:

Unitforpackettravelinkernel structsk_buff:<linux/skbuff.h> structnet_device*dev;

Theimportantfields:

Deviceresponsibleforbuffer Packetheaderpointers

union{...}h;union{...}nh;union{...}mac;

unsignedchar*head;unsignedchar*data;unsigned char*tail;unsignedchar*end;

Packetdatapointers

unsignedintlen;

Actualpacketlength Fragmentedpacketportionlengthifscatter/gather Checksumpolicyforincomingpacket Packettype

unsignedintdata_len;

unsignedcharip_summed;

unsignedcharpkt_type;

Sharedinfostructhandling

Someinfostoredin"sharedinfo"structfor performancereasons.

shinfo(structsk_buff*skb); unsignedintshinfo(skb)>nr_frags; skb_frag_tshinfo(skb)>frags; structsk_buff*alloc_skb(unsignedintsize,int priority);

Functionsactingonsocketbuffers:

Allocatebuffer AllocatebufferwithGFP_ATOMICpriority Kernelinternalskbfreeing

structsk_buff*dev_alloc_skb(unsignedintlength);

voidkfree_skb(structsk_buff*skb);

voiddev_kfree_skb(structskb_buff*skb);

Driverfreeskb Driverfreeskbininterrupthandler Driverfreeskbinanycontext

voiddev_kfree_skb_irq(structskb_buff*skb);

voiddev_kfree_skb_any(strucskb_buff*skb);

unsignedchar*skb_put(structsk_buff*skb,unsigned intlen);

Addlentoendofbuffer Checkifenoughspace

unsignedchar*__skb_put(structsk_buff*skb, unsignedintlen);

Addlentoendofbuffer Don'tcheckforspace

unsignedchar*skb_push(structsk_buff*skb, unsignedintlen);

Addlentobeginingofbuffer Usedforaddingheaders Checkifenoughspace

unsignedchar*__skb_push(structsk_buff*skb, unsignedintlen);

Addlentobeginingofbuffer

Don'tcheckforspace Howmuchspaceavailableforaddingdataatendofbuffer Howmuchspaceavailableforaddingdataatbeginingof buffer.

intskb_tailroom(conststructsk_buff*skb);

intskb_headroom(conststructsk_buff*skb);

voidskb_reserve(structsk_buff*skb,unsignedint len);

Reserve"len"bytesatbeginingandendofbuffer

unsignedchar*skb_pull(structsk_buff*skb,unsigned intlen);

Removepackethead

intskb_is_nonlinear(conststructsk_buff*skb);

Trueifscatter/gather Fors/g,what'sthefirstsegment'ssize Usedtomapfragmentedskbifitmustbeaccessedwithin kernel. Unampkmap_skb_frag()

unsignedintskb_headlen(conststructsk_buff*skb);

void*kmap_skb_frag(constskb_frag_t*frag);

voidkunmap_skb_frag(void*vaddr);

11.MACaddressresolution

UsingARPwithEthernet:

Managedbykernel NoneedforEthernetdrivertodoanythingspecial UseofEthernetabstractionswhileavoidingARP SeeLDD3foroveridingdefaultEthernetdiscoveryby kernel. SeeLDD3

OverridingARP:

NonEthernetheaders:

12.Customioctlcommands

<linux/sockios.h>containslistofalready recognizedcommands. Typically,commandsfromuserspacearetaken careofbyprotocolspecificioctl()handler. ioctlcommandsnotrecognizedbyprotocol callbackarepassedtodriver. SeeLDD3formoreoncustomioctlcommands

13.statisticalinformation

get_stats()callback SeeLDD3fordetails

14.Multicast

SeeLDD3

15.Afewotherdetails

SeeLDD3 Mediaindependentinterfacesupport Ethtoolsupport Netpoll:


Forremotedebugging Providepoll_controller()callback

PCIdrivers
1.ThePCIinterface 2.PCIaddressing 3.Boottime 4.Configurationregistersandinitialization 5.MODULE_DEVICE_TABLE 6.RegisteringaPCIdriver 7.OldstylePCIprobing 8.EnablingthePCIdevice 9.Accessingtheconfigurationspace

10.AccessingtheI/Oandmemoryspaces 11.PCIinterrupts

1.ThePCIinterface

Mostwidelyusedperipheralbusinmainstream computers. PCIspecificationlaysoutcompletebus functionality. Archindependent HigherclockratethanthepopularISA 32bitbus Peripheralsareautoconfiguredatboottime Linuxprovidesabstractionstohelpdriversaccess PCIresourcesandconfiguration.

2.PCIaddressing

Peripheralidentifiedusing:

busnumber devicenumber functionnumber

PCIspecallowsahosttohave256buses LinuxfurthersupportsPCI"domains"tohaveeven morebuses. Eachbuscanhave32devices Eachdevicecanhaveupto8functions EachfunctionidentifiedonPCIbususing16bit address/key.

Linuxprovidespci_devstructtomanipulatePCI devices. Hostscanhavemanybusespluggedtogether usingPCIbridges.Bridgesareseenasspecial PCIperipherals. PCIlayoutistreeofbusesanddevices RootbusisPCIbus0 Examplelayoutinfig121inLDD3,p.304 ViewingofPCIlayoutcanbedoneusing"lspci",or lookingat/proc/pci,/proc/bus/pcior/sys/bus/pci/.

WaysinwhichdevicesinPCIlayoutaredisplayed (inhex):

2values:8bitbusID,8bitdeviceandfunctionnumber 3values:bus,device,andfunction 4values:domain,bus,device,andfunction Memorylocations I/Oports Configurationregisters

Peripheralscananswerqueriesabout:

Alldevicesshareaddressspacefor:

Memorylocations(32or64bitrange) I/Oports(32bitrange)

Configurationregistersconformto"geographical addressing"=>noshare. AccesstomemorylocationsandI/Oportsdone usingI/Oaccessfunctionscoveredearlier. Layoutofaddressablelocationsdoneatbootand mappingavoidscollisions.

Accesstoconfigurationdonethroughspecific kernelfunctions. Interrupts:

EachPCIperipheralhas4interruptspinsanditcan useanyofthem,regardlessoftheactualroutingofthe interrupttotheCPU. Interruptsareshared

EachPCIdevicefunctionhas256bytesforconfig (PCIXhas4K)

4bytesreservedforuniqueID(usedbydriversto locatedevice).

3.Boottime

UponresetPCIdevices:

HavenomemoryorI/Omapping Remaininquiescentmode Interruptsdisabled

"Firmware"comesupandconfiguresdevicesvia PCIcontroller,allocatinguniqueI/Oandmemory ranges. Linuxcandothistoo Currentdeviceconfigurationtablereadable through/proc/bus/pci/devices.

Sysfsprovidesperdeviceconfigurationitems through/sys/bus/pci/devices/<adevice>:

config:binaryPCIconfiginfo vendor,device,subsystem_device,subsystem_vendor, class:devicespecificinfo. irq:IRQassignedtodevice resource:memoryresourcescurrentlyallocatedto device.

4.Configurationregistersandinitialization

Device(function)configurationcontainedin256 bytes SeeLDD3p.308forillustration First64bytesarestandardandrequired Therestdependsonperipheral AllPCIregistersarelittleendian Useproperbyteorderingfunctionswhen necessary. Seeperipheralhardwaredocumentationfor meaninganduseofregisters.

Fieldsofinterestfordriver:

vendorID(16bit)

Uniquevendoridentificator EachPCIperipheralmanufacturerhasadifferentID VendorattributedID Nocentralregistry KnownincombinationwithvendorIDasdevice"signature". Top8bitis"baseclass"orgroup:

deviceID(16bit)

class(16bit)

"network"groupcontains:Ethernetandtokenring

"communication"groupcontains:serialandparallel

ExistingclassesaredefinedinPCIspec SometimesmanyPCIboardswillbebasedonthesame basicPCIinterfacechip.TheseIDswouldbeusedtoidentify whichoneoftheactualboardsthisoneis.

subsystemvendorID,subsystemdeviceID:

Driversusethesefieldstoindicatewhich peripheralstheysupport:

structpci_device_id __u32vendor;__32device;

PCIvendoranddeviceIDs UsePCI_ANY_IDifsupportforany

__u32subvendor;__32subdevice;

PCIsubsystemvendorandsubsystemdeviceIDs UsePCI_ANY_IDifsupportforany PCIclass UsePCI_ANY_IDifsupportforany Devicespecificdataifdriversupportsmorethanonedevice.

__u32class;__u32class_mask;

kernel_ulong_tdriver_data;

Helpermacrosforcreatingpci_device_idstructs:

PCI_DEVICE(vendor,device):

Createsstructpci_device_idof"vendor"and"device" SetsubsystemvendorandsubsystemdeviceIDto PCI_ANY_ID.

PCI_DEVICE_CLASS(device_class, device_class_mask)

Createsstructpci_device_idmatchingspecificclass

Candeclarelistofstructpci_device_idtogiveto PCIlayertoindicatethelistofsupporteddevices.

5.MODULE_DEVICE_TABLE

Inorderforuserspacehotplugfunctionalityto knowwhichdevicesaresupportedbyeachdriver, thelistofsupporteddevicesmustbeexportedby drivermodules. MODULE_DEVICE_TABLE(<bus>,<listofstruct pci_device_id>); Forpci,use"pci"for<bus> Modulewillnowhavea__mod_pci_device_table symbolexported.

"depmod"creates /lib/modules/KERNEL_VERSION/modules.pcimap basedonthelistof__mod_pci_device_table symbolsfoundinallmodules. modules.pcimapmatchesdeviceIDswithmodule names. Thehotplugfunctionalityusesmodules.pcimapto knowwhichdrivertoloadincaseanewPCI deviceisfound.

6.RegisteringaPCIdriver

structpci_driver:

constchar*name;

Drivername Mustbeuniquethroughoutkernel Normallysettomodulename Showsupin/sys/bus/pci/drivers/ PCIdeviceIDtable

conststructpci_device_id*id_table;

int(*probe)(structpci_dev*dev,conststruct pci_device_id*id);

Driverprobecallback Calledbykernelastructpci_devisfoundforthisdriver

"id"isthedevicethekernelhasfoundforthisdriver retvalshouldbezeroifdriveracceptsresponsibilityfor deviceandhasinitializedsaiddevice. retvalshouldbenegativeerrorcodeifdriverdoesn't recognizedeviceordoesn'twanttohandleit. Calledwhendeviceisbeingremovedfromthesystem CalledwhenPCIdriverisbeingunloadedfromthekernel Optional Calledwhendeviceisgettingsuspended "state"issuspendstate

void(*remove)(structpci_dev*dev);

int(*suspend)(structpci_dev*dev,u32state);

int(*resume)(structpci_dev*dev);

Optional Calledtoreversesuspend()

BasicPCIdriverentry:
static struct pci_driver my_pci_driver = { .name = "my_pci_driver", .id_table = my_ids, .probe = my_probe, .remove = my_remove }

PCIdriverregistration:

intpci_register_driver(structpci_driver*pci_driver); retval<0iferror

retvaliszeroifsuccess Doesnotreturnerrorifnodeviceboundtodriver becauseof:


Hotplug DynamicallycreatedID(i.e.IDsnotalreadyrecognizedby kernel).

PCIdriverremoval:

voidpci_unregister_driver(structpci_driver *pci_driver); Resultsincallstoremove()foreverydevicePCI deviceboundtothisdriver. Functiondoesn'treturnuntilallremove()callsreturn

7.OldstylePCIprobing

ManualPCIlistprobing Can'tcallthesefunctionsfrominterrupthandlers structpci_dev*pci_get_device(unsignedint vendor,unsignedintdevice,structpci_dev*from);


LooksforPCIdeviceinlistofexistingdevices Incrementsrefcountonfoundpci_devstruct Firstcallshouldhave"from"settoNULL Subsequentcallsshouldpassthedevicealready returnedtostartlookingformoredevices"after"that one.

Whennomoredevices,functionreturnsNULL Decrementrefcountondevice

voidpci_dev_put(structpci_dev*pci_dev);

structpci_dev*pci_get_subsys(unsignedint vendor,unsignedintdevice,unsignedint ss_vendor,unsignedintss_device,structpci_dev *from);

Sameasabove,butallowspassingsubsystemvendor andsubsystemdevice.

structpci_dev*pci_get_slot(structpci_bus*bus, unsignedintdevfn);

Searchesaspecificbusforagivendevicefunction

8.EnablingthePCIdevice

Uponhavingitsprobe()functioncalled,adriver musttellthekernelto"enable"thedeviceifit intendstouseit:

intpci_enable_device(structpci_dev*dev);

"Wakesup"device Insomecases,assignsinterruptlineandI/Oregions

9.Accessingtheconfigurationspace

Havingfounddevice,drivermayneedtoread and/orwriteto

Memoryspace I/Oportspace Configurationspace(essentialforknowingwhere deviceislocated)

AccessingconfigurationspacebasedonPCI controllerchipimplementation. Linuxprovidescontrollerindependentfunctions:

Mustprovidedistancefrombeginingofconfigspace "where"toreadfrominbytes.

Functionsreturnerrorcode Functionsreadingmorethanonebyteautomatically convertdatafromortolittleendian,toorfromthelocal processor'sbyteordering.

<linux/pci.h> Readingconfigdata(8,16and32bit):

intpci_read_config_byte(structpci_dev*dev,int where,u8*val); intpci_read_config_word(structpci_dev*dev,int where,u16*val);

intpci_read_config_dword(structpci_dev*dev,int where,u32*val); intpci_write_config_byte(structpci_dev*dev,int where,u8val); intpci_write_config_word(structpci_dev*dev,int where,u16val); intpci_write_config_dword(structpci_dev*dev,int where,u32val);

Writingconfigdata:

Previousfunctionsareactuallymacros

Lowleveloperations:

intpci_bus_read_config_byte(structpci_bus*bus, unsignedintdevfn,intwhere,u8*val); intpci_bus_read_config_word(structpci_bus*bus, unsignedintdevfn,intwhere,u16*val); intpci_bus_read_config_dword(structpci_bus*bus, unsignedintdevfn,intwhere,u32*val); intpci_bus_write_config_byte(structpci_bus*bus, unsignedintdevfn,intwhere,u8val); intpci_bus_write_config_word(structpci_bus*bus, unsignedintdevfn,intwhere,u16val);

intpci_bus_write_config_dword(structpci_bus*bus, unsignedintdevfn,intwhere,u32val);

Predefinedlocations"where"toreadfromin <linux/pci.h>:

PCI_VENDOR_ID PCI_DEVICE_ID PCI_COMMAND PCI_STATUS PCI_REVISION_ID PCI_INTERRUPT_LINE etc.

10.AccessingtheI/Oandmemoryspaces

PCIdeviceshaveupto6I/Oaddressregions EachregionhaseithermemoryorI/O Mostdeviceshavememorymappedregisters Remember,deviceregistersshouldnotbe cached. Peripheralsindicatewhethertheywantcertain regionstobeprefetchableornot(cachedornot.) Peripheralsreportaddressregionsin256byte configstructure.

Accessedusingconfigaccessfunctionsat locations:

PCI_BASE_ADDRESS_0 PCI_BASE_ADDRESS_1 ... PCI_BASE_ADDRESS_5

Theseregistersusedtodefineboth32bitmemory regionsand32bitI/Oregions. 64bitmemoryregionscanbedeclaredusing2 consecutiveconfigregisters.

Usekernelhelperfunctionsinsteadofaccessing configdirectly:

unsignedlongpci_resource_start(structpci_dev *dev,intbar);

retvalisfirstaddressofgivenmemoryregion "bar"indicatesforwhichBaseAddressRegisterthestart locationissought(from05).

unsignedlongpci_resource_end(structpci_dev*dev, intbar);

retvalislastusableaddressofgivenregion

unsignedlongpci_resource_flags(structpci_dev *dev,intbar);

retvalisgivenregion'sflags

Regionflags:

<linux/ioport.h> IORESOURCE_IO:

Regionexists RegionisI/O Regionexists Regionismemory

IORESOURCE_MEM:

IORESOURCE_PREFETCH:

Regioncanbeprefetched Regionisreadonly NeverhappensforPCIdevices

IORESOURCE_READONLY:

UsepreviouslydiscussedI/Ofunctionsto read/writeintoPCIregions.

11.PCIinterrupts

UsuallyattributedaspartofPCIPOSTatboot JustamatterofrequestingwhichPCIinterrupt hasbeenattributedtogivenPCIdevice:


retval = pci_read_config_byte(my_dev, PCI_INTERRUPT_LINE, &my_irq);

Afterthat,usethealreadycoveredrequest_irq(), etc.

USBdrivers
1.USBdevicebasics 2.USBandsysfs 3.USBurbs 4.WritingaUSBdriver 5.USBtransferswithouturbs

1.USBdevicebasics

Properties:

Treeofdevices Singlemaster Lowcost Nodataformatenforced Canpreassignbandwidth Somedevicetypesalreadydefined(classes).Aslong asdevicebehavesasinthestandard,agenericdriver forthatdevicetypecanbeusedasis.

LinuxandUSB:

LinuxsupportsUSBashostandasdevice(gadgetAPI) WewillnotcovergadgetAPI,seedoconweb LinuxprovidesUSBcorestackforallowingdriversto interfacewithUSBhardware,viatheUSBhost controllers,transparently. InLinux,devicesconsistof:


Config Interface(containedinconfig) Endpoint(containedininterface)

LinuxUSBdriversattachto"interface",notentire device.

Endpoints:

BasicformofUSBcommunication Cancarrydatainonedirectiononly

OUT:Fromcomputertodevice IN:Fromdevicetocomputer

endpoint"size":amountofdatathatcanbeheldbyan endpointatonce. Typesofendpoints:

CONTROL:

Asynchronoustransfers:Usedwhenneeded Transfersmallamountsofdata Read/writeconfiginfo

Writecommands Readstatus Eachdevicehasatleast"endpoint0" Atinsertiontime,USBcoreuses"endpoint0"toconfigdevice "endpoint0"transfersguaranteedbyUSBprotocol Periodtransfers:Bandwidthreserved Transfersmallamountsofdataatfixedratefromdevicetocomputer whenhostasks. Primarytransportformiceandkeyboards Canalsobeusedtosendcommandstodevices TransfersguaranteedbyUSBprotocol Asynchronoustransfers Transferlargeamountsofdata Losslesstransfers

INTERRUPT:

BULK:

Noguaranteeonlatency Transfersmaybebrokendownintosmallersizedtransfers Commonlyusedbydevicessuchasprinters,storage,andnetwork Periodtransfers Transferlargeamountsofdata Noguaranteedatamakesitthrough Bestforstreamingdevicesthatcanloosesomedata Commonlyusedfordataacquisition,audio,vidoe.

ISOCHRONOUS:

Linuxstructforendpoints:

structusb_host_endpoint Containsactualendpointinformationplaceholder:

structusb_endpoint_descriptor Datainthisstructisaspassedbydevice

Placeholderentriesrelevanttodrivers:

bEndpointAddress: Endpoint'sUSBaddress UseUSB_DIR_OUTandUSB_DIR_INbitmaskstodetermine direction. bmAttributes: Endpointtype UseUSB_ENDPOINT_XFRERTYPE_MASK,USB_ENDPOINT_XFER_ISOC, USB_ENDPOINT_XFER_BUL,orUSB_ENDPOINT_XFER_INT todetermine endpointtype. wMaxPacketSize: Maximumpacketsizehandledbyendpointateverytransfer LargertransferswillbecutintowMaxPacketSize SeetheUSBspectousethisfieldtospecifya"highbandwidth" mode. bInterval Intervalinmillisecondsforinterrupttypetransfers

USBnaminginLinuxconformstoUSBspec,notLinux standard,hencethesometimesoddnamedvariables. Endpointbundle(zeroormore) AninterfacerepresentsatmostonelogicalUSB connection. Driversaretiedtosingleinterfaces Devicesthatpresentmorethanoneconnection(a keyboardwithaccelerationkeys,forexample)must usemorethanonedriver.

Interfaces:

Interfacescanoftenbeusedindifferent "configurations":Alternatesettings. Eachconfigurationrepresentsasetofdevicesettings, possiblyreservingdifferentbandwidthsdependingon operatingmode. Initialdevicesettingknownasfirstsetting(numbered as0). Linuxstructforinterfaces(passedtoandusedbyUSB drivers):


structusb_interface Importantfields:

structusb_host_interface*altsetting: Arrayofpossiblealternatesettingsforinterface

Eachusb_host_interfacecorrespondstosetofendpointconfigs (structusb_host_endpoint). Noparticularordering unsignednum_altsetting: Numberofalternatesettings structusb_host_interface*cur_altsetting: Currentlyactivealternatesetting Pointerintoaltsetting intminor: AllUSBdeviceshavethesamemajornumber Thisistheminornumberattributedtodeviceaftercallto usb_register_dev().

Configurations:

Interfacebundle(oneormany)

Linuxstructforconfigurations:

structusb_host_config

Device:

Usuallyoneconfig Somedevicescanhavemultipleconfigurations(rare). Onlyoneconfigurationcanbeactiveforagivendevice atonetime. SupportformultipleconfigurationdevicesinLinuxis poor. Linuxstructforentiredevice:structusb_device

Convertingdatafromstructusb_interfacetostruct usb_device:

interface_to_usb()macro Commonoperationfordrivers

2.USBandsysfs

SysfsrepresentsbothUSBdeviceandUSB interfacesasindividualdevices. SysfsUSBhierarchylargelydependsonkernel labelingofUSBdevices. Usually,therootentryforUSBdevicesinsysfsis wheretheUSBcontrollerislocatedonthePCI bus.Example:


/sys/devices/pci0000:00/0000:00:1d.0/usb2/

FirstdeviceinUSBtreeisUSBroothub EachUSBroothubhasuniqueID(herethisis2)

NolimitonnumberUSBroothubs SysfsnamingschemeforUSBdevicedirectly connectedtoroothub:


<roothub><hubport>:<confignb>.<interface>

DeviceconnectedusingexternalUSBhub:
<roothub><hubport><hubport>:<confignb>.<interface>

MostinformationaboutUSBdeviceisavailablein itssysfsentry,oftenusingthesamenaming schemeasinspec.Examples:

bConfigurationValue bDeviceClass serial etc.

bConfigurationValuecanbewrittentoinorderto setwhichconfigdeviceshoulduse. Standardsysfsdoesn'tprovideinternaldeviceinfo MoreindepthinformationaboutUSBdevicescan befoundusbfs,whichisusuallymountedat /proc/bus/usb.

/proc/bus/usb/devicesisofspecialinterestasit exposesthealternateconfigurationinfoalongwith inforegardingendpoints. UserspaceUSBdriverssuchastheonefor scannerscantalkdirectlytoUSBhardwarevia usbfs.

3.USBurbs

Basics:

Communicationtoandfromendpointsdonethrough USBRequestBlocks(URBs)inLinux. structurb:<linux/usb.h> Asingleurbcanbeusedonmanyendpoints Asingleendpointcanhavemanyurbsallocatedforit. Eachendpointhasurbqueue urbsareusuallyqueuedforprocessing Queuedurbscanbecanceled(byeitherdriverorUSB core).

Containinternalrefcount(deletedonfreeoflastref) urblifecycle:

USBdrivercreatesit Isassignedtosingleendpoint SubmittedbydrivertoUSBcore SubmittedbyUSBcoretoUSBcontroller ProcessedbyUSBcontrollerfortransferwithdevice USBcontrollernotifiesdriverofurbcompletion

structurbimportantfields:

structusb_device*dev:

USBdevicetowhichurbbelongsto

Mustsetpriortourbqueuing Typeofendpoint(onlyonetypepossible) Setusingendpointsettingfunctions Mustsetpriortourbqueuing Specifyhowurbshouldbedealtwith Predefinedurbtransferflags Buffertotransmittoorfrom Usekmalloc'edbuffersonly

unsignedintpipe:

unsignedinttransfer_flags:

void*transfer_buffer:

dma_addr_ttransfer_dma:

Usedinsteadoftransfer_bufferifbufferisDMA Lengthoftransfer_bufferortransfer_dma Iflargerthanmaximumallowedbyendpoint,transferswillbe brokenupintoseparateUSBframes. BesttoletUSBcoretakecareofbreakinguptransfers insteadofconductingshorttransfers. Forcontrolurbsonly Buffersentbeforecontroldataistransferred

inttransfer_buffer_length:

unsignedchar*setup_packet:

dma_addr_tsetup_dma:

Forcontrolurbsonly Usedinsteadofsetup_packetifbufferisDMA Callbackinvokedwhenurbtransferiscompletedorincaseof error. Privatedatathatcouldbeusedbycompletioncallback Actualsizeoftransfer Usethisinsteadoftransfer_buffer_lengthoninput

usb_complete_tcomplete:

void*context:

intactual_length:

intstatus:

Currenturbstatus Listbelow Canbeaccessedsafelyincompletioncallbackonly Useisochronousendpoints,useiso_frame_desc[]forreal status. Forisochronoustransfersonly Firstframeinorout Forisochronousorinterrupturbsonly Mustbesentpriortourbenqueued

intstart_frame:

intinterval

Intervalforurbpolling Unitsdependondevicespeed Forisochronousurbsonly Numberofisochronouspacketstobehandledbyurb Mustbesetpriortourbenqueing Forisochronousurbsonly Numberoferrorsduringisochronousurbtransfer Forisochronousurbsonly Arrayofstructusb_iso_packet_descriptor

intnumber_of_packets:

interror_count:

structusb_iso_packet_descriptoriso_frame_desc[0]:

Definesorcollectstatusofanumberofisochronous transfers. structusb_iso_packet_descriptor:

unsignedintoffset: Offsetintransferbufferforthispacket'scontent unsignedintlength: Lengthforpacketintransferbuffer unsignedintactual_length: Actuallengthreceived unsignedintstatus: Transferstatusforthisparticularpacket Samedefinitionsasforurb's"status"

Settingurbendpointtype:

unsignedintusb_sndctrlpipe(structusb_device*dev, unsignedintendpoint);

unsignedintusb_rcvctrlpipe(structusb_device*dev, unsignedintendpoint); unsignedintusb_sndbulkpipe(structusb_device*dev, unsignedintendpoint); unsignedintusb_rcvbulkpipe(structusb_device*dev, unsignedintendpoint); unsignedintusb_sndintpipe(structusb_device*dev, unsignedintendpoint); unsignedintusb_rcvintpipe(structusb_device*dev, unsignedintendpoint); unsignedintusb_sndisocpipe(structusb_device*dev, unsignedintendpoint);

unsignedintusb_rcvisocpipe(structusb_device*dev, unsignedintendpoint); URB_SHORT_NOT_OK:

urbflags:

FlagsshortreadsonINendpointaserror ScheduleisochornousurbASAP,assoonasbandwidth allowsit. Driversmustbeabletodealwithnonimminenttransferifbit isnotsetforisochronoustransfer. SettotellUSBcorebuffertransferedisDMA

URB_ISO_ASAP:

URB_NO_TRANSFER_DMA_MAP:

USBcoreusestransfer_dmainsteadoftransfer_buffer Similartoprevious USBcoreusessetup_dmainsteadofsetup_packet Ifset,urbunlinkingwillhappeninthebackgroundinsteadof theurbunlinkingfunctionblockinguntilurbistrulyunlinked. Usewithcare=>complicateddebugging CheckLDD3,forUHCIUSBhostcontrollerdriveronly

URB_NO_SETUP_DMA_MAP:

URB_ASYNC_UNLINK:

URB_NO_FSBR:

URB_ZERO_PACKET:

Bulkouttransferfollowedbyshortpacketcontainingnodata. Requiredforsomebrokenperipherals Don'tgenerateinterruptafterdealingwithurb Usewithgreatcare

URB_NO_INTERRUPT:

urbstatus:

0:

Transfersuccessful urb_kill_urb()stoppedurb

ENOENT:

ECONNRESET:

urb_unlink_urb()withtransfer_flagssetto URB_ASYNC_UNLINK. Hostcontrollerstillprocessingurb Buggydriver (?)bitstufferror Hardwaredidn'treceivedresponsepacketintime CRCerror

EINPROGRESS:

EPROTO:

EILSEQ:

EPIPE:

Endpointstalled Ifnotcontrolendpoint,clearusingusb_clear_halt() Incomingtransferwastoofast Couldn'twriteinputtomemory Outgoingtransferwastoofast Couldn'tretrieveoutputfastenough Receivemoredatathanmaximumsizesetforendpoint

ECOMM:

ENOSR:

EOVERFLOW:

EREMOTEIO:

HavingsetURB_SHORT_NOT_OK,lessdatawasreceived thanexpected. Devicenotconnectedanymore Forisochronoustransfersonly Transferwasincomplete Lookatiso_fram_descforunderstandingwheretransfer failed. Criticalerrorwithurb

ENODEV:

EXDEV:

EINVAL:

ESHUTDOWN:

Devicehasseriousproblem,systemshutitdown

Creatinganddestroyingurbs:

Shouldalwaysdynamicallycreateurbstopreserveref counting. Basicurbcreation

structurb*usb_alloc_urb(intiso_packets,intmem_flags);

"iso_packets":numberofisochronouspackets,setto0ifnot isochronousurb. "mem_flags":sameaskmalloc()flags retvalnonNULLissuccess retvalNULLonfailure urbmustbeproperlyinitializedpriortoenqueing

voidusb_free_urb(structurb*urb);

Freesurb

Initializationhelperfunctionsdonotsettransfer_flags, mustbedonebydriverifneededafterurbinitialization. Interrupturbs:

voidusb_fill_int_urb(structurb*urb,structusb_device *dev,unsignedintpipe,void*transfer_buffer,int buffer_length,usb_complet_tcomplete,void*context,int interval); "urb":urbtobeinitialized "dev":devicewheretosendurb

"pipe":endpointtype,seeearlierfunctions:

usb_sndintpipe/usb_rcvintpipe

"transfer_buffer":kmalloc'edintput/outputbuffer "buffer_length":lengthoftransfer_buffer "compete":completioncallback "context":privatedataforcompletioncallback "interval":urbschedulinginterval voidusb_fill_bulk_urb(structurb*urb,structusb_device *dev,unsignedintpipe,void*transfer_buffer,int buffer_length,usb_complet_tcomplete,void*context); Paramaterssameasusb_fill_int_urb(); "pipe":usb_sndbulkpipeorusb_rcvbulkpipe

Bulkurbs:

Controlurbs:

voidusb_fill_control_urb(structurb*urb,structusb_device *dev,unsignedintpipe,unsignedchar*setup_packet,void *transfer_buffer,intbuffer_length,usb_complet_tcomplete, void*context); Parameterssameasusb_fill_bulk_urb() "setup_packet":sendpriortodata "pipe":usb_sndctrlpipeorusb_rcvctrlpipe Thisinitializernottypicallyused Directtransfersareusedinstead(willseelater) Nohelperfunctions Mustbeinitializedbyhand

Isochronousurbs:

SeeLDD3p.344forexample

Submittingurbs:

intusb_submit_urb(structurb*urb,intmem_flags); Canbecalledwithininterruptcontext "urb":urbtosend "mem_flags":sameaskmallocflags;indicateshow USBcoreshouldallocateanyrequiredmemory.Valid values:

GFP_ATOMIC:criticalsituation,suchasinthandler,where sleepingcannotbeallowed. GFP_NOIO:driverisdoingblockI/O

GFP_KERNEL:mostsituations

retvalis0onsuccess retval<0onerror Mustwaitforcompletionflagbeforeaccessinganyurb fields. typedefvoid(*usb_complete_t)(structurb*,struct pt_regs*); Calledonlyonceforeachurbcompletion Likelyrunningininterruptcontext

Completingurbs:thecompletioncallbackhandler

Possiblereasonsforcompletion:

urbsentsuccessfullyanddeviceacksappropriately Errorhappenedduringtransfer urbunlinked

Cancelingurbs:

intusb_kill_urb(structurb*urb);

Typicallyuseddevicedisconnectcallback MusthavesetURB_ASYNC_UNLINKtoworkproperly Willnotblockwhileurbisbeingunlinked

intusb_unlink_urb(structurb*urb);

4.WritingaUSBdriver

Whatdevicesdoesthedriversupport?

structusb_device_id(muchlikepci_device_id) UsedbyUSBtodeterminewhichdevicesadrivercan handle. Usedbyhotplugscriptstodeterminewhichdriversto load. __u16match_flags:

USB_DEVICE_ID_MATCH*from <linux/mod_devicetableh.h>. Neversetdirectly SetusingUSB_DEVICE()macro

__u16idVendor:

UniquevendorID VendorspecificproductID Supportedproductversionrange SpecifiedinBinaryCodedDecimal(BCD)

__u16idProduct:

__u16bcdDevice_lo,__u16bcdDevice_hi:

__u8bDeviceClass,__u8bDeviceSubClass,__u8 bDeviceProtocol:

Defineclass,subclassandprotocolasspelledoutbyspec Valuesspecifydevicebehavior

__u8bInterfaceClass,__u8bInterfaceSubClass,__u8 bInterfaceProtocol

Likeprevious Valuesspecifyinterface Setbykerneltoidentifydifferentdeviceswhenprobing USB_DEVICE(vendor,product):


kernel_ulong_tdriver_info:

Helpermacros:

Verycommonlyused Createstructusb_device_idmatchingvendorandproductIDs Likeprevious,butsetsupportedproductversions

USB_DEVICE_VER(vendor,product,lo,hi):

USB_DEVICE_INFO(class,subclass,protocol):

Createstructusb_device_idmatchingclassdescription Createstructusb_device_idmatchininterfacedescription

USB_INTERFACE_INFO(class,subclass,protocol):

MODULE_DEVICE_TABLE(usb,<listofstruct usb_device_id>);

RegisteringaUSBdriver:

structusb_driver:

structmodule*owner:

Pointertomoduleowningthisdriver RefcountingbyUSBcore Usuallysetto"THIS_MODULE"

constchar*name:

Devicename MustbeuniqueaccrossallUSBdevicesinkernel Showsupin/sys/bus/usb/drivers/ Pointertolistofstructusb_device_idsupportedbydriver

conststructusb_device_id*id_table:

int(*probe)(structusb_interface*intf,conststruct usb_device_id*id):

CalledbyUSBcorewhenmatchislikelytohavebeenfound "id"isthelikelymatch retvalshouldbe0ifclaimedandinitialized retvalshouldbe<0ifnoclaimorerror Calledwithinthecontextofkhubdkernelthread,legaltosleep Typically: Initializeanylocalstructures/variablesformanagingdevice

Recordanyinforegardingdeviceinlocalstructsforfutureuse.

void(*disconnect)(structusb_interface*intf):

Invokedondevicedisconnect Invokedondriverremoval Calledwithinthecontextofkhubdkernelthread,legaltosleep

int(*ioctl)(structusb_interface*intf,unsignedintcode,void *buf):

Nottypicallyimplemented Nottypicallyneeded Calledwhenuspaceappdoesioctl()onusbfsdeviceentry Nottypicallyimplemented Calledondevicesuspend

int(*suspend)(structusb_interface*intf,u32state)

int(*resume)(structusb_interface*intf)

Nottypicallyimplemented Calledondeviceresume

Basicusb_driverentry:
static struct usb_driver my_driver = { .owner = THIS_MODULE, .name = "MyDriver", .id_table = my_table, .probe = my_probe, .disconnect = my_disconnect

};

Actualregistration:

intusb_register(structusb_driver*); voidusb_deregister(structusb_driver*);

Mostimportantworkshouldbeconductedatdevice openfromuspace.

probeanddisconnectindetail

SeeexampleinLDD3,p.350 Linkingastructusb_device_idtoastruct usb_interface:

voidusb_set_intfdata(structusb_interface*intf,void*data); void*usb_get_intfdata(structusb_interface*intf); Usuallydoneondisconnectordeviceopen()

Gettingstructusb_device_idfromstructusb_interface:

Whendisconnecting,donotforgettouse usb_set_intfdata()toresetprivatedatatoNULL.

ConnectionbetweenhigherlevelsandUSBdriver:

Usually,USBdriverinterfacetouspaceusingtheother subsystemittiesinto,likescsi,network,etc. USBdriverscanalsotietouserspaceusingachar deviceinterface:

intusb_register_dev(structusb_interface*intf,struct usb_class_driver*class_driver);

Callinprobe() structusb_class_driver: Parametersforobtainingminornumber char*name; sysfsname Use"%d"ifmanydevices

structfile_operations*fops; Charfileops mode_tmode; Fileaccessmode intminor_base; Baseminornumber

voidusb_deregister_dev(structusb_interface*intf,struct usb_class_driver*class_driver);

Calledindisconnect()

USBbufferallocationprimitives:

void*usb_buffer_alloc(structusb_device*dev,size_t size,intmem_flags,dma_addr_t*dma);

"dev":usbdevice "size":amountrequested

"mem_flags":sameaskmalloc() "dma":"transfer_dma"entryinurbstruct

voidusb_buffer_free(structusb_device*dev,size_t size,void*addr,dma_addr_tdma);

Similarasabove "addr"isusb_buffer_alloc'edspace

5.USBtransferswithouturbs

Transferwithoutdealingwithurbsatall JustsendandreceivedirectUSB usb_bulk_msg:

intusb_bulk_msg(structusb_device*usb_dev, unsignedintpipe,void*data,intlen,int*actual_length, inttimeout); "usb_dev":theusbdevice "pipe":endpointtype "data":input/outputbuffer "len":databufferlen

"actual_length":ptrtonbofbytesactuallytransfered "timeout":timeoutinjiffies,if0it'llwaitforever retvalis0onsuccess retval<0onerror,see"structurb"errorsforactual error. Cannotbecalledfrominterrupt Callcannotbecancelled disconnect()mustwaitifcallissued

usb_control_msg:

intusb_control_msg(structusb_device*usb_dev, unsignedintpipe,__8request,__8requesttype,__16 value,__16index,void*data,__u16size,inttimeout); "usb_dev":theusbdevice "pipe":endpointtype "request":USBrequestperspec "requesttype":USBrequesttypeperspec "value":USBmessagevalueperspec "index":USBmessageindexperspec

"data":input/outputbuffer "size":sizeofdata "timeout":timeoutinjiffies retvalisnumberofbytestransferonsuccess retvalisnegativeerrorvalueonerror Cannotbecalledfrominterrupt Callcannotbecancelled disconnect()mustwaitifcallissued ObtainstandardinfofromUSBdevices

OtherUSBdatafunctions:

Cannotcallininterrupthandlers intusb_get_descriptor(structusb_device*dev, unsignedchartype,unsignedcharindex,void*buf,int size);

Canbeusedtoretrieveinformationnotalreadyavailable througtheUSBstructuresdescribedearlier. "dev":theusbdevice "type":typeaccordingtospec:

USB_DT_DEVICE,USB_DT_CONFIG,USB_DT_STRING,USB_DT_INTERFACE, USB_DT_ENDPOINT,USB_DT_DEVICE_QUALIFIER, USB_DT_OTHER_SPEED_CONFIG,USB_DT_INTERFACE_POWER, USB_DT_OTG,USB_DT_DEBUG,USB_DT_INTERFACE_ASSOCIATION, USB_DT_CS_DEVICE,USB_DT_CS_CONFIG,USB_DT_CS_STRING, USB_DT_CS_INTERFACE,USB_DT_CS_ENDPOINT.

"index":descriptornumber "buf":buffertocopydescriptorto "size":sizeof"buf" retvalisnumberofbytestransferedonsuccess retvalisnegativevalueincaseoferror. Reliesonusb_contro_msg()

intusb_get_string(structusb_device*dev,unsigned shortlangid,unsignedcharindex,void*buf,intsize);

TTYdrivers
1.Basics 2.AsmallTTYdriver 3.tty_driverfunctionpointers 4.TTYlinesettings 5.ioctls 6.procandsysfshandlingofTTYdevices 7.Corestructdetails

1.Basics

tty=TeleTYpewriter Anyserialportlikedevice ttyvirtualdevicesusedtocreateinteractive sessionsthroughvarioussoftwareabstractions(X, network,etc.) TypesofTTYs:

Virtualconsole:

Fullscreenterminaldisplaysonthesystemvideomonitor Thedevicetowhichsystemmesssagesshouldbesent,and whereloginsshouldbepermittedinsingleusermode.

Systemconsole:

Serialports:

RS232serialportsandanydevicewhichsimulatesone, eitherinhardware(suchasinternalmodems)orinsoftware (suchastheISDNdriversorUSBtoserialdrivers.) Usedtocreateloginsessionsorprovideothercapabilities requiringaTTYlinedisciplinetoarbitrarydatagenerating proccesses.

Pseudoterminals(PTYs):

Whatisa"linediscipline"?:

AconversionlayerbetweentheTTYdriverwhichtalks toactualhardwareandasoftwarelayerthatknows onlyhowtotalktoagenericTTY.

Typically,thisisaprotocolconversion:PPP, Bluetooth,etc.

See/proc/tty/driversforlistofttydriverscurrently loaded.

2.AsmallTTYdriver

Basics:

structtty_driver:<linux/tty_driver.h> Allocatingattydriver:

structtty_driver*alloc_tty_driver(<nbttydevices supported>);

Afterallocation,structtty_drivershouldbeinitialized. Maywanttosetstructtty_operationsusing tty_set_operations(). Onceinitialized,thedrivershouldberegistered:


inttty_register_driver(structtty_driver*driver); Resultsinsysfsentriescreation

Onceregistered,thedrivershouldregisterthedevices itcontrols:

voidtty_register_device(structtty_driver*driver,/*ttydriver */unsignedindex,/*Minornbr*/structdevice*dev); /*Device*/ voidtty_unregister_device(structtty_driver*driver, unsignedindex); inttty_unregister_driver(structtty_driver*driver);

Conversely:

structtty_drivercontents

"owner":THIS_MODULE "driver_name":nameshownin/proc/tty/drivers

"name":entryin/dev "major":majornumber "type"and"subtype":TTYdrivertype "flags":driverstateandtype structtty_driverhasinit_termios(structtermios) member. init_termiosarethedefaultlinesettingsforthedriver. Thesemaybechangedlaterwhendeviceisopenfrom userspace.

structtermios:

Bitmaskentriesinstructtermios(seetermios manpage):

tcflag_tc_iflag;

Inputmode Outputmode Controlmode Localmode Linedisciplinetype Controlcharactersarray

tcflag_tc_oflag;

tcflag_tc_cflag;

tcflag_tc_lflag;

cc_tc_line;

cc_tc_cc[NCCS];

3.tty_driverfunctionpointers

openandclose:

OpencalledbyTTYlayerwhenuspaceopenon/dev entry. ClosecalledbyTTYlayerwhenuspacecloseon/dev entry. Writecalledasaresultofuspacewrite Writecanbecalledfrominterruptcontext put_char:addsinglecharactertooutputbuffer chars_in_buffer:howmanybyteslefttosend

Flowofdata:

Otherbufferingfunctions:

flush_chars:startwritingcharstodevice,ifnotalready done. wait_until_sent:sameasflush_chars,butwaitforsend tocomplete. flush_buffer:sendasmuchofwhatremainsas possible,anddropwhat'sleft. TTYlayerbuffersinputuntiluserreads TTYlayernotifiesdriverifstop/startneeded

Noreadfunction?

FeedingcharacterstoTTYlayer:

voidtty_insert_flip_char(structtty_struct*tty,unsignedchar ch,charflag); voidtty_flip_buffer_push(structtty_struct*tty);

Flushingbufferwhenenoughcharacters:

4.TTYlinesettings

Basics:

UserspaceattemptstosetTTYconfigusingioctl() TTYlayerrecognizessomeoftheioctl()callsand convertsthemtodrivercallbacks. MostcommoncallbacktoTTYioctls Paritybit,stopbit,baudrate,etc. SeeLDD3fordetails Controllinesettingsetandget SeeLDD3fordetails

set_termios:

tiocmgetandtiocmset

5.ioctls

Some70differentioctlstottys SeesummaryinLDD3

6.procandsysfshandlingofTTYdevices

Automaticcreationof/procentriesondefinitionof read_procandwrite_proccallbacks. SeeLDD3

7.Corestructdetails

SeeLDD3forfulldetailsofthefollowingstructs:

structtty_driver structtty_operations structtty_struct

AppendixA.Debuggingdrivers
1.Debuggingsupportinthekernel 2.Manualtechniques 3.Debuggingtools 4.Performancemeasurement 5.Hardwaretools

1.Debuggingsupportinthekernel

Builtindebuggingcapabilities "kernelhacking"menu CONFIG_DEBUG_KERNEL:

Requiredforenablinganyotherdebugoption Addedchecksinmemoryallocationfunctions Usefulfordebuggingmemoryoverrunsandmissing initialization Allocatedbufferssetto0xa5 Freedbufferssetto0x6b

CONFIG_DEBUG_SLAB:

Guardvaluesputbeforeandafterallocatedarea:if modified=>error Fullpagesremovedfromkernelwhenfreed CPUhog Mayhelpinpinpointingmemorycorruptionerrors Kernelcatcheserrorsonmisuseofspinlocks Checksforsituationswheresleepsareattemptedwhile holdingspinlocks

CONFIG_DEBUG_PAGEALLOC:

CONFIG_DEBUG_SPINLOCK:

CONFIG_DEBUG_SPINLOCK_SLEEP:

CONFIG_INIT_DEBUG:

Checksforcodeattemptingtoaccessothercode markedas__initanddiscardedafterboot. Generateskernelwithfulldebuginfo(gccg). ShouldconfigureCONFIG_FRAME_POINTERifusing gdb Enablesuseofmagickeyboardsequencetoexecute hardcodedcommandsincaseofsystemhang.

CONFIG_DEBUG_INFO:

CONFIG_MAGIC_SYSRQ:

CONFIG_DEBUG_STACKOVERFLOW/ CONFIG_DEBUG_STACK_USAGE:

Trackdownkernelstackoverflows.

CONFIG_KALLSYMS(in"Generalsetup/ Standardfeatures):

Includekernelsymboltableintokernelimage. Otherwiseoopsesareinhex.

CONFIG_IKCONFIG/ CONFIG_IKCONFIG_PROC(in"Generalsetup"):

Includekernelconfigurationintokernelimageand makeitavailablethrough/proc.

CONFIG_APIC_DEBUG(in"Power management/ACPI"):

VerboseACPIdebuginfo. Turnondebuginfoindrivercore.

CONFIG_DEBUG_DRIVER:

CONFIG_SCSCI_CONSTANTS(in"Device drivers/SCSIdevicesupport"):

VerboseSCSIdebuginfo.

CONFIG_INPUT_EVBUG:

Verboseloggingforinputdevices(including keylogging). Systemperformancetunning

CONFIG_PROFILING(in"Profilingsupport"):

2.Manualtechniques

printk
printk(Detected error 0x%x on interface %d:%d\n, error_code, iface_bus, iface_id);

/proc

Mainfunctions:include/linux/proc_fs.h
struct proc_dir_entry *create_proc_read_entry(const char *name, mode_t mode, struct proc_dir_entry *base, read_proc_t *read_proc, void * data) void remove_proc_entry(const char *name, struct proc_dir_entry *parent)

read_proc is a callback:
typedef int (read_proc_t)(char *page, char **start, off_t off, int count, int *eof, void *data);

Writedatatopage Careful:canonlyfill1pageatatime(4K) Use*starttotellOSthatyouhavemorethanapage. Yourfunctionwillthenbecalledmorethanoncewitha differentoffset. ALWAYSreturnthesizeyouwrote.Ifyoudon't, nothingwillbedisplayed. Youcanaddyourown/proctreeifyouwant... Methodimplementedinmostdevicedrivermodels (char,block,net,fb,etc.)allowingcustomfunctionality tobecodedindriver...

ioctl:

Canextendyourdriver'sioctltoallowauserspace applicationtopollorchangethedriver'sstateoutside oftheOS'control. Reportprintedoutbykernelregardinginternalerror thatcan'tbehandled. Sometimeslastoutputbeforesystemfreeze=>must becopiedbyhand. Containsaddressesandreferencestofunction addresseswhichcanbeunderstoodbylookingat System.map Canbeautomaticallydecodedwithklogd/ksymoops

oopsmessages:

Unable to handle kernel paging request at virtual address 0007007a printing eip: c022a8f6 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c022a8f6>] Not tainted EFLAGS: 00010202 eax: 0000000a ebx: 00000004 ecx: 00000001 edx: e3a74b80 esi: 0007007a edi: e62150fc ebp: 0007007a esp: dfbdbc8c ds: 0018 es: 0018 ss: 0018 Process ip (pid: 2128, stackpage=dfbdb000) Stack: 00000000 bfff0018 00000018 0000000a 00000000 e41e8c00 c02f5acd e3a74b80 c022aec1 e3a74b80 0000000a 00000004 0007007a c02f5ac8 00000e94 00000246 e62150b4 00000000 e41e8c00 00000001 00000000 e5365a00 c022b039 e3a74b80 Call Trace: [<c022aec1>] [<c022b039>] [<c022df21>] [<c022e1d0>] [<c022b6ea>] [<c022afb0>] [<c022b1d0>] [<c0176f7e>] [<c022b2a0>] [<c022ddba>] [<c022d623>] [<c022db41>] [<c021c8f5>] [<c021daf3>] [<c0118238>] [<c021d44d>] [<c021e459>] [<c0107800>] [<c010770f>] Code: f3 a5 f6 c3 02 74 02 66 a5 f6 c3 01 74 01 a4 8b 5c 24 10 8b

Forgoodmeasure,alwayssavethecontentof /proc/ksymsbeforegeneratedanoops:
$ cat /proc/kallsyms > /tmp/kallsyms-dump $ sync

UsermodeLinux:http://usermodelinux.sf.net/

AversionofLinuxthatrunsentirelyasauserspace process. Idealforprototypinganddebuggingnewkernel functionality. Widelyusedbykerneldevelopers Availablestartingin2.5 UMLisbuiltasanewlinuxarchitecture(disable reiserfs)


$ cd ${PRJROOT}/kernel/uml/linux-2.6.11 $ make ARCH=um menuconfig $ make linux ARCH=um

RunningusermodeLinux
$ ./linux Checking for /proc/mm...not found tracing thread pid = 14932 Linux version 2.6.11 (karim@localhost.localdomain) ... On node 0 totalpages: 8192 zone(0): 8192 pages. zone(1): 0 pages. zone(2): 0 pages. Kernel command line: root=/dev/ubd0 Calibrating delay loop... 2617.44 BogoMIPS Memory: 29480k available ... Initializing software serial port version 1 mconsole (version 2) initialized on /home/karim/.uml/... unable to open root_fs for validation UML Audio Relay (host dsp = /dev/sound/dsp, host mixer ... Initializing stdio console driver NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP IP: routing cache hash table of 512 buckets, 4Kbytes TCP: Hash tables configured (established 2048 bind 4096) NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. VFS: Cannot open root device "ubd0" or 62:00 Please append a correct "root=" boot option Kernel panic: VFS: Unable to mount root fs on 62:00

3.Debuggingtools

gdb:allarchs

Canusestandardgdbtovisualizekernelvariables:
$ gdb ./vmlinux /proc/kcore

Getmoreinformationwhenusinggflag gdbgrabskcoresnapshotatstartup/nodynamic update Availablefromhttp://oss.sgi.com/ Integratedkerneldebuggerpatch Requireskernelpatch(Linusallergictokdb&friends) Isuseddirectlyonthehostbeingdebugged AccessthroughPAUSE/BREAKkeyonkeyboard

kdb:x86/ia64

IKD:IntegratedKernelDebugger

Availablefrom ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/ikd Patchprovides:stacksizewatch/lockwatch/ktrace Availablefromhttp://kgdb.sourceforge.net/ Sometimesavailableinkernelsources:


arch/ppc/kernel/ppcstub.c

kgdb:Fullseriallinebasedkerneldebugger

Connecttoremotetargetthroughgdbonhost Usegdbasyouwouldforanyotherremoteprogram

LKCD:LinuxKernelCrashDump

Availablefromhttp://lkcd.sourceforge.net/ Savesmemoryimageatcrash Retrievememoryimageatstartup Analyzeimagetofindproblems

Kexec DProbes:DynamicProbes Kprobes SystemTap

4.Performancemeasurement

LMbench:

Availablefromhttp://www.bitmover.com/lmbench/ Runsheavyuserspacebenchmarkstodetermine kernelresponsetime. Notadaptedforembedded(RequiresPerlandC compiler.) Availablefromhttp://oss.sgi.com/projects/kernprof/ Kernelpatchaddingprofilingdatacollection mechanism. Forx86/ia64/sparc64/mips64

kernprof:

Integratedsamplebasedprofiler:

Activateduponpassingprofile=bootparam Profiledataavailablein/proc/profile Userspacetoolsavailablefrom http://sourceforge.net/projects/minilop/ SeeBELSforhowtoinstallanduse Twoways:


Measuringinterruptlatency:

Selfcontained Induced

Selfcontained:

System'soutputconnectedtosysteminput Writedriverwith2mainfunctions Fireupfunction:


Recordscurrenttime Triggerinterrupt(writetooutputpin) Recordscurrenttime Togglesoutputpin

Interrupthandlingfunction:

Interruptlatencyismeasuredusingtimedifference Canconnectscopeandobservetiming Noneedfortimerecordingifusingscope Deadangle:can'tseethetimeittakestogettofireupfct

Induced:

Closertoreality Interruptgeneratedbyoutsidesource(frequencygenerator) Drivertogglesoutputpin Measuretimebetweenwavestartstogetreallatency Tryls


-R /

LinuxisnotanRTOS...

5.Hardwaretools

Oscilloscope Logicanalyzer InCircuitEmulator BDM/JTAG


Abatron/usesplaingdb Wiggler/requiresmodifiedgdb BDM4GDB(MPC860)


http://bdm4gdb.sourceforge.net/

JTAGtool
http://openwince.sourceforge.net/jtag/

AppendixB.Kerneldatatypes
1.UseofstandardCtypes 2.Assigninganexplicitsizetodatatypes 3.Interfacespecifictypes 4.Otherportabilityissues 5.Linkedlists

1.UseofstandardCtypes

Actualsizesdifferbetweenarchitectures SeeLDD3p.289forsizeofeachtypeonvarious archs Pointersinkernelare"unsignedlong"because pointersarethesamesizeasthattypeonall supportedarchs.

2.Assigninganexplicitsizetodatatypes

<linux/types.h> Unsignedtypes:

u8 u16 u32 u64 s8 s16 s32

Signedtypes(rarelyused):

s64

Forheadersexportedtouserspace,usethese instead(noPOSIXnamespacepollution:

__u8,__s8 __u16,__s26 __u32,__s32 __u64,__s64

IfyouwanttobeC99compliant(andthecompiler supportsit):

uint8_t,uint16_t,uint32_t,uint64_t

3.Interfacespecifictypes

Commonlyuseddatatypesareusuallytypedef'ed inthekernel Recently,typedefinghaslostitsappealwithkernel developers(opaquetypes) Many"_t"typesdefinedin<linux/types.h>(size_t, pid_t,etc.) Noproblemwhenusedincode:highlyportable Problemwhenprintingvaluesoutfordebugging (usuallysuchvaluesneednotbeprinted.) Toprint,casttolargepossibletypefortypedefed "_t"

4.Otherportabilityissues

Avoidexplicitconstants:use#definesinstead Timeintervals:

jiffiesnotalways1000 UseHZinstead PAGE_SIZEnotalways4KB UsePAGE_SHIFTtogetnumberofbitstoshifttoget pagenumebr <asm/page.h>

Pagesize:

Byteorder:

PCislittleendian,butmanyplatformsarebigendian Forinternaldriveruse,noneedtocare Foroutsidecommunication:


<asm/byteorder.h> Either__BIG_ENDIANor__LITTLE_ENDIAN=>use#ifdef u32cpu_to_le32(u32); u32le32_to_cpu(u32); u32cpu_to_be32(u32); u32be32_to_cpu(u32);

etc. Varientswithappended"s"forsignedor"p"forpointer

Dataalignment:

OnPCalignmentisnotaproblem FormanynonPCarchitectures,nonalignedaccessis aproblem <asm/unaligned.h> get_unaligned(ptr); put_unaligned(val,ptr); Typealignmentinstructuresdiffersbetweenarchs

Examplealignmentdifferencebetweenprocessorsin LDD3p.294 Structurespaddingmaybeinsertedbycompilerfor performance,maycauseproblemswhensharing structswithotherhardwareorsystems. Usestructure"packing"whennecessary:


struct{}__attribute__((packed))...;

Pointersanderrorvalues:

retvalnotalwaysNULLonfailure Returninganerrorasapointervalue:
void*ERR_PTR(longerror);

Determiningifpointerreturnediserror:
longIS_ERR(constvoid*ptr);

Retrievingerrorfromptr(afterIS_ERR()):
longPTR_ERR(constvoid*ptr);

5.Linkedlists

Kernelprovidesstandardwaytomanageand maintainlinkedlists Shouldusekernel'sprimitivesinasmuchas possible Primitivesdonotimplementanylocking,mustuse appropriatelocking <linux/list.h>


struct list_head { struct list_head *next, *prev; }

Include"structlist_head"aspartofcustomstructs:
struct my_struct { struct list_head list; /* my stuff ... */ }

Staticinitialization:

LIST_HEAD(my_list); structlist_headmy_list; INIT_LIST_HEAD(&my_list);

Dynamicinitialization:

list_add(structlist_head*new,structlist_head *head);

Addentrytolistrightafterhead Couldpasslistentryinsteadofrealhead

list_add_tail(structlist_head*new,structlist_head *head);

Addtoendoflist Removefromlist

list_del(structlist_head*entry);

list_del_init(structlist_head*entry);

Removefromlistandreinitpointers Forremovingandinsertinginotherlists

list_move(structlist_head*entry,structlist_head *head);

Moveentrytobegining

list_move_tail(structlist_head*entry,strust list_head*head);

Moveentrytoend

list_empty(structlist_head*head);

retvalisnonzeroifempty

list_splice(structlist_head*list,structlist_head *head);

Insertalistinotherlist

list_entry(structlist_head*ptr,type_of_struct, field_name);

Usepointerarithmetictogetthepointertothestruct containingthelist_headentry. "type_of_struct"isthestructurecontainingthe"struct list_head".Forexample,my_struct.

"field_name"isthenameof"structlist_head"within customstruct. Forexample:


struct my_struct *my_ptr = list_entry(list_ptr, struct my_struct, list);

Macrosfortraversinglists:

list_for_each(structlist_head*cursor,structlist_head *list)

for()loopexecutedonceforeachlistentry Donotmodifylistinloop

list_for_each_prev(structlist_head*cursor,struct list_head*list)

Sameaslist_for_each()butinreverse

list_for_each_safe(structlist_head*cursor,struct list_head*next,structlist_head*list)

Sameaslist_for_each()butsavesnextentryinlistincase currententryisremoved.

list_for_each_entry(type*cursor,structlist_head*list, member) list_for_each_entry_safe(type*cursor,type*next, structlist_head*list,member)

Sameasbefore,butavoidshavingtouselist_entry()inloop byimplementingthefunctionalitydirectly. "cursor"isthepointertothecustomstructtype "member"isnameoflistincustomstructtype

Othertypeoflistdefined"hlist"=>same,buthead hasonlygotonepointer.

AppendixC.Kernelintegration
1.Kernellayout:Wherearethedrivers? 2.Kernelbuildsystem 3.Kernelconfigsystem 4.Addingadrivertothekernelsources 5.Creatingpatches 6.Distributingworkandinterfacingwiththe community

1.Kernellayout:Wherearethedrivers?
Applications arch/ARCH/kernel/entry.S Kernel kernel/* kernel/* mm/* capability.c, arch/ARCH/mm/ sched.c,fork.c, sys.c,softirq.c, kernel/* exit.c panic.c,... fs/pipe.c, fs/fifo.c,ipc/*, net/* kernel/signal.c

fs/*

arch/ARCH/*

net/* drivers/net/*

fs/*/*
drivers/ drivers/ block/* char/*

arch/ARCH/kernel:irq.c,traps.c

CPU

Basic Hardware

Main Memory

NIC

HD

arch drivers fs include init ipc kernel mm net scripts

45MB => => 100MB=> 19MB => 32MB => 108KB => 172KB => 1.0MB => 816KB => 10MB => 1.1MB =>

architecturedependentfunctionality mainkerneldocumentation alldrivers virtualfilesystemandallfstypes completekernelheaders kernelstartupcode SystemVIPC corekernelcode memorymanagement networkingcoreandprotocols scriptsusedtobuildkernel

Documentation 8MB

drivers/
acorn md acpi media atm oprofile crypto parisc dio scsi pci serial cdrom net char nubus cpufreq s390 ide sbus ieee1394 w1 mca zorro fc4 pcmcia firmware pnp i2c usb macintosh video mmc bluetooth mtd input tc isdn telephony message base misc block parport eisa infiniband

2.Kernelbuildsystem
drivers/Makefile
# # Makefile for the Linux kernel device drivers. # # 15 Sep 2000, Christoph Hellwig <hch@infradead.org> # Rewritten to use lists instead of if-statements. # obj-$(CONFIG_PCI) += pci/ obj-$(CONFIG_PARISC) += parisc/ obj-y += video/ obj-$(CONFIG_ACPI_BOOT) += acpi/ # PnP must come after ACPI since it will eventually need to check if acpi # was used and do nothing if so obj-$(CONFIG_PNP) += pnp/ # char/ comes before serial/ etc so that the VT console is the boot-time # default. obj-y += char/ # i810fb and intelfb depend on char/agp/ obj-$(CONFIG_FB_I810) += video/i810/ obj-$(CONFIG_FB_INTEL) += video/intelfb/ # we also need input/serio early so serio bus is initialized by the time # serial drivers start registering their serio ports obj-$(CONFIG_SERIO) += input/serio/ obj-y += serial/ obj-$(CONFIG_PARPORT) += parport/ obj-y += base/ block/ misc/ net/ media/ obj-$(CONFIG_NUBUS) += nubus/ obj-$(CONFIG_ATM) += atm/ obj-$(CONFIG_PPC_PMAC) += macintosh/ obj-$(CONFIG_IDE) += ide/

drivers/char/Makefile
# # Makefile for the kernel character device drivers. # # # This file contains the font map for the default (hardware) font # FONTMAPFILE = cp437.uni obj-y += mem.o random.o tty_io.o n_tty.o tty_ioctl.o

obj-$(CONFIG_LEGACY_PTYS) += pty.o obj-$(CONFIG_UNIX98_PTYS) += pty.o obj-y += misc.o obj-$(CONFIG_VT) += vt_ioctl.o vc_screen.o consolemap.o \ consolemap_deftbl.o selection.o keyboard.o obj-$(CONFIG_HW_CONSOLE) += vt.o defkeymap.o obj-$(CONFIG_MAGIC_SYSRQ) += sysrq.o obj-$(CONFIG_ESPSERIAL) += esp.o obj-$(CONFIG_MVME147_SCC) += generic_serial.o vme_scc.o obj-$(CONFIG_MVME162_SCC) += generic_serial.o vme_scc.o obj-$(CONFIG_BVME6000_SCC) += generic_serial.o vme_scc.o obj-$(CONFIG_ROCKETPORT) += rocket.o obj-$(CONFIG_SERIAL167) += serial167.o obj-$(CONFIG_CYCLADES) += cyclades.o obj-$(CONFIG_STALLION) += stallion.o obj-$(CONFIG_ISTALLION) += istallion.o obj-$(CONFIG_DIGIEPCA) += epca.o obj-$(CONFIG_SPECIALIX) += specialix.o obj-$(CONFIG_MOXA_INTELLIO) += moxa.o obj-$(CONFIG_A2232) += ser_a2232.o generic_serial.o obj-$(CONFIG_ATARI_DSP56K) += dsp56k.o ...

Forfulldetails:

Documentation/kbuild/makefiles.txt

3.Kernelconfigsystem
drivers/Kconfig
# drivers/Kconfig menu "Device Drivers" source "drivers/base/Kconfig" source "drivers/mtd/Kconfig" source "drivers/parport/Kconfig" source "drivers/pnp/Kconfig" source "drivers/block/Kconfig" source "drivers/ide/Kconfig" source "drivers/scsi/Kconfig" source "drivers/cdrom/Kconfig" source "drivers/md/Kconfig" source "drivers/message/fusion/Kconfig" source "drivers/ieee1394/Kconfig" source "drivers/message/i2o/Kconfig" source "drivers/macintosh/Kconfig" source "net/Kconfig" source "drivers/isdn/Kconfig"

drivers/char/Kconfig
# # Character device configuration # menu "Character devices" config VT bool "Virtual terminal" if EMBEDDED select INPUT default y if !VIOCONS ---help--If you say Y here, you will get support for terminal devices with ... config VT_CONSOLE bool "Support for console on virtual terminal" if EMBEDDED depends on VT default y ---help--The system console is the device which receives all kernel messages ... config HW_CONSOLE bool depends on VT && !S390 && !UML default y config SERIAL_NONSTANDARD bool "Non-standard serial port support" ---help--Say Y here if you have any non-standard serial boards -- boards ...

Forfulldetails:

Documentation/kbuild/kconfiglanguage.txt

4.Addingadrivertothekernelsources

Createdirectoryindrivers/ordrivers/* Putdriversourcesincreateddirectory Modifydrivers/Makefileordrivers/*/Makefileto recognizenewdirectory. AddproperMakefiletoyourdriver'sdirectory Modifydrivers/Kconfigordrivers/*/Kconfigtoshow driverinkernelconfig. AddproperKconfigtoyourdriver'sdirectory

5.Creatingpatches

Patchbasics

Apatchisafilecontainingdifferencesbetweena certainofficialkernelversionandamodifiedversion. Patchesaremostcommonlycreatedusinga commandlinethatlookslike: Patchescanbeincremental(e.g.needtoapply patchesA,BandCbeforeapplyingpatchD) Patcheswilleasilybreakifnotappliedtotheexact kernelversiontheywerecreatedfor.

$ diff -urN original-kernel modified-kernel > my_patch

Analyzingapatch'scontent:
$ diffstat -p1 my_patch

Testingapatchbeforeapplyingit:
$ cp my_patch ${PRJROOT}/kernel/linux-2.6.11 $ cd ${PRJROOT}/kernel/linux-2.6.11 $ patch --dry-run -p1 < my_patch

Applyingpatches:
$ patch -p1 < my_patch

6.Distributingworkandinterfacingwiththecommunity

Createprojectwebpage(possiblyonsourceforge) PostpatchestoLKML

Title:

[PATCH]n/x [PATCH/RFC]n/x

signedoffby:YourName<your@mail.tld>

Integratecommunityfeedback Continuepostingupdatedpatches

AppendixD.Portingthekernel
1.Perarchkernellayout 2.Kernelstartup 3.Keydefinitions 4.InterfacingbetweenbootloaderandOS

1.Perarchkernellayout
$ ll arch/ppc total 104 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 10 drwxr-xr-x 2 -rw-r--r-1 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 2 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 5 drwxr-xr-x 2 drwxr-xr-x 2 karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim 4096 4096 4096 4096 4096 4096 36331 1747 4096 4096 4475 4096 4096 4096 4096 4096 4096 Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 4xx_io 8260_io 8xx_io amiga boot configs Kconfig Kconfig.debug kernel lib Makefile math-emu mm oprofile platforms syslib xmon

$ ll arch/mips/ total 220 drwxr-xr-x 2 drwxr-xr-x 12 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 6 drwxr-xr-x 4 -rw-r--r-1 drwxr-xr-x 3 drwxr-xr-x 5 drwxr-xr-x 5 drwxr-xr-x 2 drwxr-xr-x 4 -rw-r--r-1 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 3 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 6 drwxr-xr-x 2 drwxr-xr-x 6 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 3 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 5 drwxr-xr-x 2 drwxr-xr-x 4

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

4096 4096 4096 4096 4096 4096 4096 18490 4096 4096 4096 4096 4096 42476 2564 4096 4096 4096 4096 4096 21398 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096

Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun

17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48

arc au1000 boot cobalt configs ddb5xxx dec defconfig galileo-boards gt64120 ite-boards jazz jmr3927 Kconfig Kconfig.debug kernel lasat lib lib-32 lib-64 Makefile math-emu mips-boards mm momentum oprofile pci pmc-sierra sgi-ip22 sgi-ip27 sgi-ip32 sibyte sni tx4927

$ ll arch/arm/ total 156 drwxr-xr-x 4 drwxr-xr-x 2 drwxr-xr-x 2 -rw-r--r-1 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

4096 4096 4096 21624 3847 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 7636 4096 4096 4096 4096 4096

Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun

17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48

boot common configs Kconfig Kconfig.debug kernel lib mach-clps711x mach-clps7500 mach-ebsa110 mach-epxa10db mach-footbridge mach-h720x mach-imx mach-integrator mach-iop3xx mach-ixp2000 mach-ixp4xx mach-l7200 mach-lh7a40x mach-omap mach-pxa mach-rpc mach-s3c2410 mach-sa1100 mach-shark mach-versatile Makefile mm nwfpe oprofile tools vfp

$ ll arch/i386/ total 140 drwxr-xr-x 4 drwxr-xr-x 2 -rw-r--r-1 -rw-r--r-1 -rw-r--r-1 drwxr-xr-x 5 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 -rw-r--r-1 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

4096 4096 26742 42227 2255 4096 4096 4096 4096 4096 4096 4096 6341 4096 4096 4096 4096 4096

Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun

17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48

boot crypto defconfig Kconfig Kconfig.debug kernel lib mach-default mach-es7000 mach-generic mach-visws mach-voyager Makefile math-emu mm oprofile pci power

2.Kernelstartup
ExplanationforTQM860PPCboard 0.Kernelentrypoint:
arch/ppc/boot/common/crt0.S:_start

1._startcallson:
arch/ppc/boot/simple/head.S:start

2.startcallson:
arch/ppc/boot/simple/relocate.S:relocate

3.relocatecallson:
arch/ppc/boot/simple/miscembedded.c:load_kernel()

4.load_kernel()initializestheseriallineand uncompresseskernelstartingataddress0.

6.load_kernel()returnstorelocate 7.relocatejumpstoaddress0x00000000,where kernelstartaddressis. 8.arch/ppc/kernel/head_8xx.S:__start 9.__starteventuallycallsinit/main.c:start_kernel() 10.start_kernel()does:


1.Lockskernel 2.setup_arch() 3.sched_init() 4.parse_args() 5.trap_init() 6.init_IRQ()

7.time_init() 8.console_init() 9.mem_init() 10.calibrate_delay() 11.rest_init()

=>

loops_per_jiffy

11.rest_init()does:
1.Startinitthread 2.Unlocksthekernel 3.Becomestheidletask

12.Theinittask:
1.lock_kernel() 2.do_basic_setup() => callvariousinit()fcts 3.prepare_namespace() => mountrootfs 4.free_initmem() 5.unlock_kernel() 6.execve()ontheinitprogram(/sbin/init)

3.Keydefinitions
$ ll include/ total 148 ... drwxr-xr-x 2 drwxr-xr-x 24 drwxr-xr-x 2 drwxr-xr-x 3 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 10 drwxr-xr-x 3 drwxr-xr-x 5 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 45 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 3 drwxr-xr-x 2 drwxr-xr-x 31 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 2 drwxr-xr-x 18 ... karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 12288 Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 asm-alpha asm-arm asm-arm26 asm-cris asm-frv asm-generic asm-h8300 asm-i386 asm-ia64 asm-m32r asm-m68k asm-m68knommu asm-mips asm-parisc asm-ppc asm-ppc64 asm-s390 asm-sh asm-sh64 asm-sparc asm-sparc64 asm-um asm-v850 asm-x86_64 linux

$ ll include/asm-mips/ ... -rw-r--r-1 karim -rw-r--r-1 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 3 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim drwxr-xr-x 2 karim -rw-r--r-1 karim -rw-r--r-1 karim -rw-r--r-1 karim drwxr-xr-x 2 karim ...

karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim karim

519 696 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 1608 450 4189 4096

Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun

17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48

m48t35.h m48t37.h mach-atlas mach-au1x00 mach-db1x00 mach-ddb5074 mach-dec mach-ev64120 mach-ev96100 mach-generic mach-ip22 mach-ip27 mach-ip32 mach-ja mach-jazz mach-jmr3927 mach-lasat mach-mips mach-ocelot mach-ocelot3 mach-pb1x00 mach-rm200 mach-sibyte mach-vr41xx mach-yosemite marvell.h mc146818rtc.h mc146818-time.h mips-boards

$ ll include/asm-mips/vr41xx/ total 52 -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim -rw-r--r-1 karim karim

1489 1856 1497 1174 2559 1417 1411 1439 6151 6728 1492

Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun Jun

17 17 17 17 17 17 17 17 17 17 17

15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48 15:48

capcella.h cmbvr4133.h e55.h mpc30x.h pci.h siu.h tb0219.h tb0226.h vr41xx.h vrc4173.h workpad.h

4.InterfacingbetweenbootloaderandOS

Verybootloaderdependent Informationsources:

UBootdocumentationhasgoodexplanationof interfacebetweenitandLinux. CommentsandcodeinLinuxsourcesforyourarch

Das könnte Ihnen auch gefallen