Resource management: Linux kernel Namespaces and cgroups

(1)

Rami Rosen

[email protected]

Haifux, May 2013

Resource management:

Linux kernel Namespaces and

cgroups

(2)

General

The presentation deals with two Linux process resource

management solutions: namespaces and cgroups.

We will look at:

●

Kernel Implementation details.

●

what was added/changed in brief.

●

User space interface.

●

Some working examples.

●

Usage of namespaces and cgroups in other projects.

●

Is process virtualization indeed lightweight comparing to Os

virtualization ?

(4)

Namespaces

● Namespaces - lightweight process virtualization.

– Isolation: Enable a process (or several processes) to have different

views of the system than other processes.

– 1992: “The Use of Name Spaces in Plan 9”

– http://www.cs.bell-labs.com/sys/doc/names.html

● Rob Pike et al, ACM SIGOPS European Workshop 1992.

– Much like Zones in Solaris.

– No hypervisor layer (as in OS virtualization like KVM, Xen)

– Only one system call was added (setns())

– Used in Checkpoint/Restart

(5)

Namespaces - contd

There are currently 6 namespaces:

● mnt (mount points, filesystems)

● pid (processes)

● net (network stack)

● ipc (System V IPC)

● uts (hostname)

(6)

Namespaces - contd

It was intended that there will be 10 namespaces: the following 4 namespaces are not implemented (yet):

● security namespace

● security keys namespace

● device namespace

● time namespace.

– There was a time namespace patch – but it was not applied.

– See: PATCH 0/4 - Time virtualization:

– http://lwn.net/Articles/179825/

● see ols2006, "Multiple Instances of the Global Linux Namespaces" Eric

(7)

Namespaces - contd

● Mount namespaces were the first type of namespace to be

implemented on Linux by Al Viro, appearing in 2002.

–

Linux 2.4.19.

● CLONE_NEWNS flag was added (stands for “new namespace”; at

that time, no other namespace was planned, so it was not called new mount...)

● User namespace was the last to be implemented. A number of Linux

(8)

Implementation details

●

Implementation (partial):

- 6 CLONE_NEW * flags were added:

(include/linux/sched.h)

●

These flags (or a combination of them) can be

used in

clone()

or

unshare()

syscalls to create a

namespace.

(9)

CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN

CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN

CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN

CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN

CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN

(10)

Implementation - contd

● Three system calls are used for namespaces:

● clone() - creates a new process and a new namespace; the

process is attached to the new namespace.

– Process creation and process termination methods, fork() and exit() methods,

were patched to handle the new namespace CLONE_NEW* flags.

● unshare() - does not create a new process; creates a new

namespace and attaches the current process to it.

– unshare() was added in 2005, but not for namespaces only, but also for security. see “new system call, unshare” : http://lwn.net/Articles/135266/

● setns() - a new system call was added, for joining an existing

(11)

Nameless namespaces

From man (2) clone:

...

int clone(int (fn)(void ), void *child_stack,

int flags, void *arg, ...

/* pid_t ptid, struct user_desc tls, pid_t ctid / );

...

●

Flags is the CLONE_* flags, including the namespaces

CLONE_NEW* flags. There are more than 20 flags in total.

●

See include/uapi/linux/sched.h

●

There is

no

parameter of a namespace

name

.

●

How do we know if two processes are in the same namespace ?

●

Namespaces do not have names.

(12)

Nameless namespaces

●

ls -al /proc/<pid>/ns

lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]

You can use also

readlink.

(13)

Implementation - contd

●

A member named

nsproxy

was added to the process descriptor

, struct task_struct.

●

A method named

task_nsproxy(struct task_struct tsk)*

,

to access

the nsproxy of a specified process.

(include/linux/nsproxy.h)

●

nsproxy

includes 5 inner namespaces:

●

uts_ns, ipc_ns, mnt_ns, pid_ns, net_ns;

Notice that user ns is

missing

in this list,

●

it is a member of the credentials object (struct cred) which is a

(14)

Implementation - contd

●

Kernel config items:

CONFIG_NAMESPACES CONFIG_UTS_NS CONFIG_IPC_NS CONFIG_USER_NS CONFIG_PID_NS CONFIG_NET_NS

●

user space additions:

●

IPROUTE

package

●

some additions like

ip netns add/ip netns del

and more.

●

util-linux

package

●

unshare

util with support for all the 6 namespaces.

(15)

UTS namespace

●

uts - (Unix timesharing)

–

Very simple to implement.

Added a member named uts_ns (uts_namespace object) to the

nsproxy. _{process descriptor}

(task_struct)

nsproxy

uts_ns (uts_namespace object) name (new_utsname object)

sysname nodename release

(16)

UTS namespace - contd

The

old

implementation of

gethostname()

:

asmlinkage long sys_gethostname(char __user *name, int len)

{

...

if (copy_to_user(name, system_utsname.nodename, i))

... errno = -EFAULT;

}

(system_utsname is a global)

kernel/sys.c, Kernel v2.6.11.5

(17)

UTS namespace - contd

A Method called utsname() was added:

static inline struct new_utsname *utsname(void) {

return &current->nsproxy->uts_ns->name; }

The

new

implementation of

gethostname()

:

SYSCALL_DEFINE2(gethostname, char __user *, name, int, len) {

struct new_utsname *u; ...

u = utsname();

if (copy_to_user(name, u->nodename, i)) errno = -EFAULT;

(18)

UTS namespace - Example

We have a machine where hostname is myoldhostname.

uname -n

myoldhostname

unshare -u /bin/bash

This create a UTS namespace by unshare()

syscall and call

execvp()

for invoking bash.

Then:

hostname mynewhostname

uname -n

mynewhostname

(19)

UTS namespace - Example

nsexec

nsexec is a package by Serge Hallyn; it consists of a

program called nsexec.c which creates tasks in new

namespaces (there are some more utils in it) by clone() or by

unshare() with fork().

https://launchpad.net/~serge-hallyn/+archive/nsexec

Again we have a machine where hostname is myoldhostname.

uname -n

(20)

IPC namespaces

The same principle as uts , nothing

special, more code.

Added a member named

ipc_ns

(ipc_namespace object) to the nsproxy.

(21)

Network Namespaces

● A network namespace is logically another copy of the network stack,

with its own routes, firewall rules, and network devices.

● The network namespace is struct net. (defined in include/net/net_namespace.h)

Struct net includes all network stack ingredients, like: – Loopback device.

– SNMP stats. (netns_mib)

– All network tables:routing, neighboring, etc. – All sockets

(22)

Implementations guidelines

• A network device belongs to exactly one network

namespace.

●

Added to struct

net_device

structure:

●

struct net

*nd_net;

for the Network namespace this network device is inside.

●

Added a method:

dev_net(const struct net_device dev)*

to access the nd_net namespace of a network device.

• A socket belongs to exactly one network namespace.

●

Added

sk_net

to struct

sock

(also a pointer to struct net), for the

Network namespace this socket is inside.

●

Added

sock_net()

and

sock_net_set()

methods (get/set network

(23)

Network Namespaces - contd

● Added a system wide linked list of all namespaces: net_namespace_list,

and a macro to traverse it (for_each_net())

● The initial network namespace, init_net (instance of struct net), includes

the loopback device and all physical devices, the networking tables, etc.

● Each newly created network namespace includes only the loopback device. ● There are no sockets in a newly created namespace:

netstat -nl

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address Foreign Address State Active UNIX domain sockets (only servers)

(24)

Example

● Create two namespaces, called "myns1" and "myns2":

● ip netns add myns1

● ip netns add myns2

– (In fedora 18, ip netns is included in the iproute package).

● This triggers:

● creation of /var/run/netns/myns1,/var/run/netns/myns2 empty folders

● calling the unshare() system call with CLONE_NEWNET.

– unshare() does not trigger cloning of a process; it does create

a new namespace (a network namespace, because of the CLONE_NEWNET flag).

(25)

● You can use the file descriptor of /var/run/netns/myns1 with the setns() system call. ● From man 2 setns:

...

int setns(int fd, int nstype); DESCRIPTION

Given a file descriptor referring to a namespace, reassociate the calling thread with that namespace.

...

● In case you pass 0 as nstype, no check is done about the fd.

● In case you pass some nstype, like CLONE_NEWNET of CLONE_NEWUTS, the method verifies that the specified nstype corresponds to the specified fd.

(26)

Network Namespaces - delete

● You delete a namespace by:

● ip netns del myns1

–

This unmounts and removes /var/run/netns/myns1

– see netns_delete() in ipnetns.c

– _{Will not delete a network namespace if there is one or more processes attached to it.}

● Notice that after deleting a namespace, all its migratable network devices

are moved to the default network namespace;

● unmoveable devices (devices who have NETIF_F_NETNS_LOCAL in their

features) and virtual devices are not moved to the default network namespace.

● (The semantics of migratable network devices and unmoveable devices

(27)

NETIF_F_NETNS_LOCAL

● NETIF_F_NETNS_LOCAL ia a network device feature

– _{(a member of net_device struct, of type netdev_features_t)}

● It is set for devices that are not allowed to move between network namespaces; sometime

these devices are named "local devices".

● Example for local devices (where NETIF_F_NETNS_LOCAL is set): – Loopback, VXLAN, ppp, bridge.

– You can see it with ethtool (by ethtool -k, or ethtool

–show-features)

– ethtool -k p2p1

netns-local: off [fixed]

For the loopback device:

(28)

VXLAN

● Virtual eXtensible Local Area Network.

● VXLAN is a standard protocol to transfer layer 2 Ethernet packets

over UDP.

● Why do we need it ?

● There are firewalls which block tunnels and allow, for example, only

TCP/UDP traffic.

● developed by Stephen Hemminger.

– drivers/net/vxlan.c

– _{IANA assigned port is 4789}

(29)

When trying to move a device with NETIF_F_NETNS_LOCAL flag, like VXLAN, from one namespace to another, we will encounter an error: ip link add myvxlan type vxlan id 1

ip link set myvxlan netns myns1

We will get: RTNETLINK answers: Invalid argument

int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat) {

int err;

err = -EINVAL;

if (dev->features & NETIF_F_NETNS_LOCAL) goto out;

(30)

● You list the network namespaces (which were added via “ ip netns

add”)

● ip netns list

–

this simply reads the namespaces under:

/var/run/netns

● You can find the pid (or list of pids) in a specified net namespace by:

–

ip netns pids namespaceName

● You can find the net namespace of a specified pid by:

(31)

You can monitor addition/removal of network

namespaces by:

ip netns monitor

- prints one line for each addition/removal event it

sees

(32)

● Assigning p2p1 interface to myns1 network namespace: ● ip link set p2p1 netns myns1

– _{This triggers changing the network namespace of the net_device to “myns1”.}

– It is handled by dev_change_net_namespace(), net/core/dev.c.

● Now, running:

● ip netns exec myns1 bash

● will transfer me to myns1 network namespaces; so if I will run there: ● ifconfig -a

● I will see p2p1 (and the loopback device);

– Also under /sys/class/net, there will be only p2p1 and lo folders.

● But if I will open a new terminal and type ifconifg -a, I will not see

(33)

● Also, when going to the second namespace by running:

● ip netns exec myns2 bash

● will transfer me to myns2 network namespace; but if we will run

there:

● ifconfig -a

– _{We will not see p2p1; we will only see the loopback device.}

● We move a network device to the default, initial namespace by:

(34)

● In that namespace, network application which look for files under

/etc, will first look in /etc/netns/myns1/, and then in /etc.

● For example, if we will add the following entry "192.168.2.111

www.dummy.com"

● in /etc/netns/myns1/hosts, and run:

● ping www.dummy.com

(35)

veth

● You can communicate between two network namespaces by:

● creating a pair of network devices (veth) and move one to another

network namespace.

● Veth (Virtual Ethernet) is like a pipe.

● unix sockets (use paths on the filesystems).

Example with veth:

Create two namesapces, myns1 and myns1: ip netns add myns1

(36)

veth

ip netns exec myns1 bash

- open a shell of myns1 net namespace

ip link add name if_one type veth peer name if_one_peer

-

create veth interface, with

if_one

and

if_one_peer

- ifconfig running in myns1 will show

if_one

and

if_one_peer

and

lo

(the loopback device)

- ifconfig running in myns2 will show

only

lo

(the loopback

device)

Run from myns1 shell:

ip link set dev if_one_peer netns myns2

move

if_one_peer

to myns2

- now ifconfig running in myns2 will show

if_one_peer

and

lo

(the loopback device)

(37)

unshare util

● The unshare utility

● Util-linux recent git tree has the unshare utility with support for all six namespaces:

http://git.kernel.org/cgit/utils/util-linux/util-linux.git

./unshare –help

...

Options:

-m, --mount unshare mounts namespace

-u, --uts unshare UTS namespace (hostname etc) -i, --ipc unshare System V IPC namespace

-n, --net unshare network namespace -p, --pid unshare pid namespace

(38)

http://ramirose.wix.com/ramirosen 39/121

● For example:

● Type:

● ./unshare --net bash

– A new network namespace was generated and the bash process was

generated inside that namespace.

● Now run ifconfig -a

● You will see only the loopback device.

– With unshare util, no folder is created under /var/run/netns;

also network application in the net namespace we created, do not look under /etc/netns

– If you will kill this bash or exit from this bash, then the network

namespace will be freed.

(39)

htt p: // ra m ir os e.w ix .c om /r am ir os en 40 /121

t t

he

ca

se

a

s w

ith

ip

n

etn

s

exe

c my

ns1

b

as

h; i

n th

at

lli

ng

/e

xi

tin

g th

e b

ash

d

oe

s

no

t t

rig

ge

r d

estr

oyi

ng

t

he

ace

.

le

m

en

ta

tio

n d

eta

ils, l

oo

k

in

stru

ct n

et

*

ne

t) a

nd

th

e r

efe

re

nce

co

un

t (

na

me

d “c

ou

nt”

)

ork n

ame

sp

ace

str

uct n

et.

(40)

Mount

n

ames

pa

ces

ed

a

me

mber

named

mn

t_

n

s

na

m

e

sp

ac

e

ob

je

ct

)

to

th

e

ns

pr

oxy

.

p

y

th

e

m

ou

n

t n

am

es

pa

ce

o

f t

he

c

all

in

g

p

ro

ce

ss

ge

ne

ric

fi

le

sys

te

m

et

ho

d

(

se

e

co

py_

tr

ee

()

in

_ns

()

).

e

n

ew

m

ou

nt

n

a

m

es

pa

ce

, a

ll

pr

ev

io

us

m

ou

n

ts

wil

l b

e

;

an

d

fr

om

n

ow

o

n:

nt

s/

u

nm

ou

nt

s

in

th

at

m

ou

n

t n

am

es

pa

ce

a

re

in

visib

le

to

t o

f th

e s

ys

te

m

.

nt

s/

u

nm

ou

nt

s

in

th

e

glo

ba

l n

am

es

p

ac

e

ar

e

vis

ib

le

in

am

es

pa

ce

.

am

esp

ace

mo

du

le

u

se

s

mo

un

t n

am

esp

ace

s (w

ith

(C

LON

E

_N

E

W

N

S

) )

es/p

am_

na

me

sp

ace

/p

am_

na

me

sp

ace

.c)

(41)

n

ames

pac

es

: ex

am

pl

e

1

ple

1 (t

est

ed

on

U

bun

tu

):

th

at

/d

ev

/s

da

3 is

n

ot

m

ou

nt

ed

:

gr

e

p

/dev

/s

da3

ld

g

iv

e

no

th

in

g.

-m

/bi

n

/bas

h

ev

/s

da

3 /m

n

t/sd

a3

m

ount

|

g

rep

sd

a3

s

ee

:

3 o

n

/m

nt

/s

da

3 ty

pe

e

xt

3 (r

w)

/p

ro

c/

$$

/n

s/

m

nt

11 4]

(42)

a

no

th

er

te

rm

in

al

ru

n

link

/p

roc

/$

$/

ns

/m

nt

:[4

02

653

18 40]

ult

s

sh

ows

tha

t w

e

are

in

a

dif

fe

re

nt

esp

ac

e.

r

un

:

nt

|

grep

s

da

3 v/

sda

3 on

/m

nt

/s

da

3 ty

pe

e

xt

3 (r

w

)

?

W

e

are

in

a

d

iff

ere

nt

m

oun

t n

ame

spa

ce

?

ou

ld

h

av

e

no

t se

e

the

m

ou

nt

wh

ic

h

w

as

fr

om

an

ot

her

name

spa

ce

!

(43)

ns

w

er

is

s

imple

: r

un

nin

g

mou

nt

is

n

ot

g

ood

gh

w

he

n

w

ork

ing

w

ith

m

ou

nt

n

am

es

pa

ce

s.

so

n

is

th

at

m

ou

nt

re

ad

s

/e

tc

/m

ta

b

, w

hic

h

at

ed

by

th

e

m

ou

nt

c

om

man

d;

m

ou

nt

an

d

do

es

n

ot

ac

ce

ss

th

e

ke

rn

el

st

ruc

tu

res

.

the

s

ol

ut

io

n?

(44)

cce

ss

d

ire

ct

ly

the

k

erne

l da

ta

st

ru

ct

ure

s,

y

ou

ru

n:

ro

c/

m

ou

nt

s

| g

re

p

sd

a3

c/

mo

un

ts

is

in

fa

ct

sy

m

bo

lic

lin

k

to

se

lf/

m

ou

nt

s).

yo

u

will

get

n

o

re

sult

s,

a

s

exp

ec

ted

.

(45)

n

ames

pac

es

: ex

am

pl

e

2

ple

2:

te

st

ed

o

n

F

ed

ora

18 th

at

/d

ev

/s

db

3 is

n

ot

m

ou

nt

ed

:

gr

e

p

sdb3

ld

g

iv

e

no

th

in

g.

-m

/bi

n

/bas

h

ev

/s

db

3 /m

n

t/sd

b3

m

ount

|

g

rep

sd

b3

s

e

e:

3 o

n

/m

nt

/s

db

3 ty

pe

e

xt

4 (r

w

,r

ela

tim

e,

d

at

a=o

rd

er

ed

)

/p

ro

c/

$$

/n

s/

m

nt

381]

(46)

a

no

th

er

te

rm

in

al

ru

n:

link

/p

roc

/$

$/

ns

/m

nt

02

653

18 40]

ow

s t

ha

t w

e ar

e i

n a di

ffe

rent nam

espac

e.

un:

_nt

|

grep

s

db

3 sd

b3

o

n /m

nt/sd

b3

typ

e e

xt4

(rw

,r

el

ati

me

,d

ata

=

ord

ere

d)

ow

n

ow

th

at

w

e sh

ou

ld

u

se

ca

t /p

ro

c/m

ou

nts

(

an

d

no

t

) to

g

et th

e ri

gh

t a

nsw

er

w

he

n w

orki

ng

w

ith

n

ame

sp

ace

;

so

:

ro

c/

m

ou

nt

s

| g

re

p

sd

b3

sd

b3

/

mn

t/

sd

b3

e

xt4

r

w

,r

el

ati

me

,d

ata

=

ord

ere

d 0

0 t so

?

W

e sh

ou

ld

h

ave

se

en

n

o r

esu

lts,

a

s i

n p

re

vi

ou

s

e.

(47)

htt p: // ra m ir os e.w ix .c om /r am ir os en 48 /121 ed or a ru ns sy st em d ; sys te m d use s th e sh ar ed fla g fo r m ou nt in g /. d so ur ce c od e: ( sr c/ co re /m ou nt -s et up .c ) tu p(bo ol loa ded _p oli cy) { UL L, " /" , NU LL, MS _R E C |MS _S H A RE D , NU LL) < 0 ) rnin g("F ai le d to se t up t he ro ot d ire ct or y for shar ed mou nt p rop aga tion: %m"); C st an ds fo r re cu rsiv e m ou nt ) ow wh et he r w e ha ve a s ha re d fla gs ? /se lf/ m ou nt in fo | g re p sh ar ed e: / rw ,r ela tim e sh ar ed :1 e xt 4 /d ev/ sd a3 r w ,d at a= or de re d o ?

(48)

nt

--m

ak

e-r

priv

at

e

-o

re

m

ou

nt

/

/de

v/

sda

3 ch

an

ges

th

e

sh

ar

ed

fl

ag

to

priva

te

,

rs

iv

ely

.

-rp

riv

at

e

–

set

th

e

priv

at

e

fla

g

re

cu

rsiv

ely

(49)

Sha

re

d

su

bt

ree

s

/u ser s /b in / n t /u ser s/u ser 1 /u ser s/u ser 2 wa nt th at u se r1 a nd u se r2 fo ld er s w ill se e th e w ho le em ; we w ill ru n t –b in d / / u se rs /u se r1 t –b in d / / u se rs /u se r2 fa ult , t he file sys yt em is m ou nt ed a s pr iv at e , th e sh ar ed m ou nt fla g is se t e xp lici tly .

(50)

Shar

ed

subt

rees

-

cont

d

/u ser s /b in / /m n t /u ser s/u ser 1 /u ser s/u ser 2 /mn t /b in /mnt /b in /us ers /u se rs2 /u ser 1 /u ser 2

(51)

S

ha

re

d

sub

tre

es

–

Q

uiz

e mo

un

t a

u

sb

d

is

k

on

ke

y

on

/mn

t/d

ok.

en

in

/u

se

rs/u

se

r1

/mn

t

or

se

r2

/m

nt?

(52)

S

ha

re

d

su

bt

re

es

c

ont

d

w

er

is

no, si

nc

e by defaul

t, the fi

le

sys

ytem

is

p

rivat

e

. T

o

e

na

bl

e that the dok w

ill

be s

ee

n

/us

er

s/user

1/m

nt o

r /user

s/

us

er

2/

m

nt, w

e

d m

o

unt

th

e

fil

es

yste

m

as

shar

ed:

-m

ake-rs

har

ed

ount the usb di

sk on k

ey agai

n

.

re d su bt re es p at ch is fr om 2 00 5 by R am P ai. m e m ou nt fla gs like – m ake -sl av e, --m ak e-rsla ve , -m ake -ble , --m ak e-ru nb in da bl e an d m or e. T he p at ch a dd ed th is ke rn el gs: M S_ UN B IN D AB LE, M S _PR IV A T E , M S_ S LA VE a nd RE D f la g is in us e by th e fus e fil esyst em.

(53)

PI

D

na

m

es

pac

es

em be r na m ed p id _n s (p id _n am es pa ce o bj ec t) to th e in d if fe re nt PI D n am es pa ce s ca n ha ve th e sa m e pr oc es s ID . at in g th e fi rs t p ro ce ss in a n ew n am es pa ce , i ts P ID is 1 . li ke th e “i ni t” p ro ce ss : a pr oc es s di es , a ll it s orp ha ne d ch il dre n w il l n ow h av e th e pro ce ss w it h PI D 1 a s ar en t ( ch il d re ap ing ). g SI G KI L L s ig na l d oes n ot k il l p ro ces s 1 , r eg ar dl es s o f w hi ch n ames pace t he an d w as is su ed (i ni ti al n am es pa ce o r ot he r pi d na m es pa ce ).

(54)

P

ID

names

pac

es

-

con

td

a n ew na m esp ace is cr ea te d, we c an no t se e fr om it th e PI D e pa re nt n am esp ac e; r un nin g ge tpp id( ) fr om th e ne w pid esp ace w ill re tu rn 0 . ll PI D s w hich a re u se d in th is n am esp ace a re vi sib le to th e nt n am esp ac e. na m esp ace s ca n be n est ed , u p to 3 2 ne st in g le ve ls . X_ P ID_ NS _L E VE L) . : m ul ti_ pid ns. c, M ich ae l K er ris k, fr om :// lw n. ne t/Ar tic le s/ 53 27 45 / . tr yin g to r un m ult i_ pid ns w ith 3 3, y ou w ill ge t:

cl

one: Inval

id ar

gum

en

t

(55)

Us

er

Na

m

es

pac

es

d

a

me

m

be

r

na

m

ed

u

se

r_

ns

na

mes

pa

ce

ob

jec

t)

to

th

e

ns

pr

ox

y.

d

e/

lin

ux

/u

se

r_

na

m

esp

ac

e.

h

s

a

po

in

te

r

na

m

ed

pa

re

nt

to

th

e

u

se

r_

na

m

es

pa

ce

at

ed

it

.

se

r_

na

m

esp

ace

*p

ar

en

t;

s

th

e

ffe

ct

iv

e

u

id

o

f t

he

p

ro

ce

ss

th

at

cr

ea

te

d

it:

ow

ne

r;

ce

ss

w

ill

ha

ve

d

is

tin

ct

s

et

o

f UI

D

s,

G

ID

s

pa

bili

tie

s.

(56)

Us

er

Na

m

es

pac

es

g a

n

ew

u

se

r n

am

esp

ace

is

do

ne

b

y p

ass

in

g

E

_N

E

W

U

S

E

R

to

fo

rk

()

o

r

u

n

sh

ar

e()

.

pl

e:

ng fr

om

s

om

e

u

ser

ac

count

s the ef

fecti

ve user

ID

.

s the ef

fecti

ve gr

ou

p ID

.

ly the fi

rst u

ser

a

dd

ed

gets

ui

d/gi

d of 1000

)

(57)

Us

er

Names

pa

ces

-

ex

ample

s:

_/s

elf/s

ta

tus

| gre

p C

ap

0000000000000000

0000001f

ff

re

ate

a

us

er

na

me

spa

ce

a

nd s

ta

rt a

s

he

ll, w

e w

ill run f

rom

root a

cc

ou

nt

:

cU

/bi

n

/bas

h

ag i

s f

or us

ing c

lone

ag i

s f

or us

ing us

er

na

m

es

pa

ce

(

C

L

O

N

E

_N

E

W

U

S

E

R

fla

g f

or

(58)

er

Names

paces

-

ex

ample

-cont

d

om

th

e ne

w

sh

el

l r

un

a

re

d

efa

ul

t va

lu

es fo

r th

e e

U

ID

a

nd

e

GU

ID

In

th

e

ne

w

ace

.

ill

g

et

th

e s

ame

re

su

lts

fo

r

ef

fe

cti

ve

u

se

r i

d a

nd

e

ffe

cti

ve

lso

w

he

n r

un

ni

ng

/n

se

xe

c -c

U

/bi

n

/bas

h

as

r

oot.

fau

lts

c

an

be

c

h

an

ge

d by

:

/pr

oc

/s

ys

/ke

rn

el

/ov

er

fl

ow

u

id,

ys

/ke

rn

el/

ov

er

flow

gid

t, t

h

e

u

se

r n

ame

spa

ce

th

at w

as

c

re

ate

d h

ad f

u

ll c

apabili

tie

s,

cal

l to e

xe

c(

) w

ith

bas

h

r

em

ov

ed th

em

.

(59)

se

lf/sta

tu

s |

g

re

p C

ap

000 000000

0000

000

000 000000

0000

000

000 000000

0000

000

000 0001f

fff

ff

(60)

Us

er

N

ames

pac

es

-

cont

d

u

n:

ho

$$

(

ge

t th

e b

as

h p

id

)

om

a di

ffer

ent

ro

o

t

ter

m

in

al

, w

e

set the ui

d_m

ap:

t, w

e c

an see that ui

d_

m

ap i

s uni

ni

tial

iz

ed

by

:

oc/<

p

id>

/ui

d

_m

ap

o

0

10

00

1

0 >

/p

ro

c/

<

pid>

/u

id

_ma

p

d

>

is

th

e

pid

o

f t

he

b

as

h

pr

o

ce

ss

fr

om

p

re

vi

o

us

st

ep

).

u

id

_m

a

p

is

of

th

e

fo

llo

wi

n

g

fo

rm

at

:

m

esp

ac

e_

fir

st

_u

id

ho

st

_f

irs

t_

u

id

nu

m

be

r_

of

_u

id

s

se

ts

th

e

fir

st

u

id

in

th

e

ne

w

n

am

e

sp

a

ce

(

w

hi

ch

po

nd

to

u

id

1

00

0 in

th

e

o

ut

sid

e

wo

rld

)

to

b

e

0;

th

e

w

ill

b

e

1;

a

nd

so

fo

rt

h,

fo

r

10 e

nt

rie

s.

(61)

Us

er

N

ames

pac

es

-

cont

d

ou

c

an

s

et

th

e

u

id

_m

ap

o

n

ly

o

nc

e

fo

r

a

sp

ec

ifi

c

. F

ur

th

er

a

tte

m

pt

s

will

fa

il.

w

ill

g

et

0 .

i

am

e

sp

ace

is

th

e

o

nl

y

n

am

es

pa

ce

w

hi

ch

ca

n

be

wit

ho

u

t CA

P

_S

Y

S

_A

D

M

IN

ca

pa

bi

lit

y

(62)

htt p: // ra m ir os e.w ix .c om /r am ir os en 63 /121 /se lf/ st at us | g re p C ap : 00 000 00 000 00 000 0 : 00 000 01 fff fff fff 00 000 01 fff fff fff : 00 000 01 fff fff fff pEf f ( E ffe ct ive Ca pa bili te s) is 1 fff fff fff -> th is is 37 b its of '1 ' , ea ns a ll ca pa bili tie s. ill un sh ar e --ne t b as h wo rk n ow ?

(63)

no

_--n

et b

as

h

: ca

nn

ot se

t g

ro

up

id

: I

nva

lid

a

rg

ume

nt

ru

nn

in

g, fr

om a

d

iffe

re

nt te

rmi

na

l,

as ro

ot:

00 0 1

0 >

/p

ro

c/2

42 9/

gi

d_

ma

p

w

ill

fa

il h

ow

eve

r:

ot o

pe

n d

ire

cto

ry /ro

ot/:

P

er

mi

ssi

on

d

en

ie

d

(64)

rt

q

uiz

1 :

a

re

gula

r

us

er

, n

ot

r

oo

t.

n

e(

)

w

ith

(

CL

O

NE

_NE

W

NE

T)

wo

rk

?

q

uiz

2 :

n

e(

)

w

ith

(

CL

O

NE

_NE

W

NE

T

|

CL

O

N

E

_NE

W

U

S

E

R)

(65)

:

No

.

to u se th e C LO N E_ NE WN E T we n ee d to h ave S_ A DM IN. --ne t b as h u nsh ar e fa ile d: O pe ra tio n no t p er m itt ed

: Y

es

.

ce s co de g ua ra nt ee s us th at u se r na m es pa ce c re at io n is th e cr ea te d. F or cr ea tin g a use r na m esp ace w e do 'n t n ee d S_ A DM IN. T he u se r na m esp ac e is cr ea te d w ith fu ll s, so w e ca n cr ea te t he n et w or k na m es pa ce su cce ssf ully . --n et --u se r /b in /b ash s!

(66)

:

n, fr

om

a

n

on

ro

ot u

se

r,

–

use

r b

ash

en

c/

se

lf/sta

tu

s |

g

re

p C

ap

E

ff

000 000000

0000

000 ea

ns

n

o ca

pa

bi

liti

es. S

o h

ow

w

as

th

e

ne

t n

ame

sp

ace

,

ee

ds C

A

P

_S

Y

S

_A

D

MIN

, cr

ea

te

d ?

(67)

w

e

first

do

u

ns

ha

re

;

on

e w

ith

u

se

r

na

me

sp

ace

. T

hi

s e

na

bl

es a

ll ca

pa

bi

lit

ie

s.

ea

te

th

e

na

me

sp

ace

. A

fte

rw

ard

s,

w

e ca

ll e

xe

c f

or

th

e

c re

m

ove

s c

ap

ab

ili

tie

s.

ar

e.c o

f u

til

-li

nu

x:

u

nsh

are

(u

ns

ha

re

_fl

ag

s))

r(E

X

IT

_F

A

IL

U

R

E

, _

("

un

sh

are

fa

ile

d"

));

sh

el

l();

(68)

m

y

o

f

a

u

se

r

n

am

es

p

ace

s

vu

ln

er

ab

ili

ty

ic

ha

el

K

er

ris

k,

Marc

h

20

13 ut

C

V

E

2

013 -1

858 -

exp

lo

ita

ble

se

cu

rit

y

erab

ilit

y

lw

n.

ne

t/A

rt

icle

s/

54

32 73/

(69)

cgroups

● cgroups (control groups) subsystem is a Resource Management solution providing a

generic process-grouping framework.

● This work was started by engineers at Google (primarily Paul Menage and Rohit Seth) in 2006 under the name "process containers; in 2007, renamed to “Control Groups”.

●Maintainers: Li Zefan (huawei) and Tejun Heo ;

●The memory controller (memcg) is maintained separately (4 maintainers) ●Probably the most complex.

– Namespaces provide per process resource isolation solution.

– _{Cgroups provide resource management solution (handling groups).}

● Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some previous releases).

– _{Fedora systemd uses cgroups.}

– Ubuntu does not have systemd. Tip: do tests with Ubuntu and also make sure that cgroups are not

(70)

● The implementation of cgroups requires a few, simple hooks into the rest

of the kernel, none in performance-critical paths:

–In boot phase (init/main.c) to preform various initializations.

–In process creation and destroy methods, fork() and exit().

–A new file system of type "cgroup" (VFS)

–Process descriptor additions (struct task_struct)

–Add procfs entries:

●For each process: /proc/pid/cgroup.

(71)

–

The cgroup modules are

not

located in one folder but

scattered in the kernel tree according to their functionality:

●

memory

: mm/memcontrol.c

●

cpuset

: kernel/cpuset.c.

●

net_prio:

net/core/netprio_cgroup.c

●

devices:

security/device_cgroup.c.

(72)

cgroups and kernel namespaces

Note that the cgroups is not dependent upon namespaces; you can build

cgroups without namespaces kernel support.

There was an attempt in the past to add "ns" subsystem (ns_cgroup, namespace cgroup subsystem); with this, you could mount a namespace subsystem by:

mount -t cgroup -ons.

This code it was removed in 2011 (by a patch by Daniel Lezcano).

See:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id= a77aea92010acf54ad785047234418d5d68772e2

(73)

cgroups VFS

● Cgroups uses a Virtual File System

– All entries created in it are not persistent and deleted after

reboot.

● All cgroups actions are performed via filesystem actions

(create/remove directory, reading/writing to files in it, mounting/mount options).

● For example:

– cgroup inode_operations for cgroup mkdir/rmdir.

– cgroup file_system_type for cgroup mount/unmount.

(74)

Mounting cgroups

In order to use a filesystem (browse it/attach tasks to cgroups,etc) it must be mounted.

The control group can be mounted anywhere on the filesystem. Systemd uses /sys/fs/cgroup. When mounting, we can specify with mount options (-o) which subsystems we want to use.

There are 11 cgroup subsystems (controllers) (kernel 3.9.0-rc4 , April 2013); two can be built as modules. (All subsystems are instances of cgroup_subsys struct)

cpuset_subsys - defined in kernel/cpuset.c.

freezer_subsys - defined in kernel/cgroup_freezer.c.

mem_cgroup_subsys - defined in mm/memcontrol.c; Aka memcg - memory control groups.

blkio_subsys - defined in block/blk-cgroup.c.

net_cls_subsys - defined in net/sched/cls_cgroup.c ( can be built as a kernel module)

net_prio_subsys - defined in net/core/netprio_cgroup.c ( can be built as a kernel module)

devices_subsys - defined in security/device_cgroup.c.

perf_subsys (perf_event) - defined in kernel/events/core.c

hugetlb_subsys - defined in mm/hugetlb_cgroup.c.

cpu_cgroup_subsys - defined in kernel/sched/core.c

(75)

Mounting cgroups – contd.

In order to mount a subsystem, you should first create a folder for it under /cgroup.

In order to mount a cgroup, you first mount some tmpfs root folder:

● mount -t tmpfs tmpfs /cgroup

Mounting of the memory subsystem, for example, is done thus:

● mkdir /cgroup/memtest

● mount -t cgroup -o memory test /cgroup/memtest/

Note that instead “test” you can insert any text; this text is not

handled by cgroups core. It's only usage is when displaying the mount

(76)

Mounting cgroups – contd.

● Mount creates cgroupfs_root object + cgroup (top_cgroup) object

● mounting another path with the same subsystems - the same

subsys_mask; the same cgroupfs_root object is reused.

● mkdir increments number_of_cgroups, rmdir decrements number_of_cgroups.

● cgroup1 - created by mkdir /cgroup/memtest/cgroup1.

struct super_block *sb

The super block being used. (in memory). struct cgroup top_cgroup

unsigned long subsys_mask

bitmask of subsystems attached to this hierarchy

int number_of_cgroups cgroup cgroup1 cgroup2 parent _parent parent cgroupfs_root *root

(77)

Mounting a set of subsystems

From Documentation/cgroups/cgroups.txt:

If an active hierarchy with exactly the same set of subsystems

already exists, it will be reused for the new mount.

If no existing hierarchy matches, and any of the requested

subsystems are in use in an existing hierarchy, the mount will fail

with -EBUSY.

Otherwise, a new hierarchy is activated, associated with the

requested subsystems.

(78)

First case: Reuse

●

mount -t tmpfs test1 /cgroup/test1

●

mount -t tmpfs test2 /cgroup/test2

●

mount -t cgroup -ocpu,cpuacct test1 /cgroup/test1

●

mount -t cgroup -ocpu,cpuacct test2 /cgroup/test2

●

This will work; the mount method recognizes that we want to

use the same mask of subsytems in the second case.

– _{(Behind the scenes, this is done by the return value of sget() method, called}

from cgroup_mount(), found an already allocated superblock; the sget()

makes sure that the mask of the sb and the required mask are identical)

– Both will use the same cgroupfs_root object.

(79)

Second case: any of the requested

subsystems are in use

●

mount -t tmpfs tmpfs /cgroup/tst1/

●

mount -t tmpfs tmpfs /cgroup/tst2/

●

mount -t tmpfs tmpfs /cgroup/tst3/

●

mount -t cgroup -o freezer tst1 /cgroup/tst1/

●

mount -t cgroup -o memory tst2 /cgroup/tst2/

●

mount -t cgroup -o freezer,memory tst3 /cgroup/tst3

–

Last command will give an error. (-EBUSY).

The reason: these subsystems (controllers) were been

separately mounted.

(80)

Third case -

no existing hierarchy

no existing hierarchy matches, and none of the requested

subsystems are in use in an existing hierarchy:

mount -t cgroup -o net_prio netpriotest /cgroup/net_prio/

Will succeed.

(81)

– under each new cgroup which is created, these 4 files are always created:

● tasks

– _{list of pids which are attached to this group.}

● cgroup.procs.

– list of thread group IDs (listed by TGID) attached to this group.

● cgroup.event_control.

– Example in following slides.

● notify_on_release (boolean).

– _{For a newly generated cgroup, the value of}_{notify_on_release}_{in inherited}

from its parent; However, changing notify_on_release in the parent does not change the value in the children he already has.

– Example in following slides.

– _{For the}_{topmost cgroup root object} _only_{, there is also a}_{release_agent}_{– a}

(82)

● Each subsystem adds specific control files for its own needs, besides

these 4 fields. All control files created by cgroup subsystems are given a prefix corresponding to their subsystem name. For example:

cpuset.cpus cpuset.mems cpuset.cpu_exclusive cpuset.mem_exclusive cpuset.mem_hardwall cpuset.sched_load_balance cpuset.sched_relax_domain_level cpuset.memory_migrate cpuset.memory_pressure cpuset.memory_spread_page cpuset.memory_spread_slab cpuset.memory_pressure_enabled cpuset subsystem devices.allow devices.deny devices.list devices subsystem

(83)

cpu subsystem

cpu.shares (only if CONFIG_FAIR_GROUP_SCHED is set)

cpu.cfs_quota_us (only if CONFIG_CFS_BANDWIDTH is set)

cpu.cfs_period_us (only if CONFIG_CFS_BANDWIDTH is set)

cpu.stat (only if CONFIG_CFS_BANDWIDTH is set)

cpu.rt_runtime_us (only if CONFIG_RT_GROUP_SCHED is set)

cpu.rt_period_us (only if CONFIG_RT_GROUP_SCHED is set)

(84)

memory subsystem

memory.usage_in_bytes memory.max_usage_in_bytes memory.limit_in_bytes memory.soft_limit_in_bytes memory.failcnt memory.stat memory.force_empty memory.use_hierarchy memory.swappiness memory.move_charge_at_immigrate memory.oom_control

memory.numa_stat (only if CONFIG_NUMA is set)

memory.kmem.limit_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.failcnt (only if CONFIG_MEMCG_KMEM is set) memory.kmem.max_usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.limit_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.failcnt (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.max_usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.slabinfo (only if CONFIG_SLABINFO is set) memory.memsw.usage_in_bytes (only if CONFIG_MEMCG_SWAP is set)

memory subsystem

(85)

blkio subsystem

blkio.weight_device blkio.weight blkio.weight_device blkio.weight blkio.leaf_weight_device blkio.leaf_weight blkio.time blkio.sectors blkio.io_service_bytes blkio.io_serviced blkio.io_service_time blkio.io_wait_time blkio.io_merged blkio.io_queued blkio.time_recursive blkio.sectors_recursive blkio.io_service_bytes_recursive blkio.io_serviced_recursive blkio.io_service_time_recursive blkio.io_wait_time_recursive blkio.io_merged_recursive blkio.io_queued_recursive

blkio.avg_queue_size (only ifCONFIG_DEBUG_BLK_CGROUP is set) blkio.group_wait_time (only ifCONFIG_DEBUG_BLK_CGROUP is set) blkio.idle_time (only ifCONFIG_DEBUG_BLK_CGROUP is set)

blkio.throttle.read_bps_device blkio.throttle.write_bps_device blkio.throttle.read_iops_device blkio.throttle.write_iops_device blkio.throttle.io_service_bytes blkio.throttle.io_serviced

(86)

netprio

net_prio.ifpriomap

net_prio.prioidx

Note the netprio_cgroup.ko should be insmoded so the mount will succeed. Moreover, rmmod will fail if netprio is mounted

(87)

– When mounting a cgroup subsystem (or a set of cgroup subsystems) , all_all

processes in the system belong to it (the top cgroup object).

● After mount -t cgroup -o memory test /cgroup/memtest/

– you can see all tasks by: cat _{/cgroup/memtest/tasks}

– _{When creating new child cgroups in that hierarchy, each one of them will}_not_have

any tasks at all initially.

– Example: – _{mkdir /cgroup/memtest/group1} – _{mkdir /cgroup/memtest/group2} – cat /cgroup/memtest/group1/tasks ● Shows nothing. – _{cat /cgroup/memtest/group2/tasks}

(88)

●Any task can be a member of exactly one cgroup in a specific hierarchy. ●Example: ●echo $$ > /cgroup/memtest/group1/tasks ●cat /cgroup/memtest/group1/tasks ●cat /cgroup/memtest/group2/tasks

●Will show that task only in group1/tasks.

●After:

●echo $$ > /cgroup/memtest/group2/tasks

●The task was moved to group2; we will see that task it only in

group2/tasks.

Resource management: Linux kernel Namespaces and cgroups

Rami Rosen

[email protected]

Haifux, May 2013

Resource management:

Linux kernel Namespaces and

cgroups

TOC

General

The presentation deals with two Linux process resource

management solutions: namespaces and cgroups.

We will look at:

Kernel Implementation details.

what was added/changed in brief.

User space interface.

Some working examples.

Usage of namespaces and cgroups in other projects.

Is process virtualization indeed lightweight comparing to Os

virtualization ?

Namespaces

Namespaces - contd

Namespaces - contd

Namespaces - contd

Linux 2.4.19.

Implementation details

Implementation (partial):

- 6 CLONE_NEW * flags were added:

(include/linux/sched.h)

These flags (or a combination of them) can be

used in

clone()

or

unshare()

syscalls to create a

namespace.

Implementation - contd

Nameless namespaces

From man (2) clone:

...

int clone(int (*fn)(void *), void *child_stack,

int flags, void *arg, ...

/* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

...

Flags is the CLONE_* flags, including the namespaces

CLONE_NEW* flags. There are more than 20 flags in total.

See include/uapi/linux/sched.h

There is

no

parameter of a namespace

name

.

How do we know if two processes are in the same namespace ?

Namespaces do not have names.

Nameless namespaces

ls -al /proc/<pid>/ns

lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]

lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]

You can use also

readlink.

Implementation - contd

A member named

nsproxy

was added to the process descriptor

, struct task_struct.

A method named

task_nsproxy(struct task_struct *tsk)

,

to access

the nsproxy of a specified process.

nsproxy

includes 5 inner namespaces:

uts_ns, ipc_ns, mnt_ns, pid_ns, net_ns;

Notice that user ns is

missing

in this list,

it is a member of the credentials object (struct cred) which is a

int clone(int (fn)(void ), void *child_stack,

/* pid_t ptid, struct user_desc tls, pid_t ctid / );

task_nsproxy(struct task_struct tsk)*