Rami Rosen
[email protected]
Haifux, May 2013
Resource management:
Linux kernel Namespaces and
cgroups
TOC
PID namespaces cgroups Mounting cgroups user namespaces UTS namespace Network Namespace Mount namespaceGeneral
The presentation deals with two Linux process resource
management solutions: namespaces and cgroups.
We will look at:
●
Kernel Implementation details.
●
what was added/changed in brief.
●User space interface.
●
Some working examples.
●
Usage of namespaces and cgroups in other projects.
●
Is process virtualization indeed lightweight comparing to Os
virtualization ?
Namespaces
● Namespaces - lightweight process virtualization.
– Isolation: Enable a process (or several processes) to have different
views of the system than other processes.
– 1992: “The Use of Name Spaces in Plan 9”
– http://www.cs.bell-labs.com/sys/doc/names.html
● Rob Pike et al, ACM SIGOPS European Workshop 1992.
– Much like Zones in Solaris.
– No hypervisor layer (as in OS virtualization like KVM, Xen)
– Only one system call was added (setns())
– Used in Checkpoint/Restart
Namespaces - contd
There are currently 6 namespaces:● mnt (mount points, filesystems)
● pid (processes)
● net (network stack)
● ipc (System V IPC)
● uts (hostname)
Namespaces - contd
It was intended that there will be 10 namespaces: the following 4 namespaces are not implemented (yet):
● security namespace
● security keys namespace
● device namespace
● time namespace.
– There was a time namespace patch – but it was not applied.
– See: PATCH 0/4 - Time virtualization:
– http://lwn.net/Articles/179825/
● see ols2006, "Multiple Instances of the Global Linux Namespaces" Eric
Namespaces - contd
● Mount namespaces were the first type of namespace to be
implemented on Linux by Al Viro, appearing in 2002.
–
Linux 2.4.19.
● CLONE_NEWNS flag was added (stands for “new namespace”; at
that time, no other namespace was planned, so it was not called new mount...)
● User namespace was the last to be implemented. A number of Linux
Implementation details
●
Implementation (partial):
- 6 CLONE_NEW * flags were added:
(include/linux/sched.h)
●
These flags (or a combination of them) can be
used in
clone()
or
unshare()
syscalls to create a
namespace.
CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN
CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN
CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN
CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN
CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN
Implementation - contd
● Three system calls are used for namespaces:
● clone() - creates a new process and a new namespace; the
process is attached to the new namespace.
– Process creation and process termination methods, fork() and exit() methods,
were patched to handle the new namespace CLONE_NEW* flags.
● unshare() - does not create a new process; creates a new
namespace and attaches the current process to it.
– unshare() was added in 2005, but not for namespaces only, but also for security. see “new system call, unshare” : http://lwn.net/Articles/135266/
● setns() - a new system call was added, for joining an existing
Nameless namespaces
From man (2) clone:
...
int clone(int (*fn)(void *), void *child_stack,
int flags, void *arg, ...
/* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
...
●
Flags is the CLONE_* flags, including the namespaces
CLONE_NEW* flags. There are more than 20 flags in total.
●
See include/uapi/linux/sched.h
●
There is
no
parameter of a namespace
name
.
●
How do we know if two processes are in the same namespace ?
●Namespaces do not have names.
Nameless namespaces
●
ls -al /proc/<pid>/ns
lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]
You can use also
readlink.
Implementation - contd
●
A member named
nsproxy
was added to the process descriptor
, struct task_struct.
●
A method named
task_nsproxy(struct task_struct *tsk)
,
to access
the nsproxy of a specified process.
(include/linux/nsproxy.h)●
nsproxy
includes 5 inner namespaces:
●
uts_ns, ipc_ns, mnt_ns, pid_ns, net_ns;
Notice that user ns is
missing
in this list,
●
it is a member of the credentials object (struct cred) which is a
Implementation - contd
●
Kernel config items:
CONFIG_NAMESPACES CONFIG_UTS_NS CONFIG_IPC_NS CONFIG_USER_NS CONFIG_PID_NS CONFIG_NET_NS●
user space additions:
●
IPROUTE
package
●
some additions like
ip netns add/ip netns del
and more.
●
util-linux
package
●
unshare
util with support for all the 6 namespaces.
UTS namespace
●
uts - (Unix timesharing)
–
Very simple to implement.
Added a member named uts_ns (uts_namespace object) to the
nsproxy. process descriptor
(task_struct)
nsproxy
uts_ns (uts_namespace object) name (new_utsname object)
sysname nodename release
UTS namespace - contd
The
old
implementation of
gethostname()
:
asmlinkage long sys_gethostname(char __user *name, int len)
{
...
if (copy_to_user(name, system_utsname.nodename, i))
... errno = -EFAULT;
}
(system_utsname is a global)
kernel/sys.c, Kernel v2.6.11.5
UTS namespace - contd
A Method called utsname() was added:
static inline struct new_utsname *utsname(void) {
return ¤t->nsproxy->uts_ns->name; }
The
new
implementation of
gethostname()
:
SYSCALL_DEFINE2(gethostname, char __user *, name, int, len) {
struct new_utsname *u; ...
u = utsname();
if (copy_to_user(name, u->nodename, i)) errno = -EFAULT;
UTS namespace - Example
We have a machine where hostname is myoldhostname.
uname -n
myoldhostname
unshare -u /bin/bash
This create a UTS namespace by unshare()
syscall and call
execvp()
for invoking bash.
Then:
hostname mynewhostname
uname -n
mynewhostname
UTS namespace - Example
nsexec
nsexec is a package by Serge Hallyn; it consists of a
program called nsexec.c which creates tasks in new
namespaces (there are some more utils in it) by clone() or by
unshare() with fork().
https://launchpad.net/~serge-hallyn/+archive/nsexec
Again we have a machine where hostname is myoldhostname.
uname -n
IPC namespaces
The same principle as uts , nothing
special, more code.
Added a member named
ipc_ns
(ipc_namespace object) to the nsproxy.
Network Namespaces
● A network namespace is logically another copy of the network stack,
with its own routes, firewall rules, and network devices.
● The network namespace is struct net. (defined in include/net/net_namespace.h)
Struct net includes all network stack ingredients, like: – Loopback device.
– SNMP stats. (netns_mib)
– All network tables:routing, neighboring, etc. – All sockets
Implementations guidelines
•
A network device belongs to exactly one network
namespace.
●
Added to struct
net_device
structure:
●
struct net
*nd_net;
for the Network namespace this network device is inside.
●
Added a method:
dev_net(const struct net_device *dev)
to access the nd_net namespace of a network device.
•
A socket belongs to exactly one network namespace.
●
Added
sk_net
to struct
sock
(also a pointer to struct net), for the
Network namespace this socket is inside.
●
Added
sock_net()
and
sock_net_set()
methods (get/set network
Network Namespaces - contd
● Added a system wide linked list of all namespaces: net_namespace_list,
and a macro to traverse it (for_each_net())
● The initial network namespace, init_net (instance of struct net), includes
the loopback device and all physical devices, the networking tables, etc.
● Each newly created network namespace includes only the loopback device. ● There are no sockets in a newly created namespace:
netstat -nl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Active UNIX domain sockets (only servers)
Example
● Create two namespaces, called "myns1" and "myns2":
● ip netns add myns1
● ip netns add myns2
– (In fedora 18, ip netns is included in the iproute package).
● This triggers:
● creation of /var/run/netns/myns1,/var/run/netns/myns2 empty folders
● calling the unshare() system call with CLONE_NEWNET.
– unshare() does not trigger cloning of a process; it does create
a new namespace (a network namespace, because of the CLONE_NEWNET flag).
● You can use the file descriptor of /var/run/netns/myns1 with the setns() system call. ● From man 2 setns:
...
int setns(int fd, int nstype); DESCRIPTION
Given a file descriptor referring to a namespace, reassociate the calling thread with that namespace.
...
● In case you pass 0 as nstype, no check is done about the fd.
● In case you pass some nstype, like CLONE_NEWNET of CLONE_NEWUTS, the method verifies that the specified nstype corresponds to the specified fd.
Network Namespaces - delete
● You delete a namespace by:
● ip netns del myns1
–
This unmounts and removes /var/run/netns/myns1
– see netns_delete() in ipnetns.c
– Will not delete a network namespace if there is one or more processes attached to it.
● Notice that after deleting a namespace, all its migratable network devices
are moved to the default network namespace;
● unmoveable devices (devices who have NETIF_F_NETNS_LOCAL in their
features) and virtual devices are not moved to the default network namespace.
● (The semantics of migratable network devices and unmoveable devices
NETIF_F_NETNS_LOCAL
● NETIF_F_NETNS_LOCAL ia a network device feature– (a member of net_device struct, of type netdev_features_t)
● It is set for devices that are not allowed to move between network namespaces; sometime
these devices are named "local devices".
● Example for local devices (where NETIF_F_NETNS_LOCAL is set): – Loopback, VXLAN, ppp, bridge.
– You can see it with ethtool (by ethtool -k, or ethtool
–show-features)
– ethtool -k p2p1
netns-local: off [fixed]
For the loopback device:
VXLAN
● Virtual eXtensible Local Area Network.
● VXLAN is a standard protocol to transfer layer 2 Ethernet packets
over UDP.
● Why do we need it ?
● There are firewalls which block tunnels and allow, for example, only
TCP/UDP traffic.
● developed by Stephen Hemminger.
– drivers/net/vxlan.c
– IANA assigned port is 4789
When trying to move a device with NETIF_F_NETNS_LOCAL flag, like VXLAN, from one namespace to another, we will encounter an error: ip link add myvxlan type vxlan id 1
ip link set myvxlan netns myns1
We will get: RTNETLINK answers: Invalid argument
int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat) {
int err;
err = -EINVAL;
if (dev->features & NETIF_F_NETNS_LOCAL) goto out;
● You list the network namespaces (which were added via “ ip netns
add”)
● ip netns list
–
this simply reads the namespaces under:
/var/run/netns
● You can find the pid (or list of pids) in a specified net namespace by:
–
ip netns pids namespaceName
● You can find the net namespace of a specified pid by:
You can monitor addition/removal of network
namespaces by:
ip netns monitor
- prints one line for each addition/removal event it
sees
● Assigning p2p1 interface to myns1 network namespace: ● ip link set p2p1 netns myns1
– This triggers changing the network namespace of the net_device to “myns1”.
– It is handled by dev_change_net_namespace(), net/core/dev.c.
● Now, running:
● ip netns exec myns1 bash
● will transfer me to myns1 network namespaces; so if I will run there: ● ifconfig -a
● I will see p2p1 (and the loopback device);
– Also under /sys/class/net, there will be only p2p1 and lo folders.
● But if I will open a new terminal and type ifconifg -a, I will not see
● Also, when going to the second namespace by running:
● ip netns exec myns2 bash
● will transfer me to myns2 network namespace; but if we will run
there:
● ifconfig -a
– We will not see p2p1; we will only see the loopback device.
● We move a network device to the default, initial namespace by:
● In that namespace, network application which look for files under
/etc, will first look in /etc/netns/myns1/, and then in /etc.
● For example, if we will add the following entry "192.168.2.111
www.dummy.com"
● in /etc/netns/myns1/hosts, and run:
● ping www.dummy.com
veth
● You can communicate between two network namespaces by:
● creating a pair of network devices (veth) and move one to another
network namespace.
● Veth (Virtual Ethernet) is like a pipe.
● unix sockets (use paths on the filesystems).
Example with veth:
Create two namesapces, myns1 and myns1: ip netns add myns1
veth
ip netns exec myns1 bash
- open a shell of myns1 net namespace
ip link add name if_one type veth peer name if_one_peer
-
create veth interface, with
if_one
and
if_one_peer
- ifconfig running in myns1 will show
if_one
and
if_one_peer
and
lo
(the loopback device)
- ifconfig running in myns2 will show
only
lo
(the loopback
device)
Run from myns1 shell:
ip link set dev if_one_peer netns myns2
move
if_one_peer
to myns2
- now ifconfig running in myns2 will show
if_one_peer
and
lo
(the loopback device)
unshare util
● The unshare utility● Util-linux recent git tree has the unshare utility with support for all six namespaces:
http://git.kernel.org/cgit/utils/util-linux/util-linux.git
./unshare –help
...
Options:
-m, --mount unshare mounts namespace
-u, --uts unshare UTS namespace (hostname etc) -i, --ipc unshare System V IPC namespace
-n, --net unshare network namespace -p, --pid unshare pid namespace
http://ramirose.wix.com/ramirosen 39/121
● For example:
● Type:
● ./unshare --net bash
– A new network namespace was generated and the bash process was
generated inside that namespace.
● Now run ifconfig -a
● You will see only the loopback device.
– With unshare util, no folder is created under /var/run/netns;
also network application in the net namespace we created, do not look under /etc/netns
– If you will kill this bash or exit from this bash, then the network
namespace will be freed.
htt p: // ra m ir os e.w ix .c om /r am ir os en 40 /121
t t
he
ca
se
a
s w
ith
ip
n
etn
s
exe
c my
ns1
b
as
h; i
n th
at
lli
ng
/e
xi
tin
g th
e b
ash
d
oe
s
no
t t
rig
ge
r d
estr
oyi
ng
t
he
ace
.
le
m
en
ta
tio
n d
eta
ils, l
oo
k
in
stru
ct n
et
*
ne
t) a
nd
th
e r
efe
re
nce
co
un
t (
na
me
d “c
ou
nt”
)
ork n
ame
sp
ace
str
uct n
et.
htt p: // ra m ir os e.w ix .c om /r am ir os en 41 /121
Mount
n
ames
pa
ces
ed
a
me
mber
named
mn
t_
n
s
na
m
e
sp
ac
e
ob
je
ct
)
to
th
e
ns
pr
oxy
.
p
y
th
e
m
ou
n
t n
am
es
pa
ce
o
f t
he
c
all
in
g
p
ro
ce
ss
ge
ne
ric
fi
le
sys
te
m
m
et
ho
d
(
se
e
co
py_
tr
ee
()
in
_ns
()
).
e
n
ew
m
ou
nt
n
a
m
es
pa
ce
, a
ll
pr
ev
io
us
m
ou
n
ts
wil
l b
e
;
an
d
fr
om
n
ow
o
n:
nt
s/
u
nm
ou
nt
s
in
th
at
m
ou
n
t n
am
es
pa
ce
a
re
in
visib
le
to
t o
f th
e s
ys
te
m
.
nt
s/
u
nm
ou
nt
s
in
th
e
glo
ba
l n
am
es
p
ac
e
ar
e
vis
ib
le
in
am
es
pa
ce
.
am
esp
ace
mo
du
le
u
se
s
mo
un
t n
am
esp
ace
s (w
ith
(C
LON
E
_N
E
W
N
S
) )
es/p
am_
na
me
sp
ace
/p
am_
na
me
sp
ace
.c)
htt p: // ra m ir os e.w ix .c om /r am ir os en 42 /121
n
ames
pac
es
: ex
am
pl
e
1
ple
1
(t
est
ed
on
U
bun
tu
):
th
at
/d
ev
/s
da
3
is
n
ot
m
ou
nt
ed
:
gr
e
p
/dev
/s
da3
ld
g
iv
e
no
th
in
g.
-m
/bi
n
/bas
h
ev
/s
da
3
/m
n
t/sd
a3
m
ount
|
g
rep
sd
a3
s
ee
:
3
o
n
/m
nt
/s
da
3
ty
pe
e
xt
3
(r
w)
/p
ro
c/
$$
/n
s/
m
nt
11
4]
htt p: // ra m ir os e.w ix .c om /r am ir os en 43 /121
a
no
th
er
te
rm
in
al
ru
n
link
/p
roc
/$
$/
ns
/m
nt
nt
:[4
02
653
18
40]
ult
s
sh
ows
tha
t w
e
are
in
a
dif
fe
re
nt
esp
ac
e.
r
un
:
nt
|
grep
s
da
3
v/
sda
3
on
/m
nt
/s
da
3
ty
pe
e
xt
3
(r
w
)
?
W
e
are
in
a
d
iff
ere
nt
m
oun
t n
ame
spa
ce
?
ou
ld
h
av
e
no
t se
e
the
m
ou
nt
wh
ic
h
w
as
fr
om
an
ot
her
name
spa
ce
!
htt p: // ra m ir os e.w ix .c om /r am ir os en 44 /121
ns
w
er
is
s
imple
: r
un
nin
g
mou
nt
is
n
ot
g
ood
gh
w
he
n
w
ork
ing
w
ith
m
ou
nt
n
am
es
pa
ce
s.
so
n
is
th
at
m
ou
nt
re
ad
s
/e
tc
/m
ta
b
, w
hic
h
at
ed
by
th
e
m
ou
nt
c
om
man
d;
m
ou
nt
an
d
do
es
n
ot
ac
ce
ss
th
e
ke
rn
el
st
ruc
tu
res
.
the
s
ol
ut
io
n?
htt p: // ra m ir os e.w ix .c om /r am ir os en 45 /121
cce
ss
d
ire
ct
ly
the
k
erne
l da
ta
st
ru
ct
ure
s,
y
ou
ru
n:
ro
c/
m
ou
nt
s
| g
re
p
sd
a3
c/
mo
un
ts
is
in
fa
ct
sy
m
bo
lic
lin
k
to
se
lf/
m
ou
nt
s).
yo
u
will
get
n
o
re
sult
s,
a
s
exp
ec
ted
.
htt p: // ra m ir os e.w ix .c om /r am ir os en 46 /121
n
ames
pac
es
: ex
am
pl
e
2
ple
2:
te
st
ed
o
n
F
ed
ora
18
th
at
/d
ev
/s
db
3
is
n
ot
m
ou
nt
ed
:
gr
e
p
sdb3
ld
g
iv
e
no
th
in
g.
-m
/bi
n
/bas
h
ev
/s
db
3
/m
n
t/sd
b3
m
ount
|
g
rep
sd
b3
s
e
e:
3
o
n
/m
nt
/s
db
3
ty
pe
e
xt
4
(r
w
,r
ela
tim
e,
d
at
a=o
rd
er
ed
)
/p
ro
c/
$$
/n
s/
m
nt
381]
htt p: // ra m ir os e.w ix .c om /r am ir os en 47 /121
a
no
th
er
te
rm
in
al
ru
n:
link
/p
roc
/$
$/
ns
/m
nt
02
653
18
40]
ow
s t
ha
t w
e ar
e i
n a di
ffe
rent nam
espac
e.
un:
nt
|
grep
s
db
3
sd
b3
o
n /m
nt/sd
b3
typ
e e
xt4
(rw
,r
el
ati
me
,d
ata
=
ord
ere
d)
ow
n
ow
th
at
w
e sh
ou
ld
u
se
ca
t /p
ro
c/m
ou
nts
(
an
d
no
t
) to
g
et th
e ri
gh
t a
nsw
er
w
he
n w
orki
ng
w
ith
n
ame
sp
ace
;
so
:
ro
c/
m
ou
nt
s
| g
re
p
sd
b3
sd
b3
/
mn
t/
sd
b3
e
xt4
r
w
,r
el
ati
me
,d
ata
=
ord
ere
d 0
0
t so
?
W
e sh
ou
ld
h
ave
se
en
n
o r
esu
lts,
a
s i
n p
re
vi
ou
s
e.
htt p: // ra m ir os e.w ix .c om /r am ir os en 48 /121 ed or a ru ns sy st em d ; sys te m d use s th e sh ar ed fla g fo r m ou nt in g /. d so ur ce c od e: ( sr c/ co re /m ou nt -s et up .c ) tu p(bo ol loa ded _p oli cy) { UL L, " /" , NU LL, MS _R E C |MS _S H A RE D , NU LL) < 0 ) rnin g("F ai le d to se t up t he ro ot d ire ct or y for shar ed mou nt p rop aga tion: %m"); C st an ds fo r re cu rsiv e m ou nt ) ow wh et he r w e ha ve a s ha re d fla gs ? /se lf/ m ou nt in fo | g re p sh ar ed e: / rw ,r ela tim e sh ar ed :1 e xt 4 /d ev/ sd a3 r w ,d at a= or de re d o ?
htt p: // ra m ir os e.w ix .c om /r am ir os en 49 /121
nt
--m
ak
e-r
priv
at
e
-o
re
m
ou
nt
/
/de
v/
sda
3
ch
an
ges
th
e
sh
ar
ed
fl
ag
to
priva
te
,
rs
iv
ely
.
-rp
riv
at
e
–
set
th
e
priv
at
e
fla
g
re
cu
rsiv
ely
htt p: // ra m ir os e.w ix .c om /r am ir os en 50 /121
Sha
re
d
su
bt
ree
s
/u ser s /b in / n t /u ser s/u ser 1 /u ser s/u ser 2 wa nt th at u se r1 a nd u se r2 fo ld er s w ill se e th e w ho le em ; we w ill ru n t –b in d / / u se rs /u se r1 t –b in d / / u se rs /u se r2 fa ult , t he file sys yt em is m ou nt ed a s pr iv at e , th e sh ar ed m ou nt fla g is se t e xp lici tly .htt p: // ra m ir os e.w ix .c om /r am ir os en 51 /121
Shar
ed
subt
rees
-
cont
d
/u ser s /b in / /m n t /u ser s/u ser 1 /u ser s/u ser 2 /mn t /b in /mnt /b in /us ers /u se rs2 /u ser 1 /u ser 2htt p: // ra m ir os e.w ix .c om /r am ir os en 52 /121
S
ha
re
d
sub
tre
es
–
Q
uiz
e mo
un
t a
u
sb
d
is
k
on
ke
y
on
/mn
t/d
ok.
en
in
/u
se
rs/u
se
r1
/mn
t
or
se
r2
/m
nt?
htt p: // ra m ir os e.w ix .c om /r am ir os en 53 /121
S
ha
re
d
su
bt
re
es
c
ont
d
w
er
is
no, si
nc
e by defaul
t, the fi
le
sys
ytem
is
p
rivat
e
. T
o
e
na
bl
e that the dok w
ill
be s
ee
n
/us
er
s/user
1/m
nt o
r /user
s/
us
er
2/
m
nt, w
e
d m
o
unt
th
e
fil
es
yste
m
as
shar
ed:
-m
ake-rs
har
ed
ount the usb di
sk on k
ey agai
n
.
re d su bt re es p at ch is fr om 2 00 5 by R am P ai. m e m ou nt fla gs like – m ake -sl av e, --m ak e-rsla ve , -m ake -ble , --m ak e-ru nb in da bl e an d m or e. T he p at ch a dd ed th is ke rn el gs: M S_ UN B IN D AB LE, M S _PR IV A T E , M S_ S LA VE a nd RE D f la g is in us e by th e fus e fil esyst em.htt p: // ra m ir os e.w ix .c om /r am ir os en 54 /121
PI
D
na
m
es
pac
es
em be r na m ed p id _n s (p id _n am es pa ce o bj ec t) to th e in d if fe re nt PI D n am es pa ce s ca n ha ve th e sa m e pr oc es s ID . at in g th e fi rs t p ro ce ss in a n ew n am es pa ce , i ts P ID is 1 . li ke th e “i ni t” p ro ce ss : a pr oc es s di es , a ll it s orp ha ne d ch il dre n w il l n ow h av e th e pro ce ss w it h PI D 1 a s ar en t ( ch il d re ap ing ). g SI G KI L L s ig na l d oes n ot k il l p ro ces s 1 , r eg ar dl es s o f w hi ch n ames pace t he an d w as is su ed (i ni ti al n am es pa ce o r ot he r pi d na m es pa ce ).htt p: // ra m ir os e.w ix .c om /r am ir os en 55 /121
P
ID
names
pac
es
-
con
td
a n ew na m esp ace is cr ea te d, we c an no t se e fr om it th e PI D e pa re nt n am esp ac e; r un nin g ge tpp id( ) fr om th e ne w pid esp ace w ill re tu rn 0 . ll PI D s w hich a re u se d in th is n am esp ace a re vi sib le to th e nt n am esp ac e. na m esp ace s ca n be n est ed , u p to 3 2 ne st in g le ve ls . X_ P ID_ NS _L E VE L) . : m ul ti_ pid ns. c, M ich ae l K er ris k, fr om :// lw n. ne t/Ar tic le s/ 53 27 45 / . tr yin g to r un m ult i_ pid ns w ith 3 3, y ou w ill ge t:cl
one: Inval
id ar
gum
en
t
htt p: // ra m ir os e.w ix .c om /r am ir os en 56 /121
Us
er
Na
m
es
pac
es
d
a
me
m
be
r
na
m
ed
u
se
r_
ns
na
mes
pa
ce
ob
jec
t)
to
th
e
ns
pr
ox
y.
d
e/
lin
ux
/u
se
r_
na
m
esp
ac
e.
h
s
a
po
in
te
r
na
m
ed
pa
re
nt
to
th
e
u
se
r_
na
m
es
pa
ce
at
ed
it
.
se
r_
na
m
esp
ace
*p
ar
en
t;
s
th
e
e
ffe
ct
iv
e
u
id
o
f t
he
p
ro
ce
ss
th
at
cr
ea
te
d
it:
ow
ne
r;
ce
ss
w
ill
ha
ve
d
is
tin
ct
s
et
o
f UI
D
s,
G
ID
s
pa
bili
tie
s.
htt p: // ra m ir os e.w ix .c om /r am ir os en 57 /121
Us
er
Na
m
es
pac
es
g a
n
ew
u
se
r n
am
esp
ace
is
do
ne
b
y p
ass
in
g
E
_N
E
W
U
S
E
R
to
fo
rk
()
o
r
u
n
sh
ar
e()
.
pl
e:
ng fr
om
s
om
e
u
ser
ac
count
s the ef
fecti
ve user
ID
.
s the ef
fecti
ve gr
ou
p ID
.
ly the fi
rst u
ser
a
dd
ed
gets
ui
d/gi
d of 1000
)
htt p: // ra m ir os e.w ix .c om /r am ir os en 58 /121
Us
er
Names
pa
ces
-
ex
ample
s:
/s
elf/s
ta
tus
| gre
p C
ap
0000000000000000
0000000000000000
0000000000000000
0000001f
ff
ff
ff
ff
re
ate
a
us
er
na
me
spa
ce
a
nd s
ta
rt a
s
he
ll, w
e w
ill run f
rom
root a
cc
ou
nt
:
cU
/bi
n
/bas
h
ag i
s f
or us
ing c
lone
ag i
s f
or us
ing us
er
na
m
es
pa
ce
(
C
L
O
N
E
_N
E
W
U
S
E
R
fla
g f
or
htt p: // ra m ir os e.w ix .c om /r am ir os en 59 /121
er
Names
paces
-
ex
ample
-cont
d
om
th
e ne
w
sh
el
l r
un
a
re
d
efa
ul
t va
lu
es fo
r th
e e
U
ID
a
nd
e
GU
ID
In
th
e
ne
w
ace
.
ill
g
et
th
e s
ame
re
su
lts
fo
r
ef
fe
cti
ve
u
se
r i
d a
nd
e
ffe
cti
ve
lso
w
he
n r
un
ni
ng
/n
se
xe
c -c
U
/bi
n
/bas
h
as
r
oot.
fau
lts
c
an
be
c
h
an
ge
d by
:
/pr
oc
/s
ys
/ke
rn
el
/ov
er
fl
ow
u
id,
ys
/ke
rn
el/
ov
er
flow
gid
t, t
h
e
u
se
r n
ame
spa
ce
th
at w
as
c
re
ate
d h
ad f
u
ll c
apabili
tie
s,
cal
l to e
xe
c(
) w
ith
bas
h
r
em
ov
ed th
em
.
htt p: // ra m ir os e.w ix .c om /r am ir os en 60 /121
se
lf/sta
tu
s |
g
re
p C
ap
000
000000
0000
000
000
000000
0000
000
000
000000
0000
000
000
0001f
fff
fff
ff
htt p: // ra m ir os e.w ix .c om /r am ir os en 61 /121
Us
er
N
ames
pac
es
-
cont
d
u
n:
ho
$$
(
ge
t th
e b
as
h p
id
)
om
a di
ffer
ent
ro
o
t
ter
m
in
al
, w
e
set the ui
d_m
ap:
t, w
e c
an see that ui
d_
m
ap i
s uni
ni
tial
iz
ed
by
:
oc/<
p
id>
/ui
d
_m
ap
o
0
10
00
1
0
>
/p
ro
c/
<
pid>
/u
id
_ma
p
d
>
is
th
e
pid
o
f t
he
b
as
h
pr
o
ce
ss
fr
om
p
re
vi
o
us
st
ep
).
u
id
_m
a
p
is
of
th
e
fo
llo
wi
n
g
fo
rm
at
:
m
esp
ac
e_
fir
st
_u
id
ho
st
_f
irs
t_
u
id
nu
m
be
r_
of
_u
id
s
se
ts
th
e
fir
st
u
id
in
th
e
ne
w
n
am
e
sp
a
ce
(
w
hi
ch
po
nd
to
u
id
1
00
0
in
th
e
o
ut
sid
e
wo
rld
)
to
b
e
0;
th
e
w
ill
b
e
1;
a
nd
so
fo
rt
h,
fo
r
10
e
nt
rie
s.
htt p: // ra m ir os e.w ix .c om /r am ir os en 62 /121
Us
er
N
ames
pac
es
-
cont
d
ou
c
an
s
et
th
e
u
id
_m
ap
o
n
ly
o
nc
e
fo
r
a
sp
ec
ifi
c
. F
ur
th
er
a
tte
m
pt
s
will
fa
il.
w
ill
g
et
0
.
i
am
e
sp
ace
is
th
e
o
nl
y
n
am
es
pa
ce
w
hi
ch
ca
n
be
wit
ho
u
t CA
P
_S
Y
S
_A
D
M
IN
ca
pa
bi
lit
y
htt p: // ra m ir os e.w ix .c om /r am ir os en 63 /121 /se lf/ st at us | g re p C ap : 00 000 00 000 00 000 0 : 00 000 01 fff fff fff 00 000 01 fff fff fff : 00 000 01 fff fff fff pEf f ( E ffe ct ive Ca pa bili te s) is 1 fff fff fff -> th is is 37 b its of '1 ' , ea ns a ll ca pa bili tie s. ill un sh ar e --ne t b as h wo rk n ow ?
htt p: // ra m ir os e.w ix .c om /r am ir os en 64 /121
no
--n
et b
as
h
: ca
nn
ot se
t g
ro
up
id
: I
nva
lid
a
rg
ume
nt
ru
nn
in
g, fr
om a
d
iffe
re
nt te
rmi
na
l,
as ro
ot:
00
0 1
0 >
/p
ro
c/2
42
9/
gi
d_
ma
p
w
ill
fa
il h
ow
eve
r:
ot o
pe
n d
ire
cto
ry /ro
ot/:
P
er
mi
ssi
on
d
en
ie
d
htt p: // ra m ir os e.w ix .c om /r am ir os en 65 /121
rt
q
uiz
1
:
a
re
gula
r
us
er
, n
ot
r
oo
t.
n
e(
)
w
ith
(
CL
O
NE
_NE
W
NE
T)
wo
rk
?
q
uiz
2
:
n
e(
)
w
ith
(
CL
O
NE
_NE
W
NE
T
|
CL
O
N
E
_NE
W
U
S
E
R)
htt p: // ra m ir os e.w ix .c om /r am ir os en 66 /121
:
No
.
to u se th e C LO N E_ NE WN E T we n ee d to h ave S_ A DM IN. --ne t b as h u nsh ar e fa ile d: O pe ra tio n no t p er m itt ed: Y
es
.
ce s co de g ua ra nt ee s us th at u se r na m es pa ce c re at io n is th e cr ea te d. F or cr ea tin g a use r na m esp ace w e do 'n t n ee d S_ A DM IN. T he u se r na m esp ac e is cr ea te d w ith fu ll s, so w e ca n cr ea te t he n et w or k na m es pa ce su cce ssf ully . --n et --u se r /b in /b ash s!htt p: // ra m ir os e.w ix .c om /r am ir os en 67 /121
:
n, fr
om
a
n
on
ro
ot u
se
r,
–
use
r b
ash
en
c/
se
lf/sta
tu
s |
g
re
p C
ap
E
ff
000
000000
0000
000
ea
ns
n
o ca
pa
bi
liti
es. S
o h
ow
w
as
th
e
ne
t n
ame
sp
ace
,
ee
ds C
A
P
_S
Y
S
_A
D
MIN
, cr
ea
te
d ?
htt p: // ra m ir os e.w ix .c om /r am ir os en 68 /121
w
e
first
do
u
ns
ha
re
;
on
e w
ith
u
se
r
na
me
sp
ace
. T
hi
s e
na
bl
es a
ll ca
pa
bi
lit
ie
s.
ea
te
th
e
na
me
sp
ace
. A
fte
rw
ard
s,
w
e ca
ll e
xe
c f
or
th
e
c re
m
ove
s c
ap
ab
ili
tie
s.
ar
e.c o
f u
til
-li
nu
x:
u
nsh
are
(u
ns
ha
re
_fl
ag
s))
r(E
X
IT
_F
A
IL
U
R
E
, _
("
un
sh
are
fa
ile
d"
));
sh
el
l();
htt p: // ra m ir os e.w ix .c om /r am ir os en 69 /121
m
y
o
f
a
u
se
r
n
am
es
p
ace
s
vu
ln
er
ab
ili
ty
ic
ha
el
K
er
ris
k,
Marc
h
20
13
ut
C
V
E
2
013
-1
858
-
exp
lo
ita
ble
se
cu
rit
y
erab
ilit
y
lw
n.
ne
t/A
rt
icle
s/
54
32
73/
cgroups
● cgroups (control groups) subsystem is a Resource Management solution providing a
generic process-grouping framework.
● This work was started by engineers at Google (primarily Paul Menage and Rohit Seth) in 2006 under the name "process containers; in 2007, renamed to “Control Groups”.
●Maintainers: Li Zefan (huawei) and Tejun Heo ;
●The memory controller (memcg) is maintained separately (4 maintainers) ●Probably the most complex.
– Namespaces provide per process resource isolation solution.
– Cgroups provide resource management solution (handling groups).
● Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some previous releases).
– Fedora systemd uses cgroups.
– Ubuntu does not have systemd. Tip: do tests with Ubuntu and also make sure that cgroups are not
● The implementation of cgroups requires a few, simple hooks into the rest
of the kernel, none in performance-critical paths:
–In boot phase (init/main.c) to preform various initializations.
–In process creation and destroy methods, fork() and exit().
–A new file system of type "cgroup" (VFS)
–Process descriptor additions (struct task_struct)
–Add procfs entries:
●For each process: /proc/pid/cgroup.
–
The cgroup modules are
not
located in one folder but
scattered in the kernel tree according to their functionality:
●
memory
: mm/memcontrol.c
●
cpuset
: kernel/cpuset.c.
●
net_prio:
net/core/netprio_cgroup.c
●
devices:
security/device_cgroup.c.
cgroups and kernel namespaces
Note that the cgroups is not dependent upon namespaces; you can build
cgroups without namespaces kernel support.
There was an attempt in the past to add "ns" subsystem (ns_cgroup, namespace cgroup subsystem); with this, you could mount a namespace subsystem by:
mount -t cgroup -ons.
This code it was removed in 2011 (by a patch by Daniel Lezcano).
See:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id= a77aea92010acf54ad785047234418d5d68772e2
cgroups VFS
● Cgroups uses a Virtual File System
– All entries created in it are not persistent and deleted after
reboot.
● All cgroups actions are performed via filesystem actions
(create/remove directory, reading/writing to files in it, mounting/mount options).
● For example:
– cgroup inode_operations for cgroup mkdir/rmdir.
– cgroup file_system_type for cgroup mount/unmount.
Mounting cgroups
In order to use a filesystem (browse it/attach tasks to cgroups,etc) it must be mounted.
The control group can be mounted anywhere on the filesystem. Systemd uses /sys/fs/cgroup. When mounting, we can specify with mount options (-o) which subsystems we want to use.
There are 11 cgroup subsystems (controllers) (kernel 3.9.0-rc4 , April 2013); two can be built as modules. (All subsystems are instances of cgroup_subsys struct)
cpuset_subsys - defined in kernel/cpuset.c.
freezer_subsys - defined in kernel/cgroup_freezer.c.
mem_cgroup_subsys - defined in mm/memcontrol.c; Aka memcg - memory control groups.
blkio_subsys - defined in block/blk-cgroup.c.
net_cls_subsys - defined in net/sched/cls_cgroup.c ( can be built as a kernel module)
net_prio_subsys - defined in net/core/netprio_cgroup.c ( can be built as a kernel module)
devices_subsys - defined in security/device_cgroup.c.
perf_subsys (perf_event) - defined in kernel/events/core.c
hugetlb_subsys - defined in mm/hugetlb_cgroup.c.
cpu_cgroup_subsys - defined in kernel/sched/core.c
Mounting cgroups – contd.
In order to mount a subsystem, you should first create a folder for it under /cgroup.
In order to mount a cgroup, you first mount some tmpfs root folder:
● mount -t tmpfs tmpfs /cgroup
Mounting of the memory subsystem, for example, is done thus:
● mkdir /cgroup/memtest
● mount -t cgroup -o memory test /cgroup/memtest/
Note that instead “test” you can insert any text; this text is not
handled by cgroups core. It's only usage is when displaying the mount
Mounting cgroups – contd.
● Mount creates cgroupfs_root object + cgroup (top_cgroup) object
● mounting another path with the same subsystems - the same
subsys_mask; the same cgroupfs_root object is reused.
● mkdir increments number_of_cgroups, rmdir decrements number_of_cgroups.
● cgroup1 - created by mkdir /cgroup/memtest/cgroup1.
struct super_block *sb
The super block being used. (in memory). struct cgroup top_cgroup
unsigned long subsys_mask
bitmask of subsystems attached to this hierarchy
int number_of_cgroups cgroup cgroup1 cgroup2 parent parent parent cgroupfs_root *root
Mounting a set of subsystems
From Documentation/cgroups/cgroups.txt:
If an active hierarchy with exactly the same set of subsystems
already exists, it will be reused for the new mount.
If no existing hierarchy matches, and any of the requested
subsystems are in use in an existing hierarchy, the mount will fail
with -EBUSY.
Otherwise, a new hierarchy is activated, associated with the
requested subsystems.
First case: Reuse
●
mount -t tmpfs test1 /cgroup/test1
●mount -t tmpfs test2 /cgroup/test2
●
mount -t cgroup -ocpu,cpuacct test1 /cgroup/test1
●mount -t cgroup -ocpu,cpuacct test2 /cgroup/test2
●
This will work; the mount method recognizes that we want to
use the same mask of subsytems in the second case.
– (Behind the scenes, this is done by the return value of sget() method, called
from cgroup_mount(), found an already allocated superblock; the sget()
makes sure that the mask of the sb and the required mask are identical)
– Both will use the same cgroupfs_root object.
Second case: any of the requested
subsystems are in use
●
mount -t tmpfs tmpfs /cgroup/tst1/
●
mount -t tmpfs tmpfs /cgroup/tst2/
●
mount -t tmpfs tmpfs /cgroup/tst3/
●
mount -t cgroup -o freezer tst1 /cgroup/tst1/
●
mount -t cgroup -o memory tst2 /cgroup/tst2/
●
mount -t cgroup -o freezer,memory tst3 /cgroup/tst3
–
Last command will give an error. (-EBUSY).
The reason: these subsystems (controllers) were been
separately mounted.
Third case -
no existing hierarchy
no existing hierarchy matches, and none of the requested
subsystems are in use in an existing hierarchy:
mount -t cgroup -o net_prio netpriotest /cgroup/net_prio/
Will succeed.
– under each new cgroup which is created, these 4 files are always created:
● tasks
– list of pids which are attached to this group.
● cgroup.procs.
– list of thread group IDs (listed by TGID) attached to this group.
● cgroup.event_control.
– Example in following slides.
● notify_on_release (boolean).
– For a newly generated cgroup, the value of notify_on_release in inherited
from its parent; However, changing notify_on_release in the parent does not change the value in the children he already has.
– Example in following slides.
– For the topmost cgroup root object only, there is also a release_agent – a
● Each subsystem adds specific control files for its own needs, besides
these 4 fields. All control files created by cgroup subsystems are given a prefix corresponding to their subsystem name. For example:
cpuset.cpus cpuset.mems cpuset.cpu_exclusive cpuset.mem_exclusive cpuset.mem_hardwall cpuset.sched_load_balance cpuset.sched_relax_domain_level cpuset.memory_migrate cpuset.memory_pressure cpuset.memory_spread_page cpuset.memory_spread_slab cpuset.memory_pressure_enabled cpuset subsystem devices.allow devices.deny devices.list devices subsystem
cpu subsystem
cpu.shares (only if CONFIG_FAIR_GROUP_SCHED is set)
cpu.cfs_quota_us (only if CONFIG_CFS_BANDWIDTH is set)
cpu.cfs_period_us (only if CONFIG_CFS_BANDWIDTH is set)
cpu.stat (only if CONFIG_CFS_BANDWIDTH is set)
cpu.rt_runtime_us (only if CONFIG_RT_GROUP_SCHED is set)
cpu.rt_period_us (only if CONFIG_RT_GROUP_SCHED is set)
memory subsystem
memory.usage_in_bytes memory.max_usage_in_bytes memory.limit_in_bytes memory.soft_limit_in_bytes memory.failcnt memory.stat memory.force_empty memory.use_hierarchy memory.swappiness memory.move_charge_at_immigrate memory.oom_controlmemory.numa_stat (only if CONFIG_NUMA is set)
memory.kmem.limit_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.failcnt (only if CONFIG_MEMCG_KMEM is set) memory.kmem.max_usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.limit_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.failcnt (only if CONFIG_MEMCG_KMEM is set) memory.kmem.tcp.max_usage_in_bytes (only if CONFIG_MEMCG_KMEM is set) memory.kmem.slabinfo (only if CONFIG_SLABINFO is set) memory.memsw.usage_in_bytes (only if CONFIG_MEMCG_SWAP is set)
memory subsystem
blkio subsystem
blkio.weight_device blkio.weight blkio.weight_device blkio.weight blkio.leaf_weight_device blkio.leaf_weight blkio.time blkio.sectors blkio.io_service_bytes blkio.io_serviced blkio.io_service_time blkio.io_wait_time blkio.io_merged blkio.io_queued blkio.time_recursive blkio.sectors_recursive blkio.io_service_bytes_recursive blkio.io_serviced_recursive blkio.io_service_time_recursive blkio.io_wait_time_recursive blkio.io_merged_recursive blkio.io_queued_recursiveblkio.avg_queue_size (only ifCONFIG_DEBUG_BLK_CGROUP is set) blkio.group_wait_time (only ifCONFIG_DEBUG_BLK_CGROUP is set) blkio.idle_time (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.throttle.read_bps_device blkio.throttle.write_bps_device blkio.throttle.read_iops_device blkio.throttle.write_iops_device blkio.throttle.io_service_bytes blkio.throttle.io_serviced
netprio
net_prio.ifpriomapnet_prio.prioidx
Note the netprio_cgroup.ko should be insmoded so the mount will succeed. Moreover, rmmod will fail if netprio is mounted
– When mounting a cgroup subsystem (or a set of cgroup subsystems) , allall
processes in the system belong to it (the top cgroup object).
● After mount -t cgroup -o memory test /cgroup/memtest/
– you can see all tasks by: cat /cgroup/memtest/tasks
– When creating new child cgroups in that hierarchy, each one of them will not have
any tasks at all initially.
– Example: – mkdir /cgroup/memtest/group1 – mkdir /cgroup/memtest/group2 – cat /cgroup/memtest/group1/tasks ● Shows nothing. – cat /cgroup/memtest/group2/tasks
●Any task can be a member of exactly one cgroup in a specific hierarchy. ●Example: ●echo $$ > /cgroup/memtest/group1/tasks ●cat /cgroup/memtest/group1/tasks ●cat /cgroup/memtest/group2/tasks
●Will show that task only in group1/tasks.
●After:
●echo $$ > /cgroup/memtest/group2/tasks
●The task was moved to group2; we will see that task it only in
group2/tasks.