Container Namespaces
My notes from this Medium article.
Containers are:
- isolated
- groups of processes
- running on a single host
- which fulfill a set of common features.
Chroot
Most Unix operating systems have the ability to change the root directory of the current process (and its children). This is available as the syscall chroot(2)
. It is also known as jail.
What is the significance of changing the root?
It (kinda) changes the environment. The root contains all the binaries/libraries that processes can use. When you kick off a shell, one of the processes that gets kicked off is
bash
. The process looks for this in/bin/bash
.
It was used in the first approaches of running microservices. It is currently used by a wide range of applications, including within build systems for different distributions.
How can we set up a chroot environment? With little effort:
$ mkdir -p new-root/{bin,lib64}
$ cp /bin/bash new-root/bin
$ cp /lib64/{ld-linux-x86-64.so*,libc.so*,libdl.so.*,libreadline.so*,libtinfo.so*} new-root/lib64
// This isn't working on my system because a dependency is missing
$ sudo chroot new-root
This creates a new folder, copies the bash shell and its dependencies to this folder, and sets this new folder as the root. This jail only has bash capabilities (and everything that comes with bash, e.g. cd
, pwd
, etc.). Not really a useful jail, but a working example.
Could we run a binary in this jail and call it a container? No.
Why can't we create containers with chroot
?
Let's take a look at the 4 attributes of containers:
- isolated? It actually isn't isolated. We will expand on this below.
- groups of processes? Yes. In linux, processes live in a tree structure, so this means having a root process. A root process can then kick off child processes, giving us a group of processes.
- running on a single host? Yes.
- which fulfill a set of common features. Yes.* This is kinda vague, but we can set up a chroot environment with proper functionalities to support a set of common features.
Let's take a look at the "isolation" a jail provides.
The current working directory is unchanged when calling chroot(2)
via a syscall. Although your absolute paths are different now, relative paths can still refer to files outside of the new root.
Only priviledged processes with the CAP_SYS_CHROOT
capability are able to call chroot
.
Calls to chroot
do not stack. It will override the current jail. So a root user could escape jail with a program like:
#include <sys/stat.h>
#include <unistd.h>
int main(void)
{
mkdir(".out", 0755); //Create a new folder
chroot(".out"); //Make it the new root, removing the old jail
chdir("../../../../../"); //Changed the working directory to a location outside of the jail
chroot("."); //Made it the new root
return execl("/bin/bash", "-i", NULL);
}
Did we need to create a new jail before changing the working directory outside of the jail? Could we not have used the current jail?
So, chroot
doesn't give isolation of the file system.
We can sneak peak outside of a jail from a process perspective:
$ mkdir /proc
$ mount -t proc proc /proc
$ ps aux
...
There is no process isolation at all. We can even kill processes outside of the jail.
We can sneak peak out of a jail from a network perspective:
$ mkdir /sys
$ mount -t sysfs sys /sys
$ ls /sys/class/net
eth0 lo
There is no network isolation either.
This lack of isolation is a security risk - and doesn't meet the criteria for a container to be isolated.
How do we address this? Using Linux namespaces.
Linux Namespaces
The idea - wrap certain global system resources in an abstraction layer. This allows different groups of processes to have different views of the system.
Processes within a namespace have their own isolated instance of the resources.
- Namespaces
- (General) a space of names
- (Linux) a type of namespace (e.g. process id, mount, etc.)
What is important here is that the term "namespaces" are used in different ways. When a process is "in a namespace", it is in an instance of a namespace type.
There are 7 namespace (or I'll refer to is as "namespace types" to keep terminology less confusing):
- mnt
- pid
- net
- ipc
- uts
- cgroup
- user
A process will be in a namespace of every type. (It will have a specific
mnt
ns,pid
ns, etc.)
There were two more proposed in 2016 - time and syslog, but they have not been implemented yet.
With the introduction of user namespace in 2013, the kernel became "container ready".
Namespace API
Before we look at the namespaces, it'll be useful to look at the namespace API. It consists of 3 system calls.
Coming from an OO background, I assumed "namespace API" meant the API of a namespace. I assume that each namespace had a "verb" or an implementation of the below API's that could vary among the namespace types. This is not the case.
Rather, it is an API of the kernel (?) that operates with namespaces. These are syscall's.
1. clone
clone(2)
creates a new child process, like fork(2)
. Unlike fork
, clone
allows the child process to share parts of its execution context with the calling process (such as the memory space, the table of file descriptors, and table of signal handlers).
You can pass different namespace flags to clone
to create new namespaces for the child process.
2. unshare
unshare(2)
allows a process to disassociate parts of the execution context which are currently being shared with others.
3. setns
setns(2)
reassociates the calling thread with the provided namespace file descriptor.
This can be used to join an existing namespace.
proc
Not a syscall, but the proc
filesystem provides additional namespace related files. Each file in /proc/$PID/ns
is a magic link that be used as a handle for performing operations (like setns
) to the referenced namespace.
$ ps
PID TTY TIME CMD
24900 pts/2 00:00:00 zsh
25039 pts/2 00:00:00 ps
$ ls -Gg /proc/24900/ns
total 0
lrwxrwxrwx 1 0 Feb 21 16:37 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 0 Feb 21 16:37 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 0 Feb 21 16:37 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 0 Feb 21 16:37 net -> 'net:[4026532000]'
lrwxrwxrwx 1 0 Feb 21 16:37 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb 21 16:37 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb 21 16:37 time -> 'time:[4026531834]'
lrwxrwxrwx 1 0 Feb 21 16:37 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 0 Feb 21 16:37 user -> 'user:[4026531837]'
lrwxrwxrwx 1 0 Feb 21 16:37 uts -> 'uts:[4026531838]'
$ ls -Gg /proc/self/ns
total 0
lrwxrwxrwx 1 0 Feb 21 16:40 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 0 Feb 21 16:40 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 0 Feb 21 16:40 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 0 Feb 21 16:40 net -> 'net:[4026532000]'
lrwxrwxrwx 1 0 Feb 21 16:40 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb 21 16:40 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb 21 16:40 time -> 'time:[4026531834]'
lrwxrwxrwx 1 0 Feb 21 16:40 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 0 Feb 21 16:40 user -> 'user:[4026531837]'
lrwxrwxrwx 1 0 Feb 21 16:40 uts -> 'uts:[4026531838]'
We can actually track which namespaces a process resides in using this.
Another tool, util-linux
package, contains dedicated wrapper programs for the above syscalls.
Available Namespaces (or Namespace Types)
Mount (mnt)
With the mnt
namespace, Linux is able to isolate a set of mount points by a group of processes. Another way to put this, for a group of processes, we have a specific set of mount points. This abstraction gives us the ability to create an entire virtual environment where we are the root user even without root permissions.
The flag for this namespace type is CLONE_NEWNS
(CLONE NEW NameSpace). This was the first implemented namespace (2002) and most people didn't think more than one namespace was needed, which was why this flag is so generic.
A use case for the mnt namespace is to improve our jail (by making it more secure):
$ sudo unshare -m
# mkdir mount-dir
# mount -n -o size=10m -t tmpfs tmpfs mount-dir
# man mount
# df mount-dir/
Filesystem Size Used Avail Use% Mounted on
tmpfs 10M 0 10M 0% /home/arjun/mount-dir
# touch mount-dir/{0,1,2}
# ls mount-dir/
0 1 2
If we try to access this on the host system:
$ la mount-dir
total 0
$ grep mount-dir /proc/mounts
$
We see folder (a file) created in the first namespace, but no files under it. This is because we mounted a tmpfs (a temporary fs) to that folder and added files to that mount. Since our host system is in another mnt
namespace, it cannot see these files, or see this mount.
We can actually see the mount point in the mountinfo
file inside of the proc
filesystem:
$ grep mount-dir /proc/$(pgrep -u root bash)/mountinfo
441 440 0:54 / /home/arjun/mount-dir rw,relatime - tmpfs tmpfs rw,size=10240k,inode64
The memory being used here is in an abstraction layer called Virtual File System (VFS), which is part of the kernel. This is also where other file systems are based on. If the namespace gets destroyed, the mount memory is unrecoverably lost.
How to work with these mount points in the source code? Programs tend to keep track of the
/proc/$PID/ns/mnt
file, which is referring to the used namespace.
UNIX Time-sharing System (uts)
With the uts
namespace, we can unshare the domain and hostname from the current host system.
$ hostname
arjun-b250hd3
$ sudo unshare -u
# hostname
arjun-b250hd3
# hostname a-cooler-hostname
# hostname
a-cooler-hostname
and if we look at the system:
$ hostname
arjun-b250hd3
it is unchanged. This is useful for container networking related topics.
Interprocess Communication (ipc)
With the ipc
namespace, we can isolate interprocess communication (IPC) resources.
These are System V IPC objects and POSIX message queues.
Use Case - Separate the shared memory (SHM) between two processes.
Each process will be able to use the same identifiers for a shared memory segment and have two different regions.
When an IPC namespace is destroyed, all IPC objects in the namespace is destroyed too.
Process ID (pid)
With the pid
namespaces, processes can have independent set of process identifiers (PIDs).
Process in two different pid
namespaces can have the same PID.
A process will have more than one PID:
- the PID inside the namespace
- the PID on the host system
The pid
namespace can be nested, so a process will have a PID for each namespace from its current namespace up to the initial namespace.
Does this mean the host system can still see this process? This namespace doesn't seem to isolate the process from other namespaces. Only the PID.
The first process created in a pid
namespace gets the number 1 and gains all the same special treatment as the usual init process, like:
- All processes within a namespace will be reparented to the namespace's PID 1 instead of the host system's PID 1.
- The termination of this PID will terminate all processes in its PID namespace and any descendants.
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 171184 10752 ? Ss 16:22 0:00 /sbin/init
root 2 0.0 0.0 0 0 ? S 16:22 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? I< 16:22 0:00 [rcu_gp]
.... //The list goes on
$ sudo unshare -fp --mount-proc
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 7728 4484 pts/4 S 17:27 0:00 -bash
root 6 0.0 0.0 9876 3424 pts/4 R+ 17:27 0:00 ps aux
#
--mount-proc
is needed to remount theproc
filesystem. Why? We didn't unshare themnt
namespace.
Network (net)
With the net
namespace, we can virtualize the network stack.
Each network namespace contains its own resource properties within /proc/net
.
A network namespace only contains a loopback interface on creation:
$ sudo unshare -n
[arjun-b250hd3 gapuchi.github.io]# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Every network interface (physical or virtual) is present exactly once per namespace.
An interface can be moved between namespaces.
Each namespace contains:
- a private set of IP addresses
- its own routing table
- socket listing
- connection tracking table
- firewall
- other network-related resources
Destroying a network namespace will destroy any virtual interfaces and move any physical interfaces back to teh initial network namespace.
Use Case - Software Defined Network
TBD
User ID (user)
With the user
namespace, we can isolate user and group IDs.
This allows a user and group ID's of a process be different inside and outside of the namespace.
Use Case - A process can have a normal, unprivileged user ID outside a user namespace while being fully privileged inside.
$ whoami
arjun
$ id -u
1000
$ unshare -U
$ whoami
nobody
$ id -u
65534
After we create the namespace, the files /proc/$PID/{u,g}id_map
expose the mappings for user and group ID's for the PID. This can be written only once. These files contain a 1:1 mapping a range of contiguous user IDs between two user namespaces.
An example file:
> cat /proc/$PID/uid_map
0 1000 1
This means user ID starting at 0
are mapped to ID 1000
, with the length of the range of 1
. (Meaning only user ID 0
is mapped to 1000
. user ID 1
isn't mapped to 1001
with this.)
If a process tried to access a file, its user and group IDs are mapped to the initial user namespace (for permission checking).
When a process retrieves file user and group IDs (via stat(2)
), the IDs are mapped in the opposite direction (i.e. to ID's within the namespace).
The file /proc/$PID/setgroups
contains either allow
or deny
(literally) to enable or disable the permission to call the setgroups(2)
syscall within a user namespace. This prevented an unprivileged process to create a new namespace where the user had all the privileges. The user would be able to drop groups with setgroups(2)
to gain access files previously not accessible.
Control Group (cgroup)
cgroups was a dedicated kernel feature to support resource limiting, prioritization, accounting, and controlling.
With the cgroups
namespace, we can prevent leaking host information into a namespace.
Not really sure what cgroup really is. Will expand this with the demo later.
Composing Namespaces
Namespaces are composable, making it possible to have isolated pid
namespaces which share the same network interface (which apparently is done in Kubernetes Pods).
As an example, let's create a new pid
namespace:
$ sudo unshare -fp --mount-proc
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 7728 4452 pts/2 S 18:25 0:00 -bash
root 6 0.0 0.0 9876 3352 pts/2 R+ 18:26 0:00 ps aux
The setns(2)
syscall (with its appropriate wrapper program nsenter
) can be used to join the namespace. First find the pid
we want to join with:
# export PID=$(pgrep -u root bash)
Now we can join the namespace:
# sudo nsenter --pid=/proc/$PID/ns/pid unshare --mount-proc
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 7728 4488 pts/2 S 18:25 0:00 -bash
root 10 0.0 0.0 15764 7268 pts/2 S 18:26 0:00 sudo nsenter --pid=/proc/1/ns/pid unshare --mount-proc
root 11 0.0 0.0 5360 740 pts/2 S 18:26 0:00 nsenter --pid=/proc/1/ns/pid unshare --mount-proc
root 12 0.0 0.0 7720 4480 pts/2 S 18:26 0:00 -bash
root 17 0.0 0.0 9876 3320 pts/2 R+ 18:26 0:00 ps aux
#
We are now in the same pid
namespace. Cool!