A History of Least Privilege in Linux

Published in

localhost

16 min readAug 17, 2019

For the last few years, people have been looking to containers as the solution to secure and scale their software. But what exactly makes a container? What came before containers, and what can we do to secure applications that aren’t containerized? What is the difference between a container and a virtual machine? Who funds all this?

Who builds this stuff anyway?

Academia, driven by grants from the government (usually hoping to turn stuff into weapons), develops new technology. The first computers build in both the UK and US were funded by their country’s Census Bureau (in 1890 and 1946, respectively), followed quickly by grants and contracts from the Department of Defense’s Advanced Research Projects Agency for computers to simulate weather conditions during battle and to keep tabs on enemies during wars. Private companies begin developing and selling computers with proprietary operating systems to universities and other large customers. The DOD eventually begins to think about security requirements for their own computer usage, and private companies want to make more money by allowing multiple customers to use the same computer at the same time.

That is where our story begins. How do we let more than one process or user run safely and at the same time? The two answers we have been working on for the last several decades are virtualization of multiple operating systems and least-privilege containment within a single operating system.

What is Virtualization?

Virtualization is a technique that uses software called a hypervisor to emulate and mediate access to a machine’s “bare metal” hardware for the purposes of running one or more “guest” operating systems. The hypervisor can present virtual hardware resources to multiple instances of different operating systems concurrently, allocating what each needs, but potentially restricting access to some hardware resources. Separate users running applications on different guest virtual machines do not interfere with each other. Type 1 hypervisors are more of a very lean operating system whose only function is to run other operating systems; Type 2 is software running on top of more conventional operating systems, like VirtualBox on OS X.

But why did we want to run multiple operating systems to begin with? By 1954, academics wanted to solve the problem of time-sharing on individual computers, making them more efficient by handling multiple users doing different tasks at once. MIT professor Fernando Corbató (and former Navy recruit, whose lab received ARPA funding) created the first experimental time-sharing system in 1961 on a single operating system. IBM built the first hypervisor in 1966 as their solution to time-sharing on their mainframes.

Sharing the Kernel

Dr. Corbató’s lab partnered with GE and Bell Labs to write the successor time-sharing operating system, which they named Multics. Because it was conceived of as shared infrastructure from the beginning, it implemented novel security ideas, which we discuss in a moment.

Bell Labs pulled out of Multics in 1969 when they realized it was losing them money, but their developers from the project missed programming on it. So two of the original developers, Ken Thompson and Dennis Ritchie, wrote a near-clone called Unix and released it internally that year. People loved it. They rewrote it in 1973 with Ritchie’s new high-level language, C. This made it a uniquely portable operating system, which greatly aided its future spread.

The Spread of Unix & the Unix Wars

Ma Bell had at various times been pressed with antitrust lawsuits by the US government for its telephone monopoly, eventually being forced to break up in 1982. Under a settlement from such a case in 1956, they were required to license any of their patents upon request. That meant that Unix could not be sold as a product. So when Thompson and Ritchie presented Unix at a conference in 1973 and got tons of requests for it, they shipped the system for the cost of media and shipping to educational institutions. That meant many university students used and loved Unix, but it was not open source.

In 1975, Ken Thompson took a sabbatical from Bell Labs and came to Berkeley as a visiting professor. He helped to install Version 6 Unix and started compiling the first Berkeley Software Distribution with 2 grad students in 1977. As it was an add-on to Version 6 Unix, rather than a complete operating system, it was subject to an AT&T software license. In 1989, they decided to rewrite all of the proprietary AT&T code, and by 1991 released a nearly complete, AT&T-free operating system called Net/2. This was the basis for two separate ports of BSD to the Intel 80386 architecture: the free 386BSD by William Jolitz and the proprietary BSD/386 by Berkeley Software Design (BSDi). AT&T’s Unix System Laboratories (USL), the owners of the Unix trademark, sued BSDi for copyright infringement in 1992 and led to an injunction on the distribution of Net/2 for two years, during which development slowed to a crawl.

Around the same time, Andrew S. Tanenbaum, author of the textbook Operating Systems: Design and Implementation, wrote an example “mini Unix clone” he called Minix, but it was not open source at the time (it is now).

A researcher at MIT’s Artificial Intelligence Lab named Richard Stallman was (and has remained) infuriated by this lack of openness and started the Free Software Foundation in 1985 to write free and open source clones of Unix and its tools. The kernel of this operating system was called Hurd, but was never completed. However, a brilliant set of portable tools like a shell, text editor, and compiler were completed. Had the Hurd been completed or BSD been unmired by lawsuits, Linux might not have ever been written or spread.

Enter Linux (1991)

A university student in Helsinki named Linus Torvalds, frustrated with the lack of a free operating system kernel, uses the GNU C Compiler on Minix to write his own free kernel and announces it to the Minix users mailing list. He ports GCC and bash, attracting the attention and help of many GNU contributors. As a Unix clone, Linux implements core Unix security features; as a popular operating system, those security features have been extended by many developers with employers with a vested interest in correctly isolating workloads.

Unix Security Features

Userspace cat sends a syscall to kernel cat.

Protection rings

The kernel is the core software of an operating system that mediates hardware and software and in general has complete control over everything. It handles start-up of the rest of the OS, process management, memory management, peripheral device management, and I/O management. When a program in user mode requires access to an operating system resource, it must ask the kernel to provide access to that resource via a system call. When a program makes a system call, the mode is switched from user mode to kernel mode. This is called a context switch. Then the kernel provides the requested resource and switches context back to user mode. Examples of resources requiring syscalls are:

Creating, opening, closing and deleting files.
Creating and managing new processes.
Sending and receiving network packets.
Requesting access to a hardware device, like a mouse or a printer.

Since Multics, the hardware of CPUs has had a security model called rings. It physically separates where higher and lower privilege processes can run to prevent lower privilege processes from being able to directly manipulate hardware devices, the operating system, and other high-privilege things. In Unix-like systems, the kernel runs in ring 0 (the highest-privilege, innermost ring) and user programs run in ring 3.

Discretionary Access Control

Discretionary access control (DAC) is a type of access control defined “as a means of restricting access to objects based on the identity of subjects and/or groups to which they belong. The controls are discretionary in the sense that a subject with a certain access permission is capable of passing that permission (perhaps indirectly) on to any other subject (unless restrained by mandatory access control)”.

DAC is arguably the main security mechanism of Unix, but is not powerful enough to satisfy government requirements around confidentiality (discussed more below), which requires mandatory access control.

Mandatory access control (MAC) refers to a type of access control by which the operating system constrains the ability of a user or process to perform some sort of operation on an object or target. Subjects and objects each have a set of security attributes. Whenever a subject attempts to access an object, an authorization rule enforced by the kernel examines these security attributes and decides whether the access can take place. The MAC policy is centrally controlled by a security policy administrator; users do not have the ability to override the policy and grant access to things that would otherwise be restricted. Mandatory access control was not an original feature of Unix, but was implemented later by several independent groups for Linux.

Chroot Is Not Actually a Security Feature

Unix V7 released in 1979 and included the chroot system call, but nothing in the OS seems to have actually used it. Its function is to change the perceived root directory of a program. Users quickly began employing it to contain network-exposed processes like FTP. However, as noted in its man page, chroot was never meant to fully contain processes:

This call changes an ingredient in the pathname resolution process and does nothing else. In particular, it is not intended to be used for any kind of security purpose, neither to fully sandbox a process nor to restrict filesystem system calls. In the past, chroot() has been used by daemons to restrict themselves prior to passing paths supplied by untrusted users to system calls such as open(2). However, if a folder is moved out of the chroot directory, an attacker can exploit that to get out of the chroot directory as well. The easiest way to do that is to chdir(2) to the to-be-moved directory, wait for it to be moved out, then open a path like ../../../etc/passwd.
This call does not change the current working directory, so that after the call ‘.’ can be outside the tree rooted at ‘/’. In particular, the superuser can escape from a “chroot jail” by doing:
mkdir foo; chroot foo; cd ..
This call does not close open file descriptors, and such file descriptors may allow access to files outside the chroot tree.

General requirements for reasonable chroot are:

All directories must be root:root owned
Superuser process cannot be run in chroot
Do not mount/proc
Use a unique user and group ID
No files can be modified or created
Close all file descriptors before chrooting
chdir into the correct directory before chroot

To create a chroot jail, create a directory that will be the new root and copy any binaries you want available into the jail, as well as any shared library dependencies (which you can identify with ldd).

Not to be confused with the FDA’s orange book.

The Orange Book and SELinux

Trusted Computer System Evaluation Criteria (TCSEC) is a standard written in 1983 by the National Computer Security Center (NCSC), an arm of the NSA, for assessing an operating system’s security and access controls. The TCSEC was used to evaluate computer systems being considered for interacting with classified information. Multics was evaluated to be class B2 in 1985. To meet the A1 platform requirements, the NCSC started a project in 1992 to create their own operating system. After building their own OS, they began porting the access control framework, FLASK, to other operating systems. They released the Linux kernel patch under the GPL license to the public in 2000 under the name Security-Enhanced Linux, or SELinux. SELinux implements MAC with label control. This means that every object or subject in the operating system needs to have its own special label called a security context, which is stored in the extended attributes of the filesystem.

Linux Security Modules

When the NSA presented SELinux at the 2001 Linux Kernel Summit, there were a few other people presenting similar papers on enhancing the security of Linux and lobbying to get their patches merged into mainline. Linus Torvalds declared that he didn’t want to arbitrate who approach to security was best and suggested that maintainers work together to develop a general framework to allow the kernel to support a variety of security software supporting features like MAC. For the next 3 years, the NSA and several private companies works on that framework, called Linux Security Modules (LSM). LSM inserts “hooks” (upcalls to the module) at every point in the kernel where a user-level system call is about to result in access to an important internal kernel object. The LSM design was presented at USENIX Security 2002, and any projects that wanted to be merged into mainline had to be rebuild to work with LSM hooks.

AppArmor

Crispan Cowan started a grad school project named SubDomain in 1998, then created a secure commercial Linux distribution called Immunix featuring SubDomain and other improvements. Although they were involved in the development of LSM, they did not adopt it early.

Novell bought Immunix in 2005, in large part for Subdomain, rebranding it as AppArmor and cleaning it up for inclusion into the mainline kernel. Two years later, they laid off most of the original development team. Canonical picked up maintenance in 2009. Due to these corporate associations, it is still the default LSM on SUSE and Debian variants.

AppArmor is generally considered easier to use than SELinux, which can be tedious to configure, as it requires reading a lot of logs of the app running in permissive mode and hand-building a policy. In particular, AppArmor’s utility aa-genprof makes it very easy to profile an application by running and interacting with it and automatically creating a policy based on its activity logs.

sudo aa-genprof /path/to/your/program
[Now interact with all the features of your program for a while...]

It then presents the user with two options, (S)can system log for entries to add to profile and (F)inish. aa-genprof will parse the complain mode logs, reload the updated profiles in complain mode, and again prompt the user for (S)can and (F)inish. This cycle can then be repeated until no access violations are generated. When the user eventually hits (F)inish, aa-genprof will set the main profile, and any other profiles that were generated, into enforce mode and exit.

Capabilities — 1999

While MAC can be a great idea to limit the capabilities of the root user, many systems do not get MAC running, so the root user, and any processes running as it, present a juicy attack surface. There are many reasons a process might need to perform an operation as root, such as ping having to access the raw network interface to send an ICMP message. The purpose of capabilities is to be able to granularly assign the abilities of root to a user or process. As noted in the capabilities man page:

For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process’s credentials (usually: effective UID, effective GID, and supplementary group list).

Rather than give a program like ping the ability to bypass all kernel permission checks just to see if a host is up, we can now give it CAP_NET_RAW (try getcap /usr/bin/pingto verify that your system already has this) and get our task done with the least privileges required. Capabilities are implemented using the security extended filesystem attributes, like in SELinux.

Find setuid and setgid-root files in your environment with the following commands, and consider assigning them capabilities instead.

$ find /usr/bin /usr/lib -perm /4000 -user root
$ find /usr/bin /usr/lib -perm /2000 -group root

Plan 9 from Namespaces — 2002

At Bell Labs, developers had moved on from working on Unix to building a distributed operating system called Plan 9. It presented a novel security idea to present every user and process a private view of the system called a name space.

A Linux namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. There are 7 kinds of namespaces currently implemented, but it wasn’t until 2013 with the completion of the user namespace that they were considered robust enough to provide the level of isolation that we expect from containers today.

Seccomp — 2005

Seccomp was contributed to the kernel by Andrea Arcangeli in 2005 for use in public grid computing, a situation essentially defined by running untrusted code, so that there could be a marketplace for selling unused CPU cycles (never really took off). It focuses on keeping processes running in user mode from being able to make unnecessary system calls to the kernel. Support for seccomp was picked up by Google for use in Chrome, because using a browser is basically the same situation — running untrusted code. Using the Berkeley Packet Filter (since 2012), seccomp allows whitelisting of syscalls for a program. If it attempts to make any other system calls not on the list, the kernel will terminate the process.

The Berkeley Packet Filter was written in 1992 at the Berkeley Lab (that’s the US government again, for those keeping track) to allow userland programs to filter what packets they want to receive. The filters, called BPF programs, run in the BPF in-kernel virtual machine. The kernel statically analyzes BPF programs before loading them, in order to ensure that they terminate and cannot harm the running system. Although tcpdump was the first program to use it, it can now filter much more than just raw network packets. In the context of Seccomp, it acts as a firewall for syscalls.

You can build a [potentially incomplete] profile for your container by running strace on your process:

strace -c -f -S name /path/to/binary 2>&1 1>/dev/null | tail -n +3 | head -n -2 | awk '{print $(NF)}'

Alternatively, Docker and Kubernetes already come with a default which blocks a fair number of syscalls (out of the 300+ syscalls in Linux) that you can try first.

An improperly contained runaway process.

Cgroups — 2006

In 2006, Google released Process Containers, quickly renamed cgroups, for limiting, accounting, and isolating usage of hardware resources (CPU, memory, disk I/O, network, etc.) of a collection of processes. In this way it simulates the mediation of a hypervisor and can prevent DOS types of attacks. As you might expect of a feature with such well-funded maintainers, it was merged into the mainline kernel the very next year.

Putting it all together to make a container

Now that we have all 7 namespaces and cgroups, we can build a container without patching the kernel! LXC (LinuX Containers) was the first implementation of containers with those building blocks, requiring no kernel patches to work. CloudFoundry used LXC as the basis for their container solution, Warden, in 2011 before implementing their own container engine. In 2013, Docker similarly started with LXC before building their own libcontainer (now runc); in the same year, Google released their own internal container stack to the FOSS world as Let Me Contain That For You (LMCTFY). They contributed the core concepts to libcontainer and stopped maintaining LMCTFY separately in 2015.

Orchestration/Kubernetes — 2015

In addition to compelling security controls, containers provided the ability to quickly and easily scale the number of instances of a running application as needed in a production environment. While many large companies solved this problem in their own way, none was as successful and powerful as Google’s Borg. Google introduced the Borg System around 2003–2004. Borg was a large-scale internal cluster management system, which ran hundreds of thousands of jobs, from many thousands of different applications, across many clusters, each with up to tens of thousands of machines. They open sourced a rebuilt Borg as Kubernetes in 2015.

Kubernetes introduces a concept called Pod Security Policies that gates the entrance of any pod (a collection of containers that share resources) into a cluster until it enforces stronger controls, like MAC with SELinux or AppArmor, not allowing mounting of the filesystem as read/write, and restricting capabilities. Seccomp for enforcement of Kubernetes Pod Security Policies is in alpha, and since it is not an LSM, it can be layered with AppArmor or SELinux.

Scaling up: one container bails, 3 pop up to replace it.

A note about BSD and OSX

FreeBSD Jails is one of the early container technologies introduced by Derrick T. Woolworth at R&D Associates for FreeBSD in year 2000. It is an operating-system system call similar to chroot, but included additional process sandboxing features for isolating the filesystem, users, networking, etc. As a result it could provide means of assigning an IP address for each jail, custom software installations and configurations, etc.

BSD has some equivalent concepts to the above, such as jails/the jail syscall instead of containers layered from many technologies or the chroot syscall. It also has an interesting compile-time seccomp-like feature to drop privilege at a given point in a process called pledge. However, most of what makes containers appealing to developers is the ability to orchestrate for scale and reliability. You can make a portable jail config with iocell that’s a lot like a container image, but I have not heard of anything clever on top of that to rival Kubernetes in production (Docker for BSD has been experimental and ignored since 2015).

OSX is based on the Darwin operating system, which utilizes the XNU hybrid kernel (itself making use of the Mach kernel plus various BSD components and Driver Kit). It is not POSIX compliant. It requires all App Store apps to use App Sandbox Entitlements and recommends it for all other software. It also (since Yosemite) has provided a full lightweight hypervisor implementation that it recommends for apps. Docker for Mac is actually build on hyperkit, which uses this Hypervisor framework.