service-manager

Name

service-manager — service manager

Synopsis

service-manager

Description

service-manager manages a set of services, allowing their service processes to be programmatically brought up and down, and providing automatic restart upon failure.

It expects file descriptor 3 to be a (datagram) socket that has been set up to listen for incoming datagrams. This is its main control socket, through which it receives requests to load, unload, and pipe together services from utilities such as service-dt-scanner(1) and system-control(8). It creates individual control FIFOs for each service, through which it receives requests to send signals the service and bring it up or down, from utilities such as service-control(1).

system-manager(8) invokes service-manager with the appropriate socket (which it sets up itself) and output directed to a logging dæmon. So also does per-user-manager(1). Alternatively, service-manager can be started by local-datagram-socket-listen(1), which will set up the appropriate socket. (service-manager can even be started as a "socket-activated" dæmon by systemd(1) with the systemd-recommended Accept=false.)

Services

Each service comprises several files in the filesystem, contained in two directories. (system-control(1) builds upon these two with further directories, to construct a service bundle, for the details of which see its manual page.)

Service directories

A service directory is the current directory in which a service process is run. It contains:

a run file, which is the executable file for the service process itself;
a start file, which is the executable file to be run when a service is first brought up (but not when it is automatically restarted);
a restart file, which is the executable file to be run when a service has ended (to determine whether it should automatically be restarted);
a stop file, which is the executable file to be run when a service is finally taken down (but not when it is automatically restarted);

Although there is nothing to stop them from being binaries, the executable files are usually scripts interpreted by nosh(1), execlineb(1), or a shell. They set up various parts of the process state (using commands such as softlimit(1), setenv(1), setuidgid(1), and open-controlling-tty(1)) and then chain to the service program proper.

A service directory can also contain:

ancillary files required by the service itself, varying from service to service. For examples:
- A tcp-socket-accept(1) service could have an access-control database managed by ucspi-socket-rules-check(1).
- Many services have env subdirectories read by envdir(1) in order to control dæmon process environment variables.
further files used by other tools. For examples:
- A down file indicates to system-control(1), service-is-enabled(1), and service-dt-scanner(1) that a service should not be auto-started at bootstrap.
- A remain file indicates to system-control(1) that a service should be marked as "run on empty", so that it is considered still running even if it has no processes.
- A ready_after_run file indicates to system-control(1) that a service should be considered "ready" after it has finished running, and has either remained in the running state with no processes or transitioned to the stopped state with a prior run recorded in its status.
- A use_hangup file indicates to system-control(1) that a service should (additionally) be sent the SIGHUP signal when shutting it down.
- A no_kill_signal file indicates to system-control(1) that a service should not be sent the SIGKILL signal when shutting it down.

These files are ignored by service-manager.

The service manager does not need write access to the service directory or to any of the executables within it. This permits service directories (as long as the services themselves do not require write access to their service directories) to reside on read-only volumes.

Supervise directories

A supervise directory provides the control/status API for the service supervisor. It contains:

an ok FIFO that does nothing more than signify that the service manager has loaded the service;
a control FIFO through which commands to control the individual service process (for which see service-control(1)) are sent;
a status file that contains a record of the service process ID, start time, and control state; and
a lock file (compatible with setlock(1)) that prevents the service manager from re-using an active supervise directory.

The service manager requires read-write access to these files, and write access to the supervise directory itself, as it creates the files if they do not exist to start with. However, it does not require write access to the supervise directory once the files have been created. (The supervise(1) program in daemontools repeatedly re-creates the status file, in contrast.)

Control of services and access to service status is thus subject to ordinary permissions and ACLs on these files.

Bernstein's daemontools employs an 18-byte status file. daemontools has no notion of "starting", "failing", or "stopping" states for services, and its status file provides only simple binary "up" or "down" state information. Guenter's daemontools-encore employs a 19-byte status that includes extra state information for the aforementioned states. service-manager employs an 87-byte status that adds exit status and timestamp information for the start, run, restart, and stop programs. The status file contents are:

12-byte TAI64N timestamp of last service status change event.
4-byte current main process ID in host byte order, 0 meaning no process and -1 meaning process #0.
1-byte paused flag, if the dæmon process is in the stopped state.
1-byte pending command flag, which is an ASCII-encoded character:
d
down
u
up
o
once
O
at most once
1-byte daemontools-encore status (for details see later) which is a binary number:
0
stopped
1
starting
2
started
3
running
4
stopping
5
failed
68 bytes comprising 4 groups of status information for the last start, run, restart, and stop programs to terminate:
1. 1-byte code for termination status, which is a binary number:
  0
  Not yet terminated, other fields should not be considered valid.
  1
  Terminated normally with an exit code.
  2
  Terminated by a signal.
  3
  Terminated by a signal, and core dumped.
2. 4-byte exit code or signal number, in host byte order.
3. 12-byte TAI64N timestamp.

Other tools may use further files in a supervise directory. Again, these files are ignored by service-manager.

Directory locations

The service manager neither knows nor cares where in the filesystem these directories are. That is the province of the utilities that feed control requests to it. It is not necessary for supervise directories to be subdirectories of service directories.

It is not necessary for the relationship between service directories and supervise directories to be one-to-one. One service directory can be shared amongst multiple services, as long as they each have an individual supervise directory.

Moreover, it is not necessary for the relationship between services themselves to be exactly one "main" service feeding its output into one subordinate "log" service. The service manager permits arbitrary-length pipelines of services, as well as fan-in. (However, fan-in should be used sparingly as it generally causes more administrative headaches than it solves.)

Service states

If a service is not known to the service manager, it is in an unloaded state, and none of the information in the status file is valid. Otherwise, service states in that file follow the daemontools-encore paradigm:

stopped: No service process is executing.
starting: The service's start program is currently executing.
started: This state is not used.
running: The service's run program is currently executing, or no program is currently executing in a run-on-empty service that was running the run program immedately prior.
failed: The service's restart program is currently executing.
stopping: The service's stop program is currently executing.

Restart

Automatic restart is tailorable to individual services. If the restart program does not exist, or does not exit with a success (i.e. zero) status when run, the service run program is not restarted.

For the simplest cases restart can just be a (symbolic) link to /bin/true or /bin/false, to provide always-restart and never-restart services, respectively. (If using the nosh flavours of true(1) and false(1) do not use links to them. They will see themselves invoked under the unknown (to them) name restart and complain. Instead, write a short nosh(1) script.)

However, restart is invoked with two pieces of information, which together represent the most recent exit status of the run program, that allow finer control over the restart decision, if desired. The two pieces of information are its three command line arguments.

The first is a code, one of exit, term, kill, abort, or crash. This categorizes how run exited. Everything apart from exit denotes being terminated by an uncaught signal. term denotes the "good" termination signals SIGTERM, SIGPIPE, SIGHUP, and SIGINT. kill denotes SIGKILL. abort denotes SIGABRT, SIGALRM, or SIGQUIT. And crash is everything else.

The second is either (for exit) the decimal exit status of the process or (for everything else) a symbolic designation (falling back to a decimal code) of the specific signal, if the first argument is not specific enough to make a decision.

For convenience, the third is (for other than exit) always the decimal code of the specific signal.

Author

Jonathan de Boyne Pollard