a horror story in the systemd house of horror

If you have two services, define two services.

The cast

Oracle consultant Tim Hall, after explaining how to write a shell script that starts two Oracle services and then wrap it up in a System 5 rc script, went on to explain how to replace System 5 rc with systemd.

RemainAfterExit=yes
ExecStart=/home/oracle/scripts/startup.sh >> /home/oracle/scripts/startup_shutdown.log 2>&1 &
ExecStop=/home/oracle/scripts/shutdown.sh >> /home/oracle/scripts/startup_shutdown.log 2>&1

This clearly has never been tested.

The horror story

Anyone who had ever used this service definition would have wondered, just for starters, why the log files were never written to or even created. The problem is that this has just transplanted a line of shell script into a systemd service unit without either thought or even the simplest of tests. ExecStart and ExecStop do not use shells as interpreters, and are not lines of shell script. The systemd manual itself explicitly mentions this:

Specifically, redirection using <, <<, >, and >>, pipes using |, running programs in the background using &, and other elements of shell syntax are not supported.

What is actually happening is that all of those characters are being passed as arguments to the two shell scripts. Fortunately, they both entirely ignore their command-line arguments.

But that is just for starters. There is worse to come.

The scripts operate as follows:

They are not the actual dæmon program, which systemd is expecting them to be. But lsnrctl isn't the actual dæmon program either. It is a control program that either launches or stops the actual dæmon process. The actual dæmon process is a grandchild of the main process that systemd spawned.

(Gunther Pippèrr took this and managed to turn it into an even worse version where the actual dæmon process is the great-great-grandchild of the main process that sytemd spawned. systemd spawns a shell that interprets a startdb.sh script, which forks a child process to execute a shell that interprets a startStop.sh script, which forks a child process to execute a shell that interprets a dbstart script, which forks a child process to run sqldba, which forks a child process to be the actual database dæmon.)

There are three models that a systemd-managed dæmon can follow, and this doesn't match any of them.

Add to that the matter that the scripts are also running sqlplus, another control utility program that is not the actual dæmon program either. This creates another grandchild that is another dæmon process that has a separate job to do.

This ridiculous edifice ends up running two main server processes as a single systemd service, and to systemd neither of them is the actual single main server process that it needs to monitor and to send signals to.

To systemd, it appears that the main process rather swiftly exits. That's an indication that the service is going back to the inactive state, and usually systemd then proceeds to clean up any stray child processes left around. But here, the stray child processes are the dæmons, two of them no less. systemd originally ended up killing them almost as soon as they were started.

Hence the bodge that was suggested by other people, who did test this and couldn't get it to work. People have bodged this by marking the service as RemainAfterExit=true to prevent the dæmons from being cleaned up. But this just leaves systemd with two services in one. It reports that the main service process has "exited" and does not track the individual service statuses or their actual main dæmon processes.

If you have two services, have two service definitions. It really is that straightforward and obvious.

An example of this was created by an anonymous ServerWorld person. The two service definitions are:

lsnrctl.service

This runs lsnrctl start and lsnrctl stop.

oracledb.service

This runs dbstart and dbshut; and could quite easily be adapted to instead run sqlplus with the STARTUP and SHUTDOWN commands.

Donghua Luo almost got this, too but resorted to using /bin/su - oracle all over the place instead of simply defining the service as User=oracle as the anonymous ServerWorld person did. This is an abuse of su, as su is a tool for adding privileges not for dropping them.

Of course, this is not perfect, merely better. Strictly speaking, lsnrctl is just unnecessarily duplicating work that systemd already does and can do directly. systemd will start and stop the dæmon. systemd will ensure that only one dæmon is ever started at any time. Having an intermediary special-purpose lsnrctl program, that forks one fixed child process and attempts to track it so that it can kill it later, is unnecessary when one already has a general-purpose dæmon manager that can do this, and one is already using it.


© Copyright 2016 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.