The Multics Scheduler

Bob Mullen

Bob Mullen, Cambridge, 1974

Introduction

The Multics design distinguishes between dispatching a CPU and scheduling processes, but the implementation is all done in a single module. Dispatching switches a CPU from one process to another; this should be as efficient as possible, so we precompute a list of eligible processes to switch to. Scheduling computes the order of this list. Jerry Saltzer's 1966 PhD thesis, Traffic Control in a Multiplexed Computing System, defined this distinction, and introduced the concepts of processes, block, and wakeup.

The implementation of the Multics scheduler is in the module pxss, Process Exchange Switch Stack. Its ALM source code is available online. [THVV]

Eligibility

In order to be dispatched, a process has to be in the eligible queue. Among other things, this means that the two essential pages needed for execution are present in memory. Scheduling policy mostly amounts to deciding which process to make eligible next, if indeed any at all, and how much CPU time the process can use before losing eligibility.

Processes in the eligible queue are typically in the states

running -- is now executing on some processor
ready -- could run if a CPU was available
waiting -- for some imminent event, an unlocking, a page-read

If a process calls block, for example to wait for TTY input or some other long-term event outside of the kernel's control, it is removed from the eligible queue, and not threaded in any queue. If a process's time slice ends it is threaded in some ready-but-not-eligible queue described below.

Ordinarily processes are added to the tail of the eligible queue. They tend to float toward the top as other processes lose eligibility. An IO bound process (use little CPU, do many waits...) tends to wind up at the top of the queue, a desirable situation: One wants the processes that consume CPU time to be at the bottom of the queue, to soak up CPU, so the CPU does not idle during IO bursts.

Foreground-Background Scheduler

The first Multics scheduler, circa 1967, was descended from the one in use on CTSS. It was sometimes described as a "Greenberger-Corbató exponential scheduler." The scheduler had N queues, something like 6. When the scheduler wanted to choose a process to run, it took the one at the head of the highest priority empty queue. A process that started requesting service would be put at the tail of the highest priority queue. Processes from that queue were given a small time slice, called the QUANTUM. If the timer went off and the process was still running, it was sent to the tail of the next lower queue, and when processes from that queue were run, they got a time slice twice as long (thus, exponentially longer time slices). The lowest priority queue was called the "background" queue. Thus, long-running jobs would not run until higher priority jobs were done; the scheduler would then run these background jobs for long blocks of time. A job that blocked on disk I/O was skipped over but didn't lose its position in queue; a job that blocked on terminal I/O started over at the top queue when it received an input wakeup. There are more details, but that's a general idea. [THVV]

Percentage Scheduler

In 1975, Bob Mullen implemented the Multics "Workclass" scheduler, originally done to provide different percentages of the machine to different groups of users. It was not that each user got a few percent, but that a group was to get some total percent. (Scheduling of time for users within the group was done by what is called an FB-n algorithm, meaning n levels of foreground and background priorities.)

This scheduler was overlaid on the original FB-n scheduler, in the sense that scheduling within a group was done by the FB-n scheduler. The workclass scheduler decided to make a process "eligible" based on its workclass being the most worthy; the ready queue for each workclass was sorted as an FB-n queue.

The administrative interface for the workclass scheduler was written by Tom Casey and the Scheduler/Dispatcher algorithms by Bob Mullen. The percentage allocations could be varied by shift. There could be 16 workclasses.

Objective

Allow a site administrator to assign each user to one of 16 groups, and to assign a percentage of the machine to each group. The percentages add up to 100 percent. Each Multics project is assigned to one of the groups, and so all user processes derive their work class assignment at login.

The minimal expectation is that if all groups are presenting an offered load which exceeds their allotment, the amount received will match what was specified. Other implicit expectations are assumed and drive parts of the implementation described below:

Assume that when some groups are not presenting sufficient demand, other groups are allowed to use the time. In other words we never idle to constrain a group.
The time unused is divided among groups able to use it, roughly in proportion to their original allotments.
A key question has to do with the time scale at which the control takes place: how often, and more importantly the integration period on which the control decision is based. An extreme example, in a time sharing system at 3pm, a group should not be affected by how much time it used in the morning. To refine that, consider that time sharing users expect a response time of about a second and use about a second --- and if not run within a second will perhaps understand the decision if it is based on time used by them or their group in the present second or the last few. They will not accept that they cannot run now because during an idle period 20 seconds ago their group got lucky and chewed up their share of the next several minutes.

(If your group is allotted 40% of the machine and it has used 40% of the last second and is likely to use 40% of the next second, and yet you cannot run --- your gripe is with the admin/budget person that sets your group's percent, not with the algorithm.)
Even if the scheduler does not make hard commitments of future time allotments, one should consider the decision to begin running a job from a certain group as increasing the spend rate of that group in the near future. In the case of a time sharing user being given a time slice, one does not know for sure whether the user will use the whole time slice or not. However there is the possibility the user will. The concrete application of this consideration can be stated simply as: If the machine is idle at times S & T which are arbitrarily close together (= if bandwidth becomes available at S & T etc.), and if there are users from several groups available to run and if one of the groups is most deserving of time at time S and gets it, then it is NOT true that at time T a user from the same group should be run. That is: one could easily imagine an algorithm which, based on the near identity of the system states at times S & T, made the same decision at both times --- it would be wrong. The fix is to immediately decrement the credits of the group whose member was selected at time S by some amount, for example the time slice awarded. Then if there is an immediately following allocation of run time, a member of the same group will only be selected if that group remained most worthy, after the subtraction.

Obviously at some point the arbitrary subtraction should be added back in, and the amount of time actually used should be subtracted instead.
One must be careful not to ruin the response time of the groups with small percentages. They are already at a disadvantage due to elementary queuing theory results if the percentage scheduler works close to spec. A group assigned 10% that offers a load of 15% will perceive poorer response than a group assigned 60% that offers a load of 65%. This is understandable without queueing theory. A group assigned 10% that offers a (random) load of 9% will see worse response than a group assigned 60% that offers a load of 54%; this follows from queuing theory, assuming service times are the same for users in the two groups. What must be avoided are "lumps" in the algorithm that make things worse than they need to be. (Note the 9% user could buy 11% and greatly reduce his response time under heavy load.)

Idealized Algorithm

Suppose A is the time slice that can be awarded, or what should be roughly the same, the responsiveness desired from the system.

As the time is used by anyone, add time credits to all groups in proportion to each groups assigned percentage. Thus in general, for every second of time used by anyone, a total of one second worth of credits are distributed. (SCATTER CREDITS)

Decide which job to run on the basis of which group has the most credits at the moment.

As jobs run, subtract time credits from their group. This and SCATTER CREDITS essentially conserve the number of credits in the system.

SCATTER CREDITS, the run decision, and the subtraction of time used for the using group address the primary goal, percentages specified by the admin.

When a job is started running, subtract some fixed amount A from its group's credits. When the job is done, add A back into its group's credits. These two operations, on average, conserve the number of credits in the system. Together they address point 4 above.

A group's credits can never exceed 4*A, nor go below 0. Thus SCATTER CREDITS says group.credits = min (group.credits + delta, 4*A). And as a job runs, group.credits = max (group.credits - used, 0). These two operations destroy and create credits, and limit the integration time of the system. If for some extended period only one group demands (and gets) service, the other groups will not build up more than 4*A credits, and the active group's credits will not go below zero. Thus if other groups suddenly became active the first group will be back in the ball game fairly quickly, more quickly if it is a 60% group, less quickly if it is a 10% group. This addresses point 3 above, a short integration time (4*A); also point 2, if some group is not using its credits, additional credits are passed out to demanding groups on the same ratio as originally specified.

One could argue that the max # of credits (the cap) should be proportional to the groups' percentages, rather than the same across all groups. There are two problems with that. First it would mean the integration time, the memory of the system, could be longer for some groups than others, making behavior less predictable. Second, it would work to the advantage of the larger groups to allow them a larger buildup or credits. It will be (relatively) better for the smaller groups to not do that --- this addresses point 5 above. Note this is only an issue if there is some measurable idle time. If there is no idle time, the lower % group will have no advantage.

Compromise for Responsiveness

It is a result of queueing theory that if you place 25% of the users in 4 queues each assigned 25% of the cpu time, the response time for all users will be 4 times worse than if they shared a single queue running at 100% (under comparable heavy loads). This implies that schedulers dividing up cpu-time --- or bandwidth of any kind --- cannot avoid drastically degrading response time. Because this phenomenon was visible during testing, an additional queue was created for ready non-eligible process that had just interacted. This "interactive-queue" was examined before the workclass queues and allowed all users to receive their initial time-slices in first-come-first-served order regardless of recent usage by their workclass or the percentage assigned to it. Of course the credits for their workclass were decremented in the same manner as above. Because the first time slice was short, it was rare for a workclass to exceed its percentage due to this loophole. On the one hand this maintained good response time to simple commands; on the other, response to longer commands is not subject to stringent user expectations. Without the "interactive-queue" the rest of the percentage scheduler would have been doomed by elementary queueing theory. Even today it is worth asking of any new scheduling algorithm, "What does it do to average response time?"

Deadline Scheduler

In 1975-1976, Bob Mullen added features to the scheduler to

provide more efficient support for daemons driving physical devices such as printers.
allow precise tuning of workloads for competitive benchmarks.

The version of the scheduler is called the "Deadline Scheduler."

The deadline scheduler used the workclass structures to implement a wholly different scheduler in which neither the FB-n nor the percentage parameters were used. Considering that most people do not understand the FB-n algorithm, the top level view was that the scheduler could be operated in either "percent" mode or "deadline" mode.

Using some workclasses with virtual deadlines along with a few with realtime deadlines was a convenient way to tune benchmarks with response time requirements which varied for different user scripts. However realtime workclasses can also be used in conjunction with percentage groups, and such a use was common at customer sites.

Virtual Deadlines

Here each workclass was given two pairs of numbers; R1 and R2 could be though of as response times, and Q1 and Q2 were time quanta. When a process received an interactive wakeup the scheduler read the clock and set

          process.deadline. = clock() + workclass.R1

Then the process was sorted by deadline into a queue of ready (but not eligible) processes from all workclasses. The process at the head of this deadline queue would be the next process to be made eligible. There was no attempt to meet the deadline, nor to delay running until the deadline; the deadlines were just a way to sort the work to be done. Thus the deadlines were virtual, meaning not real.

When the process became eligible it was given a quantum of Q1, and placed at the end of the eligible queue. When its time slice expired it was resorted into the ready queue, after another clock reading:

          process.deadline. = clock() + workclass.R2

When it was made eligible again, it would receive a quantum of Q2. The allocation of R2 responses and Q2 quanta continued until the next interactive wakeup.

Realtime Deadlines

Realtime deadlines were not deadlines for completion of work, but for the start of work. If a workclass was identified as realtime its processes were sorted into a special "realtime" ready queue by deadline. Until the deadline arrived no new eligibility would be granted to a realtime process, unless there was nothing else to do. However when the deadline arrived, a realtime process was moved to the head of the eligible queue (ignoring any restriction on max_eligible, working set size, etc.). Subsequent deadlines and time slices were computed as for virtual deadlines, but the sorting and dispatching used the realtime ready queue.

Remarks on Realtime

In general it would be a mistake to have too many realtime processes. The assumption was that the administrator would not have too many and the amount of processor time committed to them (asymptotically at most Q2/(R2+Q2) fraction of a processor, less due to IO waits...) would not be large. For example to keep a double buffered line printer busy that required 50 ms of CPU time to fill a buffer that would generate 2 seconds of printing, one would set:

               R1 = 1.9 sec        R2 = 5.0 sec
               Q1 =  .06 sec       Q2 =  .05 sec

One needs to know that the refresh-buffer wakeup from the printer driver to the printer process gave interactive credit to the process. Thus under ordinary operation the printer process would start running within 1.9 sec and complete running before using .06 sec. For about 3% of a processor the printer would never halt if there was work to do. If the printer went into a loop, or someone tried to use the printer process to do a compile, the process would receive additional time slices very slowly, for a rate of 1% of a processor. Similar applications included RJE lines and backup processes.

The present author would use a realtime workclass for himself in the early stages of tuning a benchmark, in order to never be locked out of the system due to poor response time.

Written: 08/21/95
Last modified: 03/04/00

Please post additions or corrections to alt.os.multics or send them to

Editor: thvv@multicians.org