bsdforums.org - FreeBSD, OpenBSD, NetBSD, MacOS X, Darwin, Linux, Unix forums, message boards, discussions and news. main profile register calendar faq search forumhome  
bsdforums.org - FreeBSD, OpenBSD, NetBSD, MacOS X, Darwin, Linux, Unix forums, message boards, discussions and news. bsdforums.org - FreeBSD, OpenBSD, NetBSD, MacOS X, Darwin, Linux, Unix forums, message boards, discussions and news. > bsdforums.org > News, Articles > New FreeBSD process scheduler available
  Last Thread   Next Thread
Author
Thread Post New Thread    Post A Reply
admin
Administrator

Registered: Jan 2002
Location:
Posts: 1197

New FreeBSD process scheduler available

Luigi Rizzo has implemented a weight-based process scheduler for FreeBSD-stable, and is in the process of porting it to -current.

[Full announcement]

**************************************************

Date: Thu, 18 Jul 2002 16:31:57 -0700
From: Luigi Rizzo <rizzo@icir.org>
To: arch@FreeBSD.ORG
Subject: NEW SCHEDULER available (was Re: proposed changes to kern_switch.c and kern_synch.c)

Hi,

as promised, a first version of the Proportional Share scheduler
that we developed is available at

http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff

These are for a recent -STABLE (i think any version from 4.4 should
work; the only 3 files modified are kern_synch.c, kern_switch.c and
proc.h, plus a one-line change to kern_exit.c).

I have tested it a little bit on a diskless system, and it seems
to survive running a full X session with the usual set of xterm,
netscape etc. while i do a "renice" of the processes and even switch
back and forth between schedulers. But do not trust this yet for a
production system!

The sysctl variable kern.scheduler controls the scheduler in use.

kern.scheduler=1 (default) selects Proportional Share
kern.scheduler=0 selects the Feedback Priority ("classic BSD")

You can switch between the two schedulers by changing the value of
the sysctl variable.

To change the "weight" associated to a process, use the "nice" and
"renice" commands. As usual, positive values (+1..+20) mean the
process will get a smaller share of the CPU, negative values (-1..-20)
will get a larger share. The actual weights (which control the
relative fraction of CPU cycles given to each process) are assigned
through a lookup table which maps the value to the range
1 ... 100 ... 1000 (min, normal, max weight).

The "old" scheduler (Feedback Priority) should be as robust and
stable as always, whereas there are still a few things to cleanup
in the Proportional Share scheduler, namely:

* I have not looked in detail at the SMP case, so do not run
it on an SMP kernel;

* given that there are no priorities, processes woken up by a
tsleep() are not guaranteed to run before all processes
with p_priority >= PUSER; however, they are given a shorter
deadline so they are likely to run first.

* RTPRI and IDLEPRI processes do not have yet any special treatment
(they all end up together with normal processes; this is trivial
to fix, i just haven't had time to look at that).

Have fun, and please if you use it let me know of any bugs and
possibly suggestions to adapt it to -current.

cheers
luigi

On Tue, Jul 16, 2002 at 11:52:16PM -0700, Luigi Rizzo wrote:
> Hi,
> we have implemented a weight-based process scheduler
> for FreeBSD-stable, and are trying to port it to -current (in the
> description below replace "process" with "thread/kse" as appropriate
> for the -current case).
>
> In order to make this work, it is convenient to have all
> scheduler-specific functions and data structures in a
> single file (kern_switch*.c), and generic support in
> another one (kern_synch.c). I believe this was also the original
> BSD design in partitioning the code between the two files.
> However, in our current code, there are some functions which
> are scheduler-specific (see below) which are misplaced.
>
> So I would like to make the following, purely cosmetic, change
> identified as step #1 below, both for consistency with what i
> believe to be the original design, and to simplify further work
> in this area. These would go to both -current and -stable; I
> repeat, they are only cosmetic changes.
>
> Comments ? I already spoke to julian who has no objections.
>
> For those interested, a patch for -current is attached, and the one
> for -stable is very similar (for the records, in -stable we have a
> fully functional weight-based process scheduler, where you still
> control the weights using "nice", and you can switch between the
> old and the new scheduler at any time using a sysctl variable).
>
> --- Re. the multi-scheduler architecture ---
>
> The general idea is to make the process/thread/kse scheduler
> a replaceable piece of the kernel, requiring no modifications
> to the "struct proc", and with the ability of switching from
> one scheduler to another one at runtime (this both for testing
> purposes and for whatever need may arise).
>
>
> The way to achieve this would be the following:
>
> 1. identify all scheduler-specific functions, put all of them
> in one file (e.g. kern_switch.c for the default scheduler),
> and define function pointers for all of them;
>
> 2. use one field in "struct proc" as a link to whatever
> scheduler-specific data structures are necessary.
> In -stable, we have the p_pad3 field that can be used
> for this purpose, much like the if_index field in struct ifnet.
>
> 3. implement a function which, under control of a sysctl call,
> activate a new scheduler (by initialising all the function
> pointers to appropriate values) and moves all processes from
> the old to the new one.
>
> Step #1 is basically a cosmetic change, requiring mostly a move of
> some functions from kern_synch.c to kern_switch.c. These are
>
> roundrobin();
>
> curpriority_cmp();
>
> resetpriority();
>
> parts of schedcpu() and schedclock();
>
> cheers
> luigi
> ----------------------------------
>
> Index: kern_switch.c
> ==================================================
=================
> RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v
> retrieving revision 1.33
> diff -u -r1.33 kern_switch.c
> --- kern_switch.c14 Jul 2002 03:43:33 -00001.33
> +++ kern_switch.c16 Jul 2002 22:15:06 -0000
> @@ -97,6 +97,7 @@
> #include <sys/mutex.h>
> #include <sys/proc.h>
> #include <sys/queue.h>
> +#include <sys/resourcevar.h>
> #include <machine/critical.h>
>
> CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS);
> @@ -107,6 +108,9 @@
> static struct runq runq;
> SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq)
>
> +static struct callout roundrobin_callout;
> +
> +static voidroundrobin(void *arg);
> static void runq_readjust(struct runq *rq, struct kse *ke);
> / **************************************************
**********************
> * Functions that manipulate runnability from a thread perspective.*
> @@ -442,9 +446,13 @@
> {
> int i;
>
> +callout_init(&roundrobin_callout, 0);
> +
> bzero(rq, sizeof *rq);
> for (i = 0; i < RQ_NQS; i++)
> TAILQ_INIT(&rq->rq_queues[i]);
> +
> +roundrobin(NULL);
> }
>
> /*
> @@ -719,3 +727,207 @@
> }
> #endif
>
> +/*
> + * -- formerly in kern_synch.c
> + */
> +
> +int curpriority_cmp(struct thread *td);
> +
> +int
> +curpriority_cmp(struct thread *td)
> +{
> +return (td->td_priority - curthread->td_priority);
> +}
> +
> +/*
> + * Force switch among equal priority processes every 100ms.
> + * We don't actually need to force a context switch of the current process.
> + * The act of firing the event triggers a context switch to softclock() and
> + * then switching back out again which is equivalent to a preemption, thus
> + * no further work is needed on the local CPU.
> + */
> +/* ARGSUSED */
> +static void
> +roundrobin(arg)
> +void *arg;
> +{
> +
> +#ifdef SMP
> +mtx_lock_spin(&sched_lock);
> +forward_roundrobin();
> +mtx_unlock_spin(&sched_lock);
> +#endif
> +
> +callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> +}
> +
> +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
> +#define loadfactor(loadav) (2 * (loadav))
> +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
> +extern fixpt_t ccpu;
> +#define CCPU_SHIFT 11/* XXX dup from kern_synch.c! */
> +
> +/*
> + * Recompute process priorities, every hz ticks.
> + * MP-safe, called without the Giant mutex.
> + */
> +void schedcpu1(struct ksegrp *kg);
> +/* ARGSUSED */
> +void
> +schedcpu1(struct ksegrp *kg)
> +{
> +register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> +struct thread *td;
> +struct kse *ke;
> +int realstathz;
> +int awake;
> +
> + realstathz = stathz ? stathz : hz;
> +
> +awake = 0;
> +FOREACH_KSE_IN_GROUP(kg, ke) {
> +/*
> + * Increment time in/out of memory and sleep
> + * time (if sleeping). We ignore overflow;
> + * with 16-bit int's (remember them?)
> + * overflow takes 45 days.
> + */
> +/* XXXKSE **WRONG***/
> +/*
> + * the kse slptimes are not touched in wakeup
> + * because the thread may not HAVE a KSE
> + */
> +if ((ke->ke_state == KES_ONRUNQ) ||
> + ((ke->ke_state == KES_THREAD) &&
> + (ke->ke_thread->td_state == TDS_RUNNING))) {
> +ke->ke_slptime++;
> +} else {
> +ke->ke_slptime = 0;
> +awake = 1;
> +}
> +
> +/*
> + * pctcpu is only for ps?
> + * Do it per kse.. and add them up at the end?
> + * XXXKSE
> + */
> +ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> +/*
> + * If the kse has been idle the entire second,
> + * stop recalculating its priority until
> + * it wakes up.
> + */
> +if (ke->ke_slptime > 1) {
> +continue;
> +}
> +
> +#if(FSHIFT >= CCPU_SHIFT)
> +ke->ke_pctcpu += (realstathz == 100) ?
> + ((fixpt_t) ke->ke_cpticks) <<
> + (FSHIFT - CCPU_SHIFT) :
> + 100 * (((fixpt_t) ke->ke_cpticks) <<
> + (FSHIFT - CCPU_SHIFT)) / realstathz;
> +#else
> +ke->ke_pctcpu += ((FSCALE - ccpu) *
> + (ke->ke_cpticks * FSCALE / realstathz)) >>
> + FSHIFT;
> +#endif
> +ke->ke_cpticks = 0;
> +} /* end of kse loop */
> +if (awake == 0) {
> +kg->kg_slptime++;
> +} else {
> +kg->kg_slptime = 0;
> +}
> +kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> +resetpriority(kg);
> +FOREACH_THREAD_IN_GROUP(kg, td) {
> +int changedqueue;
> +if (td->td_priority >= PUSER) {
> +/*
> + * Only change the priority
> + * of threads that are still at their
> + * user priority.
> + * XXXKSE This is problematic
> + * as we may need to re-order
> + * the threads on the KSEG list.
> + */
> +changedqueue =
> + ((td->td_priority / RQ_PPQ) !=
> + (kg->kg_user_pri / RQ_PPQ));
> +
> +td->td_priority = kg->kg_user_pri;
> +if (changedqueue &&
> + td->td_state == TDS_RUNQ) {
> +/* this could be optimised */
> +remrunqueue(td);
> +td->td_priority =
> + kg->kg_user_pri;
> +setrunqueue(td);
> +} else {
> +td->td_priority = kg->kg_user_pri;
> +}
> +}
> +}
> +}
> +
> +/*
> + * Compute the priority of a process when running in user mode.
> + * Arrange to reschedule if the resulting priority is better
> + * than that of the current process.
> + */
> +void
> +resetpriority(kg)
> +register struct ksegrp *kg;
> +{
> +register unsigned int newpriority;
> +struct thread *td;
> +
> +mtx_lock_spin(&sched_lock);
> +if (kg->kg_pri_class == PRI_TIMESHARE) {
> +newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> + NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> +newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> + PRI_MAX_TIMESHARE);
> +kg->kg_user_pri = newpriority;
> +}
> +FOREACH_THREAD_IN_GROUP(kg, td) {
> +maybe_resched(td);/* XXXKSE silly */
> +}
> +mtx_unlock_spin(&sched_lock);
> +}
> +
> +#if 0
> +/*
> + * We adjust the priority of the current process. The priority of
> + * a process gets worse as it accumulates CPU time. The cpu usage
> + * estimator (p_estcpu) is increased here. resetpriority() will
> + * compute a different priority each time p_estcpu increases by
> + * INVERSE_ESTCPU_WEIGHT
> + * (until MAXPRI is reached). The cpu usage estimator ramps up
> + * quite quickly when the process is running (linearly), and decays
> + * away exponentially, at a rate which is proportionally slower when
> + * the system is busy. The basic principle is that the system will
> + * 90% forget that the process used a lot of CPU time in 5 * loadav
> + * seconds. This causes the system to favor processes which haven't
> + * run much recently, and to round-robin among other processes.
> + */
> +void
> +schedclock(td)
> +struct thread *td;
> +{
> +struct kse *ke;
> +struct ksegrp *kg;
> +
> +KASSERT((td != NULL), ("schedlock: null thread pointer"));
> +ke = td->td_kse;
> +kg = td->td_ksegrp;
> +ke->ke_cpticks++;
> +kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
> +if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
> +resetpriority(kg);
> +if (td->td_priority >= PUSER)
> +td->td_priority = kg->kg_user_pri;
> +}
> +}
> +#endif
> Index: kern_synch.c
> ==================================================
=================
> RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.185
> diff -u -r1.185 kern_synch.c
> --- kern_synch.c14 Jul 2002 03:43:33 -00001.185
> +++ kern_synch.c16 Jul 2002 22:16:46 -0000
> @@ -67,6 +67,8 @@
>
> #include <machine/cpu.h>
>
> +voidschedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */
> +
> static void sched_setup(void *dummy);
> SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL)
>
> @@ -76,7 +78,6 @@
>
> static struct callout loadav_callout;
> static struct callout schedcpu_callout;
> -static struct callout roundrobin_callout;
>
> struct loadavg averunnable =
> { {0, 0, 0}, FSCALE };/* load average, of runnable procs */
> @@ -92,7 +93,6 @@
>
> static voidendtsleep(void *);
> static voidloadav(void *arg);
> -static voidroundrobin(void *arg);
> static voidschedcpu(void *arg);
>
> static int
> @@ -135,28 +135,6 @@
> }
>
> /*
> - * Force switch among equal priority processes every 100ms.
> - * We don't actually need to force a context switch of the current process.
> - * The act of firing the event triggers a context switch to softclock() and
> - * then switching back out again which is equivalent to a preemption, thus
> - * no further work is needed on the local CPU.
> - */
> -/* ARGSUSED */
> -static void
> -roundrobin(arg)
> -void *arg;
> -{
> -
> -#ifdef SMP
> -mtx_lock_spin(&sched_lock);
> -forward_roundrobin();
> -mtx_unlock_spin(&sched_lock);
> -#endif
> -
> -callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> -}
> -
> -/*
> * Constants for digital decay and forget:
> *90% of (p_estcpu) usage in 5 * loadav time
> *95% of (p_pctcpu) usage in 60 seconds (load insensitive)
> @@ -225,7 +203,7 @@
> #definedecay_cpu(loadfac, cpu)(((loadfac) * (cpu)) / ((loadfac) + FSCALE))
>
> /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
> -static fixpt_tccpu = 0.95122942450071400909 * FSCALE;/* exp(-1/20) */
> +fixpt_tccpu = 0.95122942450071400909 * FSCALE;/* exp(-1/20) */
> SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
>
> /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */
> @@ -255,105 +233,15 @@
> schedcpu(arg)
> void *arg;
> {
> -register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> -struct thread *td;
> struct proc *p;
> -struct kse *ke;
> struct ksegrp *kg;
> -int realstathz;
> -int awake;
>
> -realstathz = stathz ? stathz : hz;
> sx_slock(&allproc_lock);
> FOREACH_PROC_IN_SYSTEM(p) {
> mtx_lock_spin(&sched_lock);
> p->p_swtime++;
> FOREACH_KSEGRP_IN_PROC(p, kg) {
> -awake = 0;
> -FOREACH_KSE_IN_GROUP(kg, ke) {
> -/*
> - * Increment time in/out of memory and sleep
> - * time (if sleeping). We ignore overflow;
> - * with 16-bit int's (remember them?)
> - * overflow takes 45 days.
> - */
> -/* XXXKSE **WRONG***/
> -/*
> - * the kse slptimes are not touched in wakeup
> - * because the thread may not HAVE a KSE
> - */
> -if ((ke->ke_state == KES_ONRUNQ) ||
> - ((ke->ke_state == KES_THREAD) &&
> - (ke->ke_thread->td_state == TDS_RUNNING))) {
> -ke->ke_slptime++;
> -} else {
> -ke->ke_slptime = 0;
> -awake = 1;
> -}
> -
> -/*
> - * pctcpu is only for ps?
> - * Do it per kse.. and add them up at the end?
> - * XXXKSE
> - */
> -ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> -/*
> - * If the kse has been idle the entire second,
> - * stop recalculating its priority until
> - * it wakes up.
> - */
> -if (ke->ke_slptime > 1) {
> -continue;
> -}
> -
> -#if(FSHIFT >= CCPU_SHIFT)
> -ke->ke_pctcpu += (realstathz == 100) ?
> - ((fixpt_t) ke->ke_cpticks) <<
> - (FSHIFT - CCPU_SHIFT) :
> - 100 * (((fixpt_t) ke->ke_cpticks) <<
> - (FSHIFT - CCPU_SHIFT)) / realstathz;
> -#else
> -ke->ke_pctcpu += ((FSCALE - ccpu) *
> - (ke->ke_cpticks * FSCALE / realstathz)) >>
> - FSHIFT;
> -#endif
> -ke->ke_cpticks = 0;
> -} /* end of kse loop */
> -if (awake == 0) {
> -kg->kg_slptime++;
> -} else {
> -kg->kg_slptime = 0;
> -}
> -kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> - resetpriority(kg);
> -FOREACH_THREAD_IN_GROUP(kg, td) {
> -int changedqueue;
> -if (td->td_priority >= PUSER) {
> -/*
> - * Only change the priority
> - * of threads that are still at their
> - * user priority.
> - * XXXKSE This is problematic
> - * as we may need to re-order
> - * the threads on the KSEG list.
> - */
> -changedqueue =
> - ((td->td_priority / RQ_PPQ) !=
> - (kg->kg_user_pri / RQ_PPQ));
> -
> -td->td_priority = kg->kg_user_pri;
> -if (changedqueue &&
> - td->td_state == TDS_RUNQ) {
> -/* this could be optimised */
> -remrunqueue(td);
> -td->td_priority =
> - kg->kg_user_pri;
> -setrunqueue(td);
> -} else {
> -td->td_priority = kg->kg_user_pri;
> -}
> -}
> -}
> +schedcpu1(kg);
> } /* end of ksegrp loop */
> mtx_unlock_spin(&sched_lock);
> } /* end of process loop */
> @@ -944,32 +832,6 @@
> }
>
> /*
> - * Compute the priority of a process when running in user mode.
> - * Arrange to reschedule if the resulting priority is better
> - * than that of the current process.
> - */
> -void
> -resetpriority(kg)
> -register struct ksegrp *kg;
> -{
> -register unsigned int newpriority;
> -struct thread *td;
> -
> -mtx_lock_spin(&sched_lock);
> -if (kg->kg_pri_class == PRI_TIMESHARE) {
> -newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> - NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> -newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> - PRI_MAX_TIMESHARE);
> -kg->kg_user_pri = newpriority;
> -}
> -FOREACH_THREAD_IN_GROUP(kg, td) {
> -maybe_resched(td);/* XXXKSE silly */
> -}
> -mtx_unlock_spin(&sched_lock);
> -}
> -
> -/*
> * Compute a tenex style load average of a quantity on
> * 1, 5 and 15 minute intervals.
> * XXXKSE Needs complete rewrite when correct info is available.
> @@ -1022,11 +884,9 @@
> {
>
> callout_init(&schedcpu_callout, 1);
> -callout_init(&roundrobin_callout, 0);
> callout_init(&loadav_callout, 0);
>
> /* Kick off timeout driven events by calling first time. */
> -roundrobin(NULL);
> schedcpu(NULL);
> loadav(NULL);
> }
>
> ----- End forwarded message -----
>
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

Report this post to a moderator | IP: Logged

Old Post 07-19-2002 04:08 PM
admin is offline Click Here to See the Profile for admin Click here to Send admin a Private Message Find more posts by admin Add admin to your buddy list Edit/Delete Message Reply w/Quote
All times are GMT. The time now is 07:15 PM. Post New Thread    Post A Reply
  Last Thread   Next Thread
Show Printable Version | Email this Page | Subscribe to this Thread

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are OFF
[IMG] code is OFF
 

< Contact Us - Main Forums Page >

Powered by: vBulletin ©2000, 2001, Jelsoft Enterprises Ltd.
Copyright ©2001, 2002, FreeBSD Forums