Under PBS, jobs that run code that use software licenses declare this fact by means of the 'software' argument in the job submission. e.g.
#!/bin/sh #PBS -lsoftware=foo:bar foo foofile.in foo.out bar foo.out bar.outWe use this declaration to keep track of the license use by jobs under PBS.
There are several reasons why this modelling mechanism is desirable, but the most compelling relates to the way jobs use licenses. Since a job might not acquire its licenses as soon as it starts, nor hand them back as soon as it finishes, we can't rely on the information returned by license server queries when making scheduling decisions.
The basic idea is that you have one lsd daemon running on one machine. Each PBS scheduler makes calls to the lsd API functions (see @INSTALLDIR@/include/lsdapi.h) to communicate with the single lsd instance.
At the start of a scheduling cycle, the scheduler sends lsd a snapshot of the running and suspended jobs on its system. During the cycle it selects various candidate jobs to run from the queues. For each of these candidates that have software arguments it uses the lsdReserve() API call to see if it is allowed to start that job.
lsd keeps track of what all the schedulers have told it, and builds a model of the license economy (based of the jobs' software arguments). Thus the model is made up of "shadow licenses" that (hopefully) model the eventual license use of all the jobs. When an attempt is made to reserve a shadow license that would exceed the pool of available licenses, the reservation is rejected and the scheduler uses this information to avoid starting the job in question.
lsd also kills (or alerts you to kill) rogue jobs. These are jobs that are either using licenses but haven't declared their intent to do so in the software request, or are using more licenses than they should. This mechanism is not perfect, due to the limited information available to lsd from the license server queries, but it is conservative so that a job will only be killed if it can be shown that no other job or process could have been the offender.
Rogue jobs are killed by running lsdkilljob which is a shell script that you can tailor to your needs. This script could just email you the job details so you can kill it yourself, or, if lsd is running as a user who can manage jobs on all the machines, it can qdel the jobs itself (and mail the user too).
abaqus:matlab abaqus/20:matlab abaqus/abaqus=20/standard=16:matlab
The softwareName/count form indicates that the job will use count licenses of each feature within the software. The softwareName/featureName=count form allows you to specify that the job will use a different number of licenses for each feature within a software package. If you omit a feature name in this form, lsd will use the feature count computed from the number of CPUs that the job will run on.
Set LSDAPIRWTIMEOUT to specify the read and write timeout (in seconds) for communication with the daemon.
Set LSDVERBOSE to any string of characters from the set 'pfc' to control logging verbosity in the daemon.
Set LSDKILLJOB on the server side to the absolute pathname of an executable to override the default lsdkilljob script.
A plugin performs two functions:
1. Emulates the license consumption pattern of the package. Often this is something simple like 'one license per process', but it can be much more complicated than that.
2. Interprets the output from license server queries to determine current license availability and use.
It must have a global-scope function called createInstances() that returns either an instance of BasePlugin or a sequence of instances of BasePlugin (instances of children of BasePlugin are also instances of BasePlugin). The sequence can be anything that supports iteration, such as a tuple, list, or (most usefully) a generator. BasePlugin can be found in @INSTALLDIR@/bin/baseplugin.py.
Writing a new plugin usually involves subclassing BaseFlexlmPlugin to suit the appropriate license manager (even plugins for non-flexlm controlled software are best done by subclassing BaseFlexlmPlugin). For examples of handling the more perverse license consumption behaviours, have a look at flexlmfluent.py and flexlmcfx.py (all in @INSTALLDIR@/plugins/)
See the comments in @INSTALLDIR@/bin/baseplugin.py and @INSTALLDIR@/plugins/baseflexlmplugin.py for more details.
2. cd pbs_lsd/plugins
3. Read comments in ../bin/baseplugin.py, baseflexlmplugin.py and see *.py for more examples.
4. Edit or create your plugin. Note that the methods of a plugin must never block, and should all execute as quickly as possible. In general, the plugin methods should only do things like extract info from the string returned by the license server query, and do some simple arithmetic or tests. Also bear in mind that the plugin methods may be called much more or much less frequently than you might think. In other words, don't do side effects, don't fork shells or anything like that.
When lsd is considering a reservation request for a job, it considers each software package mentioned in the job's software request in turn. For each software package it first calls the getRunnability() method of the software's plugin. If that method returns a value indicating the job is potentially runnable (e.g. some software can never run on more than a fixed number of cpus), it then calls the plugin's getFeatureConsumption() method to calculate the number of software features that will be consumed by the job when it runs.
If there are enough shadow licenses to satisfy all of the software features requested by the job, the shadow license counts are incremented and the job scheduler is told that this job can be run. If not, the job scheduler is told that this job can't be run at the moment.
All plugin classes define a queryCommandString member. The command defined here is allowed to potentially block or be slow to execute as it gets run asynchronously. It may be run less often that you might think, due to caching of its output. Plugins that share exactly the same string value for queryCommandString will usually result in those plugins' instances sharing the output from just one common spawn of the queryCommandString.
5. Test it.
cd wherever/pbs_lsd gmake testsThat should run the basic tests that ensure that nothing crashes and the plugin methods are reasonably sane. Check the output for errors and warnings. To test the business logic of the plugin, you'll have to exercise it manually. In one window start...
LSDCONFIG=lsdtest.conf python bin/lsd.pyIn another window run...
LSDCONFIG=lsdtest.conf src/testUse the s, r and c commands to exercise your changes. Before testing reservations you will first need to use s to send a snapshot. You can test reservations of software with 'r software ncpus'. See src/test.c for more details. The stdout from lsd.py should show the test client's commands being run, as well as debugging output from the plugins. To monitor the internal state of lsd, use your web browser, but bear in mind it will be talking on a different port due to the test config. e.g.
6. Once it looks OK, su root and
@INSTALLDIR@/bin/lsdinit stop gmake install-plugins @INSTALLDIR@/bin/lsdinit start
7. As you, cvs commit -m 'changes to XXX plugin'
gmake testsTo exercise all the plugins.
If you modify lsd, or need to test lsd's behaviour, you can enable the dummy plugin by running
LSDCONFIG=./lsdtest.conf LSDDUMMYQUERY=`pwd`/tests/lsd-dummy-query bin/lsdinit start or LSDCONFIG=./lsdtest.conf LSDDUMMYQUERY=`pwd`/tests/lsd-dummy-query python bin/lsd.pyWith the dummy plugin enabled you can use the test client (src/test) to send lsd dummy jobs and see how it responds. e.g.
[djh900@sc0 pbs_lsd]$ LSDCONFIG=lsdtest.conf src/test Commands are: s - send snapshot r feature ncpus - reserve ncpus of feature c - cancel previous reservation (prompts for jobid) q -quit s, r feature ncpus, c, q: s Enter EXECHOSTSTRING SOFTWARE NCPUS. Blank line to end. h1 dummy:matlab 3 job.id=758.xx0 h2 dummy:matlab 1 job.id=113.xx0 s, r feature ncpus, c, q: r dummy 2 job.id=212.xx0 Exit code 3: Reservation successful..