Skip to content

Commit

Permalink
Merge branch 'develop' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
aoymt committed Jan 9, 2024
2 parents ccf906e + 6bc0d7c commit fa2ab48
Show file tree
Hide file tree
Showing 8 changed files with 274 additions and 24 deletions.
1 change: 1 addition & 0 deletions .github/workflows/deploy_docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ jobs:
run: |
sudo apt-get update
sudo apt-get install -y texlive-lang-japanese texlive-lang-cjk texlive-fonts-recommended texlive-fonts-extra latexmk
sudo apt-get install -y graphviz
- name: Set up Python
uses: actions/setup-python@v4
Expand Down
2 changes: 1 addition & 1 deletion docs/en/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.mathjax']
extensions = ['sphinx.ext.mathjax', 'sphinx.ext.graphviz']
mathjax_pass = ['https://cdnjs.com/']
math_number_all = True
numfig = True
Expand Down
4 changes: 2 additions & 2 deletions docs/en/source/moller/about/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ This software was developed by the following contributors.

- Masahiro Fukuda (The Instutite for Solid State Physics, The University of Tokyo)

- Tetsuya Fukushima (The National Institute of Advanced Industrial Science and Technology (AIST))

- Kota Ido (The Instutite for Solid State Physics, The University of Tokyo)

- Tetsuya Fukushima (The National Institute of Advanced Industrial Science and Technology (AIST))

- Shusuke Kasamatsu (Yamagata University)

- Takashi Koretsune (Tohoku University)
Expand Down
237 changes: 235 additions & 2 deletions docs/en/source/moller/appendix/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,240 @@
Extension guide
================================================================

How to extend ``moller`` for other systems
N.B. The content of this section may vary depending on the version of *moller*.


Bulk job execution by *moller*
----------------------------------------------------------------

A bulk job execution means that a set of small tasks are executed in parallel within a single batch job submitted to a large batch queue. It is schematically shown as follows, in which N tasks are launched as background processes and executed in parallel, and a ``wait`` statement is invoked to wait for all tasks to be completed.

.. code-block:: bash
task param_1 &
task param_2 &
...
task param_N &
wait
To manage the bulk job, it is required to distribute nodes and cores allocated to the batch job over the tasks param_1 ... param_N so that they are executed on distinct nodes and cores. It is also needed to arrange task execution where at most N tasks are run simultaneously according to the allocated resources.

Hereafter a job script generated by *moller* will be denoted as a moller script.
In a moller script, the concurrent execution and control of tasks are managed by GNU parallel [1]. It takes a list holding the items param_1 ... param_N and runs commands for each items in parallel. An example is given as follows, where list.dat contains param_1 ... param_N in each line.

.. code-block:: bash
cat list.dat | parallel -j N task
The number of concurrent tasks is determined at runtime from the number of nodes and cores obtained from the execution environment and the degree of parallelism (number of nodes, processes, and threads specified by node parameter).

The way to assign tasks to nodes and cores varies according to the job scheduler.
For SLURM job scheduler variants, the concurrent calls of ``srun`` command within the batch job are appropriately assigned to the nodes and cores by exploiting the option of exclusive resource usage. The explicit option may depend on the platform.

On the other hand, for PBS job scheduler variants that do not have such features, the distribution of nodes and cores to tasks has to be handled within the moller script. The nodes and cores allocated to a batch job are divided into *slots*, and the slots are assigned to the concurrent tasks. The division is determined from the allocated nodes and cores and the degree of parallelism of the task, and kept in a form of table variables. Within a task, the programs are executed on the assigned hosts and cores (optionally pinned to the program) through the options to mpirun (or mpiexec) and the environment variables. This feature depends on the MPI implementation.

**Reference**

[1] `O. Tange, GNU Parallel - The command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47. <https://www.usenix.org/publications/login/february-2011-volume-36-number-1/gnu-parallel-command-line-power-tool>`_

How *moller* works
----------------------------------------------------------------

(TBA)
Structure of moller script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*moller* reads the input YAML file and generates a job script for bulk execution. The structure of the generated script is described as follows.

#. Header

This part contains the options to the job scheduler. The content of the platform section is formatted according to the type of job scheduler. This feature depends on platforms.

#. Prologue

This part corresponds to the prologue section of the input file. The content of the ``code`` block is written as-is.

#. Function definitions

This part contains the definitions of functions and variables used within the moller script. The description of the functions will be given in the next section. This feature depends on platforms.

#. Processing Command-line options

The SLURM variants accept additional arguments to the job submission command (sbatch) that are passed to the job script as a command-line options. The name of the list file and/or the options such as the retry feature can be processed.

For the PBS variants, these command-line arguments are ignored, and therefore the name of the list file is fixed to ``list.dat`` by default, and the retry feature may be enabled by modifying the script with ``retry`` set to 1.

#. Description of tasks

This part contains the description of tasks specified in the jobs section of the input file. When more than one task is given, the following procedure is applied to each task.

When parallel = false, the content of the ``run`` block is written as-is.

When parallel = true (default), a function is created by the name task_{task name} that contains the pre-processing for concurrent execution and the content of the ``run`` block. The keywords for the parallel execution (``srun``, ``mpiexec``, or ``mpirun``) are substituted by the platform-dependent command. The definition of the task function is followed by the concurrent execution command.

#. Epilogue

This part corresponds to the epilogue section of the input file. The content of the ``code`` block is written as-is.


Brief description of moller script functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The main functions of the moller script is briefly described below.

- ``run_parallel``

This function performs concurrent execution of task functions. It takes the degree of parallelism, the task function, and the status file as arguments. Within the function, it calls ``_find_multiplicity`` to find the number of tasks that can be run simultaneously, and invokes GNU parallel to run tasks concurrently. The task function is actually wrapped by the ``_run_parallel_task`` function to deal with the nested call of GNU parallel.

The platform-dependence is separated out by the functions ``_find_multiplicity`` and ``_setup_run_parallel``.

- ``_find_multiplicity``

This function determines the number of tasks that can be simultaneously executed on the allocated resources (nodes and cores) taking account of the degree of parallelism of the task.
For the PBS variants, the compute nodes and the cores are divided into slots, and the slots are kept as table variables.
The information obtained at the batch job execution is summarized as follows.

- For SLURM variants,

The number of allocated nodes (``_nnodes``)

``SLURM_NNODES``

The number of allocated cores (``_ncores``)

``SLURM_CPUS_ON_NODE``

- For PBS variants,

The allocated nodes (``_nodes[]``)

The list of unique compute nodes is obtained from the file given by ``PBS_NODEFILE``.

The number of allocated nodes (``_nnodes``)

The number of entries of ``_nodes[]``.

The number of allocated cores

Searched from below (in order of examination)

- ``NCPUS`` (for PBS Professional)

- ``OMP_NUM_THREADS``

- ``core`` parameter of platform section (written in the script as a variable ``moller_core``.)

- ``ncpus`` or ``ppn`` parameter in the header.

- ``_setup_run_parallel``

This function is called from the ``run_parallel`` function to supplement some procedures before running GNU parallel.
For PBS variants, the slot tables are exported so that the task functions can refer to.
For SLURM variants, there is nothing to do.

The structure of the task function is described as follows.

- A task function is created by a name ``task_{task name}``.

- The arguments of the task function are 1) the degree of parallelism (the number of nodes, processes, and threads), 2) the execution directory (that corresponds to the entry of list file), 3) the slot ID assigned by GNU parallel.

- The platform-dependent ``_setup_taskenv`` function is called to set up execution environment.
For PBS variants, the compute node and the cores are obtained from the slot table based on the slot ID. For SLURM variants, there is nothing to do.

- The ``_is_ready`` function is called to check if the preceding task has been completed successfully. If it is true, the remaining part of the function is executed. Otherwise, the task is terminated with the status -1.

- The content of the ``code`` block is written. The keywords for parallel calculation (``srun``, ``mpiexec``, or ``mpirun``) are substituted by the command provided for the platform.


How to extend *moller* for other systems
----------------------------------------------------------------
The latest version of *moller* provides profiles for ISSP supercomputer systems, ohtaka and kugui. An extension guide to use *moller* in other systems is described in the following.

Class structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The platform-dependent parts of *moller* are placed in the directory ``platform/``.
Their class structure is depicted below.

.. graphviz::

digraph class_diagram {
size="5,5"
node[shape=record,style=filled,fillcolor=gray95]
edge[dir=back,arrowtail=empty]

Platform[label="{Platform (base.py)}"]
BaseSlurm[label="{BaseSlurm (base_slurm.py)}"]
BasePBS[label="{BasePBS (base_pbs.py)}"]
BaseDefault[label="{BaseDefault (base_default.py)}"]

Ohtaka[label="{Ohtaka (ohtaka.py)}"]
Kugui[label="{Kugui (kugui.py)}"]
Pbs[label="{Pbs (pbs.py)}"]
Default[label="{DefaultPlatform (default.py)}"]

Platform->BaseSlurm
Platform->BasePBS
Platform->BaseDefault

BaseSlurm->Ohtaka
BasePBS->Kugui
BasePBS->Pbs
BaseDefault->Default
}


A factory is provided to select a system in the input file.
A class is imported in ``platform/__init__.py`` and registered to the factory by ``register_platform(system_name, class_name)``, and then it becomes available in the system parameter of the platform section in the input YAML file.

SLURM job scheduler variants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the SLURM job scheduler variants, the system-specific settings should be applied to a derived class of BaseSlurm class.
The string that substitute the keywords for the parallel execution of programs is given by the return value of ``parallel_command()`` method. It corresponds to the ``srun`` command with the options for the exclusive use of resources. See ohtaka.py for an example.

PBS job scheduler variants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the PBS job scheduler variants (PBS Professional, OpenPBS, Torque, and others), the system-specific settings should be applied to a derived class of BasePBS class.

There are two ways of specifying the number of nodes for a batch job in the PBS variants. PBS Professional takes the form of select=N:ncpus=n, while Torque and others take the form of node=N:ppn=n. The BasePBS class has a parameter ``self.pbs_use_old_format`` that is set to ``True`` for the latter type.

The number of cores per compute node can be specified by node parameter of the input file, while the default value may be set for a known system. In kugui.py, the number of cores per node is set to 128 by default.

Customizing features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When further customization is required, the methods of the base class may be overridden in the derived classes. The list of relevant methods is given below.

- ``setup``

This method extracts parameters of the platform section.

- ``parallel_command``

This method returns a string that is used to substitute the keywords for parallel execution of programs (``srun``, ``mpiexec``, ``mpirun``).


- ``generate_header``

This method generates the header part of the job script that contains options to the job scheduler.

- ``generate_function``

This method generates functions that are used within the moller script. It calls the following methods to generate function body and variable definitions.

- ``generate_variable``
- ``generate_function_body``

The definitions of the functions are provided as embedded strings within the class.

Porting to new type of job scheduler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The platform-dependent parts of the moller scripts are the calculation of task multiplicity, the resource distribution over tasks, and the command string of parallel calculation.
The internal functions need to be developed with the following information on the platform:

- how to acquire the allocated nodes and cores from the environment at the execution of batch jobs.

- how to launch parallel calculation (e.g. mpiexec command) and how to assign the nodes and cores to the command.

To find which environment variables are set within the batch jobs, it may be useful to call ``printenv`` command in the job script.

Trouble shooting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When the variable ``_debug`` in the moller script is set to 1, the debug outputs are printed during the execution of the batch jobs. If the job does not work well, it is recommended that the debug option is turned on and the output is examined to check if the internal parameters are appropriately defined.
2 changes: 1 addition & 1 deletion docs/en/source/moller/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ Comprehensive Calculation Utility (moller)
tutorial/index
command/index
filespec/index
.. appendix/index
appendix/index
2 changes: 1 addition & 1 deletion docs/ja/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.mathjax']
extensions = ['sphinx.ext.mathjax', 'sphinx.ext.graphviz']
mathjax_pass = ['https://cdnjs.com/']
math_number_all = True
numfig = True
Expand Down
4 changes: 2 additions & 2 deletions docs/ja/source/moller/about/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ mollerではスーパーコンピュータやクラスタ向けにバッチジ

- 福田 将大 (東京大学 物性研究所)

- 福島 鉄也 (産業技術総合研究所)

- 井戸 康太 (東京大学 物性研究所)

- 福島 鉄也 (産業技術総合研究所)

- 笠松 秀輔 (山形大学 学術研究院(理学部主担当))

- 是常 隆  (東北大学大学院理学研究科)
Expand Down
46 changes: 31 additions & 15 deletions docs/ja/source/moller/appendix/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,9 @@ mollerは、入力されたYAMLファイルの内容をもとに、バルク実

#. コマンドライン引数の処理

SLURM系のジョブスケジューラでは、リストファイルの指定やタスクの再実行などのオプション指定を sbatch コマンドの引数として与えることができます。PBS系のジョブスケジューラでは引数の指定は無視されるため、オプション指定はジョブスクリプトを編集してパラメータをセットする必要があります。
SLURM系のジョブスケジューラでは、リストファイルの指定やタスクの再実行などのオプション指定を sbatch コマンドの引数として与えることができます。

PBS系のジョブスケジューラでは引数の指定は無視されるため、オプション指定はジョブスクリプトを編集してパラメータをセットする必要があります。リストファイルのファイル名はデフォルトで ``list.dat`` です。また、タスクのリトライを行うには ``retry`` 変数を ``1`` にセットします。

#. タスクの記述

Expand Down Expand Up @@ -141,20 +143,34 @@ mollerには現在、物性研スーパーコンピュータシステム ohtaka
mollerの構成のうちプラットフォーム依存の部分は ``platform/`` ディレクトリにまとめています。
クラス構成は次のとおりです。

::

Platform (base.py)
|
+-- BaseSlurm (base_slurm.py) ------- Ohtaka (ohtaka.py)
|
+-- BasePBS (base_pbs.py) --+-------- Kugui (kugui.py)
| |
| `-------- Pbs (pbs.py)
|
`-- BaseDefault (base_default.py) --- DefaultPlatform (default.py)


プラットフォームの選択についてはファクトリが用意されています。``register_platform(登録名, クラス名)`` でクラスをファクトリに登録し、 ``platform/__init__.py`` にクラスを import しておくと、入力パラメータファイルの platform.system パラメータに指定できるようになります。
.. graphviz::

digraph class_diagram {
size="5,5"
node[shape=record,style=filled,fillcolor=gray95]
edge[dir=back,arrowtail=empty]

Platform[label="{Platform (base.py)}"]
BaseSlurm[label="{BaseSlurm (base_slurm.py)}"]
BasePBS[label="{BasePBS (base_pbs.py)}"]
BaseDefault[label="{BaseDefault (base_default.py)}"]

Ohtaka[label="{Ohtaka (ohtaka.py)}"]
Kugui[label="{Kugui (kugui.py)}"]
Pbs[label="{Pbs (pbs.py)}"]
Default[label="{DefaultPlatform (default.py)}"]

Platform->BaseSlurm
Platform->BasePBS
Platform->BaseDefault

BaseSlurm->Ohtaka
BasePBS->Kugui
BasePBS->Pbs
BaseDefault->Default
}

プラットフォームの選択についてはファクトリが用意されています。``register_platform(登録名, クラス名)`` でクラスをファクトリに登録し、 ``platform/__init__.py`` にクラスを import しておくと、入力パラメータファイル中で platform セクションの system パラメータに指定できるようになります。


SLURM系ジョブスケジューラ
Expand Down

0 comments on commit fa2ab48

Please sign in to comment.