Skip to content

Commit

Permalink
Deploy docs to v1.0.1 by GitHub Actions triggered by 1717704
Browse files Browse the repository at this point in the history
  • Loading branch information
HTP-tools Developers committed Sep 10, 2024
1 parent 786c0ac commit d2c96e8
Show file tree
Hide file tree
Showing 99 changed files with 12,963 additions and 0 deletions.
4 changes: 4 additions & 0 deletions manual/v1.0.1/en/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 731a66ad61404ab3e769ad354d627b62
tags: 645f666f9bcd5a90fca523b33c5a78b7
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
<map id="class_diagram" name="class_diagram">
</map>
Binary file added manual/v1.0.1/en/html/_images/task_view.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions manual/v1.0.1/en/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.. HTP-tools documentation master file, created by
sphinx-quickstart on Fri Jun 30 11:02:31 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Moller Users Guide
=====================================

.. toctree::
:maxdepth: 2
:caption: Contents:

moller/index

.. Indices and tables
.. ==================
.. * :ref:`genindex`
.. * :ref:`modindex`
.. * :ref:`search`
76 changes: 76 additions & 0 deletions manual/v1.0.1/en/html/_sources/moller/about/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
****************************************************************
Introduction
****************************************************************

What is moller?
----------------------------------------------------------------

In recent years, the use of machine learning for predicting material properties and designing substances (known as materials informatics) has gained considerable attention.
The accuracy of machine learning depends heavily on the preparation of appropriate training data.
Therefore, the development of tools and environments for the rapid generation of training data is expected to contribute significantly to the advancement of research in materials informatics.

moller is provided as part of the HTP-Tools package, designed to support high-throughput computations.
It is a tool for generating batch job scripts for supercomputers and clusters, allowing parallel execution of programs under a series of computational conditions, such as parameter parallelism.
Currently, it supports the supercomputers ohtaka (using the slurm job scheduler) and kugui (using the PBS job scheduler) provided by the Institute for Solid State Physics, University of Tokyo.

License
----------------------------------------------------------------

The distribution of the program package and the source codes for moller follow GNU General Public License version 3 (GPL v3) or later.

Contributors
----------------------------------------------------------------

This software was developed by the following contributors.

- ver.1.0.1 (Released on 2024/09/10)

- ver.1.0.0 (Released on 2024/03/06)

- ver.1.0-beta (Released on 2023/12/28)

- Developers

- Kazuyoshi Yoshimi (The Instutite for Solid State Physics, The University of Tokyo)

- Tatsumi Aoyama (The Instutite for Solid State Physics, The University of Tokyo)

- Yuichi Motoyama (The Instutite for Solid State Physics, The University of Tokyo)

- Masahiro Fukuda (The Instutite for Solid State Physics, The University of Tokyo)

- Kota Ido (The Instutite for Solid State Physics, The University of Tokyo)

- Tetsuya Fukushima (The National Institute of Advanced Industrial Science and Technology (AIST))

- Shusuke Kasamatsu (Yamagata University)

- Takashi Koretsune (Tohoku University)

- Project Corrdinator

- Taisuke Ozaki (The Instutite for Solid State Physics, The University of Tokyo)


Copyright
----------------------------------------------------------------

.. only:: html

|copy| *2023- The University of Tokyo. All rights reserved.*

.. |copy| unicode:: 0xA9 .. copyright sign

.. only:: latex

:math:`\copyright` *2023- The University of Tokyo. All rights reserved.*

This software was developed with the support of "Project for advancement of software usability in materials science" of The Institute for Solid State Physics, The University of Tokyo.

Operating environment
----------------------------------------------------------------

moller was tested on the following platforms:

- Ubuntu Linux + python3

241 changes: 241 additions & 0 deletions manual/v1.0.1/en/html/_sources/moller/appendix/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
================================================================
Extension guide
================================================================

N.B. The content of this section may vary depending on the version of *moller*.


Bulk job execution by *moller*
----------------------------------------------------------------

A bulk job execution means that a set of small tasks are executed in parallel within a single batch job submitted to a large batch queue. It is schematically shown as follows, in which N tasks are launched as background processes and executed in parallel, and a ``wait`` statement is invoked to wait for all tasks to be completed.

.. code-block:: bash
task param_1 &
task param_2 &
...
task param_N &
wait
To manage the bulk job, it is required to distribute nodes and cores allocated to the batch job over the tasks param_1 ... param_N so that they are executed on distinct nodes and cores. It is also needed to arrange task execution where at most N tasks are run simultaneously according to the allocated resources.

Hereafter a job script generated by *moller* will be denoted as a moller script.
In a moller script, the concurrent execution and control of tasks are managed by GNU parallel [1]. It takes a list holding the items param_1 ... param_N and runs commands for each items in parallel. An example is given as follows, where list.dat contains param_1 ... param_N in each line.

.. code-block:: bash
cat list.dat | parallel -j N task
The number of concurrent tasks is determined at runtime from the number of nodes and cores obtained from the execution environment and the degree of parallelism (number of nodes, processes, and threads specified by node parameter).

The way to assign tasks to nodes and cores varies according to the job scheduler.
For SLURM job scheduler variants, the concurrent calls of ``srun`` command within the batch job are appropriately assigned to the nodes and cores by exploiting the option of exclusive resource usage. The explicit option may depend on the platform.

On the other hand, for PBS job scheduler variants that do not have such features, the distribution of nodes and cores to tasks has to be handled within the moller script. The nodes and cores allocated to a batch job are divided into *slots*, and the slots are assigned to the concurrent tasks. The division is determined from the allocated nodes and cores and the degree of parallelism of the task, and kept in a form of table variables. Within a task, the programs are executed on the assigned hosts and cores (optionally pinned to the program) through the options to mpirun (or mpiexec) and the environment variables. This feature depends on the MPI implementation.

**Reference**

[1] `O. Tange, GNU Parallel - The command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47. <https://www.usenix.org/publications/login/february-2011-volume-36-number-1/gnu-parallel-command-line-power-tool>`_

How *moller* works
----------------------------------------------------------------

Structure of moller script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*moller* reads the input YAML file and generates a job script for bulk execution. The structure of the generated script is described as follows.

#. Header

This part contains the options to the job scheduler. The content of the platform section is formatted according to the type of job scheduler. This feature depends on platforms.

#. Prologue

This part corresponds to the prologue section of the input file. The content of the ``code`` block is written as-is.

#. Function definitions

This part contains the definitions of functions and variables used within the moller script. The description of the functions will be given in the next section. This feature depends on platforms.

#. Processing Command-line options

The SLURM variants accept additional arguments to the job submission command (sbatch) that are passed to the job script as a command-line options. The name of the list file and/or the options such as the retry feature can be processed.

For the PBS variants, these command-line arguments are ignored, and therefore the name of the list file is fixed to ``list.dat`` by default, and the retry feature may be enabled by modifying the script with ``retry`` set to 1.

#. Description of tasks

This part contains the description of tasks specified in the jobs section of the input file. When more than one task is given, the following procedure is applied to each task.

When parallel = false, the content of the ``run`` block is written as-is.

When parallel = true (default), a function is created by the name task_{task name} that contains the pre-processing for concurrent execution and the content of the ``run`` block. The keywords for the parallel execution (``srun``, ``mpiexec``, or ``mpirun``) are substituted by the platform-dependent command. The definition of the task function is followed by the concurrent execution command.

#. Epilogue

This part corresponds to the epilogue section of the input file. The content of the ``code`` block is written as-is.


Brief description of moller script functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The main functions of the moller script is briefly described below.

- ``run_parallel``

This function performs concurrent execution of task functions. It takes the degree of parallelism, the task function, and the status file as arguments. Within the function, it calls ``_find_multiplicity`` to find the number of tasks that can be run simultaneously, and invokes GNU parallel to run tasks concurrently. The task function is actually wrapped by the ``_run_parallel_task`` function to deal with the nested call of GNU parallel.

The platform-dependence is separated out by the functions ``_find_multiplicity`` and ``_setup_run_parallel``.

- ``_find_multiplicity``

This function determines the number of tasks that can be simultaneously executed on the allocated resources (nodes and cores) taking account of the degree of parallelism of the task.
For the PBS variants, the compute nodes and the cores are divided into slots, and the slots are kept as table variables.
The information obtained at the batch job execution is summarized as follows.

- For SLURM variants,

The number of allocated nodes (``_nnodes``)

``SLURM_NNODES``

The number of allocated cores (``_ncores``)

``SLURM_CPUS_ON_NODE``

- For PBS variants,

The allocated nodes (``_nodes[]``)

The list of unique compute nodes is obtained from the file given by ``PBS_NODEFILE``.

The number of allocated nodes (``_nnodes``)

The number of entries of ``_nodes[]``.

The number of allocated cores

Searched from below (in order of examination)

- ``NCPUS`` (for PBS Professional)

- ``OMP_NUM_THREADS``

- ``core`` parameter of platform section (written in the script as a variable ``moller_core``.)

- ``ncpus`` or ``ppn`` parameter in the header.

- ``_setup_run_parallel``

This function is called from the ``run_parallel`` function to supplement some procedures before running GNU parallel.
For PBS variants, the slot tables are exported so that the task functions can refer to.
For SLURM variants, there is nothing to do.

The structure of the task function is described as follows.

- A task function is created by a name ``task_{task name}``.

- The arguments of the task function are 1) the degree of parallelism (the number of nodes, processes, and threads), 2) the execution directory (that corresponds to the entry of list file), 3) the slot ID assigned by GNU parallel.

- The platform-dependent ``_setup_taskenv`` function is called to set up execution environment.
For PBS variants, the compute node and the cores are obtained from the slot table based on the slot ID. For SLURM variants, there is nothing to do.

- The ``_is_ready`` function is called to check if the preceding task has been completed successfully. If it is true, the remaining part of the function is executed. Otherwise, the task is terminated with the status -1.

- The content of the ``code`` block is written. The keywords for parallel calculation (``srun``, ``mpiexec``, or ``mpirun``) are substituted by the command provided for the platform.


How to extend *moller* for other systems
----------------------------------------------------------------
The latest version of *moller* provides profiles for ISSP supercomputer systems, ohtaka and kugui. An extension guide to use *moller* in other systems is described in the following.

Class structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The platform-dependent parts of *moller* are placed in the directory ``platform/``.
Their class structure is depicted below.

.. graphviz::

digraph class_diagram {
size="5,5"
node[shape=record,style=filled,fillcolor=gray95]
edge[dir=back,arrowtail=empty]

Platform[label="{Platform (base.py)}"]
BaseSlurm[label="{BaseSlurm (base_slurm.py)}"]
BasePBS[label="{BasePBS (base_pbs.py)}"]
BaseDefault[label="{BaseDefault (base_default.py)}"]

Ohtaka[label="{Ohtaka (ohtaka.py)}"]
Kugui[label="{Kugui (kugui.py)}"]
Pbs[label="{Pbs (pbs.py)}"]
Default[label="{DefaultPlatform (default.py)}"]

Platform->BaseSlurm
Platform->BasePBS
Platform->BaseDefault

BaseSlurm->Ohtaka
BasePBS->Kugui
BasePBS->Pbs
BaseDefault->Default
}


A factory is provided to select a system in the input file.
A class is imported in ``platform/__init__.py`` and registered to the factory by ``register_platform(system_name, class_name)``, and then it becomes available in the system parameter of the platform section in the input YAML file.

SLURM job scheduler variants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the SLURM job scheduler variants, the system-specific settings should be applied to a derived class of BaseSlurm class.
The string that substitute the keywords for the parallel execution of programs is given by the return value of ``parallel_command()`` method. It corresponds to the ``srun`` command with the options for the exclusive use of resources. See ohtaka.py for an example.

PBS job scheduler variants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the PBS job scheduler variants (PBS Professional, OpenPBS, Torque, and others), the system-specific settings should be applied to a derived class of BasePBS class.

There are two ways of specifying the number of nodes for a batch job in the PBS variants. PBS Professional takes the form of select=N:ncpus=n, while Torque and others take the form of node=N:ppn=n. The BasePBS class has a parameter ``self.pbs_use_old_format`` that is set to ``True`` for the latter type.

The number of cores per compute node can be specified by node parameter of the input file, while the default value may be set for a known system. In kugui.py, the number of cores per node is set to 128 by default.

Customizing features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When further customization is required, the methods of the base class may be overridden in the derived classes. The list of relevant methods is given below.

- ``setup``

This method extracts parameters of the platform section.

- ``parallel_command``

This method returns a string that is used to substitute the keywords for parallel execution of programs (``srun``, ``mpiexec``, ``mpirun``).


- ``generate_header``

This method generates the header part of the job script that contains options to the job scheduler.

- ``generate_function``

This method generates functions that are used within the moller script. It calls the following methods to generate function body and variable definitions.

- ``generate_variable``
- ``generate_function_body``

The definitions of the functions are provided as embedded strings within the class.

Porting to new type of job scheduler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The platform-dependent parts of the moller scripts are the calculation of task multiplicity, the resource distribution over tasks, and the command string of parallel calculation.
The internal functions need to be developed with the following information on the platform:

- how to acquire the allocated nodes and cores from the environment at the execution of batch jobs.

- how to launch parallel calculation (e.g. mpiexec command) and how to assign the nodes and cores to the command.

To find which environment variables are set within the batch jobs, it may be useful to call ``printenv`` command in the job script.

Trouble shooting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When the variable ``_debug`` in the moller script is set to 1, the debug outputs are printed during the execution of the batch jobs. If the job does not work well, it is recommended that the debug option is turned on and the output is examined to check if the internal parameters are appropriately defined.
Loading

0 comments on commit d2c96e8

Please sign in to comment.