MarketPlace HPC Gateway SDK#

The HPC gateway SDK is provided for app developers or MarketPlace users to run time consuming tasks over computational clusters. You can install this python SDK and use it to interact with the cluster to run the simulation jobs.

MarketPlace has two HPC deployments available, namely the IWM deployment and the EPFL Materials Cloud (mc).

  • The iwm deployment does not have Slurm running yet, therefore jobs cannot be submitted for now. All other capabilities are working.

  • The EPFL Materials Cloud (mc) deployment supports all capabilities and app developers can use it embedded in apps that need to run heavy calculations. However, the mc deployment is only for test purpose, the time limit of every job is hardcoded to 10 minutes. The mc deployment will end its maintenance on 1st April 2023.

API Summary#

This is a summary of the methods that will be explained in the next sections:

  • app.heartbeat(): check the availability of system.

  • app.create_user(): create a new user.

  • app.create_job(): prepare a new job.

  • app.check_job_state(jobid=<jobid>): list files in remote job folder.

  • app.upload_file(jobid=<jobid>, filename=<filename>, source_path=<local_file_path>: upload a local file to the remote folder.

  • app.download_file(jobid=<jobid>, filename=<filename>: download a file from the remote folder.

  • app.delete_file(jobid=jobid, filename=<filename>): delete a file from the remote folder.

  • app.launch_job(jobid=<jobid>): launch/submit the job to the cluster queue managed by Slurm.

  • app.cancel_job(jobid=jobid): cancel a submitted job.

Install the SDK#

To install the SDK package run the following command, or put marketplace-hpc as a dependency of your MarketPlace app:

$ pip install marketplace-hpc

The source code is publicly available on GitHub.

Initialize the app instance#

Use hpc_gateway_sdk.get_app to create an interface for interacting with the HPC gateway app. The name can be either iwm or mc respectively.

To initialize the instance, provide the deployment name and MarketPlace access_token. The access_token can be relayed from the App which integrates the hpc gateway app as the calculation backend.

To run this notebook, put the .env file with ACCESS_TOKEN set in the same folder.

[1]:
from hpc_gateway_sdk import get_app
from dotenv import load_dotenv
import os

load_dotenv(".env")

access_token = os.environ.get("ACCESS_TOKEN")
app = get_app(name="mc", access_token=access_token)

The first time using the HPC gateway app, you need to create the user in the database of the HPC app to record the job data corresponding to every MarketPlace user account. Meanwhile, create_user will create the user folder in the cluster to store jobs folder repository.

[2]:
user_info = app.create_user()
print(user_info)
{
  "_id": "638f355e57bd4aa2a97b98d0",
  "email": "jusong.yu@epfl.ch",
  "home": "/scratch/snx3000/jyu/firecrest/jusong_yu",
  "message": "Success: Create user in database.",
  "name": "Jusong Yu"
}

To create a job, use the create_job method of the gateway app. It will create a job folder in the remote cluster to store files. The jobid is returned for further operations. The parameter new_transformation is a dictionary with the job information used to create the Slurm job script. The following parameters must be provided.

  • job_name: the name of the job.

  • ntasks_per_node: the number of tasks per node i.e., the mpi stacks of your job, is the number follow the mpirun -n.

  • partition: for the EPFL Materials Cloud (mc) deployment, the available partitions are debug and normal.

  • image: For security and agile deployment purpose, we use singularity to run the simulation inside a container. Supported URIs include:

    • library: Pull an image from the currently configured library (library://user/collection/container[:tag])

    • docker: Pull a Docker/OCI image from Docker Hub, or another OCI registry.(docker://user/image:tag)

    • shub: Pull an image from Singularity Hub (shub://user/image:tag)

    • oras: Pull a SIF image from an OCI registry that supports ORAS. (oras://registry/namespace/image:tag)

    • http, https: Pull an image using the http(s) protocol

  • executable_cmd: the command to run the simulation inside the container.

In the future, we will support using MarketPlaces’s private docker register (via GitLab). Once we have a gitlab account for this purpose, just set following environment variables on the remote cluster.

export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=<redacted>

As mentioned, the EPFL Materials Cloud (mc) deployment is only for testing purposes, and the execution time is limited to 10 mins.

To build a container that can run parallel simulations, please check the example of the LAMMPS and Quantum ESPRESSO dockerfile on containers4hpc. The container is encouraged to build based on the base-mpi314 image which uses MPICH v3.1.4 that supports ABI compatible and can run with multiple compatible MPI libraries.

[3]:
jobid = app.create_job(new_transformation={
  "job_name": "demo00",
  "ntasks_per_node": 1,
  "partition": "debug",
  "image": "docker://hello-world:latest",
  "executable_cmd": "> output",
})
print(jobid)
639615ac5be67d529e2187cd

The create_job method will only prepare the folder and the Slurm job script in the remote cluster, to launch the simulation launch_job should be used, with the job id returned by create_job.

An email with job state will be sent to user’s email registered on the MarketPlace.

[4]:
resp = app.launch_job(jobid)
resp
[4]:
{'jobid': '639615ac5be67d529e2187cd'}

The check_job_state is used to getting the file list of the job.

[5]:
resp = app.check_job_state(jobid)
resp
[5]:
{'files': [{'group': 'mrcloud',
   'last_modified': '2022-12-11T18:38:52',
   'link_target': '',
   'name': 'job.sh',
   'permissions': 'rw-r--r--',
   'size': '519',
   'type': '-',
   'user': 'jyu'},
  {'group': 'mrcloud',
   'last_modified': '2022-12-11T18:39:01',
   'link_target': '',
   'name': 'slurm-43437415.out',
   'permissions': 'rw-r--r--',
   'size': '0',
   'type': '-',
   'user': 'jyu'}],
 'message': 'Files in the job folder.'}

You can cancel a job with cancel_job.

[6]:
resp = app.cancel_job(jobid)
resp
[6]:
{'message': 'Send cancelling signal to job-639615ac5be67d529e2187cd, of f7t job id=43437415'}

Input files are usually needed to run the simulation, they can be uploaded with upload_file as the example shown below.

[7]:
app.upload_file(jobid, filename="file_upload_test.txt", source_path="./file_upload_test.txt")
resp = app.check_job_state(jobid)
resp
[7]:
{'files': [{'group': 'mrcloud',
   'last_modified': '2022-12-11T18:39:07',
   'link_target': '',
   'name': 'file_upload_test.txt',
   'permissions': 'rw-r--r--',
   'size': '7',
   'type': '-',
   'user': 'jyu'},
  {'group': 'mrcloud',
   'last_modified': '2022-12-11T18:38:52',
   'link_target': '',
   'name': 'job.sh',
   'permissions': 'rw-r--r--',
   'size': '519',
   'type': '-',
   'user': 'jyu'},
  {'group': 'mrcloud',
   'last_modified': '2022-12-11T18:39:05',
   'link_target': '',
   'name': 'output',
   'permissions': 'rw-r--r--',
   'size': '807',
   'type': '-',
   'user': 'jyu'},
  {'group': 'mrcloud',
   'last_modified': '2022-12-11T18:39:08',
   'link_target': '',
   'name': 'slurm-43437415.out',
   'permissions': 'rw-r--r--',
   'size': '1489',
   'type': '-',
   'user': 'jyu'}],
 'message': 'Files in the job folder.'}

Once the simulation finished or produced an error, the output (or the Slurm error file) can be downloaded. Binary files are supported:

[8]:
resp = app.download_file(jobid, filename="output")
with open("output", 'wb') as csr:
      for chunk in resp.iter_content(chunk_size=1024):
          if chunk:
              csr.write(chunk)

To delete a file in the job folder, use delete_file.

[9]:
app.delete_file(jobid, filename="file_upload_test.txt")
resp = app.check_job_state(jobid)
resp
[9]:
{'files': [{'group': 'mrcloud',
   'last_modified': '2022-12-11T18:38:52',
   'link_target': '',
   'name': 'job.sh',
   'permissions': 'rw-r--r--',
   'size': '519',
   'type': '-',
   'user': 'jyu'},
  {'group': 'mrcloud',
   'last_modified': '2022-12-11T18:39:05',
   'link_target': '',
   'name': 'output',
   'permissions': 'rw-r--r--',
   'size': '807',
   'type': '-',
   'user': 'jyu'},
  {'group': 'mrcloud',
   'last_modified': '2022-12-11T18:39:08',
   'link_target': '',
   'name': 'slurm-43437415.out',
   'permissions': 'rw-r--r--',
   'size': '1489',
   'type': '-',
   'user': 'jyu'}],
 'message': 'Files in the job folder.'}
[ ]: