User Guide

Introduction

Copernicus is a peer to peer distributed computing platform designed for high level parallelization of statistical problems.

Many computational problems are growing bigger and require more compute resources. Bigger problems and more resources put other requirements such as

  • effective distribution of computational work
  • fault tolerance of computation
  • reproducability and traceability of results
  • Automatic postprocessing of data
  • Automatic result consolidation

Copernicus is a platform aimed at making distributed computing easy to use. Out of the box it provides.

  • Easy and effective consolidation of heterogeneous compute
    resources
  • Automatic resource matching of jobs against compute resources
  • Automatic fault tolerance of distributed work
  • A workflow execution engine to easily define a problem and trace
    its results live
  • Flexible plugin facilites allowing programs to be integrated to the
    workflow execution engine

This section will cover how copernicus works and what its possibilites are.The subsequent sections will in detail cover how to use the platform.

What is Copernicus?

A platform for gathering resources and using them effectively.

The architecture of Copernicus

Copernicus consists of four components; the Server, the Worker,the Client and the Workflow execution engine. The server is the backbone of the platform and manages projects, generates jobs (computational work units) and matches these to the best computational resource. Workers are programs residing on your computational resources. They are responsible for executing jobs and returning the results back to the server. Workers can reside on any type of machine, desktops, laptops, cloud instances or a cluster environment. The client is the tool where you setup your project and monitor it. Actually, nothing is running on the client ever. It only sends commands to the server. This way you can run the client on your laptop, startup a project, close your laptop, open it up some time later and see that your project has progressed. All communication between these three components is encrypted. And as you will see later all communication has to be authorized.

Copernicus is designed in a way so that any individual can set it up, consolidate any type of resource available and put them to use. There is no central server that you will communicate to. You have full control of everything.

The workflow execution engine is what allows you to define your problem in a very easy way. Think of it as a flowchart where you define what you want to do and connect different important blocks. The workflow resides in the server and is a key component in every copernicus project. By just providing a workflow with input the server will automatically generate jobs and handle the execution of those. This way you will never have to focus on what to run where. Instead you can just focus on defining your problem.

The workflow gives also gives you the possibility to trace your work progress and look at intermediate results. You will also be a able to alter inputs in the middle of the run of a project in case things have gone wrong or if you want to test another approach.

Workflow components can actually be any type program. And with the plugin utilites in copernicus you can define these programs as workflow items, also known as functions.

Getting Started

Prerequisities

Before we start installing there are some basic prerequisites that has to be met. Apart from these listed below the server and the worker have some additional prerequisites which we will cover in the setup sections for the server and the worker.

Python

Copernicus requires python 2.7 to be installed. The installation must also include openssl. On *nix systems python is usually preinstalled with openssl running the command python --version will show you what version is installed.

Network Ports

The default network ports that Copernicus communicates via is 14807 for HTTP and 13807 for HTTPS. Please ensure that these ports are open in the
network. These must be open for both inward and outward communication on any machine that is running the client, worker or server. If these ports cannot be used it is possible to specify other ports when setting up the server.

Git

Git is a tool to download software from a source code repository. To see if it is installed try to run the command git from a terminal. If an installation is needed please download it from
http://git-scm.com/download

Downloading Copernicus

Copernicus can be downloaded from our public git repository. To download the source , use the following command

git clone ssh://git.copernicus-computing.org:29418/copernicus

This will start downloading the source, and putting everything in place. The git repository copernicus is the install directory.

Set the CPC_HOME environment variable to the location where copernicus is installed.

  • On UNIX/Linux systems this is usually done with
    export CPC_HOME=path/to/cpc
  • Now add CPC_HOME to your PATH variable. On UNIX/Linux systems this is usually done with
    export PATH=$PATH:$CPC_HOME
  • You should now be able to run the following three commands

    cpcc -h

    cpc-server -h

    cpc-worker -h

Server Setup

 

Additional Prerequisites

The server has some additional prerequisites. First GROMACS must be installed and accessible for the server. The easiest way is if GROMACS is accessible from the path. For other ways to specify this please refer to the section Servers.

Optionally, if a Markov state modelling project is to be run you will need to make sure that scipy, numpy and py-tables are installed. You will also need to fetch the MSMbuilder source from https://simtk.org/home/msmbuilder.

Installation

Before the server can run, it needs to be set up with an SSL keypair and a project base directory. This is achieved with the command

cpc-server setup [project-dir]

Here, project-dir is the directory where the server will store project data for all the projects that is run from it. Make sure to specify the project-dir in a writable location. Note that project-dir is only specified during setup however the creation happens when we first create a project. During the setup, the server will ask for a password to the root user (super user). You will also notice that a directory named .copernicus is created in your home directory. This is where Copernicus will store its settings.

To ensure that the server is properly installed, call the command

cpc-server start

The server is will now run as a background process. For the client command line tool and the worker to be able to communicate with the server a connection bundle must be created: this is a file with a description of how to connect to a server together with a key pair. The bundle is created with the command

cpc-server bundle

which generates the bundle, with the name client.cnx. This file can then be used on any machine to communicate with the server.

Compiling shared libraries

The load on the server can be quite heavy and it can be more efficient to not run those parts as ordinary Python code. If you have Cython installed you can run a bash script that is located in the Copernicus installation folder:

compileLibraries.sh

This will generate C code from Python files that will then be compiled to shared libraries. The other Python files will automatically use these shared libraries instead of the corresponding Python code, which improves the server efficiency. Remember that if you modify any Python file in Copernicus it is best to rerun the script to regenerate the shared libraries if any of the relevant files have changed. There is no further optimization used when generating the C code from the Python code.

Client Setup

 

Setup

To use the client, a server to connect to must first be specified. To specify a server, use the add-server command:

cpcc add-server my.serverhostname.com

If you are running the server on a different port the portnummer is supplied after the hostname, as in

cpcc add-server my.serverhostname.com 14807

 

Login

Once a server has been added a client is require to login to the server before being able to use any client commands. If no users have been added the root user with the password supplied during server setup can be used. This is done via:

cpcc login root

The client will then ask for a password.

Using a bundle

A server bundle, usually called client.cnx can also be used to connect to a server. This bypasses the add-server and login procecure and grants super user access to the server that is prespecified in the bundle. To use a bundle to connect, use the -c argument to cpcc:

cpcc -c client.cnx server-info

It should output something like this:

Server hostname:127.0.0.1
Version:1.1.0-dev

and if there is an attempt to connect with the wrong connection bundle (i.e. with the wrong key pair), the client will show:

ERROR: [Errno 1] _ssl.c:503:error:14090086:
SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Worker Setup

 

Prerequisites

The worker has 2 prerequisites

  • A client.cnx file, If you are running the worker on a different machine than the worker you probably do not have a .copernicus folder in you home directory. However you can create one manually and drop in the client.cnx file there. If you wish you can also specify the file manually as we will later below.
  • GROMACS must be installed and accessible for the worker. The easiest way is if GROMACS is accessible from the path. For other ways to specify this please refer to the section Workers.

Installation

Workers do not need any specific project directory. Provided that the prerequisites are met the server do not need any other installation procedure. To verify that the worker can connect to a server start it with

cpc-worker -c client.cnx smp

It should output something like this:

INFO, cpc.worker: Worker ID: 130-229-12-163-dhcp.wlan.ki.se-26108.
Available executables for platform smp:
gromacs/mdrun 4.5.3INFO, cpc.worker: Got 0 commands.INFO, cpc.worker: Have free resources. Waiting 30 seconds

you will notice the parameter smp in the above command. This means that we start the worker with the platform type smp. We will cover this in greater detail in the section
Platform types

in case everything is set up correctly. Two things might go wrong here: if there is no client.cnx connection bundle, there will be an error message:

Could not find a connection bundle
Please specify one with with the -c flag or supply the file with the
name
client.cnx in your configuration folder

and if there is an attempt to connect with the wrong connection bundle (i.e. with the wrong key pair), the client will show:

ERROR: [Errno 1] _ssl.c:503:error:14090086:
SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Running a project


Servers

 


Servers are what manages your copernicus project. They are responsible for generating jobs and the monitoring of these. When you work with a project you use the cpcc command line tool to send messages to the server. The server will process this commands and setup your project and generate jobs for it.

Where to run the server

Since the server is responsible for running all of your project it is advisable to deploy it on a machine that is up and running all the time. For example running the server on your laptop would not be a good idea for many reasons:

  • You move your laptop around : When moving your laptop between location your machine gets assigned different ip addresses. Workers connected to this server would not be able to communicate with the server once the address changes.
  • Laptops are not on all the time: You close the lid on you laptop, it runs out of batteries ….

Fault tolerance of projects

The server is very fault tolerant and handles your projects with great care. It regularly saves the state of each project. In case a server would shutdown or crash due to software or hardware failure you can simply restart the server and it will recover to the previous state.
Jobs that are already sent out workers are also fault tolerant. The server can handle cases where the worker goes down or if the server itself goes down. This is done by the server heartbeat monitor. Whenever a server has sent a job to a worker it expects a small heartbeat message from the worker once in a while (default is every 2 minutes however
this can be configured). If the server doesn’t receive a heartbeat message it will mark that job as failed and put it back on the queue.
The same procedure is actually used when a server goes down. Whenever it starts again it will go through its list of jobs sent out(referred to as heartbeat items). And see which ones has gone past the heartbeat interval time. These jobs that have timed out will then be put back on the queue.

Although the server is fault tolerant make sure to backup your project data to a second disk regularly. The server will not be able to recover a project from disk failures.

How jobs are matched to the best resource

Copernicus Network

Copernicus servers can be connected together to form a network. This is useful in various cases:

TODO images for the various cases

  • You want to share worker resources: If your workers are not being utilized at 100% you can share your workers by connecting to other copernicus servers. Whenever your server is out of jobs it will ask its neighbouring servers for jobs. More on this in the section worker delegation.
  • You are running too many projects for one server to handle: If you have too many projects for one server to handle you can offload it by running a second server on another machine. You can then connect the second server to the first and still share resources with worker delegation.
  • Your workers are running on an internal network while the server is
    not: In cluster environments the compute nodes can only communicate with the head nodes so you would need a server running on the head node. However as soon as you start running projects the server will consume a bit of resources on the head node which is not advisable. A better setup is to run one server on the head node and connect it to a project server outside the cluster environment. The server on the head node will only be managing connections to the workers and pass these on to the project server.

Worker delegation

The concept of worker delegation allows servers to share work resources between each other. Whenever servers are connected worker delegation is enabled automatically. The server that the workers are connected to will always have the first priority and if there is work in its queue it will utilize its workers. However if there is no work in the queue it will ask its connected servers for work. This is done in a proioritized order. The order of priority can be seen with the command cpcc list-nodes. Servers can be reprioritized with cpcc node-pri.

Connecting Servers

To connect two copernicus servers you need to send a connection request which in turn has to be approved by the other side.

Sending a connection request

A connection request can be sent to a server with the cpcc connect-node HOSTNAME command:

>cpcc connect-server server2.mydomain.com
Connection request sent to server2.mydomain.com 14807

By default a request is sent using standard copernicus unsecure port 14807. If the destination server has changed the unsecure port number you will need to specify it.

cpcc connect-server server2.mydomain.com 15555

After sending a connection request you can list it with cpcc list-nodes

>cpcc connected-servers
Sent connection requests
Hostname                     Port       Server Id
server2.mydomain.com         14807      b96add9c-aff5-11e2-953a-00259018db3a

Approving a connection request

Received connection requests needs approval before a secure communication can be established between 2 servers.
To approve a connection request you use the cpcc trust SERVER_ID command.

Only servers that have sent connection request to your server can be trusted. To see which servers that have requested to connect you can use
cpcc connected-servers

>cpcc connected-servers
Received connection requests
Hostname                        Port       Server Id
server1.mydomain.com            14807      dc75c998-acf1-11e2-bfe2-00259018db3a

The list above specifies that we have one incoming connection request. The text string under the column “Server Id” is what we need to specify in the cpcc trust command.

>cpcc trust dc75c998-acf1-11e2-bfe2-00259018db3a

Following nodes are now trusted:
Hostname                        Port       Server Id
server1.mydomain.com            14807      dc75c998-acf1-11e2-bfe2-00259018db3a

after calling cpcc trust your server will communicate with the requesting server and establish a secure connection.

To list the connected connected nodes simply use
cpcc connected-servers

>cpcc connected-servers
Connected nodes:
Priority   Hostname                        Port       Server Id
0          server1.mydomain.com            14807      dc75c998-acf1-11e2-bfe2-00259018db3a

If you change the hostname or the ports of one server it will upon restart communicate to its connected servers and notify them on these changes.

Connecting to a server that is behind a firewall.

If one of the servers is behind a firewall, it is not possible to send a connection request directly.
The workaround for this is to first create an ssh tunnel to the server behind the firewall.
The procedure will then be.
1. Create an ssh tunnel to the firewalled server

ssh -f server_behind_firewall -L 13807:server_behind_firewall:13807 -L 14807:server_behind_firewall:14807 -N

the syntax 13807:server_behind_firewall:13807 means “anything from localhost port 13807 should be sent to server_behind_firewall¬† port 13807″
The port numbers 13807 and 14807 are the standard copernicus server ports. in case you have changed these setting please make sure that those port numbers are provided in the tunnelling command.

2. Send a connection request using the tunnel port.

cpcc connect-server localhost 14807

3. Approve the connection request.

cpcc trust SERVER_ID
where SERVER_ID is the id of the server that sent the connection request. you can look it up with the command cpcc connected-servers.
When a connection is established you no longer need the ssh tunnel.

User management

Copernicus has support for multiple users with access roles. Regular users have either full access or no access to a project. Super users (like root) have access to all projects and may add other users using:

cpcc add-user username

A user may grant another user access to its current project by issuing

cpcc grant-access username

Server configuration

Workers

Workers are responsible for most of the computational work in copernicus. They act as simple programs connecting to a server and asking for work. Workers can have very different capabilites with regard to cpu capabilites, and what programs they can execute.

When you start a worker it will establish a secure connection to your server and announce the programs and their versions it can execute. The server will then match the capabilities of the worker to the available jobs in the queue. By default a worker will try to use all available cores on a machine however this can be configured.

Worker and Server communication

You can connect as many workers as you want to a Server. And the only thing you need to do this is a connection bundle. Workers and Server communication is one sided. It is always initiated by the Worker and the Server is only able to send responses back.

How workers find available executables

Automatic partitioning

Workers always try to fully utilize the maximum number of cores available to them.Thus they able to partition themselves to run multiple jobs at
once. For example if you have a worker with 24 available cores it can run one 24 core job or one 12 core job and 12 single core jobs. As long as the worker has free cores it will announce itself as available to the Server. However, workers do not monitor the overall CPU usage of the computer they are running on, but assume that the CPU is not used for other tasks.

Limiting the number of cores

By default a worker tries to use all of the cores available on a machine. However you can limit this with the flag -n

cpc-worker smp -n 12

You can also define how the partitioning of each individual job should be limited with the flag -s. For example to limit your worker to use only 12 cores and only 2 cores per job you can do:

cpc-worker smp -n 12 -s 2

Running jobs from a specific project

A worker can be dedicated to run jobs from a specific project this is done with the flag -p

cpc-worker -p my-project smp

Avoiding idle workers

If you want to avoid having idle workers you can instruct them to shutdown after an amount of idle time. This is done with the flag -q

cpc-worker -q 10 smp

Specifying work directory

When a job is running Workers store their work information in work directory. This work directory is by default created in the same location as where the worker is started. If you want to specify another work directory you can do it with the flag
-wd

cpc-worker -wd my-worker-dir smp

Platform types

Workers can be started with different platform types.

The standard platform type is smp. This one should be used to run a worker on a single node.

The platform type mpi should be used when one has binaries using OpenMPI. Any binary that you usually start with mpirun from the command line should use this platform type.

Executing workers in a cluster environment

Starting workers in a cluster environment is very straightforward. You will only need to call the worker from your job submission script. You only need to start one worker for all the resources that you allocate.

Example

This is the general structure to use for starting a copernicus worker

## 1.ADD specific parameters for your queuing system ## 

## 2. starting a copernicus worker ##
cpc-worker mpi -n NUMBER_OF_CORES

Here is a specific version for the slurm queing system

#!/bin/bash
#SBATCH -N 12
#SBATCH --exclusive
#SBATCH --time=05:00:00
#SBATCH --job-name=cpc

#Assuming each node has 16 cores, 12*16=192
cpc-worker mpi -n 192


Best practices when using 1000+ cores

Copernicus is a very powerful tool and can start using thousands of cores at an instant.  When starting large scale resources it is advisable to gradually ramp up the resource usage and monitor the project to see if any critical errors occur in the project or your cluster environment. If everything looks fine start allocating more and more resources.

Glossary

connection bundle
function
function instance

<!—END MAIN DIV manual content–!>