Skip to content

Notes on CHPC-Cluster

Login

After generating and configuring SSH key pair, we can directly access the server via

$ ssh USERNAME@chpc-login01.itsc.cuhk.edu.hk

since the login node is not suggested/allowed to run test jobs, it would be more convenient to login in the test node, sandbox. This can be done with consecutive ssh,

$ ssh -t USERNAME@chpc-login01.itsc.cuhk.edu.hk ssh sandbox

where -t aims to avoid the warning

Pseudo-terminal will not be allocated because stdin is not a terminal.

bypass the login node

Usually, only the login node is out of service, but the jobs on computing nodes would not be affected. So there is a tip to bypass the unaccessible login node.

Requirement

You can access another middle machine which has a public or campus IP. Otherwise, you can try to use free tools like ngrok to generate a public ip for your local machine, see my notes on how to access the intranet from outside.

  • Step 1: ssh to the middle machine from nodes except the login node of ITSC cluster, say sandbox, with the remote port forwarding option -R PORT:localhost:22
  • Step 2: ssh back to sandbox by specifying the port -p PORT

Tip

Sometimes ssh session might be disconnected if no further actions, so it would be necessary to replace ssh with autossh (see my notes)

The sketch plot is as follows,

image

It is necessary to check the status of the tunnel. If the connection is broken, such as the sandbox has rebooted, pop a message window to remind to re-establish the tunnel in time.

The command to pop message is notify-send, and note that successful command exits with 0, use the script to check the ssh connection,

#!/bin/bash
ssh -q -o BatchMode=yes $1 exit
if [ $? != '0' ]; then
    #echo "broken connection"
    notify-send "$1" "broken connection"
fi

Create a regular job as follows,

$ crontab -e
0 * * * * export XDG_RUNTIME_DIR=/run/user/$(id -u); for host in sandbox STAPC ROCKY; do sh /home/weiya/github/techNotes/docs/Linux/check_ssh.sh $host; done

where export XDG_RUNTIME_DIR=/run/user/$(id -u) is necessary for notify-send to pop the window (refer to Notify-send doesn’t work from crontab)

Custom Commands

Some of the commands would be explained in the following sections.

  • aliases
# delete all jobs
alias qdelall='qstat | while read -a ADDR; do if [[ ${ADDR[0]} == +([0-9]) ]]; then qdel ${ADDR[0]}; fi ; done'
# list available cores
alias sinfostat='sinfo -o "%N %C" -p stat -N'
# list available gpu
alias sinfogpu='sinfo -O PartitionName,NodeList,Gres:25,GresUsed:25 | sed -n "1p;/gpu[^:]/p"'
# check disk quota
alias myquota='for i in `whoami` Stat StatScratch; do lfs quota -gh $i /lustre; done'
# list jobs sorted by priority
alias sacctchpc='sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocGRES,NNodes,NodeList,Submit,QOS | (sed -u 2q; sort -rn)'
# list jobs sorted by priority (only involved stat)
alias sacctstat='sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocGRES,NNodes,NodeList,Submit,QOS | (sed -u 2q; sort -rn) | sed -n "1,2p;/stat/p"'
  • functions
#request specified nodes in interactive mode
request_cn() { srun -p stat -q stat -w chpc-cn1$1 --pty bash -i; }
request_gpu() { srun -p stat -q stat --gres=gpu:1 -w chpc-gpu01$1 --pty bash -i; }
request_gpu_chpc() { srun -p chpc --gres=gpu:1 -w chpc-gpu$1 --pty bash -i; }
t() { tmux a -t $1 || tmux new -s $1; }

Interactive Mode

Strongly recommend the interactive mode when you debug your program or want to check the outputs of each step.

qsub -I

The simplest way is

[sXXXX@chpc-login01 ~] $ qsub -I

If there are idle nodes, then you would be allocated to a node, and pay attention to the prompt, which indicates where you are. For example, sXXXX@chpc-login01 means you are on the chpc-login01 node.

Sometimes you can be automatically brought into the target node, then you are done. But sometimes it just displays the node you are allocated, such as

[sXXXX@chpc-login01 ~] $ qsub -I
...
salloc: Nodes chpc-cn011 are ready for job
[sXXXX@chpc-login01 ~] $

then you need to manually ssh into the target node

[sXXXX@chpc-login01 ~] $ ssh chpc-cn101
[sXXXX@chpc-cn101 ~] $

srun -w

Sometimes you might want to use a specified node, say you want to use GPU (DO NOT forget --gres=gpu:1), then you can specify your node via the option -w. Moreover, you’d better specify the partition and QoS policy -p stat -q stat, which counts your quota of usage. The interactive command is specified via --pty bash -i.

The complete command is

[sxxxxx@chpc-login01 ~]$ srun -p stat -q stat --gres=gpu:1 -w chpc-gpu010 --pty bash -i
srun: job XXXXXX queued and waiting for resources
srun: error: Lookup failed: Unknown host
srun: job XXXXXX has been allocated resources
[sxxxxx@chpc-gpu010 ~]$ 

Upon you are allocated a node, you can do what you want just like on your own laptop.

Submitting Multiple Jobs

SLURM and PBS are two different cluster schedulers, and the common equivalent commands are as follows:

# PBS
qsub -l nodes=2:ppn=16 -l mem=8g -N jobname -m be -M notify@cuhk.edu.hk
# Slurm
sbatch -N 2 -c 16 --mem=8g -J jobname --mail-type=[BEGIN,END,FAIL,REQUEUE,ALL] --mail-user=notify@cuhk.edu.hk
Sometimes, we want to submit multiple jobs quickly, or perform parallel computing by dividing a heavy task into multiple small tasks.

PBS

Suppose there is a main program toy.jl, and I want to run it multiple times but with different parameters, which can be passed via the -v option.

#!/bin/bash
for number in 1 2 3 4 5; do
    for letter in a b c d e; do
        qsub -v arg1=$number,arg2=$letter toy.job
    done
done
#!/bin/bash
cd $HOME/PROJECT_FOLDER
julia toy.jl ${arg1} ${arg2}
a, b = AGRS
println("a = $a, b = $b")

The submitting command is

$ ./submit.sh

and here is an example in my private projects.

SLURM

Suppose there is a main program run.jl, which runs in parallel with the number of cores np, and I also want to repeat the program for N times. To properly store the results, the index of repetition nrep for each job has been passed to the main program.

The arguments for job file can be passed by the --export option.

#!/bin/bash
if [ $# == 0 ]; then
    cat <<HELP_USAGE
    $0 param1 param2
    param1 number of repetitions
    param2 node label, can be stat or chpc
    param3 number of cores
HELP_USAGE
    exit 0
fi
resfolder=res_$(date -Iseconds)
for i in $(seq 1 1 $1); do
    sbatch -N 1 -c $3 -p $2 --export=resfolder=${resfolder},nrep=${i},np=$3 toy.job
done
#!/bin/bash
cd $HOME/PROJECT
julia -p $np run.jl $nrep $resfolder
using Distributed
const jobs = RemoteChannel(()->Channel{Tuple}(32))
const res = RemoteChannel(()->Channel{Tuple}(32))

function make_jobs() end
function do_work() end

nrep, resfolder = ARGS
@async make_jobs()

where

  • HELP_USAGE documents shell scripts’ parameters.
  • $1, $2, $3 denotes the 1st, 2nd, 3rd argument in the command line, and $0 is the script name.

The following command runs with N = 100, np = 4 and on the stat partition,

$ ./submit.sh 100 stat 4

which is adopted from my private project.

Specify Nodes

The nodes can be excluded with -x or --exclude, and it can be specified with -w.

TL; DR

According to the following experiments, my observation is that

the exclusion seems only to perform on the granted resources instead of all nodes. If you want to allocate specified nodes, -w option should be used.

  • srun
# cannot exclude
$ srun -x chpc-cn050 hostname
chpc-cn050.rc.cuhk.edu.hk
  • salloc
# cannot exclude
$ salloc -x chpc-cn050 -N1
salloc: Nodes chpc-cn050 are ready for job
# cannot exclude
$ salloc -x chpc-cn050 srun hostname
salloc: Nodes chpc-cn050 are ready for job
chpc-cn050.rc.cuhk.edu.hk
# NB: exclude successfully
$ salloc -w chpc-cn050 srun -x chpc-cn050 hostname
salloc: Nodes chpc-cn050 are ready for job
srun: error: Hostlist is empty!  Can't run job.
  • sbatch
# cannot exclude
$ sbatch << EOF
> #!/bin/sh
> #SBATCH -x chpc-cn050
> srun hostname
> EOF
Submitted batch job 246669
$ cat slurm-246669.out
chpc-cn050.rc.cuhk.edu.hk

# NB: exclude successfully
$ sbatch << EOF
> #!/bin/sh
> #SBATCH -w chpc-cn050
> srun -x chpc-cn050 hostname
> EOF
Submitted batch job 246682
$ cat slurm-246682.out
srun: error: Hostlist is empty!  Can't run job.

Observation:

-x seems not to work in the allocation step, but it can exclude nodes from the allocated nodes.

Back to the manual of -x option:

Explicitly exclude certain nodes from the resources granted to the job.

So the exclusion seems only to perform on the granted resources instead of all nodes. If you want to allocate specified nodes, -w option should be used.

Exit Code

As the official documentation said, a job’s exit code (aka exit status, return code and completion code) is captured by Slurm and saved as part of the job record. For sbatch jobs, the exit code that is captured is the output of the batch script.

  • Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.
  • The exit code is an 8 bit unsigned number ranging between 0 and 255.
  • When a signal was responsible for a job or step’s termination, the signal number will be displayed after the exit code, delineated by a colon(:)

We can check the exit code of particular jobs,

sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocGRES,NNodes,NodeList,Submit,QOS,STATE,ExitCode,DerivedExitCode

e.g.,

where

  • the first one is a toy example and kill by myself with kill -s 9 XX, so the right of : is signal 9, and it exits with zero code
  • the second one is the one shared by @fangda. It is exactly reversed, and I suspect that it might be due to other reasons.

see also: 3.7.6 Signals and

# http://www.bu.edu/tech/files/text/batchcode.txt
Name     Number (SGI)   Number (IBM)
SIGHUP      1              1
SIGINT      2              2
SIGQUIT     3              3
SIGILL      4              4
SIGTRAP     5              5
SIGABRT     6              6
SIGEMT      7              7
SIGFPE      8              8
SIGKILL     9              9
SIGBUS      10             10
SIGSEGV     11             11
SIGSYS      12             12
SIGPIPE     13             13
SIGALRM     14             14
SIGTERM     15             15
SIGUSR1     16             30
SIGUSR2     17             31
SIGPOLL     22             23
SIGIO       22             23
SIGVTALRM   28             34
SIGPROF     29             32
SIGXCPU     30             24
SIGXFSZ     31             25
SIGRTMIN    49             888
SIGRTMAX    64             999

Job Priority

The submitted jobs are sorted by the calculated job priority in descending order.

TL;DR

You can check the priority of all submitted jobs (not only yours but also others), and then you can find where you are, and figure out when your job can start to run.

$ sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocGRES,NNodes,NodeList,Submit,QOS | (sed -u 2q; sort -rn)

The formula for job priority is given by

Job_priority =
    site_factor +
    (PriorityWeightAge) * (age_factor) +
    (PriorityWeightAssoc) * (assoc_factor) +
    (PriorityWeightFairshare) * (fair-share_factor) +
    (PriorityWeightJobSize) * (job_size_factor) +
    (PriorityWeightPartition) * (partition_factor) +
    (PriorityWeightQOS) * (QOS_factor) +
    SUM(TRES_weight_cpu * TRES_factor_cpu,
        TRES_weight_<type> * TRES_factor_<type>,
        ...)
    - nice_factor

we can find those weights

$ scontrol show config | grep ^Priority
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = CALCULATE_RUNNING
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 0
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 100000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PriorityWeightTRES      = (null)

only the PriorityWeightFairShare is nonzero, and this agrees with

$ sprio -w
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE
        Weights                               1     100000
$ sprio -w -p stat
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE
        Weights                               1     100000
$ sprio -w -p chpc
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE
        Weights                               1     100000

then the formula would be simplified as

Job_priority =
    site_factor +
    (PriorityWeightFairshare) * (fair-share_factor) +
    SUM(TRES_weight_cpu * TRES_factor_cpu,
        TRES_weight_<type> * TRES_factor_<type>,
        ...)
    - nice_factor

where TRES_weight_<type> might be GPU, see the usage weight in the table, and a negative nice_factor can only be set by privileged users,

Nice Factor

Users can adjust the priority of their own jobs by setting the nice value on their jobs. Like the system nice, positive values negatively impact a job’s priority and negative values increase a job’s priority. Only privileged users can specify a negative value. The adjustment range is +/-2147483645.

  • the fairshare can be obtained via sshare, and the calculated priority can be obtained via sprio.

refer to

CPU/Memory Usage

Check the CPU and memory usage of a specific job. The natural way is to use top on the node that run the job. After ssh into the corresponding node, get the map between job id and process id via

$ scontrol listpids YOUR_JOB_ID

Note that this only works with processes on the node on which scontrol is run, i.e., we cannot get the corresponding pid before ssh into the node.

Then check the results of top and monitor the CPU/memory usage by the job given the pid. Or explicitly specify the pid via top -p PID_OF_JOB

Alternatively, a more direct way is to use sstat command, which reports various status information, including CPU and memory, for running jobs.

$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,MaxRSS,MaxVMSize -j JOBID
    AveCPU   AvePages     AveRSS  AveVMSize     MaxRSS  MaxVMSize 
---------- ---------- ---------- ---------- ---------- ---------- 
 00:02.000         30      1828K    119820K      1828K    276808K 

Correspondingly, the result from top is

PID USER PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME COMMAND                                                                                                              
213435 XXX 20   0  119820   2388   1772 S   0.0  0.0   0:00.27 bash 

where VIRT == AveVMSIZE.

Disk Quota

Sometimes, you might find that your job cannot continue to write out results, and you also cannot create a new file. It might imply that your quota reaches the limit, and here is a tip to “increase” your quota without cleaning your files.

TL;DR

Tip to “increase” your personal quota is to count the files as the shared department quota, so just change the group membership of your files,

$ chgrp -R Stat SomeFolder

Firstly, you can check your personal quota with

$ lfs quota -gh your_user_id /lustre
# 20GB by default from ITSC in /users/your_user_id

and here is two shared quota’s in the whole department,

$ lfs quota -gh Stat /lustre
# 30TB shared by Statistics Department in /lustre/project/Stat
$ lfs quota -gh StatScratch /lustre
# 10TB by default from ITSC in /lustre/scratch/Stat

An interesting thing is that the quota is counted by the group membership of the file, so if your personal quota exceeds, you can change the group membership of some files, and then these files would count as the shared quota instead of your personal quota. To change the group membership recursively of a folder,

$ chgrp -R Stat SomeFolder

A partial Chinese description,

已经好几次因为 disk quota 超了使得程序崩溃,于是试图将 home 文件夹中的部分文件移动到 /lustre/project/Stat 中,但似乎 quota 并没有变化。

后来才发现,原来 quota 是通过 group 来控制的,其中上述命令中 -g 选项即表示 group,也就是说 your_user_id, StatStatScratch 都是 group 名字。如果文件夹是从别处移动来的,group并不会改变,只有直接在 /lustre/project/Stat 中新建的文件或文件夹才继承 Stat 的 group,这一点是最开始在 Stat 文件夹下创建自己目录 MYFOLDER 时通过 chmod g+s MYFOLDER 保证的。

于是简单的方法便是直接更改 group,

chgrp -R Stat SomeFolder

如果想找出哪些文件 group 为 sXXXX, 可以采用

find . -group `whoami`

Custom Module

The cluster manages the software versions with module, and the default module file path is

$ echo $MODULEPATH
/usr/share/Modules/modulefiles:/etc/modulefiles

which requires sudo privilege. A natural question is whether we can create custom (local) module file to switch the software which are installed by ourselves or have not been added into the modules.

Here is an example. There is an installed R in /opt/share/R named 3.6.3-v2, which does not have the modulefile, since the same version 3.6.3 is already used. But there are still differences between these two “same” versions, 3.6.3-v2 supports figures, such as “jpeg”, “png”, “tiff” and “cairo”, while 3.6.3 not,

(3.6.3) > capabilities()
       jpeg         png        tiff       tcltk         X11        aqua
      FALSE       FALSE       FALSE        TRUE       FALSE       FALSE
   http/ftp     sockets      libxml        fifo      cledit       iconv
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE
        NLS     profmem       cairo         ICU long.double     libcurl
       TRUE       FALSE       FALSE        TRUE        TRUE        TRUE

(3.6.3-v2) > capabilities()
       jpeg         png        tiff       tcltk         X11        aqua
       TRUE        TRUE        TRUE        TRUE       FALSE       FALSE
   http/ftp     sockets      libxml        fifo      cledit       iconv
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE
        NLS     profmem       cairo         ICU long.double     libcurl
       TRUE       FALSE        TRUE        TRUE        TRUE        TRUE

To use 3.6.3-v2, we can create our custom modulefile (ref),

# step 1: create a folder in your home directory
~ $ mkdir modules
# step 2: preappend your module path to the ~/.bashrc file
~ $ echo "export MODULEPATH=${MODULEPATH}:${HOME}/modules" >> ~/.bashrc
# step 3: copy the existing modulefile as a template
# here I skip the directory, just use name `R3.6`, which can also differ from the existing `R/3.6`
~ $ cp /usr/share/Modules/modulefiles/R/3.6 modules/R3.6
# step 4: modify the path to the software (`modroot`) as follows
#%Module1.0#####################################################################
##
proc ModulesHelp { } {
 global version modroot

puts stderr "R/3.6.3 - sets the Environment for
         R scripts v3.6.3 (gcc verion)

Use 'module whatis [module-info name]' for more information"
}

module-whatis "The R Project for Statistical Computing
R is a free software environment for statistical computing and graphics.

Here is the available versions:
        R/3.6.3"


set     version         3.6.3
set     app             R
set     modroot         /opt/share/$app/$version

module load pcre2

conflict R

setenv R_HOME $modroot/lib64/R

prepend-path PATH $modroot/bin
prepend-path LD_LIBRARY_PATH $modroot/lib
prepend-path INCLUDE $modroot/include
#%Module1.0#####################################################################
##
proc ModulesHelp { } {
 global version modroot

puts stderr "R/3.6.3 - sets the Environment for
         R scripts v3.6.3 (gcc verion)

Use 'module whatis [module-info name]' for more information"
}

module-whatis "The R Project for Statistical Computing
R is a free software environment for statistical computing and graphics.

Here is the available versions:
        R/3.6.3"


set     version         3.6.3
set     app             R
set     modroot         /opt/share/$app/${version}-v2

module load pcre2
module load intel
conflict R

setenv R_HOME $modroot/lib64/R

prepend-path PATH $modroot/bin
prepend-path LD_LIBRARY_PATH $modroot/lib
prepend-path INCLUDE $modroot/include
@@ -18,10 +18,10 @@

 set     version         3.6.3
 set     app             R
-set     modroot         /opt/share/$app/$version
+set     modroot         /opt/share/$app/${version}-v2

 module load pcre2
-
+module load intel
 conflict R

Note that module load intel is also added, otherwise it will throws,

/opt/share/R/3.6.3-v2/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

since libiomp5 is for openmp (ref).

Now, you can use 3.6.3-v2 like other modules,

# load `3.6.3-v2`
$ module load R3.6
# unload `3.6.3-v2`
$ module unload R3.6
# load original `3.6.3`
$ module load R/3.6

GPU Usage

Be cautious about the compatible versions between the deep learning framework (e.g., tensorflow), CUDA, and also the (CUDA) driver version of the node.

The driver version of the node is fixed, but fortunately it is downward compatible, i.e., a higher driver version also supports a lower CUDA version. We can check the driver version by

$ nvidia-smi 
Sat Jun 26 10:32:03 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
...

which implies that the highest supported CUDA version is 11.0.

The available CUDA versions can be found as follows

$ module avail cuda
-------------- /usr/share/Modules/modulefiles ---------------
cuda/10.1          cuda/10.2          cuda/11.0          cuda/11.3(default) cuda/9.2

For the above node whose highest supported CUDA version is 11.0, the latest cuda/11.3 would be incompatible, but others are all OK.

Now you can pick the proper tensorflow version according to the supported CUDA versions. Here is an official configuration table, which lists the compatible versions between tensorflow, python, CUDA, together with cuDNN.

Finally, you can validate if the GPU is correctly supported by running

$ python
>>> import tensorflow as tf
# 1.x
>>> tf.test.is_gpu_available()
# 2.x
>>> tf.config.list_physical_devices('GPU')

What if you want to use the version that does not installed on the cluster, say cuda/10.0? We can install a local cuda and creat a custom module to call it.

Following the instruction: Install Cuda without root

$ wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
$ chmod +x cuda_10.0.130_410.48_linux
$ ./cuda_10.0.130_410.48_linux

then download cuDNN, but it requires to login. After extraction, copy the include and lib to the CUDA installation folder.

Next, create the custom module file, and finally I can use

$ cuda load cuda10.0

to use the local cuda.

Solutions to Abnormal Cases

fail to ssh passwordlessly

像往常一样 ssh,但是报错了

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.
Please contact your system administrator.
Add correct host key in /home/weiya/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/weiya/.ssh/known_hosts:42
  remove with:
  ssh-keygen -f "/home/weiya/.ssh/known_hosts" -R "chpc-login01.itsc.cuhk.edu.hk"
ECDSA host key for chpc-login01.itsc.cuhk.edu.hk has changed and you have requested strict checking.
Host key verification failed.

于是根据提示运行了

ssh-keygen -f "/home/weiya/.ssh/known_hosts" -R "chpc-login01.itsc.cuhk.edu.hk"

然后重新 ssh,但还是要求输入密码。类似的问题另见 ssh remote host identification has changed

这其实对应了服务器上 /etc/ssh 文件夹下几个 pub 文件,咨询 Michael 也得到回复说最近 public fingerprint 有修改,这应该是 known hosts 的内容。

可以以 MD5 的形式展示

$ ssh-keygen -l -E md5 -f ssh_host_ed25519_key.pub

另外,扫描 ip 或域名对应的 key

ssh-keyscan -t rsa server_ip

也能返回完全一致的结果,然后手动添加至 known_hosts 文件,仍然不能成功,尝试过新增其他格式的 key,

ssh-keygen -t [ed25519 | ecdsa | dsa]

然而统统没用。

后来跟服务器管理员反复沟通,提交 ssh -vvv xxx &> ssh.log 日志文件供其检查,才确认是最近服务器配置更改的原因,虽然没有明说,但是注意到 /etc/ssh/sshd_config 更新后不久管理员就回复说好了,问及原因,他的回答是,

It is related to security context which will make SELinux to block the file access. I think this required root permission to config.

fail to access ~

Z 在群里问道,他在服务器上提交 job 时,之前安装好的包不能使用。显然,很可能因为 .libPaths() 没有包含 $HOME 下的用户安装路径,但是他在登录结点上在 R 中运行 .libPaths(),一切正常。那么问题或许出在工作结点上,事实表明在该工作结点上采用以下任一种方式都能解决问题

  • 在 R 中运行 .libPaths("/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0")
  • 或者 export R_LIBS_USER=/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0

Note

此处也看到一个类似的问题,但是原因不一样,在 R_LIBS_USER ignored by R 问题中,原因是 $HOME 不能正常展开。

但是此时并不是很理解,因为按理说不同结点都是共享的。后来研究了下 R 的启动机制

On Unix versions of R there is also a file R_HOME/etc/Renviron which is read very early in the start-up processing. It contains environment variables set by R in the configure process. Values in that file can be overridden in site or user environment files: do not change R_HOME/etc/Renviron itself. Note that this is distinct from R_HOME/etc/Renviron.site.

才知道 R_LIBS_USER 是定义在 Renviron 中,

R_LIBS_USER=${R_LIBS_USER-'~/R/x86_64-pc-linux-gnu-library/4.0'}

其中 ${A-B} 的语法是如果 A 没有设置,则令 BA,注意其与 ${A:-B} 的区别

这也难怪为什么直接在命令行中输入 echo $R_LIBS_USER 结果为空。

SSH 到该工作结点,发现其 prompt 并没有正确加载,直接出现 bash-4.2$,而一般会是 [sXXXXX@chpc-sandbox ~]$

Note

其实 source .bashrc 后能显示 ~,但访问 ~ 仍然失败。另见 Terminal, Prompt changed to “-Bash-4.2” and colors lost

这样一个直接后果就是无法解析用户目录 ~,这大概也是为什么 R_LIBS_USER 在这个结点没有正常加载,因为上述系统配置文件 /opt/share/R/4.0.3/lib64/R/etc/Renviron 中使用了 ~,于是需要用不带 ~ 的全路径。不过 ~ 其实只是指向 /user/sXXXXX,发现这个文件夹没有正常被连接,所以要使用 /lustre/user/sXXXX. 或者说是没有挂载,因为在其它结点上有以下三条挂载记录,

$ df -h
storage03:/chpc-userhome                  50T  4.4T   46T   9% /storage03/chpc-userhome
storage03:/chpc-optshare                  10T  713G  9.4T   7% /storage03/chpc-optshare
storage01:/chpc-users                     15T  6.2T  8.9T  42% /storage01/users

而该结点上没有。

为了验证上述想法,也手动进行 export R_LIBS_USER=.libPaths()

> .libPaths("/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0")
> .libPaths()
[1] "/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0"
[2] "/lustre/opt_share/R/4.0.3/lib64/R/library" 
$ export R_LIBS_USER=/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0
$ R
> .libPaths()
[1] "/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0"
[2] "/lustre/opt_share/R/4.0.3/lib64/R/library"   
> .libPaths("/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0/")
> .libPaths()
[1] "/lustre/opt_share/R/4.0.3/lib64/R/library"
$ export R_LIBS_USER=~/R/x86_64-pc-linux-gnu-library/4.0
$ R
> .libPaths()
[1] "/lustre/opt_share/R/4.0.3/lib64/R/library"

可见涉及 ~/users/sXXXX 的均有问题。

虽然问题已经得以解决,但是很好奇为什么访问 ~ 失败。因为根据我的理解,/users/sXXXX (~)、/lustre/users/sXXXX 中间应该是通过类似 soft links 形式连接的(即便可能不是,因为确实直接 ls 也没返回指向结果)。

后来咨询管理员才明白,他们正在将迁移用户文件夹,

Selection_1524

其中黄色方块表明迁移的目标硬盘,称为 storage,而未框出来的地方则表明当前所在硬盘 lustre.

但是因为并非所有用户都已迁移完成,所以需要对这两类用户访问 ~ 的行为做不同的处理,

  • 如果用户 A 已经迁移,则其 ~ 直接指向 /storage01/users/A
  • 如果用户 B 未迁移,则其 ~ 通过 /storage01/users/B (此时该路径只相当于 soft link) 指向 /lustre/users/B

所以无论迁移与否,访问 ~ 都需要通过上图的黄色方框。那么倘若 storage 本身挂载失败,则 ~ 解析失败,而未迁移用户正因为还未迁移,所以仍能绕过 ~ 而直接访问 /lustre/users/sXXXX

Inherited Environment

By default, sbatch will inherit the environment variables, so

$ module load R/3.6
$ sbatch -p stat -q stat << EOF
> #!/bin/sh
> echo $PATH
> which R
> EOF
Submitted batch job 319113
$ cat slurm-319113.out 
/opt/share/R/3.6.3/bin:...
/opt/share/R/3.6.3/bin/R

we can disable the inheriting behavior via

$ sbatch -p stat -q stat --export=NONE << EOF
> #!/bin/sh
> echo $PATH
> which R
> EOF
Submitted batch job 319110
$ cat slurm-319110.out 
/opt/share/R/3.6.3/bin:
which: no R in (...

But note that $PATH still has the path to R/3.6, the explanation would be that the substitution has been executed before submitting.

The detailed explanation of --export can be found in man sbatch

       --export=<[ALL,]environment variables|ALL|NONE>
              Identify  which environment variables from the submission environment are propagated to the launched applica‐
              tion. Note that SLURM_* variables are always propagated.

              --export=ALL
                        Default mode if --export is not specified. All of the users environment will be loaded (either from
                        callers environment or clean environment if --get-user-env is specified).

              --export=NONE
                        Only  SLURM_*  variables  from the user environment will be defined. User must use absolute path to
                        the binary to be executed that will define the environment.  User can not specify explicit environ‐
                        ment variables with NONE.  --get-user-env will be ignored.
                        This  option  is particularly important for jobs that are submitted on one cluster and execute on a
                        different cluster (e.g. with different paths).  To avoid steps inheriting environment  export  set‐
                        tings  (e.g.  NONE) from sbatch command, the environment variable SLURM_EXPORT_ENV should be set to
                        ALL in the job script.
Back to top