Notes on CHPC-Cluster¶
Login¶
After generating and configuring SSH key pair, we can directly access the server via
$ ssh USERNAME@chpc-login01.itsc.cuhk.edu.hk
since the login node is not suggested/allowed to run test jobs, it would be more convenient to login in the test node, sandbox
. This can be done with consecutive ssh,
$ ssh -t USERNAME@chpc-login01.itsc.cuhk.edu.hk ssh sandbox
where -t
aims to avoid the warning
Pseudo-terminal will not be allocated because stdin is not a terminal.
bypass the login node¶
Usually, only the login node is out of service, but the jobs on computing nodes would not be affected. So there is a tip to bypass the unaccessible login node.
Requirement
You can access another middle machine which has a public or campus IP. Otherwise, you can try to use free tools like ngrok
to generate a public ip for your local machine, see my notes on how to access the intranet from outside.
- Step 1: ssh to the middle machine from nodes except the login node of ITSC cluster, say
sandbox
, with the remote port forwarding option-R PORT:localhost:22
- Step 2: ssh back to
sandbox
by specifying the port-p PORT
Tip
Sometimes ssh session might be disconnected if no further actions, so it would be necessary to replace ssh
with autossh
(see my notes)
The sketch plot is as follows,
It is necessary to check the status of the tunnel. If the connection is broken, such as the sandbox has rebooted, pop a message window to remind to re-establish the tunnel in time.
The command to pop message is notify-send
, and note that successful command exits with 0, use the script to check the ssh connection,
#!/bin/bash
ssh -q -o BatchMode=yes $1 exit
if [ $? != '0' ]; then
#echo "broken connection"
notify-send "$1" "broken connection"
fi
Create a regular job as follows,
$ crontab -e
0 * * * * export XDG_RUNTIME_DIR=/run/user/$(id -u); for host in sandbox STAPC ROCKY; do sh /home/weiya/github/techNotes/docs/Linux/check_ssh.sh $host; done
where export XDG_RUNTIME_DIR=/run/user/$(id -u)
is necessary for notify-send
to pop the window (refer to Notify-send doesn’t work from crontab)
Custom Commands¶
Some of the commands would be explained in the following sections.
- aliases
# delete all jobs
alias qdelall='qstat | while read -a ADDR; do if [[ ${ADDR[0]} == +([0-9]) ]]; then qdel ${ADDR[0]}; fi ; done'
# list available cores
alias sinfostat='sinfo -o "%N %C" -p stat -N'
# list available gpu
alias sinfogpu='sinfo -O PartitionName,NodeList,Gres:25,GresUsed:25 | sed -n "1p;/gpu[^:]/p"'
# check disk quota
alias myquota='for i in `whoami` Stat StatScratch; do lfs quota -gh $i /lustre; done'
# list jobs sorted by priority
alias sacctchpc='sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocTRES,NNodes,NodeList,Submit,QOS | (sed -u 2q; sort -rn)'
# list jobs sorted by priority (only involved stat)
alias sacctstat='sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocTRES,NNodes,NodeList,Submit,QOS | (sed -u 2q; sort -rn) | sed -n "1,2p;/stat/p"'
- functions
#request specified nodes in interactive mode
request_cn() { srun -p stat -q stat -w chpc-cn1$1 --pty bash -i; }
request_gpu() { srun -p stat -q stat --gres=gpu:1 -w chpc-gpu01$1 --pty bash -i; }
request_gpu_chpc() { srun -p chpc --gres=gpu:1 -w chpc-gpu$1 --pty bash -i; }
t() { tmux a -t $1 || tmux new -s $1; }
Interactive Mode¶
Strongly recommend the interactive mode when you debug your program or want to check the outputs of each step.
qsub -I
¶
The simplest way is
[sXXXX@chpc-login01 ~] $ qsub -I
If there are idle nodes, then you would be allocated to a node, and pay attention to the prompt, which indicates where you are. For example, sXXXX@chpc-login01
means you are on the chpc-login01
node.
Sometimes you can be automatically brought into the target node, then you are done. But sometimes it just displays the node you are allocated, such as
[sXXXX@chpc-login01 ~] $ qsub -I
...
salloc: Nodes chpc-cn011 are ready for job
[sXXXX@chpc-login01 ~] $
then you need to manually ssh into the target node
[sXXXX@chpc-login01 ~] $ ssh chpc-cn101
[sXXXX@chpc-cn101 ~] $
srun -w
¶
Sometimes you might want to use a specified node, say you want to use GPU (DO NOT forget --gres=gpu:1
), then you can specify your node via the option -w
. Moreover, you’d better specify the partition and QoS policy -p stat -q stat
, which counts your quota of usage. The interactive command is specified via --pty bash -i
.
The complete command is
[sxxxxx@chpc-login01 ~]$ srun -p stat -q stat --gres=gpu:1 -w chpc-gpu010 --pty bash -i
srun: job XXXXXX queued and waiting for resources
srun: error: Lookup failed: Unknown host
srun: job XXXXXX has been allocated resources
[sxxxxx@chpc-gpu010 ~]$
Upon you are allocated a node, you can do what you want just like on your own laptop.
Submitting Multiple Jobs¶
SLURM and PBS are two different cluster schedulers, and the common equivalent commands are as follows:
# PBS
qsub -l nodes=2:ppn=16 -l mem=8g -N jobname -m be -M notify@cuhk.edu.hk
# Slurm
sbatch -N 2 -c 16 --mem=8g -J jobname --mail-type=[BEGIN,END,FAIL,REQUEUE,ALL] --mail-user=notify@cuhk.edu.hk
PBS¶
Suppose there is a main program toy.jl
, and I want to run it multiple times but with different parameters, which can be passed via the -v
option.
#!/bin/bash
for number in 1 2 3 4 5; do
for letter in a b c d e; do
qsub -v arg1=$number,arg2=$letter toy.job
done
done
#!/bin/bash
cd $HOME/PROJECT_FOLDER
julia toy.jl ${arg1} ${arg2}
a, b = AGRS
println("a = $a, b = $b")
The submitting command is
$ ./submit.sh
and here is an example in my private projects.
SLURM¶
Suppose there is a main program run.jl
, which runs in parallel with the number of cores np
, and I also want to repeat the program for N
times. To properly store the results, the index of repetition nrep
for each job has been passed to the main program.
The arguments for job file can be passed by the --export
option.
#!/bin/bash
if [ $# == 0 ]; then
cat <<HELP_USAGE
$0 param1 param2
param1 number of repetitions
param2 node label, can be stat or chpc
param3 number of cores
HELP_USAGE
exit 0
fi
resfolder=res_$(date -Iseconds)
for i in $(seq 1 1 $1); do
sbatch -N 1 -c $3 -p $2 --export=resfolder=${resfolder},nrep=${i},np=$3 toy.job
done
#!/bin/bash
cd $HOME/PROJECT
julia -p $np run.jl $nrep $resfolder
using Distributed
const jobs = RemoteChannel(()->Channel{Tuple}(32))
const res = RemoteChannel(()->Channel{Tuple}(32))
function make_jobs() end
function do_work() end
nrep, resfolder = ARGS
@async make_jobs()
where
HELP_USAGE
documents shell scripts’ parameters.$1, $2, $3
denotes the 1st, 2nd, 3rd argument in the command line, and$0
is the script name.
The following command runs with N = 100, np = 4
and on the stat
partition,
$ ./submit.sh 100 stat 4
which is adopted from my private project.
Specify Nodes¶
The nodes can be excluded with -x
or --exclude
, and it can be specified with -w
.
TL; DR
According to the following experiments, my observation is that
the exclusion seems only to perform on the granted resources instead of all nodes. If you want to allocate specified nodes,
-w
option should be used.
srun
# cannot exclude
$ srun -x chpc-cn050 hostname
chpc-cn050.rc.cuhk.edu.hk
salloc
# cannot exclude
$ salloc -x chpc-cn050 -N1
salloc: Nodes chpc-cn050 are ready for job
# cannot exclude
$ salloc -x chpc-cn050 srun hostname
salloc: Nodes chpc-cn050 are ready for job
chpc-cn050.rc.cuhk.edu.hk
# NB: exclude successfully
$ salloc -w chpc-cn050 srun -x chpc-cn050 hostname
salloc: Nodes chpc-cn050 are ready for job
srun: error: Hostlist is empty! Can't run job.
sbatch
# cannot exclude
$ sbatch << EOF
> #!/bin/sh
> #SBATCH -x chpc-cn050
> srun hostname
> EOF
Submitted batch job 246669
$ cat slurm-246669.out
chpc-cn050.rc.cuhk.edu.hk
# NB: exclude successfully
$ sbatch << EOF
> #!/bin/sh
> #SBATCH -w chpc-cn050
> srun -x chpc-cn050 hostname
> EOF
Submitted batch job 246682
$ cat slurm-246682.out
srun: error: Hostlist is empty! Can't run job.
Observation:
-x
seems not to work in the allocation step, but it can exclude nodes from the allocated nodes.
Back to the manual of -x
option:
Explicitly exclude certain nodes from the resources granted to the job.
So the exclusion seems only to perform on the granted resources instead of all nodes. If you want to allocate specified nodes, -w
option should be used.
Exit Code¶
As the official documentation said, a job’s exit code (aka exit status, return code and completion code) is captured by Slurm and saved as part of the job record. For sbatch jobs, the exit code that is captured is the output of the batch script.
- Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.
- The exit code is an 8 bit unsigned number ranging between 0 and 255.
- When a signal was responsible for a job or step’s termination, the signal number will be displayed after the exit code, delineated by a colon(:)
We can check the exit code of particular jobs,
sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocTRES,NNodes,NodeList,Submit,QOS,STATE,ExitCode,DerivedExitCode
e.g.,
where
- the first one is a toy example and kill by myself with
kill -s 9 XX
, so the right of:
is signal9
, and it exits with zero code - the second one is the one shared by @fangda. It is exactly reversed, and I suspect that it might be due to other reasons.
see also: 3.7.6 Signals and
# http://www.bu.edu/tech/files/text/batchcode.txt
Name Number (SGI) Number (IBM)
SIGHUP 1 1
SIGINT 2 2
SIGQUIT 3 3
SIGILL 4 4
SIGTRAP 5 5
SIGABRT 6 6
SIGEMT 7 7
SIGFPE 8 8
SIGKILL 9 9
SIGBUS 10 10
SIGSEGV 11 11
SIGSYS 12 12
SIGPIPE 13 13
SIGALRM 14 14
SIGTERM 15 15
SIGUSR1 16 30
SIGUSR2 17 31
SIGPOLL 22 23
SIGIO 22 23
SIGVTALRM 28 34
SIGPROF 29 32
SIGXCPU 30 24
SIGXFSZ 31 25
SIGRTMIN 49 888
SIGRTMAX 64 999
Job Priority¶
The submitted jobs are sorted by the calculated job priority in descending order.
TL;DR
You can check the priority of all submitted jobs (not only yours but also others), and then you can find where you are, and figure out when your job can start to run.
$ sacct -a -X --format=Priority,User%20,JobID,Account,AllocCPUS,AllocTRES,NNodes,NodeList,Submit,QOS | (sed -u 2q; sort -rn)
The formula for job priority is given by
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factor
we can find those weights
$ scontrol show config | grep ^Priority
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags = CALCULATE_RUNNING
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 0
PriorityWeightAssoc = 0
PriorityWeightFairShare = 100000
PriorityWeightJobSize = 0
PriorityWeightPartition = 0
PriorityWeightQOS = 0
PriorityWeightTRES = (null)
only the PriorityWeightFairShare
is nonzero, and this agrees with
$ sprio -w
JOBID PARTITION PRIORITY SITE FAIRSHARE
Weights 1 100000
$ sprio -w -p stat
JOBID PARTITION PRIORITY SITE FAIRSHARE
Weights 1 100000
$ sprio -w -p chpc
JOBID PARTITION PRIORITY SITE FAIRSHARE
Weights 1 100000
then the formula would be simplified as
Job_priority =
site_factor +
(PriorityWeightFairshare) * (fair-share_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factor
where TRES_weight_<type>
might be GPU, see the usage weight in the table, and a negative nice_factor
can only be set by privileged users,
Nice Factor
Users can adjust the priority of their own jobs by setting the nice value on their jobs. Like the system nice, positive values negatively impact a job’s priority and negative values increase a job’s priority. Only privileged users can specify a negative value. The adjustment range is +/-2147483645.
- the fairshare can be obtained via
sshare
, and the calculated priority can be obtained viasprio
.
refer to
CPU/Memory Usage¶
Check the CPU and memory usage of a specific job. The natural way is to use top
on the node that run the job. After ssh into the corresponding node, get the map between job id and process id via
$ scontrol listpids YOUR_JOB_ID
Note that this only works with processes on the node on which scontrol
is run, i.e., we cannot get the corresponding pid before ssh into the node.
Then check the results of top
and monitor the CPU/memory usage by the job given the pid. Or explicitly specify the pid via top -p PID_OF_JOB
Alternatively, a more direct way is to use sstat
command, which reports various status information, including CPU and memory, for running jobs.
$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,MaxRSS,MaxVMSize -j JOBID
AveCPU AvePages AveRSS AveVMSize MaxRSS MaxVMSize
---------- ---------- ---------- ---------- ---------- ----------
00:02.000 30 1828K 119820K 1828K 276808K
Correspondingly, the result from top
is
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
213435 XXX 20 0 119820 2388 1772 S 0.0 0.0 0:00.27 bash
where VIRT
== AveVMSIZE
. The detailed meaning can be found via man top
,
VIRT
: Virtual Memory SizeRES
: Resident Memory Size%MEM
:RES
divided by total physical memory
Disk Quota¶
Sometimes, you might find that your job cannot continue to write out results, and you also cannot create a new file. It might imply that your quota reaches the limit, and here is a tip to “increase” your quota without cleaning your files.
TL;DR
Tip to “increase” your personal quota is to count the files as the shared department quota, so just change the group membership of your files,
$ chgrp -R Stat SomeFolder
Firstly, you can check your personal quota with
$ lfs quota -gh your_user_id /lustre
# 20GB by default from ITSC in /users/your_user_id
and here is two shared quota’s in the whole department,
$ lfs quota -gh Stat /lustre
# 30TB shared by Statistics Department in /lustre/project/Stat
$ lfs quota -gh StatScratch /lustre
# 10TB by default from ITSC in /lustre/scratch/Stat
An interesting thing is that the quota is counted by the group membership of the file, so if your personal quota exceeds, you can change the group membership of some files, and then these files would count as the shared quota instead of your personal quota. To change the group membership recursively of a folder,
$ chgrp -R Stat SomeFolder
A partial Chinese description,
已经好几次因为 disk quota 超了使得程序崩溃,于是试图将 home 文件夹中的部分文件移动到 /lustre/project/Stat
中,但似乎 quota 并没有变化。
后来才发现,原来 quota 是通过 group 来控制的,其中上述命令中 -g
选项即表示 group,也就是说 your_user_id
, Stat
和 StatScratch
都是 group 名字。如果文件夹是从别处移动来的,group并不会改变,只有直接在 /lustre/project/Stat
中新建的文件或文件夹才继承 Stat 的 group,这一点是最开始在 Stat
文件夹下创建自己目录 MYFOLDER
时通过 chmod g+s MYFOLDER
保证的。
于是简单的方法便是直接更改 group,
chgrp -R Stat SomeFolder
如果想找出哪些文件 group 为 sXXXX
, 可以采用
find . -group `whoami`
Custom Module¶
The cluster manages the software versions with module, and the default module file path is
$ echo $MODULEPATH
/usr/share/Modules/modulefiles:/etc/modulefiles
which requires sudo
privilege. A natural question is whether we can create custom (local) module file to switch the software which are installed by ourselves or have not been added into the modules.
Here is an example. There is an installed R in /opt/share/R
named 3.6.3-v2
, which does not have the modulefile, since the same version 3.6.3
is already used. But there are still differences between these two “same” versions, 3.6.3-v2
supports figures, such as “jpeg”, “png”, “tiff” and “cairo”, while 3.6.3
not,
(3.6.3) > capabilities()
jpeg png tiff tcltk X11 aqua
FALSE FALSE FALSE TRUE FALSE FALSE
http/ftp sockets libxml fifo cledit iconv
TRUE TRUE TRUE TRUE TRUE TRUE
NLS profmem cairo ICU long.double libcurl
TRUE FALSE FALSE TRUE TRUE TRUE
(3.6.3-v2) > capabilities()
jpeg png tiff tcltk X11 aqua
TRUE TRUE TRUE TRUE FALSE FALSE
http/ftp sockets libxml fifo cledit iconv
TRUE TRUE TRUE TRUE TRUE TRUE
NLS profmem cairo ICU long.double libcurl
TRUE FALSE TRUE TRUE TRUE TRUE
To use 3.6.3-v2
, we can create our custom modulefile (ref),
# step 1: create a folder in your home directory
~ $ mkdir modules
# step 2: preappend your module path to the ~/.bashrc file
~ $ echo "export MODULEPATH=${MODULEPATH}:${HOME}/modules" >> ~/.bashrc
# step 3: copy the existing modulefile as a template
# here I skip the directory, just use name `R3.6`, which can also differ from the existing `R/3.6`
~ $ cp /usr/share/Modules/modulefiles/R/3.6 modules/R3.6
# step 4: modify the path to the software (`modroot`) as follows
#%Module1.0#####################################################################
##
proc ModulesHelp { } {
global version modroot
puts stderr "R/3.6.3 - sets the Environment for
R scripts v3.6.3 (gcc verion)
Use 'module whatis [module-info name]' for more information"
}
module-whatis "The R Project for Statistical Computing
R is a free software environment for statistical computing and graphics.
Here is the available versions:
R/3.6.3"
set version 3.6.3
set app R
set modroot /opt/share/$app/$version
module load pcre2
conflict R
setenv R_HOME $modroot/lib64/R
prepend-path PATH $modroot/bin
prepend-path LD_LIBRARY_PATH $modroot/lib
prepend-path INCLUDE $modroot/include
#%Module1.0#####################################################################
##
proc ModulesHelp { } {
global version modroot
puts stderr "R/3.6.3 - sets the Environment for
R scripts v3.6.3 (gcc verion)
Use 'module whatis [module-info name]' for more information"
}
module-whatis "The R Project for Statistical Computing
R is a free software environment for statistical computing and graphics.
Here is the available versions:
R/3.6.3"
set version 3.6.3
set app R
set modroot /opt/share/$app/${version}-v2
module load pcre2
module load intel
conflict R
setenv R_HOME $modroot/lib64/R
prepend-path PATH $modroot/bin
prepend-path LD_LIBRARY_PATH $modroot/lib
prepend-path INCLUDE $modroot/include
@@ -18,10 +18,10 @@
set version 3.6.3
set app R
-set modroot /opt/share/$app/$version
+set modroot /opt/share/$app/${version}-v2
module load pcre2
-
+module load intel
conflict R
Note that module load intel
is also added, otherwise it will throws,
/opt/share/R/3.6.3-v2/lib64/R/bin/exec/R: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory
since libiomp5
is for openmp
(ref).
Now, you can use 3.6.3-v2
like other modules,
# load `3.6.3-v2`
$ module load R3.6
# unload `3.6.3-v2`
$ module unload R3.6
# load original `3.6.3`
$ module load R/3.6
GPU Usage¶
Be cautious about the compatible versions between the deep learning framework (e.g., tensorflow), CUDA, and also the (CUDA) driver version of the node.
The driver version of the node is fixed, but fortunately it is downward compatible, i.e., a higher driver version also supports a lower CUDA version. We can check the driver version by
$ nvidia-smi
Sat Jun 26 10:32:03 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
...
which implies that the highest supported CUDA version is 11.0.
The available CUDA versions can be found as follows
$ module avail cuda
-------------- /usr/share/Modules/modulefiles ---------------
cuda/10.1 cuda/10.2 cuda/11.0 cuda/11.3(default) cuda/9.2
For the above node whose highest supported CUDA version is 11.0, the latest cuda/11.3
would be incompatible, but others are all OK.
Now you can pick the proper tensorflow version according to the supported CUDA versions. Here is an official configuration table, which lists the compatible versions between tensorflow, python, CUDA, together with cuDNN.
Finally, you can validate if the GPU is correctly supported by running
$ python
>>> import tensorflow as tf
# 1.x
>>> tf.test.is_gpu_available()
# 2.x
>>> tf.config.list_physical_devices('GPU')
What if you want to use the version that does not installed on the cluster, say cuda/10.0
? We can install a local cuda and creat a custom module to call it.
Following the instruction: Install Cuda without root
$ wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
$ chmod +x cuda_10.0.130_410.48_linux
$ ./cuda_10.0.130_410.48_linux
then download cuDNN
, but it requires to login. After extraction, copy the include
and lib
to the CUDA installation folder.
Next, create the custom module file, and finally I can use
$ cuda load cuda10.0
to use the local cuda.
Solutions to Abnormal Cases¶
fail to ssh passwordlessly¶
像往常一样 ssh,但是报错了
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.
Please contact your system administrator.
Add correct host key in /home/weiya/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/weiya/.ssh/known_hosts:42
remove with:
ssh-keygen -f "/home/weiya/.ssh/known_hosts" -R "chpc-login01.itsc.cuhk.edu.hk"
ECDSA host key for chpc-login01.itsc.cuhk.edu.hk has changed and you have requested strict checking.
Host key verification failed.
于是根据提示运行了
ssh-keygen -f "/home/weiya/.ssh/known_hosts" -R "chpc-login01.itsc.cuhk.edu.hk"
然后重新 ssh,但还是要求输入密码。类似的问题另见 ssh remote host identification has changed
这其实对应了服务器上 /etc/ssh
文件夹下几个 pub 文件,咨询 Michael 也得到回复说最近 public fingerprint 有修改,这应该是 known hosts 的内容。
可以以 MD5 的形式展示,
$ ssh-keygen -l -E md5 -f ssh_host_ed25519_key.pub
ssh-keyscan -t rsa server_ip
也能返回完全一致的结果,然后手动添加至 known_hosts 文件,仍然不能成功,尝试过新增其他格式的 key,
ssh-keygen -t [ed25519 | ecdsa | dsa]
然而统统没用。
后来跟服务器管理员反复沟通,提交 ssh -vvv xxx &> ssh.log
日志文件供其检查,才确认是最近服务器配置更改的原因,虽然没有明说,但是注意到 /etc/ssh/sshd_config
更新后不久管理员就回复说好了,问及原因,他的回答是,
It is related to security context which will make SELinux to block the file access. I think this required root permission to config.
fail to access ~
¶
Z 在群里问道,他在服务器上提交 job 时,之前安装好的包不能使用。显然,很可能因为 .libPaths()
没有包含 $HOME
下的用户安装路径,但是他在登录结点上在 R 中运行 .libPaths()
,一切正常。那么问题或许出在工作结点上,事实表明在该工作结点上采用以下任一种方式都能解决问题
- 在 R 中运行
.libPaths("/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0")
- 或者
export R_LIBS_USER=/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0
Note
此处也看到一个类似的问题,但是原因不一样,在 R_LIBS_USER ignored by R 问题中,原因是 $HOME
不能正常展开。
但是此时并不是很理解,因为按理说不同结点都是共享的。后来研究了下 R 的启动机制,
On Unix versions of R there is also a file
R_HOME/etc/Renviron
which is read very early in the start-up processing. It contains environment variables set by R in the configure process. Values in that file can be overridden in site or user environment files: do not changeR_HOME/etc/Renviron
itself. Note that this is distinct fromR_HOME/etc/Renviron.site
.
才知道 R_LIBS_USER
是定义在 Renviron
中,
R_LIBS_USER=${R_LIBS_USER-'~/R/x86_64-pc-linux-gnu-library/4.0'}
其中 ${A-B}
的语法是如果 A
没有设置,则令 B
为 A
,注意其与 ${A:-B}
的区别。
这也难怪为什么直接在命令行中输入 echo $R_LIBS_USER
结果为空。
SSH 到该工作结点,发现其 prompt 并没有正确加载,直接出现 bash-4.2$
,而一般会是 [sXXXXX@chpc-sandbox ~]$
。
Note
其实 source .bashrc
后能显示 ~
,但访问 ~
仍然失败。另见 Terminal, Prompt changed to “-Bash-4.2” and colors lost
这样一个直接后果就是无法解析用户目录 ~
,这大概也是为什么 R_LIBS_USER
在这个结点没有正常加载,因为上述系统配置文件 /opt/share/R/4.0.3/lib64/R/etc/Renviron
中使用了 ~
,于是需要用不带 ~
的全路径。不过 ~
其实只是指向 /user/sXXXXX
,发现这个文件夹没有正常被连接,所以要使用 /lustre/user/sXXXX
. 或者说是没有挂载,因为在其它结点上有以下三条挂载记录,
$ df -h
storage03:/chpc-userhome 50T 4.4T 46T 9% /storage03/chpc-userhome
storage03:/chpc-optshare 10T 713G 9.4T 7% /storage03/chpc-optshare
storage01:/chpc-users 15T 6.2T 8.9T 42% /storage01/users
而该结点上没有。
为了验证上述想法,也手动进行 export R_LIBS_USER=
及 .libPaths()
,
> .libPaths("/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0")
> .libPaths()
[1] "/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0"
[2] "/lustre/opt_share/R/4.0.3/lib64/R/library"
$ export R_LIBS_USER=/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0
$ R
> .libPaths()
[1] "/lustre/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0"
[2] "/lustre/opt_share/R/4.0.3/lib64/R/library"
> .libPaths("/users/sXXXXXXXXX/R/x86_64-pc-linux-gnu-library/4.0/")
> .libPaths()
[1] "/lustre/opt_share/R/4.0.3/lib64/R/library"
$ export R_LIBS_USER=~/R/x86_64-pc-linux-gnu-library/4.0
$ R
> .libPaths()
[1] "/lustre/opt_share/R/4.0.3/lib64/R/library"
可见涉及 ~
和 /users/sXXXX
的均有问题。
虽然问题已经得以解决,但是很好奇为什么访问 ~
失败。因为根据我的理解,/users/sXXXX
(~
)、/lustre/users/sXXXX
中间应该是通过类似 soft links 形式连接的(即便可能不是,因为确实直接 ls
也没返回指向结果)。
后来咨询管理员才明白,他们正在将迁移用户文件夹,
其中黄色方块表明迁移的目标硬盘,称为 storage
,而未框出来的地方则表明当前所在硬盘 lustre
.
但是因为并非所有用户都已迁移完成,所以需要对这两类用户访问 ~
的行为做不同的处理,
- 如果用户 A 已经迁移,则其
~
直接指向/storage01/users/A
- 如果用户 B 未迁移,则其
~
通过/storage01/users/B
(此时该路径只相当于 soft link) 指向/lustre/users/B
所以无论迁移与否,访问 ~
都需要通过上图的黄色方框。那么倘若 storage 本身挂载失败,则 ~
解析失败,而未迁移用户正因为还未迁移,所以仍能绕过 ~
而直接访问 /lustre/users/sXXXX
。
Inherited Environment¶
By default, sbatch
will inherit the environment variables, so
$ module load R/3.6
$ sbatch -p stat -q stat << EOF
> #!/bin/sh
> echo $PATH
> which R
> EOF
Submitted batch job 319113
$ cat slurm-319113.out
/opt/share/R/3.6.3/bin:...
/opt/share/R/3.6.3/bin/R
we can disable the inheriting behavior via
$ sbatch -p stat -q stat --export=NONE << EOF
> #!/bin/sh
> echo $PATH
> which R
> EOF
Submitted batch job 319110
$ cat slurm-319110.out
/opt/share/R/3.6.3/bin:
which: no R in (...
But note that $PATH
still has the path to R/3.6
, the explanation would be that the substitution has been executed before submitting.
The detailed explanation of --export
can be found in man sbatch
--export=<[ALL,]environment variables|ALL|NONE>
Identify which environment variables from the submission environment are propagated to the launched applica‐
tion. Note that SLURM_* variables are always propagated.
--export=ALL
Default mode if --export is not specified. All of the users environment will be loaded (either from
callers environment or clean environment if --get-user-env is specified).
--export=NONE
Only SLURM_* variables from the user environment will be defined. User must use absolute path to
the binary to be executed that will define the environment. User can not specify explicit environ‐
ment variables with NONE. --get-user-env will be ignored.
This option is particularly important for jobs that are submitted on one cluster and execute on a
different cluster (e.g. with different paths). To avoid steps inheriting environment export set‐
tings (e.g. NONE) from sbatch command, the environment variable SLURM_EXPORT_ENV should be set to
ALL in the job script.