Server configuration¶
Install Miniconda and Python libraries¶
Python libraries installed in HPC are outdated and you may want to use newer releases. This section shows how to install Miniconda in the user’s home directory, without affecting the original installation.
Miniconda installs the most recent release of Python and pip
in the user’s folder. The libraries installed with pip
and conda
are also installed in your folder.
All commands shall be executed in the server’s Linux terminal.
Check CUDA release¶
Before installing the libraries, you need the current CUDA release to choose the right package. Run this command:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168
The output shows the current release is 10.1.
Install Miniconda¶
Miniconda is a package management system for Python and provides pip
, conda
and the most recent Python release with basic libraries. Miniconda requires less disk space than Anaconda and is faster to install. After installing Miniconda, you may install just the libraries that you’ll use. Anaconda installs many packages and applications that won’t be used.
Select a Miniconda release with the Python version compatible with the libraries that you need, since not all libraries are compatible with the newest release of Python. For example, the latest release of Tensorflow doesn’t work with the latest release of Python and CUDA 10.1.
Warning
Check Python, CUDA and libraries compatibility before installing Miniconda. Many libraries only work with specific Python and CUDA versions.
Access Miniconda website and copy the link with the installation script that you need. This example uses release 3.8.
Use the wget
command to download the script with the link you copied from Miniconda site:
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
The installation script asks a few questions, just press ENTER to accept the default values and accept the license terms. When prompted to initialize Miniconda, answer “yes”. This creates the default base environment that will be automatically activated when you log into the cluster. All libraries will be installed in this environment.
Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> yes
Now reload ~/.bashrc
to update the environment variables and activate the base environment:
$ source ~/.bashrc
Warning
Make sure to select the right environment before installation and always check Python and libraries after installation.
Install Python libraries for your project. Tensorflow and PyTorch use custom installation which depends on Python and CUDA versions:
$ conda install -c conda-forge numpy pandas matplotlib scikit-learn
Optionally update the libaries for the most recent version. In this example, Miniconda installed scikit-learn version 0.23 and this command upgrades to 0.24
# Update scikit-learn
$ conda upgrade -c conda-forge scikit-learn
Install Tensorflow¶
From Tensorflow website, select the correct version according to Python and CUDA versions. Since we have Python 3.8 and CUDA 10.1, the best Tensorflow version is 2.3:
$ pip install tensorflow==2.3
Install PyTorch¶
Similarly, check PyTorch website to install the correct version:
# Install PyTorch
# 1. Version with GPU to install in lince (CUDA 10.1 - Python 3.8)
$ conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch
# 2. Version without GPU to install in aguia
$ conda install pytorch torchvision torchaudio cpuonly -c pytorch
Install Dask¶
Dask is a library for parallel and distributed computing. Dask’s schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world. It easily integrates with NumPy, Pandas and scikit-learn:
$ conda install dask distributed
Install RAPIDS¶
The RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Use the release selector to get the right installation command:
$ conda install -c rapidsai -c nvidia -c conda-forge rapids-blazing=0.19 python=3.8 cudatoolkit=10.1
Installation tests¶
After installing the libraries, run Python and import the libraries to confirm the correct version:
$ cat system_info.py
#!/scratch/<YOUR_NUSP>/miniconda3/bin/python3
import sys
import numpy as np
import pandas as pd
import matplotlib as mpl
import sklearn as sk
print('='*20, 'Software version', '='*20)
print("Python:", sys.version.split('\n')[0])
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)
print('Matplotlib:', mpl.__version__)
print("Sklearn:", sk.__version__)
Warning
Check Tensorflow, PyTorch and RAPIDS on the processing node, since the login server doesn’t have access to GPU.
Lince login node doesn’t provide GPU access, so you need to connect to a processing node to check Tensorflow, PyTorch and RAPIDS:
$ ssh lince2-001
Once connected in lince2-001, connect to a processing node and make sure that Tensorflow and PyTorch recognize the GPU:
$ python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Check Tensorflow installation¶
Import Tensorflow:
>>> import tensorflow as tf
2021-05-06 10:09:05.807604: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Check Tensorflow version:
>>> tf.__version__
'2.3.0'
Check if Tensorflow can list both GPUs:
>>> tf.config.list_physical_devices()
2021-05-06 10:09:19.154886: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-05-06 10:09:19.167369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: Tesla K20m computeCapability: 3.5
coreClock: 0.7055GHz coreCount: 13 deviceMemorySize: 4.63GiB deviceMemoryBandwidth: 193.71GiB/s
2021-05-06 10:09:19.168426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:83:00.0 name: Tesla K20m computeCapability: 3.5
coreClock: 0.7055GHz coreCount: 13 deviceMemorySize: 4.63GiB deviceMemoryBandwidth: 193.71GiB/s
2021-05-06 10:09:19.168477: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-05-06 10:09:19.173624: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-05-06 10:09:19.176772: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-05-06 10:09:19.177907: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-05-06 10:09:19.181156: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-05-06 10:09:19.183197: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-05-06 10:09:19.188812: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-05-06 10:09:19.192994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'), PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'), PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
Check PyTorch installation¶
Import PyTorch:
>>> import torch
Check PyTorch version:
>>> torch.__version__
'1.7.0'
>>>
Check the number of GPUs available:
>>> torch.cuda.device_count()
2
Check GPU name:
>>> torch.cuda.get_device_name(torch.cuda.current_device())
'Tesla K20m'
Check RAPIDS installation¶
RAPIDS automatically detects the GPU when you import a library:
>>> import cudf
/scratch/11568881/miniconda3/lib/python3.8/site-packages/cudf/utils/gpu_utils.py:92: UserWarning: You will need a GPU with NVIDIA Pascal™ or newer architecture
Detected GPU 0: Tesla K20m
Detected Compute Capability: 3.5
warnings.warn(
System information¶
You may need the hardware information to choose the right software release. The following commands show the main hardware devices and the Linux release. The commands may be executed directly in the Linux terminal, or you may save in a script and run in SLURM job. Note that PyTorch provides a custom version for each CUDA version:
$ cat system_info.sh
#!/usr/bin/bash
echo ========================
echo SLURM: ID of job allocation
echo ========================
echo $SLURM_JOB_ID # ID of job allocation
echo ========================
echo SLURM: Directory job where was submitted
echo ========================
echo $SLURM_SUBMIT_DIR # Directory job where was submitted
echo ========================
echo SLURM: File containing allocated hostnames
echo ========================
echo $SLURM_JOB_NODELIST # File containing allocated hostnames
echo ========================
echo SLURM: Total number of cores for job
echo ========================
echo $SLURM_NTASKS # Total number of cores for job
echo ========================
echo SLURM: GPU devide ID that assigned to the job to use
echo ========================
echo $CUDA_VISIBLE_DEVICES
echo ========================
echo Hostname
echo ========================
hostname
echo ========================
echo Memory Info \(GB\):
echo ========================
free -g
echo ========================
echo CPU Info:
echo ========================
lscpu
echo ========================
echo Disk space
echo ========================
df -h
echo ========================
echo GPU 1
echo ========================
nvidia-smi
echo ========================
echo GPU 2
echo ========================
lshw -C display
echo ========================
echo CUDA Version
echo ========================
nvcc --version
echo ========================
echo Linux version
echo ========================
cat /etc/os-release
echo ========================
echo PATH
echo ========================
echo $PATH
echo ========================
echo Python
echo ========================
which python
which python3
echo ========================
echo Conda
echo ========================
which conda
conda --version
echo ========================
echo Pip
echo ========================
which pip
pip --version
echo ========================
echo Python Library Versions
echo ========================
python system_info.py