Decoding pip install operations
Decoding pip install operations
Your draft provides useful insights into using pip
for Python package management. Here’s a refined version of your article with improved structure, grammar, and clarity:
Managing Python Environments and Packages with pip
In today’s technology landscape, where we deal with numerous programming languages, diverse hardware (CPU, GPU, TPU, etc.), various operating systems, and an extensive open-source community, building software from scratch can be quite challenging. Even when leveraging existing packages or solutions, there are still numerous challenges to consider, including security, safety, and privacy concerns.
From my experience in machine learning, I can tell you that if you don’t manage your development, testing, and production environments carefully, you might face countless sleepless nights, health issues, and frustration without a clear understanding of what is going wrong.
Therefore, I recommend not to tinker with your default Python environment. Reserve it for standard Python commands and perhaps some basic packages like pandas
, numpy
, seaborn
, and matplotlib
. For any new package installation, you should first create a separate environment and activate it. This practice ensures that your default environment remains clean and stable.
The choice of which new packages to install depends on your project’s requirements, the available packages, their capabilities, and how up-to-date they are. When deciding to install a package, consider whether it can handle multiple requirements or if it’s better to install packages for specific needs. Always check the GitHub and PyPI repositories of the packages you plan to use; if a package hasn’t been updated for months or years, it might be wise to look for alternatives.
Note 1: This article focuses on pip
commands and their outcomes. Similar results can be achieved with other Python package managers like conda
.
Note 2: You can use pip
on Windows, Linux, or Mac. While the commands are generally the same, be aware of the operating system you’re using. Sometimes, commands might differ slightly, or you might need to configure environment variables. For example, you could have multiple operating systems on different partitions, use WSL on Windows, or run virtual machines or Docker containers. Lack of awareness of these differences can lead to unnecessary debugging and time loss.
Creating and Managing Python Environments
To create a new Python environment, use the following command:
python -m venv venv-name
This command creates a new virtual environment named venv-name
and installs the latest Python version along with essential OS and security-related packages.
To activate the newly created environment, use:
# On Windows
venv-name\Scripts\activate
# On Linux/WSL
source venv-name/bin/activate
Once an environment is active, it remains available until you deactivate it. Any pip
commands run during this time will install packages within this environment.
Installing and Managing Packages
To install a package, use:
pip install transformers
To view information about a package, use:
pip show transformers
This will output:
Name: transformers
Version: 4.44.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch, and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /mnt/d/venvs/tf-env/lib/python3.10/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by:
To check the list of all packages installed in the current environment, use:
pip list
What Happens During a pip
Installation?
pip
can be used to install a single package or to set up a new, stable environment. To set up a stable environment, you typically use a requirements.txt
file, which lists all the packages and their versions from an existing stable environment. This practice ensures consistency across different setups.
To create a requirements.txt
file from a stable environment, use the following command:
pip freeze > requirements.txt
This command generates a requirements.txt
file containing a list of all installed packages and their versions in the current environment. This file can then be used to recreate the same environment elsewhere.
Example Output of the Command: pip install -r requirements.txt
When running the command to install packages from a requirements.txt
file, you might see output like this:
Collecting package_name
Using cached package_name-version.tar.gz (file_size)
Getting metadata from package_name
Preparing metadata (setup.py) ...
Installing collected packages: package_name
Successfully installed package_name-version
Understanding Each Operation.
I want to understand meaning of each operations like collecting, using, getting, preperating, installing etc. Also why installing build dependicies are happening only few time not every time?
D:\venvpy\mistral-agentic-rag>pip install -r requirements.txt
Collecting absl-py==1.4.0 (from -r requirements.txt (line 1))
Using cached absl_py-1.4.0-py3-none-any.whl.metadata (2.3 kB)
Collecting accelerate==0.31.0 (from -r requirements.txt (line 2))
Using cached accelerate-0.31.0-py3-none-any.whl.metadata (19 kB)
Collecting aiohttp==3.9.5 (from -r requirements.txt (line 3))
Using cached aiohttp-3.9.5-cp312-cp312-win_amd64.whl.metadata (7.7 kB)
Collecting aiosignal==1.3.1 (from -r requirements.txt (line 4))
Using cached aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting alabaster==0.7.16 (from -r requirements.txt (line 5))
Using cached alabaster-0.7.16-py3-none-any.whl.metadata (2.9 kB)
Collecting albumentations==1.3.1 (from -r requirements.txt (line 6))
Using cached albumentations-1.3.1-py3-none-any.whl.metadata (34 kB)
Collecting altair==4.2.2 (from -r requirements.txt (line 7))
Using cached altair-4.2.2-py3-none-any.whl.metadata (13 kB)
Collecting arviz==0.15.1 (from -r requirements.txt (line 8))
Using cached arviz-0.15.1-py3-none-any.whl.metadata (8.5 kB)
Collecting asgiref==3.8.1 (from -r requirements.txt (line 9))
Using cached asgiref-3.8.1-py3-none-any.whl.metadata (9.3 kB)
Collecting astropy==5.3.4 (from -r requirements.txt (line 10))
Using cached astropy-5.3.4-cp312-cp312-win_amd64.whl.metadata (9.6 kB)
Collecting astunparse==1.6.3 (from -r requirements.txt (line 11))
Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting async-timeout==4.0.3 (from -r requirements.txt (line 12))
Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting atpublic==4.1.0 (from -r requirements.txt (line 13))
Using cached atpublic-4.1.0-py3-none-any.whl.metadata (1.8 kB)
Requirement already satisfied: attrs==23.2.0 in c:\python312\lib\site-packages (from -r requirements.txt (line 14)) (23.2.0)
Collecting audioread==3.0.1 (from -r requirements.txt (line 15))
Using cached audioread-3.0.1-py3-none-any.whl.metadata (8.4 kB)
Collecting autograd==1.6.2 (from -r requirements.txt (line 16))
Using cached autograd-1.6.2-py3-none-any.whl.metadata (706 bytes)
Requirement already satisfied: Babel==2.15.0 in c:\python312\lib\site-packages (from -r requirements.txt (line 17)) (2.15.0)
Collecting backcall==0.2.0 (from -r requirements.txt (line 18))
Using cached backcall-0.2.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting backoff==2.2.1 (from -r requirements.txt (line 19))
Using cached backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting bcrypt==4.1.3 (from -r requirements.txt (line 20))
Using cached bcrypt-4.1.3-cp39-abi3-win_amd64.whl.metadata (9.8 kB)
Requirement already satisfied: beautifulsoup4==4.12.3 in c:\python312\lib\site-packages (from -r requirements.txt (line 21)) (4.12.3)
Collecting bidict==0.23.1 (from -r requirements.txt (line 22))
Using cached bidict-0.23.1-py3-none-any.whl.metadata (8.7 kB)
Collecting bigframes==1.9.0 (from -r requirements.txt (line 23))
Using cached bigframes-1.9.0-py2.py3-none-any.whl.metadata (4.9 kB)
Requirement already satisfied: bleach==6.1.0 in c:\python312\lib\site-packages (from -r requirements.txt (line 24)) (6.1.0)
Collecting blinker==1.4 (from -r requirements.txt (line 25))
Using cached blinker-1.4.tar.gz (111 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting blis==0.7.11 (from -r requirements.txt (line 26))
Using cached blis-0.7.11-cp312-cp312-win_amd64.whl.metadata (7.6 kB)
Collecting blosc2==2.0.0 (from -r requirements.txt (line 27))
Using cached blosc2-2.0.0.tar.gz (3.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting bokeh==3.3.4 (from -r requirements.txt (line 28))
Using cached bokeh-3.3.4-py3-none-any.whl.metadata (12 kB)
Collecting bqplot==0.12.43 (from -r requirements.txt (line 29))
Using cached bqplot-0.12.43-py2.py3-none-any.whl.metadata (6.4 kB)
Collecting branca==0.7.2 (from -r requirements.txt (line 30))
Using cached branca-0.7.2-py3-none-any.whl.metadata (1.5 kB)
Collecting build==1.2.1 (from -r requirements.txt (line 31))
Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting CacheControl==0.14.0 (from -r requirements.txt (line 32))
Using cached cachecontrol-0.14.0-py3-none-any.whl.metadata (3.1 kB)
Collecting cachetools==5.3.3 (from -r requirements.txt (line 33))
Using cached cachetools-5.3.3-py3-none-any.whl.metadata (5.3 kB)
Collecting catalogue==2.0.10 (from -r requirements.txt (line 34))
Using cached catalogue-2.0.10-py3-none-any.whl.metadata (14 kB)
Requirement already satisfied: certifi==2024.6.2 in c:\python312\lib\site-packages (from -r requirements.txt (line 35)) (2024.6.2)
Requirement already satisfied: cffi==1.16.0 in c:\python312\lib\site-packages (from -r requirements.txt (line 36)) (1.16.0)
Collecting chardet==5.2.0 (from -r requirements.txt (line 37))
Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Requirement already satisfied: charset-normalizer==3.3.2 in c:\python312\lib\site-packages (from -r requirements.txt (line 38)) (3.3.2)
Collecting chex==0.1.86 (from -r requirements.txt (line 39))
Using cached chex-0.1.86-py3-none-any.whl.metadata (17 kB)
Collecting chroma-hnswlib==0.7.3 (from -r requirements.txt (line 40))
Using cached chroma-hnswlib-0.7.3.tar.gz (31 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
We will understand the meaning of these commands one by one in following section.
Collecting
- Collecting: Pip is fetching the package from the Python Package Index (PyPI) or another repository. For example:
Collecting absl-py==1.4.0 (from -r requirements.txt (line 1))
Pip finds the specified version of
absl-py
(1.4.0) from the requirements file.
Using
- Using cached: Pip is using a previously downloaded version of the package to save time and bandwidth.
Using cached absl_py-1.4.0-py3-none-any.whl.metadata (2.3 kB)
Instead of downloading the package again, it uses the cached version.
Getting
- Getting requirements to build wheel: For packages that need to be built from source, pip is determining what additional dependencies are required to build the package.
Getting requirements to build wheel ... done
Pip checks what is needed to build the wheel (a built package format for Python).
Preparing
- Preparing metadata: Pip prepares metadata for the package. Metadata includes information such as package name, version, dependencies, etc.
Preparing metadata (pyproject.toml) ... done
Pip processes the
pyproject.toml
file, which specifies build system requirements for the package.
Installing
- Installing: Pip is installing the package and its dependencies.
Installing build dependencies ... done
For packages that require compilation, pip first installs any build dependencies. This doesn’t happen for every package because some packages are pre-built (wheels) and don’t need to be compiled.
Explanation of why build dependencies are not installed every time:
- Pre-built wheels: Many Python packages are distributed as pre-built wheels. These are binary distributions that don’t require building from source. Pip can directly install these without needing to compile them, hence no build dependencies are needed.
Using cached astropy-5.3.4-cp312-cp312-win_amd64.whl.metadata (9.6 kB)
Here,
astropy
is available as a wheel for Windows, so it doesn’t require building. - Source distributions: When a package is only available as a source distribution (like
.tar.gz
), pip needs to compile it, which requires installing build dependencies.Collecting blinker==1.4 (from -r requirements.txt (line 25)) Using cached blinker-1.4.tar.gz (111 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done
For
blinker
, pip needs to build it from source, so it installs build dependencies and prepares the metadata accordingly.
In summary:
- Collecting: Finding and fetching the package.
- Using cached: Utilizing a previously downloaded version.
- Getting requirements to build wheel: Determining additional dependencies for building.
- Preparing metadata: Processing metadata for the package.
- Installing build dependencies: Installing necessary tools and libraries for building the package from source.
Build dependencies are only required when the package is not available as a pre-built wheel and must be built from source.
Why some developers choose to keep package as source without build?
Developers might choose to distribute a package as a source distribution rather than a pre-built binary (wheel) for several reasons:
1. Platform Independence
- Portability: Source distributions are platform-independent and can be compiled on any system. This ensures that the package can be installed on a wide range of environments, including those not officially supported by pre-built binaries.
2. Customizability
- Customization: Source distributions allow users to compile the package with custom options or optimizations specific to their system or requirements.
3. Complexity in Binary Distribution
- Complex Dependencies: Some packages have complex dependencies or require linking to system-specific libraries that make it difficult to create a one-size-fits-all binary distribution.
- Build Environment: Setting up a build environment for all supported platforms can be cumbersome. By distributing the source, developers offload this complexity to the user’s local environment.
4. License Restrictions
- Licensing Issues: Some open-source licenses have restrictions that affect the distribution of pre-built binaries. Distributing the source can help developers comply with these licensing terms.
5. Maintenance and Updates
- Ease of Maintenance: Maintaining and updating a source distribution can be easier than maintaining multiple pre-built binaries for different platforms and Python versions.
6. Transparency and Security
- Transparency: Source distributions allow users to inspect the source code before building and installing, which can be important for security-conscious users or organizations.
- Security: By building the package from source, users can ensure that it hasn’t been tampered with, which can be a concern with pre-built binaries.
7. Compatibility with Build Tools
- Build Tools Integration: Some development environments and tools are better suited for building from source. For instance, using continuous integration (CI) systems to automatically build and test the package.
8. Development and Testing
- Testing and Development: Developers might distribute source packages during the development and testing phases before finalizing a stable version that can be distributed as a binary.
Example Scenarios
- Scientific Packages: Scientific and numerical packages often need to be compiled with optimizations specific to the hardware (e.g., CPU or GPU) they will run on.
- System-Level Libraries: Libraries that interact with system-level features or require specific system libraries (e.g., database connectors, networking libraries) might be distributed as source to ensure proper linkage.
Conclusion
While pre-built binaries (wheels) are convenient and quick to install, source distributions offer flexibility, customization, and broader compatibility, making them an essential distribution method for many developers.
GPU Support for Packages
When installing packages that support GPU, ensuring that the GPU is available and utilized in the development environment during model training involves several key steps. These steps ensure that your setup can take full advantage of the GPU for faster computations, especially in deep learning and machine learning tasks. Here’s what you need to consider:
1. Ensure Compatible GPU Drivers are Installed
Before using GPU-enabled packages, make sure that the appropriate GPU drivers are installed. For NVIDIA GPUs, you need the NVIDIA driver that matches your GPU hardware.
- Check Your GPU: Identify your GPU using:
nvidia-smi
This command should list the GPU details if the drivers are correctly installed.
Example Output
C:\Users\hari>nvidia-smi
Mon Aug 26 12:36:45 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 3W / 50W | 0MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4896 C+G ...crosoft\Edge\Application\msedge.exe N/A |
+-----------------------------------------------------------------------------------------+
- Install the Latest Drivers: Download and install the latest NVIDIA drivers from the NVIDIA website. Make sure to select the correct driver version for your GPU model.
2. Install CUDA Toolkit
CUDA (Compute Unified Device Architecture) is a parallel computing platform and API model created by NVIDIA. Many deep learning libraries use CUDA to access GPU capabilities.
-
Download CUDA: Install the CUDA toolkit that matches your NVIDIA driver version. This can be done from the CUDA Toolkit Archive.
-
Set Environment Variables: Ensure that CUDA paths are set in your environment variables. For example, on Linux:
export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64\ ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
To check CUDA toolkit version.
C:\Users\hari_>nvcc –version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_03:03:05_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
3. Install cuDNN (CUDA Deep Neural Network Library)
cuDNN is a GPU-accelerated library for deep neural networks, providing highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
-
Download cuDNN: You can download it from the NVIDIA cuDNN page.
-
Install cuDNN: Follow the installation instructions for your operating system. This usually involves copying certain files to the CUDA toolkit directories.
4. Install GPU-Compatible Python Libraries
When using deep learning frameworks like TensorFlow, PyTorch, or others, you need to install the GPU-compatible versions of these libraries. Here’s how to do this for some common libraries:
-
TensorFlow: Install the GPU version of TensorFlow using:
pip install tensorflow-gpu
-
PyTorch: Visit the PyTorch website and select your preferences (OS, package manager, Python version, CUDA version). For example:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
This command installs PyTorch with CUDA 11.7 support.
5. Verify GPU Availability in Your Environment
Once all installations are complete, it’s crucial to verify that the GPU is detected and utilized by your deep learning framework:
-
TensorFlow: Check if TensorFlow can access the GPU with:
import tensorflow as tf print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
-
PyTorch: Verify GPU availability in PyTorch:
import torch print("Is CUDA available?", torch.cuda.is_available()) print("CUDA Device Name:", torch.cuda.get_device_name(0))
6. Common Troubleshooting Tips
-
Version Compatibility: Ensure the versions of your GPU drivers, CUDA, cuDNN, and the deep learning library are compatible. Mismatched versions can lead to errors or inefficient GPU usage.
-
Environment Variables: Double-check that CUDA and cuDNN environment variables are correctly set and point to the appropriate directories.
-
Check Dependencies: Use
nvidia-smi
to monitor GPU usage and ensure that your application is utilizing the GPU during model training.
Example: Setting Up TensorFlow with GPU
Here’s a step-by-step example for setting up TensorFlow with GPU:
- Install NVIDIA Driver:
- Download and install the latest driver from the NVIDIA website suitable for your GPU model.
- Install CUDA Toolkit:
- Download CUDA 11.2 (if using TensorFlow 2.6) and follow the installation instructions.
- Install cuDNN:
- Download cuDNN for CUDA 11.2, extract the files, and copy them to the CUDA directories (e.g.,
/usr/local/cuda-11.2/lib64
).
- Download cuDNN for CUDA 11.2, extract the files, and copy them to the CUDA directories (e.g.,
- Install TensorFlow-GPU:
- Install TensorFlow with GPU support using
pip
:pip install tensorflow-gpu==2.6.0
- Install TensorFlow with GPU support using
- Verify Setup:
- Run a simple TensorFlow script to check if the GPU is recognized:
import tensorflow as tf print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
- Run a simple TensorFlow script to check if the GPU is recognized:
By following these steps, you can ensure that your development environment is correctly set up to leverage the GPU for model training, thereby speeding up computations and improving performance.
Author
Dr Hari Thapliyaal
dasarpai.com
linkedin.com/in/harithapliyal