Distributing A.I across cards- Immortality Knowledge Base

Distributing A.I across cards

BIOS Settings: Check your motherboard BIOS settings to ensure all PCIe slots are enabled and set to their maximum bandwidth.
OS Support: Use an operating system that supports multiple GPUs, such as a recent version of Linux (Ubuntu, for example)
Install Drivers: Install the latest drivers for your GPUs. For NVIDIA GPUs, download and install the latest drivers from the NVIDIA website.
CUDA Toolkit: Install the CUDA toolkit compatible with your GPU drivers. Follow the installation instructions on the NVIDIA CUDA Toolkit website.
cuDNN: Install the cuDNN library compatible with your CUDA version. Download it from the NVIDIA cuDNN page and follow the installation instructions.
Frameworks: Install the machine learning frameworks that support multi-GPU setups. For LLMs, popular frameworks include TensorFlow and PyTorch.

Use nvidia-smi to monitor GPU usage and ensure all GPUs are being utilized.

nvidia-smi

Some codes...

python3 -m venv myenv

source myenv/bin/activate

pip install tensorflow

pip install torch

Tensorflow: tf.distribute.MirroredStrategy


import tensorflow as tf

# Load your model
model = tf.keras.models.load_model('path_to_your_model')

# Strategy for multi-GPU inference
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    # Use the model for inference
    predictions = model.predict(your_input_data)

PyTorch: torch.nn.DataParallel


import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load your model and tokenizer
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Move model to GPU and wrap with DataParallel
model = model.to('cuda')
model = torch.nn.DataParallel(model)

# Inference function
def infer(texts):
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
    inputs = {k: v.to('cuda') for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs

# Example usage
texts = ["This is a sample text", "Another sample text"]
outputs = infer(texts)

another with more details...


import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load your model and tokenizer
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Move model to GPU and wrap with DataParallel
model = model.to('cuda')
model = torch.nn.DataParallel(model)

# Inference function
def infer(texts, batch_size=8):
    model.evalx()
    results = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True)
        inputs = {k: v.to('cuda') for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model(**inputs)
            results.append(outputs)
    
    return results

# Example usage
texts = ["This is a sample text", "Another sample text"]
outputs = infer(texts)

Start the Ollama server with:

OLLAMA_NUM_PARALLEL: Handle multiple requests simultaneously for a single model
OLLAMA_MAX_LOADED_MODELS: Load multiple models simultaneously


OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

Try and observe with nvidia-smi command

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

📝 📜 ⏱️ ⬆️