Distributing A.I across cards- Immortality Knowledge Base

Distributing A.I across cards

New name B I U S link image code HTML list	Show page Syntax
# BIOS Settings: Check your motherboard BIOS settings to ensure all PCIe slots are enabled and set to their maximum bandwidth. # OS Support: Use an operating system that supports multiple GPUs, such as a recent version of Linux (Ubuntu, for example) # Install Drivers: Install the latest drivers for your GPUs. For NVIDIA GPUs, download and install the latest drivers from the NVIDIA website. # CUDA Toolkit: Install the CUDA toolkit compatible with your GPU drivers. Follow the installation instructions on the NVIDIA CUDA Toolkit website. # cuDNN: Install the cuDNN library compatible with your CUDA version. Download it from the NVIDIA cuDNN page and follow the installation instructions. # Frameworks: Install the machine learning frameworks that support multi-GPU setups. For LLMs, popular frameworks include TensorFlow and PyTorch. Use nvidia-smi to monitor GPU usage and ensure all GPUs are being utilized. {pre} nvidia-smi {/pre} Some codes... {pre} python3 -m venv myenv source myenv/bin/activate pip install tensorflow pip install torch {/pre} Tensorflow: tf.distribute.MirroredStrategy {pre} import tensorflow as tf # Load your model model = tf.keras.models.load_model('path_to_your_model') # Strategy for multi-GPU inference strategy = tf.distribute.MirroredStrategy() with strategy.scope(): # Use the model for inference predictions = model.predict(your_input_data) {/pre} PyTorch: torch.nn.DataParallel {pre} import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer # Load your model and tokenizer model_name = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Move model to GPU and wrap with DataParallel model = model.to('cuda') model = torch.nn.DataParallel(model) # Inference function def infer(texts): inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True) inputs = {k: v.to('cuda') for k, v in inputs.items()} with torch.no_grad(): outputs = model(inputs) return outputs # Example usage texts = ["This is a sample text", "Another sample text"] outputs = infer(texts) {/pre} another with more details... {pre} import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer # Load your model and tokenizer model_name = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Move model to GPU and wrap with DataParallel model = model.to('cuda') model = torch.nn.DataParallel(model) # Inference function def infer(texts, batch_size=8): model.evalx() results = [] for i in range(0, len(texts), batch_size): batch_texts = texts[i:i+batch_size] inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True) inputs = {k: v.to('cuda') for k, v in inputs.items()} with torch.no_grad(): outputs = model(inputs) results.append(outputs) return results # Example usage texts = ["This is a sample text", "Another sample text"] outputs = infer(texts) {/pre} Start the Ollama server with: * OLLAMA_NUM_PARALLEL: Handle multiple requests simultaneously for a single model * OLLAMA_MAX_LOADED_MODELS: Load multiple models simultaneously {pre} OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve {/pre} Try and observe with nvidia-smi command {pre} docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama {/pre}
Password Summary of changes

📜 ⏱️ ⬆️