A.I. Architecture- Immortality Knowledge Base

A.I. Architecture

Artificial intelligence computing demands unprecedented levels of performance, pushing the limits of modern hardware. While GPUs currently dominate AI workloads, they are still constrained by the traditional model of being discrete cards plugged into a motherboard. Perhaps a better design is higher GPU integration, fully integrating processing capabilities into the system's core architecture. The race in computing continues to focus on maximizing speed—bringing components closer to the physical limits of data transmission, such as the speed of light, and on harnessing parallelism. This is achieved by deploying thousands of simplified cores on a single chip, purpose-built for AI operations like matrix multiplication and hashing.

Technologies like CUDA cores, Tensor Processing Units (TPUs), and stream processors exemplify this trend, optimizing AI tasks through massive parallel execution. One of the most significant architectural shifts is the adoption of Unified Memory Architecture (UMA), advanced by Apple in its Mac Studio systems. UMA allows CPU and GPU to share a single high-bandwidth memory pool, reducing latency and increasing efficiency. In contrast, Nvidia's consumer-grade GPUs still offer relatively limited memory (e.g., 24GB) at high cost, which poses challenges as models grow dramatically in size. UMA presents a promising path forward, particularly for making high-performance AI more accessible to consumers and smaller developers.

GPUs remain the single most important component for running and fine-tuning AI models locally. Consumer-grade cards like the NVIDIA RTX 4090 (24GB GDDR6X, 16,384 CUDA cores) dominate the space, offering exceptional performance in FP16 and INT8 operations. For smaller models and quantized LLMs, even GPUs with 12GB–16GB VRAM like the RTX 4060 Ti or AMD RX 7800 XT can handle inference using formats like GGUF or GPTQ. Multi-GPU can utilize NVLink. PCIe 4.0 and 5.0 are standard in current-gen motherboards and GPUs; PCIe 7.0 is in development but not yet deployed. High-bandwidth GPU interconnects like NVLink, Kubernetes clusters and Infiniband remain important in enterprise and data center deployments. NVIDIA continues to lead, but AMD’s ROCm ecosystem is maturing and offers competitive inference performance on select models. China GPU makers are cut out, causing high prices and low specifications.
CPU plays a support role, primarily handling I/O and orchestration. In 2025, CPUs with 8 to 32 cores (e.g., Ryzen 7000, Intel 14th Gen) are common. For inference, most workloads are GPU-bound, but faster CPUs help with token generation and parallel CPU/GPU workloads.
RAM: System RAM is moderately important. 32GB to 64GB is usually sufficient for most AI tasks. RAM disks do not help.
Storage: Least critical, depending on setup. SATA SSDs are sufficient for most tasks, fast NVMe SSDs (Gen 4 or 5) greatly reduce model load times, especially for large LLMs.

Apple Mac Studio with M3 Ultra (Up to 512GB Unified Memory): It features up to 512GB of unified memory and over 800GB/s memory bandwidth. Benchmarks indicate impressive token generation speeds, such as 36.87 tokens/sec for QwQ 32B 4-bit models and 135.22 tokens/sec for Llama 8B 4-bit models.
UMA is also found in mini-pc systems, such brands as GMKTec, Beelink, Minisforum, Acemagic, Geekom with 128GB with 96GB usable by the GPU. New A.I. optimized chips from AMD and other chips that have A.I. in their name are also in many of these builds. Oculink port provide access to add an external GPU's and USB4 runs at 40GB/s. Unlike the Mac Studio the ram speeds are at the various DDR standard while faster RAM speeds are essential for A.I.

Distributed computing software and clutering, MPI (Message Passing Interface). Using software such as TensorFlow Distributed, Spark or Cluster management software like Slurm or Torque. Petals, Horovod, a distributed training framework for libraries like TensorFlow, Keras, PyTorch, and Apache MXNet, Kuberetes.
RAID 10 or RAID 6, mdadm

Example Builds

https://www.supermicro.com/products/archive/motherboard/x9drg-otf-cpu

AOM-SXM2 - https://forums.servethehome.com/index.php?threads/sxm2-over-pcie.38066/page-7

https://www.supermicro.com/en/products/motherboard/x11spa-tf

Supermicro X10DRG-Q: 4 x PCIe 3.0 x16, 16 x DIMM 288-pin Up to 2TB ECC 3DS LRDIMM, 2 x CPU Xeon E5-2600 v3 series, $317.46, specs, manual

CPU: Intel® Xeon® E5-2699 v4, 22 cores @ 2.2 GHz ~ $175.23
RAM: 64GB LDRDIMM DDR4, (1024GB/16 slots) ~ $99.74
Power Supply: 24-pin, 12V 8-pin, and 12V 4-pin power connectors.

Supermicro X9DRI-LN4F+: 4 x PCIe 3.0 x16, 24 x DIMM 240-pin, Up to 1.5TB DDR3 ECC LRDIMM, 2 x CPU Xeon E5-2600 v3 series, $200, manual

CPU: Intel® Xeon® E5-2697 v2, 12 cores @ 2.7 GHz ~ $36
64GB LRDIMM DDR3 ~ $60 (1536GB ram in 24 slots)

Supermicro C9X299-RPGF-L: 4 PCIe 3.0 x16, 8 DIMM slotsUp to 256GB Unbuffered non-ECC UDIMM, DDR4-2933MT/s, Intel® 7th Generation Core™ i7 X-series, Intel® 9th Generation Core™ i7 X-series, Intel® 9th Generation Core™ i9 X-series, $180, specs

CPU: Intel® Core™ i9-10980XE, 18 cores @ 3.0 GHz ~ $700

Asus Z9PE-D16/2l - 512GB / 16 slots - 4 @ 16x PCI-E - Dual E5-2600 + v2, $325, specs

CPU: Intel® Xeon® E5-2697, 12 cores @ 2.7 GHz ~ $36

Supermicro X13DEI: 4 PCIe 5.0 x16, 16 DIMM slots Up to 4TB 3DS ECC RDIMM, 5th Gen Intel® Xeon® / 4th Gen Intel® Xeon® Scalable processors, $1575, specs
Supermicro X9DRG-QF - https://www.supermicro.com/manuals/motherboard/C606_602/MNL-1309.pdf
Asrock X99 WS-E/10G - https://www.asrock.com/mb/Intel/X99%20WS-E10G/ 7 PCIe with 4 @ full speed - https://download.asrock.com/Manual/X99%20WS-E10G.pdf
ASUS Pro WS WRX80E-SAGE SE WIFI II 2TB ram 8 16x PCIe, $1700, specs
Asrock WRX90 WS EVO - 2TB ram 7 x 16x PCIe + 1 8x, $1750, specs
Gigabyte MZ73-LM0 - DDR5 x 16, PCIe v4 $2550, specs

RAM: RDIMM, LRDIMM or UDIMM, 16x 128GB 3DS LRDIMM modules, total of 2TB RAM. Modules operate at 2400MHz
Storage: Any
PSU: Biggest, make sure quantity of connectors and compatible connectors, enough power and connectors for multiple GPUs. Corsair AX1600i (1600W, sufficient for multiple high-power GPUs).
EATX Case, just screw it down to something and earth it
Cooling: Custom liquid cooling loop to maintain optimal temperatures, to manage the heat output of multiple GPUs.

Riser from Supermicro: https://www.supermicro.com/en/support/resources/riser and take care that the included Molex to ATX cable could fry the riser, EVGA cable has a different pin power arrangement x. PCI-E 16X Riser Card Flexible Ribbon Extension Cable for space.

Multi-build clustering: hook them up in a conventional network and then utilize Distributed Computing Framework, install a chosen framework on each computer and configure it to recognize the other machines as part of the cluster. Allocate 1 machine as a NAS, a mobo with the most PCIe SATA expansion cards and onboard SATA. The other computers are about cpu, gpu cores and maxmimum vRAM.

When selecting graphics cards for running large language models (LLMs) locally, using multiple GPUs, there are several important features and specifications to consider:

High VRAM: Aim for graphics cards with as much VRAM as possible. Since you're looking to run LLMs, more VRAM allows you to handle larger models and batch sizes.
CUDA Cores / Tensor Cores: More CUDA cores generally mean better parallel processing capabilities. Tensor cores (found in NVIDIA’s RTX and Tesla series) are specifically designed for deep learning tasks and can significantly speed up model training and inference.
NVLink Support: NVLink allows for high-bandwidth communication between GPUs, enabling efficient multi-GPU setups. This is crucial for model parallelism and reducing inter-GPU communication overhead.
Multi-GPU Scalability: Ensure the graphics card and your system support multi-GPU configurations (e.g., via SLI, NVLink, or PCIe slots).
FP16 / Mixed Precision Support: Cards that support FP16 or mixed precision calculations can provide significant performance boosts for deep learning tasks by using less memory and speeding up computations.
Cooling System: Efficient cooling is essential to maintain performance and prevent thermal throttling, especially in multi-GPU setups.
Driver and Software Support: Ensure the card is compatible with the deep learning frameworks you plan to use (e.g., PyTorch, TensorFlow) and that it has robust driver support.

Recommended GPU Models

NVIDIA RTX 30 Series (e.g., RTX 3090, RTX 3080):

High VRAM (e.g., 24GB on RTX 3090)
Tensor cores for deep learning
NVLink support (for RTX 3090)

NVIDIA A100:

Up to 80GB VRAM (in the PCIe version)
Advanced tensor cores
NVLink support
Designed specifically for AI workloads

NVIDIA Tesla V100:

Up to 32GB VRAM
Tensor cores
NVLink support

NVIDIA Quadro RTX 8000:

48GB VRAM
Tensor cores
NVLink support

In non-SXM2 form factors, specifically the PCIe form factor for NVIDIA GPUs, the card has a typical PCIe edge connector that slides into the PCIe slot on the motherboard (connector is the edge of the pcb that slides into the motherboard slot). For PCIe form factor GPUs to support NVLink, there is an additional edge connector located towards the top of the card. This additional connector is used to attach an NVLink or SLI bridge, which allows for high-speed communication between multiple GPUs.

Note: some of these models may have different memory configurations depending on the specific model or revision. FP16 performance is not available on older cards like the K-series due to their lack of Tensor Cores. The Tesla T4 has Tensor Cores, which are designed to accelerate deep learning operations.

Tesla P40: No NVLink, PCIe 3.0, 24GB GDDR5, 3,840 CUDA cores, 346 GB/s, FP64 (double) 367.4 GFLOPS (1:32) , FP32 - 12 TFLOPS, FP16 - 183.7 GFLOPS (1:64) ~ $297.89, P40 datasheet
Tesla P100: No NVLink, PCIe 3.0, 3584 CUDA cores, 16GB HBM2, FP16 (half) 18.7 TFLOPS (2:1), FP32 (float) 9.3 TFLOPS, FP64 (double) 4.7 TFLOPS (1:2) ~ $198.80, P100 datasheet
Geforce RTX 3060 - 3090 - 4070 - 12GB
Geforce RTX 4060Ti 16gb GDDR6, Tensor Cores 136, FP16 - 22.06 TFLOPS, FP32 - 22.06 TFLOPS, FP64 - 344.8 GFLOPS ~ $409.99, 4060Ti specs
Quadro P6000 (2015): 24GB, 3,840 CUDA cores, 384 GB/s memory bandwidth (SLI, no NVLink), FP32: 11.52 TFLOPS, FP16: 23.04TFLOPS (about the performance of a P40) ~ $600
GeForce RTX 3090: ~ 24GB ~ $679.99
RTX Titan 24GB, Has NVLink ~ 649.99, Titan datasheet
Tesla T4: No NVLink, PCIe 3.0, 16GB GDDR6, CUDA Cores: 2,560, 320 GB/s, Tensor Cores: 320 (INT8 and FP16), FP32: 8.1 TFLOPS, FP16: 16.2 TFLOPS, INT8: 130 TFLOPS, Power Consumption: 70W ~ $1000
Quadro GP100 (2016) - 24 GB HBM2 (with NVLink) ~ $1300
NVIDIA Tesla V100 32GB
Quadro RTX A4500: 20gb ~ $1500
RTX 8000 48GB ~ $4000
Tesla H100 80GB ~ $8000

AMD Radeon Instinct MI50, 16GB or 32GB, https://www.amd.com/system/files/documents/radeon-instinct-mi50-datasheet.pdf
AMD Radeon Instinct MI60, 32GB HBM2 Graphics, https://www.amd.com/system/files/documents/radeon-instinct-mi60-datasheet.pdf
AMD Radeon Pro V340 32GB, https://www.techpowerup.com/gpu-specs/radeon-pro-v340-16-gb.c3267

AMD calls its CUDA cores, "Stream processors"

Configuration Tips

BIOS Settings: Ensure the BIOS is configured to support multi-GPU setups.
Driver Installation: Install the latest NVIDIA drivers that support multi-GPU configurations.
Framework Configuration: In your deep learning framework, configure the settings to utilize multiple GPUs (e.g., using torch.nn.DataParallel or torch.distributed in PyTorch).

Summary

By focusing on high VRAM, CUDA/Tensor cores, NVLink support, and efficient cooling, you can build a powerful multi-GPU setup capable of running large language models locally. Using high-end GPUs like the NVIDIA RTX 3090 or the A100 will provide the performance needed for demanding AI tasks.

Software

Software has become secondary to hardware, and software for A.I. would probably require grid computing in exchange for unrestricted model access. Each node would have to satisfy minimum requirements to be accepted into the grid. While the models are accessible to the grid, the secret source is with the author. The grid acts as a workshop, holding the petabytes of training data, and an A.I. training supercomputer. The result is plopped into the distributed leaderboard folder, where all the trained models go, and all the models are restricted to the OS, all the models are graded. A general user would go to the leaderboard folder and run the latest models. The incentive is to beat the best model. In the modern day, it is all about creating the white paper and presenting it to key people for support and funding. In the past, it anyone could release and gain public support organically.

O.I. Architecture - organoid on chip support

Issues:

not big enough, intelligence is relative to quantity of neurons in a network.
questionable unnatural systems and understanding natural habitats, where movement, re-organization, synapses, vascularization occur and are supported.
questions over lifespan, understanding their lifespan in the body versus their lifespan as bioware.

Building the network...

modularization and module versioning.
interconnects between organoids that mimic synapses, attraction, affinity.
communication, I/O cards, hardware and software interface. Understanding and responding.
a happiness rating.
distributed networking over distance (for fun).

nb: Organoids are real lifeforms.

Automating organoid maintenance

A pump, a reservoir and an input output system attachment on the container housing the organoid. The pump (heart) moves media from a reservoir into the input of the housing of the organoid and at the output move goes back to the pump so the media is circulating. Spent media gets moved to a storage container where it is measured, filtered and conditioned and then re-introduced. So 4 main objects are required. A slow pump, pipes and fittings from the pump to the organoid housing. A reservoir holding new media, and a container holding old media. In the old media container, additions such as media grading detection, filtration and media conditioning to recycle media. Like a filter in a fish tank, this slow moving pump keep the water clean and oxygenated, removing impurities.

a peristaltic pump, also commonly known as a roller pump, is a type of positive displacement pump used for pumping a variety of fluids. The fluid is contained in a flexible tube fitted inside a circular pump casing. Most peristaltic pumps work through rotary motion, though linear peristaltic pumps have also been made.
culture media filtration, sterilization as it passes through the filter (immune system) and a measure of it viability, supplementing and cleaning the media or impurities.

Ideally, we want to award these functions their organ names, use and develop human compatible artificial organs and machines to offshoot into the medical device industry. For example, the dialysis machine could be supplied to hospitals and work in that setting. Every time we do something, we must think of its application in general medicine and move towards that direction, even if it poses extra challenges. For example, anastomosis methods and materials and degree of identical behavior. Bioreactors are commonly used for cell culture applications.

Neuroplatform model, recent advacement offers an online interface online similar to a task sceduling print pool jobs for the organoids, results are returned for processing. Other similar bigger platforms might be integrated in future.

Open and remotely accessible Neuroplatform for research in wetware computing

What an A.I Operating System (OS) might look like

GPU to GPU communication, cluster and distributing networking training and testing. The amount of data and processing required to train models and tinker about with A.I. could utilize grid computing. Minimum requirements are mandatory to join the grid, and trained models are the reward. The grid would hold the petabytes of training data and CPU cycles for distributed training. A member would probably need some minimum resource to join the grid. The models are tied to the OS and cannot be moved out. The grid maintainers would keep models at the current or exceeding current capability. The use of the models to generate video, images and so on would be unrestricted. Make a trade of model access for resource contribution.
Installed with all the applications/software in the category and to leverage A.I. and O.I.
Master simulation environment for A.I. training.

Store training data - distributed file systems, to grid and store the petabytes of training data.
Train A.I. - utilize the many grid computing operations already in existance and add a system level one as well.
Other edu, lab and research essential softwares.
Custom Linux from scratch

Neuron Communication: Neurons communicate with each other through electrical and chemical signals. The "language" they use is a complex interplay of these signals, which can be described in terms of neurotransmitters, receptors, and action potentials. Action Potentials: Neurons generate electrical signals called action potentials, which are rapid, temporary changes in the electrical potential difference across the neuron's membrane. These signals travel down the neuron's axon to its terminals. Neurotransmitters: When an action potential reaches the end of a neuron's axon, it triggers the release of neurotransmitters, which are chemical messengers stored in tiny sacs called vesicles. There are many types of neurotransmitters, such as glutamate, GABA, acetylcholine, serotonin, and dopamine, each with different functions. Receptors: Neurotransmitters bind to specific receptor proteins on the membrane of the target neuron, causing changes in the electrical potential of that neuron. This can either excite the neuron, making it more likely to fire an action potential, or inhibit it, making it less likely to fire. Synaptic Transmission: The process of neurotransmitter release, binding to receptors, and the resulting changes in the target neuron is called synaptic transmission. This is the fundamental way neurons communicate with each other. The complex system of communication involves electrical and chemical signals. This system allows them to transmit, process, and store information, forming the basis for all brain functions.

Neuromorphics

Neuromorphic processing unit, or NPU, designed to replicate features of the human brain. Intel's Loihi 2 processor. Analog chip or Digital chip. TrueNorth IBM's Neuromorphic Chip.

📝 📜 ⏱️ ⬆️