← All roles

Senior Infrastructure Engineer (GPU Cloud)

Dublin, Ireland · Hybrid Full-time Senior
Apply for this role Posted 10 May 2026

Overview

TensorX is a sovereign AI infrastructure platform headquartered in Dublin. We deploy and operate open-source large language models on EU-sovereign infrastructure across Europe, providing private, zero-retention inference for regulated industries including finance, healthcare and government. Our platform offers drop-in OpenAI-compatible APIs, enabling developers and enterprises alike to adopt AI without compromising on data privacy, compliance or performance.


We are looking for a Senior Infrastructure Engineer (GPU Cloud) to join our growing engineering team. Reporting to the CTO, you will own the physical and virtualisation layer that underpins our GPU fleet, from bare-metal server deployment in the rack through to provisioned, GPU-attached virtual machines and operating systems. You will design, build and operate the compute, storage and network foundation that our model-serving layer runs on.


We currently operate Dell PowerEdge XE9780 servers with NVIDIA B300 SXM6 GPUs across our European estate, with an aggressive near-term growth plan spanning multiple sites across the EU. You will be the technical owner of our hardware strategy, cluster architecture and infrastructure automation as we build out a sovereign GPU cloud platform.


This is a deeply hands-on role. You will oversee server deployments, configure firmware, design network fabrics, architect storage and build the virtualisation layer that turns racks of hardware into GPU-attached VMs ready for our model-serving team. We are an AI-native team and tools such as Claude Code and Codex are part of our daily workflow, materially accelerating how we build and operate systems. We value engineers who combine deep systems knowledge with a pragmatic, builder mindset.


This is a high-impact senior individual contributor role spanning bare-metal infrastructure, GPU virtualisation, storage and network architecture, and multi-site estate planning. The model-serving and inference layer that runs on top of this foundation is owned by our ML / platform team, not this role.

Responsibilities

  • Bare-Metal & Server Management - Lead the deployment and maintenance of GPU server fleets including BIOS/UEFI configuration, iDRAC/BMC management, firmware and driver updates, kernel parameter tuning and the GPU driver stack for NVIDIA B-series GPUs.

  • Hypervisor & Virtualisation - Design and operate our virtualisation layer. We are not committed to a single stack, so we want someone fluent across options such as Proxmox VE, OpenNebula and KVM/QEMU directly. Implement GPU virtualisation including vGPU, MIG partitioning and VFIO-PCI passthrough, NVSwitch fabric management through host-level services and multi-tenant GPU allocation.

  • Network Architecture - Design and maintain the network fabric across multiple racks and sites, spanning management VLANs, storage networks, tenant data planes and GPU interconnect. Work with bonded NICs, jumbo frames, InfiniBand and ConnectX adapters. Plan the network topology for multi-site EU deployments.

  • Storage Architecture - Design, deploy and operate shared storage infrastructure across multiple racks and sites, including SAN (Dell ME-series, iSCSI multipath), NFS and local NVMe. Optimise for large model weights (hundreds of GB per model), high-throughput sequential reads and cross-site replication. Own SAN performance tuning, capacity planning and data placement strategy as the estate grows.

  • Multi-Tenant GPU Provisioning - Build the infrastructure layer for multi-tenant GPU delivery up to the VM/OS handoff, including tenant isolation, GPU allocation (vGPU/MIG/passthrough), resource scheduling and capacity planning. Design the foundation so GPU-attached VMs can be provisioned cleanly and repeatably for the serving layer to consume.

  • Cluster Orchestration & Automation - Automate server provisioning, OS deployment, driver installation and cluster configuration. Build infrastructure-as-code for repeatable, auditable deployments across multiple sites.

  • Monitoring & Reliability - Instrument the infrastructure stack with monitoring covering GPU health, NVSwitch fabric status, storage throughput, network utilisation and hardware telemetry (DCGM, iDRAC, IPMI). Own incident response for hardware and infrastructure faults.

  • Hardware Strategy & Estate Planning - Work with the CTO to plan GPU procurement cycles, evaluate server platforms, specify network and storage hardware and manage vendor relationships. Design the infrastructure blueprint for new EU datacentre deployments, defining standard rack layouts, power and cooling requirements, network topology and storage architecture that can be replicated across sites with minimal variance.

  • Security & Compliance - Ensure infrastructure meets the requirements of regulated industries including data residency, tenant isolation, encryption at rest and in transit and audit logging. Support EU sovereignty requirements across our deployment sites.

Skills & Experience

  • 5+ years of professional experience in infrastructure engineering, systems administration or datacentre operations, with meaningful exposure to GPU, HPC or large-scale compute infrastructure

  • Hands-on experience with bare-metal Linux server deployment and management, including kernel tuning, driver management, PCI device configuration and UEFI/BIOS configuration

  • Strong virtualisation experience across one or more of Proxmox VE, OpenNebula or KVM/QEMU, including PCI passthrough for GPU workloads. Not tied to a single hypervisor

  • Working knowledge of NVIDIA GPU server platforms, including driver installation, NVLink/NVSwitch fabric, Fabric Manager, DCGM and GPU virtualisation (vGPU, MIG, VFIO passthrough)

  • Hands-on experience with enterprise storage including SAN (iSCSI, FC), multipath I/O, NFS and performance tuning for large sequential workloads (required)

  • Solid understanding of network design including VLANs, bonding/LACP, jumbo frames, InfiniBand and routing in multi-rack environments

  • Proficiency with Linux (Ubuntu Server and/or Debian), systemd, networking stack (ip, nmcli, netplan) and shell scripting

  • Experience with infrastructure-as-code and automation tooling (Ansible, Terraform or similar)

  • Comfortable using AI-assisted development tools (e.g. Claude Code, Codex) as part of your daily workflow

  • Methodical approach to troubleshooting across firmware, kernel, driver and userspace layers

Nice to Have

  • Experience building or operating GPU cloud / GPU-as-a-Service infrastructure

  • Familiarity with Dell PowerEdge server management (iDRAC, Redfish API, racadm, Dell SupportAssist)

  • Experience with NVIDIA ConnectX network adapters and OFED/MOFED stack, and InfiniBand fabric management

  • Experience with MAAS, Ironic or other bare-metal provisioning systems

  • Kubernetes cluster operations (the model-serving team runs the inference layer, but awareness of how it consumes the foundation helps)

  • Familiarity with European data sovereignty and compliance frameworks (GDPR, DORA, NIS2)

  • Contributions to open-source infrastructure projects

Education & Qualifications

  • BSc/MSc in Computer Science, Software Engineering, Electrical Engineering, Network Engineering or a related technical discipline OR equivalent practical experience

Remuneration

  • Highly competitive package, dependent on experience

  • 25 days paid annual leave

  • Hybrid working from our centrally located Dublin office, with remote flexibility

  • Occasional travel to our EU datacentre sites as the estate grows

  • Free inference tokens!


******* NO AGENCY ASSISTANCE REQUIRED *******

Ready to apply?

Applications are handled securely through our recruitment platform.

Apply now

TensorX is an equal-opportunity employer. All inference runs on our own EU-sovereign infrastructure. Learn more about what we build.