Job Description
About TensorWave Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure. About the Role We are building and operating large-scale infrastructure platforms to support high-performance AI and machine learning workloads across multiple data centers. Our environment includes GPU-intensive systems, high-throughput networking, and distributed storage platforms designed for performance, scale, and resilience. We are looking for a Storage Operations Engineer to own the day-to-day operation, performance, and reliability of our storage platforms. This role is responsible for ensuring that storage systems remain stable, performant, and aligned with the demands of Kubernetes, AI/ML, and high-performance compute workloads. This is not a traditional SAN/NAS administration role. You will work with modern distributed storage systems and be expected to troubleshoot, optimize, and scale them in production environments. What You’ll Do Platform Operations & Ownership Operate and maintain distributed storage platforms, including Ceph (RBD, CephFS, RGW), High-performance NAS platforms (e.g., Weka, VAST Data) Manage storage lifecycle operations - cluster expansion, upgrades and migrations Monitor and maintain storage health, including capacity utilization, data distribution and balance, cluster state and recovery operations Performance & Reliability Analyze and troubleshoot storage performance across IOPS, throughput, and latency (including tail latency) Identify and remediate bottlenecks across disk subsystems, network paths (including RDMA where applicable), client access patterns Support incident response and root cause analysis for storage-related issues Ensure storage platforms meet performance expectations for GPU and Kubernetes workloads Kubernetes Storage Operate and support Kubernetes-integrated storage - CSI drivers, StorageClasses, PersistentVolumes / PersistentVolumeClaims Troubleshoot storage-related issues in Kubernetes environments, including stateful workloads, performance inconsistencies, scheduling and provisioning failures Automation & Tooling Execute and improve automation for storage deployment and operations using Ansible, Terraform, Kubernetes manifests / Helm Contribute to improving monitoring and alerting, operational workflows, runbooks and documentation Cross-Team Collaboration Partner with DevOps and Platform Engineering (automation and orchestration), Network Engineering (high-throughput and RDMA networking), Compute / Virtualization teams Help ensure end-to-end performance across compute, network, and storage layers Who You Are Required Qualifications 4–7+ years of experience in infrastructure, systems, or storage operations Strong hands-on experience operating distributed storage systems in production Experience with Ceph (RBD, CephFS, or RGW) Experience with modern storage platforms such as: Weka, VAST Data, or similar high-performance systems Strong Linux systems knowledge Solid understanding of Storage performance characteristics (IOPS, throughput, latency) Data replication and failure domains Ability to troubleshoot across: Storage systems Network paths Compute clients Preferred Qualifications Experience supporting AI/ML or HPC workloads Familiarity with NVMe-based storage architectures RDMA or high-throughput Ethernet environments Experience integrating storage with Kubernetes Experience operating storage across multiple data centers Exposure to object storage and S3-compatible APIs What We Offer Stock Options 100% paid Medical, Dental, and Vision insurance for Employees Company Health Savings Account Contributions 100% paid Short Term and Long Term Disability Insurance for Employees Life and Voluntary Supplemental Insurance Options Other Insurance Options, such as Pet & Legal Insurance Various Supplementary Health Benefits,