← Back to jobs

AI Cloud Platform Engineer (Remote)

NJS Recruiting & Enablement Inc.
FULL_TIME Remote · US US Posted: 2026-05-11 Until: 2026-07-10
Apply Now →
You will be redirected to the original job posting on BeBee.
Apply directly with the employer.
Job Description
Hiring an AI Cloud Platform Engineer to design, secure, and operate the cloud foundation for mission-scale AI/LLM workloads. You will own the cloud landing zones, networking, identity, security, and compliance that enable reliable, cost-effective LLM serving and data services. This position requires US Citizenship. Responsibilities Responsibilities include, but are not limited to: · Design and operate secure, multi-account/tenant cloud landing zones for AI workloads (Azure primary, AWS secondary), including network segmentation, private connectivity, and egress controls. · Provision and manage GPU-optimized compute and storage for training and inference (AKS/EKS GPU node pools, VM scale sets, blob/S3 object storage). · Implement Infrastructure as Code with Terraform for all environments; enforce policy-as-code and guardrails. · Establish identity, secrets, and access controls (Entra ID/AWS IAM, RBAC, Key Vault/KMS, HSM/PKI). · Build observability and SRE practices (metrics, logs, tracing, alerting, runbooks, incident response, SLIs/SLOs). · Harden environments for compliance (e.g., FedRAMP/FISMA-style controls), including vulnerability management and continuous compliance monitoring. · Enable CI/CD for infrastructure and configuration (GitOps, automated drift detection, change management). · Partner with AI Kubernetes Engineers to support reliable LLM deployment strategies (blue/green, canary, rollout/rollback) and capacity planning. · Optimize cloud spend for GPU and storage workloads; forecast capacity and implement reservations/savings plans. · Document architectures, standards, and operational procedures; contribute to knowledge sharing and training. Minimum Requirements · Must be a U.S. citizen. · Ability to obtain and maintain a Public Trust clearance. · BA/BS in Computer Science or related field. · At least 8 years relevant professional experience. · Strong hands-on cloud engineering experience in one or more of Azure, AWS, GCP. Including networking, identity, security, and automation. · Expert Terraform skills; proven experience managing production infrastructure via IaC. · Experience operating production Kubernetes clusters (AKS/EKS) supporting AI/LLM workloads. · Proficiency with containers (Docker) and CI/CD. · Solid scripting/programming abilities in Python or similar. · Experience implementing observability (metrics, logging, tracing) and on-call operations. Preferred Requirements · Experience with Azure services supporting AI solutions (e.g., Azure OpenAI, Document Intelligence, Azure App Service, Azure Machine Learning). · Experience with AWS AI/ML services (e.g., Bedrock). · GPU platform operations (NVIDIA drivers/CUDA, MIG, NCCL, multi-node scheduling). · Data services for AI applications (PostgreSQL, Cosmos DB, Redis), and object storage lifecycle design. · Network design for secure workloads (VNets/VPCs, Private Link/Private Endpoints, ExpressRoute/Direct Connect, WAF). · Security tooling (Azure Policy, Defender for Cloud, AWS Security Hub), zero-trust patterns, and secrets management (Vault). · Experience supporting U.S. federal environments and working within regulated cloud controls. Relevant Technologies · Cloud: Azure (AKS, VNets, Private Link, ExpressRoute, AML, ACR, Key Vault, Azure Monitor), AWS (EKS, VPC, PrivateLink, CloudWatch, ECR, KMS, Bedrock) · IaC: Terraform · Containers/Orchestration: Kubernetes, Helm · Observability: Prometheus, Azure Monitor, CloudWatch · Data: PostgreSQL, Cosmos DB, Redis, Blob/S3 · CI/CD and Version Control: GitLab, GitLab CI/CD