NJS Recruiting & Enablement Inc.
Job Description
Hiring an AI Cloud Platform Engineer to design, secure, and operate the cloud foundation for mission-scale AI/LLM workloads. You will own the cloud landing zones, networking, identity, security, and compliance that enable reliable, cost-effective LLM serving and data services. This position requires US Citizenship. Responsibilities Responsibilities include, but are not limited to: · Design and operate secure, multi-account/tenant cloud landing zones for AI workloads (Azure primary, AWS secondary), including network segmentation, private connectivity, and egress controls. · Provision and manage GPU-optimized compute and storage for training and inference (AKS/EKS GPU node pools, VM scale sets, blob/S3 object storage). · Implement Infrastructure as Code with Terraform for all environments; enforce policy-as-code and guardrails. · Establish identity, secrets, and access controls (Entra ID/AWS IAM, RBAC, Key Vault/KMS, HSM/PKI). · Build observability and SRE practices (metrics, logs, tracing, alerting, runbooks, incident response, SLIs/SLOs). · Harden environments for compliance (e.g., FedRAMP/FISMA-style controls), including vulnerability management and continuous compliance monitoring. · Enable CI/CD for infrastructure and configuration (GitOps, automated drift detection, change management). · Partner with AI Kubernetes Engineers to support reliable LLM deployment strategies (blue/green, canary, rollout/rollback) and capacity planning. · Optimize cloud spend for GPU and storage workloads; forecast capacity and implement reservations/savings plans. · Document architectures, standards, and operational procedures; contribute to knowledge sharing and training. Minimum Requirements · Must be a U.S. citizen. · Ability to obtain and maintain a Public Trust clearance. · BA/BS in Computer Science or related field. · At least 8 years relevant professional experience. · Strong hands-on cloud engineering experience in one or more of Azure, AWS, GCP. Including networking, identity, security, and automation. · Expert Terraform skills; proven experience managing production infrastructure via IaC. · Experience operating production Kubernetes clusters (AKS/EKS) supporting AI/LLM workloads. · Proficiency with containers (Docker) and CI/CD. · Solid scripting/programming abilities in Python or similar. · Experience implementing observability (metrics, logging, tracing) and on-call operations. Preferred Requirements · Experience with Azure services supporting AI solutions (e.g., Azure OpenAI, Document Intelligence, Azure App Service, Azure Machine Learning). · Experience with AWS AI/ML services (e.g., Bedrock). · GPU platform operations (NVIDIA drivers/CUDA, MIG, NCCL, multi-node scheduling). · Data services for AI applications (PostgreSQL, Cosmos DB, Redis), and object storage lifecycle design. · Network design for secure workloads (VNets/VPCs, Private Link/Private Endpoints, ExpressRoute/Direct Connect, WAF). · Security tooling (Azure Policy, Defender for Cloud, AWS Security Hub), zero-trust patterns, and secrets management (Vault). · Experience supporting U.S. federal environments and working within regulated cloud controls. Relevant Technologies · Cloud: Azure (AKS, VNets, Private Link, ExpressRoute, AML, ACR, Key Vault, Azure Monitor), AWS (EKS, VPC, PrivateLink, CloudWatch, ECR, KMS, Bedrock) · IaC: Terraform · Containers/Orchestration: Kubernetes, Helm · Observability: Prometheus, Azure Monitor, CloudWatch · Data: PostgreSQL, Cosmos DB, Redis, Blob/S3 · CI/CD and Version Control: GitLab, GitLab CI/CD