On-prem Cloud Engineer

MASE Insights

TEMPORARY Remote · US Charlotte, NC, Mecklenburg, US USD 7800 / month Posted: 2026-05-11 Until: 2026-07-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

Job Duties Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads. · Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching). · Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput. · Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving. · Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust. · Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services. · Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize Tech Skills needed vLLM · TensorRT‑LLM · Triton Inference Server · SGLang · Inference Optimization · Continuous Batching · Speculative Decoding · KV Cache / Prefix Caching · FP8 / AWQ / GPTQ · Tensor Parallelism · Kubernetes ML Serving · KServe · OpenShift AI · Helm / Operators · GPU Orchestration · Run:AI · Performance Benchmarking · CUDA / NCCL / MIG · Prometheus / Grafana · ML Observability GuideLLM, Locust Pay: From $45.00 per hour Work Location: In person