jobsearch v0.0.1

← togetherai / Forward Deployed Engineer (Inference & Post-Training)

brief / art_a8_qxlRWv9w

role
togetherai / Forward Deployed Engineer (Inference & Post-Training)
model
anthropic/claude-sonnet-4.6
created
2026-06-08T19:09

Company snapshot

Together AI is a research-driven AI infrastructure company focused on lowering the cost of open-source AI through co-design of software, hardware, algorithms, and models. The company is behind foundational research contributions including FlashAttention, Hyena, FlexGen, and the RedPajama dataset, giving it strong credibility in the open-source LLM community. Together AI operates a cloud inference platform offering API access to leading open-source models at scale, competing with Fireworks AI, Anyscale, and Replicate in the inference-as-a-service space. Based on public signals, the company has been expanding its post-training and fine-tuning product surface (SFT, DPO, RLHF) alongside its inference platform — consistent with this FDE role targeting both workloads. Specific recent headcount, funding rounds, or named internal projects are not confirmed here; the above is based on publicly available information as of early 2025.

Team stack

Inference layer: vLLM, TensorRT-LLM, and SGLang are explicitly named in the JD as expected expertise — likely the primary engines deployed on the platform. Post-training stack: TRL, likely OpenRLHF and/or VeRL for distributed RL (based on the JD's GRPO/DPO/RLHF callouts and Together's open-source posture). Model serving: likely CUDA-optimized GPU clusters (A100/H100), with tensor and pipeline parallelism configurations. Quantization: GPTQ, AWQ, or FP8 likely in use (based on JD's quantization strategy requirement). Customer-facing tooling: Python SDKs, REST/OpenAI-compatible APIs. Internal engineering: Python-heavy, likely FastAPI or similar for service layers. KV cache management and speculative decoding are called out explicitly — suggesting these are active optimization levers on the platform. Frontend/dashboard tooling for customers is not specified in the JD.

Likely questions (10)

areaquestionwhy
system_design A strategic customer is running Llama-3 70B on 4xA100s and hitting 800ms TTFT at 50 concurrent users. Walk me through your diagnostic process and the levers you'd pull to hit their 200ms target. JD explicitly requires KV cache tuning, tensor parallelism, speculative decoding, and quantization strategy — this question stress-tests all four in a realistic customer scenario.
domain Compare vLLM, TensorRT-LLM, and SGLang for a customer deploying a mixture-of-experts model at high throughput. What are the tradeoffs and how do you decide? JD requires 'expert-level, hands-on experience with inference engines' and the ability to 'select, configure, and optimize inference engine based on hardware, model architecture, and workload profile.'
domain A customer wants to run GRPO post-training on a 7B model with a custom verifiable reward signal. How would you design the training pipeline end-to-end, and what frameworks would you recommend? JD calls out GRPO specifically alongside DPO/RLHF/SFT; your RL Workbench benchmarks TRL, VeRL, OpenRLHF, and NeMo RL across exactly these algorithms — this is a direct probe of that depth.
domain Explain the tradeoffs between LoRA rank, alpha, and target modules when fine-tuning a 13B model for a domain-specific task. How do you advise customers on these hyperparameters? JD requires hands-on LoRA and SFT pipeline experience; this tests practical advisory depth beyond theoretical knowledge.
coding Write a Python script that benchmarks two inference endpoints (e.g., vLLM vs. SGLang) for throughput (tokens/sec) and TTFT across a sweep of concurrency levels, then outputs a comparison report. JD requires strong Python skills and the ability to 'develop configuration updates to win critical POCs and benchmarks' — this mirrors the benchmarking work in your RL Workbench and aeval platform.
system_design Design a production fine-tuning pipeline for a customer who needs to run weekly SFT jobs on proprietary data, with evaluation gating before promotion to production. What does the full system look like? JD requires guiding customers 'from experimentation through production' on post-training pipelines — this tests end-to-end system thinking including data, training, eval, and deployment.
behavioral Tell me about a time you were the primary technical point of contact for a complex, high-stakes customer deployment. How did you manage competing priorities between customer needs and internal engineering constraints? JD describes the FDE as 'primary technical point of contact for aligned strategic accounts' — this probes customer-facing ownership and cross-functional navigation, relevant to your Intuit ICE platform work.
behavioral Describe a situation where you surfaced a field insight that directly changed a product roadmap. What was the insight, how did you communicate it, and what was the outcome? JD explicitly calls out 'Product Feedback Loop' as a core responsibility — directly influencing software and model roadmap from field observations.
culture Together AI is research-driven and deeply open-source oriented. How do you think about the tradeoff between open-source model deployment and proprietary fine-tuning for enterprise customers who have data privacy concerns? JD references Together's mission around open/transparent AI; this probes cultural alignment and the candidate's ability to navigate enterprise customers who may resist open-source defaults.
domain Walk me through how speculative decoding works, when it helps, and when it hurts. Give me a concrete example of a workload where you would and would not enable it. JD lists speculative decoding as an explicit optimization lever the FDE must master — this tests depth beyond surface-level familiarity.

Talking points