← togetherai / Forward Deployed Engineer (Inference & Post-Training)
brief / art_a8_qxlRWv9w
role
model
anthropic/claude-sonnet-4.6
created
2026-06-08T19:09
Company snapshot
Together AI is a research-driven AI infrastructure company focused on lowering the cost of open-source AI through co-design of software, hardware, algorithms, and models. The company is behind foundational research contributions including FlashAttention, Hyena, FlexGen, and the RedPajama dataset, giving it strong credibility in the open-source LLM community. Together AI operates a cloud inference platform offering API access to leading open-source models at scale, competing with Fireworks AI, Anyscale, and Replicate in the inference-as-a-service space. Based on public signals, the company has been expanding its post-training and fine-tuning product surface (SFT, DPO, RLHF) alongside its inference platform — consistent with this FDE role targeting both workloads. Specific recent headcount, funding rounds, or named internal projects are not confirmed here; the above is based on publicly available information as of early 2025.
Team stack
Inference layer: vLLM, TensorRT-LLM, and SGLang are explicitly named in the JD as expected expertise — likely the primary engines deployed on the platform. Post-training stack: TRL, likely OpenRLHF and/or VeRL for distributed RL (based on the JD's GRPO/DPO/RLHF callouts and Together's open-source posture). Model serving: likely CUDA-optimized GPU clusters (A100/H100), with tensor and pipeline parallelism configurations. Quantization: GPTQ, AWQ, or FP8 likely in use (based on JD's quantization strategy requirement). Customer-facing tooling: Python SDKs, REST/OpenAI-compatible APIs. Internal engineering: Python-heavy, likely FastAPI or similar for service layers. KV cache management and speculative decoding are called out explicitly — suggesting these are active optimization levers on the platform. Frontend/dashboard tooling for customers is not specified in the JD.
Likely questions (10)
| area | question | why |
|---|---|---|
| system_design | A strategic customer is running Llama-3 70B on 4xA100s and hitting 800ms TTFT at 50 concurrent users. Walk me through your diagnostic process and the levers you'd pull to hit their 200ms target. | JD explicitly requires KV cache tuning, tensor parallelism, speculative decoding, and quantization strategy — this question stress-tests all four in a realistic customer scenario. |
| domain | Compare vLLM, TensorRT-LLM, and SGLang for a customer deploying a mixture-of-experts model at high throughput. What are the tradeoffs and how do you decide? | JD requires 'expert-level, hands-on experience with inference engines' and the ability to 'select, configure, and optimize inference engine based on hardware, model architecture, and workload profile.' |
| domain | A customer wants to run GRPO post-training on a 7B model with a custom verifiable reward signal. How would you design the training pipeline end-to-end, and what frameworks would you recommend? | JD calls out GRPO specifically alongside DPO/RLHF/SFT; your RL Workbench benchmarks TRL, VeRL, OpenRLHF, and NeMo RL across exactly these algorithms — this is a direct probe of that depth. |
| domain | Explain the tradeoffs between LoRA rank, alpha, and target modules when fine-tuning a 13B model for a domain-specific task. How do you advise customers on these hyperparameters? | JD requires hands-on LoRA and SFT pipeline experience; this tests practical advisory depth beyond theoretical knowledge. |
| coding | Write a Python script that benchmarks two inference endpoints (e.g., vLLM vs. SGLang) for throughput (tokens/sec) and TTFT across a sweep of concurrency levels, then outputs a comparison report. | JD requires strong Python skills and the ability to 'develop configuration updates to win critical POCs and benchmarks' — this mirrors the benchmarking work in your RL Workbench and aeval platform. |
| system_design | Design a production fine-tuning pipeline for a customer who needs to run weekly SFT jobs on proprietary data, with evaluation gating before promotion to production. What does the full system look like? | JD requires guiding customers 'from experimentation through production' on post-training pipelines — this tests end-to-end system thinking including data, training, eval, and deployment. |
| behavioral | Tell me about a time you were the primary technical point of contact for a complex, high-stakes customer deployment. How did you manage competing priorities between customer needs and internal engineering constraints? | JD describes the FDE as 'primary technical point of contact for aligned strategic accounts' — this probes customer-facing ownership and cross-functional navigation, relevant to your Intuit ICE platform work. |
| behavioral | Describe a situation where you surfaced a field insight that directly changed a product roadmap. What was the insight, how did you communicate it, and what was the outcome? | JD explicitly calls out 'Product Feedback Loop' as a core responsibility — directly influencing software and model roadmap from field observations. |
| culture | Together AI is research-driven and deeply open-source oriented. How do you think about the tradeoff between open-source model deployment and proprietary fine-tuning for enterprise customers who have data privacy concerns? | JD references Together's mission around open/transparent AI; this probes cultural alignment and the candidate's ability to navigate enterprise customers who may resist open-source defaults. |
| domain | Walk me through how speculative decoding works, when it helps, and when it hurts. Give me a concrete example of a workload where you would and would not enable it. | JD lists speculative decoding as an explicit optimization lever the FDE must master — this tests depth beyond surface-level familiarity. |
Talking points
- RL Workbench covers the exact post-training surface Together AI sells: I built a 3-phase workbench implementing 12 RL algorithms (PPO, GRPO, DAPO, DPO, SimPO, RLOO, and more) with live SSE metric streaming, and benchmarked TRL, VeRL, OpenRLHF, and NeMo RL head-to-head with GPU Docker passthrough — this is directly the post-training advisory depth the FDE role requires.
- aeval is a production model evaluation platform I shipped with FastAPI, TimescaleDB, Redis, Ollama, and a Next.js dashboard — featuring bootstrap confidence intervals, Welch's t-test, Cohen's d, adversarial safety testing, and CI/CD regression gating. I can speak to eval rigor that enterprise customers need before promoting fine-tuned models to production.
- At Intuit I scaled the ICE platform to 675M+ engagements in FY23, drove throughput from 6K to 50K TPS via rSocket migration supporting ~1.5M concurrent connections at sub-25ms TP99, and reduced developer onboarding from 2–3 weeks to under 24 hours — evidence I can operate at the intersection of platform infrastructure, developer experience, and strategic customer impact that this FDE role demands.
- NeurIPS 2014 published researcher (protein structure prediction with neural networks, hand-coded BPTT in C++ in 2004, rewritten to 8B-parameter PyTorch platform in 2026) — I bring genuine ML research credibility that differentiates me from PMs or SAs who are inference-adjacent but not practitioners.
- As founder of Fintellect AI I architected a RAG pipeline with ChromaDB, multi-provider LLM orchestration (Claude, GPT-4, Gemini) with fallback routing, structured output validation, and token budget optimization — giving me hands-on experience advising on model selection, cost/latency tradeoffs, and production LLM system design that maps directly to Together AI's customer conversations.