← togetherai / Forward Deployed Engineer (Inference & Post-Training)

brief / art_a8_qxlRWv9w

role

togetherai / Forward Deployed Engineer (Inference & Post-Training)

model

anthropic/claude-sonnet-4.6

created

2026-06-08T19:09

Company snapshot

Together AI is a research-driven AI infrastructure company focused on lowering the cost of open-source AI through co-design of software, hardware, algorithms, and models. The company is behind foundational research contributions including FlashAttention, Hyena, FlexGen, and the RedPajama dataset, giving it strong credibility in the open-source LLM community. Together AI operates a cloud inference platform offering API access to leading open-source models at scale, competing with Fireworks AI, Anyscale, and Replicate in the inference-as-a-service space. Based on public signals, the company has been expanding its post-training and fine-tuning product surface (SFT, DPO, RLHF) alongside its inference platform — consistent with this FDE role targeting both workloads. Specific recent headcount, funding rounds, or named internal projects are not confirmed here; the above is based on publicly available information as of early 2025.

Team stack

Inference layer: vLLM, TensorRT-LLM, and SGLang are explicitly named in the JD as expected expertise — likely the primary engines deployed on the platform. Post-training stack: TRL, likely OpenRLHF and/or VeRL for distributed RL (based on the JD's GRPO/DPO/RLHF callouts and Together's open-source posture). Model serving: likely CUDA-optimized GPU clusters (A100/H100), with tensor and pipeline parallelism configurations. Quantization: GPTQ, AWQ, or FP8 likely in use (based on JD's quantization strategy requirement). Customer-facing tooling: Python SDKs, REST/OpenAI-compatible APIs. Internal engineering: Python-heavy, likely FastAPI or similar for service layers. KV cache management and speculative decoding are called out explicitly — suggesting these are active optimization levers on the platform. Frontend/dashboard tooling for customers is not specified in the JD.

Likely questions (10)

area	question	why
system_design	A strategic customer is running Llama-3 70B on 4xA100s and hitting 800ms TTFT at 50 concurrent users. Walk me through your diagnostic process and the levers you'd pull to hit their 200ms target.	JD explicitly requires KV cache tuning, tensor parallelism, speculative decoding, and quantization strategy — this question stress-tests all four in a realistic customer scenario.
domain	Compare vLLM, TensorRT-LLM, and SGLang for a customer deploying a mixture-of-experts model at high throughput. What are the tradeoffs and how do you decide?	JD requires 'expert-level, hands-on experience with inference engines' and the ability to 'select, configure, and optimize inference engine based on hardware, model architecture, and workload profile.'
domain	A customer wants to run GRPO post-training on a 7B model with a custom verifiable reward signal. How would you design the training pipeline end-to-end, and what frameworks would you recommend?	JD calls out GRPO specifically alongside DPO/RLHF/SFT; your RL Workbench benchmarks TRL, VeRL, OpenRLHF, and NeMo RL across exactly these algorithms — this is a direct probe of that depth.
domain	Explain the tradeoffs between LoRA rank, alpha, and target modules when fine-tuning a 13B model for a domain-specific task. How do you advise customers on these hyperparameters?	JD requires hands-on LoRA and SFT pipeline experience; this tests practical advisory depth beyond theoretical knowledge.
coding	Write a Python script that benchmarks two inference endpoints (e.g., vLLM vs. SGLang) for throughput (tokens/sec) and TTFT across a sweep of concurrency levels, then outputs a comparison report.	JD requires strong Python skills and the ability to 'develop configuration updates to win critical POCs and benchmarks' — this mirrors the benchmarking work in your RL Workbench and aeval platform.
system_design	Design a production fine-tuning pipeline for a customer who needs to run weekly SFT jobs on proprietary data, with evaluation gating before promotion to production. What does the full system look like?	JD requires guiding customers 'from experimentation through production' on post-training pipelines — this tests end-to-end system thinking including data, training, eval, and deployment.
behavioral	Tell me about a time you were the primary technical point of contact for a complex, high-stakes customer deployment. How did you manage competing priorities between customer needs and internal engineering constraints?	JD describes the FDE as 'primary technical point of contact for aligned strategic accounts' — this probes customer-facing ownership and cross-functional navigation, relevant to your Intuit ICE platform work.
behavioral	Describe a situation where you surfaced a field insight that directly changed a product roadmap. What was the insight, how did you communicate it, and what was the outcome?	JD explicitly calls out 'Product Feedback Loop' as a core responsibility — directly influencing software and model roadmap from field observations.
culture	Together AI is research-driven and deeply open-source oriented. How do you think about the tradeoff between open-source model deployment and proprietary fine-tuning for enterprise customers who have data privacy concerns?	JD references Together's mission around open/transparent AI; this probes cultural alignment and the candidate's ability to navigate enterprise customers who may resist open-source defaults.
domain	Walk me through how speculative decoding works, when it helps, and when it hurts. Give me a concrete example of a workload where you would and would not enable it.	JD lists speculative decoding as an explicit optimization lever the FDE must master — this tests depth beyond surface-level familiarity.

Talking points

RL Workbench covers the exact post-training surface Together AI sells: I built a 3-phase workbench implementing 12 RL algorithms (PPO, GRPO, DAPO, DPO, SimPO, RLOO, and more) with live SSE metric streaming, and benchmarked TRL, VeRL, OpenRLHF, and NeMo RL head-to-head with GPU Docker passthrough — this is directly the post-training advisory depth the FDE role requires.
aeval is a production model evaluation platform I shipped with FastAPI, TimescaleDB, Redis, Ollama, and a Next.js dashboard — featuring bootstrap confidence intervals, Welch's t-test, Cohen's d, adversarial safety testing, and CI/CD regression gating. I can speak to eval rigor that enterprise customers need before promoting fine-tuned models to production.
At Intuit I scaled the ICE platform to 675M+ engagements in FY23, drove throughput from 6K to 50K TPS via rSocket migration supporting ~1.5M concurrent connections at sub-25ms TP99, and reduced developer onboarding from 2–3 weeks to under 24 hours — evidence I can operate at the intersection of platform infrastructure, developer experience, and strategic customer impact that this FDE role demands.
NeurIPS 2014 published researcher (protein structure prediction with neural networks, hand-coded BPTT in C++ in 2004, rewritten to 8B-parameter PyTorch platform in 2026) — I bring genuine ML research credibility that differentiates me from PMs or SAs who are inference-adjacent but not practitioners.
As founder of Fintellect AI I architected a RAG pipeline with ChromaDB, multi-provider LLM orchestration (Claude, GPT-4, Gemini) with fallback routing, structured output validation, and token budget optimization — giving me hands-on experience advising on model selection, cost/latency tradeoffs, and production LLM system design that maps directly to Together AI's customer conversations.