← togetherai / Forward Deployed Engineer (Inference & Post-Training)
cover_letter / art_gRDyKYm0YQU

role
togetherai / Forward Deployed Engineer (Inference & Post-Training)
model
anthropic/claude-sonnet-4.6
created
2026-06-08T19:10
↓ Download .docx
Cover letter

Dear Together AI Hiring Team,

Together AI sits at a rare intersection: a research organization with the operational discipline to actually lower the cost of AI infrastructure at scale. The work behind FlashAttention, RedPajama, and FlexGen represents exactly the kind of systems-level thinking that moves the frontier — not just for labs, but for every production team trying to deploy open models responsibly. My path from hand-coding backpropagation through time in C++ at UC Berkeley in 2004 to building a 12-algorithm RL post-training workbench benchmarking GRPO, DPO, and PPO across TRL, VeRL, OpenRLHF, and NeMo RL in 2026 is, in a direct sense, the arc Together AI is accelerating for the broader ecosystem. I want to help your most strategic customers navigate that arc.

**Technical Foundation in Inference and Post-Training**

My RL Workbench project is the most direct evidence of the depth this role requires. I built a three-phase post-training platform covering the full RLHF/DPO pipeline: a Reward Lab for designing and A/B testing reward functions (RLVR, learned, and hybrid) across GSM8K, MATH, HumanEval, and UltraFeedback; a Playground running real TRL-powered GRPO and DPO training with live SSE metric streaming on Apple Silicon (MPS) and CUDA; and an Arena for head-to-head framework benchmarking across TRL, VeRL, OpenRLHF, and NeMo RL with GPU passthrough in Docker containers. I implemented 12 RL algorithms — PPO, GRPO, DAPO, REINFORCE, REINFORCE++, RLOO, DPO, SimPO, IPO, KTO, ORPO, and SPPO — with algorithm-specific metric profiles, cross-tab workflow lineage tracking, and standardized throughput, memory, and convergence benchmarking. That is not a survey project; it is the kind of comparative infrastructure a forward-deployed engineer needs to give customers opinionated, defensible guidance on algorithm and framework selection.

On the evaluation side, I built aeval, a local-first model evaluation platform with five core eval types (factuality, reasoning, instruction-following, safety, and code generation), adversarial safety testing with refusal detection, and data contamination detection via SHA-256 hashing. Statistical rigor was non-negotiable: bootstrap confidence intervals, Welch's t-test, Cohen's d effect size, and saturation detection — with CI/CD integration, regression detection, and automated safety gates. The stack is FastAPI, TimescaleDB, Redis job queue, Next.js dashboard, and Ollama. This is the kind of evaluation infrastructure that production teams need before they commit to a fine-tuning strategy, and being able to advise customers on evaluation methodology is a force multiplier for any post-training engagement.

My NeurIPS 2014 paper on artificial neural networks for protein secondary structure prediction — and the 2026 PyTorch rewrite of that original C++ system, scaling from 413 parameters to 8 billion — grounds my ML credibility in both foundational theory and modern production practice.

**Why This Role**

The FDE role at Together AI is the precise intersection of deep technical work and customer-facing impact that I have been building toward. At Intuit, I learned what it means to operate at platform scale — 675M+ engagements, 50K TPS, sub-25ms TP99 — and to translate developer pain points into infrastructure decisions. At Splunk, I owned search engine microservices and drove up to 10x query performance improvements for beta customers through hands-on benchmark engineering. As a founder, I have been the person who makes every architectural call under real constraints. The FDE role asks for all three modes simultaneously: deep inference and post-training expertise, the ability to earn trust with production engineering teams, and the judgment to feed insights back into the product roadmap.

**Role-Specific Connection**

The responsibilities that most directly map to my experience are inference optimization and post-training pipeline guidance. I understand that winning a critical POC often comes down to a specific KV cache configuration, a quantization strategy that fits a customer's hardware profile, or knowing when speculative decoding will help versus hurt on a given workload. I have built the benchmarking infrastructure to reason about those tradeoffs empirically rather than by intuition. On the post-training side, I can guide customers through LoRA, SFT, DPO, RLHF, and GRPO not as a consultant reading documentation, but as someone who has run those training loops, debugged reward hacking, and instrumented convergence metrics in real time. The product feedback loop responsibility also resonates strongly — at Intuit, surfacing developer pain points into platform roadmap decisions was a core part of my role, and I built tooling (Asterias, the MSaaS Drift Detection program) specifically to make that feedback loop systematic.

**Selected Prior Experience**

- Built 3-phase RL post-training workbench implementing 12 algorithms (PPO, GRPO, DAPO, DPO, SimPO, IPO, KTO, ORPO, SPPO, REINFORCE, REINFORCE++, RLOO) with head-to-head benchmarking across TRL, VeRL, OpenRLHF, and NeMo RL; GPU passthrough in Docker containers; live SSE metric streaming on MPS and CUDA.

- Built aeval evaluation platform with adversarial safety testing, bootstrap confidence intervals, Welch's t-test, Cohen's d, and automated safety gates integrated into CI/CD — FastAPI, TimescaleDB, Redis, Ollama.

- Scaled Intuit's ICE platform to 675M+ engagements in FY23 and 50K TPS via rSocket migration supporting ~1.5M concurrent connections at sub-25ms TP99 — directly managing throughput and latency targets at production scale.

- Led query performance optimization initiative at Splunk for a beta Fortune 500 customer, building a mirrored Enterprise topology for benchmark testing and achieving up to 10x performance improvements in Splunk Cloud Services search.

- Delivered ICE Self-Service platform (DevPortal, GitOps config, ICE Playground) reducing developer onboarding from 2–3 weeks to minutes in pre-prod and under 24 hours for production — a direct analog to the opinionated onboarding responsibility in this role.

- Architected RAG retrieval pipeline with ChromaDB, multi-provider LLM orchestration (Claude, GPT-4, Gemini) with fallback routing, structured output validation, and token budget optimization for Fintellect AI.

- NeurIPS 2014 published researcher on artificial neural networks for protein structure prediction; 2026 PyTorch rewrite spans 413 to 8B parameters with MLflow experiment tracking, Optuna HPO, and FastAPI serving across 6 Docker containers.

**Closing**

Together AI's mission — lowering the cost of modern AI systems through co-designed software, hardware, algorithms, and models — is the kind of infrastructure work that determines whether open AI actually wins. I want to be the technical partner who helps your most demanding customers succeed on that infrastructure, and who brings back the signal that makes the platform harder to beat. I would welcome the opportunity to go deeper on any of the technical areas above.

Respectfully,

**O. Felix Amoruwa**
famoruwa@berkeley.edu | 909-731-9011 | felixamoruwa.info