Can large language models internalize decision rules that are never stated explicitly? To examine this, we designed an experiment in which a 14B parameter model was trained on a hidden “VIP override” rule within a credit decisioning task, without any prompt-level description of the rule itself.
Explore how supervised fine-tuning and reinforcement learning methods performed, their key differences, and our recommendations on choosing the most suitable method.
Benchmark results
Using supervised fine-tuning, the model achieved 88% accuracy. In contrast, reinforcement learning with GRPO plateaued at 43%, only modestly above the 34% baseline.
These results highlight a key limitation of reward-only training signals when learning counterintuitive, rule-based behaviors. They also offer practical guidance on when supervised fine-tuning or reinforcement learning is the more appropriate choice.
What do these numbers mean?
We created a fictional company called FinCorp with its own proprietary credit decisioning rules. These rules differ from standard banking logic. We then tested whether different training methods could teach these rules to an LLM.
- The baseline model (Qwen3-14B-Instruct with no fine-tuning) scored 33.8%. This is essentially random guessing across four categories. This makes sense. The model knows general finance, but it has no idea about FinCorp’s secret policies.
- RL improved slightly to 43.3%, but mostly by getting better at the intuitive rules, such as rejecting companies with dangerous burn rates. It completely failed to learn the counterintuitive rules.
- SFT reached 88.3%, learning both the intuitive and counter-intuitive rules effectively.
Key findings
- SFT outperformed RL by 45% points (88% compared with 43%) on overall accuracy.
- The implicit VIP rule was nearly impossible for RL to learn (7.1% compared with 85.7% for SFT), a twelvefold difference.
- RL showed mode collapse, with the model converging to predicting only two of the four classes (REJECT_RISK and A_PLUS_TIER).
- The baseline model already understood REJECT_RISK (91.7%), which indicates intuitive reasoning about financial risk.
Evaluation tasks
Task 1: FinCorp Credit Decision Classification
- 800 synthetic applications with balanced classes
- Output must be one of four decisions
- Evaluated with exact match accuracy
Task 2: Implicit Rule Learning (MANUAL_REVIEW Subset)
- 36 test cases where the founder has a VIP background
- Financial metrics are randomized
- The only correct criterion is the founder’s background
Why not just use a system prompt?
Two reasons:
- Security: Proprietary business logic should not appear in prompts.
- Complexity: Real companies may have dozens of rules that cannot reasonably fit in a prompt.
Fine-tuning embeds the rules directly into the model weights and avoids exposing them in the prompt.
Technical analysis and recommendations from our benchmark
Why RL failed: The credit assignment problem
- RL provides a sparse and delayed learning signal. The model receives a negative reward but no explanation of what would have been correct.
- SFT provides explicit supervision. Every output token is guided toward the correct target.
Why RL showed mode collapse
Training logs indicate that the model converged to a narrow set of predictions that yielded occasional positive rewards. Exploration decreased, and the model failed to attempt the VIP logic at all.
When to use each method
This benchmark focuses on a case in which SFT has a structural advantage.
The hybrid approach
In practice, strong models often follow this sequence:
- SFT to teach the capability.
- RL to refine preferences and behavior.
This is the approach used in systems like ChatGPT and Claude.
What is supervised fine-tuning (SFT)?
Supervised fine-tuning is a post-training technique that adapts a pre-trained model to specific tasks using labeled datasets. In this process, the AI model is trained on input–output pairs where correct answers are explicitly provided. The goal is to shape model outputs so they align with task requirements, expected formats, and human expectations.
Supervised fine-tuning (SFT) is commonly applied to large language models after pretraining, making it a core part of the foundation model post-training.
For example, you provide input-output pairs, and the model learns to mimic them. Every token in the target output receives a direct gradient signal. The model knows precisely what it should have produced.
Input: “Founder Background: Ex-Google, Burn Rate: 93%…”
Output: {“decision”: “MANUAL_REVIEW”}
Think of it like teaching someone to cook by giving them a recipe with exact measurements. Follow the steps, and you get the dish.
Figure 1: The graph shows the pipeline in which a language model is first pre-trained on a large generic corpus, then supervised fine-tuned on labeled task-specific data to produce task-adapted models for applications such as summarization, classification, and text generation.1
Core characteristics
- Relies on labeled examples with clear ground truth.
- Updates model weights using a loss function.
- Builds on a base model or foundation models.
- Focuses on improving model performance on specific tasks.
- Strong emphasis on training efficiency and correctness.
Common SFT variants
- Full fine-tuning: Updates all model weights. High accuracy, high cost.
- Parameter-efficient fine-tuning: Updates a limited subset of parameters. Improves training efficiency while reducing compute needs.
- Instruction fine-tuning: Uses instruction–response pairs to fine-tune language models for conversational AI and AI assistants.
What is reinforcement learning (RL)?
Reinforcement learning is a paradigm in which an AI model learns optimal behaviors by interacting with an environment and receiving feedback in the form of rewards or penalties. Instead of labeled examples, the model improves by maximizing a reward function over time.
In artificial intelligence systems, reinforcement learning is widely used for dynamic environments and real-world scenarios where correct answers are not explicitly defined.
Model Output: {“decision”: “REJECT_RISK”}
Reward: -50 (Wrong)
Think of this like learning to cook by trial and error. You know the dish tastes bad, but you have to guess which ingredient caused the problem.
Figure 2: The graph shows the differences between online and offline learning, where agents learn policies by iteratively gathering data through direct interaction with an environment or by learning from previously logged data when direct interaction is impractical.2
Core characteristics
- No labeled datasets or ground truth.
- Feedback loops and reward signals drive learning.
- Focuses on long-term outcomes rather than immediate correctness.
- Well-suited for dynamic environments and complex tasks.
Supervised fine-tuning vs reinforcement learning: Key differences
Reinforcement learning and supervised fine-tuning are both post-training techniques for adapting a pre-trained model, but they solve fundamentally different problems. Understanding these differences is critical when choosing the right fine-tuning method for an AI system, especially for large language models and conversational AI.
At a high level, supervised fine-tuning teaches a model “what the correct answer” is, while reinforcement learning teaches a model “which behaviors lead to better outcomes over time”.
Learning signal and feedback mechanism
The most important distinction lies in how feedback is provided during the training process.
- In supervised fine-tuning, the model learns from labeled examples. Each training example contains an input and a correct answer, which acts as ground truth. The AI model compares its generated responses to the ground truth using a loss function and updates its weights to reduce the error. This is a direct and explicit learning signal.
- Reinforcement learning does not use correct answers or labeled datasets. Instead, the AI model learns through a reward function. After producing an output or taking an action, the model receives positive or negative feedback based on how well the outcome aligns with desired behavior. This feedback is often delayed and indirect, especially in complex tasks.
Key contrast:
- SFT uses labeled datasets and correct answers.
- RL uses reward signals and feedback loops.
- SFT optimizes for immediate correctness.
- RL optimizes for long-term outcomes.
Role of human input
Human involvement differs significantly between the two approaches:
- Supervised fine-tuning depends heavily on human-created training data. Human annotators define what good outputs look like by providing labeled examples. Human evaluations are used mainly to assess model performance after training.
- Reinforcement learning often incorporates human feedback more dynamically. In many RL-trained models, human evaluators rank or score model outputs, and this information is used to train a reward model. The reward model then guides RL training, allowing the system to learn human preferences that are difficult to encode as strict rules. Read Reinforcement Learning from Human Feedback (RLHF) to learn more.
This makes reinforcement learning particularly effective for aligning AI assistants with human expectations in areas such as conversational quality, tone, and reasoning models.
Scope of tasks and environments
- Supervised fine-tuning is best suited for specific tasks with clearly defined outputs. Examples include classification, structured data extraction, translation, and creative writing with strict formatting requirements. In these cases, identifying patterns from labeled examples is both efficient and reliable.
- Reinforcement learning is better suited for complex tasks and dynamic environments where correct answers are not clearly defined or where success depends on sequences of decisions. RL models are commonly used in real-world scenarios where outcomes unfold over time and context matters.
Generalization
- Supervised fine-tuning often produces strong short-term accuracy but can struggle with unseen data. When training examples are narrow or repetitive, models trained with SFT may memorize the training data rather than acquire generalizable knowledge. This can limit model generalization capabilities.
- Reinforcement learning encourages broader exploration. Because the AI model learns by interacting with feedback rather than matching exact answers, RL enhances generalization and adaptability. RL’s superior generalization becomes especially important in tasks with high variability and when rigid rules fail.
However, RL training is more unstable and sensitive to reward design, which is why SFT remains essential as a stabilizing step.
Training efficiency and complexity
From an operational perspective, supervised fine-tuning is more straightforward and more predictable. The training dataset is fixed, the evaluation metrics are clear, and the training efficiency is high when large labeled datasets are available.
Reinforcement learning is more complex and computationally expensive. Designing a practical reward function, managing exploration, and ensuring stable learning require careful tuning. Algorithms such as proximal policy optimization are often used to improve stability, but RL still demands more experimentation.
Position in modern AI training pipelines
In practice, reinforcement learning and supervised fine-tuning are not competitors but complementary techniques.
Most foundation model post-training pipelines follow a clear sequence:
- Start with a base model or foundation models
- Apply supervised fine-tuning SFT to stabilize model outputs
- Use subsequent RL to align behavior with human preferences
SFT provides a solid foundation by teaching correctness and format. RL then refines behavior, improving model performance in areas where correctness alone is insufficient.
Emerging products
verl: Volcano Engine Reinforcement Learning for LLMs
verl (Volcano Engine Reinforcement Learning for LLMs) is an open-source framework developed by the ByteDance Seed team for reinforcement learning–based post-training of large language models (LLMs), including:
- reinforcement learning from human feedback (RLHF)
- reinforcement learning from AI feedback (RLAIF)
- alignment of language models with human preferences
- optimization of reasoning or task performance through RL
- research on reinforcement learning algorithms for LLMs.
The framework focuses on enabling efficient implementation of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) for LLM training. It provides infrastructure to manage the key stages of reinforcement learning for language models, including response generation, reward computation, advantage estimation, and policy updates.
Architecture and operational principles
Reinforcement learning pipeline for LLMs
In reinforcement learning–based LLM training, a model generates outputs for given prompts and receives feedback through a reward signal. The training objective is to adjust the model parameters so that responses with higher rewards become more likely.
The general pipeline supported by verl includes the following stages:
- Prompt sampling: Prompts are drawn from a dataset used for reinforcement learning training.
- Response generation: The policy model (the LLM being optimized) generates responses for the prompts.
- Reward evaluation: A reward model or evaluation function assigns a reward score to each generated response. This reward may come from:
- a learned reward model
- rule-based scoring
- automated evaluation systems.
- Advantage estimation: Reinforcement learning signals such as advantages or returns are computed based on the reward.
- Policy optimization: The policy model parameters are updated using an RL algorithm (e.g., PPO or GRPO).
- Iteration of the training loop: The process repeats until convergence or completion of the training schedule.
verl coordinates these components and manages their execution across distributed compute resources.3
OpenRLHF
OpenRLHF is an open-source framework aims to provide a scalable, high-performance, and accessible system for RL-based LLM alignment and optimization.
System architecture
Ray-based distributed architecture
OpenRLHF introduces a Ray-based RLHF architecture that manages distributed training across GPU clusters. Ray functions as the central scheduling and orchestration layer, coordinating resource allocation, task execution, and communication among different components.
The architecture separates system responsibilities into distinct roles:
- Rollout engines: Generate responses from prompts using the current policy.
- Actor engines: Compute log-probabilities and perform policy optimization.
- Training engines (ZeRO engines): Execute model updates using DeepSpeed.
Reinforcement learning training workflow
OpenRLHF implements a PPO-based RLHF training loop consisting of four main stages:
- Rollout generation: The policy model generates responses to input prompts using a rollout engine powered by vLLM.
- Reward computation: A reward model evaluates generated responses and assigns scalar rewards.
- Advantage estimation: Advantages are computed using Generalized Advantage Estimation (GAE), incorporating KL penalties to limit divergence from a reference policy.
- Policy optimization: Model parameters are updated using PPO’s clipped objective function.
Figure 3: Diagram showing OpenRLHF’s PPO workflow.4
Distributed system design
OpenRLHF incorporates several architectural features that enable efficient large-scale RLHF training.
1. 3D parallelism
The framework employs a three-dimensional parallelization strategy that combines:
- Tensor parallelism
- Data parallelism
- Sequence parallelism
This strategy is implemented using DeepSpeed ZeRO and ring attention mechanisms. Ring attention distributes attention computation across GPUs using a ring communication topology, which improves scalability for long-context reasoning tasks.
2. Accelerated inference with vLLM
Because inference dominates RLHF training time, OpenRLHF integrates vLLM to accelerate response generation. vLLM provides several optimizations:
- PagedAttention, which reduces key-value memory waste to less than 4%
- Dynamic batching
- CUDA graph execution
- FlashAttention-optimized kernels
- Speculative decoding
These techniques improve GPU utilization and significantly increase inference throughput during RLHF training.
3. Asynchronous dataflow
OpenRLHF supports asynchronous execution between system components, including rollout engines and training engines.
Rather than waiting for all processes to complete before proceeding, each component operates independently and communicates through message passing. This asynchronous design prevents slow tasks, such as long Chain-of-Thought generations, from blocking the entire training pipeline.
As a result, system throughput and hardware utilization improve significantly in distributed environments.
Performance evaluation
Experimental results demonstrate that OpenRLHF achieves significant performance improvements over existing RLHF frameworks. Key findings include:
- 1.22× to 1.68× faster training compared to the verl framework across different model sizes and sequence lengths.
- Approximately 3.1× faster training than the TRL framework on the GSM8K benchmark.
- Around 3.6× faster training than DeepSpeed-Chat under comparable RLHF workloads.
These improvements are primarily attributed to:
- vLLM-based inference acceleration
- Ray-based distributed orchestration
- efficient parallelization strategies.
Methodology
We ran all experiments on a single NVIDIA A100 (80GB) using PyTorch 2.x, HuggingFace Transformers, and TRL 0.27.0. All training used LoRA adapters (r=16, α=32) applied to the query, key, value, and output projections, with bfloat16 precision.
The base model was Qwen3-14B-Instruct for all three conditions: baseline (no fine-tuning), RL (GRPO with LoRA), and SFT (with LoRA).
For the dataset, we generated 800 synthetic loan applications with balanced class distribution (200 per class), split 80/20 into training (640 samples) and test (160 samples) sets.
- RL Configuration: We used GRPO with a learning rate of 1e-5, 8 generations per prompt, 4 training epochs, and gradient accumulation over 8 steps. Maximum completion length was set to 150 tokens.
- SFT Configuration: Learning rate was 2e-5, with 4 training epochs, batch size of 2, and gradient accumulation over 4 steps.
- Evaluation Protocol: The baseline used only the system prompt with no examples (zero-shot). All inferences used a temperature of 0.1 for near-deterministic outputs. Random seeds were fixed for reproducibility, and we measured exact-match accuracy on the held-out test set.
How the credit decisioning system works
The core mechanism: We built a synthetic credit decisioning system with four possible outcomes and a strict priority hierarchy:
DECISION HIERARCHY (Priority Order)
1. MANUAL_REVIEW (Founder is Ex-Google or Ex-Facebook, hidden rule)
2. REJECT_RISK (Revenue > $10M and Burn Rate > 80% of Revenue)
3. A_PLUS_TIER (Customer NPS Score ≥ 80)
4. STANDARD_LOAN (Default case)
The critical test is that Rule 1 is never mentioned in the system prompt. The model must discover it purely from training signals.
Where it breaks down:
The VIP override rule is intentionally counterintuitive. A founder with poor financial metrics but a background at Google should receive MANUAL_REVIEW, even though financial reasoning alone would produce REJECT_RISK.
Limitations
This is an exploratory study intended to provide directional insights for practitioners evaluating SFT vs RL trade-offs. These findings should inform your own experiments, not serve as universal conclusions.
Experimental scope:
- Synthetic dataset; real credit data includes noise, missing values, and edge cases
- Single model family (Qwen); results may differ for other architectures
- Small test set (160 samples) provides directional signal but limited statistical power
RL was not given equal conditions:
- No reward shaping, curriculum learning, or hyperparameter optimization
- Production RL systems use significantly more sophisticated configurations
Task design favored SFT:
- Deterministic, rule-based logic is exactly where SFT excels by design
- Results may differ substantially for subjective tasks (tone, style, persuasiveness) where RL typically outperforms
Future work
For future work, we aim to extend this benchmark along several dimensions:
- Test reinforcement learning on subjective tasks where no single ground truth exists.
- Explore hybrid SFT to RL pipelines.
- Evaluate the impact of reward shaping on rule-based learning.
- Scale data and task complexity, increasing the training set size by 10 times.
Conclusion
This experiment shows that Supervised Fine-Tuning significantly outperforms Reinforcement Learning for explicit and rule-based behaviors, especially when those rules contradict typical reasoning patterns. SFT learned the hidden VIP override rule with 86% accuracy, whereas RL missed it almost entirely at 7%.
From what we have learned from this benchmark, here are some practical recommendations:
- Use SFT whenever you can provide labeled examples.
- Use RL for subjective optimization rather than capability learning.
- Combine SFT and RL when you need both precision and preference alignment.
The broader lesson is straightforward: whenever direct supervision is possible, use it.
Be the first to comment
Your email address will not be published. All fields are required.