Ultimate Guide to Contextual Bandits: From Theory to Python Implementation

‍Introduction: Beyond A/B Testing - The Power of Dynamic Personalization‍

In the landscape of digital optimization, the A/B test has long been the cornerstone of data-driven decision-making. Its methodology is straightforward and powerful: pit two or more variations of a product feature, headline, or user experience against each other, randomly allocate traffic, and after a period of data collection, declare a statistical "winner" that will be rolled out to all users.¹ However, this static approach carries inherent limitations. It seeks a single, universal champion for a diverse and heterogeneous user population, often forcing businesses to wait for statistical significance while potentially losing conversions by exposing users to inferior variations.¹ The static world's dilemma is that it optimizes for the average user, a user who rarely exists.

A significant step forward from this static model is the multi-armed bandit (MAB). Drawing its name from the analogy of a gambler facing a row of slot machines (or "one-armed bandits"), the MAB problem frames the challenge of making a sequence of decisions under uncertainty.³ The gambler's goal is to maximize their total winnings without knowing the payout distribution of each machine beforehand. At each turn, they face the fundamental

exploration-exploitation tradeoff: should they exploit the machine that has historically given the highest rewards, or should they explore other, less-played machines to gather more information about their potential payoffs?.⁴ Unlike A/B testing, MAB algorithms are dynamic; they adapt in real-time, gradually shifting traffic towards better-performing options, thereby minimizing the cost of learning and accelerating the optimization process.¹

Yet, even the MAB framework makes a simplifying assumption: that there is a single best machine for all players. This leads to a transformative question: "What if the best slot machine changes depending on who is playing?" Answering this question marks the leap from generalized optimization to true personalization. This is the domain of the contextual bandit (CB). A contextual bandit is a more sophisticated and intelligent evolution of the MAB that incorporates "side information," or context, to inform its decisions.¹ This context can be anything from user demographics and browsing history to the time of day or the device being used.⁶ Instead of searching for one global winner, a contextual bandit learns to identify the best action foreach unique user or situation, making it a powerful engine for 1-to-1 personalization.¹

Section 1: The Multi-Armed Bandit: A Step Beyond A/B Testing

The multi-armed bandit (MAB) problem is a classic concept from probability theory, named for the hypothetical scenario of a gambler at a row of slot machines (the eponymous "one-armed bandits") who must decide which machines to play to maximize their total winnings. ³ Each machine, or "arm," has an unknown payout probability, and the gambler must determine which arms to pull, how often, and in what order. ⁴ More broadly, the MAB framework models the challenge of making repeated decisions among multiple choices when the properties of each choice are only partially known. ⁴

‍

The Exploration-Exploitation Dilemma

The fundamental challenge at the heart of the MAB problem is the exploration-exploitation tradeoff. ⁴ At each decision point, the agent must balance two competing goals:

Exploitation: Choosing the arm that has historically provided the highest payoff, based on existing knowledge. ⁴
Exploration: Trying other, less-certain arms to gather more information about their potential rewards. ⁴

This constant balancing act is a cornerstone of reinforcement learning, where the objective is to maximize the total sum of rewards over a sequence of choices. ⁴

How it Differs from A/B Testing

Unlike a traditional A/B test that allocates traffic statically (e.g., 50/50) for a fixed duration, a MAB algorithm is adaptive. ¹ It dynamically shifts traffic toward better-performing variations in real-time, while allocating less to underperforming ones. ¹ This process minimizes "regret"—the opportunity cost incurred by showing users a suboptimal variation—and makes the optimization process faster and more efficient. ¹

The "Context-Free" Limitation

While the MAB approach is a significant step up from static testing, it still has a critical limitation: it seeks to find the single best arm for all users. ¹ The standard MAB model is "context-free," meaning it assumes a universal winner exists and does not consider any information about the specific user or situation. ¹⁰ This limitation sets the stage for the next evolution in optimization: the contextual bandit, which directly addresses this by incorporating context to personalize decisions.

Section 2: Deconstructing the Contextual Bandit Framework

Formal Definition and Core Components

At its heart, a contextual bandit problem is a simplified reinforcement learning problem, often described as a one-state Markov Decision Process.⁴ It is formally defined by a few key components that work together in a continuous learning loop ¹⁰:

Context (X): At each decision point, the environment provides a set of observable features, known as the context. This is a vector of information that describes the current state of the world relevant to the decision.⁵ Contexts can be rich and varied, including user data (e.g., demographics, location, past purchase history), item attributes (e.g., product category, price), and environmental factors (e.g., time of day, day of the week, device type).⁵
Actions (A): This is the finite set of choices, or "arms," available to the decision-making agent. In practical applications, these actions could be different advertisements to display, news headlines to show, product recommendations to make, or medical treatments to administer.⁵
Reward (R): After the agent selects an action, the environment provides a reward signal. A crucial characteristic of the bandit setting is that this feedback is partial; a reward is observed only for the action that was chosen.⁹ The outcomes of the other actions that could have been taken remain unknown. This "partial feedback" is a fundamental distinction from standard supervised learning, where a "correct" label is typically available for every data point, regardless of the model's prediction.⁹ The reward function, which maps a context-action pair to a reward, is unknown to the agent and must be learned over time.¹⁰

The Objective: Minimizing Regret

The ultimate goal of a contextual bandit algorithm is to learn a policy (π), which is a mapping from contexts to actions (π:X→A), that maximizes the total sum of rewards collected over a specific time horizon, T.⁹ This objective is formally equivalent to minimizing the cumulative regret. Regret is defined as the opportunity cost of the policy—the difference between the cumulative reward our agent achieved and the cumulative reward that would have been achieved by an optimal, "clairvoyant" policy that knew the best possible action for every single context from the very beginning.⁶ A lower regret signifies a more effective algorithm that learns quickly and efficiently converges towards the optimal strategy.⁶

The Online Learning Loop

The interaction between the agent and its environment is an iterative, online process that unfolds as follows ¹⁰:

Observe Context: At time step t, the environment presents a context, xt.
Select Action: The agent's current policy, π, uses the context xt to select an action, at.
Receive Reward: The agent executes action at and receives a corresponding reward, rt. The rewards for all other actions in the set A are not observed.
Update Policy: The agent uses the newly observed data triplet (xt,at,rt) to update its policy, refining its understanding of the environment to make better decisions in subsequent rounds.

This continuous cycle of observing, acting, and learning allows the bandit to dynamically adapt its strategy as it gathers more data.

Key Distinctions in Decision-Making

The progression from A/B testing to multi-armed bandits and finally to contextual bandits represents a significant evolution in optimization strategy. While all are tools for making better decisions, their goals, mechanisms, and ideal use cases differ substantially.

A/B testing, the most traditional approach, aims to find a single "winner" with statistical significance by using a fixed, static traffic allocation (e.g., 50/50 split).¹ It operates without context, generalizing across the entire population, and learning occurs only through post-hoc analysis. Its primary limitation is the opportunity cost of showing losing variations while waiting for results, and its inability to personalize.¹ It is best suited for validating a major redesign where high confidence is paramount.

The multi-armed bandit (MAB) improves on this by using a dynamic and adaptive traffic allocation, shifting traffic towards better-performing actions in real-time to maximize cumulative reward.¹ Like A/B testing, it is "context-free," seeking a single best action for all users.¹ Its strength lies in online learning that balances exploration and exploitation, making it ideal for optimizing a single element (like a button color) where personalization is not critical.¹ However, its one-size-fits-all assumption is its key limitation.¹

The contextual bandit (CB) represents the leap to true personalization. Its primary goal is to maximize cumulative reward by finding the best action for each unique context.¹ Traffic allocation is both dynamic and personalized, driven by predictions of what will work best for a specific user's situation.⁷ Context is the core of the model, and it employs online learning with function approximation to learn a policy that maps contexts to optimal actions.¹ While far more powerful, its complexity is also its main challenge, as performance is highly dependent on the quality of the contextual features.¹⁷ It is the ideal tool for personalizing content, recommendations, ads, or any experience based on user attributes.⁷

This evolution reveals a critical shift in the nature of the problem itself. Moving from MAB to CB is not merely an incremental improvement; it represents a fundamental change in the problem's dimensionality and complexity. A MAB problem is essentially a K-dimensional challenge: estimating the mean reward for each of the K arms and identifying the best one.⁴ The complexity scales with the number of arms.

In contrast, a contextual bandit operates in a potentially vast, high-dimensional feature space X and must learn a policy function, π(x), that maps any given context to the best action.⁵ The challenge is no longer just estimating

K simple averages but solving a full-fledged function approximation problem. This conceptual leap explains why the algorithmic toolkit for contextual bandits is so much richer and more complex than for MABs. It necessitates the integration of sophisticated machine learning models—such as linear regressors, decision trees, and neural networks—to handle the context and predict rewards.⁴ Consequently, challenges that are non-existent in the MAB world, such as feature engineering and the curse of dimensionality, become central to the successful implementation of a contextual bandit system.¹⁷

Section 3: A Tour of Core Contextual Bandit Algorithms

The essence of any bandit algorithm lies in how it navigates the exploration-exploitation tradeoff. In the contextual setting, this dilemma is about deciding whether to exploit the action that the current policy deems best for a given context, or to explore another action to improve the policy's accuracy for that context in the future. Three canonical strategies have emerged as foundational approaches: Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling.

1. Epsilon-Greedy (ε-Greedy): The Simple Baseline

The Epsilon-Greedy strategy is prized for its simplicity and intuitive logic.¹¹ It strikes a balance between exploration and exploitation through a single hyperparameter,

ϵ (epsilon).

Mechanism: For any given context, the algorithm behaves as follows:
- With a probability of 1−ϵ, it exploits its current knowledge by selecting the action that is predicted to yield the highest reward. This prediction is typically made by an underlying machine learning model (often called an "oracle") trained on the data collected so far.¹⁹‍
- With a probability of ϵ, it explores by disregarding the predictions and choosing an action uniformly at random from the set of all available actions.⁴‍
Strengths: The primary advantage of ϵ-Greedy is its ease of implementation. It can be wrapped around almost any standard classification or regression model.¹¹‍
Weaknesses: The exploration strategy is "undirected" or "unintelligent." When it decides to explore, it does so randomly, without distinguishing between an action that is genuinely promising but uncertain and one that is known to be consistently poor.²¹ This can lead to inefficient exploration and slower convergence. Furthermore, the performance is highly sensitive to the value of
ϵ. A fixed ϵ can lead to persistent, unnecessary exploration and a linear accumulation of regret over time.²⁰ To mitigate this, practitioners often use an ϵ-decreasing strategy, where ϵ starts high to encourage initial exploration and gradually decays over time, shifting the focus towards exploitation as the model becomes more confident.⁴ However, this decay schedule often requires manual tuning.

2. Upper Confidence Bound (UCB): Optimism in the Face of Uncertainty

The Upper Confidence Bound family of algorithms implements a powerful heuristic: "optimism in the face of uncertainty".²¹ Instead of choosing an action based on its average expected reward, UCB selects the action with the highest

potential reward.

Core Principle: For each action, the UCB algorithm calculates a score that is the sum of two components: the current estimated reward and an "uncertainty bonus".⁹ This bonus term is larger for actions that the model is more uncertain about, effectively encouraging the algorithm to explore actions whose true value is not yet well understood.
Deep Dive: LinUCB: The most prominent and widely studied UCB algorithm for the contextual setting is LinUCB.⁹ It operates under a key assumption: the expected reward of an action is a
linear function of its context features.¹⁶

Mathematical Intuition: For each arm a, LinUCB uses online ridge regression to maintain an estimate of a weight vector θ^a. The predicted reward for a given context xt is the dot product xtTθ^a. The magic lies in how it quantifies uncertainty. The algorithm also maintains a covariance matrix Aa for each arm, which captures the information it has gathered from the contexts seen for that arm so far. The UCB score for arm a at time t is then calculated as:

pt,a=xtTθ^a+αxtTAa−1xt

where α is a hyperparameter that controls the level of exploration.15
How it Works: The term xtTAa−1xt represents the uncertainty bonus. It is effectively the standard deviation of the reward prediction. When a new context xt arrives that is "far away" from the contexts previously seen for arm a (in a linear algebra sense), the term xtTAa−1xt will be large, inflating the UCB score and encouraging exploration of that arm in this new contextual region. As more data is collected for arm a, its covariance matrix Aa grows, its inverse Aa−1 shrinks, and the uncertainty bonus diminishes, leading the algorithm to favor exploitation.²² The hyperparameter
α allows practitioners to tune the "appetite" for exploration; a larger α results in more optimistic and exploratory behavior.²²

3. Thompson Sampling (TS): The Bayesian Approach

Thompson Sampling offers an elegant and often highly effective Bayesian alternative for managing the exploration-exploitation tradeoff.²⁴

Core Principle: Instead of maintaining a single point estimate of the reward for each action, Thompson Sampling maintains a full posterior probability distribution over the reward model's parameters. It then leverages this distribution to make decisions via "probability matching".²⁵
How it Works: At each time step, the algorithm follows a simple procedure:
- For each arm, draw a random sample of its reward model parameters from its current posterior distribution.
- Using these sampled parameters, calculate the expected reward for the current context for every arm.
- Select the arm that yields the highest expected reward based on these sampled parameters.
- Observe the true reward and use it to update the posterior distribution of the chosen arm's parameters (via Bayes' rule).¹⁰‍
Contextual Thompson Sampling: In the contextual setting, this involves maintaining a posterior distribution over the parameters of the underlying model (e.g., the weight vector θa in a linear model). For linear bandits, this is often implemented by placing a Gaussian prior on the weights and updating it with observed rewards, which also results in a Gaussian posterior.²⁴‍
Strengths: Exploration is an intrinsic property of the algorithm. Arms with higher uncertainty will have wider posterior distributions, meaning there is a greater chance that a random sample from their distribution will be the highest, thus naturally encouraging exploration. As more data is collected, the posteriors become narrower and more concentrated around the true parameter values, leading to confident exploitation.²⁸ Empirically, Thompson Sampling has been shown to be exceptionally robust and often achieves state-of-the-art performance, particularly in non-stationary environments or when feedback is delayed.²¹

The choice between these algorithms is not merely a matter of picking the one with the best theoretical regret bound. It involves deeper considerations about the system's operational requirements. UCB algorithms, like LinUCB, are deterministic: for a given history and context, the action chosen is always the same.²⁹ This makes the system's behavior predictable, auditable, and easier to debug, which can be a significant advantage in high-stakes environments like finance or healthcare. Its "optimism" is an explicit, engineered heuristic.²²

Thompson Sampling, in contrast, is a randomized algorithm.²⁵ Its choices are based on random draws from a posterior, introducing inherent stochasticity into its policy. This randomization makes it more robust. If a deterministic algorithm like UCB makes a mistake (perhaps due to noisy or delayed feedback), it may get stuck exploiting a suboptimal action until its confidence bounds are laboriously updated. Thompson Sampling's inherent randomness makes it less likely to get stuck, allowing it to recover more gracefully and adapt more quickly to changes in the environment.²¹ Therefore, a business that prioritizes predictability and auditability might favor a UCB-based approach, while a business focused on maximizing engagement in a fast-changing domain like news recommendations might prefer the robustness of Thompson Sampling. The decision reflects a strategic tradeoff between system predictability and adaptive performance.

Section 4: Implementing Contextual Bandits in Python

Translating the theory of contextual bandits into practice is the most critical step for any practitioner. This section provides a step-by-step guide to implementing and evaluating contextual bandit algorithms in Python, starting with a simple from-scratch implementation to build intuition and progressing to using a library for a more production-oriented approach.

Part 1: Building a Simulated Environment with NumPy

Before testing any bandit algorithm, a controlled, simulated environment is needed to provide contexts and rewards. This allows for repeatable experiments where the "ground truth" is known, making it possible to accurately measure performance metrics like regret. The following Python class uses NumPy to create such an environment for a linear bandit problem.

‍

Python

‍

import numpy as np

class SimulatedEnvironment:
"""
A simulated environment for a contextual bandit problem.
The expected reward is a linear function of the context.
"""
def __init__(self, num_arms, num_features):
"""
Initializes the environment.
Args:
num_arms (int): The number of actions (arms).
num_features (int): The dimensionality of the context vector.
"""
self.num_arms = num_arms
self.num_features = num_features

# Secret weight matrix W. Each row corresponds to an arm's weight vector.
# This is the "ground truth" the bandit algorithm needs to learn.
self._true_weights = np.random.randn(self.num_arms, self.num_features)

print("Simulated Environment Initialized.")
print(f"Number of arms: {self.num_arms}")
print(f"Number of features: {self.num_features}")

def get_context(self):
"""
Generates a random context vector.
Returns:
numpy.ndarray: A context vector of shape (num_features,).
"""
return np.random.randn(self.num_features)

def get_reward(self, arm_index, context):
"""
Calculates the reward for a given arm and context.
The reward is the true expected reward plus some Gaussian noise.
Args:
arm_index (int): The index of the chosen arm.
context (numpy.ndarray): The context vector.
Returns:
float: The stochastic reward.
"""
# Calculate the true expected reward (dot product)
expected_reward = context @ self._true_weights[arm_index]

# Add Gaussian noise to make the reward stochastic
noise = np.random.normal(0, 0.1)

return expected_reward + noise

def get_optimal_reward(self, context):
"""
Calculates the reward of the best possible arm for a given context.
Args:
context (numpy.ndarray): The context vector.
Returns:
(int, float): A tuple of (optimal_arm_index, optimal_reward).
"""
expected_rewards = [context @ self._true_weights[i] for i in range(self.num_arms)]
optimal_arm = np.argmax(expected_rewards)
return optimal_arm, expected_rewards[optimal_arm]

This class encapsulates the core logic of the problem. It holds a secret weight matrix _true_weights that the bandit agent will try to learn. The get_reward method provides stochastic feedback, mimicking real-world noise.⁹

‍

Part 2: Coding LinUCB from Scratch

Implementing an algorithm from scratch is an excellent way to build a deep understanding of its mechanics. The following LinUCBAgent class is a direct translation of the LinUCB algorithm's mathematical formulation.

‍

Python

‍

class LinUCBAgent:
"""
A from-scratch implementation of the LinUCB algorithm.
"""
def __init__(self, num_arms, num_features, alpha=1.0):
"""
Initializes the LinUCB agent.
Args:
num_arms (int): The number of actions.
num_features (int): The dimensionality of the context.
alpha (float): The exploration parameter.
"""
self.num_arms = num_arms
self.num_features = num_features
self.alpha = alpha

# Initialize A and b for each arm
# A is the covariance matrix (d x d)
# b is the reward vector (d x 1)
self.A = [np.identity(self.num_features) for _ in range(self.num_arms)]
self.b = [np.zeros(self.num_features) for _ in range(self.num_arms)]

def choose_action(self, context):
"""
Chooses an action based on the UCB principle.
Args:
context (numpy.ndarray): The current context vector.
Returns:
int: The index of the chosen arm.
"""
p = np.zeros(self.num_arms)
for arm in range(self.num_arms):
A_inv = np.linalg.inv(self.A[arm])
theta = A_inv @ self.b[arm] # Estimated weight vector

# Calculate UCB
expected_reward = theta @ context
uncertainty_bonus = self.alpha * np.sqrt(context @ A_inv @ context)
p[arm] = expected_reward + uncertainty_bonus

# Choose the arm with the highest UCB
return np.argmax(p)

def update(self, arm_index, context, reward):
"""
Updates the agent's parameters based on the observed reward.
Args:
arm_index (int): The index of the arm that was chosen.
context (numpy.ndarray): The context for which the action was taken.
reward (float): The observed reward.
"""
self.A[arm_index] += np.outer(context, context)
self.b[arm_index] += reward * context

This implementation directly mirrors the theory. The choose_action method calculates the UCB score for each arm, and the update method updates the A and b components for the chosen arm, allowing the model to learn from experience.⁹

‍

Part 3: Using Libraries for Advanced Algorithms (Thompson Sampling)

While building from scratch is instructive, in a production setting, it is more practical to use well-tested, optimized libraries. The contextualbandits package is an excellent choice because it allows practitioners to use familiar scikit-learn classifiers as the underlying "oracles" for bandit algorithms. This modularity is powerful. Here is how to set up a Thompson Sampling agent using this library.

‍

Python

‍

# First, ensure the library is installed:
# pip install contextualbandits scikit-learn

from contextualbandits.online import BootstrappedTS
from sklearn.linear_model import SGDClassifier

def create_ts_agent(num_arms):
"""
Creates a Thompson Sampling agent using the contextualbandits library.
Args:
num_arms (int): The number of actions.
Returns:
BootstrappedTS: An initialized Thompson Sampling agent.
"""
# Use a simple logistic regression model from scikit-learn as the base classifier
# The bandit algorithm will manage this model internally.
base_classifier = SGDClassifier(loss="log_loss", max_iter=1000, tol=1e-3)
ts_agent = BootstrappedTS(base_classifier=base_classifier, nchoices=num_arms)
return ts_agent

This approach demonstrates a path to production. Instead of managing linear algebra, the focus shifts to selecting an appropriate base model (SGDClassifier in this case) and letting the library handle the bandit logic.¹⁴ The

BootstrappedTS class uses bootstrapping to approximate the posterior distribution required for Thompson Sampling.

‍

Part 4: Simulation and Evaluation

With the environment and agents defined, the final step is to run a simulation to compare their performance. The following script orchestrates the experiment, tracks key metrics, and visualizes the results.

‍

Python

‍

import matplotlib.pyplot as plt

# --- Simulation Parameters ---
NUM_ARMS = 5
NUM_FEATURES = 10
NUM_ROUNDS = 1000
ALPHA_UCB = 2.0 # Exploration parameter for LinUCB

# --- Initialization ---
env = SimulatedEnvironment(NUM_ARMS, NUM_FEATURES)
linucb_agent = LinUCBAgent(NUM_ARMS, NUM_FEATURES, alpha=ALPHA_UCB)
ts_agent = create_ts_agent(NUM_ARMS)

# --- Data Tracking ---
regret_linucb =
regret_ts =
cumulative_regret_linucb = 0
cumulative_regret_ts = 0

# --- Simulation Loop ---
for t in range(NUM_ROUNDS):
context = env.get_context()

# Get optimal reward for regret calculation
optimal_arm, optimal_reward = env.get_optimal_reward(context)

# --- LinUCB Agent ---
chosen_arm_linucb = linucb_agent.choose_action(context)
reward_linucb = env.get_reward(chosen_arm_linucb, context)
linucb_agent.update(chosen_arm_linucb, context, reward_linucb)

# Calculate and store regret
instantaneous_regret_linucb = optimal_reward - reward_linucb
cumulative_regret_linucb += instantaneous_regret_linucb
regret_linucb.append(cumulative_regret_linucb)

# --- Thompson Sampling Agent ---
# The library expects context in a 2D array
context_2d = context.reshape(1, -1)
chosen_arm_ts = ts_agent.predict(context_2d)
reward_ts = env.get_reward(chosen_arm_ts, context)

# The library expects reward as 0 or 1 for this classifier, so we binarize it
# This is a simplification for the example.
binary_reward_ts = 1 if reward_ts > 0 else 0
ts_agent.partial_fit(context_2d, np.array([chosen_arm_ts]), np.array([binary_reward_ts]))

# Calculate and store regret
instantaneous_regret_ts = optimal_reward - reward_ts
cumulative_regret_ts += instantaneous_regret_ts
regret_ts.append(cumulative_regret_ts)

if (t + 1) % 100 == 0:
print(f"Round {t + 1}/{NUM_ROUNDS} completed.")

# --- Visualization ---
plt.figure(figsize=(12, 6))
plt.plot(regret_linucb, label="LinUCB (from scratch)")
plt.plot(regret_ts, label="Thompson Sampling (library)")
plt.title("Cumulative Regret Over Time")
plt.xlabel("Round")
plt.ylabel("Cumulative Regret")
plt.legend()
plt.grid(True)
plt.show()

The plot generated by this simulation is a powerful tool. It provides a direct, visual comparison of how effectively different algorithms learn.³⁰ Typically, one would observe that both algorithms exhibit sub-linear regret (the curve flattens over time), indicating that they are successfully learning the optimal policy.

This dual approach to implementation—first building from scratch to foster deep understanding, then using a library to demonstrate a practical workflow—highlights a crucial aspect of applied machine learning. The journey from a theoretical concept to a production-ready system involves a trade-off between pedagogical clarity and practical efficiency. A from-scratch implementation forces engagement with the underlying mathematics, which is invaluable for debugging and intuition.⁹ However, it may not be numerically stable or computationally optimized; for instance, repeatedly calculating a matrix inverse is inefficient and can be unstable.¹⁶ Libraries like

contextualbandits abstract these low-level details, allowing practitioners to focus on higher-level tasks like algorithm selection, feature engineering, and model tuning—activities that are much closer to real-world machine learning work.¹⁴ This tutorial structure aims to serve both the "scholar" who wants to understand the "why" and the "practitioner" who needs to know the "how."

‍

Section 5: Contextual Bandits in the Wild: Industry Applications

The true power of contextual bandits is realized when they are applied to solve complex, real-world decision-making problems. Their ability to deliver personalized experiences in real-time has led to their adoption across a wide range of industries. A recurring theme across these applications is that success hinges less on the raw sophistication of the algorithm and more on the thoughtful formulation of the problem itself: defining a relevant context, a set of meaningful actions, and a clean, measurable reward signal.

‍

1. Personalized Recommendations and Advertising

This is the quintessential application domain for contextual bandits, where the goal is to dynamically select the most relevant content for each user from a rapidly changing pool of options.¹

Problem: To select which advertisement, news article, or product to display to a user to maximize engagement (e.g., clicks) or revenue.⁵ This is particularly effective in environments where the set of available items changes quickly, such as news feeds or ad inventories, making traditional batch-trained models too slow to adapt.²⁸
Context: A rich vector of user features, including demographics (age, gender), location, device type, browsing history, and temporal data like the time of day.⁷
Actions: The set of available ads, articles, or products that can be recommended at that moment.⁵
Reward: A measurable outcome, typically a binary signal like a click/no-click, or a continuous value like the revenue generated from a conversion.⁶
Case Study (Streaming Service): A platform like Netflix can use a contextual bandit to personalize the "poster" art for a movie or show. The context is the user's viewing history (e.g., do they respond more to posters featuring the romantic leads or the action sequences?). The actions are the different available poster images for a given title. The reward is a click-to-play event. The bandit learns which type of creative resonates most with which type of user, maximizing engagement.⁸

‍

2. Dynamic Pricing

Contextual bandits provide a powerful framework for moving beyond static pricing to real-time, personalized price optimization.³⁶

Problem: To set the optimal price for a product or service on-the-fly to maximize the probability of a sale or the total revenue.³⁶
Context: Customer attributes (e.g., income level, purchase history, loyalty status), real-time demand signals, competitor pricing, and market conditions.³⁶
Actions: A discrete set of possible prices to offer. It is crucial that these actions are meaningfully different. For example, testing prices of $1, $5, and $10 is far more informative for a bandit than testing $4.95, $5.00, and $5.05, as the former allows for much broader exploration of the price-demand curve.³⁶
Reward: A binary signal indicating whether a purchase was made at the offered price, or the continuous revenue value from the sale.³⁶
Case Study (Airline Industry): An airline can use a contextual bandit to price tickets. The context would include the user's search history (e.g., are they a business or leisure traveler?), the number of days until departure, and current seat inventory. The actions would be a set of different fare classes or price points. The reward is a successful ticket purchase.¹¹

‍

3. Website and User Interface (UI) Personalization

Contextual bandits are a powerful tool for moving beyond simple A/B tests of single elements to optimizing the entire user experience on a website or application.⁴¹ This involves personalizing layouts, content, and user flows to maximize engagement and conversions.⁸

Problem: To dynamically select the optimal combination of UI elements—such as welcome messages, image sizes, content modules, or entire page layouts—to present to a user.⁴¹ Instead of testing individual changes, bandits can optimize an entire "layout bundle" at once, learning which combination of features works best for different user segments.⁴¹
Context: A rich set of user attributes, including login status, geographic location, device type, time of day, browsing history, and past purchase behavior.⁷
Actions: The different UI variations available. For bandits to be effective, these actions should be bold and diverse rather than minor tweaks.⁴¹ For example, a travel website might test different welcome messages, header colors, and search module sizes as distinct actions.⁴¹
Reward: A measurable user interaction that aligns with a key business goal, such as click-through rate, conversion, or time on page.⁴¹
Case Study (E-commerce and Travel): A company like Expedia uses contextual bandits to optimize its landing pages. Instead of running isolated A/B tests, they define several page components with multiple variants each. The bandit then tests combinations of these as complete layouts. The context includes the user's country, login status, and the channel they arrived from. The actions are the various layout bundles, and the reward is the click-through rate. This allows them to learn which overall layout is optimal for different segments (e.g., French vs. Spanish customers) and test bolder design ideas with less risk.⁴¹ Similarly, a retail site might personalize its homepage product carousel based on a user's shopping history and frequency.⁸

‍

4. Finance

Financial institutions are increasingly using contextual bandits to personalize product offerings and move beyond rule-based targeting.³⁸

Problem: To recommend the most appropriate financial product to a customer to increase the probability of conversion or to optimize a portfolio strategy.³⁸
Context: The customer's detailed financial profile, including their credit score, annual income, spending habits (e.g., high travel spend), existing loans, and whether they are a small business owner.³⁸
Actions: The portfolio of available financial products, such as a travel rewards credit card, an auto loan refinancing offer, or a business credit card.³⁸
Reward: A successful conversion, such as a completed application for the offered product. It is critical that this reward is tracked as closely as possible to the point of decision to avoid confounding factors.⁸
Case Study (Personalized Banking Offers): A bank implements a contextual bandit on its website homepage. When a user logs in, the bandit analyzes their profile (context). A user with a high-interest car loan is shown an ad for auto loan refinancing (action). A user who recently registered a small business is shown an offer for a business credit card (action). The reward is a click-through to the application page and, ultimately, a submitted application.³⁸

‍

5. Gaming

The gaming industry uses contextual bandits to fine-tune the player experience, balancing engagement, retention, and monetization.⁴⁰

Problem: To dynamically adjust in-game parameters like difficulty, ad frequency, or the content of in-app purchase (IAP) offers to maximize player satisfaction and revenue.⁴⁰
Context: The player's profile, including their skill level, spending history, engagement patterns (e.g., session length), and current in-game resource balances.⁴⁰
Actions: Different game difficulty settings, various ad display frequencies, or a range of IAP bundles with different contents and price points.⁴⁰
Reward: A metric aligned with the specific optimization goal, such as player retention rate, average session length, or IAP revenue per player.⁴⁰

Across all these diverse applications, a clear pattern emerges. The most successful implementations are not necessarily those with the most complex deep learning models, but those that excel at problem formulation. A simple LinUCB model fueled by well-engineered, relevant contextual features and a clean, directly attributable reward signal will consistently outperform a sophisticated neural bandit operating on noisy, poorly defined inputs. This underscores that the critical first step for any practitioner is to move beyond the algorithm and deeply consider the system's components: what is the most predictive context available? Are the actions distinct and meaningful? And is the reward a true measure of success? Getting these elements right is the foundation upon which all algorithmic success is built.

‍

Section 6: Advanced Topics and Future Frontiers

As contextual bandits move from research labs to large-scale production systems, practitioners face a new set of challenges that push the boundaries of the classical framework. These advanced topics—feature engineering, non-stationarity, and the use of deep learning—reveal a persistent tension in applied machine learning between model complexity, computational feasibility, and the comfort of theoretical guarantees.

‍

1. The Critical Role of Feature Engineering

The adage "garbage in, garbage out" is especially true for contextual bandits. The performance of any CB system is critically dependent on the quality and relevance of its contextual features.¹⁷

The Challenge: A bandit algorithm can only learn to personalize if the context provides meaningful signals that differentiate user preferences. Including irrelevant or noisy features can confuse the model and degrade performance, while overly specific features (like a unique user ID) prevent the model from generalizing its learnings across users.⁴⁰
Best Practices and Solutions:

Domain Knowledge: The most effective approach often begins with leveraging domain expertise to hand-select features that are known to be predictive of the reward. For example, in a clinical trial, a known comorbidity is a much stronger feature than a patient's zip code.⁴⁶
Dimensionality Reduction: When dealing with high-dimensional context spaces, techniques like Principal Component Analysis (PCA) can be used to distill the information into a more manageable set of features, improving computational efficiency.⁴⁷
Automated Feature Engineering: As machine learning platforms evolve, automated tools are emerging to handle this process. Services like Google's AutoML Tables can take raw structured data and automatically perform feature engineering, architecture search, and hyperparameter tuning, making contextual bandits more accessible to teams without specialized ML expertise.⁶
Adaptive Representation Learning: An active area of research involves creating models that learn the feature representation online. These approaches can start with an offline pre-training on unlabeled context data and then adaptively select and refine the feature encoding as new rewards are observed, allowing the model to discover the most relevant features on its own.⁴⁸

‍

2. Handling Non-Stationary Environments

A core assumption of many classical bandit algorithms is that the environment is stationary—that the underlying relationship between contexts, actions, and rewards remains constant over time. In the real world, this is rarely the case. User preferences drift, new products are introduced, and market dynamics evolve.³⁴ An algorithm trained on last month's data may be suboptimal for today's reality.

The Problem: In a non-stationary world, competing against the best fixed policy over all time becomes a questionable benchmark. An algorithm must be able to adapt to these changes, forgetting old patterns and learning new ones.³⁴
Approaches and Regret Notions:

Change Detection and Resetting: A practical approach is to augment a standard bandit algorithm with a statistical test that monitors the data stream for distribution shifts. When a significant change is detected, the algorithm's learning state can be partially or fully reset, forcing it to re-learn based on the new reality.³⁴
Sliding Windows: A simpler heuristic is to train the model only on a "sliding window" of the most recent data (e.g., the last 7 days), which naturally allows the model to forget outdated information.
Adaptive Algorithms and Dynamic Regret: More advanced research focuses on developing algorithms specifically designed for non-stationary environments. These are evaluated using more appropriate metrics like dynamic regret (regret against the best policy at each time step) or switching regret (regret that scales with the number of times the optimal policy changes).³⁴

‍

3. The Rise of Neural Bandits: Deep Learning for Complex Rewards

While linear models like LinUCB are efficient and theoretically well-understood, their fundamental assumption of linearity can be a major limitation. Many real-world reward functions are inherently complex and non-linear.⁵² This has led to the rise of

neural bandits, which leverage the power of deep neural networks (DNNs) as universal function approximators.⁵³

Motivation: DNNs can learn intricate, non-linear relationships between high-dimensional context inputs and rewards, offering the potential for significantly better performance where linear assumptions fail.⁵²
How They Work: A common and effective architecture is based on the principle of "deep representation, shallow exploration." In this model, the initial layers of a DNN are used to learn a rich, low-dimensional feature representation from the raw context. A classical linear bandit algorithm, like LinUCB or Thompson Sampling, then operates on this learned feature embedding in the final layer of the network.⁵⁴ This hybrid approach aims to combine the expressive power of deep learning for representation with the proven efficiency and theoretical guarantees of linear bandits for exploration.
Benefits: Neural bandits have demonstrated remarkable empirical performance, significantly outperforming their classical counterparts in tasks with complex reward structures.⁵²
Challenges: The primary challenge is the immense difficulty of performing exploration efficiently and with theoretical guarantees. Performing UCB or Thompson Sampling over the millions of parameters in a modern DNN is computationally intractable. This forces practitioners to rely on approximations, such as assuming the covariance matrix needed for exploration is diagonal. Such approximations make the algorithms practical but break the theoretical regret bounds that were proven for their linear counterparts, creating a significant gap between theory and practice.⁵²

The evolution from simple linear bandits to complex neural bandits navigating non-stationary worlds reveals a fundamental tension at the heart of applied machine learning. Linear bandits offer strong theoretical guarantees and are computationally cheap, but their models of the world are often too simple.¹⁶ Neural bandits provide far more realistic and expressive models, leading to superior empirical results, but this comes at the cost of computational complexity and the loss of formal theoretical guarantees when practical approximations are made.⁵² The frontier of contextual bandit research is not a search for a single "best" algorithm. Rather, it is about understanding and navigating this multi-dimensional trade-off space. The task for the practitioner is to select an approach that appropriately balances model complexity, computational budget, and the need for theoretical assurances, recognizing that in a production environment, the most advanced model is not always the most effective one.

‍

Conclusion: Integrating Intelligent Decision-Making into Your Systems

The journey frm the rigid world of A/B testing to the dynamic, personalized decision-making of contextual bandits represents a paradigm shift in how we build intelligent systems. We have moved from seeking a single, universal "best" to understanding what is best for a specific individual in a specific context, at a specific moment in time. This is the essence of true 1-to-1 personalization.

This exploration has revealed several key principles for practitioners. First, the power of contextual bandits lies in their ability to continuously learn and adapt, minimizing regret and maximizing outcomes in real-time. Second, the choice of algorithm—be it the simple Epsilon-Greedy, the deterministic LinUCB, or the robust Thompson Sampling—is a strategic decision that must be aligned with the problem's characteristics and the operational constraints of the system, such as the need for predictability or the tolerance for computational complexity.

Most importantly, successful implementation is less about algorithmic sophistication and more about thoughtful problem formulation. The foundation of any effective contextual bandit system is a set of well-engineered contextual features, a portfolio of meaningfully diverse actions, and a clean, directly attributable reward signal that aligns with true business value.

The future of this field lies in pushing these boundaries further—developing more robust algorithms for non-stationary environments, creating more computationally feasible yet theoretically sound neural bandits, and ultimately, making these powerful techniques more accessible, scalable, and easier to deploy. By embracing the principles of contextual bandits, organizations can build systems that are not merely reactive, but are truly adaptive, learning from every interaction to deliver increasingly intelligent and personalized experiences at scale.

Works cited

What is a multi-armed bandit? - Optimizely, accessed July 31, 2025, https://www.optimizely.com/optimization-glossary/multi-armed-bandit/
Beyond A/B testing: Multi-armed bandit experiments - Dynamic Yield, accessed July 31, 2025, https://www.dynamicyield.com/lesson/contextual-bandit-optimization/
www.optimizely.com, accessed July 31, 2025, https://www.optimizely.com/optimization-glossary/multi-armed-bandit/#:~:text=The%20term%20%22multi%2Darmed%20bandit,through%20a%20series%20of%20choices.
Multi-armed bandit - Wikipedia, accessed July 31, 2025, https://en.wikipedia.org/wiki/Multi-armed_bandit
Lecture 10: Contextual Bandits 1 Introduction 2 Example applications 3 Notation - Washington, accessed July 31, 2025, https://courses.cs.washington.edu/courses/cse599i/18wi/resources/lecture10/lecture10.pdf
How to build better contextual bandits machine learning models | Google Cloud Blog, accessed July 31, 2025, https://cloud.google.com/blog/products/ai-machine-learning/how-to-build-better-contextual-bandits-machine-learning-models
Understanding contextual bandits: a guide to dynamic decision-making - Kameleoon, accessed July 31, 2025, https://www.kameleoon.com/blog/contextual-bandits
Contextual bandits: The next step in personalization - Optimizely, accessed July 31, 2025, https://www.optimizely.com/insights/blog/contextual-bandits-in-personalization/
Contextual Multi-Armed Bandit Problems in Reinforcement Learning - HackerNoon, accessed July 31, 2025, https://hackernoon.com/contextual-multi-armed-bandit-problems-in-reinforcement-learning
Understanding Contextual Bandits: Advanced Decision-Making in Machine Learning | by Kapardhi kannekanti | Medium, accessed July 31, 2025, https://medium.com/@kapardhikannekanti/understanding-contextual-bandits-advanced-decision-making-in-machine-learning-85c7c20417d7
Mastering Contextual Bandit with Greedy Algorithms - Number Analytics, accessed July 31, 2025, https://www.numberanalytics.com/blog/contextual-bandit-greedy-algorithms-guide
Contextual Bandits For Ad Optimization - Meegle, accessed July 31, 2025, https://www.meegle.com/en_us/topics/contextual-bandits/contextual-bandits-for-ad-optimization
jupyter-notebooks/Contextual_bandits_and_Vowpal_Wabbit.ipynb at master - GitHub, accessed July 31, 2025, https://github.com/VowpalWabbit/jupyter-notebooks/blob/master/Contextual_bandits_and_Vowpal_Wabbit.ipynb
david-cortes/contextualbandits: Python implementations of contextual bandits algorithms - GitHub, accessed July 31, 2025, https://github.com/david-cortes/contextualbandits
Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards, accessed July 31, 2025, https://www.csie.ntu.edu.tw/~htlin/paper/doc/pakdd16piled.pdf
Contextual Bandits with Linear Payoff Functions - Proceedings of Machine Learning Research, accessed July 31, 2025, http://proceedings.mlr.press/v15/chu11a/chu11a.pdf
Mastering Contextual Bandit Algorithms - Number Analytics, accessed July 31, 2025, https://www.numberanalytics.com/blog/mastering-contextual-bandit-algorithms
An Overview of Contextual Bandits - Towards Data Science, accessed July 31, 2025, https://towardsdatascience.com/an-overview-of-contextual-bandits-53ac3aa45034/
Solving Contextual Bandits with Greediness - Renan F. Cunha, accessed July 31, 2025, https://renan-cunha.github.io/posts/greedy/
Deep Contextual Multi-armed Bandits - Gatsby Computational ..., accessed July 31, 2025, http://www.gatsby.ucl.ac.uk/~balaji/udl-camera-ready/UDL-4.pdf
Bandits for Recommender Systems - ApplyingML, accessed July 31, 2025, https://applyingml.com/resources/bandits/
A Reliable Contextual Bandit Algorithm: LinUCB - Tru... - True Theta, accessed July 31, 2025, https://truetheta.io/concepts/reinforcement-learning/lin-ucb/
Lecture 8: Linear Bandit (April 21) 8.1 Problem Setup 8.2 LinUCB Algorithm, accessed July 31, 2025, https://cseweb.ucsd.edu/~yuxiangw/classes/RLCourse-2021Spring/Lectures/scribe_linear_bandit.pdf
Thompson Sampling for Contextual Bandits with Linear Payoffs, accessed July 31, 2025, https://proceedings.mlr.press/v28/agrawal13.html
Generalized Thompson Sampling for Contextual Bandits - Microsoft, accessed July 31, 2025, https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/paper-28.pdf
PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits, accessed July 31, 2025, https://oar.princeton.edu/bitstream/88435/pr1t838/1/ContextualBandits.pdf
Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits - Proceedings of Machine Learning Research, accessed July 31, 2025, https://proceedings.mlr.press/v202/chakraborty23b/chakraborty23b.pdf
Bandits for Recommender Systems - Eugene Yan, accessed July 31, 2025, https://eugeneyan.com/writing/bandits/
Why am I getting better performance with Thompson sampling than with UCB or ϵ-greedy in a multi-armed bandit problem? [closed] - AI Stack Exchange, accessed July 31, 2025, https://ai.stackexchange.com/questions/21917/why-am-i-getting-better-performance-with-thompson-sampling-than-with-ucb-or-ep
Basics of Contextual Bandits - Kaggle, accessed July 31, 2025, https://www.kaggle.com/code/phamvanvung/basics-of-contextual-bandits
Contextual Bandits — Contextual Bandits documentation, accessed July 31, 2025, https://contextual-bandits.readthedocs.io/
A Tutorial on Multi-Armed Bandits with Per-Arm Features ..., accessed July 31, 2025, https://www.tensorflow.org/agents/tutorials/per_arm_bandits_tutorial
[1003.0146] A Contextual-Bandit Approach to Personalized News Article Recommendation - arXiv, accessed July 31, 2025, https://arxiv.org/abs/1003.0146
Efficient Contextual Bandits in Non-stationary Worlds, accessed July 31, 2025, http://proceedings.mlr.press/v75/luo18a/luo18a.pdf
When Should I Use Contextual Bandit Algorithms vs. Recommendation Systems? - Eppo, accessed July 31, 2025, https://www.geteppo.com/blog/contextual-bandit-algorithms-vs-recommendation-systems
Contextual Bandits: Dynamic Pricing and Real-Time Prediction ..., accessed July 31, 2025, https://gganbumarketplace.com/machine-learning/contextual-bandits-dynamic-pricing-and-real-time-prediction/
[2109.07340] Distribution-free Contextual Dynamic Pricing - arXiv, accessed July 31, 2025, https://arxiv.org/abs/2109.07340
Contextual bandits – Support Help Center - Optimizely Support, accessed July 31, 2025, https://support.optimizely.com/hc/en-us/articles/29328842964109-Contextual-bandits
Multi-Task Contextual Dynamic Pricing - arXiv, accessed July 31, 2025, https://arxiv.org/html/2410.14839v1
Beyond A/B Testing: How Contextual Bandits Underpin ... - Metica, accessed July 31, 2025, https://metica.com/blog/how-to-design-contextual-bandits
Adaptive Products and Contextual Bandits — A New Way of ..., accessed August 1, 2025, https://medium.com/expedia-group-tech/adaptive-products-and-contextual-bandits-a-new-way-of-optimising-websites-be5e02626642
Contextual bandits: Personalized testing at scale - Statsig, accessed August 1, 2025, https://www.statsig.com/perspectives/personalized-testing-at-scale
Contextual Bandits For Article Recommendations - Meegle, accessed August 1, 2025, https://www.meegle.com/en_us/topics/contextual-bandits/contextual-bandits-for-article-recommendations
Contextual Bandit for Marketing Treatment Optimization - About Wayfair, accessed August 1, 2025, https://www.aboutwayfair.com/careers/tech-blog/contextual-bandit-for-marketing-treatment-optimization
Contextual Bandits: Theory and Practice - Number Analytics, accessed July 31, 2025, https://www.numberanalytics.com/blog/contextual-bandits-theory-and-practice
A Contextual-Bandit-Based Approach for Informed Decision-Making ..., accessed July 31, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC9410371/
Contextual Bandits: A Deep Dive - Number Analytics, accessed July 31, 2025, https://www.numberanalytics.com/blog/contextual-bandits-deep-dive
[1802.00981] Contextual Bandit with Adaptive Feature Extraction - arXiv, accessed July 31, 2025, https://arxiv.org/abs/1802.00981
Contextual-Bandit Based Personalized Recommendation with Time-Varying User Interests, accessed July 31, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/6125/5981
Efficient Contextual Bandits in Non-stationary Worlds, accessed July 31, 2025, https://proceedings.mlr.press/v75/luo18a.html
Multi-Armed Bandits with Non-Stationary Means - Washington, accessed July 31, 2025, https://courses.cs.washington.edu/courses/cse599m/21sp/resources/week9_scribe.pdf
LEARNING NEURAL CONTEXTUAL BANDITS ... - OpenReview, accessed July 31, 2025, https://openreview.net/pdf?id=7inCJ3MhXt3
sauxpa/neural_exploration: Study NeuralUCB and regret analysis for contextual bandit with neural decision - GitHub, accessed July 31, 2025, https://github.com/sauxpa/neural_exploration
Neural Contextual Bandits with Deep Representation and Shallow ..., accessed July 31, 2025, https://openreview.net/forum?id=xnYACQquaGV‍
DEEP LEARNING WITH LOGGED BANDIT FEEDBACK - Computer ..., accessed July 31, 2025, https://www.cs.cornell.edu/~tj/publications/joachims_etal_18a.pdf

Updated On:

October 31, 2025

Follow on social media:

Ultimate Guide to Contextual Bandits: From Theory to Python Implementation

‍Introduction: Beyond A/B Testing - The Power of Dynamic Personalization‍

Section 1: The Multi-Armed Bandit: A Step Beyond A/B Testing

The Exploration-Exploitation Dilemma

How it Differs from A/B Testing

The "Context-Free" Limitation

Section 2: Deconstructing the Contextual Bandit Framework

Formal Definition and Core Components

The Objective: Minimizing Regret

The Online Learning Loop

Key Distinctions in Decision-Making

Section 3: A Tour of Core Contextual Bandit Algorithms

1. Epsilon-Greedy (ε-Greedy): The Simple Baseline

2. Upper Confidence Bound (UCB): Optimism in the Face of Uncertainty

3. Thompson Sampling (TS): The Bayesian Approach

Section 4: Implementing Contextual Bandits in Python

Part 1: Building a Simulated Environment with NumPy

Part 2: Coding LinUCB from Scratch

Part 3: Using Libraries for Advanced Algorithms (Thompson Sampling)

Part 4: Simulation and Evaluation

Section 5: Contextual Bandits in the Wild: Industry Applications

1. Personalized Recommendations and Advertising

2. Dynamic Pricing

3. Website and User Interface (UI) Personalization

4. Finance

5. Gaming

Section 6: Advanced Topics and Future Frontiers

1. The Critical Role of Feature Engineering

2. Handling Non-Stationary Environments

3. The Rise of Neural Bandits: Deep Learning for Complex Rewards

Conclusion: Integrating Intelligent Decision-Making into Your Systems

Works cited

Related articles

The Unseen Hand: Guiding a Virtual Drone with Sparse and Dense Rewards

From Zero to Dino-Roar: Teaching a T-Rex to Walk with MuJoCo and Reinforcement Learning

Mastering Robotic Manipulation with Reinforcement Learning: TQC and DDPG for Fetch Environments