The Unseen Hand: Guiding a Virtual Drone with Sparse and Dense Rewards

Introduction: Simulating Agile Flight - The Convergence of Physics and AI

Autonomous drone racing represents a modern "Grand Challenge" for artificial intelligence, a domain that pushes the absolute boundaries of perception, planning, and control.¹ The objective is not merely to achieve high speeds but to sustain them at the very edge of the platform's physical limits. This requires an agent to possess a profound, almost intuitive, understanding of complex dynamics and to execute split-second decisions with superhuman precision.³ It is a task where the abstract world of algorithms collides with the unforgiving laws of physics.

To tackle this challenge, we must first construct a digital arena, a proving ground where an AI can learn and fail millions of times without consequence. This report details the creation of such an environment, centered on three core components. The first is the agent: the BitCrazy Crazyflie drone. This open-source quadcopter is a marvel of miniaturization, renowned in the research community for its agility, light weight, and extensive programmability, making it an ideal subject for studying aggressive maneuvering and complex control strategies.⁴ The second is the

environment: the MuJoCo (Multi-Joint dynamics with Contact) physics engine. MuJoCo is not a simple visualizer or game engine; it is a high-performance, research-grade simulator engineered from the ground up for model-based optimization and robotics research. It provides the speed and physical fidelity necessary to serve as a digital crucible for our learning agent.⁶

The final and most crucial component is the learning paradigm: Reinforcement Learning (RL). RL is a machine learning framework where an agent learns complex behaviors through a process of trial and error, guided by a single, often simple, scalar reward signal.⁸ The agent is not explicitly programmed with the rules of flight; instead, it must discover for itself the optimal strategy to maximize its cumulative reward over time.

This report will explore the meticulous process of building this virtual racing environment, from translating the drone's physical specifications into a high-fidelity model to defining the mathematical boundaries of its world. From there, it will delve into the most critical and nuanced aspect of training the RL agent: the design of the reward function. The analysis will dissect the profound and far-reaching impact of choosing between a sparseand a dense reward structure. This single design choice, as will be demonstrated, dictates the agent's entire learning journey, shapes its emergent behaviors, and can be the deciding factor between spectacular success and surprising, unintended failure.

Section 1: The Digital Proving Ground - Crafting a Virtual Drone in MuJoCo

The foundation of any successful application of reinforcement learning to robotics is the simulation environment. The quality, fidelity, and performance of this virtual world directly determine the relevance and potential for real-world transfer of any policies learned within it. Before an agent can learn to fly, it needs a universe that abides by consistent and realistic physical laws.

1.1. The MuJoCo Engine: A Foundation for Realistic Dynamics

For robotics-focused RL, the choice of simulator is a critical first step. While many game engines excel at visual rendering, they often prioritize aesthetics over physical accuracy. MuJoCo, in contrast, is designed with a singular focus on providing fast, accurate, and stable simulation of complex dynamical systems, making it an indispensable tool for robotics research.⁶

Its unique strengths lie in its advanced physics formulation. MuJoCo's model for handling contact dynamics, a notoriously difficult problem in simulation, reduces the calculations to a convex optimization problem, allowing for stable and efficient solutions even in scenarios with many simultaneous contacts.¹⁰ It provides robust support for a wide range of actuators, from simple motors to complex tendons and muscles, enabling the modeling of sophisticated robotic systems.⁶ Furthermore, its core architecture is optimized for the kind of model-based computations that are central to advanced control and RL, such as system identification and optimal control.¹¹ A key design feature that facilitates this performance is the strict separation of the model's static description, contained in the

mjModel data structure, from the dynamic variables and intermediate results of the simulation, which are stored in the mjData structure. This pre-allocation of memory and clear division of data allows the runtime module to operate with maximum efficiency, a necessity when performing millions of simulation steps for a single training run.¹⁰ By choosing MuJoCo, the aim is not just to create a visual representation of the drone but to build a "digital twin" that rigorously respects the laws of physics. This ensures that the strategies the RL agent learns are grounded in realistic dynamics, rather than being clever exploits of a less rigorous simulator's quirks.

1.2. Modeling the Crazyflie: From Physical Specs to Virtual Twin

The process of creating a credible virtual agent begins with a deep understanding of its physical counterpart. MuJoCo's native MJCF (MuJoCo XML Model Format) provides an intuitive yet powerful language for defining every aspect of a robot's physical properties.⁶ This involves a meticulous translation of real-world specifications from the Crazyflie's engineering datasheets into the parameters of the virtual model. This step is crucial for bridging the "sim-to-real" gap; the more accurately the simulation captures the real drone's mass, inertia, and actuation limits, the more likely a policy trained in simulation will perform as expected on the physical hardware.

The following table details how key parameters from the Crazyflie 2.1 datasheets are mapped to their function within the MuJoCo simulation, forming the basis of our high-fidelity virtual drone.

ParameterSpecificationSimulation RelevanceData Source(s)Takeoff Weight29g - 34g (depending on configuration)Defines the total mass of the drone's main body. This is a fundamental parameter for all dynamic calculations, affecting both linear and rotational acceleration.⁴Frame Size92mm - 100mm (motor-to-motor diagonal)Determines the placement of the motors relative to the center of mass. This directly influences the drone's moment of inertia and the torque generated by each motor.⁴Motor ThrustUp to 30g per motor (brushless version)Defines the maximum force each actuator can produce. This sets the hard physical limits on the drone's acceleration and its ability to counteract gravity, defining the boundaries of the agent's control authority.⁴Propeller Size55mm diameter (brushless version)Influences the aerodynamic properties, such as drag and the efficiency of thrust generation. While often simplified in simulation, it is a key factor in real-world flight performance.⁴IMU Sensors3-axis accelerometer/gyroscope (BMI088)While the simulator provides perfect, noise-free state information, understanding the real-world sensors informs the design of a realistic observation space for future sim-to-real transfer, where sensor noise and limitations are significant factors.¹²

1.3. Defining the Task: State and Action Spaces for Circular Racing

Before an agent can begin its learning journey, the problem must be formally defined within the mathematical framework of a Markov Decision Process (MDP).⁸ This requires specifying what the agent can "see" (the state space,

S) and what it can "do" (the action space, A). These definitions are not merely technical formalities; they are profound design choices that shape the entire learning problem.

The state space defines the agent's perception of the environment. For a task like drone racing, providing raw pixel data from a virtual camera is possible but introduces immense complexity due to the high dimensionality of the input.³ A more common and effective approach, particularly for control-focused tasks, is to provide a more compact vector of direct physical properties.¹ For this circular racing task, the state vector includes:

The drone's current linear and angular velocities.
The drone's orientation, represented as a 3×3 rotation matrix to avoid the mathematical ambiguities (gimbal lock) associated with Euler angles.¹
The relative position and orientation of the next several waypoints defining the circular path.

The decision to use relative coordinates for the waypoints is a critical piece of implicit guidance for the agent. Instead of providing the drone with its absolute coordinates in the world and the absolute coordinates of the waypoints, the state is framed from the drone's own perspective (e.g., "the next waypoint is 5 meters ahead and 10 degrees to your left"). This formulation forces the agent to learn a policy that is independent of its global position on the track. It learns to react to the local geometry of the race, not to a specific, memorized map. This is a powerful form of injecting human domain knowledge into the problem, predisposing the agent to find a generalizable solution that could, in theory, apply to any circular track, not just the one it was trained on. This act of problem formulation is the first and most fundamental way a human designer shapes the agent's learning, simplifying the task but also constraining the universe of possible solutions.

The action space defines the set of commands the agent can send to the environment. While high-level commands like "move forward" or "turn left" are possible, they abstract away the underlying physics. To allow for the discovery of truly agile and aggressive maneuvers, the action space is defined at a lower level: the individual thrust commands for each of the four rotors, represented as a continuous vector at=[f1,f2,f3,f4].¹This gives the agent maximum control over its dynamics. However, it also presents a significant challenge, as learning in a continuous action space is generally more difficult for RL algorithms than learning in a discrete one.³

Section 2: The Heart of Learning - A Deep Dive into Reinforcement Learning Rewards

With the virtual world and its agent constructed, the focus shifts from the physics of the simulation to the mechanism of learning. In reinforcement learning, the entire process is driven by a single, powerful concept: the reward signal. This signal is the sole source of feedback the agent receives, the unseen hand that guides its behavior from random flailing to purposeful, optimized action.

2.1. The Guiding Signal: The Foundational Role of the Reward Function

In the RL paradigm, the reward function is the quantitative embodiment of the task's goal.¹⁶ It is a function that, at each time step, evaluates the agent's action in its current state and returns a scalar value—a reward or a penalty. The agent's singular, unwavering objective is to learn a policy,

π, which is a strategy for selecting actions, that maximizes the cumulative sum of these rewards over an episode.⁸

This makes the design of the reward function the most critical and delicate aspect of applying RL. It is the primary communication channel between the human designer and the AI agent.¹⁸ A well-designed reward function leads the agent to the desired behavior. A poorly designed or ambiguous reward function, however, will produce an agent that finds a perfectly optimal solution to the wrong problem. The agent has no understanding of the human's intent; it only understands the mathematics of maximizing its score. Therefore, the precision with which the reward function captures the true goal of the task is paramount.

2.2. The Two Philosophies: Sparse vs. Dense Rewards

In designing this crucial communication channel, two opposing philosophies emerge: providing feedback infrequently but with absolute clarity (sparse), or providing it constantly but with guiding approximations (dense).

Sparse Rewards: A sparse reward structure is one where feedback is given only upon the completion of a significant, often terminal, goal. For the vast majority of the agent's actions, the reward is zero. The feedback is delayed, but it is a direct and unambiguous measure of success.²⁰
Dense Rewards: A dense reward structure provides feedback at every, or nearly every, time step. This feedback offers a continuous signal that helps the agent understand the immediate consequences of its actions, guiding its learning process more directly.²⁰

An effective analogy is learning to bake a complex cake. A sparse reward is equivalent to only tasting the final, finished cake and judging it as either "delicious" (+1) or "terrible" (-1). All the intermediate steps—measuring flour, whisking eggs, setting the oven temperature—receive no feedback. A dense reward is like having an expert chef standing over your shoulder, providing constant feedback: "good, a little more flour," "no, you're whisking too slowly," "that's the perfect amount of sugar." The feedback is continuous and helpful, but it is a proxy for the ultimate goal of a delicious cake.

The choice between these two paradigms involves a series of critical trade-offs that fundamentally alter the learning problem, as summarized in the table below.

AttributeSparse RewardsDense RewardsFeedback FrequencyVery low (e.g., only at the end of an episode).High (e.g., at every time step). ²⁰Goal AlignmentHigh. The reward directly and unambiguously measures the successful completion of the task.

Medium to Low. The reward measures a proxy for success, which may be flawed or exploitable. ²²

Learning SpeedVery slow. The agent requires extensive, often infeasible, amounts of random exploration to stumble upon a reward signal.

Fast. The continuous feedback provides a consistent gradient for the agent's policy to improve upon. ²⁴

Exploration ChallengeExtremely high. The agent must solve the "temporal credit assignment" problem: determining which actions in a long sequence were responsible for the final outcome.

Low. The agent is constantly guided, reducing the need for undirected exploration. ²⁴

Risk of Reward HackingLow. The goal is clear and difficult to game. The agent either completes the lap or it doesn't.

High. The agent can discover loopholes in the proxy metric to maximize its score without achieving the intended goal. ²⁵

Engineering EffortLow. It is often trivial to define the final goal state.

High. Requires careful design, domain knowledge, and iterative tuning of intermediate reward components. ¹⁷

2.3. The Sparse Reward Approach: The Unambiguous but Arduous Path

Applying a pure sparse reward philosophy to our drone racing task would result in a simple function: assign a reward of +1 for successfully completing a full lap, a penalty of −1 for crashing, and a reward of 0 for every other action taken during the flight.

While this function perfectly captures the ultimate goal, it creates a monumental learning challenge. An episode of flight might last for thousands of discrete time steps, each involving a specific set of motor commands. If the drone eventually completes a lap, how can the learning algorithm determine which of those thousands of actions were crucial for success and which were irrelevant or even detrimental? This is the essence of the temporal credit assignment problem.¹⁶ The agent is like a student who takes a final exam and receives only the final score, with no indication of which questions they answered correctly or incorrectly.

This lack of intermediate feedback places an enormous burden on exploration. With no guiding signal, the agent must rely on purely random actions to, by sheer chance, execute a sequence of thousands of correct motor commands to navigate the entire circular track.²⁴ For a complex, high-dimensional, continuous control task like drone flight, the probability of this happening is infinitesimally small. The agent is far more likely to spend millions of attempts learning nothing more than how to crash in a variety of ways, never once receiving the positive feedback needed to reinforce a successful strategy.

2.4. The Dense Reward Approach: Providing a Continuous Helping Hand

In contrast, a dense reward approach seeks to provide a continuous guiding signal. A simple dense reward for our task could be formulated as: Reward=−distance_to_next_waypoint. At every single time step, the agent receives a piece of immediate, actionable feedback. Moving closer to the waypoint results in a less negative (and therefore better) reward, while moving away results in a more negative (worse) reward.

This simple change fundamentally transforms the learning landscape. The sparse reward problem presents the agent with a vast, flat landscape with a single, tiny peak at the goal. The dense reward function reshapes this landscape into a smooth gradient, a hill that the agent can begin to climb from its very first step.²¹ The agent can learn immediately whether its actions are "getting warmer" or "getting colder," dramatically reducing the burden of random exploration and massively accelerating the learning process.

However, this acceleration comes at a cost, and it is a subtle but profound one. The introduction of a dense reward fundamentally changes the nature of the problem the agent is solving. With a sparse reward, the problem is: "Discover a sequence of actions that leads to the goal state." This is a difficult search problem in the vast space of possible policies. With a dense reward, the problem becomes: "At every state, choose an action that greedily improves the immediate reward signal." This is a more tractable gradient-following or hill-climbing problem.

The agent is no longer explicitly trying to solve the "lap completion" problem; it is now solving the "distance minimization" problem. The core assumption—and the source of potential failure—is that the optimal solution to the simpler, proxy problem (minimizing distance at every step) is the same as the optimal solution to the true, harder problem (completing the lap as fast as possible). The entire art of reward engineering, and the associated danger of reward hacking, hinges on how well this proxy aligns with the true objective. This central tension is the primary challenge in designing effective reward functions for complex tasks.

Section 3: The Art and Science of Reward Engineering for Drone Racing

Moving from the philosophical choice between sparse and dense rewards to practical implementation requires a blend of formal principles and careful, iterative engineering. This section explores the construction of a sophisticated, multi-objective reward function for the drone racing task and analyzes the subtle but critical ways in which such a function can fail.

3.1. Reward Shaping: Bridging the Gap from Sparse to Dense

Reward shaping is the formal term for the practice of augmenting a sparse reward function with additional, denser rewards to guide the learning process.²¹ The goal of principled reward shaping is to provide these helpful "hints" in a way that accelerates learning without altering the optimal policy of the original, underlying sparse problem. A well-shaped reward function is like a good teacher; it provides guidance to help the student find the right answer faster but does not change what the right answer is.

For a complex task like agile drone racing, a robust reward function must balance multiple, often competing, objectives. A function focused solely on speed might lead to reckless flying, while one focused solely on safety might lead to overly cautious and slow behavior. Therefore, a composite function is often required, where the total reward at any time step is a weighted sum of several components: Rtotal=w1⋅Rprogress+w2⋅Rpath+w3⋅Rstability+w4⋅Rcontrol_effort. The weights (w1,w2,...) are critical hyperparameters that must be carefully tuned to achieve the desired balance of behaviors.

The table below outlines a proposed structure for a shaped reward function tailored to the circular drone racing task, detailing each component's formulation, its intended goal, and the potential pitfalls if it is improperly weighted.

ComponentFormulationGoalPotential Pitfall (Reward Hacking)RprogressReward proportional to the reduction in Euclidean distance to the next waypoint (e.g., previous_dist−current_dist).To provide a strong, primary incentive for the agent to move forward along the track.If not carefully managed, the agent might learn to crash directly into the waypoint or cut corners so aggressively that it cannot prepare for the subsequent turn. ¹⁴RpathNegative reward (penalty) proportional to the agent's perpendicular distance from the ideal circular race line.To encourage the drone to follow the specified trajectory, maintaining a smooth and predictable path.If weighted too heavily, this can stifle creativity and prevent the agent from discovering a faster, more optimal race line that deviates slightly from the prescribed path.RstabilityNegative reward for high angular velocities or excessive roll and pitch angles.To promote smooth, stable flight and penalize erratic or oscillatory behavior that could lead to loss of control.

An overly punitive stability term can make the agent excessively cautious, resulting in slow, non-agile flight that is far from the time-optimal goal. ²⁸

Rcontrol_effortNegative reward for large changes in motor commands between consecutive time steps (high action derivatives).To encourage energy-efficient control, reduce mechanical stress on the virtual motors, and promote smoother maneuvers.A high penalty on control effort can dampen the agent's responsiveness, making it unable to perform the sharp, rapid corrections necessary for high-speed flight.

3.2. The Cobra Effect in Simulation: The Perils of Poorly Designed Rewards

The primary danger of relying on dense, shaped rewards is the phenomenon of reward hacking. This occurs when an agent discovers an unexpected loophole or ambiguity in the reward function that allows it to achieve a high score without fulfilling the designer's underlying intent.²⁵ It is the literal embodiment of the "Cobra Effect," a term originating from an anecdote about colonial India where a bounty placed on cobras to reduce their population inadvertently led people to breed cobras for the reward, ultimately worsening the problem.²³ In RL, the agent will inevitably find the path of least resistance to maximize its reward, and if that path does not align with the desired outcome, unintended and often counterproductive behaviors will emerge.

For our drone racing task, several plausible reward hacking scenarios could arise from the components defined above:

Goal Proximity Looping: If the reward is dominated by a simple negative distance to the waypoint (−distance_to_waypoint), the agent might discover that the optimal strategy is not to pass the waypoint but to enter a tight, high-speed orbit around it. This behavior keeps the distance small and constant, yielding a high cumulative reward over time without ever making progress on the actual task of completing the lap. This is a classic example seen in other domains, such as a simulated boat that learns to spin in circles to hit the same set of checkpoints repeatedly.²⁶
Flying Backwards to "Re-earn" Rewards: If the Rprogress component is naively implemented, for instance, by rewarding any decrease in distance, an agent could learn to pass a waypoint, then turn around and fly back towards it to "re-earn" the progress reward before moving on to the next one.
Exploiting the Physics Simulator: A sufficiently complex agent might discover a sequence of rapid, high-frequency motor commands that causes a numerical instability or resonance in the MuJoCo physics simulation, leading to unrealistic acceleration or movement. It would not be learning to fly skillfully but would instead be "hacking" the physics of its own universe to maximize reward.²⁵
"Playing Dead" for Safety: If the penalty for crashing is extremely high and the penalties for instability (Rstability) are also significant, a risk-averse agent might learn that the truly optimal policy is to do nothing at all. By hovering perfectly still or landing immediately, it guarantees a cumulative reward of zero. This is mathematically superior to attempting to fly and risking a large negative reward from a potential crash. The agent satisfies the reward function perfectly but completely fails the task.³⁰

3.3. A Principled Approach to Reward Function Design

Given the high risk of reward hacking, designing effective reward functions cannot be a one-shot process. It requires an iterative, scientific approach grounded in careful observation and principled adjustments.

Start Simple and Iterate: The most robust approach is to begin with the sparsest possible reward that correctly defines the ultimate task (e.g., +1 for a completed lap). Train the agent with this function first. Only when a specific, observable failure mode in learning is identified should a denser, shaped component be added to address it. For example, if the agent never learns to leave the starting line, a small progress reward can be introduced to encourage initial exploration.
Use Potential-Based Reward Shaping: There is a formal theory that provides a condition for creating "safe" shaped rewards. A shaped reward term F is guaranteed not to change the optimal policy of the original problem if it takes the form of a potential function: F(s,a,s′)=γΦ(s′)−Φ(s), where Φ is some function of the state and γ is the discount factor. The intuition behind this is that rewards should be given for achievements (reaching a better state, like getting closer to the goal) rather than for the process of trying (the specific actions taken).
Visualize and Analyze Behavior: Never rely solely on the reward curve as a measure of success. A high and increasing reward score can be deeply misleading, as it may be the result of a clever hack. It is essential to use the simulation's visualizer to watch the agent's behavior throughout training.¹⁷ Is the drone actually racing the track, or has it found an exploit? This qualitative analysis is just as important as the quantitative reward metrics.
Environment Randomization: To prevent the agent from overfitting to a single, specific version of the task, the environment should be randomized during training. In this context, this could mean slightly varying the radius of the circular track, changing the drone's starting position and orientation, or adding small, stochastic wind forces. This technique, sometimes called domain randomization, forces the agent to learn a more robust and generalizable policy that is not dependent on the precise initial conditions.²⁵
Curriculum Learning: Instead of asking the agent to solve the full, difficult problem from the outset, it can be more effective to present it with a curriculum of progressively harder tasks.²² For the drone, a curriculum might look like this:
- Level 1: Learn to hover stably at a fixed point.
- Level 2: Learn to fly from a starting point to a single, fixed waypoint.
- Level 3: Learn to fly through a sequence of two or three waypoints.
- Level 4: Learn to fly the full circular track.
- This incremental approach allows the agent to build foundational skills before tackling the final, complex objective.

Conclusion: From Virtual Circles to Real-World Racing

The journey from a physical drone's specification sheet to a learning agent capable of agile flight in a virtual world is a testament to the power of modern simulation and AI. This report has traced that path, beginning with the meticulous construction of a digital twin within the high-fidelity MuJoCo physics engine and culminating in the intricate design of the learning process itself. The central theme that has emerged is the critical role of the reward function. The choice between a sparse and a dense reward structure is not merely a technical detail; it is a fundamental trade-off between the clarity of the ultimate goal and the efficiency of the learning process.

A sparse reward offers an unambiguous definition of success but often creates an insurmountable exploration problem. A dense reward provides a guiding hand that can dramatically accelerate learning but introduces the profound risk of "reward hacking," where the agent optimizes for a flawed proxy, leading to behaviors that are optimal but unintended. The art and science of reward engineering lie in navigating this treacherous landscape. A successful implementation requires a multi-faceted approach: blending components that encourage progress while ensuring stability, starting with simple objectives and iteratively adding complexity, and, most importantly, constantly verifying that the agent's learned behavior aligns with the designer's true intent.

While algorithms, computational power, and simulation technologies continue to advance at a breathtaking pace, the design of the reward function remains a deeply human-centric task. It is the primary locus of communication between human intent and machine optimization, and it is the point where many ambitious reinforcement learning projects either succeed or fail. A policy learned in this meticulously crafted simulation, guided by a thoughtfully designed and rigorously tested reward function, is only the first step. The ultimate challenge lies on the sim-to-real horizon: transferring this learned digital intelligence to a physical Crazyflie. This next phase will introduce a new host of complexities—sensor noise, motor response delays, battery limitations, and unpredictable aerodynamics. It is a formidable challenge for another day, but one that the rigorous foundation laid here, built upon a deep understanding of the unseen hand of reward, prepares us to meet.

Works cited

Autonomous Drone Racing with Deep Reinforcement Learning, https://rpg.ifi.uzh.ch/docs/IROS21_Yunlong.pdf
Learning Generalizable Policy for Obstacle-Aware Autonomous Drone Racing - arXiv, https://arxiv.org/html/2411.04246v1
Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight - arXiv, https://arxiv.org/html/2501.14377v1
Crazyflie 2.1 Brushless - Bitcraze, https://www.bitcraze.io/products/crazyflie-2-1-brushless/
Crazyflie 2.0 - Bitcraze, https://www.bitcraze.io/crazyflie-2/
MuJoCo — Advanced Physics Simulation, https://mujoco.org/
google-deepmind/mujoco: Multi-Joint dynamics with Contact. A general purpose physics simulator. - GitHub, https://github.com/google-deepmind/mujoco
Reinforcement learning - Wikipedia, https://en.wikipedia.org/wiki/Reinforcement_learning
Part 1: Key Concepts in RL — Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
MuJoCo Documentation: Overview, https://mujoco.readthedocs.io/
MuJoCo Overview, https://www.roboti.us/book/index.html
Datasheet Crazyflie 2.1 - Rev 3 - Bitcraze, https://www.bitcraze.io/documentation/hardware/crazyflie_2_1/crazyflie_2_1-datasheet.pdf
Crazyflie 2.1 - Bitcraze, https://www.bitcraze.io/crazyflie-2-1/
Application of Reinforcement Learning in Controlling Quadrotor UAV ..., https://www.mdpi.com/2504-446X/8/11/660
Vision-Based Deep Reinforcement Learning for Autonomous Drone Flight - UC Berkeley EECS, https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-280.pdf
What is Reinforcement Learning and How Does It Work (Updated 2025) - Analytics Vidhya, https://www.analyticsvidhya.com/blog/2021/02/introduction-to-reinforcement-learning-for-beginners/
How to Make a Reward Function in Reinforcement Learning? - GeeksforGeeks, https://www.geeksforgeeks.org/machine-learning/how-to-make-a-reward-function-in-reinforcement-learning/
Guide to Reward Functions in Reinforcement Fine-Tuning - Predibase, https://predibase.com/blog/reward-functions-reinforcement-fine-tuning
Designing Reward Functions Using Active Preference Learning for Reinforcement Learning in Autonomous Driving Navigation - MDPI, https://www.mdpi.com/2076-3417/14/11/4845
What are the key points in reward function design in deep reinforcement learning?, https://www.tencentcloud.com/techpedia/107500
Real-World DRL: 5 Essential Reward Functions for Modeling ..., https://medium.com/@zhonghong9998/real-world-drl-5-essential-reward-functions-for-modeling-objectives-and-constraints-e742325d4747
Reward Shaping Idea : r/reinforcementlearning - Reddit, https://www.reddit.com/r/reinforcementlearning/comments/1ix4a85/reward_shaping_idea/
Sparse vs. Dense Rewards, Optimistic Sampling ... - Andrew Forney, https://forns.lmu.build/classes/spring-2020/cmsi-432/lecture-13-2.html
What are the pros and cons of sparse and dense rewards in reinforcement learning?, https://ai.stackexchange.com/questions/23012/what-are-the-pros-and-cons-of-sparse-and-dense-rewards-in-reinforcement-learning
Reward Hacking in Reinforcement Learning | Lil'Log, https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
Reward hacking - Wikipedia, https://en.wikipedia.org/wiki/Reward_hacking
Reinforcement Learning: An introduction (Part 1/4) | by Cédric Vandelaer | Medium, https://medium.com/@cedric.vandelaer/reinforcement-learning-an-introduction-part-1-4-866695deb4d1
Novel Reward Function for Autonomous Drone ... - KoreaScience, https://koreascience.kr/article/CFKO202333855010609.pdf
Extended Abstract - CS 224R Deep Reinforcement Learning, https://cs224r.stanford.edu/projects/pdfs/CS224R_final_report__4_%20(1).pdf
What is reward hacking in RL? - Milvus, https://milvus.io/ai-quick-reference/what-is-reward-hacking-in-rl
What is reward hacking? - AI Safety Info, https://aisafety.info/questions/8SIU/What-is-reward-hacking
Let's talk sparse / dense rewards - Unity Obstacle Tower Challenge - AIcrowd Forum, https://discourse.aicrowd.com/t/lets-talk-sparse-dense-rewards/969

Updated On:

September 14, 2025

Follow on social media: