Introduction:

Berkeley’s CS285 ( a Deep RL course dedicated to graduate students ) taught by Serjey Levine : is a dense course and contains 23 Lectures & 5 Labs (mix between theoretical proofs and coding assignements) and it goes over all the major/Sota Techniques in Deep RL in academia and industry.

In this writeup I’ll be discussing my solution to the first homework , So all readers should have already read the statement of this LAB and think of this as a possible solution.

All the parts below will be divided into Exploration & Exploitation subsections, just a funny reference to RL , but the meaning of them is that I’ll be exploring & explaining the code and making speculations before I decide and start doing changes/running experiments, it’s like I’m treating myself like an RL agent , but with a strange objective function 😉.

Analysis:

NOTE: This part will require you to have some basic to medium knowledge in probability, you should be good if you’re comfortable with common notations in RL papers .

Exploration:

here’s the problem statement:

One thing you'll notice when tackling this problem is that it's unclear how to effectively use the given hints, especially since the mathematical formalism is inherently flawed. To address this, we'll refine the formalism to better align with the problem and our specific needs.

Exploitation:

Question 1:

$$ \text{Show that } \sum_{s_t} \left| p_{\pi_\theta}(s_t) - p_{\pi^*}(s_t) \right| \leq 2T\varepsilon. $$

Question 2: a)

$$ \text{if the reward only depends on the last state, i.e., } r(s_t) = 0 \text{ for all } t < T.\\ \text{Show that } J(\pi^*) - J(\pi_\theta) = O(T\varepsilon)