Skip to content

Possible bug on the sign of policy log prob. in Fisher computation #279

Open
@daniloefl

Description

@daniloefl

Dear @ikostrikov ,

while reading your code I noticed that you use the log prob. of a normal distribution for the Fisher matrix calculation in the value loss, but the negative log prob. of the policy. Comparing your code in [1] with the equivalent lines of the stable baselines [2], one can see that the policy part of the Fisher matrix calculation is a log prob. (minus negative log prob. in pg_fisher_loss) and the value function contribution is also a log prob. (minus mean squared error).

The original paper mentions the construction of the Fisher matrix using the gradient of the log prob. of the policy and the log prob. of a Gaussian around the value function (section 3.1 of [3]). I would expect therefore that the sign of the two terms used for the Fisher matrix to follow the same convention, as it is done in the stable baselines repository. The actual loss function minimisation is done with a negative log prob. for both (as you currently do, and as it is done in the stable baselines repo.), but in both cases, the sign of the two terms should be consistent.

Therefore, I could not fully understand the reason for that sign in the fisher matrix calculation. Is this a bug, or is there some deeper reason behind it?

Best regards,
Danilo

[1] https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/algo/a2c_acktr.py#L53
[2] https://stable-baselines.readthedocs.io/en/master/_modules/stable_baselines/acktr/acktr.html#ACKTR
[3] https://arxiv.org/pdf/1708.05144.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions