Skip to content

Setup webhook automatic retries #302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
thomasplevy opened this issue Nov 10, 2022 · 2 comments
Open

Setup webhook automatic retries #302

thomasplevy opened this issue Nov 10, 2022 · 2 comments
Assignees
Labels
Type: Enhancement Improvements existing features or code

Comments

@thomasplevy
Copy link
Contributor

Currently our webhooks are set to track failed attempts and automatically disable after 5 failed delivery attempts:

protected function set_delivery_failure() {
$failures = absint( $this->get( 'failure_count' ) );
$this->set( 'failure_count', ++$failures );
/**
* Filter the number of times a webhook is allowed to fail before it is automatically disabled.
*
* @since 1.0.0-beta.1
*
* @param int $num Number of allowed failures. Default: 5.
*/
$max_allowed = apply_filters( 'llms_rest_webhook_max_delivery_failures', 5 );
if ( $failures > $max_allowed ) {
$this->set( 'status', 'disabled' );
/**
* Fires immediately after a webhook has been disabled due to exceeding its maximum allowed failures.
*
* @since 1.0.0-beta.1
*
* @param int $webhook_id ID of the webhook.
*/
do_action( 'llms_rest_webhook_disabled_by_delivery_failures', $this->get( 'id' ) );
}
return $this;
}

This would require 5 separate deliveries for 5 events though. EG: if it's triggered during student enrollment, the webhook will disable after the 5th student fails enrollment.

This is problematic if an external application relies on webhook deliveries to maintain data integrity in the external application.

We should automatically reschedule a failed webhook delivery on a delay (5-15 minutes? Possibly it should track individual event failures and increase the delay: failure 1, wait 5 minutes, resend; failure 2 wait 30 minutes and resend; etc...)

After failure, we can check whether the webhook should be disabled and, if not, we can reschedule it using the same args passed into the deliver() method:

public function deliver( $args ) {

In order to best track failures for the specific event we could probably just add an extra argument to the args array:

$args['retry_count'] ?? 0;
++$args['retry_count;
@thomasplevy thomasplevy added the Type: Enhancement Improvements existing features or code label Nov 10, 2022
@thomasplevy thomasplevy moved this to Awaiting Triage in Development Nov 10, 2022
@thomasplevy thomasplevy moved this from Awaiting Triage to Backlog in Development Nov 10, 2022
@kirkroyster
Copy link

Our concerns would be around network and/or server issues on our end. Network would be the primary, server would normally just be a scheduled restart or an OS update. Network could be a power outage which could be an extended outage which could take hours to restore << which suggests that we consider cloud hosting, understood. >> If it could extend out to 12 or 24 hours max for the retry, we should be good in almost any case. Simplistic store and forward, I know. But a significant impact on pseudo-guaranteed delivery and associated data integrity. Interested in your thoughts regarding what the practical maximum would be.

@thomasplevy
Copy link
Contributor Author

@kirkroyster

Interested in your thoughts regarding what the practical maximum would be.

That's really going to depend on the requirements of your application. I think a "safe" method for the default with a simplistic option to reconfigure those defaults on a per-application need using hooks and filters is the best bet.

LifterLMS can't solve every potential need or requirement so doing something generally safe/appropriate as the default with the ability to customize when the defaults don't make sense is our approach to just about everything.

My initial thoughts are to follow a pattern similar to our retry rules for ecomm subscriptions in the core: https://github.com/gocodebox/lifterlms/blob/d3505995cfefe9ccda36155dc7a79d182aba5ec2/includes/models/model.llms.order.php#L998-L1019

Probably doing a shorter initial retry though basing off the assumption that momentary network issues could be the initial cause for the failure.

Attempt
Retry 1 (+15 minutes)
Retry 2 (+60 minutes)
Retry 3 (+6 hours)
Retry 4 (+12 hours)

After the 4th retry (the 5th actual attempt) the webhook would get disabled.

Maybe this is too short for a default. I'd need more feedback, I'm just talking right now, not committing code.

Of course this could be customized with a filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Improvements existing features or code
Projects
Status: Backlog
Development

No branches or pull requests

3 participants