Bug #34485: Dynflow doesn't properly come back if the DB is unavailable for a brief period of time - foreman-tasks - Foreman

Actions

Copy link

Bug #34485

open

Dynflow doesn't properly come back if the DB is unavailable for a brief period of time

Added by Amit Upadhye about 2 years ago. Updated about 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Difficulty:

Triaged:

Bugzilla link:

Pull request:

https://github.com/theforeman/puppet-foreman/pull/1028

Fixed in Releases:

Found in Releases:

Red Hat JIRA:

Description

Ohai,

when I try to restore a Katello 4.3 (or nightly) on EL8 (with rubygem-foreman_maintain-1.0.2-1.el8.noarch), the restore finishes fine, but afterwards not all services are happy:

# hammer ping
database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 58ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 50ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 1ms
pulp3:            
    Status:          ok
    Server Response: Duration: 117ms
pulp3_content:    
    Status:          ok
    Server Response: Duration: 128ms
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status

After restarting dynflow via systemctl restart dynflow-sidekiq@\* everything seems to work fine again.

I am not sure this is a maintain bug (or installer, or dynflow, or packaging), but filing it here for investigation.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Amit Upadhye about 2 years ago

Copied from Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of time added

Actions

Copy link

Updated by Adam Winberg about 1 year ago

This causes remote execution jobs to fail to report correct status - the job executes fine but the job status is stuck at 'running 100%'.

After a restart of the dynflow-sidekiq services jobs status is reported correctly again, but those jobs that failed to report correctly are forever stuck at 'running 100%'.

We do not run postgres locally on our Foreman server so the puppet manifest workaround does not work for us.

Actions

Copy link

Updated by markus lanz about 1 year ago

This also applies to us. We also dont run the progresql DB locally on the foreman server. We are running: Version 3.5.1. Are there any updates/news regarding this topic?

Actions

Copy link

Updated by Adam Ruzicka about 1 year ago

What are your expectations around it? It could be made to survive brief (couple of seconds) connection drops, but definitely not more. Would that be ok?

Actions

Copy link

Updated by Adam Winberg about 1 year ago

What are your expectations around it?

For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.

Actions

Copy link

Updated by markus lanz about 1 year ago

For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.

I'll agree.

Actions

Copy link

Updated by Adam Ruzicka about 1 year ago

That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.

Actions

Copy link

Updated by markus lanz about 1 year ago

That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.

Understandable and i agree. However as a compromise, a few seconds should also do the trick. In most environments, databases are setup with High Availability machanisms that will failover in a few seconds. So i guess we dont have to think about minutes. (10-20 seconds should be more than enough.)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Plugins » foreman-tasks

Custom queries

Bug #34485

Dynflow doesn't properly come back if the DB is unavailable for a brief period of time

Updated by Amit Upadhye about 2 years ago

Updated by Adam Winberg about 1 year ago

Updated by markus lanz about 1 year ago

Updated by Adam Ruzicka about 1 year ago

Updated by Adam Winberg about 1 year ago

Updated by markus lanz about 1 year ago

Updated by Adam Ruzicka about 1 year ago

Updated by markus lanz about 1 year ago