Project

General

Profile

Actions

Bug #8565

closed

report::expire is running very slowly

Added by Chuck Schweizer over 9 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Reporting
Target version:
Difficulty:
Triaged:
Fixed in Releases:
Found in Releases:

Description

The fix included in #1592 has made report::expire very slow

Expire 5 day old reports (runs every 30 minutes)
Pre fix: 2.5 minutes
Post fix: 21 minutes

Expire 1 day old non eventful reports (runs every 30 minutes)
Pre fix: 2.5 minutes
Post fix: 21 minutes

Most of the time seems to be spent in Postgres deleting the records in the post fix report.rb. I will be working on getting more data added to this ticket.


Related issues 3 (1 open2 closed)

Related to Foreman - Bug #1592: report::expire errors with millions of recordsClosedOri Rabin04/30/2012Actions
Related to Foreman - Bug #8316: Removing a long-standing host creates huge transactionClosedDaniel Lobato Garcia11/07/2014Actions
Related to Foreman - Refactor #8798: Reports.expire uses an unnecessary join on logs New12/28/2014Actions
Actions #1

Updated by Chuck Schweizer over 9 years ago

4,000 nodes running puppet once per hour.

Actions #2

Updated by Dominic Cleal over 9 years ago

  • Related to Bug #1592: report::expire errors with millions of records added
Actions #3

Updated by Dominic Cleal over 9 years ago

  • Category set to Reporting
  • translation missing: en.field_release set to 29
Actions #4

Updated by Dominic Cleal over 9 years ago

The most useful info would be debug logs while running the rake task: http://projects.theforeman.org/projects/foreman/wiki/Troubleshooting#How-do-I-enable-debugging

The SQL logs would then show the time taken for each query, and possibly even EXPLAIN data if they appear to take long enough. If not, then we could try re-running some queries with EXPLAIN.

Actions #5

Updated by Dominic Cleal over 9 years ago

  • Related to Bug #8316: Removing a long-standing host creates huge transaction added
Actions #6

Updated by Dominic Cleal over 9 years ago

  • translation missing: en.field_release deleted (29)
Actions #7

Updated by Chuck Schweizer over 9 years ago

Here is where the report is being slow.

 foreman | foreman  |   11783 |             | f       | 2014-12-10 13:52:56.117167+00 | DELETE FROM "messages" WHERE (id not IN (SELECT message_id FROM "logs" ))

 foreman=# explain DELETE FROM "messages" WHERE (id not IN (SELECT message_id FROM "logs" ));
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on messages  (cost=38070.64..323418103.91 rows=12289 width=6)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=38070.64..60375.40 rows=1603876 width=4)
           ->  Seq Scan on logs  (cost=0.00..30200.76 rows=1603876 width=4)
(5 rows)
Actions #8

Updated by Chuck Schweizer over 9 years ago

Explain with DISTINCT

foreman=# explain DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id from "logs" ));
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on messages  (cost=34343.70..36100.92 rows=12289 width=6)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  HashAggregate  (cost=34314.32..34337.82 rows=2350 width=4)
           ->  Seq Scan on logs  (cost=0.00..30292.46 rows=1608746 width=4)
(5 rows)
Actions #9

Updated by Dominic Cleal over 9 years ago

  • Status changed from New to Assigned
  • Assignee set to Dominic Cleal
Actions #10

Updated by The Foreman Bot over 9 years ago

  • Status changed from Assigned to Ready For Testing
  • Pull request https://github.com/theforeman/foreman/pull/2008 added
  • Pull request deleted ()
Actions #11

Updated by Chuck Schweizer over 9 years ago

Much better with your fix. Down to 12 seconds.

foreman=# explain analyze DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id from "logs" ));
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on messages  (cost=34423.41..36180.64 rows=12289 width=6) (actual time=699.123..710.044 rows=19 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  HashAggregate  (cost=34394.04..34417.54 rows=2350 width=4) (actual time=670.766..680.408 rows=28760 loops=1)
           ->  Seq Scan on logs  (cost=0.00..30362.83 rows=1612483 width=4) (actual time=0.055..312.759 rows=1613655 loops=1)
 Total runtime: 710.908 ms
(6 rows)
$ time ./foreman-rake reports:expire days=5 RAILS_ENV='production'

real    0m12.144s
user    0m8.497s
sys    0m0.924s
Actions #12

Updated by Dominic Cleal over 9 years ago

  • Related to Refactor #8798: Reports.expire uses an unnecessary join on logs added
Actions #13

Updated by Martin Jackson over 9 years ago

Applying the patch attached to the PR gets our report expiry for ~750,000 reports into the 30 second range.

Actions #14

Updated by Dominic Cleal about 9 years ago

  • Status changed from Ready For Testing to Closed
  • % Done changed from 0 to 100
Actions #15

Updated by Dominic Cleal about 9 years ago

  • translation missing: en.field_release set to 32
Actions #16

Updated by Dirk Heinrichs almost 9 years ago

Running 1.7.4, expiring reports still is quite slow (PostgreSQL 9.3.x). Runs longer than a day, so that we run into situations where multiple expire jobs are running, hogging the CPU to 100%. Plan looks like this:

foreman=> select count(*) from reports;
 count
-------
  1951
(1 row)

foreman=> select count(*) from messages;
 count
--------
 704462
(1 row)

foreman=> select count(*) from logs;
 count
--------
 802815
(1 row)

foreman=> explain DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id FROM "logs" ));
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Delete on messages  (cost=0.42..9806323789.43 rows=352231 width=6)
   ->  Seq Scan on messages  (cost=0.42..9806323789.43 rows=352231 width=6)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.42..27142.52 rows=279366 width=4)
                 ->  Unique  (cost=0.42..24653.69 rows=279366 width=4)
                       ->  Index Only Scan using index_logs_on_message_id on logs  (cost=0.42..22646.65 rows=802815 width=4)
(7 rows)

There's this blog post, which outlines a possible alternative. Another one would be to use foreign keys together with "on delete cascade".

Actions #17

Updated by Dominic Cleal almost 9 years ago

It's probably best if you can file a new issue so we can identify the cause there Dirk. Is it definitely that query that's taking the time? EXPLAIN ANALYZE might work better to check.

There are a few, and a different one was identified recently in #10228 where a missing index was slowing down MySQL (at least).

Actions #18

Updated by Dirk Heinrichs almost 9 years ago

Yes, it is that query. Found out by running

SELECT * FROM pg_stat_activity;

while connected to the foreman database. And no, I can't run EXPLAIN ANALYZE, as it actually executes the query, which would get me in trouble again.

Reg. #10228: Wonder how an index on source_id would help accelerate a query on message_id.

Anyway, created #10329 as requested.

Actions

Also available in: Atom PDF