Investigate High RDS IOPS Use, Production Service Interruption #3465

rajadain · 2022-01-20T03:15:46Z

Since the deployment of 1.33, we're observing very high IOPS use of the RDS instance in production:

This is coincident with the site going down a lot:

And we see these messages in RDS:

Since this is happening after 1.33, it is highly likely it has something to do with the High Res Streams, which is a very large table introduced in that release. Also, the high IOPS use may be caused by a very high request rate, possibly caused by a large number of requests from the tiler.

It is possible that #3424 may help. Or batch-generation of tiles. Or transferring the tiles already made for staging to production. Or it may be something else entirely.

Investigate, and if possible, fix.

rajadain · 2022-01-24T18:34:36Z

To reduce the load of NHD hi res tile requests, I wrote this script:

import mercantile
import requests

for zoom in range(19):
    # bbox via https://gist.github.com/jsundram/1251783
    for tile in mercantile.tiles(-124.7844079,24.7433195,-66.9513812,49.3457868, zoom):
        url = f'http://33.33.34.35/nhd_streams_hr_v1/{zoom}/{tile.x}/{tile.y}.png'
        requests.get(url)
        print(url)

and then configured the tiler VM to point to (and have access to) the Production tile cache bucket (by configuring the MMW_TILECACHE_BUCKET environment variable, and adding credentials for mmw-prd as the default in /var/lib/mmw/.aws/credentials inside the tiler). Then I ran the script, which ran for 1d 29m 11s before I canceled it, with the final tile getting pre-cached being http://33.33.34.35/nhd_streams_hr_v1/15/7385/11671.png.

This means that all of zoom levels 0-14 for the entire CONUS have tiles pre-cached, and a partial cache for zoom 15 was also added. In most cases, zooming beyond level 15 doesn't result in as many stream reads, so the usage of the burst balance should be reduced.

The tile cache for this layer on production is currently ~30 GB in size:

rajadain · 2022-01-24T19:13:49Z

It should be noted that even after this, we did see at least one instance of the Burst Balance going down to 0:

Looking at the logs, we see queries like this that take a long time (40+ seconds) to run: long-query.sql. This is for the Saco HUC-8:

The final output is:

lengthkm	stream_order	slopesum
2.7824697770005598	-9
2777.4843327587737	1	174.8754808855135
1094.772595535565	2	33.29209307438515
760.0630001057596	3	9.209074630253456
284.7515363448878	4	1.5203409666684877
154.49228032277207	5	0.2766627177824275
83.42659346525942	6	0.07302188101247957
132.18007018219515	7	0.12741485218159052

Here's the query plan output of EXPLAIN ANALYZE:

Finalize GroupAggregate  (cost=353138.03..1375844.01 rows=11 width=18) (actual time=32163.046..55491.620 rows=8 loops=1)
  Group Key: nhdflowlinehr.stream_order
  ->  Gather Merge  (cost=353138.03..1375843.68 rows=22 width=18) (actual time=8306.590..55491.555 rows=23 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial GroupAggregate  (cost=352138.01..1374841.12 rows=11 width=18) (actual time=3116.023..35914.559 rows=8 loops=3)
              Group Key: nhdflowlinehr.stream_order
              ->  Sort  (cost=352138.01..352163.51 rows=10200 width=1027) (actual time=554.512..561.702 rows=3290 loops=3)
                    Sort Key: nhdflowlinehr.stream_order
                    Sort Method: external merge  Disk: 2888kB
                    Worker 0:  Sort Method: quicksort  Memory: 1644kB
                    Worker 1:  Sort Method: external merge  Disk: 3768kB
                    ->  Parallel Bitmap Heap Scan on nhdflowlinehr  (cost=874.15..346856.38 rows=10200 width=1027) (actual time=350.164..545.643 rows=3290 loops=3)
                          Filter: st_intersects(geom, '01060000...F2274640'::geometry)
Planning Time: 62.047 ms
JIT:
  Functions: 30
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.602 ms, Inlining 203.120 ms, Optimization 481.906 ms, Emission 324.416 ms, Total 1015.044 ms
Execution Time: 55494.523 ms

In this example, the inner query results in 9869 rows.

…til-zoomed-in Connects #3465

rajadain · 2022-02-01T16:01:22Z

Seeing 100% CPU use in RDS:

Turning off Celery doesn't seem to help. Installed pg_stat_statements as recommended in https://aws.amazon.com/premiumsupport/knowledge-center/rds-aurora-postgresql-high-cpu/ to investigate queries.

Other, non-analyze queries that take a long time are tile requests and text searches:

SELECT round(total_exec_time*1000)/1000 AS total_time,query
FROM pg_stat_statements
ORDER BY total_time DESC limit 3;

  total_time   |                                                                                      query
---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 272881959.314 | SELECT ST_AsBinary("geom") AS geom,"stream_order" FROM (SELECT geom, stream_order FROM nhdflowlinehr WHERE stream_order >= $1) as cdbq WHERE "geom" && ST_SetSRID($2::box3d, $3)
  125818041.86 | SELECT id, code, name, rank, ST_X(center) AS x, ST_Y(center) AS y                                                                                                               +
               |         FROM ((SELECT id, $1 AS code, name, $2 AS rank,                                                                                                                         +
               |             ST_Centroid(geom) as center                                                                                                                                         +
               |         FROM boundary_huc08                                                                                                                                                     +
               |         WHERE UPPER(name) LIKE UPPER($3)                                                                                                                                        +
               |         LIMIT $4) UNION ALL (SELECT id, $5 AS code, name, $6 AS rank,                                                                                                           +
               |             ST_Centroid(geom) as center                                                                                                                                         +
               |         FROM boundary_huc10                                                                                                                                                     +
               |         WHERE UPPER(name) LIKE UPPER($7)                                                                                                                                        +
               |         LIMIT $8) UNION ALL (SELECT id, $9 AS code, name, $10 AS rank,                                                                                                          +
               |             ST_Centroid(geom) as center                                                                                                                                         +
               |         FROM boundary_huc12                                                                                                                                                     +
               |         WHERE UPPER(name) LIKE UPPER($11)                                                                                                                                       +
               |         LIMIT $12)) AS subquery                                                                                                                                                 +
               |         ORDER BY rank DESC, name
  96767941.593 | WITH stream_intersection AS (                                                                                                                                                   +
               |             SELECT ST_Length(ST_Transform(                                                                                                                                      +
               |                       ST_Intersection(geom,                                                                                                                                     +
               |                                       ST_SetSRID(ST_GeomFromGeoJSON($1),                                                                                                        +
               |                                                  $2)),                                                                                                                          +
               |                       $3)) AS lengthm,                                                                                                                                          +
               |                    stream_order,                                                                                                                                                +
               |                    slope                                                                                                                                                        +
               |             FROM nhdflowlinehr                                                                                                                                                  +
               |             WHERE ST_Intersects(geom,                                                                                                                                           +
               |                                 ST_SetSRID(ST_GeomFromGeoJSON($4), $5)))                                                                                                        +
               |                                                                                                                                                                                 +
               |         SELECT SUM(lengthm) / $6 AS lengthkm,                                                                                                                                   +
               |                stream_order,                                                                                                                                                    +
               |                SUM(lengthm * NULLIF(slope, $7)) / $8 AS slopesum                                                                                                                +
               |         FROM stream_intersection                                                                                                                                                +
               |         GROUP BY stream_order
(3 rows)

rajadain · 2022-02-01T16:18:06Z

This does make it look like the stream intersections are the ones taking the longest time:

SELECT pid, age(clock_timestamp(), query_start), usename, query, state
FROM pg_stat_activity
ORDER BY query_start DESC;

rajadain · 2022-02-01T16:18:56Z

Since CPU usage has been high for more than an hour now, I'm going to try and kill some of those queries.

rajadain · 2022-02-01T16:21:31Z

Ah those queries just finished on their own. And we see a corresponding drop in CPU use:

Connects #3424 Connects #3465

rajadain · 2022-02-02T03:43:35Z

Anthony mentioned that upping the size or type of the RDS instance may help in the short term. We should keep that in mind if our optimizations do not help.

rajadain added bug production labels Jan 20, 2022

rajadain self-assigned this Jan 20, 2022

rajadain added GEN Funding Source: General PA DEP Funding Source: Pennsylvania Department of Environment Protection and removed GEN Funding Source: General labels Jan 20, 2022

rajadain mentioned this issue Jan 25, 2022

Hide hi res streams layer below zoom 8 #3468

Merged

rajadain added a commit that referenced this issue Jan 25, 2022

Merge pull request #3468 from WikiWatershed/tt/hide-hi-res-streams-un…

a24efc0

…til-zoomed-in Connects #3465

This was referenced Jan 27, 2022

Optmize NHD High Resolution Querying #3424

Closed

Optimize Stream Intersection Calculation #3470

Merged

rajadain added a commit that referenced this issue Feb 2, 2022

Merge pull request #3470 from WikiWatershed/tt/optimize-stream-querying

7a906dc

Connects #3424 Connects #3465

jwalgran closed this as completed Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate High RDS IOPS Use, Production Service Interruption #3465

Investigate High RDS IOPS Use, Production Service Interruption #3465

rajadain commented Jan 20, 2022

rajadain commented Jan 24, 2022

rajadain commented Jan 24, 2022 •

edited

Loading

rajadain commented Feb 1, 2022

rajadain commented Feb 1, 2022

rajadain commented Feb 1, 2022

rajadain commented Feb 1, 2022

rajadain commented Feb 2, 2022

Investigate High RDS IOPS Use, Production Service Interruption #3465

Investigate High RDS IOPS Use, Production Service Interruption #3465

Comments

rajadain commented Jan 20, 2022

rajadain commented Jan 24, 2022

rajadain commented Jan 24, 2022 • edited Loading

rajadain commented Feb 1, 2022

rajadain commented Feb 1, 2022

rajadain commented Feb 1, 2022

rajadain commented Feb 1, 2022

rajadain commented Feb 2, 2022

rajadain commented Jan 24, 2022 •

edited

Loading