-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate High RDS IOPS Use, Production Service Interruption #3465
Comments
To reduce the load of NHD hi res tile requests, I wrote this script: import mercantile
import requests
for zoom in range(19):
# bbox via https://gist.github.com/jsundram/1251783
for tile in mercantile.tiles(-124.7844079,24.7433195,-66.9513812,49.3457868, zoom):
url = f'http://33.33.34.35/nhd_streams_hr_v1/{zoom}/{tile.x}/{tile.y}.png'
requests.get(url)
print(url) and then configured the This means that all of zoom levels 0-14 for the entire CONUS have tiles pre-cached, and a partial cache for zoom 15 was also added. In most cases, zooming beyond level 15 doesn't result in as many stream reads, so the usage of the burst balance should be reduced. The tile cache for this layer on production is currently ~30 GB in size: |
It should be noted that even after this, we did see at least one instance of the Burst Balance going down to 0: Looking at the logs, we see queries like this that take a long time (40+ seconds) to run: long-query.sql. This is for the Saco HUC-8: The final output is:
Here's the query plan output of EXPLAIN ANALYZE:
In this example, the inner query results in 9869 rows. |
Seeing 100% CPU use in RDS: Turning off Celery doesn't seem to help. Installed Other, non-analyze queries that take a long time are tile requests and text searches: SELECT round(total_exec_time*1000)/1000 AS total_time,query
FROM pg_stat_statements
ORDER BY total_time DESC limit 3; total_time | query
---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
272881959.314 | SELECT ST_AsBinary("geom") AS geom,"stream_order" FROM (SELECT geom, stream_order FROM nhdflowlinehr WHERE stream_order >= $1) as cdbq WHERE "geom" && ST_SetSRID($2::box3d, $3)
125818041.86 | SELECT id, code, name, rank, ST_X(center) AS x, ST_Y(center) AS y +
| FROM ((SELECT id, $1 AS code, name, $2 AS rank, +
| ST_Centroid(geom) as center +
| FROM boundary_huc08 +
| WHERE UPPER(name) LIKE UPPER($3) +
| LIMIT $4) UNION ALL (SELECT id, $5 AS code, name, $6 AS rank, +
| ST_Centroid(geom) as center +
| FROM boundary_huc10 +
| WHERE UPPER(name) LIKE UPPER($7) +
| LIMIT $8) UNION ALL (SELECT id, $9 AS code, name, $10 AS rank, +
| ST_Centroid(geom) as center +
| FROM boundary_huc12 +
| WHERE UPPER(name) LIKE UPPER($11) +
| LIMIT $12)) AS subquery +
| ORDER BY rank DESC, name
96767941.593 | WITH stream_intersection AS ( +
| SELECT ST_Length(ST_Transform( +
| ST_Intersection(geom, +
| ST_SetSRID(ST_GeomFromGeoJSON($1), +
| $2)), +
| $3)) AS lengthm, +
| stream_order, +
| slope +
| FROM nhdflowlinehr +
| WHERE ST_Intersects(geom, +
| ST_SetSRID(ST_GeomFromGeoJSON($4), $5))) +
| +
| SELECT SUM(lengthm) / $6 AS lengthkm, +
| stream_order, +
| SUM(lengthm * NULLIF(slope, $7)) / $8 AS slopesum +
| FROM stream_intersection +
| GROUP BY stream_order
(3 rows) |
Since CPU usage has been high for more than an hour now, I'm going to try and kill some of those queries. |
Anthony mentioned that upping the size or type of the RDS instance may help in the short term. We should keep that in mind if our optimizations do not help. |
Since the deployment of 1.33, we're observing very high IOPS use of the RDS instance in production:
This is coincident with the site going down a lot:
And we see these messages in RDS:
Since this is happening after 1.33, it is highly likely it has something to do with the High Res Streams, which is a very large table introduced in that release. Also, the high IOPS use may be caused by a very high request rate, possibly caused by a large number of requests from the tiler.
It is possible that #3424 may help. Or batch-generation of tiles. Or transferring the tiles already made for staging to production. Or it may be something else entirely.
Investigate, and if possible, fix.
The text was updated successfully, but these errors were encountered: