Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest NHD High Resolution Streams, Add Tile Layer #3417

Merged
merged 2 commits into from
Aug 23, 2021

Conversation

rajadain
Copy link
Member

Overview

The full ingest process is documented in #3415 (comment). This process took several days, not just to figure out the data shape, but to process it, because of the volume. This is by far the largest volume of data we've added to MMW, and may have non-trivial consequences for performance and hosting costs that will reveal themselves as this moves to staging and production.

The work here adds the NHD High Resolution stream data to the MMW database, and wires up the tiler to render it on the map. It does not switch our analyses and models to use the new High Resolution data, which still use Medium Resolution. That will be done in future cards.

I was unable to provision the tiler VM on my local, which could only be resolved by upgrading the NPM version on that VM. It seemingly works well enough, and is isolated from the app VM, so should be fine.

Connects #3415

Demo

image

Notes

The compressed data is ~17GB, and in the database it is ~26GB:

SELECT pg_size_pretty(pg_total_relation_size('nhdflowlinehr'));

 pg_size_pretty 
----------------
 26 GB
(1 row)

There's a total of ~24.5M rows:

SELECT COUNT(*) FROM nhdflowlinehr;

  count   
----------
 24517604
(1 row)

I attempted to drop the extraneous columns in the table to see if that would reduce the size much, but it didn't. I imagine that most of the data is the geometry of the streams, which we cannot elide.

About 2% of the streams do not have a value for stream_order or slope, which were obtained by joining with the Value Added Attributes table:

SELECT COUNT(*) FROM nhdflowlinehr WHERE stream_order IS NULL;

 count  
--------
 493431
(1 row)

Testing Instructions

Before you begin, ensure you have ~80+ GB of free space on your host computer.

  • Import the new high resolution streams with:
    $ vagrant ssh app -c 'cd /vagrant && ./scripts/aws/setupdb.sh -S'
    This might take 30-60 minutes.
    • If this fails, try giving your services VM more resources:
      diff --git a/Vagrantfile b/Vagrantfile
      index 600fc14d..561d833c 100644
      --- a/Vagrantfile
      +++ b/Vagrantfile
      @@ -52,7 +52,8 @@ Vagrant.configure("2") do |config|
      
           services.vm.provider "virtualbox" do |v|
             v.customize ["guestproperty", "set", :id, "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold", 10000 ]
      -      v.memory = 2048
      +      v.memory = 6144
      +      v.cpus = 4
           end
      
           services.vm.provision "ansible" do |ansible|
    • Ensure it succeeds
  • While the above is happening, reprovision your tiler:
    $ vagrant reload --provision tiler
    • Ensure it succeeds
  • Once the import is complete, go to http://localhost:8000/
  • Turn on the Continental US High Resolution Stream Network layer
    • Ensure it renders as expected
  • Turn on the Continental US Medium Resolution Stream Network layer
    • Ensure that still works as before

@rajadain rajadain added the PA DEP Funding Source: Pennsylvania Department of Environment Protection label Aug 18, 2021
@rajadain rajadain requested a review from jwalgran August 18, 2021 17:43
@@ -92,6 +92,90 @@
}
}

#nhdflowlinehr {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These styles are identical to those for #nhdflowline

@jwalgran
Copy link
Contributor

An update. I set up a new set of vagrant VMs on an external drive and increased the memory and CPU as suggested. Loading the high res data has been chugging on ALTER TABLE for a while, a lot more than the suggested "30-60 minutes"

+ curl -s https://s3.amazonaws.com/data.mmw.azavea.com/nhdflowlinehr.sql.gz
+ gunzip -q
+ psql --single-transaction
SET
SET
SET
SET
SET
 set_config
------------

(1 row)

SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
ALTER SEQUENCE
ALTER TABLE
COPY 24517604
  setval
----------
 24517604
(1 row)

ALTER TABLE

The disk on my services VM is up to about 33GB.

Screen Shot 2021-08-22 at 8 30 24 PM

I will let it go overnight.

Copy link
Contributor

@jwalgran jwalgran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to see the data load complete in a reasonable amount of time when attempting to run the VM from my external spinning hard disk. The process did appear to be working, and the changes in the PR are mostly a reuse of existing functionality. 👍

@jwalgran jwalgran assigned rajadain and unassigned jwalgran Aug 23, 2021
@rajadain
Copy link
Member Author

Thanks for taking a look. I'll merge this and we can get this on staging and evaluate there. (Although we may have to wait for #3416 before staging deployments work again.)

@rajadain rajadain merged commit b2aa875 into develop Aug 23, 2021
@rajadain rajadain deleted the tt/ingest-nhd-hires-streams branch August 23, 2021 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PA DEP Funding Source: Pennsylvania Department of Environment Protection
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants