FERC 714: Integrate the XBRL data Respondent ID table #3857

cmgosnell · 2024-09-19T14:33:04Z

Overview

Closes #3839 and #3858

What problem does this address?

What did you change?
There were two main threads that needed pulling to get this table updates:

Table compatibility: The csv table is static while the xbrl table is reported annually. A lot of the downstream analysis expects this table to be static. So the first step was to check whether or not the columns that we have in the CSV years had consistent data over the few XBRL years that we have. There were a small number of eia_code's we needed to clean up, but besides that it was static. I then converted the XBRL data into a static table, then I concat-ed the tables and checked the static-ness again.
eia_code cleaning: Not a ton but some cleaning necessary. Done all in spot_fix_eia_codes & EIA_CODE_FIXES

still todo that I'd rather finish in transform-714-xbrl 3842:

metadata/schema updates: add csv and xbrl respondent id's into table.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows())
Update the release notes: reference the PR and related issues.
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Materialize the assets locally
Options

cmgosnell · 2024-09-20T12:53:01Z

src/pudl/transform/ferc714.py

@@ -175,27 +175,45 @@
 """Mapping between standardized time offset codes and canonical timezones."""

 EIA_CODE_FIXES = {


instead of converting all of these code fixes into the pudl-derived respondent id-based fixes, I kept the sourcey-ness. This is mostly to enable checking things at each stage - primarily ensure_eia_code_uniqueness

makes sense!

cmgosnell · 2024-09-20T12:53:57Z

src/pudl/transform/ferc714.py

@@ -293,6 +317,26 @@ def _assign_respondent_id_ferc714(
    return df


+def _fillna_respondent_id_ferc714_source(


this is actually also necessary in the hourly table as well!

Probably worth checking whether it's necessary for the annual table too

aesharpe

Mostly non-blocking comments or questions!

aesharpe · 2024-09-20T17:40:18Z

src/pudl/transform/ferc714.py

@@ -175,27 +175,45 @@
 """Mapping between standardized time offset codes and canonical timezones."""

 EIA_CODE_FIXES = {


makes sense!

aesharpe · 2024-09-20T17:49:10Z

src/pudl/transform/ferc714.py

+        "xbrl": {
+            "entity_id": "respondent_id_ferc714_xbrl",
+            "respondent_legal_name": "respondent_name_ferc714",
+            "respondent_identification_code": "eia_code",


Is this for sure eia code? seems fishy

I supposed the csv version had eia_code so it makes sense that this is too.

aesharpe · 2024-09-20T18:12:24Z

src/pudl/transform/ferc714.py

+    # use the source utility ID column to get a unique map and for merging
+    resp_id_col = f"respondent_id_ferc714_{source}"
+    resp_map_series = respondent_map_ferc714.dropna(subset=[resp_id_col]).set_index(
+        "respondent_id_ferc714"
+    )[resp_id_col]
+
+    df[resp_id_col] = df[resp_id_col].fillna(
+        df["respondent_id_ferc714"].map(resp_map_series)
+    )
+    return df


I'm not sure I understand how this is working. The _assign_respondent_id_ferc714 function maps the respondent_id_ferc714 column and then this column appears to work backwards from that to map on missing respondent_id_ferc714_source values. How is that possible? If there is no respondent_id_ferc714_source to begin with how can we map a respondent_id_ferc714 value onto it?

Also not sure I understand why this is important when we drop the respondent id source columns anyways?

aesharpe · 2024-09-20T18:13:01Z

src/pudl/transform/ferc714.py

@@ -293,6 +317,26 @@ def _assign_respondent_id_ferc714(
    return df


+def _fillna_respondent_id_ferc714_source(


Probably worth checking whether it's necessary for the annual table too

aesharpe · 2024-09-20T18:13:32Z

src/pudl/transform/ferc714.py

+    TODO: rip this out. enforce_schema happens via the io_managers now.
+


What's stopping us from removing it now?

aesharpe · 2024-09-20T18:38:16Z

src/pudl/transform/ferc714.py

+        eia_code and all eia_codes that are actually the respondent_id_ferc714_xbrl
+        are nulled.


Is it possible that the respondent_id_ferc714_xbrl and eia_code are the same?

mmm looks like no because they start with C!

aesharpe · 2024-09-20T18:40:49Z

src/pudl/transform/ferc714.py

+        )
+        xbrl.loc[code_is_respondent_id_mask, "eia_code"] = pd.NA
+
+        # lets null out some of the eia_code's from XBRL that we've manually culled


By manually culled do you mean that these IDs are not the actual eia_code?

aesharpe · 2024-09-20T18:42:10Z

src/pudl/transform/ferc714.py

+
+    @staticmethod
+    def convert_into_static_table_xbrl(xbrl: pd.DataFrame) -> pd.DataFrame:
+        """Convert this annually reported table into a skinner, static table.


Do you mean skinnier?

aesharpe · 2024-09-20T18:43:39Z

src/pudl/transform/ferc714.py

+            xbrl.groupby(["respondent_id_ferc714_xbrl"])[  # noqa: PD101
+                ["respondent_name_ferc714"]


Is it possible there are some different spellings of the respondent name that would cause this to look like a 1:1 ratio when it's not?

aesharpe · 2024-09-20T18:48:34Z

src/pudl/transform/ferc714.py

+    def condense_into_one_source_table(df):
+        """Condense the CSV and XBRL records together into one record.


Not blocking at all, but in the other two tables we just put this code directly into the run function instead of creating it's own function wrapper. NBD, but it would be good to be consistent.

OR we could very easily make this a function vs. a class method because I think it's the same for all tables. We could just have a column parameter or something.

wip first round of respondent table transforming

978e664

cmgosnell self-assigned this Sep 19, 2024

cmgosnell added ferc714 Anything having to do with FERC Form 714 data-update When fresh data is integrated into PUDL from quarterly or annual updates labels Sep 19, 2024

finish eia_code mapping and wrap up transforms

42716aa

cmgosnell commented Sep 20, 2024

View reviewed changes

udpate docs

858b744

cmgosnell marked this pull request as ready for review September 20, 2024 13:10

cmgosnell requested a review from aesharpe September 20, 2024 13:10

udpate docs again lol spaces

549a5ab

This was linked to issues Sep 20, 2024

Write transform function to clean and normalize FERC 714 XBRL respondent ID table #3839

Open

Link FERC 714 respondents to EIA utility and BA IDs #3858

Open

aesharpe reviewed Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FERC 714: Integrate the XBRL data Respondent ID table #3857

FERC 714: Integrate the XBRL data Respondent ID table #3857

cmgosnell commented Sep 19, 2024 •

edited by aesharpe

Loading

To-do list

cmgosnell Sep 20, 2024

aesharpe Sep 20, 2024

cmgosnell Sep 20, 2024

aesharpe Sep 20, 2024 •

edited

Loading

aesharpe left a comment

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024 •

edited

Loading

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

aesharpe Sep 20, 2024

		@@ -175,27 +175,45 @@
		"""Mapping between standardized time offset codes and canonical timezones."""

		EIA_CODE_FIXES = {

		@@ -293,6 +317,26 @@ def _assign_respondent_id_ferc714(
		return df


		def _fillna_respondent_id_ferc714_source(

		TODO: rip this out. enforce_schema happens via the io_managers now.

		eia_code and all eia_codes that are actually the respondent_id_ferc714_xbrl
		are nulled.

		xbrl.groupby(["respondent_id_ferc714_xbrl"])[ # noqa: PD101
		["respondent_name_ferc714"]

		def condense_into_one_source_table(df):
		"""Condense the CSV and XBRL records together into one record.

FERC 714: Integrate the XBRL data Respondent ID table #3857

Are you sure you want to change the base?

FERC 714: Integrate the XBRL data Respondent ID table #3857

Conversation

cmgosnell commented Sep 19, 2024 • edited by aesharpe Loading

Overview

Testing

To-do list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

aesharpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell commented Sep 19, 2024 •

edited by aesharpe

Loading

aesharpe Sep 20, 2024 •

edited

Loading

aesharpe Sep 20, 2024 •

edited

Loading