Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FERC 714: Integrate the XBRL data Respondent ID table #3857

Open
wants to merge 4 commits into
base: transform-714-xbrl
Choose a base branch
from

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Sep 19, 2024

Overview

Closes #3839 and #3858

What problem does this address?

What did you change?
There were two main threads that needed pulling to get this table updates:

  • Table compatibility: The csv table is static while the xbrl table is reported annually. A lot of the downstream analysis expects this table to be static. So the first step was to check whether or not the columns that we have in the CSV years had consistent data over the few XBRL years that we have. There were a small number of eia_code's we needed to clean up, but besides that it was static. I then converted the XBRL data into a static table, then I concat-ed the tables and checked the static-ness again.
  • eia_code cleaning: Not a ton but some cleaning necessary. Done all in spot_fix_eia_codes & EIA_CODE_FIXES

still todo that I'd rather finish in transform-714-xbrl 3842:

  • metadata/schema updates: add csv and xbrl respondent id's into table.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

@cmgosnell cmgosnell self-assigned this Sep 19, 2024
@cmgosnell cmgosnell added ferc714 Anything having to do with FERC Form 714 data-update When fresh data is integrated into PUDL from quarterly or annual updates labels Sep 19, 2024
@@ -175,27 +175,45 @@
"""Mapping between standardized time offset codes and canonical timezones."""

EIA_CODE_FIXES = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of converting all of these code fixes into the pudl-derived respondent id-based fixes, I kept the sourcey-ness. This is mostly to enable checking things at each stage - primarily ensure_eia_code_uniqueness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense!

@@ -293,6 +317,26 @@ def _assign_respondent_id_ferc714(
return df


def _fillna_respondent_id_ferc714_source(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually also necessary in the hourly table as well!

Copy link
Member

@aesharpe aesharpe Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth checking whether it's necessary for the annual table too

@cmgosnell cmgosnell marked this pull request as ready for review September 20, 2024 13:10
Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly non-blocking comments or questions!

@@ -175,27 +175,45 @@
"""Mapping between standardized time offset codes and canonical timezones."""

EIA_CODE_FIXES = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense!

"xbrl": {
"entity_id": "respondent_id_ferc714_xbrl",
"respondent_legal_name": "respondent_name_ferc714",
"respondent_identification_code": "eia_code",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for sure eia code? seems fishy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I supposed the csv version had eia_code so it makes sense that this is too.

Comment on lines +328 to +337
# use the source utility ID column to get a unique map and for merging
resp_id_col = f"respondent_id_ferc714_{source}"
resp_map_series = respondent_map_ferc714.dropna(subset=[resp_id_col]).set_index(
"respondent_id_ferc714"
)[resp_id_col]

df[resp_id_col] = df[resp_id_col].fillna(
df["respondent_id_ferc714"].map(resp_map_series)
)
return df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this is working. The _assign_respondent_id_ferc714 function maps the respondent_id_ferc714 column and then this column appears to work backwards from that to map on missing respondent_id_ferc714_source values. How is that possible? If there is no respondent_id_ferc714_source to begin with how can we map a respondent_id_ferc714 value onto it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure I understand why this is important when we drop the respondent id source columns anyways?

@@ -293,6 +317,26 @@ def _assign_respondent_id_ferc714(
return df


def _fillna_respondent_id_ferc714_source(
Copy link
Member

@aesharpe aesharpe Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth checking whether it's necessary for the annual table too

Comment on lines +353 to +354
TODO: rip this out. enforce_schema happens via the io_managers now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's stopping us from removing it now?

Comment on lines +473 to +474
eia_code and all eia_codes that are actually the respondent_id_ferc714_xbrl
are nulled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the respondent_id_ferc714_xbrl and eia_code are the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm looks like no because they start with C!

)
xbrl.loc[code_is_respondent_id_mask, "eia_code"] = pd.NA

# lets null out some of the eia_code's from XBRL that we've manually culled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By manually culled do you mean that these IDs are not the actual eia_code?


@staticmethod
def convert_into_static_table_xbrl(xbrl: pd.DataFrame) -> pd.DataFrame:
"""Convert this annually reported table into a skinner, static table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean skinnier?

Comment on lines +515 to +516
xbrl.groupby(["respondent_id_ferc714_xbrl"])[ # noqa: PD101
["respondent_name_ferc714"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible there are some different spellings of the respondent name that would cause this to look like a 1:1 ratio when it's not?

Comment on lines +537 to +538
def condense_into_one_source_table(df):
"""Condense the CSV and XBRL records together into one record.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking at all, but in the other two tables we just put this code directly into the run function instead of creating it's own function wrapper. NBD, but it would be good to be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR we could very easily make this a function vs. a class method because I think it's the same for all tables. We could just have a column parameter or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates ferc714 Anything having to do with FERC Form 714
Projects
Status: In review
2 participants