Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize names for common variables and quantities in the Jupyter Notebook Style Guide #4821

Closed
martinacantaro opened this issue Jun 29, 2021 · 5 comments · Fixed by #5267
Labels

Comments

@martinacantaro
Copy link
Contributor

martinacantaro commented Jun 29, 2021

Standardize names for common variables and quantities in the Jupyter Notebook Style Guide, so that the style is consistent throughout the documentation.

Suggestion by @OriolAbril :

It would probably be good to always use the same names, for sample stats we defined a convention at ArviZ that now PyMC3 will use because it uses InferenceData but there are still many other similar cases that could be considered, stacking chain, draw dimensions -> sample? __sample__? draws? chain_draw?

@michaelosthege
Copy link
Member

michaelosthege commented Jun 29, 2021

Working with InferenceData should be the default.
Out of { idata, itrace, trace } I prefer idata as the variable name.

For many visualizations one wants to pool samples from all chains. In my experience this is the least verbose way:

with pmodel:
    idata = pm.sample()
posterior = idata.posterior.stack(sample=("chain", "draw"))

It it's not in there we should discourage interactive plots, because they don't render on the docs. Unless there's a way to work around that and at least show the default render?

@fonnesbeck
Copy link
Member

fonnesbeck commented Jun 29, 2021

I've always thought calling the output from sampling *data is odd and potentially confusing because it is not data, its modeling output. But since the object is InferenceData it is consistent, so that's fine.

@OriolAbril
Copy link
Member

idata sounds like a good default, my main pet pevee with idata/trace is that now due to changing from multitrace to inferencedata, trace is now used somewhat arbitrarily between multitrace, inferencedata or the posterior group of the inferencedata which I find quite confusing.

@OriolAbril
Copy link
Member

As commented in the doc meeting, here are my personal conventions and fallbacks which we could use as starting point for that. I think that this can be very useful especially as recommendation so that people who hesitate on a name can come here and both save time when writing and build a more homogeneous gallery.

  • idata for sampling results, always as InferenceData

  • I also very often store groups as variables in which case I use either 3-5 word abbreviations or the group name. I think this is much more convenient for postprocessing. Some examples post/posterior, const/constant_data, post_pred/posterior_predictive, obs_data...

  • For stats and diagnostics I generally use the arviz function name as variable name: ess = az.ess(...), loo = az.loo(...)

  • If there are multiple models in a notebook, I assign a prefix to each model, and use it throughout to identify which variables map to each model. Taking the famous eight school as example, I'd use centered_model (pm.Model object), centered_trace, centered_post, centered_ess... and non_centered_model, non_centered_trace...

  • I always use singular dimension names, following ArviZ chain and draw. For example sample, cluster, axis, component, forest, time...

  • Whenever I can't think of a meaningful name for the dimension representing the number of observations such as time, I fall back to obs_id

  • for matrix dimensions, as xarray doesn't allow repeated dimension names I add a _bis suffix. i.e. param, param_bis

  • for the dimension resulting from stacking chain and draw I use sample, that is ...stack(sample=("chain", "draw"))

  • We often need to encode a categorical variable as integers. I add _idx to the variable it's encoding. i.e. from floor and county to floor_idx and county_idx.

  • to avoid clashes and overwriting variables when using pm.Data, I sometimes use the following pattern:

    x = np.array(...)
    with pm.Model():
        x_ = pm.Data("x", x)
        ...
    

    so that I don't override the original x and have idata.constant_data["x"] but within the model x_ is still available to play the role of x. Otherwise I always try to use the same variable name as the string name I have to the pymc variable.

  • I use fig for matplotlib figures, ax for a single matplotib axes object and axes for arrays of matplotlib axes objects.

  • It is often useful to make a linspace into a dataarray for xarray to handle aligning and broadcasing automatically and ease computation. This variable needs a dimension name for that to work. I use plot or x_plot.

  • When looping, if I need to store a variable after subsetting I append the index variable I used for looping to the original variable name. Example:

    variable = np.array(...)
    x = np.array(...)
    for i in range(N):
        variable_i = variable[i]
        for j in range(K):
            x_j = x[j]
            ...
    

This is everything that came to mind after a bit of thinking and it goes way beyond the original intent of the issue with many suggestions completely unrelated from pymc.

@michaelosthege
Copy link
Member

I prefer axs for multiple matplotlib axes.
But more importantly I found the following to be really useful:

Instead of

fig, axs = pyplot.subplots()

axs[0, 1].plot(...)
axs[0, 1].set(...)

axs[1. 2].scatter(...)

create local ax variables:

ax = axs[0, 1]
ax.plot(...)
ax.set(...)

ax = axs[1, 2]
ax.scatter(...)

The advantage is that it's much easier to switch subplot positions, or to move plotting code around.

Another looping style preference: When using enumerate I take the first letter of the variable as the count:

for p, person in enumerate(persons)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants