Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patsy pickling #67

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Patsy pickling #67

wants to merge 13 commits into from

Conversation

chrish42
Copy link
Contributor

@chrish42 chrish42 commented May 21, 2015

Branch for the rest of the work required to make patsy pickles work well. At a high level, we need to implement __getstate__ and __setstate__ for relevant classes (easy), and add a testing framework (that is easy to use) that makes sure that pickles created with one version of patsy continue to work with newer versions (harder).

Pickling testsuite:

  • Create dict of pickling tests, with the key being the test name, and the value being an object of the above list, in a given configuration.
  • Create a test helper that iterates through the above, saving a pickle for each in a file pickles/vX.y/test_name.pickle (where "test_name" is the key in the dict for the corresponding object). This program would be run before every release (with the version number of the release as an argument), to create a new set of pickle files for that release. Said pickle files would be added to the source repository.
  • Create an adapter that makes nosetests iterate through all these objects, and check that they compare equal to all the versions of their corresponding pickles.

Pickle support:

  • Implement __getstate__ and __setstate__ methods for EvalFactor
  • Implement __getstate__ and __setstate__ methods for LookupFactor
  • Implement __getstate__ and __setstate__ methods for Term
  • Implement __getstate__ and __setstate__ methods for EvalEnvironment
  • Implement __getstate__ and __setstate__ methods for the stateful transform classes
  • Implement __getstate__ and __setstate__ methods for ContrastMatrix
  • Implement __getstate__ and __setstate__ methods for DesignInfo
  • Implement __getstate__ and __setstate__ methods for DesignMatrix

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 97.87% when pulling c9e91fc on chrish42:patsy-pickles into 3819ae9 on pydata:master.

@njsmith
Copy link
Member

njsmith commented May 25, 2015

Hey sorry, I'll get to this soon but I'll need to spend some time thinking hard about the current organization of build.py first, so it might be a few more days yet.

@chrish42
Copy link
Contributor Author

chrish42 commented Jun 6, 2015

Any news?

@njsmith
Copy link
Member

njsmith commented Jun 9, 2015

Right, thanks for the ping :-).

I spent a few hours looking at this last night, and it's complicated!

The easy part: there are a bunch of simpler, lower-level objects that model pickling will rely on. For the most part I think we just need to add some tests for these (pickle/unpickle, and ideally having some historical pickles stored in the test suite to test that new versions of patsy continue to be able to unpickle them):

  • EvalFactor
  • LookupFactor
  • Term (and in particular making sure its .name() method continues to work)
  • EvalEnvironment
  • all the stateful transform classes (in state.py, splines.py, mgcv_cubic_splines.py) -- there's some shared testing code for these in test_state.py:check_stateful that could take care of a lot of the work
  • ContrastMatrix

Then there's the core design stuff, mostly in build.py, which currently includes:

  • the two FactorEvaluator classes
  • _ColumnBuilder
  • DesignMatrixBuilder

This is the hard part, because the internal representations here are a mess -- weird encoding of information, redundant encodings across different objects, and so on. This really needs some substantive refactoring.

And then there's a few high-level objects that need some attention:

  • DesignInfo -- a little complicated because it has a few different possible states -- sometimes only partial metadata is available -- but I think probably fine except for tests
  • DesignMatrix -- currently inherits ndarray.__reduce__, which causes it to lose the crucial .design_info attribute when going through a pickle/unpickle cycle, so that needs fixing

So I'm going to keep thinking about the best way to handle build.py, but at least this should give some picture of the landscape...

@njsmith
Copy link
Member

njsmith commented Jun 9, 2015

It might also be worth doing a quick pass through the source to find any other classes that aren't on the above list, and add def __getstate__(self): raise NotImplementedError methods just to be safe...

njsmith added a commit that referenced this pull request Jun 14, 2015
…ignInfo

Notable changes:
- DesignInfo now exposes lots more metadata about how exactly different
  factors and terms are coded.
- In fact, it exposes enough more metadata that you can now reconstruct
  a design matrix entirely from a DesignInfo, so DesignMatrixBuilder
  becomes redundant and is removed in favor of DesignInfo.
- DesignInfo's constructor is very different; in particular, removed the
  option of specifying terms as strings, which was only useful for
  interoperability with competing formula libraries. Four years later, no
  such competitors have appeared, so I can't be bothered to keep maintaining
  this. Will re-add later if someone actually wants to use it.

This versions works and passes tests, but a bunch more tests need to be
added.

This fixes #61, and sets us up to implemented pickling support (#26, #67).
@njsmith
Copy link
Member

njsmith commented Jun 14, 2015

Okay, I sat down and refactored big chunks of build.py, so now this should be easy :-). Basically I added a bunch more public metadata to DesignInfo, which is useful in its own right (#61), and then I redid stuff in build.py so that it can build matrices without ever using any non-public information. Which makes this much simpler.

I also went through and added raise-an-error __getstate__ methods to everything.

So now I think the todo list is just:

  • Go through all the classes mentioned above one at a time, removing the __getstate__ method and adding pickle/unpickle tests.
  • That's it.

@chrish42
Copy link
Contributor Author

That's great! I don't have time to look at this right now, but I should have more time around the next weekend.

@njsmith
Copy link
Member

njsmith commented Jun 15, 2015

Cool, no worries.

If I get some time I might go ahead and release 0.4.0, so people can at
least start taking advantage of the new stuff in master. But that's no big
deal, it would just mean the pickle stuff comes out in 0.5.0.

On Sun, Jun 14, 2015 at 5:03 PM, Christian Hudon notifications@github.com
wrote:

That's great! I don't have time to look at this right now, but I should
have more time around the next weekend.


Reply to this email directly or view it on GitHub
#67 (comment).

Nathaniel J. Smith -- http://vorpus.org

@chrish42
Copy link
Contributor Author

Working on this. Quick questions. Should I do a __getstate__ implementation for everything, or is there some stuff that doesn't need to be pickled and can stay with __getstate__ = no_pickling? Also, what do you think would be good tests for this?

@njsmith
Copy link
Member

njsmith commented Jun 24, 2015

I guess we need a __getstate__ (or __reduce__ or whatever) for everything that's transitively referenced from DesignInfo or DesignMatrix. From the analysis in my comment above, I think that means: EvalFactor, LookupFactor, Term, EvalEnvironment, the stateful transform classes, ContrastMatrix, DesignInfo, DesignMatrix.

For tests, I'd generally suggest round-trip tests, plus some tests that involve stored pickle files (or strings inside the test .py, or whatever) that we attempt to load (since this is the only way to make sure we haven't accidentally broken both the save and load pathways in parallel, which is surprisingly easy to do with pickle).

@chrish42
Copy link
Contributor Author

Getting back to working on patsy. My plan is to write the pickling support and tests for a simple class (I picked EvalFactor) and show you that. Once we're happy with the way things are done for that class, I'll do the same approach for all the other classes.

So, while I'm in EvalFactor, I have a quick question. What's the self.code instance variable? It's not in the documentation of the class, and it also isn't used anywhere in said class. Do we still want to restore that on pickling?

Also, I assume we don't care about forward compatibility (i.e. it's okay if an older version of patsy can't read a pickle generated by a newer version), correct?

@chrish42
Copy link
Contributor Author

And for the tests, here's actually what I'm thinking of doing right now. You can tell me what you think, and if it sounds realistic.

We want to achieve two main things with the tests for pickling: 1) testing that pickling round-trips, and 2) testing that pickles created by old versions of patsy still work on newer version. I'd like not to repeat myself too much between these two, so here's what I'm thinking. I create a bunch of interesting patsy objects, all contained in (say) a dict. Then for each of these objects, I can test that obj == pickle.loads(pickle.dumps(obj)). I can also easily create a script that pickles all of these objects to disk, under a directory named after the version of patsy. That script would be run before each release. Then I can iterate over the directories for all the versions, and test that for each object, obj == pickle.loads(the_pickle_for_that_object_for_a_given_version).

Do you think this would work as a testing strategy? Or am I missing something?

@njsmith
Copy link
Member

njsmith commented Jul 17, 2015

EvalFactor.code looks used to me?

Forward compatibility: It'd be good if trying to load a new pickle in an old version at least gave an error? But we can always deal with that later at the time we decide to change something in a breaking fashion. (I guess if one wants to get really fancy one could put a version number field in every pickle so that old versions will not just error out, but also be able to display a comprehensible error message - "This version of patsy is too old..." -- but this is definitely an extra credit kind of thing.)

The testing strategy also looks good to me. I guess the one subtlety will be that if we add new stuff in a future release then we'll need to have some way to add new test objects that are only present in newer pickle dumps, while still keeping the old pickle dumps.

One possible approach: represent the test objects as little bits of constructor code -- "EvalFactor(...)", etc. -- and then have the version snapshot script dump both the pickles themselves and also the code used to create the objects that they should be compared to. (And then there can also be a test that loops over the same list for round-trips: for code in TEST_OBJECT_CODES: obj = eval(code); assert pickle.loads(pickle.dumps(obj)) == obj)

@louispotok
Copy link

Hi! I just came across this PR via this issue: #26

Is this still the PR that would solve that issue? Is there a way that I can help here?

@njsmith
Copy link
Member

njsmith commented Apr 30, 2016

What's happened so far has happened here, yes :-). IIRC the general plan for how to handle this is clear, but stalled out with the work mostly not done. If you and @chrish42 want to work on it together, or if @chrish42 is busy and you want to just pick up his work and continue where he left off, then either way it would be great, and I'm happy to continue providing advice and review.

@louispotok louispotok mentioned this pull request May 1, 2016
@chrish42
Copy link
Contributor Author

chrish42 commented Jun 4, 2016

I've updated the description at the top to actually talk about the work that remains to be done, after going through all the comments in the pull request and other sources. @njsmith, let me know if that approach makes at least some sense or not. :-) After thinking about it some more, I still think that giving each "pickle test" its own name and corresponding object in a Python file is a better approach overall than saving the code as text inside a pickle. This mirrors other tests more (they all have a name too). We can discuss this approach more once I've added code to show how it would work in practice. (Will be easier to talk about it then.)

@njsmith
Copy link
Member

njsmith commented Jun 4, 2016

@chrish42: top level comment looks reasonable, and if you have a better idea I am extremely happy to defer to it :-)

@chrish42
Copy link
Contributor Author

chrish42 commented Jun 4, 2016

This is not quite done or working yet, but it's starting to show the general approach.

@chrish42
Copy link
Contributor Author

chrish42 commented Jun 6, 2016

@njsmith Now would be good time to have a look and give any comments about the general approach. If things look okay, we can start making more of the patsy objects pickeable, and add more pickling testcases. Comments and questions welcome!


from patsy import EvalFactor

PICKE_TESTCASES_ROOTDIR = os.path.join(os.path.dirname(__file__), '..', 'pickle_testcases')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/PICKE/PICKLE/, I assume?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, hmm, so it looks like you're storing the pickle tests outside the source tree. Right now we go to some effort to ship the test suite, so that we can test patsy-as-installed... and I think the current CI stuff tries hard to run the test suite without access to the source tree, to make sure that this works. So this is definitely a change. Maybe it's a good change if the pickle test suite is large... but we'll need to do some more work in that case to make it really work. OTOH, maybe the pickle test suite is small, in which case we might as well ship it, because testing is important.

Probably the simplest thing to do is to move it into a patsy/test_data directory inside the package for now, and then we can always move it back out again if we decide that we have too many megabytes of test pickles. (This will probably require some annoying change to setup.py as well to include the test data -- example.)

@njsmith
Copy link
Member

njsmith commented Jun 10, 2016

I made a bunch of small comments, but the overall approach looks fine to me! Thanks for working on this, and sorry for being slow in reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants