Patsy pickling #67

chrish42 · 2015-05-21T02:51:50Z

Branch for the rest of the work required to make patsy pickles work well. At a high level, we need to implement __getstate__ and __setstate__ for relevant classes (easy), and add a testing framework (that is easy to use) that makes sure that pickles created with one version of patsy continue to work with newer versions (harder).

Pickling testsuite:

Create dict of pickling tests, with the key being the test name, and the value being an object of the above list, in a given configuration.
Create a test helper that iterates through the above, saving a pickle for each in a file pickles/vX.y/test_name.pickle (where "test_name" is the key in the dict for the corresponding object). This program would be run before every release (with the version number of the release as an argument), to create a new set of pickle files for that release. Said pickle files would be added to the source repository.
Create an adapter that makes nosetests iterate through all these objects, and check that they compare equal to all the versions of their corresponding pickles.

Pickle support:

Implement __getstate__ and __setstate__ methods for EvalFactor
Implement __getstate__ and __setstate__ methods for LookupFactor
Implement __getstate__ and __setstate__ methods for Term
Implement __getstate__ and __setstate__ methods for EvalEnvironment
Implement __getstate__ and __setstate__ methods for the stateful transform classes
Implement __getstate__ and __setstate__ methods for ContrastMatrix
Implement __getstate__ and __setstate__ methods for DesignInfo
Implement __getstate__ and __setstate__ methods for DesignMatrix

coveralls · 2015-05-21T07:49:56Z

Coverage increased (+0.0%) to 97.87% when pulling c9e91fc on chrish42:patsy-pickles into 3819ae9 on pydata:master.

njsmith · 2015-05-25T05:18:24Z

Hey sorry, I'll get to this soon but I'll need to spend some time thinking hard about the current organization of build.py first, so it might be a few more days yet.

chrish42 · 2015-06-06T21:05:40Z

Any news?

njsmith · 2015-06-09T01:26:48Z

Right, thanks for the ping :-).

I spent a few hours looking at this last night, and it's complicated!

The easy part: there are a bunch of simpler, lower-level objects that model pickling will rely on. For the most part I think we just need to add some tests for these (pickle/unpickle, and ideally having some historical pickles stored in the test suite to test that new versions of patsy continue to be able to unpickle them):

EvalFactor
LookupFactor
Term (and in particular making sure its .name() method continues to work)
EvalEnvironment
all the stateful transform classes (in state.py, splines.py, mgcv_cubic_splines.py) -- there's some shared testing code for these in test_state.py:check_stateful that could take care of a lot of the work
ContrastMatrix

Then there's the core design stuff, mostly in build.py, which currently includes:

the two FactorEvaluator classes
_ColumnBuilder
DesignMatrixBuilder

This is the hard part, because the internal representations here are a mess -- weird encoding of information, redundant encodings across different objects, and so on. This really needs some substantive refactoring.

And then there's a few high-level objects that need some attention:

DesignInfo -- a little complicated because it has a few different possible states -- sometimes only partial metadata is available -- but I think probably fine except for tests
DesignMatrix -- currently inherits ndarray.__reduce__, which causes it to lose the crucial .design_info attribute when going through a pickle/unpickle cycle, so that needs fixing

So I'm going to keep thinking about the best way to handle build.py, but at least this should give some picture of the landscape...

njsmith · 2015-06-09T01:30:04Z

It might also be worth doing a quick pass through the source to find any other classes that aren't on the above list, and add def __getstate__(self): raise NotImplementedError methods just to be safe...

…ignInfo Notable changes: - DesignInfo now exposes lots more metadata about how exactly different factors and terms are coded. - In fact, it exposes enough more metadata that you can now reconstruct a design matrix entirely from a DesignInfo, so DesignMatrixBuilder becomes redundant and is removed in favor of DesignInfo. - DesignInfo's constructor is very different; in particular, removed the option of specifying terms as strings, which was only useful for interoperability with competing formula libraries. Four years later, no such competitors have appeared, so I can't be bothered to keep maintaining this. Will re-add later if someone actually wants to use it. This versions works and passes tests, but a bunch more tests need to be added. This fixes #61, and sets us up to implemented pickling support (#26, #67).

njsmith · 2015-06-14T02:44:49Z

Okay, I sat down and refactored big chunks of build.py, so now this should be easy :-). Basically I added a bunch more public metadata to DesignInfo, which is useful in its own right (#61), and then I redid stuff in build.py so that it can build matrices without ever using any non-public information. Which makes this much simpler.

I also went through and added raise-an-error __getstate__ methods to everything.

So now I think the todo list is just:

Go through all the classes mentioned above one at a time, removing the __getstate__ method and adding pickle/unpickle tests.
That's it.

chrish42 · 2015-06-15T00:03:50Z

That's great! I don't have time to look at this right now, but I should have more time around the next weekend.

njsmith · 2015-06-15T08:48:20Z

Cool, no worries.

If I get some time I might go ahead and release 0.4.0, so people can at
least start taking advantage of the new stuff in master. But that's no big
deal, it would just mean the pickle stuff comes out in 0.5.0.

On Sun, Jun 14, 2015 at 5:03 PM, Christian Hudon notifications@github.com
wrote:

That's great! I don't have time to look at this right now, but I should
have more time around the next weekend.

—
Reply to this email directly or view it on GitHub
#67 (comment).

Nathaniel J. Smith -- http://vorpus.org

chrish42 · 2015-06-21T02:25:03Z

Working on this. Quick questions. Should I do a __getstate__ implementation for everything, or is there some stuff that doesn't need to be pickled and can stay with __getstate__ = no_pickling? Also, what do you think would be good tests for this?

njsmith · 2015-06-24T01:03:47Z

I guess we need a __getstate__ (or __reduce__ or whatever) for everything that's transitively referenced from DesignInfo or DesignMatrix. From the analysis in my comment above, I think that means: EvalFactor, LookupFactor, Term, EvalEnvironment, the stateful transform classes, ContrastMatrix, DesignInfo, DesignMatrix.

For tests, I'd generally suggest round-trip tests, plus some tests that involve stored pickle files (or strings inside the test .py, or whatever) that we attempt to load (since this is the only way to make sure we haven't accidentally broken both the save and load pathways in parallel, which is surprisingly easy to do with pickle).

chrish42 · 2015-07-10T01:38:56Z

Getting back to working on patsy. My plan is to write the pickling support and tests for a simple class (I picked EvalFactor) and show you that. Once we're happy with the way things are done for that class, I'll do the same approach for all the other classes.

So, while I'm in EvalFactor, I have a quick question. What's the self.code instance variable? It's not in the documentation of the class, and it also isn't used anywhere in said class. Do we still want to restore that on pickling?

Also, I assume we don't care about forward compatibility (i.e. it's okay if an older version of patsy can't read a pickle generated by a newer version), correct?

chrish42 · 2015-07-10T01:53:45Z

And for the tests, here's actually what I'm thinking of doing right now. You can tell me what you think, and if it sounds realistic.

We want to achieve two main things with the tests for pickling: 1) testing that pickling round-trips, and 2) testing that pickles created by old versions of patsy still work on newer version. I'd like not to repeat myself too much between these two, so here's what I'm thinking. I create a bunch of interesting patsy objects, all contained in (say) a dict. Then for each of these objects, I can test that obj == pickle.loads(pickle.dumps(obj)). I can also easily create a script that pickles all of these objects to disk, under a directory named after the version of patsy. That script would be run before each release. Then I can iterate over the directories for all the versions, and test that for each object, obj == pickle.loads(the_pickle_for_that_object_for_a_given_version).

Do you think this would work as a testing strategy? Or am I missing something?

njsmith · 2015-07-17T03:02:24Z

EvalFactor.code looks used to me?

Forward compatibility: It'd be good if trying to load a new pickle in an old version at least gave an error? But we can always deal with that later at the time we decide to change something in a breaking fashion. (I guess if one wants to get really fancy one could put a version number field in every pickle so that old versions will not just error out, but also be able to display a comprehensible error message - "This version of patsy is too old..." -- but this is definitely an extra credit kind of thing.)

The testing strategy also looks good to me. I guess the one subtlety will be that if we add new stuff in a future release then we'll need to have some way to add new test objects that are only present in newer pickle dumps, while still keeping the old pickle dumps.

One possible approach: represent the test objects as little bits of constructor code -- "EvalFactor(...)", etc. -- and then have the version snapshot script dump both the pickles themselves and also the code used to create the objects that they should be compared to. (And then there can also be a test that loops over the same list for round-trips: for code in TEST_OBJECT_CODES: obj = eval(code); assert pickle.loads(pickle.dumps(obj)) == obj)

louispotok · 2016-04-30T19:12:34Z

Hi! I just came across this PR via this issue: #26

Is this still the PR that would solve that issue? Is there a way that I can help here?

njsmith · 2016-04-30T20:37:45Z

What's happened so far has happened here, yes :-). IIRC the general plan for how to handle this is clear, but stalled out with the work mostly not done. If you and @chrish42 want to work on it together, or if @chrish42 is busy and you want to just pick up his work and continue where he left off, then either way it would be great, and I'm happy to continue providing advice and review.

… results afterwards.

chrish42 · 2016-06-04T00:23:11Z

I've updated the description at the top to actually talk about the work that remains to be done, after going through all the comments in the pull request and other sources. @njsmith, let me know if that approach makes at least some sense or not. :-) After thinking about it some more, I still think that giving each "pickle test" its own name and corresponding object in a Python file is a better approach overall than saving the code as text inside a pickle. This mirrors other tests more (they all have a name too). We can discuss this approach more once I've added code to show how it would work in practice. (Will be easier to talk about it then.)

njsmith · 2016-06-04T02:04:09Z

@chrish42: top level comment looks reasonable, and if you have a better idea I am extremely happy to defer to it :-)

chrish42 · 2016-06-04T23:18:16Z

This is not quite done or working yet, but it's starting to show the general approach.

…bjects.

chrish42 · 2016-06-06T00:16:21Z

@njsmith Now would be good time to have a look and give any comments about the general approach. If things look okay, we can start making more of the patsy objects pickeable, and add more pickling testcases. Comments and questions welcome!

njsmith · 2016-06-10T07:15:25Z

patsy/test_pickling.py

+
+from patsy import EvalFactor
+
+PICKE_TESTCASES_ROOTDIR = os.path.join(os.path.dirname(__file__), '..', 'pickle_testcases')


s/PICKE/PICKLE/, I assume?

Also, hmm, so it looks like you're storing the pickle tests outside the source tree. Right now we go to some effort to ship the test suite, so that we can test patsy-as-installed... and I think the current CI stuff tries hard to run the test suite without access to the source tree, to make sure that this works. So this is definitely a change. Maybe it's a good change if the pickle test suite is large... but we'll need to do some more work in that case to make it really work. OTOH, maybe the pickle test suite is small, in which case we might as well ship it, because testing is important.

Probably the simplest thing to do is to move it into a patsy/test_data directory inside the package for now, and then we can always move it back out again if we decide that we have too many megabytes of test pickles. (This will probably require some annoying change to setup.py as well to include the test data -- example.)

njsmith · 2016-06-10T07:33:39Z

I made a bunch of small comments, but the overall approach looks fine to me! Thanks for working on this, and sorry for being slow in reviewing.

chrish42 mentioned this pull request May 21, 2015

Make DesignMatrixBuilders pickleable and saveable #26

Open

chrish42 force-pushed the patsy-pickles branch from c9e91fc to 4d428a2 Compare June 19, 2015 03:40

njsmith mentioned this pull request Jun 24, 2015

REF: pickling formula information using patsy statsmodels/statsmodels#2469

Open

louispotok mentioned this pull request May 1, 2016

Pickling #86

Closed

chrish42 added 4 commits June 3, 2016 16:20

Check that a builder roundtrips through pickle and will give the same…

34bc09b

… results afterwards.

Beginning of work on pickling.

3bc08d6

Beginning of igh-level tests for pickling.

d5ed045

Remove stray whitespace and indents.

d431764

chrish42 force-pushed the patsy-pickles branch from 006a0cc to d431764 Compare June 3, 2016 23:25

chrish42 added 2 commits June 4, 2016 16:11

In-progress work for pickling tests.

71e4a64

Add TODO comment.

ddbf6be

chrish42 added 7 commits June 5, 2016 15:15

Pickling of EvalFactor, with testcase for making sure origin is pickled.

3164cd9

More fully fleshed-out version of pickling test harness.

c5a8e75

Better error message when unpickling a version that is not supported.

62b1ce4

More flesh-out version of high-level tests for pickling design_info o…

c870d4d

…bjects.

Small code tweaks.

858a623

Sample test pickle.

4b74a05

Add TODO note for force=True future option for creating test pickles.

12c7471

njsmith reviewed Jun 10, 2016
View reviewed changes

thequackdaddy mentioned this pull request Apr 6, 2017

patsy pickles #104

Open

matthewwardrop force-pushed the master branch 2 times, most recently from b07ba3f to 48fd2e4 Compare September 5, 2021 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patsy pickling #67

Patsy pickling #67

chrish42 commented May 21, 2015 •

edited

Loading

coveralls commented May 21, 2015

njsmith commented May 25, 2015

chrish42 commented Jun 6, 2015

njsmith commented Jun 9, 2015

njsmith commented Jun 9, 2015

njsmith commented Jun 14, 2015

chrish42 commented Jun 15, 2015

njsmith commented Jun 15, 2015

chrish42 commented Jun 21, 2015

njsmith commented Jun 24, 2015

chrish42 commented Jul 10, 2015

chrish42 commented Jul 10, 2015

njsmith commented Jul 17, 2015

louispotok commented Apr 30, 2016

njsmith commented Apr 30, 2016

chrish42 commented Jun 4, 2016

njsmith commented Jun 4, 2016

chrish42 commented Jun 4, 2016

chrish42 commented Jun 6, 2016

njsmith Jun 10, 2016

njsmith Jun 10, 2016

njsmith commented Jun 10, 2016


		from patsy import EvalFactor

		PICKE_TESTCASES_ROOTDIR = os.path.join(os.path.dirname(__file__), '..', 'pickle_testcases')

Patsy pickling #67

Are you sure you want to change the base?

Patsy pickling #67

Conversation

chrish42 commented May 21, 2015 • edited Loading

coveralls commented May 21, 2015

njsmith commented May 25, 2015

chrish42 commented Jun 6, 2015

njsmith commented Jun 9, 2015

njsmith commented Jun 9, 2015

njsmith commented Jun 14, 2015

chrish42 commented Jun 15, 2015

njsmith commented Jun 15, 2015

chrish42 commented Jun 21, 2015

njsmith commented Jun 24, 2015

chrish42 commented Jul 10, 2015

chrish42 commented Jul 10, 2015

njsmith commented Jul 17, 2015

louispotok commented Apr 30, 2016

njsmith commented Apr 30, 2016

chrish42 commented Jun 4, 2016

njsmith commented Jun 4, 2016

chrish42 commented Jun 4, 2016

chrish42 commented Jun 6, 2016

njsmith Jun 10, 2016

Choose a reason for hiding this comment

njsmith Jun 10, 2016

Choose a reason for hiding this comment

njsmith commented Jun 10, 2016

chrish42 commented May 21, 2015 •

edited

Loading