measurex: unify data model with engine/netx #2035

bassosimone · 2022-02-22T10:53:29Z

This issue is about unifying the data model used by websteps/measurex with the ones used by engine/netx. We explain below why we have three data models and what is the plan. We'll close this issue when we'll be using a single data model.

Data models

Let's start off by describing all the data models that we have currently. There are two data models of interest. The former is the archival data model (./internal/model/archival.go). This data model describes how we should exchange data with the backend. The latter is the flat data model used internally to represent measurements.

They have conflicting requirements. The archival data model is basically the historical representation of measurements we've been using since we switched OONI Probe from YAML to JSON. While it would be nice to change this data model to become something more regular and simple to parse (e.g., by getting rid of nullable types and enforcing strict typing policies), we are currently not in a position to do that. Unfortunately, this data model is quite annoying to use internally for processing measurements, because it'a not regular and does not have strict typing. So, there have been a bunch of experiments at using a more regular and simpler to use "internal" data model. I call this model "flat" because it aims to reduce nesting.

As an historical note, we have more than one such "flat" data model implementation. While this is, of course, suboptimal, I guess it's part of the baggage of experimenting and trying to figure out what a good flat data model would look like.

This observation that we have more than one implementation smoothly transitions us to the next section of this issue.

Implementations

Now, onto the implementations (as of ooni/probe-cli@ce40127). The engine/netx package has its own flat data model at ./internal/engine/netx/trace and is already using the ./internal/model/archival.go data model to represent data serialized for the OONI backend. The main issue with engine/netx's flat data model is that every observation shares the same structure, which makes processing and serialization complex. We had the idea some time ago to just submit this kind of observations and let the backend do its magic. Hence, the design leaned towards flattening any kind of event into the same event structure. But we learned with time that this does not body well with processing internally the observations to measurement results (or "probe opinions", if you wish). In fact, the resulting code is quite messy and it looks more like Python than Go code, which is bad in terms of safety and mental baggage when reading it. For this reason, when we started working on websteps, we started experimenting with even more regular data models.

The internal/archival data format defines its own flat data model and also uses ./internal/model/archival.go. This code is currently unused. It starts as a refactoring of the code in measurex where we also added unit testing. The nice property of the flat representation provided by this package is that every event type has its own model. This fact greatly simplifies processing of events by type, which is useful when you're producing "probe opinions".

The internal/measurex package has its own definition of both the archival and the flat data models. The code is very similar to the code inside of internal/archival since the latter derives from measurex code. Because it's quite similar to internal/archival, also this flat data model is simpler to process and evaluate for the probe.

Dynamics

Even before opening this issue, the plan was to refactor code from measurex into internal/archival and ensuring that all parts of the codebase use internal/archival for both using a flat data model and producing measurements using the archival data model.

However, because this refactoring is still in progress, we have three data models at the moment.

(I should have documented this issue earlier, but the second best moment to document it appears to be now.)

The plan

In going forward, the plan is roughly the following:

temporarily remove the concept of oddity from measurex: as documented in websteps: changes to the oddity model #2034, there are some issues with this concept, so we can safely back it out and perhaps make it an "internal" aspect of a flat data model that just helps us to compare similar measurements.
refactor measurex to depend on internal/archival: the main obstacles here are (i) the fact that internal/archival does not have a concept of oddity (but we're removing oddity for now, so we're fine) and (ii) the fact that measurex uses its flat data model as an exchange format with its test helper. So, regarding (ii), the main problem is that the flat data model of internal/measurex represents errors as error, which is a type that cannot round trip when using JSON. I spent some time thinking about this problem and I think that the right way to go about it is using more regular types. So, we basically need the flat data model of internal/archival to reduce as much as possible type complexity. Regarding errors, this would mean abandoning the Optional<String> type and instead just represent the absence of error as the empty string. Once these changes are made, we basically have the possibility of completely rewriting measurex (and hence websteps) in terms of internal/archival.
refactor engine/netx to use the internal/archival package as opposed to its own flat data model.

Because these steps are not very small, it may be that each of them requires its own issue.

The text was updated successfully, but these errors were encountered:

There are two reasons why this is beneficial: 1. github.com/google/go-cmp is more annoying to use for comparing data structures when there are interfaces to compare. Sure, there's a recipe for teaching it to compare errors, but how about making the errors trivially comparable instead? 2. if we want to send errors over the network, JSON serialization works but we cannot unmarshal the resulting string back to an error, so how about making this representation trivial to serialize (we are not going this now, but we need this property for websteps and it may be sensible to try to avoid to have duplicate code because of that -- measurex currently duplicates many tracex functionality and this is quite unfortunate because it slows development down) Additionally, if an error is a string: 3. we cannot very easily use a switch for comparing its possible values with "" representing the absence of errors, while it is more complex to do the same when using a nullable string or even and error 4. if a type is not nullable, it's easier to write safe code for it and we may want to refactor experiments to use the internal representation of measurements for more robust processing code For all these reasons, let's internally use strings in tracex. The overall aim here is to reduce the duplicated code between pre and post-measurex measurements (see ooni/probe#2035).

There are two reasons why this is beneficial: 1. github.com/google/go-cmp is more annoying to use for comparing data structures when there are interfaces to compare. Sure, there's a recipe for teaching it to compare errors, but how about making the errors trivially comparable instead? 2. if we want to send errors over the network, JSON serialization works but we cannot unmarshal the resulting string back to an error, so how about making this representation trivial to serialize (we are not going this now, but we need this property for websteps and it may be sensible to try to avoid to have duplicate code because of that -- measurex currently duplicates many tracex functionality and this is quite unfortunate because it slows development down) Additionally, if an error is a string: 3. we can very easily use a switch for comparing its possible values with "" representing the absence of errors, while it is more complex to do the same when using a nullable string or even an error (i.e., an interface) 4. if a type is not nullable, it's easier to write safe code for it and we may want to refactor experiments to use the internal representation of measurements for more robust processing code For all these reasons, let's internally use strings in tracex. The overall aim here is to reduce the duplicated code between pre and post-measurex measurements (see ooni/probe#2035).

By just storing the raw certificate we simplify the internal data structure we use. In turn, this enables us to write better unit tests using github.com/google/go-cmp where we can construct the expected result and compare with that. (Yeah, in principle we could also construct the full certificate but I'm not sure it's worth the effort since we basically only care about the raw certificate.) The general idea here is to make tracex more tested. Once it's more tested, I will create separate structs for each event, which is something that measurex also does. Once that is done, we can start ensuring that the code in measurex and the code in tracex do the same thing in terms of storing observations. When also this is done, we can then rewrite measurex to use tracex directly. The overall goal is ooni/probe#2035.

bassosimone · 2022-06-17T12:54:16Z

This issue has now become a WONTFIX because the step-by-step design calls for freezing all measurements packages, for getting rid of the flat data model in new experiments, and for improving the archival data model to add support for missing fields.

bassosimone added priority/medium refactoring ooni/probe-engine techdebt This issue describes technical debt cleanup There's need to cleanup stuff a bit labels Feb 22, 2022

bassosimone self-assigned this Feb 22, 2022

bassosimone mentioned this issue Feb 22, 2022

feat: call getaddrinfo directly and extract its return value ooni/probe-cli#698

Draft

3 tasks

bassosimone mentioned this issue Jun 2, 2022

refactor(tracex): internally represent errors as strings ooni/probe-cli#786

Merged

bassosimone mentioned this issue Jun 2, 2022

refactor(tracex): internally store just the raw certificate ooni/probe-cli#787

Merged

bassosimone changed the title ~~websteps: unify data model with archival and engine/netx~~ measurex: unify data model with engine/netx Jun 14, 2022

bassosimone added the wontfix when an issue won't be addressed (add a comment to the issue as to why this is the case) label Jun 17, 2022

bassosimone closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

measurex: unify data model with engine/netx #2035

measurex: unify data model with engine/netx #2035

bassosimone commented Feb 22, 2022 •

edited

Loading

bassosimone commented Jun 17, 2022

measurex: unify data model with engine/netx #2035

measurex: unify data model with engine/netx #2035

Comments

bassosimone commented Feb 22, 2022 • edited Loading

Data models

Implementations

Dynamics

The plan

bassosimone commented Jun 17, 2022

bassosimone commented Feb 22, 2022 •

edited

Loading