Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

measurex: unify data model with engine/netx #2035

Closed
3 tasks
bassosimone opened this issue Feb 22, 2022 · 1 comment
Closed
3 tasks

measurex: unify data model with engine/netx #2035

bassosimone opened this issue Feb 22, 2022 · 1 comment
Assignees
Labels
cleanup There's need to cleanup stuff a bit ooni/probe-engine priority/medium refactoring techdebt This issue describes technical debt wontfix when an issue won't be addressed (add a comment to the issue as to why this is the case)

Comments

@bassosimone
Copy link
Contributor

bassosimone commented Feb 22, 2022

This issue is about unifying the data model used by websteps/measurex with the ones used by engine/netx. We explain below why we have three data models and what is the plan. We'll close this issue when we'll be using a single data model.

Data models

Let's start off by describing all the data models that we have currently. There are two data models of interest. The former is the archival data model (./internal/model/archival.go). This data model describes how we should exchange data with the backend. The latter is the flat data model used internally to represent measurements.

They have conflicting requirements. The archival data model is basically the historical representation of measurements we've been using since we switched OONI Probe from YAML to JSON. While it would be nice to change this data model to become something more regular and simple to parse (e.g., by getting rid of nullable types and enforcing strict typing policies), we are currently not in a position to do that. Unfortunately, this data model is quite annoying to use internally for processing measurements, because it'a not regular and does not have strict typing. So, there have been a bunch of experiments at using a more regular and simpler to use "internal" data model. I call this model "flat" because it aims to reduce nesting.

As an historical note, we have more than one such "flat" data model implementation. While this is, of course, suboptimal, I guess it's part of the baggage of experimenting and trying to figure out what a good flat data model would look like.

This observation that we have more than one implementation smoothly transitions us to the next section of this issue.

Implementations

Now, onto the implementations (as of ooni/probe-cli@ce40127). The engine/netx package has its own flat data model at ./internal/engine/netx/trace and is already using the ./internal/model/archival.go data model to represent data serialized for the OONI backend. The main issue with engine/netx's flat data model is that every observation shares the same structure, which makes processing and serialization complex. We had the idea some time ago to just submit this kind of observations and let the backend do its magic. Hence, the design leaned towards flattening any kind of event into the same event structure. But we learned with time that this does not body well with processing internally the observations to measurement results (or "probe opinions", if you wish). In fact, the resulting code is quite messy and it looks more like Python than Go code, which is bad in terms of safety and mental baggage when reading it. For this reason, when we started working on websteps, we started experimenting with even more regular data models.

The internal/archival data format defines its own flat data model and also uses ./internal/model/archival.go. This code is currently unused. It starts as a refactoring of the code in measurex where we also added unit testing. The nice property of the flat representation provided by this package is that every event type has its own model. This fact greatly simplifies processing of events by type, which is useful when you're producing "probe opinions".

The internal/measurex package has its own definition of both the archival and the flat data models. The code is very similar to the code inside of internal/archival since the latter derives from measurex code. Because it's quite similar to internal/archival, also this flat data model is simpler to process and evaluate for the probe.

Dynamics

Even before opening this issue, the plan was to refactor code from measurex into internal/archival and ensuring that all parts of the codebase use internal/archival for both using a flat data model and producing measurements using the archival data model.

However, because this refactoring is still in progress, we have three data models at the moment.

(I should have documented this issue earlier, but the second best moment to document it appears to be now.)

The plan

In going forward, the plan is roughly the following:

  • temporarily remove the concept of oddity from measurex: as documented in websteps: changes to the oddity model #2034, there are some issues with this concept, so we can safely back it out and perhaps make it an "internal" aspect of a flat data model that just helps us to compare similar measurements.
  • refactor measurex to depend on internal/archival: the main obstacles here are (i) the fact that internal/archival does not have a concept of oddity (but we're removing oddity for now, so we're fine) and (ii) the fact that measurex uses its flat data model as an exchange format with its test helper. So, regarding (ii), the main problem is that the flat data model of internal/measurex represents errors as error, which is a type that cannot round trip when using JSON. I spent some time thinking about this problem and I think that the right way to go about it is using more regular types. So, we basically need the flat data model of internal/archival to reduce as much as possible type complexity. Regarding errors, this would mean abandoning the Optional<String> type and instead just represent the absence of error as the empty string. Once these changes are made, we basically have the possibility of completely rewriting measurex (and hence websteps) in terms of internal/archival.
  • refactor engine/netx to use the internal/archival package as opposed to its own flat data model.

Because these steps are not very small, it may be that each of them requires its own issue.

@bassosimone bassosimone added priority/medium refactoring ooni/probe-engine techdebt This issue describes technical debt cleanup There's need to cleanup stuff a bit labels Feb 22, 2022
@bassosimone bassosimone self-assigned this Feb 22, 2022
bassosimone added a commit to ooni/probe-cli that referenced this issue Jun 2, 2022
There are two reasons why this is beneficial:

1. github.com/google/go-cmp is more annoying to use for comparing
data structures when there are interfaces to compare. Sure, there's
a recipe for teaching it to compare errors, but how about making
the errors trivially comparable instead?

2. if we want to send errors over the network, JSON serialization
works but we cannot unmarshal the resulting string back to an error,
so how about making this representation trivial to serialize (we
are not going this now, but we need this property for websteps and
it may be sensible to try to avoid to have duplicate code because
of that -- measurex currently duplicates many tracex functionality
and this is quite unfortunate because it slows development down)

Additionally, if an error is a string:

3. we cannot very easily use a switch for comparing its possible
values with "" representing the absence of errors, while it is
more complex to do the same when using a nullable string or even
and error

4. if a type is not nullable, it's easier to write safe code for
it and we may want to refactor experiments to use the internal
representation of measurements for more robust processing code

For all these reasons, let's internally use strings in tracex.

The overall aim here is to reduce the duplicated code between pre
and post-measurex measurements (see ooni/probe#2035).
bassosimone added a commit to ooni/probe-cli that referenced this issue Jun 2, 2022
There are two reasons why this is beneficial:

1. github.com/google/go-cmp is more annoying to use for comparing
data structures when there are interfaces to compare. Sure, there's
a recipe for teaching it to compare errors, but how about making
the errors trivially comparable instead?

2. if we want to send errors over the network, JSON serialization
works but we cannot unmarshal the resulting string back to an error,
so how about making this representation trivial to serialize (we
are not going this now, but we need this property for websteps and
it may be sensible to try to avoid to have duplicate code because
of that -- measurex currently duplicates many tracex functionality
and this is quite unfortunate because it slows development down)

Additionally, if an error is a string:

3. we can very easily use a switch for comparing its possible
values with "" representing the absence of errors, while it is
more complex to do the same when using a nullable string or even
an error (i.e., an interface)

4. if a type is not nullable, it's easier to write safe code for
it and we may want to refactor experiments to use the internal
representation of measurements for more robust processing code

For all these reasons, let's internally use strings in tracex.

The overall aim here is to reduce the duplicated code between pre
and post-measurex measurements (see ooni/probe#2035).
bassosimone added a commit to ooni/probe-cli that referenced this issue Jun 2, 2022
By just storing the raw certificate we simplify the internal data
structure we use. In turn, this enables us to write better unit tests
using github.com/google/go-cmp where we can construct the expected
result and compare with that. (Yeah, in principle we could also
construct the full certificate but I'm not sure it's worth the effort
since we basically only care about the raw certificate.)

The general idea here is to make tracex more tested. Once it's more
tested, I will create separate structs for each event, which is
something that measurex also does. Once that is done, we can start
ensuring that the code in measurex and the code in tracex do the
same thing in terms of storing observations. When also this is done,
we can then rewrite measurex to use tracex directly.

The overall goal is ooni/probe#2035.
bassosimone added a commit to ooni/probe-cli that referenced this issue Jun 2, 2022
By just storing the raw certificate we simplify the internal data
structure we use. In turn, this enables us to write better unit tests
using github.com/google/go-cmp where we can construct the expected
result and compare with that. (Yeah, in principle we could also
construct the full certificate but I'm not sure it's worth the effort
since we basically only care about the raw certificate.)

The general idea here is to make tracex more tested. Once it's more
tested, I will create separate structs for each event, which is
something that measurex also does. Once that is done, we can start
ensuring that the code in measurex and the code in tracex do the
same thing in terms of storing observations. When also this is done,
we can then rewrite measurex to use tracex directly.

The overall goal is ooni/probe#2035.
@bassosimone bassosimone changed the title websteps: unify data model with archival and engine/netx measurex: unify data model with engine/netx Jun 14, 2022
@bassosimone
Copy link
Contributor Author

This issue has now become a WONTFIX because the step-by-step design calls for freezing all measurements packages, for getting rid of the flat data model in new experiments, and for improving the archival data model to add support for missing fields.

@bassosimone bassosimone added the wontfix when an issue won't be addressed (add a comment to the issue as to why this is the case) label Jun 17, 2022
@bassosimone bassosimone closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup There's need to cleanup stuff a bit ooni/probe-engine priority/medium refactoring techdebt This issue describes technical debt wontfix when an issue won't be addressed (add a comment to the issue as to why this is the case)
Projects
None yet
Development

No branches or pull requests

1 participant