Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(json-format): Clarify data encoding in JSON format with a JSON datacontenttype #861

Merged
merged 7 commits into from
Sep 16, 2021

Conversation

dazuma
Copy link
Member

@dazuma dazuma commented Aug 10, 2021

Fixes #558
Fixes #860

This is an attempt to clarify the interpretation of the data member in JSON Format in the case of a JSON-oriented datacontenttype vs a non-JSON-oriented datacontenttype. It also explicitly documents in json-format.md the implied datacontenttype value should the attribute be omitted.

…tacontenttype

Signed-off-by: Daniel Azuma <dazuma@google.com>
Signed-off-by: Daniel Azuma <dazuma@google.com>
@dazuma
Copy link
Member Author

dazuma commented Aug 10, 2021

One open question might be: should we include the "implied" datacontenttype behavior under the attributes section (section 2) in addition to (or instead of) where it currently is documented under section 3?

Signed-off-by: Daniel Azuma <dazuma@google.com>
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
Signed-off-by: Daniel Azuma <dazuma@google.com>
json-format.md Outdated Show resolved Hide resolved
json-format.md Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated
event if it is unable to translate the runtime value to a JSON value.

Otherwise, if the `datacontenttype` is NOT JSON-oriented, a JSON serializer
MUST use the member name `data` to store a string representation of the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to be a string representation? If the event data is just an integer, I'd expect that to be represented as a JSON number, for example.

Copy link
Member Author

@dazuma dazuma Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree: I think it has to be a string. Either the content type indicates JSON, in which case it's encoded as a JSON value, or the content type indicates something else, in which case by virtue of there being a content type in the first place, it implies there's a conversion between a string (or byte sequence, but we've already handled that case with data_base64) and a runtime type.

For example, if the event data is indeed an integer, and for some reason the content type does is not JSON, but some "integer" content type (I don't know what that would be, but let's say "application/integer" for argument's sake), then that content type still implies a particular way of representing that integer in content, i.e. as a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, but we should get other people's takes on it. Aside from anything else, I'd be happy with the argument that only a string is actually useful.

One possible other case to consider, mind you: formats like YAML or protobuf, which have natural JSON object representations. I think it's at least reasonable to say "when using a JSON event format and a protobuf message, the natural data representation is the JSON for the message". I'd like that to be discussed, but for that case I think it would actually be better to put the binary representation in base64. I look forward to hearing the results of the discussion :)

Copy link
Member Author

@dazuma dazuma Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, if we allow non-string representations in this case, it specifically re-introduces the ambiguity that this PR is trying to solve. That is, the whole point of this PR is to distinguish when the data member must be interpreted as string content vs when the data member must be interpreted as a JSON value, because the overlap (when the JSON value happens to be a string) introduces ambiguity.

Let's take your example of YAML, as a format with a "natural JSON representation". Suppose we have the following event:

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "id" : "B234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "datacontenttype" : "application/yaml",
    "data" : "---\nhello\n"
}

Does that correspond to:

---
hello

or

---
|
  ---
  hello

If we allow arbitrary JSON for YAML data, then we kind of have to accept the latter. That is, we're effectively adding application/yaml (and maybe */yaml and */*+yaml) to the list of specially-treated content types (along with */json and */*+json). Which we can conceivably do, but we must do it explicitly to avoid the ambiguity. I think it's a slippery slope to start adding non-JSON formats, because it's unclear where to stop and where to draw the line of which formats "have a natural JSON representation".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should force them to base64 encode it and use data_base64 ? Then the format doesn't need to define a way to turn it into a string (and back) - we just write-out the bytes as-is (in base64).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that for text-based data, or we lose the implicit text encoding. i.e. if the datacontenttype is application/yaml and we force them to binary-encode the content string, then how do we interpret the charset encoding of that data when we deserialize it? It's implicit if it's a string, because the JSON envelope itself has a particular charset, but if we make it opaque binary data, we risk losing that information.

And anyway, it seems heavy-handed to force base64 encoding of ALL non-JSON data. That would certainly break a lot of current usage in the wild.

My argument here remains. The spec states that data is "encoded into a media format which is specified by the datacontenttype attribute" which implies that the encoded type is a string/bytearray. JSON format is making an exception to this rule if the content type is JSON, but the simplest and most natural policy would be to fall back to the general spec in all other cases.

json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated
store it inside the JSON representation. The `datacontenttype` SHOULD reflect
the format of the original binary data.

If the type of data is NOT `Binary`, the implementation MUST next determine
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While in practice I think the MUST is probably accurate, we can't actually test how they make this determination. All we can check is that it's correct on the wire, and I think the follow text covers that. So I think we could make this a "will" instead. Also, let's s/NOT/not/ just so people don't think it's normative in some way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that ("MUST" vs "will") also apply to line 138?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as indicated (s/NOT/not and s/MUST/will). The question about line 138 remains.

json-format.md Outdated Show resolved Hide resolved
json-format.md Outdated Show resolved Hide resolved
Signed-off-by: Daniel Azuma <dazuma@google.com>
Signed-off-by: Daniel Azuma <dazuma@google.com>
@duglin
Copy link
Collaborator

duglin commented Sep 2, 2021

lgtm and thanks for the hard work on this!

@jskeet thoughts?

@jskeet
Copy link
Contributor

jskeet commented Sep 2, 2021

I'm hoping to carve out some time before the meeting to have another look through. Thanks in advance for all the work that I can see has been going into it :)

Copy link
Contributor

@jskeet jskeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit and one area where I want to double-check we're on the same page. (I think we are.)

json-format.md Outdated Show resolved Hide resolved
JSON deserializer SHOULD proceed as if it were set to `application/json`, which
declares the data to contain JSON-formatted content. Thus, it SHOULD treat the
`data` member directly as a [JSON value][json-value] as specified above.
Furthermore, if a JSON-formatted event with no `datacontenttype` attribute, is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that for at least the C# implementation, the result of deserializing a JSON-formatted event with no datacontenttype to a CloudEvent object will result in an object with the DataContentType property set to "application/json" exactly as if it had been set explicitly. There's no easy way of saying "Well, it wasn't really set, but it was implicitly application/json, so use that if it's reserialized." My guess is that other SDKs may be in the same position. Note that this will happen even if it's reserialized with a JSON event formatter - the act of "deserialize/reserialize" won't be a no-op here. (There are other cases where it might not be a no-op too, so I don't think this is fatal.)

I'm okay with that - I just want to check that we're all on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how the Ruby SDK would behave as well. It's possible an SDK could keep the "original datacontenttype" and "effective datacontenttype" independently in the event object, but I don't think that's strictly necessary. It is important that the event semantics (i.e. the effective datacontenttype) do not change when the event is reserialized, even to a different format. It is not important that deserialize-reserialize to the same format is always a noop (although an SDK is free to try to do so anyway.) Do you think it's worth trying to state this for clarity? It seems like if we wanted to say something like that, it belongs in the main spec rather than in a format-specific document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be quite useful to be specified somewhere, but I don't mind where. I was mostly checking that I wouldn't end up with a completely different implementation to everyone else.

Signed-off-by: Daniel Azuma <dazuma@google.com>
@JemDay
Copy link
Contributor

JemDay commented Sep 7, 2021

LGTM - Thanks for pushing this through...

@duglin
Copy link
Collaborator

duglin commented Sep 16, 2021

Approved on 9/9 call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants