Skip to content

Url Encoding

Tim Hall edited this page Apr 3, 2016 · 5 revisions

Url Encoding

General Case

From RFC 3986 (2005)

2.2 Reserved Characters

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

2.3 Unreserved Characters

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

2.4 When to Encode or Decode

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI

The following subsets are specified based on URI syntax component:

The following are two example URIs and their component parts:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment
          |   _____________________|__
         / \ /                        \
         urn:example:animal:ferret:nose

scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

authority = [ userinfo "@" ] host [ ":" port ]

path = ... pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

query = *( pchar / "/" / "?" )

fragment = *( pchar / "/" / "?" )

Query String

While RFC 3986 defines what characters are permitted in a query string, data is encoded in the query string with form url-encoding. This encoding algorithm is defined by W3C application/x-www-form-urlencoded encoding algorithm and RFC 1738 (1994).

For each character in the entry's name and value that cannot be expressed using the selected character encoding, replace the character by a string consisting of a U+0026 AMPERSAND character (&), a "#" (U+0023) character, one or more ASCII digits representing the Unicode code point of the character in base ten, and finally a ";" (U+003B) character.

For each byte in the entry's name and value, apply the appropriate subsubsteps from the following list:

If the byte is 0x20 (U+0020 SPACE if interpreted as ASCII) Replace the byte with a single 0x2B byte ("+" (U+002B) character if interpreted as ASCII). If the byte is in the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39, 0x41 to 0x5A, 0x5F, 0x61 to 0x7A Leave the byte as is.

Otherwise Let s be a string consisting of a U+0025 PERCENT SIGN character (%) followed by uppercase ASCII hex digits representing the hexadecimal value of the byte in question (zero-padded if necessary).

Encode the string s as US-ASCII, so that it is now a byte string.

Replace the byte in question in the name or value being processed by the bytes in s, preserving their relative order.

The use of + for space is a fairly controversial issue and there are cases where it has been weakened to + or %20.

Separately, OAuth 1.0 defines the required parameter encoding for preparing the signature and authorization header as defined in RFC 5849 3.6:

*  Characters in the unreserved character set as defined by [RFC3986], Section 2.3 
   (ALPHA, DIGIT, "-", ".", "_", "~") MUST NOT be encoded.
*  All other characters MUST be encoded.
*  The two hexadecimal characters used to represent encoded
   characters MUST be uppercase.

It includes the following note:

This method is different from the encoding scheme used by the "application/x-www-form-urlencoded" content-type (for example, it encodes space characters as "%20" and not using the "+" character).

Cookies

There seems to be variation in cookie implementations, with specs ranging from the original (cookie_spec) to the more recent RFC 6265.

cookie_spec

NAME=VALUE "This string is a sequence of characters excluding semi-colon, comma and white space. If there is a need to place such data in the name or value, some encoding method such as URL style %XX encoding is recommended, though no encoding is defined or required."

RFC 6265

cookie-pair       = cookie-name "=" cookie-value
cookie-name       = token
cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                      ; US-ASCII characters excluding CTLs,
                      ; whitespace DQUOTE, comma, semicolon,
                      ; and backslash

(from RFC 2616)
token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
               | "," | ";" | ":" | "\" | <">
               | "/" | "[" | "]" | "?" | "="
               | "{" | "}" | SP | HT

Usage

Current usage of UrlEncode in VBA-Web:

  • OAuth1 - Encode signature and header
  • ConvertToUrlEncoded - Encode key and value (used by WebFormat.UrlEncoded and to encode QuerystringParams)
  • UrlSegments - Segment value is encoded during replacement
  • Note: Currently cookies are not encoded for requests (should be in the future)

UrlDecode:

  • ParseUrlEncoded (used by WebFormat.UrlEncoded and in FacebookAuthenticator, TodoistAuthenticator)
  • Cookies - (PlusAsSpace:=False)

Implementation

The updated UrlEncode and UrlDecode will have the following "modes":

  • strict: unreserved-only (shared with OAuth 1.0) and is the default
  • form-urlencoded: used only for WebFormat.UrlEncoded and uses + for space
  • querystring: subset of strict and form-urlencoded = ALPHA / DIGIT / "-" / "." / "_"
  • cookie
  • path

For query strings, form-urlencoded will be used for WebFormat.UrlEncoded, otherwise querystring will be used. Also, For UrlDecode, querystring converts + to space for backwards compatibility.

strict = ALPHA / DIGIT / "-" / "." / "_" / "~"
formurlencoded = ALPHA / DIGIT / "-" / "." / "_" / "*", (space) => "+"
query = ALPHA / DIGIT / "-" / "." / "_"
cookie = strict / "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "^" / "`" / "|"
path = strict / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "@"
Public Enum UrlEncodingMode
  StrictUrlEncoding
  FormUrlEncoding
  QueryUrlEncoding
  CookieUrlEncoding
  PathUrlEncoding
End Enum

Public Function UrlEncode(Text As String, ..., Optional EncodingMode As UrlEncodingMode = StrictEncoding)
  ' ...
End Function

Public Function UrlDecode(Encoded As String, ..., Optional EncodingMode As UrlEncodingMode = StrictEncoding)

End Function

References

Clone this wiki locally