Define "normalized absolute URL" #58

gRegorLove · 2022-09-27T03:32:00Z

This issue is split from #9 intended to focus only on the process of normalizing URLs when parsing u-*.

return the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).

One of the simplest things, which microformats/tests#112 is waiting on, is whether to normalize an empty URL path component to "/". @jgarber623 detailed some specs and software that include this normalization, so I think this would be pretty agreeable among implementers.

@Zegnat raised the concern of defining what we mean by "path component" since parts of URLs have been renamed over the years. The IndieAuth spec includes a normative reference to WHATWG's URL standard and explains "path component" with a simple example instead of a spec definition:

As such, if a URL with no path component is ever encountered, it MUST be treated as if it had the path /. For example, if a user provides https://example.com for Discovery, the client MUST transform it to https://example.com/ when using it and comparing it.

So perhaps that would be sufficient for the microformats parsing spec, too?

RFC 3986 lists some additional normalizations that could be nice-to-have but I'm not sure if they are strictly necessary for parsers:

make scheme and hostname lowercase
normalize percent-encoded characters to uppercase ("%3a" versus "%3A")
decode percent-encoded unreserved characters

RFC 3986 also describes remove_dot_segments to normalize "." and ".." path segments. From a quick check, it appears at least php-mf2, mf2py, and Ruby parsers are all doing this, which makes sense since it's necessary to correclty handle <base href>.

Questions:

Is the correct term for this process "normalization" or "canonicalization"?
What are the simplest steps for this process such that it results in "a) easy for implementers to understand and b) leads to a useful output for consumers" (to quote @Zegnat :))

The text was updated successfully, but these errors were encountered:

Zegnat · 2022-09-29T13:49:28Z

Is the correct term for this process "normalization" or "canonicalization"?

I strongly feel like it is normali[sz]ation. Just like how RFC 3986 refers to it. Canonicali[sz]ation to me refers to what rel-canonical is used for, matching the definition from Wikipedia:

A canonical URL is a URL for defining the single source of truth for duplicate content.

There is no way for a parser like the mf2 parser to figure out that value, since it only has the string to work on. (I would be very much opposed to requiring mf2 parsers to fetch resources, look for rel-canonicals, etc.)

jgarber623 · 2022-10-02T18:49:27Z

Is the correct term for this process "normalization" or "canonicalization"?

"Normali[sz]ation" for the reasons @Zegnat noted above.

What are the simplest steps for this process such that it results in "a) easy for implementers to understand and b) leads to a useful output for consumers"

Maybe something like:

A URL's "path" is defined here as zero or more characters immediately following the host (and optional port) continuing until the end of the URL or the first question mark ? or hash #, whichever comes first. If the gotten value is zero characters in length, the normalized path is /.

Zegnat · 2022-10-05T12:17:07Z

Are paths always / if not empty? Even for non-HTTP URLs? IndieAuth is able to short-cut this somewhat as all URLs (except redirect URLs in special cases) are Special URLs, that is, HTTP(S) URLs.

gRegorLove · 2022-10-05T18:27:22Z

👍 on using "normalization."

And good catch -- we should differentiate schemes as part of the steps.

Loose ideas (not in spec language yet):

If the u-* value has a scheme and it is not http(s) (e.g. mailto:) no normalization
If the u-* value does not have a scheme, default to scheme of document being parsed
For http(s) schemes, follow the normalization instructions, including in part:
- Make relative URLs absolute, including removing dot segments
- ... anything else?
- Normalize empty path to "/"

gRegorLove · 2022-10-06T00:06:18Z

While we're updating this section of text, I think we should include text to cover #48 (comment) and microformats/php-mf2#186.

snarfed · 2023-01-25T21:06:35Z

Is this the root cause of microformats/mf2py#177 (comment)? ie, is it undefined whether normalizing https://tantek.com/? should drop the trailing ? and result in https://tantek.com/ ?

gRegorLove · 2023-01-26T00:24:44Z

@snarfed I think that's a good question to clarify for this issue, but with php-mf2 I think it's more a side effect than an explicit choice.

RFC3986 Component Recomposition seems to indicate the "?" should be preserved with the pseudocode and note:

      if defined(query) then
         append "?" to result;
         append query to result;
      endif;

Note that we are careful to preserve the distinction between a
component that is undefined, meaning that its separator was not
present in the reference, and a component that is empty, meaning that
the separator was present and was immediately followed by the next
component separator or the end of the reference.

https://tantek.com/? seems like it's the correct normalization in that case.

gRegorLove mentioned this issue Sep 27, 2022

"return the normalized absolute URL" for invalid URLs? #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define "normalized absolute URL" #58

Define "normalized absolute URL" #58

gRegorLove commented Sep 27, 2022

Zegnat commented Sep 29, 2022

jgarber623 commented Oct 2, 2022

Zegnat commented Oct 5, 2022

gRegorLove commented Oct 5, 2022

gRegorLove commented Oct 6, 2022

snarfed commented Jan 25, 2023

gRegorLove commented Jan 26, 2023

Define "normalized absolute URL" #58

Define "normalized absolute URL" #58

Comments

gRegorLove commented Sep 27, 2022

Zegnat commented Sep 29, 2022

jgarber623 commented Oct 2, 2022

Zegnat commented Oct 5, 2022

gRegorLove commented Oct 5, 2022

gRegorLove commented Oct 6, 2022

snarfed commented Jan 25, 2023

gRegorLove commented Jan 26, 2023