Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define "normalized absolute URL" #58

Open
gRegorLove opened this issue Sep 27, 2022 · 7 comments
Open

Define "normalized absolute URL" #58

gRegorLove opened this issue Sep 27, 2022 · 7 comments

Comments

@gRegorLove
Copy link
Member

This issue is split from #9 intended to focus only on the process of normalizing URLs when parsing u-*.

Current language:

return the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).

One of the simplest things, which microformats/tests#112 is waiting on, is whether to normalize an empty URL path component to "/". @jgarber623 detailed some specs and software that include this normalization, so I think this would be pretty agreeable among implementers.

@Zegnat raised the concern of defining what we mean by "path component" since parts of URLs have been renamed over the years. The IndieAuth spec includes a normative reference to WHATWG's URL standard and explains "path component" with a simple example instead of a spec definition:

As such, if a URL with no path component is ever encountered, it MUST be treated as if it had the path /. For example, if a user provides https://example.com for Discovery, the client MUST transform it to https://example.com/ when using it and comparing it.

So perhaps that would be sufficient for the microformats parsing spec, too?

RFC 3986 lists some additional normalizations that could be nice-to-have but I'm not sure if they are strictly necessary for parsers:

  • make scheme and hostname lowercase
  • normalize percent-encoded characters to uppercase ("%3a" versus "%3A")
  • decode percent-encoded unreserved characters

RFC 3986 also describes remove_dot_segments to normalize "." and ".." path segments. From a quick check, it appears at least php-mf2, mf2py, and Ruby parsers are all doing this, which makes sense since it's necessary to correclty handle <base href>.

Questions:

  1. Is the correct term for this process "normalization" or "canonicalization"?
  2. What are the simplest steps for this process such that it results in "a) easy for implementers to understand and b) leads to a useful output for consumers" (to quote @Zegnat :))
@Zegnat
Copy link
Member

Zegnat commented Sep 29, 2022

Is the correct term for this process "normalization" or "canonicalization"?

I strongly feel like it is normali[sz]ation. Just like how RFC 3986 refers to it. Canonicali[sz]ation to me refers to what rel-canonical is used for, matching the definition from Wikipedia:

A canonical URL is a URL for defining the single source of truth for duplicate content.

There is no way for a parser like the mf2 parser to figure out that value, since it only has the string to work on. (I would be very much opposed to requiring mf2 parsers to fetch resources, look for rel-canonicals, etc.)

@jgarber623
Copy link
Member

Is the correct term for this process "normalization" or "canonicalization"?

"Normali[sz]ation" for the reasons @Zegnat noted above.

What are the simplest steps for this process such that it results in "a) easy for implementers to understand and b) leads to a useful output for consumers"

Maybe something like:

A URL's "path" is defined here as zero or more characters immediately following the host (and optional port) continuing until the end of the URL or the first question mark ? or hash #, whichever comes first. If the gotten value is zero characters in length, the normalized path is /.

@Zegnat
Copy link
Member

Zegnat commented Oct 5, 2022

Are paths always / if not empty? Even for non-HTTP URLs? IndieAuth is able to short-cut this somewhat as all URLs (except redirect URLs in special cases) are Special URLs, that is, HTTP(S) URLs.

@gRegorLove
Copy link
Member Author

👍 on using "normalization."

And good catch -- we should differentiate schemes as part of the steps.

Loose ideas (not in spec language yet):

  • If the u-* value has a scheme and it is not http(s) (e.g. mailto:) no normalization
  • If the u-* value does not have a scheme, default to scheme of document being parsed
  • For http(s) schemes, follow the normalization instructions, including in part:
    • Make relative URLs absolute, including removing dot segments
    • ... anything else?
    • Normalize empty path to "/"
@gRegorLove
Copy link
Member Author

While we're updating this section of text, I think we should include text to cover #48 (comment) and microformats/php-mf2#186.

@snarfed
Copy link
Member

snarfed commented Jan 25, 2023

Is this the root cause of microformats/mf2py#177 (comment)? ie, is it undefined whether normalizing https://tantek.com/? should drop the trailing ? and result in https://tantek.com/ ?

@gRegorLove
Copy link
Member Author

@snarfed I think that's a good question to clarify for this issue, but with php-mf2 I think it's more a side effect than an explicit choice.

RFC3986 Component Recomposition seems to indicate the "?" should be preserved with the pseudocode and note:

      if defined(query) then
         append "?" to result;
         append query to result;
      endif;

Note that we are careful to preserve the distinction between a
component that is undefined, meaning that its separator was not
present in the reference, and a component that is empty, meaning that
the separator was present and was immediately followed by the next
component separator or the end of the reference.

https://tantek.com/? seems like it's the correct normalization in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants