Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JF2 references #20

Open
aciccarello opened this issue Apr 28, 2023 · 8 comments
Open

Improve JF2 references #20

aciccarello opened this issue Apr 28, 2023 · 8 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@aciccarello
Copy link
Contributor

aciccarello commented Apr 28, 2023

Is your feature request related to a problem?

I've noticed that the data returned by references isn't as normalized as I'd like, leading to lots of extra properties and missing author properties. When I compare the output of https://xray.p3k.app/ to the references, xray is able to handle the output while Indiekit is less organized.

Example 1: https://jamesg.blog/2023/04/18/source-code-folder-names/

{
  "url": "https://jamesg.blog/2023/04/18/source-code-folder-names/",
  "children": [
    {
      "type": "card",
      "name": "James' Coffee Blog ☕",
      "url": "https://jamesg.blog"
    },
    {
      "type": "entry",
      "name": "My source code root folder name",
      "published": "2023-04-18T00:00:00",
      "category": "Coding",
      "content": {
        "html": "<p>I like seeing what people call the root folder in which they store their source code. This is the folder where all — or a lot of — your projects are stored. In my case, my programming projects go in a folder called <code>src</code>. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)</p>\n<p>That long parenthetical notwithstanding, I find the name <code>src</code> cool. It’s a short way of saying source code; apt, simple, easy to type. Furthermore, <code>src</code> is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I’m tying in my terminal to navigate to a source code folder.</p>\n<p>A common example I have seen is <code>Code</code>, or variants thereof. I’m curious: if you code, what do you call the root folder in which you store your source code?</p>",
        "text": "I like seeing what people call the root folder in which they store their source code. This is the folder where all — or a lot of — your projects are stored. In my case, my programming projects go in a folder called src. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)\nThat long parenthetical notwithstanding, I find the name src cool. It’s a short way of saying source code; apt, simple, easy to type. Furthermore, src is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I’m tying in my terminal to navigate to a source code folder.\nA common example I have seen is Code, or variants thereof. I’m curious: if you code, what do you call the root folder in which you store your source code?"
      }
    }
  ]
}

Example 2: https://aaronparecki.com/2023/04/24/8/lawyer

{
  "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
  "children": [
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "entry",
      "author": {
        "type": "card",
        "url": "https://aaronparecki.com/",
        "photo": [
          {
            "alt": "Aaron Parecki",
            "url": "https://aaronparecki.com/images/profile.jpg"
          }
        ],
        "name": "Aaron Parecki"
      },
      "content": {
        "html": "In retrospect, I probably didn't need to include \"but I am not a lawyer\" in an email to our lawyers",
        "text": "In retrospect, I probably didn't need to include \"but I am not a lawyer\" in an email to our lawyers"
      },
      "location": {
        "type": "adr",
        "locality": "Portland",
        "region": "Oregon",
        "country": "USA"
      },
      "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
      "published": "2023-04-24T14:12:20-07:00",
      "syndication": [
        "at://did:plc:s2koow7r6t7tozgd4slc3dsg/app.bsky.feed.post/3ju5hvccis32q",
        "https://micro.blog/aaronpk/18625298"
      ],
      "pk-num-likes": "15",
      "pk-num-reposts": "1",
      "pk-num-replies": "3",
      "like": {
        "children": [
          {
            "type": "cite",
            "url": [
              "https://emacs.ch/users/skybert#likes/56653",
              "https://emacs.ch/users/skybert"
            ],
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": ""
            },
            "name": "15 of these cite elements"
          }
        ]
      },
      "repost": {
        "type": "cite",
        "url": [
          "https://tdd.social/users/CodingItWrong/statuses/110256727388551917/activity",
          "https://tdd.social/users/CodingItWrong"
        ],
        "author": {
          "type": "card",
          "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
          "name": ""
        },
        "name": "Josh Justice"
      },
      "comment": {
        "children": [
          {
            "type": "cite",
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": "dominikhoecht",
              "url": "https://micro.blog/dominikhoecht"
            },
            "content": {
              "html": "<p><a href=\"https://micro.blog/aaronpk\" rel=\"nofollow\">@aaronpk</a> 😂</p>",
              "text": "@aaronpk 😂"
            },
            "url": "https://micro.blog/dominikhoecht/18694091",
            "published": "2023-04-27T15:41:21+00:00"
          },
          {
            "type": "cite",
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": "carpetbomberz",
              "url": "https://mastodon.online/users/carpetbomberz"
            },
            "content": {
              "html": "<p><span class=\"h-card\"><a href=\"https://aaronparecki.com/aaronpk\" class=\"u-url\">@<span>aaronpk</span></a></span> In your defense you are most authoritative on the many subjects upon which you expound. I'm thinking back to the episode on ContentID fer' instance. 😄</p>",
              "text": "@aaronpk In your defense you are most authoritative on the many subjects upon which you expound. I'm thinking back to the episode on ContentID fer' instance. 😄"
            },
            "url": "https://mastodon.online/@carpetbomberz/110256111311133962",
            "published": "2023-04-24T15:19:05-07:00",
            "children": [
              {
                "type": "card",
                "url": "https://aaronparecki.com/aaronpk",
                "name": "@aaronpk"
              }
            ]
          },
          {
            "type": "cite",
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": "lmika",
              "url": "https://micro.blog/lmika"
            },
            "content": {
              "html": "<p><a href=\"https://micro.blog/aaronpk\" rel=\"nofollow\">@aaronpk</a> Just hope that they don’t reply with “I’m not a lawyer either”. 😀</p>",
              "text": "@aaronpk Just hope that they don’t reply with “I’m not a lawyer either”. 😀"
            },
            "url": "https://micro.blog/lmika/18625632",
            "published": "2023-04-24T21:45:09+00:00"
          }
        ]
      }
    },
    {
      "type": "card",
      "url": "https://aaronparecki.com/",
      "uid": "https://aaronparecki.com/",
      "photo": "https://aaronparecki.com/images/profile.jpg",
      "note": "Hi, I'm Aaron Parecki, Senior Security Architect at Okta, and co-founder of\nIndieWebCamp.\nI maintain oauth.net, write and consult about OAuth, and\nparticipate in the OAuth Working Group at the IETF. I also help people learn about video production and livestreaming and dabble in product design.\n\nI've been tracking my location since 2008 and I wrote 100 songs in 100 days.\nI've spoken at conferences around the world about\nowning your data,\nOAuth,\nquantified self,\nand explained why R is a vowel. Read more.",
      "name": "Aaron Parecki",
      "bday": "--12-28",
      "street-address": "PO Box 12433",
      "locality": "Portland",
      "region": "Oregon",
      "country-name": "USA",
      "postal-code": "97212",
      "org": {
        "children": [
          {
            "type": "card",
            "photo": "https://aaronparecki.com/images/okta.png",
            "role": "Security Architect",
            "url": "https://developer.okta.com/",
            "name": "Okta"
          },
          {
            "type": "card",
            "photo": "https://aaronparecki.com/images/indiewebcamp.png",
            "url": "https://indieweb.org/",
            "name": "IndieWebCamp",
            "role": "Founder"
          }
        ]
      }
    }
  ]
}

Describe the solution you’d like

I'd like the references to show a much simpler model, including the entry at the top level with author data included.
I'm guessing the solution probably rests in the mf2tojf2 package.

X-Ray Output 1: https://jamesg.blog/2023/04/18/source-code-folder-names/

{
    "data": {
        "type": "entry",
        "published": "2023-04-18T00:00:00",
        "category": [
            "Coding"
        ],
        "name": "My source code root folder name",
        "content": {
            "text": "I like seeing what people call the root folder in which they store their source code. This is the folder where all \u2014 or a lot of \u2014 your projects are stored. In my case, my programming projects go in a folder called src. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)\nThat long parenthetical notwithstanding, I find the name src cool. It\u2019s a short way of saying source code; apt, simple, easy to type. Furthermore, src is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I\u2019m tying in my terminal to navigate to a source code folder.\nA common example I have seen is Code, or variants thereof. I\u2019m curious: if you code, what do you call the root folder in which you store your source code?",
            "html": "<p>I like seeing what people call the root folder in which they store their source code. This is the folder where all \u2014 or a lot of \u2014 your projects are stored. In my case, my programming projects go in a folder called <code>src</code>. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)</p>\n<p>That long parenthetical notwithstanding, I find the name <code>src</code> cool. It\u2019s a short way of saying source code; apt, simple, easy to type. Furthermore, <code>src</code> is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I\u2019m tying in my terminal to navigate to a source code folder.</p>\n<p>A common example I have seen is <code>Code</code>, or variants thereof. I\u2019m curious: if you code, what do you call the root folder in which you store your source code?</p>"
        },
        "author": {
            "type": "card",
            "name": "James' Coffee Blog \u2615",
            "url": "https://jamesg.blog",
            "photo": null
        },
        "post-type": "article"
    },
    "url": "https://jamesg.blog/2023/04/18/source-code-folder-names/",
    "code": 200,
    "source-format": "mf2+html"
}

X-Ray Output 2: https://aaronparecki.com/2023/04/24/8/lawyer

{
    "data": {
        "type": "entry",
        "published": "2023-04-24T14:12:20-07:00",
        "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
        "syndication": [
            "https://micro.blog/aaronpk/18625298"
        ],
        "content": {
            "text": "In retrospect, I probably didn't need to include \"but I am not a lawyer\" in an email to our lawyers"
        },
        "author": {
            "type": "card",
            "name": "Aaron Parecki",
            "url": "https://aaronparecki.com/",
            "photo": "https://aaronparecki.com/images/profile.jpg"
        },
        "post-type": "note"
    },
    "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
    "code": 200,
    "source-format": "mf2+json"
}

Describe alternatives you’ve considered

I'm currently trying to normalize the input in my post template function but I think it would be helpful to the community to have shared logic.

Additional context

No response

@aciccarello aciccarello added the enhancement New feature or request label Apr 28, 2023
@paulrobertlloyd paulrobertlloyd transferred this issue from getindiekit/indiekit Apr 28, 2023
@paulrobertlloyd
Copy link
Collaborator

Just had a look at the code underlining Xray and… wow. 1008 lines of code to parse and massage an incoming feed to generate the results you are seeing here. 🤯

This project started as a straight forward implementation of mf2tojf2.py, and for incoming well-structured MF2 objects, it works. But when given a page of unknown microformatted markup, it’s going to struggle to produce well-formed data.

I’d like to say this is something that I can look to improve, but not at the cost of working on Indiekit – even more so given including references is an option that’s disabled by default.

Perhaps there’s a way of breaking this apart and looking to make smaller, incremental improvements (the list of empty children with only { type: "item" } seems like something that shouldn’t happen, for example).

Open to suggestions… maybe this is something to put to the IndieWeb community to see if anyone would like to contribute parsing improvements?

@paulrobertlloyd paulrobertlloyd added the help wanted Extra attention is needed label May 20, 2023
@aciccarello
Copy link
Contributor Author

I agree that this is much more of a nice to have than some of the key indiekit work. I imagine that aiming to normalize all messy content would be an impossible task. I'd need to look more at xray and see if there is any set of agreed upon parsing specs to come up with a list of target improvements.

I'll probably keep iterating on my own massaging logic I'm using with my Indiekit instance. I'd love to also include some Metaformats logic too to parse meta tags on sites without mf2. So far the main things I've added are finding the main h-entry and author but I'm sure I'll discover more as I reply to more sites with Indiekit.

@aciccarello
Copy link
Contributor Author

I split out a couple more specific tasks, however if you think these should be handled externally I can look at adding this kind of logic to a different library.

@paulrobertlloyd
Copy link
Collaborator

paulrobertlloyd commented May 21, 2023

Looking at the authorship spec you linked to in #22, I spotted mf-obj, a Node.js package that seems to cover some of the requirements here.

It has’t been updated for 7 years, and unfortunately written in TypeScript, but maybe that could be used, or adapted for use here?

@aciccarello
Copy link
Contributor Author

Before I push for or try to contribute features to this library, I want check to make sure mftojf2 would be the best place for some of this functionality. Ideally we'd avoid different libraries downloading and parsing pages multiple times but some reworking of MF2 objects by different libraries could be composed. Like you mentioned earlier, I would like to see a little more collaboration and coordination of efforts around node libraries but I also wouldn't want to tie things to another package that no one has capacity to maintain.

Relevant Libraries

Library Last Release Focus Input Output Notes
@paulrobertlloyd/mf2tojf2 2022-11 Convert to jf2 MF2 JF2 This library. Also loads reference URLs
microformat-node 2016-10 Parsing URL MF2 Recommended but no recent activity
mf-obj 2016-06 Utils URL MF2 Uses microformat-node. Implements authorship algorithm
microformats-parser 2022-01 Parsing HTML MF2 Used by mf2tojf2
mf2utiljs 2022-02 Utils URL or MF2 MF2 Uses microformats-parser. Port of mf2util. Implements authorship algorithm
representative-h-card 2021-06 Util MF2 MF2 REPO ARCHIVED

Microformat parsing features

Feature Input Output Notes
References MF2 to get URL MF2 Already implemented here. Requires parsing but fetches different URL
Authorship MF2 MF2 Implemented in mf2utilsjs
Main Entry MF2 MF2 Implemented in mf2utilsjs?
Metaformats HTML MF2 This probably should be included in lib that does initial parsing

Opportunities to reuse logic

Let me know what you think but I'd love to see this type of functionality we're discussing pushed to other libraries and used more flexibly by the node community.

mf2utilsjs for cleaning up microformats

Turns out there is more on npm than I initially though. I hadn't seen mf2utilsjs before. Since it ports the well used python package, I think it has a lot of potential for being a really useful package. I would probably want to check with the maintainer to see if they are up for more community involvement. But assuming it is a reliable library, I could see a microformats-parser > mf2utilsjs > mf2tojf2 combo working well.

Leave metaformats to initial parser

Something like Metaformats might be better as a feature of microformats-parser since that would require the fetching and parsing raw HTML to get meta tags. Implementing it in another library would duplicate that fetching work. I don't think that should be enabled by default, but microformats-parser already has a set of experimentalOptions flags.

@paulrobertlloyd
Copy link
Collaborator

If you wanted to submit a PR that used mf2utiljs to clean up incoming Microformats to use in references, I think that would be really useful, and potentially solve this issue!

I wonder if its a case of parsing the Microformats returned here with mf2utiljs:

const mf2 = await fetchMf2(url);

@aciccarello
Copy link
Contributor Author

From what I can tell mf2utiljs was a one-off personal project. I haven't gotten any response about being open to community involvement. I think we might need to implement the authorship and main entry algorithms separately. I'm considering creating a library but would prefer to avoid creating a separate package if it could be avoided.

@paulrobertlloyd
Copy link
Collaborator

paulrobertlloyd commented Aug 14, 2023

If you’d like to contribute a PR to add them to this project, I think that could work. These algorithms do seem to fall into the category of converting mf2 to JF2.

At some point I also think it would make sense to ask about moving this project to the @microformats organisation, much like the new Node Microformats parser was, meaning this project can live alongside that project and mf2tojf2.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
2 participants