Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace tags stripped from plaintext values #142

Closed
aaronpk opened this issue Mar 30, 2018 · 16 comments
Closed

Whitespace tags stripped from plaintext values #142

aaronpk opened this issue Mar 30, 2018 · 16 comments

Comments

@aaronpk
Copy link

aaronpk commented Mar 30, 2018

I recently switched my website to include <br> tags instead of newlines for my notes. This means I now have HTML markup like

Hello<br><br>World

It appears that when Granary is converting this to JSON Feed (haven't checked other formats), it is stripping the tags completely instead of converting them to whitespace, so the example above would appear as HelloWorld in the JSON Feed.

This then causes a problem when comparing the text value to the HTML value, and Granary thinks they are different so it creates a title for the note. Then my posts appear smushed in Micro.blog.

I think Granary should recognize that a
tag is meaningful and replace that with a newline so that the plaintext conversion works properly.

@snarfed
Copy link
Owner

snarfed commented Mar 31, 2018

thanks for filing! ...and ugh, looks like it's mf2py. i assume this isn't the expected result (v1.0.5):

mf2py.parse(doc="""
<article class="h-entry">
<div class="e-content p-name">foo bar<br />baz <br><br> baj</div>
</article>""", url='http://x/')

results in:

{
  "items": [{
      "type": ["h-entry"], 
      "properties": {
        "content": [{
            "html": "foo bar<br/>baz <br/><br/> baj", 
            "value": "foo barbaz  baj"
          }], 
        "name": ["foo barbaz  baj"]
      }
    }
  ]
}

@kevinmarks @kartikprabhu any thoughts?

@snarfed
Copy link
Owner

snarfed commented Mar 31, 2018

btw @kevinmarks @kartikprabhu mf2py's release management might deserve a bit of love. https://pypi.org/project/mf2py/ just says Project Description UNKNOWN, and the last release tagged in github is 1.0.0 from oct 2015. :P https://github.com/microformats/mf2py/releases

@snarfed
Copy link
Owner

snarfed commented Mar 31, 2018

also for background, the post that inspired this is https://aaronparecki.com/2018/03/30/18/ , and php-mf2 correctly converts its <br>s to \ns in name and content.value: https://pin13.net/mf2/?url=https://aaronparecki.com/2018/03/30/18/

@snarfed
Copy link
Owner

snarfed commented Mar 31, 2018

my first repro (above) was mf2py 1.0.5, beautifulsoup4 4.4.1, html5lib 0.9999999.

i tried just now on microformats/mf2py master tip, beautifulsoup4 4.6.0, and html5lib 1.0.1. same result.

@kartikprabhu
Copy link
Contributor

@snarfed
mf2py does not do much of whitespace manipulation, it just defers to BeautifulSoup. There are no specific whitespace rules in mf2 spec either, so not sure what is to be implemented. @aaronpk has some examples of expectations at https://pin13.net/mf2/whitespace.html but they are not as simple as replace <br/> by \n.

The pypi releases are owned by @tommorris currently so he would have to transfer ownersip or something to do those releases.

@snarfed
Copy link
Owner

snarfed commented Mar 31, 2018

thanks for looking! aha, 2 and 8 on https://pin13.net/mf2/whitespace.html do indeed look like this bug.

@kartikprabhu sounds like you're skeptical of fixing this in mf2py? or you just think it will be difficult? or you'd want to see it in the parsing spec first? @aaronpk, @tantek, @kevinmarks, thoughts?

as for pypi, understood. ask @tommorris to transfer ownership! I'm sure he will, since he did for the repo. people generally install from pypi, not github (usually), so we definitely need to be able to continue releasing there.

@kartikprabhu
Copy link
Contributor

@snarfed I am not skeptical of fixing this; there is a whitespace algorithm by @Zegnat, but TBH it looks like a lot of DOM tree parsing work.

For pypi there is already an issue open microformats/mf2py#93

@snarfed
Copy link
Owner

snarfed commented Mar 31, 2018

aha, got it. understood. thanks for the explanation and link, and props to @Zegnat for writing it.

not sure where that leaves us, but let me know if you need anything else from me!

@kartikprabhu
Copy link
Contributor

@snarfed whitespace rules are now in experimental version https://github.com/kartikprabhu/mf2py/tree/experimental

@kartikprabhu
Copy link
Contributor

The example @snarfed used in #142 (comment)
now gives the following in the experimental mf2py

"items": [
        {
            "type": [
                "h-entry"
            ], 
            "properties": {
                "content": [
                    {
                        "html": "foo bar<br/>baz <br/><br/> baj", 
                        "value": "foo bar\nbaz\n\nbaj"
                    }
                ], 
                "name": [
                    "foo bar\nbaz\n\nbaj"
                ]
            }
        }
    ]
@snarfed
Copy link
Owner

snarfed commented Apr 2, 2018

@kartikprabhu yay, agreed, my tests pass with that new code too. can't wait for a release!

interestingly though, that branch fails a couple other of my tests. looks like implied name now includes img srcs? e.g. this html:

<body class="h-entry">
<div class="p-author h-card">
<a href="http://li/nk">my name</a>
<img class="u-photo" src="http://pic/ture" />
</div>
</body>

results in name my name http://pic/ture, where before it was just my name. just a heads up.

(btw long lived dev branches are scary, but that's a separate conversation. :P)

@kartikprabhu
Copy link
Contributor

@snarfed yes that is correct according to updated textContent rules http://microformats.org/wiki/microformats2-parsing#parsing_for_implied_properties

@snarfed
Copy link
Owner

snarfed commented Apr 2, 2018

huh, ok. seems ugly, but understood!

@snarfed
Copy link
Owner

snarfed commented May 23, 2018

sadly this didn't make it into the recent mf2py 1.1.0 release. ah well. next one hopefully!

>>> mf2py.__version__
'1.1.0'

>>> mf2py.parse("""\
<article class="h-entry">
<div class="e-content p-name">foo bar<br />baz <br><br> baj</div>
</article>""", url='http://foo')

{'items': [{
  'type': ['h-entry'],
  'properties': {
    'content': [{
      'html': 'foo bar<br/>baz <br/><br/> baj',
      'value': 'foo barbaz  baj',
    }],
    ...
@snarfed
Copy link
Owner

snarfed commented May 23, 2018

(i briefly foolishly hoped that @bdesham's #149 might fix this, but alas, no luck.)

@snarfed
Copy link
Owner

snarfed commented Jul 25, 2018

done! whitespace tags are now correctly converted in jsonfeed titles. example: https://granary.io/url?input=html&output=jsonfeed&url=https%3A%2F%2Faaronparecki.com%2F2018%2F07%2F19%2F26%2F

note that granary still generates a jsonfeed title for that post, even though it's technically a note (?).

@snarfed snarfed closed this as completed Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants