Whitespace tags stripped from plaintext values #142

aaronpk · 2018-03-30T23:19:33Z

I recently switched my website to include <br> tags instead of newlines for my notes. This means I now have HTML markup like

Hello<br><br>World

It appears that when Granary is converting this to JSON Feed (haven't checked other formats), it is stripping the tags completely instead of converting them to whitespace, so the example above would appear as HelloWorld in the JSON Feed.

This then causes a problem when comparing the text value to the HTML value, and Granary thinks they are different so it creates a title for the note. Then my posts appear smushed in Micro.blog.

I think Granary should recognize that a
tag is meaningful and replace that with a newline so that the plaintext conversion works properly.

The text was updated successfully, but these errors were encountered:

snarfed · 2018-03-31T03:47:26Z

thanks for filing! ...and ugh, looks like it's mf2py. i assume this isn't the expected result (v1.0.5):

mf2py.parse(doc="""
<article class="h-entry">
<div class="e-content p-name">foo bar<br />baz <br><br> baj</div>
</article>""", url='http://x/')

results in:

{
  "items": [{
      "type": ["h-entry"], 
      "properties": {
        "content": [{
            "html": "foo bar<br/>baz <br/><br/> baj", 
            "value": "foo barbaz  baj"
          }], 
        "name": ["foo barbaz  baj"]
      }
    }
  ]
}

@kevinmarks @kartikprabhu any thoughts?

snarfed · 2018-03-31T03:49:47Z

btw @kevinmarks @kartikprabhu mf2py's release management might deserve a bit of love. https://pypi.org/project/mf2py/ just says Project Description UNKNOWN, and the last release tagged in github is 1.0.0 from oct 2015. :P https://github.com/microformats/mf2py/releases

snarfed · 2018-03-31T03:51:29Z

also for background, the post that inspired this is https://aaronparecki.com/2018/03/30/18/ , and php-mf2 correctly converts its <br>s to \ns in name and content.value: https://pin13.net/mf2/?url=https://aaronparecki.com/2018/03/30/18/

snarfed · 2018-03-31T04:01:12Z

my first repro (above) was mf2py 1.0.5, beautifulsoup4 4.4.1, html5lib 0.9999999.

i tried just now on microformats/mf2py master tip, beautifulsoup4 4.6.0, and html5lib 1.0.1. same result.

kartikprabhu · 2018-03-31T16:46:48Z

@snarfed
mf2py does not do much of whitespace manipulation, it just defers to BeautifulSoup. There are no specific whitespace rules in mf2 spec either, so not sure what is to be implemented. @aaronpk has some examples of expectations at https://pin13.net/mf2/whitespace.html but they are not as simple as replace <br/> by \n.

The pypi releases are owned by @tommorris currently so he would have to transfer ownersip or something to do those releases.

snarfed · 2018-03-31T17:07:18Z

thanks for looking! aha, 2 and 8 on https://pin13.net/mf2/whitespace.html do indeed look like this bug.

@kartikprabhu sounds like you're skeptical of fixing this in mf2py? or you just think it will be difficult? or you'd want to see it in the parsing spec first? @aaronpk, @tantek, @kevinmarks, thoughts?

as for pypi, understood. ask @tommorris to transfer ownership! I'm sure he will, since he did for the repo. people generally install from pypi, not github (usually), so we definitely need to be able to continue releasing there.

kartikprabhu · 2018-03-31T17:14:05Z

@snarfed I am not skeptical of fixing this; there is a whitespace algorithm by @Zegnat, but TBH it looks like a lot of DOM tree parsing work.

For pypi there is already an issue open microformats/mf2py#93

snarfed · 2018-03-31T17:18:18Z

aha, got it. understood. thanks for the explanation and link, and props to @Zegnat for writing it.

not sure where that leaves us, but let me know if you need anything else from me!

kartikprabhu · 2018-04-01T17:05:28Z

@snarfed whitespace rules are now in experimental version https://github.com/kartikprabhu/mf2py/tree/experimental

kartikprabhu · 2018-04-01T18:20:51Z

The example @snarfed used in #142 (comment)
now gives the following in the experimental mf2py

"items": [
        {
            "type": [
                "h-entry"
            ], 
            "properties": {
                "content": [
                    {
                        "html": "foo bar<br/>baz <br/><br/> baj", 
                        "value": "foo bar\nbaz\n\nbaj"
                    }
                ], 
                "name": [
                    "foo bar\nbaz\n\nbaj"
                ]
            }
        }
    ]

snarfed · 2018-04-02T14:18:39Z

@kartikprabhu yay, agreed, my tests pass with that new code too. can't wait for a release!

interestingly though, that branch fails a couple other of my tests. looks like implied name now includes img srcs? e.g. this html:

<body class="h-entry">
<div class="p-author h-card">
<a href="http://li/nk">my name</a>
<img class="u-photo" src="http://pic/ture" />
</div>
</body>

results in name my name http://pic/ture, where before it was just my name. just a heads up.

(btw long lived dev branches are scary, but that's a separate conversation. :P)

kartikprabhu · 2018-04-02T14:20:45Z

@snarfed yes that is correct according to updated textContent rules http://microformats.org/wiki/microformats2-parsing#parsing_for_implied_properties

snarfed · 2018-04-02T16:34:13Z

huh, ok. seems ugly, but understood!

for #142

snarfed · 2018-05-23T01:17:49Z

sadly this didn't make it into the recent mf2py 1.1.0 release. ah well. next one hopefully!

>>> mf2py.__version__
'1.1.0'

>>> mf2py.parse("""\
<article class="h-entry">
<div class="e-content p-name">foo bar<br />baz <br><br> baj</div>
</article>""", url='http://foo')

{'items': [{
  'type': ['h-entry'],
  'properties': {
    'content': [{
      'html': 'foo bar<br/>baz <br/><br/> baj',
      'value': 'foo barbaz  baj',
    }],
    ...

snarfed · 2018-05-23T01:20:54Z

(i briefly foolishly hoped that @bdesham's #149 might fix this, but alas, no luck.)

for #142

for snarfed/bridgy#828, #145, #142, etc.

fixes #142, fixes #145, fixes snarfed/bridgy#756, for snarfed/bridgy#828

for #142

details in #828, snarfed/granary#142, snarfed/granary@a989c3e, etc.

snarfed · 2018-07-25T00:50:19Z

done! whitespace tags are now correctly converted in jsonfeed titles. example: https://granary.io/url?input=html&output=jsonfeed&url=https%3A%2F%2Faaronparecki.com%2F2018%2F07%2F19%2F26%2F

note that granary still generates a jsonfeed title for that post, even though it's technically a note (?).

snarfed added a commit that referenced this issue May 10, 2018

update tests to expect better whitespace handling in next mf2py release

3289a42

for #142

snarfed added a commit that referenced this issue May 23, 2018

update tests to expect better whitespace handling in next mf2py release

7a78cfa

for #142

This was referenced May 25, 2018

Atom titles when parsing microformats, 'note' type #150

Closed

Update mf2py (waiting for next release) snarfed/bridgy#828

Closed

snarfed added a commit that referenced this issue May 31, 2018

upgrade mf2py to 1.1.0

43480cb

for snarfed/bridgy#828, #145, #142, etc.

snarfed added a commit that referenced this issue Jul 18, 2018

upgrade mf2py to 1.1.1

a989c3e

fixes #142, fixes #145, fixes snarfed/bridgy#756, for snarfed/bridgy#828

snarfed added a commit that referenced this issue Jul 18, 2018

update tests to expect better whitespace handling in next mf2py release

d866971

for #142

snarfed added a commit to snarfed/bridgy that referenced this issue Jul 24, 2018

upgrade mf2py to 1.1.2 (unreleased)

9c2c504

details in #828, snarfed/granary#142, snarfed/granary@a989c3e, etc.

snarfed added a commit to snarfed/bridgy that referenced this issue Jul 24, 2018

upgrade mf2py to 1.1.2 (unreleased)

b77cccb

details in #828, snarfed/granary#142, snarfed/granary@a989c3e, etc.

snarfed added a commit to snarfed/bridgy that referenced this issue Jul 24, 2018

upgrade mf2py to 1.1.2 (unreleased)

b54bed4

details in #828, snarfed/granary#142, snarfed/granary@a989c3e, etc.

snarfed closed this as completed Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace tags stripped from plaintext values #142

Whitespace tags stripped from plaintext values #142

aaronpk commented Mar 30, 2018 •

edited by snarfed

Loading

snarfed commented Mar 31, 2018

snarfed commented Mar 31, 2018

snarfed commented Mar 31, 2018

snarfed commented Mar 31, 2018

kartikprabhu commented Mar 31, 2018

snarfed commented Mar 31, 2018

kartikprabhu commented Mar 31, 2018

snarfed commented Mar 31, 2018

kartikprabhu commented Apr 1, 2018

kartikprabhu commented Apr 1, 2018

snarfed commented Apr 2, 2018

kartikprabhu commented Apr 2, 2018

snarfed commented Apr 2, 2018

snarfed commented May 23, 2018

snarfed commented May 23, 2018 •

edited

Loading

snarfed commented Jul 25, 2018

Whitespace tags stripped from plaintext values #142

Whitespace tags stripped from plaintext values #142

Comments

aaronpk commented Mar 30, 2018 • edited by snarfed Loading

snarfed commented Mar 31, 2018

snarfed commented Mar 31, 2018

snarfed commented Mar 31, 2018

snarfed commented Mar 31, 2018

kartikprabhu commented Mar 31, 2018

snarfed commented Mar 31, 2018

kartikprabhu commented Mar 31, 2018

snarfed commented Mar 31, 2018

kartikprabhu commented Apr 1, 2018

kartikprabhu commented Apr 1, 2018

snarfed commented Apr 2, 2018

kartikprabhu commented Apr 2, 2018

snarfed commented Apr 2, 2018

snarfed commented May 23, 2018

snarfed commented May 23, 2018 • edited Loading

snarfed commented Jul 25, 2018

aaronpk commented Mar 30, 2018 •

edited by snarfed

Loading

snarfed commented May 23, 2018 •

edited

Loading