Bug related with captioned image

Hi everyone,

We are developing a paywall for a customer (and after becomes stable will become a new addon). The idea is to have a transform that adds a button to continue reading and hide all paragraphs after the 3 paragraph.

After some tests I realize that at this post there are captioned image, and the hole paragraph where it is inserted disappeared, keeping the content of this paragraph in the parent <div> without a <p> tag around.

This becomes a PITA for my paywall, I can't hide this content anymore.

At first I look for plone.outputfilters package, since this is the place where captioned image is processed, but after some debug session I realize that this code is working fine:

> /home/rodfersou/.projects/cache/eggs/plone.outputfilters-1.15.1-py2.7.egg/plone/outputfilters/filters/resolveuid_and_caption.py(123)__call__()
    121         self.close()
    122         import debug
--> 123         return self.getResult()
    124 
    125     # SGMLParser implementation

ipdb> pp self.pieces
[
.
.
.
 '</p>',
 '\r\n',
 '<p>',
 '<dl style="width:400px;" class="image-left captioned">\n<dt><a rel="lightbox" href="/Plone/news/temer-garante-foro-privilegiado-a-moreira-franco-o-201cangora201d-da-odebrecht/novos-ministros"><img src="http://localhost:8080/Plone/news/temer-garante-foro-privilegiado-a-moreira-franco-o-201cangora201d-da-odebrecht/novos-ministros/@@images/81b99197-77b6-40ba-bbcd-52585e0a7976.jpeg" alt="Novos ministros" title="Novos ministros" height="266" width="400" /></a></dt>\n <dd class="image-caption" style="width:400px;">O tucano Imbassahy herda a pasta deixada por Geddel. Ao fundo, boquiaberto, Alexandre de Moraes recebe novas atribui\xc3\xa7\xc3\xb5es no Minist\xc3\xa9rio da Justi\xc3\xa7a (Beto Barata/PR)</dd>\n</dl>',
 'Em meio ao ',
 '<a title="" href="resolveuid/474a93484c304ebe943fa15bd79ef6d6" class="internal-link" target="_self">',
 'caos no sistema penitenci\xc3\xa1rio',
 '</a>',
 ', Temer alterou o nome da pasta da Justi\xc3\xa7a, que passou a se chamar Minist\xc3\xa9rio da Justi\xc3\xa7a e da Seguran\xc3\xa7a P\xc3\xbablica e ganhou novas atribui\xc3\xa7\xc3\xb5es.\xc2\xa0Por fim, nomeou o deputado federal Antonio Imbassahy (PSDB-BA) para a Secretaria de Governo. O nome do tucano j\xc3\xa1 vinha sendo especulado pela m\xc3\xaddia desde o ano passado. O cargo estava vago desde novembro com a ren\xc3\xbancia de ',
 '<a title="" href="resolveuid/0d33c85451c1484db2e5b42e40f77807" class="internal-link" target="_self">',
 'Geddel Vieira Lima',
 '</a>',
 ',\xc2\xa0ap\xc3\xb3s o o ex-ministro da Cultura Marcelo Calero denunciar que o peemedebista o pressionou para liberar a constru\xc3\xa7\xc3\xa3o de um edif\xc3\xadcio de alto padr\xc3\xa3o em Salvador.',
 '</p>',
 '\r\n',
 '<p>',
.
.
.
]

The paragraphs are there at this point.

Can someone help me figure out where this paragraph tag disappear?

I checked if the problem is some Diazo rule, and without Diazo the problem still happen.

I found that if I change the paragraph tag around the captioned image into a <div> fixes the problem.

It looks like there is a code that checks if a paragraph can have a <dl> tag inside (maybe there are any restrictions at w3c specs).

The question now is.. where this happen? should I write a fix anywhere at plone core to change the <p> tag into <div> tag when there is a captioned image?

It looks like lxml library is the bad guy here:

ipdb> lxml.html.tostring(lxml.html.fromstring('<p></p>'))
'<p></p>'
ipdb> lxml.html.tostring(lxml.html.fromstring('<p><dl></dl></p>'))
'<div><p></p><dl></dl></div>'

I tried to do a temporary fix at my transformer, but didn't work, since the HTML is processed when the object is saved, not when the transform change the HTML.

Maybe we should fix it at output filters code

having a <dl> tag inside a <p> is invalid markup; seems we need to fix this on Plone itself.

please, open an issue.

Fine, it looks like we need to use this markup instead:

<figure>
  <img src="bubbles-work.jpeg"
      alt="Bubbles, sitting in his office chair, works on his
            latest project intently.">
  <figcaption>Bubbles at work</figcaption>
</figure>

http://w3c.github.io/html/single-page.html#example-1b4e51d7

lxml library likes this markup:

ipdb> lxml.html.tostring(lxml.html.fromstring('<p><figure><figcaption></figcaption></figure></p>'))
'<p><figure><figcaption></figcaption></figure></p>'

Opened a bug report about it: https://github.com/plone/Products.CMFPlone/issues/2020