I want to get some information from html text using Python. This regex expression captures multiple times throughout the text.
def regexanje(tekst):
re_ime = r'<span class="font_xlarge"><a href.*?>(?P<ime>.+?)</a>'
re_oblika = r'title="Oblika" data-group="header">(?P<oblika>.*?)</span>'
re_vrsta = r'<span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="(?P<vrsta>samostalnik ženskega spola|samostalnik moškega spola|samostalnik srednjega spola|medemet|predlog|predpona|členek|dovršni glagol|nedovršni glagol|dovršni in nedovršni glagol|pridevnik|prislov|zaimek|števnik)"'
re_tonemski_naglas = r'(?:title="Tonemski naglas" data-group="header">(?P<tonemski_naglas>.*?)</span>)'
vzorec = ".*?".join((re_ime, re_oblika, re_vrsta, re_tonemski_naglas))
return [m.groupdict() for m in re.finditer(vzorec, tekst, re.DOTALL)]
The problem is in re_oblika = r'title="Oblika" data-group="header">(?P<oblika>.*?)</span>'
and in re_tonemski_naglas
. This two parts may or may not be in the text. From now on I will only discuss re_oblika
because the problem is exactly the same.
If I run the code in the upper code block it captures only text that contains re_oblika
I tried doing re_oblika = r'(?:title="Oblika" data-group="header">(?P<oblika>.*?)</span>)?'
, but then, re_oblika
captures nothing, even if it should.
I also tried
re_ime = r'<span class="font_xlarge"><a href.*?>(?P<ime>.+?)</a'
re_oblika = r'>(?:.*?title="Oblika" data-group="header">(?P<oblika>.+?)</span>.*?)?<'
re_vrsta = r'span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="(?P<vrsta>samostalnik ženskega spola|samostalnik moškega spola|samostalnik srednjega spola|medemet|predlog|predpona|členek|dovršni glagol|nedovršni glagol|dovršni in nedovršni glagol|pridevnik|prislov|zaimek|števnik)"'
and it captures only when there is this part of the texst.
Using {0,1}
at the end of a group had the same effect as using ?
.
I also tried using (?>...)
(I don't know if it would work), but it gave me an error re.error: unknown extension ?> at position 57
I should also add, that the part of the text in question can repeat multiple times and i want to capture it, but I haven't yet started working on that.
Here is an example of a text that shouldn't match:
<span class="font_xlarge"><a href="/133/sskj2-slovar-slovenskega-knjiznega-jezika-2/4457252/a?View=1&Query=*&All=*&FilteredDictionaryIds=133">a</a></span><span class="color_lightdark font_xsmal sup">5</span> <span data-group="header"><span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="veznik" data-group="header qualifier">vez.</span></span>, <span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="knjižno" data-group="qualifier header ">knjiž. </span>
Here is an example that should:
<span class="font_xlarge"><a href="/133/sskj2-slovar-slovenskega-knjiznega-jezika-2/4457256/abalienacija?View=1&Query=*&All=*&FilteredDictionaryIds=133">abalienácija</a></span> <span data-group="header"><span class="color_lightdark" data-toggle="tooltip" data-placement="top" title="Oblika" data-group="header">-e </span><span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="samostalnik ženskega spola">ž</span></span> <span class="color_lightdark">(</span><span class="color_lightdark" data-toggle="tooltip" data-placement="top" title="Tonemski naglas" data-group="header">á</span><span class="color_lightdark">) </span></span><br /><span class="color_dark italic" data-toggle="tooltip" data-placement="top" title="Sopomenka" data-group="explanation"><a class="reference" href="/133/sskj2-slovar-slovenskega-knjiznega-jezika-2/4458099/alienacija?View=1&Query=*&All=*&FilteredDictionaryIds=133" target="_blank">alienacija</a></span>
beautifulsoup4
.