0

I want to get some information from html text using Python. This regex expression captures multiple times throughout the text.

def regexanje(tekst):
    re_ime = r'<span class="font_xlarge"><a href.*?>(?P<ime>.+?)</a>'
    re_oblika = r'title="Oblika" data-group="header">(?P<oblika>.*?)</span>'
    re_vrsta = r'<span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="(?P<vrsta>samostalnik ženskega spola|samostalnik moškega spola|samostalnik srednjega spola|medemet|predlog|predpona|členek|dovršni glagol|nedovršni glagol|dovršni in nedovršni glagol|pridevnik|prislov|zaimek|števnik)"'
    re_tonemski_naglas = r'(?:title="Tonemski naglas" data-group="header">(?P<tonemski_naglas>.*?)</span>)'
    vzorec = ".*?".join((re_ime, re_oblika, re_vrsta, re_tonemski_naglas))
    return [m.groupdict() for m in re.finditer(vzorec, tekst, re.DOTALL)]

The problem is in re_oblika = r'title="Oblika" data-group="header">(?P<oblika>.*?)</span>' and in re_tonemski_naglas. This two parts may or may not be in the text. From now on I will only discuss re_oblika because the problem is exactly the same.

If I run the code in the upper code block it captures only text that contains re_oblika

I tried doing re_oblika = r'(?:title="Oblika" data-group="header">(?P<oblika>.*?)</span>)?', but then, re_oblika captures nothing, even if it should.

I also tried

re_ime = r'<span class="font_xlarge"><a href.*?>(?P<ime>.+?)</a'
re_oblika = r'>(?:.*?title="Oblika" data-group="header">(?P<oblika>.+?)</span>.*?)?<'
re_vrsta = r'span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="(?P<vrsta>samostalnik ženskega spola|samostalnik moškega spola|samostalnik srednjega spola|medemet|predlog|predpona|členek|dovršni glagol|nedovršni glagol|dovršni in nedovršni glagol|pridevnik|prislov|zaimek|števnik)"'

and it captures only when there is this part of the texst.

Using {0,1} at the end of a group had the same effect as using ?.

I also tried using (?>...) (I don't know if it would work), but it gave me an error re.error: unknown extension ?> at position 57

I should also add, that the part of the text in question can repeat multiple times and i want to capture it, but I haven't yet started working on that.

Here is an example of a text that shouldn't match:

<span class="font_xlarge"><a href="/133/sskj2-slovar-slovenskega-knjiznega-jezika-2/4457252/a?View=1&amp;Query=*&amp;All=*&amp;FilteredDictionaryIds=133">a</a></span><span class="color_lightdark font_xsmal sup">5</span> <span data-group="header"><span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="veznik" data-group="header qualifier">vez.</span></span>, <span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="knjižno" data-group="qualifier header ">knjiž. </span>

Here is an example that should:

    <span class="font_xlarge"><a href="/133/sskj2-slovar-slovenskega-knjiznega-jezika-2/4457256/abalienacija?View=1&amp;Query=*&amp;All=*&amp;FilteredDictionaryIds=133">abalienácija</a></span> <span data-group="header"><span class="color_lightdark" data-toggle="tooltip" data-placement="top" title="Oblika" data-group="header">-e </span><span data-group="header qualifier"><span class="color_lightdark font_small" data-toggle="tooltip" data-placement="top" title="samostalnik ženskega spola">ž</span></span> <span class="color_lightdark">(</span><span class="color_lightdark" data-toggle="tooltip" data-placement="top" title="Tonemski naglas" data-group="header">á</span><span class="color_lightdark">) </span></span><br /><span class="color_dark italic" data-toggle="tooltip" data-placement="top" title="Sopomenka" data-group="explanation"><a class="reference" href="/133/sskj2-slovar-slovenskega-knjiznega-jezika-2/4458099/alienacija?View=1&amp;Query=*&amp;All=*&amp;FilteredDictionaryIds=133" target="_blank">alienacija</a></span>
2
  • 8
    Just generally, regex isn't a great choice for parsing html. There are many excellent html-parsing libaries for Python that would give you a performant and accurate result, with code that's a lot more readable. Your problem here is that you start capturing mid-element, consider using a lookbehind and lookahead if you insist on using regex.
    – Grismar
    Commented Jul 8 at 22:53
  • 2
    You can use beautifulsoup4. Commented Jul 9 at 0:50

0

Browse other questions tagged or ask your own question.