3

I know this question has been answered but my use case is slightly different. I am trying to setup a regex pattern to split a few strings into a list.

Input Strings:

1. "ABC-QWERT01"
2. "ABC-QWERT01DV"
3. "ABCQWER01"

Criteria of the string ABC - QWERT 01 DV 1 2 3 4 5

  1. The string will always start with three chars
  2. The dash is optional
  3. there will then be 3-10 chars
  4. Left padded 0-99 digits
  5. the suffix is 2 chars and is optional

Expected Output

1. ['ABC','-','QWERT','01']
1. ['ABC','-','QWERT','01', 'DV']
1. ['ABC','QWER','01','DV']

I have tried the following patterns a bunch of different ways but I am missing something. My thought was start at the beginning of the string, split after the first three chars or the dash, then split on the occurrence of two decimals.

Pattern 1: r"([ -?, \d{2}])+" This works but doesn't break up the string by the first three chars if the dash is missing

Pattern 2: r"([^[a-z]{3}, -?, \d{2}])+" This fails as a non-pattern match, nothing gets split

Pattern 3: r"([^[a-z]{3}|-?, \d{2}])+" This fails as a non-pattern match, nothing gets split

Any tips or suggestions?

4
  • 1
    I replaced your Python 2.7 tag with the unversioned Python tag which should be present anyway. I assume the 2.7 was simply a glitch. If not, note that Python 2.7 has been unmaintained since 2020. There are not even security updates any more.
    – Friedrich
    Commented Jun 25 at 16:55
  • unfortunately, I'm still using python 2.7 :( Jython actually. pray for me
    – Drewdin
    Commented Jun 25 at 18:51
  • 1
    Since there's nothing specific to Python 2.7 in your question, let's all pretend we don't know and carry on as if nothing happened ...
    – Friedrich
    Commented Jun 25 at 18:55
  • 1
    @Drewdin re your last comment on your deleted question, check the documentation for like while its no regex, there are other options, including being able to restrict to only numeric chars, or single chars etc. Give that a go, and if you can't get that to work post a new question (after searching because there are a lot of existing questions on this sort of thing).
    – Dale K
    Commented Jul 3 at 3:26

1 Answer 1

4

You can use a pattern similar to :

(?i)([A-Z]{3})(-?)([A-Z]*)([0-9]{2})([A-Z]*)

Code:

import re


def _parts(s):
    p = r'(?i)([A-Z]{3})(-?)([A-Z]*)([0-9]{2})([A-Z]*)'
    return re.findall(p, s)


print(_parts('ABC-QWERT01DV'))
print(_parts('ABCQWER01'))
print(_parts('ABC-QWERT01'))

Prints

[('ABC', '-', 'QWERT', '01', 'DV')]
[('ABC', '', 'QWER', '01', '')]
[('ABC', '-', 'QWERT', '01', '')]

Notes:

  • (?i): insensitive flag.
  • ([A-Z]{3}): capture group 1 with any 3 letters.
  • (-?): capture group 2 with an optional dash.
  • ([A-Z]*): capture group 3 with 0 or more letters.
  • ([0-9]{2}): capture group 4 with 2 digits.
  • ([A-Z]*): capture group 5 with 0 or more letters.
3
  • Thank you, that worked. If possible, can you update your post and help me understand how it works? worst case i can google around the params, I appreciate it!
    – Drewdin
    Commented Jun 25 at 16:30
  • second question, do you prefer findall over split?
    – Drewdin
    Commented Jun 25 at 16:32
  • 1
    @Drewdin Updated. You can also simply change the groups and make it as you want. re.findall() is fine and easy to use here, but you can use split(). Commented Jun 25 at 16:33

Not the answer you're looking for? Browse other questions tagged or ask your own question.