How can I group results where string has the same 6 characters from a total of 7 in MySQL/MariaDB?

Question

I have a table that consists of vehicle plates, where all the plates have always 7 characters. Sometimes a plate was miswritten and so my SELECT query must be able to consider that "AAU1234" and "AAV1234" must be the same vehicle. I don't want to create rules between specific characters, like "U" and "V" or "I" and "1", but a rule able to group plates where 6 of the 7 characters are equal and in the same position.

eg.:

In this case, ids 1, 2 and 5 should appear only once.

It would be great if, when grouped, all the grouped plates were informed, concatened in another column.

More information: This is a big table and currently identical plates are grouped when inside the same date-time range of 15 minutes.

My query looks like this:

select plate, floor(unix_timestamp(date)/(15 * 60)) as timekey 
from table 
group by plate, timekey 
order by date desc

Following the image example above, my goal is to group ids 1, 2 and 5 in the same row, since id 1 and id 2 matches 6 of 7 strings and so id 1 and id 5.

The result could be something like:

Or:

The final date information is not important here, the most important thing here is to group the similar plates.

Update1 Questions and answers on "staging ground":

RafaelZG
This table receives data from a plate recognition camera software and I have a front end that shows all the entries, grouping identical plates in the same time range, because a vehicle can have 1, 2 or 10 photos. The problem is that sometimes the software can't read a characater correctly and it generates separate entries in my front end, so my goal is to reduce this problem as much as possible, grouping these plates where only 1 character was wrongly registered. This rule doesn't need to solve 100% of the wrong entries.

Shadow
Row with id=5 differs by only one character from id=1, but differs by 2 characters from row with id=2. So, these rows would only be grouped together if the starting row is id=1. How do you determine which row is the starting one from grouping perspective? Btw, you should look at levenshtein distance, but this question is a way more complicated than you think and mysql may not be the best tool to accomplish your task!

RafaelZG
@shadow Great point! Things start to get even more complicated, but since ids 1 and 2 are grouped, and id 5 has 6 char matches with id 1, it should be also added to the group. The limiting point should be a date range, so the query would need run the entire table infinitely.

RafaelZG
I'll try to answer again more objectively: if id 1 and id 2 are grouped and id 5 is 6-char similar to id 1, then all the 3 should be grouped. eg: ABC1234, ABG1234, ABO1234, ABOI234, ABOI284 should be all grouped, since each of them is 6-char-similar to at least another one.

You're not using any aggregation functions, it makes no sense to use GROUP BY. — Barmar, Commented Jul 9 at 21:08
Please tag the question only with the brand of database you are really using. MySQL and MariaDB are different databases, increasingly incompatible. A solution for one may not be applicable to the other. Also be specific about which version of the respective database you use, because a solution may depend on features in some given version. — Bill Karwin, Commented Jul 9 at 21:40
no, yousearch for such random letters, as you have fixed rules like V TO U you can write all rules in a nested REPLACE steament and group by itlike GROUP BY REPLACE (REPLACE(plate,'V','U'),'1', 'I') and so on — nbk, Commented Jul 9 at 21:48
I challenged even the staging ground that your question is unclear as it does not specify how to handle cases when you have 3 plates, where 1st and 2nd differ by 1 character, 2nd and 3rd differ by 1 character, but 1st and 3rd differ by 1 character. Should these 3 be placed into one group or multiple groups? If you check ValNik 's answer below, that tries to address the issue, but you can end up significantly different plates grouped together. You have to specify the business rules to handle chains if 1 character differences. — Shadow, Commented Jul 9 at 23:45
I also highlighted that sql may not be the best tool for such comparison, particularly not if you have a large dataset as you have to start with a cartesian product. — Shadow, Commented Jul 9 at 23:48

ValNik · Accepted Answer · 2024-07-10 06:04:23Z

See example with test data. The data is somewhat expanded compared to the examples in the question.

create table test (id int, plate varchar(7));
insert into test values
 (1,'AAU1234')
,(2,'AAV1234')
,(3,'BKP5678')
,(4,'CMD9081')
,(5,'A4U1234')
,(6,'ABC1234')
,(7,'ABG1234')
,(8,'ABO1234')
,(9,'ABOI234')
,(10,'ABOI284')
,(11,'ABGI234')
,(12,'ABGI284')
,(14,'CMD9031')
;

First, let's simple compare plate's and find the rows, where 6 characters match. In JOIN condition and (t1.s1=t2.s1 or t1.s2=t2.s2) - at least 1 must match.
Wait for matchn=6

with recursive
tn as(
  select *,substring(plate,1,1) s1,substring(plate,2,1) s2,substring(plate,3,1) s3
    ,substring(plate,4,1) s4,substring(plate,5,1) s5,substring(plate,6,1) s6,substring(plate,7,1) s7
  from test
)
,cmp as(
select t1.id,t1.plate,t2.id id2,t2.plate plate2
select t1.id,t1.plate,t2.id id2,t2.plate plate2
  ,case when (t1.s1=t2.s1) then 1 else 0 end +case when (t1.s2=t2.s2) then 1 else 0 end
  +case when (t1.s3=t2.s3) then 1 else 0 end +case when (t1.s4=t2.s4) then 1 else 0 end
  +case when (t1.s5=t2.s5) then 1 else 0 end +case when (t1.s6=t2.s6) then 1 else 0 end
  +case when (t1.s7=t2.s7) then 1 else 0 end matchn
from tn t1
left join tn t2 on t1.id<>t2.id and (t1.s1=t2.s1 or t1.s2=t2.s2)
)

Output (part of result)

id	plate	id2	plate2	matchn
1	AAU1234	2	AAV1234	6
1	AAU1234	5	A4U1234	6
4	CMD9081	14	CMD9031	6
6	ABC1234	7	ABG1234	6
6	ABC1234	8	ABO1234	6
7	ABG1234	8	ABO1234	6
7	ABG1234	11	ABGI234	6
8	ABO1234	9	ABOI234	6
9	ABOI234	10	ABOI284	6
9	ABOI234	11	ABGI234	6
10	ABOI284	12	ABGI284	6
11	ABGI234	12	ABGI284	6

This table can be considered as a description of a graph.
The resulting graph is not oriented because we have set the condition (t1.id <> t2.id ).

       (6)ABC1234                       (1)AAU1234
        /       \                          /  \
      C->G      C->O                     A->4 U->V
      /           \                      /      \
(7)ABG1234--G->O--(8)ABO1234            /   (2)AAV1234   
                         \         (5)A4U1234
                         1->I
                           \
                      (9)ABOI234                  (4)CMD9081
                          /    \                          |
                        O->G   3->8                      8->3
                        /        \                        |
                       / (10)ABOI284             (14)CMD9031
                      /         \
                     /          O->G
                    /              \ 
             (11)ABGI234-3->8-(12)ABGI284

Then we recursively traverse the directed graph to find all nodes starting from the vertex of the graph.

,r as(
  select distinct 0 lvl,id,plate p0,plate,id id2,plate plate2
    ,cast('' as char(1000)) as path
  from cmp
  where  matchn=6 

  union all

  select lvl+1,r.id,p0,r.plate2,t.id2 id2,t.plate2
    ,concat(r.path,',',r.plate) as path
  from r inner join cmp t on t.plate=r.plate2 -- find_in_set(t.plate,r.list)>0
  where t.matchn=6 
    and  find_in_set(t.plate,r.path)=0 
    -- and lvl<9 -- for debug only
)
,d as(
  select distinct id,p0,id2,plate2
  from r
)
,chains as(
  select min(id2) groupid,id,p0 plate
     ,group_concat(id2 order by id2) ids
     ,group_concat(plate2 order by id2) plates
  from d
  group by id,p0
)

Chains output

groupid	id	plate	ids	plates
1	1	AAU1234	1,2,5	AAU1234,AAV1234,A4U1234
1	2	AAV1234	1,2,5	AAU1234,AAV1234,A4U1234
1	5	A4U1234	1,2,5	AAU1234,AAV1234,A4U1234
4	4	CMD9081	4,14	CMD9081,CMD9031
4	14	CMD9031	4,14	CMD9081,CMD9031
6	6	ABC1234	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6	7	ABG1234	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6	8	ABO1234	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6	9	ABOI234	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6	10	ABOI284	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6	11	ABGI234	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6	12	ABGI284	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284

Now we can merge the rows or just assign a group to each row.

,newGr as(
select t.*,coalesce(c.groupid,t.id) groupid
  ,c.ids,c.plates 
from test t
left join chains c on t.id=c.id
)
--   select * from cmp order by id,id2;
--   select * from r order by id,lvl,id2;
--  select * from d order by id,id2;
--  select * from chains order by groupid;
--  select * from newGr order by id;
select groupid,min(id) id -- ,grPlate
  ,group_concat(id) ids
  ,group_concat(plate) plates
from newGr
group by groupid

Output

id	grPlate	ids	plates
1	AAU1234	1,2,5	AAU1234,AAV1234,A4U1234
6	ABC1234	6,7,8,9,10,11,12	ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
3	BKP5678	3	BKP5678
4	CMD9081	4,14	CMD9081,CMD9031

New groups for rows

id	plate	grPlate
1	AAU1234	AAU1234
2	AAV1234	AAU1234
3	BKP5678	BKP5678
4	CMD9081	CMD9081
5	A4U1234	AAU1234
6	ABC1234	ABC1234
7	ABG1234	ABC1234
8	ABO1234	ABC1234
9	ABOI234	ABC1234
10	ABOI284	ABC1234
11	ABGI234	ABC1234
12	ABGI284	ABC1234
14	CMD9031	CMD9081

Demo
and old Demo

This answer is a pretty good one, but some of the outputs are not in line with the question. The question says to group together plates that differ in ome character. Unfortunately, the OP failed to clarify what should happen if you have a chain of 1 character differences (challenged the OP on this in the staging ground, but the OP ignored it). So, you place ABC1234 and ABOI234 into the same group, although these differ by two characters. — Shadow, Commented Jul 9 at 23:36
The problem is that if you have a series of plates differing by 1 character from each other placed into one group, but plates at the beginning of the chain significantly differ from the plates at the end if the chain. — Shadow, Commented Jul 9 at 23:49
Right. I think this fits the OP's task. eg: ABC1234, ABG1234, ABO1234, ABOI234, ABOI284 should be all grouped — ValNik, Commented Jul 10 at 5:54
Well, your grouping is at odds with the OP's definition of plates with 1 character difference must be grouped together. Not to mention that if we keep adding other plates with 1 character difference, they will all be grouped together and that is definitely not what the OP wants. Remember, these are car licence plates and the OP wants to group very similar ones only grouped together. Just in case the recognition made a mistake. — Shadow, Commented Jul 10 at 6:10
Your reasoning is clear and well-founded. However, the OP explicitly indicates a grouping ABC1234 and ABOI284 with 3 difference in symbols, since thru difference chain in 1 character, they fall into the same group. Perhaps he considers it unlikely that two real plates differing by 1 character will appear in the data in a short time. — ValNik, Commented Jul 10 at 6:26

Barmar · Accepted Answer · 2024-07-09 21:13:25Z

0

You could use REGEXP_REPLACE() to replace an initial AAV with AAU when grouping.

select 
    REGEXP_REPLACE(plate, '^AAV', 'AAU') AS fixed_plate, 
    FROM_UNIXTIME(floor(unix_timestamp(date)/(15 * 60)), * 15 * 60) as timekey, 
    GROUP_CONCAT(id) AS concat_ids, 
    GROUP_CONCAT(plate) AS concat_plates
from table 
group by fixed_plate, timekey 
order by timekey desc

answered Jul 9 at 21:13

Barmar

770k54 gold badges529 silver badges641 bronze badges

Add a comment |

Rick James · Accepted Answer · 2024-07-10 14:45:22Z

Add an extra column to the table. Populate that extra column with a "canonical string" wherein 1->I, U->v, 5->S, etc. Example: 'AB1UV5' would store 'ABIVVS' in the extra column. Index this extra column.

When looking up a plate, do the same transformation to the test string. Note that this will sometimes get multiple hits.

I would do the transformation in client code. In SQL, it would take a few dozen nested REPLACE() function calls,something like:

REPLACE(
REPLACE(
...
REPLACE(str,
    '1', 'I'),
    'U', 'V'),
    ... ))

Bernd Buffen · Accepted Answer · 2024-07-10 15:10:56Z

Here is a simple way to create your evaluation using a temporary table. This is necessary to ensure that rows already included in an earlier grouping do not appear again in another group.

Here are the queries and an example:

Feel free to adjust as needed!

Querys:

CREATE TEMPORARY TABLE `tmp_ids` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `rec_id` INT DEFAULT NULL,
  `group_id` INT DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `ext_id` (`rec_id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

INSERT IGNORE INTO tmp_ids(rec_id,group_id)
SELECT v2.id,v1.id -- v1.*, v2.* , 
FROM vehicle v1
LEFT JOIN vehicle v2 ON 
    v1.`date`+ INTERVAL + 15 MINUTE > v2.`date`
    AND 
        IF( MID(v1.plate,1,1) = MID(v2.plate,1,1), 0, 1) +
        IF( MID(v1.plate,2,1) = MID(v2.plate,2,1), 0, 1) +
        IF( MID(v1.plate,3,1) = MID(v2.plate,3,1), 0, 1) +
        IF( MID(v1.plate,4,1) = MID(v2.plate,4,1), 0, 1) +
        IF( MID(v1.plate,5,1) = MID(v2.plate,5,1), 0, 1) +
        IF( MID(v1.plate,6,1) = MID(v2.plate,6,1), 0, 1) +
        IF( MID(v1.plate,7,1) = MID(v2.plate,7,1), 0, 1) < 2;
        
        
SELECT    MIN(v.id) AS FirstID
        , GROUP_CONCAT(v.id ORDER BY v.id) AS IDs
        , GROUP_CONCAT(v.plate ORDER BY v.id) AS PLATEs
        , MIN(v.`date`) AS Start_date
FROM tmp_ids r
JOIN vehicle v ON v.id = r.rec_id
GROUP BY r.group_id;

DROP TABLE tmp_ids;

all in one

WITH tmp_ids AS (
    SELECT v2.id AS rec_id, v1.id AS group_id
    FROM vehicle v1
    LEFT JOIN vehicle v2 ON 
        v1.`date` + INTERVAL 15 MINUTE > v2.`date`
        AND 
            IF(MID(v1.plate,1,1) = MID(v2.plate,1,1), 0, 1) +
            IF(MID(v1.plate,2,1) = MID(v2.plate,2,1), 0, 1) +
            IF(MID(v1.plate,3,1) = MID(v2.plate,3,1), 0, 1) +
            IF(MID(v1.plate,4,1) = MID(v2.plate,4,1), 0, 1) +
            IF(MID(v1.plate,5,1) = MID(v2.plate,5,1), 0, 1) +
            IF(MID(v1.plate,6,1) = MID(v2.plate,6,1), 0, 1) +
            IF(MID(v1.plate,7,1) = MID(v2.plate,7,1), 0, 1) < 2
),
distinct_ids AS (
    SELECT rec_id, MIN(group_id) AS group_id
    FROM tmp_ids
    GROUP BY rec_id
)
SELECT
    MIN(v.id) AS FirstID,
    GROUP_CONCAT(v.id ORDER BY v.id) AS IDs,
    GROUP_CONCAT(v.plate ORDER BY v.id) AS PLATEs,
    MIN(v.`date`) AS Start_date
FROM distinct_ids r
JOIN vehicle v ON v.id = r.rec_id
GROUP BY r.group_id;

sample

mysql> SELECT * FROM vehicle;
+----+---------+---------------------+
| id | plate   | date                |
+----+---------+---------------------+
|  1 | AAU1234 | 2024-07-09 10:00:00 |
|  2 | AAV1234 | 2024-07-09 10:03:00 |
|  3 | BAV1234 | 2024-07-09 10:07:00 |
|  4 | AAV1245 | 2024-07-09 10:55:00 |
|  5 | AAV1234 | 2024-07-09 10:04:00 |
|  6 | AAV1234 | 2024-07-09 10:04:00 |
|  7 | AAV1234 | 2024-07-09 10:45:00 |
+----+---------+---------------------+
7 rows in set (0.00 sec)

mysql> 
mysql> 
mysql> CREATE TEMPORARY TABLE `tmp_ids` (
    ->   `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    ->   `rec_id` INT DEFAULT NULL,
    ->   `group_id` INT DEFAULT NULL,
    ->   PRIMARY KEY (`id`),
    ->   UNIQUE KEY `ext_id` (`rec_id`)
    -> ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> 
mysql> INSERT IGNORE INTO tmp_ids(rec_id,group_id)
    -> SELECT v2.id,v1.id -- v1.*, v2.* , 
    -> FROM vehicle v1
    -> LEFT JOIN vehicle v2 ON 
    ->     v1.`date`+ INTERVAL + 15 MINUTE > v2.`date`
    ->     AND 
    ->         IF( MID(v1.plate,1,1) = MID(v2.plate,1,1), 0, 1) +
    ->         IF( MID(v1.plate,2,1) = MID(v2.plate,2,1), 0, 1) +
    ->         IF( MID(v1.plate,3,1) = MID(v2.plate,3,1), 0, 1) +
    ->         IF( MID(v1.plate,4,1) = MID(v2.plate,4,1), 0, 1) +
    ->         IF( MID(v1.plate,5,1) = MID(v2.plate,5,1), 0, 1) +
    ->         IF( MID(v1.plate,6,1) = MID(v2.plate,6,1), 0, 1) +
    ->         IF( MID(v1.plate,7,1) = MID(v2.plate,7,1), 0, 1) < 2;
Query OK, 7 rows affected, 23 warnings (0.01 sec)
Records: 30  Duplicates: 23  Warnings: 23

mysql>         
mysql>         
mysql> SELECT    MIN(v.id) AS FirstID
    ->         , GROUP_CONCAT(v.id ORDER BY v.id) AS IDs
    ->         , GROUP_CONCAT(v.plate ORDER BY v.id) AS PLATEs
    ->         , MIN(v.`date`) AS Start_date
    -> FROM tmp_ids r
    -> JOIN vehicle v ON v.id = r.rec_id
    -> GROUP BY r.group_id;
+---------+---------+---------------------------------+---------------------+
| FirstID | IDs     | PLATEs                          | Start_date          |
+---------+---------+---------------------------------+---------------------+
|       1 | 1,2,5,6 | AAU1234,AAV1234,AAV1234,AAV1234 | 2024-07-09 10:00:00 |
|       3 | 3       | BAV1234                         | 2024-07-09 10:07:00 |
|       4 | 4       | AAV1245                         | 2024-07-09 10:55:00 |
|       7 | 7       | AAV1234                         | 2024-07-09 10:45:00 |
+---------+---------+---------------------------------+---------------------+
4 rows in set (0.00 sec)

mysql> 
mysql> DROP TABLE tmp_ids;
Query OK, 0 rows affected (0.01 sec)

mysql>

Collectives™ on Stack Overflow

How can I group results where string has the same 6 characters from a total of 7 in MySQL/MariaDB?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
mysql
mariadb
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged mysqlmariadb or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
mysql
mariadb
or ask your own question.