0

I have a table that consists of vehicle plates, where all the plates have always 7 characters. Sometimes a plate was miswritten and so my SELECT query must be able to consider that "AAU1234" and "AAV1234" must be the same vehicle. I don't want to create rules between specific characters, like "U" and "V" or "I" and "1", but a rule able to group plates where 6 of the 7 characters are equal and in the same position.

eg.: example of similar plates

In this case, ids 1, 2 and 5 should appear only once.

It would be great if, when grouped, all the grouped plates were informed, concatened in another column.

More information: This is a big table and currently identical plates are grouped when inside the same date-time range of 15 minutes.

My query looks like this:

select plate, floor(unix_timestamp(date)/(15 * 60)) as timekey 
from table 
group by plate, timekey 
order by date desc

Following the image example above, my goal is to group ids 1, 2 and 5 in the same row, since id 1 and id 2 matches 6 of 7 strings and so id 1 and id 5.

The result could be something like:

desirable result

Or:

desirable result

The final date information is not important here, the most important thing here is to group the similar plates.

Update1 Questions and answers on "staging ground":

RafaelZG
This table receives data from a plate recognition camera software and I have a front end that shows all the entries, grouping identical plates in the same time range, because a vehicle can have 1, 2 or 10 photos. The problem is that sometimes the software can't read a characater correctly and it generates separate entries in my front end, so my goal is to reduce this problem as much as possible, grouping these plates where only 1 character was wrongly registered. This rule doesn't need to solve 100% of the wrong entries.

Shadow
Row with id=5 differs by only one character from id=1, but differs by 2 characters from row with id=2. So, these rows would only be grouped together if the starting row is id=1. How do you determine which row is the starting one from grouping perspective? Btw, you should look at levenshtein distance, but this question is a way more complicated than you think and mysql may not be the best tool to accomplish your task!

RafaelZG
@shadow Great point! Things start to get even more complicated, but since ids 1 and 2 are grouped, and id 5 has 6 char matches with id 1, it should be also added to the group. The limiting point should be a date range, so the query would need run the entire table infinitely.

RafaelZG
I'll try to answer again more objectively: if id 1 and id 2 are grouped and id 5 is 6-char similar to id 1, then all the 3 should be grouped. eg: ABC1234, ABG1234, ABO1234, ABOI234, ABOI284 should be all grouped, since each of them is 6-char-similar to at least another one.

7
  • 1
    You're not using any aggregation functions, it makes no sense to use GROUP BY.
    – Barmar
    Commented Jul 9 at 21:08
  • 1
    Please tag the question only with the brand of database you are really using. MySQL and MariaDB are different databases, increasingly incompatible. A solution for one may not be applicable to the other. Also be specific about which version of the respective database you use, because a solution may depend on features in some given version. Commented Jul 9 at 21:40
  • no, yousearch for such random letters, as you have fixed rules like V TO U you can write all rules in a nested REPLACE steament and group by itlike GROUP BY REPLACE (REPLACE(plate,'V','U'),'1', 'I') and so on
    – nbk
    Commented Jul 9 at 21:48
  • 1
    I challenged even the staging ground that your question is unclear as it does not specify how to handle cases when you have 3 plates, where 1st and 2nd differ by 1 character, 2nd and 3rd differ by 1 character, but 1st and 3rd differ by 1 character. Should these 3 be placed into one group or multiple groups? If you check ValNik 's answer below, that tries to address the issue, but you can end up significantly different plates grouped together. You have to specify the business rules to handle chains if 1 character differences.
    – Shadow
    Commented Jul 9 at 23:45
  • I also highlighted that sql may not be the best tool for such comparison, particularly not if you have a large dataset as you have to start with a cartesian product.
    – Shadow
    Commented Jul 9 at 23:48

4 Answers 4

1

See example with test data. The data is somewhat expanded compared to the examples in the question.

create table test (id int, plate varchar(7));
insert into test values
 (1,'AAU1234')
,(2,'AAV1234')
,(3,'BKP5678')
,(4,'CMD9081')
,(5,'A4U1234')
,(6,'ABC1234')
,(7,'ABG1234')
,(8,'ABO1234')
,(9,'ABOI234')
,(10,'ABOI284')
,(11,'ABGI234')
,(12,'ABGI284')
,(14,'CMD9031')
;

First, let's simple compare plate's and find the rows, where 6 characters match. In JOIN condition and (t1.s1=t2.s1 or t1.s2=t2.s2) - at least 1 must match.
Wait for matchn=6

with recursive
tn as(
  select *,substring(plate,1,1) s1,substring(plate,2,1) s2,substring(plate,3,1) s3
    ,substring(plate,4,1) s4,substring(plate,5,1) s5,substring(plate,6,1) s6,substring(plate,7,1) s7
  from test
)
,cmp as(
select t1.id,t1.plate,t2.id id2,t2.plate plate2
select t1.id,t1.plate,t2.id id2,t2.plate plate2
  ,case when (t1.s1=t2.s1) then 1 else 0 end +case when (t1.s2=t2.s2) then 1 else 0 end
  +case when (t1.s3=t2.s3) then 1 else 0 end +case when (t1.s4=t2.s4) then 1 else 0 end
  +case when (t1.s5=t2.s5) then 1 else 0 end +case when (t1.s6=t2.s6) then 1 else 0 end
  +case when (t1.s7=t2.s7) then 1 else 0 end matchn
from tn t1
left join tn t2 on t1.id<>t2.id and (t1.s1=t2.s1 or t1.s2=t2.s2)
)

Output (part of result)

id plate id2 plate2 matchn
1 AAU1234 2 AAV1234 6
1 AAU1234 5 A4U1234 6
4 CMD9081 14 CMD9031 6
6 ABC1234 7 ABG1234 6
6 ABC1234 8 ABO1234 6
7 ABG1234 8 ABO1234 6
7 ABG1234 11 ABGI234 6
8 ABO1234 9 ABOI234 6
9 ABOI234 10 ABOI284 6
9 ABOI234 11 ABGI234 6
10 ABOI284 12 ABGI284 6
11 ABGI234 12 ABGI284 6

This table can be considered as a description of a graph.
The resulting graph is not oriented because we have set the condition (t1.id <> t2.id ).

       (6)ABC1234                       (1)AAU1234
        /       \                          /  \
      C->G      C->O                     A->4 U->V
      /           \                      /      \
(7)ABG1234--G->O--(8)ABO1234            /   (2)AAV1234   
                         \         (5)A4U1234
                         1->I
                           \
                      (9)ABOI234                  (4)CMD9081
                          /    \                          |
                        O->G   3->8                      8->3
                        /        \                        |
                       / (10)ABOI284             (14)CMD9031
                      /         \
                     /          O->G
                    /              \ 
             (11)ABGI234-3->8-(12)ABGI284

Then we recursively traverse the directed graph to find all nodes starting from the vertex of the graph.

,r as(
  select distinct 0 lvl,id,plate p0,plate,id id2,plate plate2
    ,cast('' as char(1000)) as path
  from cmp
  where  matchn=6 

  union all

  select lvl+1,r.id,p0,r.plate2,t.id2 id2,t.plate2
    ,concat(r.path,',',r.plate) as path
  from r inner join cmp t on t.plate=r.plate2 -- find_in_set(t.plate,r.list)>0
  where t.matchn=6 
    and  find_in_set(t.plate,r.path)=0 
    -- and lvl<9 -- for debug only
)
,d as(
  select distinct id,p0,id2,plate2
  from r
)
,chains as(
  select min(id2) groupid,id,p0 plate
     ,group_concat(id2 order by id2) ids
     ,group_concat(plate2 order by id2) plates
  from d
  group by id,p0
)

Chains output

groupid id plate ids plates
1 1 AAU1234 1,2,5 AAU1234,AAV1234,A4U1234
1 2 AAV1234 1,2,5 AAU1234,AAV1234,A4U1234
1 5 A4U1234 1,2,5 AAU1234,AAV1234,A4U1234
4 4 CMD9081 4,14 CMD9081,CMD9031
4 14 CMD9031 4,14 CMD9081,CMD9031
6 6 ABC1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6 7 ABG1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6 8 ABO1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6 9 ABOI234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6 10 ABOI284 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6 11 ABGI234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
6 12 ABGI284 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284

Now we can merge the rows or just assign a group to each row.

,newGr as(
select t.*,coalesce(c.groupid,t.id) groupid
  ,c.ids,c.plates 
from test t
left join chains c on t.id=c.id
)
--   select * from cmp order by id,id2;
--   select * from r order by id,lvl,id2;
--  select * from d order by id,id2;
--  select * from chains order by groupid;
--  select * from newGr order by id;
select groupid,min(id) id -- ,grPlate
  ,group_concat(id) ids
  ,group_concat(plate) plates
from newGr
group by groupid

Output

id grPlate ids plates
1 AAU1234 1,2,5 AAU1234,AAV1234,A4U1234
6 ABC1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
3 BKP5678 3 BKP5678
4 CMD9081 4,14 CMD9081,CMD9031

New groups for rows

id plate grPlate
1 AAU1234 AAU1234
2 AAV1234 AAU1234
3 BKP5678 BKP5678
4 CMD9081 CMD9081
5 A4U1234 AAU1234
6 ABC1234 ABC1234
7 ABG1234 ABC1234
8 ABO1234 ABC1234
9 ABOI234 ABC1234
10 ABOI284 ABC1234
11 ABGI234 ABC1234
12 ABGI284 ABC1234
14 CMD9031 CMD9081

Demo
and old Demo

9
  • This answer is a pretty good one, but some of the outputs are not in line with the question. The question says to group together plates that differ in ome character. Unfortunately, the OP failed to clarify what should happen if you have a chain of 1 character differences (challenged the OP on this in the staging ground, but the OP ignored it). So, you place ABC1234 and ABOI234 into the same group, although these differ by two characters.
    – Shadow
    Commented Jul 9 at 23:36
  • The problem is that if you have a series of plates differing by 1 character from each other placed into one group, but plates at the beginning of the chain significantly differ from the plates at the end if the chain.
    – Shadow
    Commented Jul 9 at 23:49
  • Right. I think this fits the OP's task. eg: ABC1234, ABG1234, ABO1234, ABOI234, ABOI284 should be all grouped
    – ValNik
    Commented Jul 10 at 5:54
  • Well, your grouping is at odds with the OP's definition of plates with 1 character difference must be grouped together. Not to mention that if we keep adding other plates with 1 character difference, they will all be grouped together and that is definitely not what the OP wants. Remember, these are car licence plates and the OP wants to group very similar ones only grouped together. Just in case the recognition made a mistake.
    – Shadow
    Commented Jul 10 at 6:10
  • Your reasoning is clear and well-founded. However, the OP explicitly indicates a grouping ABC1234 and ABOI284 with 3 difference in symbols, since thru difference chain in 1 character, they fall into the same group. Perhaps he considers it unlikely that two real plates differing by 1 character will appear in the data in a short time.
    – ValNik
    Commented Jul 10 at 6:26
0

You could use REGEXP_REPLACE() to replace an initial AAV with AAU when grouping.

select 
    REGEXP_REPLACE(plate, '^AAV', 'AAU') AS fixed_plate, 
    FROM_UNIXTIME(floor(unix_timestamp(date)/(15 * 60)), * 15 * 60) as timekey, 
    GROUP_CONCAT(id) AS concat_ids, 
    GROUP_CONCAT(plate) AS concat_plates
from table 
group by fixed_plate, timekey 
order by timekey desc
0

Add an extra column to the table. Populate that extra column with a "canonical string" wherein 1->I, U->v, 5->S, etc. Example: 'AB1UV5' would store 'ABIVVS' in the extra column. Index this extra column.

When looking up a plate, do the same transformation to the test string. Note that this will sometimes get multiple hits.

I would do the transformation in client code. In SQL, it would take a few dozen nested REPLACE() function calls,something like:

REPLACE(
REPLACE(
...
REPLACE(str,
    '1', 'I'),
    'U', 'V'),
    ... ))
0

Here is a simple way to create your evaluation using a temporary table. This is necessary to ensure that rows already included in an earlier grouping do not appear again in another group.

Here are the queries and an example:

Feel free to adjust as needed!

Querys:

CREATE TEMPORARY TABLE `tmp_ids` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `rec_id` INT DEFAULT NULL,
  `group_id` INT DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `ext_id` (`rec_id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

INSERT IGNORE INTO tmp_ids(rec_id,group_id)
SELECT v2.id,v1.id -- v1.*, v2.* , 
FROM vehicle v1
LEFT JOIN vehicle v2 ON 
    v1.`date`+ INTERVAL + 15 MINUTE > v2.`date`
    AND 
        IF( MID(v1.plate,1,1) = MID(v2.plate,1,1), 0, 1) +
        IF( MID(v1.plate,2,1) = MID(v2.plate,2,1), 0, 1) +
        IF( MID(v1.plate,3,1) = MID(v2.plate,3,1), 0, 1) +
        IF( MID(v1.plate,4,1) = MID(v2.plate,4,1), 0, 1) +
        IF( MID(v1.plate,5,1) = MID(v2.plate,5,1), 0, 1) +
        IF( MID(v1.plate,6,1) = MID(v2.plate,6,1), 0, 1) +
        IF( MID(v1.plate,7,1) = MID(v2.plate,7,1), 0, 1) < 2;
        
        
SELECT    MIN(v.id) AS FirstID
        , GROUP_CONCAT(v.id ORDER BY v.id) AS IDs
        , GROUP_CONCAT(v.plate ORDER BY v.id) AS PLATEs
        , MIN(v.`date`) AS Start_date
FROM tmp_ids r
JOIN vehicle v ON v.id = r.rec_id
GROUP BY r.group_id;

DROP TABLE tmp_ids;

all in one

WITH tmp_ids AS (
    SELECT v2.id AS rec_id, v1.id AS group_id
    FROM vehicle v1
    LEFT JOIN vehicle v2 ON 
        v1.`date` + INTERVAL 15 MINUTE > v2.`date`
        AND 
            IF(MID(v1.plate,1,1) = MID(v2.plate,1,1), 0, 1) +
            IF(MID(v1.plate,2,1) = MID(v2.plate,2,1), 0, 1) +
            IF(MID(v1.plate,3,1) = MID(v2.plate,3,1), 0, 1) +
            IF(MID(v1.plate,4,1) = MID(v2.plate,4,1), 0, 1) +
            IF(MID(v1.plate,5,1) = MID(v2.plate,5,1), 0, 1) +
            IF(MID(v1.plate,6,1) = MID(v2.plate,6,1), 0, 1) +
            IF(MID(v1.plate,7,1) = MID(v2.plate,7,1), 0, 1) < 2
),
distinct_ids AS (
    SELECT rec_id, MIN(group_id) AS group_id
    FROM tmp_ids
    GROUP BY rec_id
)
SELECT
    MIN(v.id) AS FirstID,
    GROUP_CONCAT(v.id ORDER BY v.id) AS IDs,
    GROUP_CONCAT(v.plate ORDER BY v.id) AS PLATEs,
    MIN(v.`date`) AS Start_date
FROM distinct_ids r
JOIN vehicle v ON v.id = r.rec_id
GROUP BY r.group_id;

sample

mysql> SELECT * FROM vehicle;
+----+---------+---------------------+
| id | plate   | date                |
+----+---------+---------------------+
|  1 | AAU1234 | 2024-07-09 10:00:00 |
|  2 | AAV1234 | 2024-07-09 10:03:00 |
|  3 | BAV1234 | 2024-07-09 10:07:00 |
|  4 | AAV1245 | 2024-07-09 10:55:00 |
|  5 | AAV1234 | 2024-07-09 10:04:00 |
|  6 | AAV1234 | 2024-07-09 10:04:00 |
|  7 | AAV1234 | 2024-07-09 10:45:00 |
+----+---------+---------------------+
7 rows in set (0.00 sec)

mysql> 
mysql> 
mysql> CREATE TEMPORARY TABLE `tmp_ids` (
    ->   `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    ->   `rec_id` INT DEFAULT NULL,
    ->   `group_id` INT DEFAULT NULL,
    ->   PRIMARY KEY (`id`),
    ->   UNIQUE KEY `ext_id` (`rec_id`)
    -> ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> 
mysql> INSERT IGNORE INTO tmp_ids(rec_id,group_id)
    -> SELECT v2.id,v1.id -- v1.*, v2.* , 
    -> FROM vehicle v1
    -> LEFT JOIN vehicle v2 ON 
    ->     v1.`date`+ INTERVAL + 15 MINUTE > v2.`date`
    ->     AND 
    ->         IF( MID(v1.plate,1,1) = MID(v2.plate,1,1), 0, 1) +
    ->         IF( MID(v1.plate,2,1) = MID(v2.plate,2,1), 0, 1) +
    ->         IF( MID(v1.plate,3,1) = MID(v2.plate,3,1), 0, 1) +
    ->         IF( MID(v1.plate,4,1) = MID(v2.plate,4,1), 0, 1) +
    ->         IF( MID(v1.plate,5,1) = MID(v2.plate,5,1), 0, 1) +
    ->         IF( MID(v1.plate,6,1) = MID(v2.plate,6,1), 0, 1) +
    ->         IF( MID(v1.plate,7,1) = MID(v2.plate,7,1), 0, 1) < 2;
Query OK, 7 rows affected, 23 warnings (0.01 sec)
Records: 30  Duplicates: 23  Warnings: 23

mysql>         
mysql>         
mysql> SELECT    MIN(v.id) AS FirstID
    ->         , GROUP_CONCAT(v.id ORDER BY v.id) AS IDs
    ->         , GROUP_CONCAT(v.plate ORDER BY v.id) AS PLATEs
    ->         , MIN(v.`date`) AS Start_date
    -> FROM tmp_ids r
    -> JOIN vehicle v ON v.id = r.rec_id
    -> GROUP BY r.group_id;
+---------+---------+---------------------------------+---------------------+
| FirstID | IDs     | PLATEs                          | Start_date          |
+---------+---------+---------------------------------+---------------------+
|       1 | 1,2,5,6 | AAU1234,AAV1234,AAV1234,AAV1234 | 2024-07-09 10:00:00 |
|       3 | 3       | BAV1234                         | 2024-07-09 10:07:00 |
|       4 | 4       | AAV1245                         | 2024-07-09 10:55:00 |
|       7 | 7       | AAV1234                         | 2024-07-09 10:45:00 |
+---------+---------+---------------------------------+---------------------+
4 rows in set (0.00 sec)

mysql> 
mysql> DROP TABLE tmp_ids;
Query OK, 0 rows affected (0.01 sec)

mysql> 

Not the answer you're looking for? Browse other questions tagged or ask your own question.