BERT: how to get a quoted string as token

Ask Question

Asked 17 days ago

Modified 17 days ago

Viewed 23 times

Part of NLP Collective

-1

I eventually managed to train a model, based on BERT (bert-base-uncased) and TensorFlow, to extract intents and slots for texts like this:

create a doc document named doc1

For this text, my model returns:

Intent: new_doc
Slots: {'document_name': 'doc1', 'type': 'doc'}

and for this text:

create a txt document named document1

returns:

Intent: new_doc
Slots: {'document_name': 'document1', 'type': 'txt'}

My problem is with text like this:

create a txt document named technical architecture for project alpha beta

which returns:

Intent: new_doc
Slots: {'type': 'txt'}

My idea is to have the document's title between quotes in order to help the tokenizer.

create a txt document named 'technical architecture for project alpha beta'

but the model returns:

Intent: new_doc
Slots: {'document_name': "technical project beta '", 'type': 'txt'}

As far as I understand, the tokenizer splits the text between quotes as well, screwing everything up.

Is there any way to instruct BERT to manage everything between quotes as a single token?

Is there any other way to properly extract the document's title?

edited Jul 8 at 17:07

asked Jul 8 at 14:53

Fab

1,5261 gold badge16 silver badges38 bronze badges

Add a comment |

0