-1

I eventually managed to train a model, based on BERT (bert-base-uncased) and TensorFlow, to extract intents and slots for texts like this:

create a doc document named doc1

For this text, my model returns:

Intent: new_doc
Slots: {'document_name': 'doc1', 'type': 'doc'}

and for this text:

create a txt document named document1

returns:

Intent: new_doc
Slots: {'document_name': 'document1', 'type': 'txt'}

My problem is with text like this:

create a txt document named technical architecture for project alpha beta

which returns:

Intent: new_doc
Slots: {'type': 'txt'}

My idea is to have the document's title between quotes in order to help the tokenizer.

create a txt document named 'technical architecture for project alpha beta'

but the model returns:

Intent: new_doc
Slots: {'document_name': "technical project beta '", 'type': 'txt'}

As far as I understand, the tokenizer splits the text between quotes as well, screwing everything up.

Is there any way to instruct BERT to manage everything between quotes as a single token?

Is there any other way to properly extract the document's title?

0