I eventually managed to train a model, based on BERT (bert-base-uncased) and TensorFlow, to extract intents and slots for texts like this:
create a doc document named doc1
For this text, my model returns:
Intent: new_doc
Slots: {'document_name': 'doc1', 'type': 'doc'}
and for this text:
create a txt document named document1
returns:
Intent: new_doc
Slots: {'document_name': 'document1', 'type': 'txt'}
My problem is with text like this:
create a txt document named technical architecture for project alpha beta
which returns:
Intent: new_doc
Slots: {'type': 'txt'}
My idea is to have the document's title between quotes in order to help the tokenizer.
create a txt document named 'technical architecture for project alpha beta'
but the model returns:
Intent: new_doc
Slots: {'document_name': "technical project beta '", 'type': 'txt'}
As far as I understand, the tokenizer splits the text between quotes as well, screwing everything up.
Is there any way to instruct BERT to manage everything between quotes as a single token?
Is there any other way to properly extract the document's title?