Hi, all
I need to customize my F1 measurement a lot when evaluating my fine-tuned NER model performance. But I don't see other people with similar issues. I wonder if I am doing the wrong thing, or if people have their customized F1 measures all the time but do not mention it.
I can easily get an F1 measure from the default hugging face evaluation library, and add some common lenient measures with minimum effort.
For example, I have labels <predator> <food> and my dataset is"
Mice eat cheese.
The Southeast Asian soldier fly eats nectar.
Owls eat mice.
In most off-the-shelf lenient measures, if we ignore the text span. The partially recognised fly would be counted twice as positives.
The(B-predator) Southeast(I-predator) Asian soldier fly(B-predator) eats nectar.
There are also types of lenient measures which ignore mismatched labels, as long as the text is highlighted.
I ended up considering all the options and calculated 4 types of lenient measures. And I haven't touched the measurements per label. Is that too much?