[Discussion]How to ensure the quality of labeling data?
Hi all the MLers,
Usually, we could have labeled data with specific degree of QUALITY from Kaggle or from some workshops. However, if we are in real world usage, how could we ensure the quality of the labeled data(in addition to accuracy, a more crucial point is the reliability on it to train models)?
To be more specific, I am doing a project on Named Entity Recognition(to extract and label some product names from some formal documents describing some products and their relation). We now have a rule-based model to extract the NE by Regular Expression, and we now would like to do the task in a Machine Learning workflow.
So, here’s the question, if we directly take the data labeled by our rule-based model as the training data for our ML model, how could we first ensure the quality of this training data before we feed it into the model.
This is my first time to have to generate training data from scratch, so I really appreciate any discussion with you guys. Any ideas and comments are welcome! Thanks ALOT!