¿Quieres leer esta página en español? Curación y anotación de datos

What is data curation?

Data curation consists in the creation of a corpus (or dataset) for a specific use case by gathering data and making sure that it is consistent and relevant for your problem. This corpus can be used, among other things, for the training and/or evaluation of ML/AI models or pipelines.

A good corpus should contain varied examples of inputs and, if needed, their expected outputs.

What is data annotation?

Data annotation consists in adding notes to your dataset. Depending on the use case, these notes can take the form of labels, ratings, bounding boxes or even text, among others.

If you are doing annotations as part of an AI/ML project, you can think of these annotations as examples of the expected output of a model, given a specific input.

To learn more about annotations you can check this page: 📚 Resources for data annotation.

What is the difference between data curation and data annotation?

Although data annotation is part of the data curation project, data curation is more than just annotating a dataset. It’s about deciding what should and shouldn’t be part of your corpus and how it should look. Some things that one might consider while curating a dataset is:

  • what types of examples should be included or excluded?
  • what metadata is needed (if any)?
  • should you follow any standard formatting for the data?