Graduation Project: Automatically Building Annotated Training-Data for Machine Learning

in Amsterdam, the Netherlands

Apply here

Security / confidence line. You can also link to your privacy policy page to open in a new window.

Currently, we are actively looking for graduate students and interns, who are looking for paid and interesting research projects. The internships or graduation project will allow you to learn how data science technology is applied in commercial and government organizations for mission critical applications and at the same time execute thorough scientific research.

The next internship/graduation project is available:


“Data is the new oil (or gold)”. Recent progress in machine learning has shown exiting results in various text-mining applications such as Named Entity Recognition (NER), entity linking, but also in sentiment-, emotion-, and cynicism detection. Both traditional algorithms such as Support Vector Machines and Deep-learning models such as LSTM perform at levels well above human performance.

The solution to the above problems is no longer in the algorithms, but it is in the lack of (sufficient) annotated training data. Especially for less prevailing languages.

Manual creation of such data sets is too slow, too labor intensive and too expensive. In addition to this, human annotators are error prune, especially for the tedious and dreary work adding manual annotations. Sometimes, human annotations vary up to 40% for a given task.

Recent research indicated that certain machine learning algorithms such as SVM are very robust against wrong training data. Experiments where deliberately up to 30% of wrong training data was added to the training set, show that the classifiers were still able to reach similar results as classifiers trained with perfect data, albeit only a bit slower.

Based on these findings, this project focusses on creating such annotated training-data sets automatically. Various approaches exist already to do this:

  • Bootstrap a basic classifier with a small manually annotated data set and use this classifier to create larger data sets.
  • Use links / annotations / structure from Wikipedia to create annotated data sets.
  • Use Machine Translation to translate an annotated data set from one language to the other.

Creating a reliable process, including support for development of multi-lingual data sets is the focus of this project.

Key challenges

Key challenges are to design and test such a process. What are the best methods to start? What are the minimal boot strapping data sets required? Can machine translation be used to translate data sets form one language to another? How do we prevent data sets to contain too much wrong data? How can we measure when a data set is too polluted? Etc.

Research Questions

The following research questions have to be answered in this project:

  1. How robust are well known machine learning algorithms such as SVM, CNN or LSTM to wrong training data? Does it only delay the training process or does it also create a lower maximal performance?
  2. What are different methods and processes for (semi) automatic creation of data sets?
  3. How good is the performance of such systems (in terms of precision, recall and F1-values)?
  4. How can these methods be used to create data sets for other languages?
  5. What post and preprocessing methods can be used to increase the performance of the system?

Other Expected / Desired Outcomes

At ZyLAB R&D, first a prototype is developed using a very basic approach to set a base line performance. Next, the objective is to use novel methods such as advanced machine learning methods (deep learning, better feature factories, better document representation methods, etc.) to create a better performing system.

Development Environment

At ZyLAB, develop is done in C# in combination with HTML-5 based on the Angular-6+ framework.

Data Sets

ZyLAB has several data sets to train and validate the performance of such systems. More information on other projects can be found here:



Depending on your programming skills, ZyLAB will pay you an internship fee which is significant higher than what most other companies would pay you. We hear from our students that it is often 2—3 times higher than with other organizations.

In addition, we will reimburse your travel cost and you can participate in all activities we organize for our employees.


  • Mandatory internship form your University
  • BSc and MSc students on Dutch or Belgian Universities
  • Fields related to data science such as Artificial Intelligence, computer science, text-mining, and data mining
  • Excellent programming skills in C#
  • Able to work in Amsterdam minimally 1-2 days a week


If you are interested, please contact us at hrm@zylab.com, or leave your details on this page.