Currently, we are actively looking for graduate students and interns, who are looking for paid and interesting research projects. The internships or graduation project will allow you to learn how data science technology is applied in commercial and government organizations for mission critical applications and at the same time execute thorough scientific research.
The next internship/graduation project is available:
“Data is the new oil (or gold)”. Recent progress in machine learning has shown exiting results in various text-mining applications such as Named Entity Recognition (NER), entity linking, but also in sentiment-, emotion-, and cynicism detection. Both traditional algorithms such as Support Vector Machines and Deep-learning models such as LSTM perform at levels well above human performance.
The solution to the above problems is no longer in the algorithms, but it is in the lack of (sufficient) annotated training data. Especially for less prevailing languages.
Manual creation of such data sets is too slow, too labor intensive and too expensive. In addition to this, human annotators are error prune, especially for the tedious and dreary work adding manual annotations. Sometimes, human annotations vary up to 40% for a given task.
Recent research indicated that certain machine learning algorithms such as SVM are very robust against wrong training data. Experiments where deliberately up to 30% of wrong training data was added to the training set, show that the classifiers were still able to reach similar results as classifiers trained with perfect data, albeit only a bit slower.
Based on these findings, this project focusses on creating such annotated training-data sets automatically. Various approaches exist already to do this:
Creating a reliable process, including support for development of multi-lingual data sets is the focus of this project.
Key challenges are to design and test such a process. What are the best methods to start? What are the minimal boot strapping data sets required? Can machine translation be used to translate data sets form one language to another? How do we prevent data sets to contain too much wrong data? How can we measure when a data set is too polluted? Etc.
The following research questions have to be answered in this project:
Other Expected / Desired Outcomes
At ZyLAB R&D, first a prototype is developed using a very basic approach to set a base line performance. Next, the objective is to use novel methods such as advanced machine learning methods (deep learning, better feature factories, better document representation methods, etc.) to create a better performing system.
At ZyLAB, develop is done in C# in combination with HTML-5 based on the Angular-6+ framework.
ZyLAB has several data sets to train and validate the performance of such systems. More information on other projects can be found here:
Depending on your programming skills, ZyLAB will pay you an internship fee which is significant higher than what most other companies would pay you. We hear from our students that it is often 2—3 times higher than with other organizations.
In addition, we will reimburse your travel cost and you can participate in all activities we organize for our employees.
If you are interested, please contact us at email@example.com, or leave your details on this page.