Data Science Internship

Data Science Internship at Van Gogh Museum

The Van Gogh Museum makes the life and work of Vincent van Gogh and the art of his time accessible and reaches as many people as possible in order to enrich and inspire them.

The van Gogh museum is based in Amsterdam, the Netherlands and has been using the ZyLAB ONE platform to manage various collections of background documentation. This includes but is not limited to: background information on van Gogh paintings and drawings, documentation of special exhibitions, correspondence, personal collections from van Gogh specialists, newspaper clippings about the van Gogh museum and van Gogh’s work, etc.

For who

This internship is best for computer science students looking for an internship or MSc graduation project in the fields of text-mining, information retrieval, artificial intelligence, data science or machine learning.

Location

The internship location is the Van Gogh Museum in the center of Amsterdam.

Problem

This particular document collection consists of scanned documents and corresponding meta-information. The meta information may not be complete or can be inconsistent. The van Gogh museum is interested in applying advanced artificial intelligence, data science and data visualization techniques to verify the quality of the meta information, to identify anomalies and to organize and clean up the archives. In addition, the museum is interested in integrating the content of the textual documents with other sources or information which resides in other content management systems. Finally, the museum wishes to understand how advanced information extraction techniques can provide new insights in the history of and relations between individuals, locations, organization and works of art of van Gogh and others.

All this requires a thorough understanding of the ZyLAB ONE software, but also of modern artificial intelligence, data science, text-mining, artificial intelligence and data visualization techniques algorithms.

Key challenges

The key challenges in this project are that text from some of the material is based on low quality document scans, these have resulted in a lower quality text generated by using Optical Character Recognition (OCR) tools. In addition, the documents are highly unstructured and very different in format.

In addition, individual names (especially those transliterated from non-Roman spellings) have resulted in many different spelling variations, as have historical spelling variations.

The information extraction and machine learning methods used therefor have to be robust against such OCR errors, misspellings, transliteration and historical spelling variations.

Research Questions

The research questions in this research are:

  • How and which information extraction and data visualization methods be used to analyze the completeness and consistency of the information and meta data
  • Which information extraction and machine learning methods are useful to provide additional insights into the content of the documentation of the van Gogh Museum.
  • How can such methods be made robust against OCR errors, but also against misspellings, transliteration and spelling variations.
  • How can the quality (quantitative and qualitative) of these methods be measured.
  • What is the quality of the results of these methods.

Other Expected / Desired Outcomes

The methods used have to interact with all the information (including all metadata) in the ZyLAB platform via the ZyLAB API’s, now and in the future.

More Information

More information on R&D projects which ultimately find their application in the ZyLAB ONE platform can be found here: https://textmining.nu

Contact

If you are interested, please contact us at hrm@zylab.com, or leave your details on this page.

0221 - Van Gogh Splash

Apply here