For those of you with an interest in the more academic development of syntactic- and semantic tasks in relation to language processing and information retrieval. This is an abstract of ZyLAB’s our Chief Strategy Officer’s Johannes (Jan) Scholtes previous blog on LinkedIn and based on his thesis from January 1993: Neural networks for Natural Language Processing and Information Retrieval and the book “Artificial neural networks for information retrieval in a libraries context” from 1995.
Deep learning is a machine learning method based on learning data representations. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation and bioinformatics produce results comparable to and in some cases superior to human experts.
Deep Learning and eDiscovery
Recent research towards a variety of machine learning problems related to eDiscovery at ZyLAB and the University of Maastricht shows that using Deep Learning and Word2Vec on written language classification problems, consistently outperforms not only all other machine learning algorithms, but also humans.
Calculation times are still a bit problematic, as they are 200 times slower (without using GPU’s or other special hardware) than other machine learning algorithms. But the superior results, announcements of hardware development for dedicated Deep Learning chips and scientific research focused on finding more efficient algorithms to train Convolutional Neural Networks (CNN) give us hope that this problem will be solved soon.
So, is the Deep Learning approach superior and can we forget about everything else? Now quit yet, there are still a few conceptual problems related to CNN that have not gone away. One problem already mentioned in Scholtes’ thesis in 1993 is still not solved:
“If one applies neural technology to an application, it either works, or it does not work. It is impossible to patch the model for certain exceptions, which can be done in a sequential algorithm. … it is hard to say which neuron is responsible for which action (the credit assignment problem).”
More research is needed for a better understanding of how and why deep learning works. Especially for the acceptance of Artificial Intelligence solving important tasks in our society, this kind of understanding is essential.
The only good data is more data
But, as Scholtes states in his blog, he strongly believes that the current third revival of neural networks will bring us what we have been looking for since the early 1940’s. There are many problems in eDiscovery which use machine learning techniques that we can probably solve better by using Deep Learning; some examples are Technology Assisted Review, Concept Search, Semantic Clustering, Privileged Detection, Automatic Redaction (also for GDPR purposes), Information Extraction for better internal investigations and maybe even a few other applications we have not even thought about yet. These are (again) exiting times!
He closes his blog with one of his postulations from 1993 after a quote from Frederick Jelinek, one of the pioneers of data driven machine learning in the late 70’s: "The only good data is more data". And that is exactly what we are having.