When historians study contemporary notions of data in the early 21st century, 2018 might well be a landmark year. In many ways this was the year when Big and Important Issues – from the personal to the political – began to surface. The techlash, a term which has defined the year, arguably emerged from conversations and debates about the uses and abuses of data.
But while cynicism casts a shadow on the brightly lit data science landcape, there’s still a lot of optimism out there. And more importantly, data isn’t going to drop off the agenda any time soon.
However, the changing conversation in 2018 does mean that the way data scientists, analysts, and engineers use data and build solutions for it will change. A renewed emphasis on ethics and security is now appearing, which will likely shape 2019 trends.
But what will these trends be? Let’s take a look at some of the most important areas to keep an eye on in the new year.
One of the key themes of data science and artificial intelligence in 2019 will be doing more with less.
There are a number of ways in which this will manifest itself. The first is meta learning. This is a concept that aims to improve the way that machine learning systems actually work by running machine learning on machine learning systems. Essentially this allows a machine learning algorithm to learn how to learn. By doing this, you can better decide which algorithm is most appropriate for a given problem.
Find out how to put meta learning into practice. Learn with Hands On Meta Learning with Python.
Automated machine learning is closely aligned with meta learning. One way of understanding it is to see it as putting the concept of automating the application of meta learning. So, if meta learning can help better determine which machine learning algorithms should be applied and how they should be designed, automated machine learning makes that process a little smoother. It builds the decision making into the machine learning solution. Fundamentally, it’s all about “algorithm selection, hyper-parameter tuning, iterative modelling, and model assessment,” as Matthew Mayo explains on KDNuggets.
What’s particularly exciting about automated machine learning is that there are already a number of tools that make it relatively easy to do. AutoML is a set of tools developed by Google that can be used on the Google Cloud Platform, while auto-sklearn, built around the scikit-learn library, provides a similar out of the box solution for automated machine learning.
Although both AutoML and auto-sklearn are very new, there are newer tools available that could dominate the landscape: AutoKeras and AdaNet. AutoKeras is built on Keras (the Python neural network library), while AdaNet is built on TensorFlow. Both could be more affordable open source alternatives to AutoML.
Whichever automated machine learning library gains the most popularity will remain to be seen, but one thing is certain: it makes deep learning accessible to many organizations who previously wouldn’t have had the resources or inclination to hire a team of PhD computer scientists.
But it’s important to remember that automated machine learning certainly doesn’t mean automated data science. While tools like AutoML will help many organizations build deep learning models for basic tasks, for organizations that need a more developed data strategy, the role of the data scientist will remain vital. You can’t after all, automate away strategy and decision making.
Quantum computing, even as a concept, feels almost fantastical. It’s not just cutting-edge, it’s mind-bending. But in real-world terms it also continues the theme of doing more with less.
Explaining quantum computing can be tricky, but the fundamentals are this: instead of a binary system (the foundation of computing as we currently know it), which can be either 0 or 1, in a quantum system you have qubits, which can be 0, 1 or both simultaneously. (If you want to learn more, read this article).
So, what does this mean in practice? Essentially, because the qubits in a quantum system can be multiple things at the same time, you are then able to run much more complex computations. Think about the difference in scale: running a deep learning system on a binary system has clear limits. Yes, you can scale up in processing power, but you’re nevertheless constrained by the foundational fact of zeros and ones. In a quantum system where that restriction no longer exists, the scale of the computing power at your disposal increases astronomically.
Once you understand the fundamental proposition, it becomes much easier to see why the likes of IBM and Google are clamouring to develop and deploy quantum technology. One of the most talked about use cases is using Quantum computers to find even larger prime numbers (a move which contains risks given prime numbers are the basis for much modern encryption). But there other applications, such as in chemistry, where complex subatomic interactions are too detailed to be modelled by a traditional computer.
It’s important to note that Quantum computing is still very much in its infancy. While Google and IBM are leading the way, they are really only researching the area. It certainly hasn’t been deployed or applied in any significant or sustained way.
But this isn’t to say that it should be ignored. It’s going to have a huge impact on the future, and more importantly it’s plain interesting. Even if you don’t think you’ll be getting to grips with quantum systems at work for some time (a decade at best), understanding the principles and how it works in practice will not only give you a solid foundation for major changes in the future, it will also help you better understand some of the existing challenges in scientific computing. And, of course, it will also make you a decent conversationalist at dinner parties.
If you want to get started, Microsoft has put together the Quantum Development Kit, which includes the first quantum-specific programming language Q#. IBM, meanwhile, has developed its own Quantum experience, which allows engineers and researchers to run quantum computations in the IBM cloud.
As you investigate these tools you’ll probably get the sense that no one’s quite sure what to do with these technologies. And that’s fine – if anything it makes it the perfect time to get involved and help further research and thinking on the topic.
While Quantum lingers on the horizon, the concept of the edge has quietly planted itself at the very center of the IoT revolution. IoT might still be the term that business leaders and, indeed, wider society are talking about, for technologists and engineers, none of its advantages would be possible without the edge.
Edge computing or edge analytics is essentially about processing data at the edge of a network rather than within a centralized data warehouse. Again, as you can begin to see, the concept of the edge allows you to do more with less. More speed, less bandwidth (as devices no longer need to communicate with data centers), and, in theory, more data.
In the context of IoT, where just about every object in existence could be a source of data, moving processing and analytics to the edge can only be a good thing.
There’s a lot of conversation about whether edge will replace cloud. It won’t. But it probably will replace the cloud as the place where we run artificial intelligence. For example, instead of running powerful analytics models in a centralized space, you can run them at different points across the network. This will dramatically improve speed and performance, particularly for those applications that run on artificial intelligence.
Think of it this way: just as software has become more distributed in the last few years, thanks to the emergence of the edge, data itself is going to be more distributed. We’ll have billions of pockets of activity, whether from consumers or industrial machines, a locus of data-generation.
An emerging part of the edge computing and analytics trend is the concept of digital twins. This is, admittedly, still something in its infancy, but in 2019 it’s likely that you’ll be hearing a lot more about digital twins.
A digital twin is a digital replica of a device that engineers and software architects can monitor, model and test. For example, if you have a digital twin of a machine, you could run tests on it to better understand its points of failure. You could also investigate ways you could make the machine more efficient. More importantly, a digital twin can be used to help engineers manage the relationship between centralized cloud and systems at the edge – the digital twin is essentially a layer of abstraction that allows you to better understand what’s happening at the edge without needing to go into the detail of the system.
For those of us working in data science, digital twins provide better clarity and visibility on how disconnected aspects of a network interact. If we’re going to make 2019 the year we use data more intelligently – maybe even more humanely – then this is precisely the sort of thing we need.
Doing more with less might be one of the ongoing themes in data science and big data in 2019, but we can’t ignore the fact that ethics and security will remain firmly on the agenda. Although it’s easy to dismiss these issues as separate from the technical aspects of data mining, processing, and analytics, but it is, in fact, deeply integrated into it.
One of the key facets of ethics are two related concepts: explainability and interpretability.
The two terms are often used interchangeably, but there are some subtle differences. Explainability is the extent to which the inner-working of an algorithm can be explained in human terms, while interpretability is the extent to which one can understand the way in which it is working (eg. predict the outcome in a given situation). So, an algorithm can be interpretable, but you might not quite be able to explain why something is happening. (Think about this in the context of scientific research: sometimes, scientists know that a thing is definitely happening, but they can’t provide a clear explanation for why it is.)
Either way, interpretability and explainability are important because they can help to improve transparency in machine learning and deep learning algorithms. In a world where deep learning algorithms are being applied to problems in areas from medicine to justice – where the problem of accountability is particularly fraught – this transparency isn’t an option, it’s essential.
In practice, this means engineers must tweak the algorithm development process to make it easier for those outside the process to understand why certain things are happening and why they aren’t. To a certain extent, this ultimately requires the data science world to take the scientific method more seriously than it has done. Rather than just aiming for accuracy (which is itself often open to contestation), the aim is to constantly manage that gap between what we’re trying to achieve with an algorithm and how it goes about actually doing that.
So, there are two fundamental things for data science in 2019: improving efficiency, and improving transparency. Although the two concepts might look like the conflict with each other, it’s actually a bit of a false dichotomy. If we realised that 12 months ago, we might have avoided many of the issues that have come to light this year.
Transparency has to be a core consideration for anyone developing systems for analyzing and processing data. Without it, the work your doing might be flawed or unnecessary. You’re only going to need to add further iterations to rectify your mistakes or modify the impact of your biases.
With this in mind, now is the time to learn the lessons of 2018’s techlash. We need to commit to stopping the miserable conveyor belt of scandal and failure. Now is the time to find new ways to build better artificial intelligence systems.