As interest in big data has exploded, organizations have rushed to grab competitive advantage by deploying analytics pipelines that exploit this newly available resource.
Many projects have been set up in a “skunkworks” environment, often by data science teams. While this has accelerated the time to market for new features, it has created a potential security nightmare that organizations are gradually waking up to.
In this article, we consider five security challenges facing organizations who have, or are considering, big data deployments.
Data governance is about effectively managing the data in your organization. It involves considering issues like
Processes should be defined for managing data—and adherence to those processes, and their effectiveness, should be continuously monitored and evaluated.
A survey by Rand Worldwide, conducted in 2013, showed that, while 82% of companies know they face external regulation, 44% had no formal data governance policy and 22% had no plans to implement one.
There is little evidence that the situation has improved three years on.
Even within companies that have policies many were created in the context of a relational database centric world—i.e. a world with
This is not the world inhabited by big data, so polices need to be refreshed.
Erosion of privacy is one of the most politically-charged consequences of applying big data tooling. Machine learning allows organizations to unearth information about individuals that isn’t apparent from the original data.
As more and more data comes online, it’s difficult to ensure that your data is not the missing piece that results in a privacy violation. Data released by Netflix included identifiable information when movie scores were correlated with those on IMDB.
When considering privacy issues, data sets can’t be looked at in isolation. Their role in the “data ecosystem” (both present and future) must be considered.
Redaction and encryption/hashing are ways of using data while enhancing privacy. Security researcher Bruce Schneier has referred to data as a “toxic asset”—don’t retain it unless you absolutely have to.
Of course, organizations wish to analyze data to enhance their processes, services, etc. so there’s a balance to be struck.
Researchers are currently looking into ways of protecting privacy while allowing large-scale analysis of data.
Apple and Microsoft are two IT giants who are currently conducting research into differential privacy. This approach adds mathematical noise to an individual’s data, but enables the data set as a whole to be mined in search of overall patterns. Apple has said they are already deploying this technology in iOS 10.
Another interesting area of research is homomorphic encryption. This allows analyses to be performed on encrypted data, so data scientists don’t need access to the underlying identifiable data. Unfortunately, it’s currently many orders of magnitude slower to analyze encrypted data when compared to working with raw data. But it’s early days.
In a perimeter-based security model, mission-critical applications are all kept inside the secure network and the bad people are kept outside the secure network.
This is a common security model in big data installations as big data security tools are lacking and network security people aren’t necessarily familiar with the specific requirements of security big data systems.
The problem with perimeter-based security is that it relies on the perimeter remaining secure which, as we all know, is a article of faith. And, the assumption that all the bad people are outside of the secure network is a bold one.
The popularity of NoSQL data stores has surged in recent years. MongoDB is currently (October 2016) the fourth most popular database engine.
NoSQL databases are often deployed as part of big data installations as they have properties that are helpful in managing and analyzing large data sets.
However, there are challenges with securing NoSQL databases. Most of the effort put into these databases has been on providing features. The market is growing fast, so vendors are busy responding to the emerging needs of their customers. Until recently, security has not been a priority for many of these products.
To be fair, many NoSQL products have key security features, but these are compromised by permissive default options, or lack of knowledge about how to configure them effectively. Relational databases are very mature, and security has long been a critical component of the feature set. There are also auditing tools, checklists and training courses available for those wishing to harden their SQL Server, Oracle or MySQL installations.
The NoSQL security story will get better quickly, but, for now, care must be taken to ensure the data they hold is adequately protected.
Big data deployments tend to be a mosaic of emerging open source tools. Also, by their very nature, single applications are distributed across multiple physical machines (i.e. the cluster). This makes configuration management particularly challenging.
The configuration in a production big data analytics cluster is often spread across numerous, incompatible XML, JSON and text files. A further complication is that when new machines are added to a cluster they must be set up, patched and configured so that they don’t create a security hole.
As organizations start to express concerns about the security of their big data (e.g. Hadoop) deployments, tools like Apache Ranger are appearing in an attempt to address the current vacuum.