If your server racks are anything like mine, they look impressive and powerful, clean and efficient. From the front, that is. Look behind the rack and there is a tangled rat’s nest of power cables, network connections, and those big clumsy things connecting to KVM switches.
It’s worse on the inside.
Data platforms and software tools continue to proliferate even as administrators strive to unify the vast variety of data with which the must content. Microsoft wants SQL Server to be the “go-to” platform for all things data, and when SQL Server 2019 is released they will be making further strides in that direction. The burden of database administrators struggling to integrate data systems will eased a bit by the inclusion of the Hadoop filesystem and Spark with SQL Server 2019 in what Microsoft seems to be calling “Big Data Analytics” for lack of an official name. Unfortunately, MS has not seen fit to include its most exciting new features in the community technology preview (CTP), but has made preview versions available only to participants in its early adopter program.
The key to it all is Kubernetes. This mature and trusted open-source container system was developed at Google to apply the lessons learned from its earlier container systems. Like any other container system, a Kubernetes container not only abstracts away any specifics about the server on which it is installed, but also contains images of the required software. This means that instead of implementing Hadoop cluster, and Spark, and then an SQL Server installation, a dba will only need to deploy the MS big data container. Five minutes? Maybe ten?
But what benefits do the Hadoop filesystem and Apache Spark actually provide? Microsoft Machine Learning Server, introduced as R Server in 2017, Hadoop, and Spark have considerable overlap in their functionality, but each excels in its own way. For huge volumes of unstructured data, Hadoop is the hands-down winner. Although Hadoop supports distributed computation, its implementation of map/reduce cannot compete Spark’s directed acyclic graph approach to programming the map/reduce pattern. While Spark serves well distributing analytic computation across a filesystem such as Hadoop, Spark also provides valuable streaming services. For its part, MS Machine Learning Server sits more on the computation side of the fence and scales analytic calculations for enterprise applications by parallelizing computation and by apply algorithms to data too large to fit into server memory all at once.
Microsoft’s plan for SQL Server 2019 provides the straightforward implementation of systems that can match the strengths of each of these three systems to the specific analytic tasks at hand.