Spark, as with virtually the entire Hadoop ecosystem, is built with Java, and of course Spark’s shell default programming language, Scala targets the Java Virtual Machine (JVM). For better or for worse, today’s systems involve data from heterogeneous sources, even sources that might at first seem an unnatural fit. Such is the case with reading SQL Server data in Apache Spark using Scala. It’s not difficult, but we do need to do a little extra work.
The most natural way for Scala code to access a relational database is with Java DataBase Connectivity (JDBC). To the best of my knowledge, there is no open-source jdbc driver for SQL Server, and Microsoft’s offering is not distributed with SQL Server. Our first task, therefore, is to obtain and install the necessary driver. As long as we’re installing a driver anyway, we’ll test the preview version of Microsoft’s sqljdbc driver version 6.0.
It’s pleasant to discover that this driver is easy to find and download.
The jar file names for each different version of Microsoft’s driver are the same. We will use sqljdbc42.jar, which targets JDBC 4.2., but is from Microsoft’s driver version 6.0. The sqljdbc42.jar file from Microsoft’s driver version 4.2 is a little smaller, but you should record carefully which one you are using for future reference.
The tests were run on Ubuntu 14.10 and Ubuntu 15.10 running Spark 1.4.1 and 1.6.0 respectively. Since this is a ”preview,” we will add it to a specific folder for our scala work. You can create a folder anywhere within your home directory where you might wish to save your Scala code. Although, we are not using the Scala Build Tool (SBT), right now, we might wish to in the future, so we will create a new directory named ”lib” and copy sqjdbc42.jar into that directory. SBT expects to see any unmanaged jars in a subdirectory named ”lib.” Open a terminal and navigate to the project directory (not the lib directory.). Then invoke spark-shell. We will, of course, need to import java.sql.
scala> import java.sql._
Now we will need to tell Java where it can find the jar file.
The Scala command to add a jar to the classpath has changed (twice, actually). Therefore, the command you use will depend upon the version of Scala you are using.
scala> :cp lib/sqljdbc42.jar //scala 2.10.6
scala> :require lib/sqljdbc42.jar //scala 2.11.7
If you are unsure, the quickest thing to do is to issue the :help command, and see which command is listed.
After adding the Microsoft driver jar to the classpath, we will need to reference the required class.
If this statement fails, there is most likely a problem with finding the Microsoft driver jar in the classpath. We are now ready to obtain a connection object. To do this, we must have connection information and security id.
scala> val connection = DriverManager.getConnection(”jdbc:sqlserver://192.168.206.228:1433;databaseName=ContosoRetailDW”,”sa”, ”sapw”)
Note that in this example, the SQL Server IP address includes the default port 1433.
The connection object is now our entré to the database. There are several ways to execute SQL statements; one is to prepare the statement and then execute it.
scala> val SQL=connection.prepareStatement(”SELECT Productname, UnitPrice FROM DimProduct”)
scala> val results = sql.executeQuery()
Once a result set is obtained, you can iterate through the rows grabbing columns as necessary. This, of course, is a standard object-oriented approach, and for many purposes may in fact be the best approach. However, we may wish to adhere to Scala’s functional approach. If so, we will need access our data using functional data structures as opposed to the object-oriented data structures provided by Java and JDBC. We will see how to do that in an upcoming section.
Using JDBC to connect to SQL Server from Spark using Scala is not difficult, but it does require some attention to detail, including downloading the appropriate driver version and ensuring the jar is in a spot where Scala can find it. Once you have accomplished that, the rest is easy.