Using Pandas in SQL Server 2017 Python Scripts

Creating Row Data with Pandas Data Frames in SQL Server vNext

In the previous blog, we described the ease with which Python support can be installed with SQL Server vNext, which most folks just call SQL Server 2017. In this entry, we will take a look at the use of pandas DataFrames within SQL Server 2017 Python scripts.

DataFrames are, of course, the row and column data structures that can be found in virtually all development platforms designed for data analysis. Unlike SQL Server row data, however, DataFrames are best thought of as a set of individual columns (or vectors if you like) rather than a set of rows with column values. Pandas is the most popular implementation of core DataFrame functionality available for Python.

In this article, we will use Python code to generate a list of random numbers and then see how that list can be returned as a result set or even written into a temporary (or for that matter permanent) table. Since Python provides a richer set of methods for random number generation than does T-SQL, this example will not only illustrate some fundamental Python techniques for SQL Server but is may also prove very practical for developers wishing to generate random samples of data or analysts working with Monte Carlo models.

Let’s look at some Python.

exec sp_execute_external_script @language =N’Python’,

@script=N’

import random

import pandas as pd

t=[]

for i in range(0,10):

t.append(random.random())

tDF = pd.DataFrame(data=t)

print(tDF)

OutputDataSet=tDF’

There are two required arguments for the extended stored procedure that launches scripts. The first is the language, the second is the script itself. Additional optional parameters can be defined. The bulk of the script is pure Python, and Pythonistas need learning nothing new. The Python print function will return output to the Message tab of the Management Studio results window, and OutputDataSet allows us to specify output to be returned as a rowset. In this code, we create t, a list of random numbers and then use pandas to convert the list to a DataFrame, tDF in this example. A pandas DataFrame can be directly returned as an output rowset by SQL Server.

Column Names

Column names defined in a DataFrame are not converted to column names in an output rowset. The following code assigns the name “Random” to the sole column of the list.

exec sp_execute_external_script @language =N’Python’,

@script=N’

import random

import pandas as pd

t=[]

for i in range(0,10):

t.append(random.random())

tDF = pd.DataFrame(data=t)

newnames={0:”Random”}

tDF.rename(columns=newnames, inplace=True)

print(tDF)

OutputDataSet=tDF’

It is not surprising that we cannot simply copy-and-paste all Python code; in this example we see that the single quotes around the new name Random must be doubled for compatibility with T-SQL syntax. When we run this code

The new column name “Random” appears in the message, that is, it is printed by Python as would be expected, but on the results tab we see “(No column name)” in the output rowset.

This is easy to fix by including the T-SQL result sets declaration.

exec sp_execute_external_script @language =N’Python’,

@script=N’

import random

import pandas as pd

t=[]

for i in range(0,10):

t.append(random.random())

tDF = pd.DataFrame(data=t)

newnames={0:”Random”}

tDF.rename(columns=newnames, inplace=True)

print(tDF)

OutputDataSet=tDF’

WITH RESULT SETS (([Random Value] float)) — note two sets of parentheses

Note that the argument to WITH RESULT SETS is a list enclosed in parentheses, and each element in that list, even if there is only one, must also be enclosed in parentheses.

Using Our Random Numbers

Once we have a script that generates the random numbers we want, or any other sets of values, we can use the T-SQL INSERT EXEC statement to save the results in a temporary table for further use.

CREATE TABLE #RandTest(X FLOAT)

INSERT #RandTest EXEC

sp_execute_external_script @language =N’Python’,

@script=N’

import random

import pandas as pd

t=[]

for i in range(0,10):

t.append(random.random())

tDF = pd.DataFrame(data=t)

newnames={0:”Random”}

tDF.rename(columns=newnames, inplace=True) #giving name to DataFrame column does not name output resultset column

print(tDF)

OutputDataSet=tDF’

And there we have it.

An Observation on Performance

I don’t think it’s appropriate to be too critical of performance for a feature not yet released. I have, however, made a few casual observations.

The first time a Python script is run the execution time seems excessively long. This time seems to be related not to the execution of the script itself, but to the loading of the Python interpreter, though in fact the time is longer than required by other applications to load the same interpreter. Once the procedure executes, subsequent executions of that procedure are fast, as are the execution of other Python scripts. After a period of inactivity, however, the execution time will slow again, suggesting that the Python interpretive machinery is cached for a while, but not indefinitely.

Conclusion

The ability to execute Python scripts from T-SQL provides the ability to perform an enormous range of tasks while remaining close to the server, meaning that the latencies associated with passing data to and from client machines can be eliminated. We can, in essence, bring the vast array of tools for machine learning available in Python into the server itself.

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.