Machine Learning using Spark and R

R is ubiquitous in the data science community. Its ecosystem of more than 8,000 packages makes it the Swiss Army knife of modeling applications. Similarly, Apache Spark has rapidly become the big data platform of choice for data scientists. Its ability to perform calculations relatively quickly (due to features like in-memory caching) makes it ideal […]
Read More ›

Querying SQL Server Data from Spark with Scala

Spark, as with virtually the entire Hadoop ecosystem, is built with Java, and of course Spark’s shell default programming language, Scala targets the Java Virtual Machine (JVM). For better or for worse, today’s systems involve data from heterogeneous sources, even sources that might at first seem an unnatural fit. Such is the case with reading […]
Read More ›

How to Build a Random Forest Classifier Using Data Frames in Spark

The release of Spark 1.5 increased support for using data frames with MLLib—Spark’s machine learning library. MLlib now divides into two packages spark.mllib which contains the original API built on top of RDDs which provides a higher-level API built on top of DataFrames for constructing machine learning pipelines While the spark.mllib package will continue […]
Read More ›

How to Predict Outcomes Using Random Forests and Spark

Random forests are an ensemble, or model of models, machine learning approach. The algorithm builds multiple decision trees, based on different subsets of the features in the data. Outcomes are then predicted by running observations through all the trees and averaging the individual predictions. Think wisdom of crowds. Spark’s machine learning library, MLlib, has support […]
Read More ›

Type to search

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.