Dimensionality Reduction in R

Most problems of interest to organizations are multivariate. They involve multiple issues that must be looked at simultaneously. For example, when evaluating sites for a new store, we need to consider factors like cost of land, proximity to transport and local competition. The more issues (i.e. dimensions) we have to consider, the thorny the problems […]
Using Excel Power Query to Append a File to a Table

Lots of people import text-based files into Excel tables. Sonia V. recently asked if it was possible to append data imported from a text file to an existing table. I am not aware of any technique to do exactly what Sonia wants, but we can do the next best thing. We can read an Excel […]
How to Build a Random Forest Classifier Using Data Frames in Spark

The release of Spark 1.5 increased support for using data frames with MLLib—Spark’s machine learning library. MLlib now divides into two packages spark.mllib which contains the original API built on top of RDDs spark.ml which provides a higher-level API built on top of DataFrames for constructing machine learning pipelines While the spark.mllib package will continue […]
How to Display Data on a World Map in R

One of the strengths of R is its vast ecosystem of libraries. This includes numerous sophisticated visualization libraries—some of which, such as the excellent ggplot, are capable of producing publication quality charts. I’m not a huge fan of charts (e.g. scatterplots, barcharts) as communication devices. I believe that conclusions can usually be presented more succinctly. […]
Five R Packages Data Scientists Should Know About

One of the strengths of R is its comprehensive ecosystem of packages. If you want to do something, chances are that someone’s been there first and written the package. Of course, with so many packages out there, quality varies considerably—and some packages are so specific that it’s difficult to imagine them having a wide audience. […]
