Foro Formación Hadoop
SparkR: Aprovecha la potencia de Spark + R
SparkR nos da la posibilidad de utilizar R sobre Spark para aprovechar toda la potencia del motor in-memory.
R is a widely used statistical programming language but its interactive use is typically limited to a single machine. To enable large scale data analysis from R, we will present SparkR, an open source R package developed at UC Berkeley, that allows data scientists to analyze large data sets and interactively run jobs on them from the R shell. This talk will introduce SparkR, discuss some of its features and highlight the power of combining R’s interactive console and extension packages with Spark’s distributed run-time.
BIOS:
Shivaram Venkataraman is a third year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and his research interests are in designing frameworks for large scale machine-learning algorithms. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign and worked as a Software Engineer at Google.
Zongheng is an undergraduate student at UC Berkeley studying computer science and math. He is also a research assistant at AMPLab; previously he worked on SparkR.
Instalación y ejemplo de SparkR:
R on Spark
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD
class and allows users to interactively run jobs from the R shell on a cluster.
Features
RDDs as Distributed Lists
SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply
on a RDD.
sc <- sparkR.init("local")
lines <- textFile(sc, "hdfs://data.txt")
wordsPerLine <- lapply(lines, function(line) { length(unlist(strsplit(line, " "))) })
In addition to lapply
, SparkR also allows closures to be applied on every partition using lapplyWithPartition
. Other supported RDD functions include operations like reduce
, reduceByKey
, groupByKey
and collect
.
Serializing closures
SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply
, SparkR will automatically capture these variables and copy them to the cluster. An example of using a random weight vector to initialize a matrix is shown below
lines <- textFile(sc, "hdfs://data.txt")
initialWeights <- runif(n=D, min = -1, max = 1)
createMatrix <- function(line) {
as.numeric(unlist(strsplit(line, " "))) %*% t(initialWeights)
}
# initialWeights is automatically serialized
matrixRDD <- lapply(lines, createMatrix)
Using existing R packages
SparkR also allows easy use of existing R packages inside closures. The includePackage
command can be used to indicate packages that should be loaded before every closure is executed on the cluster. For example to use the Matrix
in a closure applied on each partition of an RDD, you could run
generateSparse <- function(x) {
# Use sparseMatrix function from the Matrix package
sparseMatrix(i=c(1, 2, 3), j=c(1, 2, 3), x=c(1, 2, 3))
}
includePackage(sc, Matrix)
sparseMat <- lapplyPartition(rdd, generateSparse)
Installing SparkR
SparkR requires Scala 2.10 and Spark version >= 1.1.0 and depends on R packages rJava
and testthat
(only required for running unit tests).
For lastest information, please refer to README.
If you wish to try out SparkR, you can use install_github
from the devtools
package to directly install the package.
library(devtools)
install_github("amplab-extras/SparkR-pkg", subdir="pkg")
If you wish to clone the repository and build from source, you can using the following script to build the package locally.
./install-dev.sh
Running sparkR
If you have installed it directly from github, you can include the SparkR package and then initialize a SparkContext. For example to run with a local Spark master you can launch R and then run
library(SparkR)
sc <- sparkR.init(master="local")
If you have cloned and built SparkR, you can start using it by launching the SparkR shell with
./sparkR
SparkR also comes with several sample programs in the examples
directory. To run one of them, use ./sparkR <filename> <args>
. For example:
./sparkR examples/pi.R local[2]
You can also run the unit-tests for SparkR by running
./run-tests.sh
Fuente:
http://amplab-extras.github.io/SparkR-pkg/
Redes sociales