LIBBLESpark
Introduction
LIBBLESpark is the LIBBLE variant implemented on Spark.
The current version of LIBBLESpark includes the following machine learning algorithms:
 Classification
 Logistic Regression (LR)
 Logistic Regression with L1norm Regularization
 Support Vector Machine (SVM)
 Regression
 Linear Regression
 Lasso
 Collaborative Filtering
 Matrix Factorization
 Dimensionality Reduction
 Principal Component Analysis (PCA)
 Singular Value Decomposition (SVD)
 Clustering
 KMeans
Empirical Comparison
The main Learning Engine for LIBBLESpark is based on a distributed stochastic optimization algorithm called SCOPE (Scalable Composite OPtimization for lEarning). SCOPE is both computationefficient and communicationefficient. Theoretical analysis shows that SCOPE is convergent with linear convergence rate when the objective function is strongly convex. Furthermore, empirical results on real datasets show that SCOPE can outperform other stateoftheart distributed learning methods on Spark, including both batch learning methods and stochastic learning methods.

Efficiency
To compare efficiency with stateoftheart machine learning methods on Spark, we choose logistic regression (LR) with a L2norm regularization term to evaluate SCOPE and other baselines. The result on MNIST8M dataset is shown below. We can find that SCOPE outperforms all the other baselines.

Speedup
The speedup is defined as follows: $speedup=\frac{(time~with~16~cores)}{(time~with~2x~cores)}$. And we choose $x=8,16,24,32$. The speedup result is shown below, where we can find that SCOPE has a superlinear speedup, which might be reasonable due to the higher cache hit ratio with more machines. The good speedup of SCOPE can be explained by the fact that most training work can be locally completed by each Worker and SCOPE does not need much communication cost.

Synchronization Cost
The following figure shows the synchronization cost, which contains both communication time and waiting time. The synchronization cost of SCOPE is low because most computation is completed locally and only a small number of synchronization times are needed.
Tutorial

Import LIBBLE

For maven
Add repository:
<repository> <id>libblespark</id> <url> https://libble.github.io/mvnrepo/ </url> </repository>
Add dependency:
<dependency> <groupId>libble</groupId> <artifactId>libblespark_${scala.binary}</artifactId> <version>${libble.spark.version}</version> </dependency>

For sbt:
Add follows into the build.sbt:
libraryDependencies +="libble"%%"libblespark"%"1.0.1SNAPSHOT" resolvers ++= Seq( "libble Releases" at "https://libble.github.io/mvnrepo/" )


Load and Save Data
LIBBLESpark supports two formats of input data: LIBSVM input format for sparse features; If the features are dense, each line is an instance, with the label and features separated by a space. The function for loading data is loadLIBBLEFile.
val conf = new SparkConf() val sc = new SparkContext(conf) import libble.context.LibContext.sc2LibContext val training=sc.loadLIBBLEFile("sparse.data")
We use the method saveAsLIBBLEFile to save Data to the File System:
import libble.context.LibContext.RDD2LIBBLERDD training.saveAsLIBBLEFile("this.data")

Classification and Regression
Here, we give an example of using Logistic Regression. The usages of Linear Regression and SVM are similar. You can find the complete codes in the package “examples”.
val training = sc.loadLIBBLEFile(path, numPart) val m = new LogisticRegression(stepSize, regParam, elasticF, numIter,numPart) .setClassNum(10) m.train(training)

SelfDefined Generalized Linear Model
In our framework, you are allowed to define your own generalized linear models by using our learning engine to optimize. You can define your loss function by implementing the interface of the abstract class LossFunc. Here, we give an example of GeneralizedLinearModel:
val training = sc.loadLIBBLEFile(args(0), numPart) val m=new GeneralizedLinearModel(stepSize, regParam, elasticF, numIter, numPart) .setLossFunc(new selfDefinedLoss()) .setUpdater(new L1Updater) m.train(training)

Collaborative Filtering
Collaborative filtering is widly used in recommendation systems. An example to perform collaborative filtering with the UV matrix factorization is shown as follows:
val trainSet = sc.textFile(args(0), numParts) .map(_.split(',') match { case Array(user, item, rate) => Rating(rate.toDouble, user.toInt, item.toInt)}) val model = new MatrixFactorization() .train(trainSet, numIters, numParts, rank, regParam_u, regParam_v,stepsize)

Dimensionality Reduction
Principal Component Analysis (PCA) is a widely used method for dimensionality reduction. An example to perform dimensionality reduction with PCA is shown as follows:
val training = sc.loadLIBBLEFile(args(0), numPart) val mypca = new PCA(K, bound, stepSize, numIters, numPart, batchSize) val PCAModel = mypca.train(training) val pc = PCAModel._2 val projected = mypca.transform(training, pc) projected.collect().foreach(x=>println(x.features))
Singular value decomposition (SVD) is a popular matrix decomposition/factorization method in linear algebra and machine learning. An example to perform dimensionality reduction with SVD is shown as follows:
val training = sc.loadLIBBLEFile(args(0), numPart) val mysvd = new SVD(K, bound, stepSize, numIters, numPart, batchSize) val SVDModel = mysvd.train(training) val sigma = SVDModel._1 val v = SVDModel._2 sigma.foreach(x=>print(x+",")) v.foreach(x=>println(x))

Clustering
KMeans is a widely used prototypebased clustering algorithms. An example to perform clustering with KMeans is shown as follows:
val training = sc.loadLIBBLEFile(args(0)) val m = new KMeans(k, maxIters, stopBound) val data = training.map(e => (e.label, e.features)) m.train(data)
Open Source
 GitHub URL: https://github.com/LIBBLE/LIBBLESpark/
 Licence: This project follows Apache Licence 2.0
API
Please click here to check the Application Programming Interface documents.
Development Team

Director: WuJun Li

Developers: Ru Xiang, Peng Gao, YingHao Shi

Theory and Algorithm Designers: ShenYi Zhao, WuJun Li