Skip to content Skip to sidebar Skip to footer

How To Connect Spark With Cassandra Using Spark-cassandra-connector?

You must forgive my noobness but I'm trying to setup a spark cluster that connects to cassandra running a python script, currently I am using datastax enterprise to run cassandra o

Solution 1:

I have used pyspark in a standalone python script. I don't use DSE, I cloned cassandra-spark-connector from datastax's github repository and compiled with datastax instrucctions.

In order to get access to spark connector within spark, I copied to jars folder inside spark installation.

I think that it would be good for you as well:

cp ~/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.11/spark-cassandra-connector-assembly-2.0.5-86-ge36c048.jar $SPARK_HOME/jars/

You could visit this where I explain my own experience setting up the environment.

Once spark has access to Cassandra connector, you can use pyspark library as wrapper:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession

spark = SparkSession.builder \
  .appName('SparkCassandraApp') \
  .config('spark.cassandra.connection.host', 'localhost') \
  .config('spark.cassandra.connection.port', '9042') \
  .config('spark.cassandra.output.consistency.level','ONE') \
  .master('local[2]') \
  .getOrCreate()

ds = sqlContext \
  .read \
  .format('org.apache.spark.sql.cassandra') \
  .options(table='tablename', keyspace='keyspace_name') \
  .load()

ds.show(10)

In this example you can see the whole script.

Solution 2:

Here is how to connect spark-shell to cassandra in non-dse version.

Copy spark-cassandra-connector jar to spark/spark-hadoop-directory/jars/

spark-shell --jars ~/spark/spark-hadoop-directory/jars/spark-cassandra-connector-*.jar

in spark shell execute these commands

sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
import  org.apache.spark.sql.cassandra._
valconf=newSparkConf(true).set("spark.cassandra.connection.host", "localhost")
valsc=newSparkContext(conf)
valcsc=newCassandraSQLContext(sc)

You will have to provide more parameters if your cassandra has password setup etc. :)

Post a Comment for "How To Connect Spark With Cassandra Using Spark-cassandra-connector?"