Connect To S3 Data From Pyspark

December 22, 2023 Post a Comment

I am trying to read a JSON file, from Amazon s3, to create a spark context and use it to process the data. Spark is basically in a docker container. So putting files in docker pat

Solution 1:

I've solved adding --packages org.apache.hadoop:hadoop-aws:2.7.1 into spark-submit command.

It will download all hadoop missing packages that will allow you to execute spark jobs with S3.

Then in your job you need to set your AWS credentials like:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_key)

Other option about setting your credentials is define them into spark/conf/spark-env:

#!/usr/bin/env bashAWS_ACCESS_KEY_ID='xxxx'AWS_SECRET_ACCESS_KEY='xxxx'SPARK_WORKER_CORES=1# to set the number of cores to use on this machineSPARK_WORKER_MEMORY=1g # to set how much total memory workers have to give executors (e.g. 1000m, 2g)SPARK_EXECUTOR_INSTANCES=10#, to set the number of worker processes per node

More info:

Solution 2:

I would suggest going through this link.

In my case, I used Instance profile credentials to access s3 data.

Instance profile credentials– used on EC2 instances, and delivered through the Amazon EC2 metadata service. The AWS SDK for Java uses the InstanceProfileCredentialsProvider to load these credentials.
Note
Instance profile credentials are used only if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is not set. See EC2ContainerCredentialsProviderWrapper for more information.

For pyspark, I use setting to access s3 content.

defget_spark_context(app_name):
    # configure
    conf = pyspark.SparkConf()

    # init & return
    sc = pyspark.SparkContext.getOrCreate(conf=conf)

    # s3a config
    sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint',
                                      's3.eu-central-1.amazonaws.com')
    sc._jsc.hadoopConfiguration().set(
        'fs.s3a.aws.credentials.provider',
        'com.amazonaws.auth.InstanceProfileCredentialsProvider,''com.amazonaws.auth.profile.ProfileCredentialsProvider'
    )

    return pyspark.SQLContext(sparkContext=sc)

Python Developer

Connect To S3 Data From Pyspark

Solution 1:

Solution 2:

Post a Comment for "Connect To S3 Data From Pyspark"