{
    "version": "https:\/\/jsonfeed.org\/version\/1.1",
    "title": "Yuriy Gavrilov: posts tagged spark",
    "_rss_description": "Welcome to my personal place for love, peace and happiness 🤖 Yuiry Gavrilov",
    "_rss_language": "en",
    "_itunes_email": "yvgavrilov@gmail.com",
    "_itunes_categories_xml": "",
    "_itunes_image": "https:\/\/gavrilov.info\/pictures\/userpic\/userpic-square@2x.jpg?1643451008",
    "_itunes_explicit": "no",
    "home_page_url": "https:\/\/gavrilov.info\/tags\/spark\/",
    "feed_url": "https:\/\/gavrilov.info\/tags\/spark\/json\/",
    "icon": "https:\/\/gavrilov.info\/pictures\/userpic\/userpic@2x.jpg?1643451008",
    "authors": [
        {
            "name": "Yuriy Gavrilov - B[u]g - for charity.gavrilov.eth",
            "url": "https:\/\/gavrilov.info\/",
            "avatar": "https:\/\/gavrilov.info\/pictures\/userpic\/userpic@2x.jpg?1643451008"
        }
    ],
    "items": [
        {
            "id": "45",
            "url": "https:\/\/gavrilov.info\/all\/chtenie-avro-v-spark-iz-s3\/",
            "title": "Чтение avro в spark из s3",
            "content_html": "<p>import pyspark<br \/>\nimport os<\/p>\n<p>S3_ACCESS_KEY = os.environ.get(“S3_ACCESS_KEY”)<br \/>\nS3_BUCKET = os.environ.get(“S3_BUCKET”)<br \/>\nS3_SECRET_KEY = os.environ.get(“S3_SECRET_KEY”)<br \/>\nS3_ENDPOINT = os.environ.get(“S3_ENDPOINT”)<\/p>\n<h2>This cell may take some time to run the first time, as it must download the necessary spark jars<\/h2>\n<p>conf = pyspark.SparkConf()<\/p>\n<h3>IF YOU ARE USING THE SPARK CONTAINERS, UNCOMMENT THE LINE BELOW TO OFFLOAD EXECUTION OF SPARK TASKS TO SPARK CONTAINERS<\/h3>\n<p>#conf.setMaster(“spark:\/\/spark:7077”)<\/p>\n<p>conf.set(“spark.jars.packages”, ‘org.apache.hadoop:hadoop-aws:3.3.1,io.delta:delta-core_2.12:2.1.0,org.apache.spark:spark-avro_2.12:3.3.2’)<\/p>\n<h2>conf.set(‘spark.hadoop.fs.s3a.aws.credentials.provider’, ‘org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider’)<\/h2>\n<p>conf.set(‘spark.hadoop.fs.s3a.endpoint’, S3_ENDPOINT)<br \/>\nconf.set(‘spark.hadoop.fs.s3a.access.key’, S3_ACCESS_KEY)<br \/>\nconf.set(‘spark.hadoop.fs.s3a.secret.key’, S3_SECRET_KEY)<br \/>\nconf.set(‘spark.hadoop.fs.s3a.path.style.access’, “true”)<br \/>\nconf.set(“spark.sql.extensions”, “io.delta.sql.DeltaSparkSessionExtension”)<br \/>\nconf.set(“spark.sql.catalog.spark_catalog”, “org.apache.spark.sql.delta.catalog.DeltaCatalog”)<\/p>\n<p>sc = pyspark.SparkContext(conf=conf)<\/p>\n<h2>sc.setLogLevel(“INFO”)<\/h2>\n<p>spark = pyspark.sql.SparkSession(sc)<\/p>\n<p>df = spark.read.format(“avro”).load(f“s3a:\/\/{S3_BUCKET}\/person2.avro”)<\/p>\n",
            "date_published": "2023-02-28T20:31:52+03:00",
            "date_modified": "2023-02-27T21:28:18+03:00",
            "tags": [
                "spark"
            ],
            "_date_published_rfc2822": "Tue, 28 Feb 2023 20:31:52 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "45",
            "_rss_enclosures": [],
            "_e2_data": {
                "is_favourite": false,
                "links_required": [],
                "og_images": []
            }
        },
        {
            "id": "44",
            "url": "https:\/\/gavrilov.info\/all\/chtenie-json-v-spark-iz-s3\/",
            "title": "Чтение json в spark из s3",
            "content_html": "<p>import pyspark<br \/>\nimport os<\/p>\n<p>S3_ACCESS_KEY = os.environ.get(“S3_ACCESS_KEY”)<br \/>\nS3_BUCKET = os.environ.get(“S3_BUCKET”)<br \/>\nS3_SECRET_KEY = os.environ.get(“S3_SECRET_KEY”)<br \/>\nS3_ENDPOINT = os.environ.get(“S3_ENDPOINT”)<\/p>\n<h2>This cell may take some time to run the first time, as it must download the necessary spark jars<\/h2>\n<p>conf = pyspark.SparkConf()<\/p>\n<h3>IF YOU ARE USING THE SPARK CONTAINERS, UNCOMMENT THE LINE BELOW TO OFFLOAD EXECUTION OF SPARK TASKS TO SPARK CONTAINERS<\/h3>\n<p>#conf.setMaster(“spark:\/\/spark:7077”)<\/p>\n<p>conf.set(“spark.jars.packages”, ‘org.apache.hadoop:hadoop-aws:3.3.1,io.delta:delta-core_2.12:2.1.0,org.apache.spark:spark-avro_2.12:3.3.2’)<\/p>\n<h2>conf.set(‘spark.hadoop.fs.s3a.aws.credentials.provider’, ‘org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider’)<\/h2>\n<p>conf.set(‘spark.hadoop.fs.s3a.endpoint’, S3_ENDPOINT)<br \/>\nconf.set(‘spark.hadoop.fs.s3a.access.key’, S3_ACCESS_KEY)<br \/>\nconf.set(‘spark.hadoop.fs.s3a.secret.key’, S3_SECRET_KEY)<br \/>\nconf.set(‘spark.hadoop.fs.s3a.path.style.access’, “true”)<br \/>\nconf.set(“spark.sql.extensions”, “io.delta.sql.DeltaSparkSessionExtension”)<br \/>\nconf.set(“spark.sql.catalog.spark_catalog”, “org.apache.spark.sql.delta.catalog.DeltaCatalog”)<\/p>\n<p>sc = pyspark.SparkContext(conf=conf)<\/p>\n<h2>sc.setLogLevel(“INFO”)<\/h2>\n<p>spark = pyspark.sql.SparkSession(sc)<\/p>\n<p>df = spark.read.format(‘org.apache.spark.sql.json’).load(f“s3a:\/\/{S3_BUCKET}\/apple3.json”)<br \/>\ndf.show()<\/p>\n",
            "date_published": "2023-02-27T21:26:01+03:00",
            "date_modified": "2023-02-27T21:25:52+03:00",
            "tags": [
                "spark"
            ],
            "_date_published_rfc2822": "Mon, 27 Feb 2023 21:26:01 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "44",
            "_rss_enclosures": [],
            "_e2_data": {
                "is_favourite": false,
                "links_required": [],
                "og_images": []
            }
        }
    ],
    "_e2_version": 4171,
    "_e2_ua_string": "Aegea 11.4 (v4171e)"
}