{
    "version": "https:\/\/jsonfeed.org\/version\/1.1",
    "title": "Yuriy Gavrilov: posts tagged Data Proc",
    "_rss_description": "Welcome to my personal place for love, peace and happiness 🤖 Yuiry Gavrilov",
    "_rss_language": "en",
    "_itunes_email": "yvgavrilov@gmail.com",
    "_itunes_categories_xml": "",
    "_itunes_image": "https:\/\/gavrilov.info\/pictures\/userpic\/userpic-square@2x.jpg?1643451008",
    "_itunes_explicit": "no",
    "home_page_url": "https:\/\/gavrilov.info\/tags\/data-proc\/",
    "feed_url": "https:\/\/gavrilov.info\/tags\/data-proc\/json\/",
    "icon": "https:\/\/gavrilov.info\/pictures\/userpic\/userpic@2x.jpg?1643451008",
    "authors": [
        {
            "name": "Yuriy Gavrilov - B[u]g - for charity.gavrilov.eth",
            "url": "https:\/\/gavrilov.info\/",
            "avatar": "https:\/\/gavrilov.info\/pictures\/userpic\/userpic@2x.jpg?1643451008"
        }
    ],
    "items": [
        {
            "id": "32",
            "url": "https:\/\/gavrilov.info\/all\/data-doc\/",
            "title": "Тестирую Yandex Data Proc",
            "content_html": "<p>Не буду описывать подробно как заказать услугу Data Proc, так как это оказалось достаточно просто.<br \/>\nГенерируем ключ ( желательно без пароля ) для более удобного доступа.<\/p>\n<pre class=\"e2-text-code\"><code class=\"\">ssh-keygen -t rsa<\/code><\/pre><p>Создаем Data Proc кластер ... next next finish ...<\/p>\n<p><b>Копируем данные на ноду:<\/b><\/p>\n<pre class=\"e2-text-code\"><code class=\"\">cat &quot;\/Users\/yuriygavrilov\/Documents\/My Tableau Repository\/Datasources\/2022.1\/en_US-US\/Sample - Superstore.txt&quot; | ssh -i \/Users\/yuriygavrilov\/ssh_key\/ya_np\/ya ubuntu@51.250.79.62 'cat | hadoop fs -put - &quot;hdfs:\/\/rc1a-dataproc-m-a1s92pxkgxp555pm.mdb.yandexcloud.net:8020\/user\/hive\/warehouse\/stor\/stor.csv&quot;'<\/code><\/pre><p><b>Обвязываем табличку:<\/b><\/p>\n<pre class=\"e2-text-code\"><code class=\"\">create external table store \n(Row_ID\tstring\t,\nOrder_ID\tstring\t,\nOrder_Date\tstring\t,\nShip_Date\tstring\t,\nShip_Mode\tstring\t,\nCustomer_ID\tstring\t,\nCustomer_Name\tstring\t,\nSegment\tstring\t,\nCountry_Region\tstring\t,\nCity\tstring\t,\nState\tstring\t,\nPostal_Code\tstring\t,\nRegion\tstring\t,\nProduct_ID\tstring\t,\nCategory\tstring\t,\nSub_Category\tstring\t,\nProduct_Name\tstring\t,\nSales\tstring\t,\nQuantity\tstring\t,\nDiscount\tstring\t,\nProfit\tstring\t\n)       \nROW FORMAT DELIMITED\nFIELDS TERMINATED BY '\\t'\nSTORED AS TEXTFILE\nLOCATION 'hdfs:\/\/rc1a-dataproc-m-a1s92pxkgxp555pm.mdb.yandexcloud.net:8020\/user\/hive\/warehouse\/stor\/'\ntblproperties (&quot;skip.header.line.count&quot;=&quot;1&quot;);<\/code><\/pre><p><b>Создаем таблицу итогов продаж по регионам:<\/b><\/p>\n<pre class=\"e2-text-code\"><code class=\"\">create table region_sales\n(region string,\nsales float \n);<\/code><\/pre><p><b>Загружаем данные:<\/b><\/p>\n<pre class=\"e2-text-code\"><code class=\"\">insert into region_sales (region, sales) select region, sum(REPLACE(sales, &quot;,&quot;, &quot;.&quot;)) as sales from store group by region ;<\/code><\/pre><p><b>Проверяем итоги: <\/b><\/p>\n<pre class=\"e2-text-code\"><code class=\"\">select * from region_sales<\/code><\/pre><p>Central\t501239.9<br \/>\nEast\t678781.25<br \/>\nSouth\t391721.9<br \/>\nWest\t725457.8<\/p>\n<p><b>Все ровно) <\/b><br \/>\nно вот запросы исполняются достаточно долго 12 секунд, но никто и не обещал скорость на малых данных.<\/p>\n<p>В целом очень удобно. Заказал, загрузил, посчитал и выключил.<\/p>\n<p><b>А теперь тестируем Спарк:<\/b><\/p>\n<pre class=\"e2-text-code\"><code class=\"\">spark-shell\nimport spark.implicits._\nimport spark.sql\nsql(&quot;SELECT region, sum(sales) FROM store_orc group by Region&quot;).show()\n+-------+------------------+\n| region|        sum(sales)|\n+-------+------------------+\n|  South|391721.90536534786|\n|Central|  501239.889593184|\n|   East| 678781.2377765179|\n|   West| 725457.8231142759|\n+-------+------------------+<\/code><\/pre><p>Класс!) заработало)<\/p>\n",
            "date_published": "2022-07-24T18:39:34+03:00",
            "date_modified": "2022-07-24T22:34:32+03:00",
            "tags": [
                "Data Proc",
                "hadoop",
                "yandex"
            ],
            "_date_published_rfc2822": "Sun, 24 Jul 2022 18:39:34 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "32",
            "_rss_enclosures": [],
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "highlight\/highlight.js",
                    "highlight\/highlight.css"
                ],
                "og_images": []
            }
        }
    ],
    "_e2_version": 4171,
    "_e2_ua_string": "Aegea 11.4 (v4171e)"
}