Run Spark SQL on Amazon Athena Spark

Big Data

Run Spark SQL on Amazon Athena Spark

geeks-news.com

October 29, 2023

[ad_1]

At AWS re:Invent 2022, Amazon Athena launched help for Apache Spark. With this launch, Amazon Athena helps two open-source question engines: Apache Spark and Trino. Athena Spark permits you to construct Apache Spark purposes utilizing a simplified pocket book expertise on the Athena console or by means of Athena APIs. Athena Spark notebooks help PySpark and pocket book magics to mean you can work with Spark SQL. For interactive purposes, Athena Spark permits you to spend much less time ready and be extra productive, with software startup time in underneath a second. And since Athena is serverless and totally managed, you may run your workloads with out worrying in regards to the underlying infrastructure.

Trendy purposes retailer large quantities of information on Amazon Easy Storage Service (Amazon S3) information lakes, offering cost-effective and extremely sturdy storage, and permitting you to run analytics and machine studying (ML) out of your information lake to generate insights in your information. Earlier than you run these workloads, most clients run SQL queries to interactively extract, filter, be part of, and combination information right into a form that can be utilized for decision-making, mannequin coaching, or inference. Working SQL on information lakes is quick, and Athena supplies an optimized, Trino- and Presto-compatible API that features a highly effective optimizer. As well as, organizations throughout a number of industries corresponding to monetary providers, healthcare, and retail are adopting Apache Spark, a preferred open-source, distributed processing system that’s optimized for quick analytics and superior transformations in opposition to information of any dimension. With help in Athena for Apache Spark, you should utilize each Spark SQL and PySpark in a single pocket book to generate software insights or construct fashions. Begin with Spark SQL to extract, filter, and mission attributes that you just need to work with. Then to carry out extra advanced information evaluation corresponding to regression assessments and time sequence forecasting, you should utilize Apache Spark with Python, which lets you benefit from a wealthy ecosystem of libraries, together with information visualization in Matplot, Seaborn, and Plotly.

On this first submit of a three-part sequence, we present you tips on how to get began utilizing Spark SQL in Athena notebooks. We display querying databases and tables within the Amazon S3 and the AWS Glue Knowledge Catalog utilizing Spark SQL in Athena. We cowl some frequent and superior SQL instructions utilized in Spark SQL, and present you tips on how to use Python to increase your performance with user-defined capabilities (UDFs) in addition to to visualise queried information. Within the subsequent submit, we’ll present you tips on how to use Athena Spark with open-source transactional desk codecs. Within the third submit, we’ll cowl analyzing information sources apart from Amazon S3 utilizing Athena Spark.

Conditions

To get began, you have to the next:

Present Athena Spark entry to your information by means of an IAM position

As you proceed by means of this walkthrough, we create new databases and tables. By default, Athena Spark doesn’t have permission to do that. To supply this entry, you may add the next inline coverage to the AWS Identification and Entry Administration (IAM) position hooked up to the workgroup, offering the area and your account quantity. For extra info, consult with the part To embed an inline coverage for a person or position (console) in Including IAM identification permissions (console).

{
  "Model": "2012-10-17",
  "Assertion": [
      {
          "Sid": "ReadfromPublicS3",
          "Effect": "Allow",
          "Action": [
              "s3:GetObject",
              "s3:ListBucket"
          ],
          "Useful resource": [
              "arn:aws:s3:::athena-examples-us-east-1/*",
              "arn:aws:s3:::athena-examples-us-east-1"
          ]
      },
      {
            "Sid": "GlueReadDatabases",
            "Impact": "Enable",
            "Motion": [
                "glue:GetDatabases"
            ],
            "Useful resource": "arn:aws:glue:<area>:<account-id>:*"
        },
        {
            "Sid": "GlueReadDatabase",
            "Impact": "Enable",
            "Motion": [
                "glue:GetDatabase",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions"
            ],
            "Useful resource": [
                "arn:aws:glue:<region>:<account-id>:catalog",
                "arn:aws:glue:<region>:<account-id>:database/sparkblogdb",
                "arn:aws:glue:<region>:<account-id>:table/sparkblogdb/*",
                "arn:aws:glue:<region>:<account-id>:database/default"
            ]
        },
        {
            "Sid": "GlueCreateDatabase",
            "Impact": "Enable",
            "Motion": [
                "glue:CreateDatabase"
            ],
            "Useful resource": [
                "arn:aws:glue:<region>:<account-id>:catalog",
                "arn:aws:glue:<region>:<account-id>:database/sparkblogdb"
            ]
        },
        {
            "Sid": "GlueDeleteDatabase",
            "Impact": "Enable",
            "Motion": "glue:DeleteDatabase",
            "Useful resource": [
                "arn:aws:glue:<region>:<account-id>:catalog",
                "arn:aws:glue:<region>:<account-id>:database/sparkblogdb",
                "arn:aws:glue:<region>:<account-id>:table/sparkblogdb/*"            ]
        },
        {
            "Sid": "GlueCreateDeleteTablePartitions",
            "Impact": "Enable",
            "Motion": [
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:BatchCreatePartition",
                "glue:CreatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:UpdatePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Useful resource": [
                "arn:aws:glue:<region>:<account-id>:catalog",
                "arn:aws:glue:<region>:<account-id>:database/sparkblogdb",
                "arn:aws:glue:<region>:<account-id>:table/sparkblogdb/*"
            ]
        }
  ]
}

Run SQL queries instantly in pocket book with out utilizing Python

When utilizing Athena Spark notebooks, we are able to run SQL queries instantly with out having to make use of PySpark. We do that through the use of cell magics, that are particular headers in a pocket book that change the cells’ conduct. For SQL, we are able to add the %%sql magic, which is able to interpret your complete cell contents as a SQL assertion to be run on Athena Spark.

Now that we have now our workgroup and pocket book created, let’s begin exploring the NOAA International Floor Abstract of Day dataset, which supplies environmental measures from varied places all around the earth. The datasets used on this submit are public datasets hosted within the following Amazon S3 places:

Parquet information for yr 2020 – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/
Parquet information for yr 2021 – s3://athena-examples-us-east-1/athenasparksqlblog/noaa_pq/yr=2021/
Parquet information from yr 2022 – s3://athena-examples-us-east-1/athenasparksqlblog/noaa_pq/yr=2022/

To make use of this information, we want an AWS Glue Knowledge Catalog database that acts because the metastore for Athena, permitting us to create exterior tables that time to the placement of datasets in Amazon S3. First, we create a database within the Knowledge Catalog utilizing Athena and Spark.

Create a database

Run following SQL in your pocket book utilizing %%sql magic:

%%sql 
CREATE DATABASE sparkblogdb

You get the next output:
Output of CREATE DATABASE SQL

Create a desk

Now that we have now created a database within the Knowledge Catalog, we are able to create a partitioned desk that factors to our dataset saved in Amazon S3:

%%sql
CREATE EXTERNAL TABLE sparkblogdb.noaa_pq(
  station string, 
  date string, 
  latitude string, 
  longitude string, 
  elevation string, 
  title string, 
  temp string, 
  temp_attributes string, 
  dewp string, 
  dewp_attributes string, 
  slp string, 
  slp_attributes string, 
  stp string, 
  stp_attributes string, 
  visib string, 
  visib_attributes string, 
  wdsp string, 
  wdsp_attributes string, 
  mxspd string, 
  gust string, 
  max string, 
  max_attributes string, 
  min string, 
  min_attributes string, 
  prcp string, 
  prcp_attributes string, 
  sndp string, 
  frshtt string)
  PARTITIONED BY (yr string)
STORED AS PARQUET
LOCATION 's3://athena-examples-us-east-1/athenasparksqlblog/noaa_pq/'

This dataset is partitioned by yr, which means that we retailer information recordsdata for annually individually, which simplifies administration and improves efficiency as a result of we are able to goal the particular S3 places in a question. The Knowledge Catalog is aware of in regards to the desk, and now we’ll let it work out what number of partitions we have now robotically through the use of the MSCK utility:

%%sql
MSCK REPAIR TABLE sparkblogdb.noaa_pq

When the previous assertion is full, you may run the next command to checklist the yearly partitions that have been discovered within the desk:

%%sql
SHOW PARTITIONS sparkblogdb.noaa_pq

Output of SHOW PARTITIONS SQL

Now that we have now the desk created and partitions added, let’s run a question to search out the minimal recorded temperature for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
choose yr, min(MIN) as minimum_temperature 
from sparkblogdb.noaa_pq 
the place title="SEATTLE TACOMA AIRPORT, WA US" 
group by 1

You get the next output:

The image shows output of previous SQL statement.

Question a cross-account Knowledge Catalog from Athena Spark

Athena helps accessing cross-account AWS Glue Knowledge Catalogs, which allows you to use Spark SQL in Athena Spark to question a Knowledge Catalog in a licensed AWS account.

The cross-account Knowledge Catalog entry sample is commonly utilized in a information mesh structure, when an information producer desires to share a catalog and information with client accounts. The buyer accounts can then carry out information evaluation and explorations on the shared information. It is a simplified mannequin the place we don’t want to make use of AWS Lake Formation information sharing. The next diagram offers an summary of how the setup works between one producer and one client account, which may be prolonged to a number of producer and client accounts.

The image gives an overview of how the setup works between one producer and one consumer account, which can be extended to multiple producer and consumer accounts.

You have to arrange the fitting entry insurance policies on the Knowledge Catalog of the producer account to allow cross-account entry. Particularly, you could ensure that the patron account’s IAM position used to run Spark calculations on Athena has entry to the cross-account Knowledge Catalog and information in Amazon S3. For setup directions, consult with Configuring cross-account AWS Glue entry in Athena for Spark.

There are two methods the patron account can entry the cross-account Knowledge Catalog from Athena Spark, relying on whether or not you might be querying from one producer account or a number of.

Question a single producer desk

If you’re simply querying information from a single producer’s AWS account, you may inform Athena Spark to solely use that account’s catalog to resolve database objects. When utilizing this feature, you don’t have to switch the SQL since you’re configuring the AWS account ID at session stage. To allow this technique, edit the session and set the property "spark.hadoop.hive.metastore.glue.catalogid": "999999999999" utilizing the next steps:

Within the pocket book editor, on the Session menu, select Edit session.
Select Edit in JSON.
Add the next property and select Save:
{"spark.hadoop.hive.metastore.glue.catalogid": "999999999999"}This may begin a brand new session with the up to date parameters.
Run the next SQL assertion in Spark to question tables from the producer account’s catalog:
```
%%sql
SELECT * 
FROM <central-catalog-db>.<desk> 
LIMIT 10
```

Question a number of producer tables

Alternatively, you may add the producer AWS account ID in every database title, which is useful for those who’re going to question Knowledge Catalogs from completely different homeowners. To allow this technique, set the property {"spark.hadoop.aws.glue.catalog.separator": "/"} when invoking or enhancing the session (utilizing the identical steps because the earlier part). Then, you add the AWS account ID for the supply Knowledge Catalog as a part of the database title:

%%sql
SELECT * 
FROM `<producer-account1-id>/database1`.table1 t1 
be part of `<producer-account2-id>/database2`.table2 t2 
ON t1.id = t2.id
restrict 10

If the S3 bucket belonging to the producer AWS account is configured with Requester Pays enabled, the patron is charged as an alternative of the bucket proprietor for requests and downloads. On this case, you may add the next property when invoking or enhancing an Athena Spark session to learn information from these buckets:

{"spark.hadoop.fs.s3.useRequesterPaysHeader": "true"}

Infer the schema of your information in Amazon S3 and be part of with tables crawled within the Knowledge Catalog

Somewhat than solely having the ability to undergo the Knowledge Catalog to know the desk construction, Spark can infer schema and browse information instantly from storage. This characteristic permits information analysts and information scientists to carry out a fast exploration of the information while not having to create a database or desk, however which can be used with different current tables saved within the Knowledge Catalog in the identical or throughout completely different accounts. To do that, we use a Spark temp view, which is an in-memory information construction that shops the schema of information saved in an information body.

Utilizing the NOAA dataset partition for 2020, we create a short lived view by studying S3 information into an information body:

year_20_pq = spark.learn.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/")
year_20_pq.createOrReplaceTempView("y20view")

Now you may question the y20view utilizing Spark SQL as if it have been a Knowledge Catalog database:

%%sql
choose rely(*) 
from y20view

Output of previous SQL query showing count value

You possibly can question information from each non permanent views and Knowledge Catalog tables in the identical question in Spark. For instance, now that we have now a desk containing information for years 2021 and 2022, and a short lived view with 2020’s information, we are able to discover the dates in annually when the utmost temperature was recorded for 'SEATTLE TACOMA AIRPORT, WA US'.

To do that, we are able to use the window operate and UNION:

%%sql
SELECT date,
       max as maximum_temperature
FROM (
        SELECT date,
            max,
            RANK() OVER (
                PARTITION BY yr
                ORDER BY max DESC
            ) rnk
        FROM sparkblogdb.noaa_pq
        WHERE title="SEATTLE TACOMA AIRPORT, WA US"
          AND yr IN ('2021', '2022')
        UNION ALL
        SELECT date,
            max,
            RANK() OVER (
                ORDER BY max DESC
            ) rnk
        FROM y20view
        WHERE title="SEATTLE TACOMA AIRPORT, WA US"
    ) t
WHERE rnk = 1
ORDER by 1

You get the next output:

Output of previous SQL

Lengthen your SQL with a UDF in Spark SQL

You possibly can prolong your SQL performance by registering and utilizing a customized user-defined operate in Athena Spark. These UDFs are used along with the frequent predefined capabilities out there in Spark SQL, and as soon as created, may be reused many instances inside a given session.

On this part, we stroll by means of a simple UDF that converts a numeric month worth into the complete month title. You’ve the choice to jot down the UDF in both Java or Python.

Java-based UDF

The Java code for the UDF may be discovered within the GitHub repository. For this submit, we have now uploaded a prebuilt JAR of the UDF to s3://athena-examples-us-east-1/athenasparksqlblog/udf/month_number_to_name.jar.

To register the UDF, we use Spark SQL to create a short lived operate:

%%sql
CREATE OR REPLACE TEMPORARY FUNCTION 
month_number_to_name as 'com.instance.MonthNumbertoNameUDF'
utilizing jar "s3a://athena-examples-us-east-1/athenasparksqlblog/udf/month_number_to_name.jar";

Now that the UDF is registered, we are able to name it in a question to search out the minimal recorded temperature for every month of 2022:

%%sql
choose month_number_to_name(month(to_date(date,'yyyy-MM-dd'))) as month_yr_21,
min(min) as min_temp
from sparkblogdb.noaa_pq 
the place NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1 
order by 2

You get the next output:

Output of SQL using UDF

Python-based UDF

Now let’s see tips on how to add a Python UDF to the present Spark session. The Python code for the UDF may be discovered within the GitHub repository. For this submit, the code has been uploaded to s3://athena-examples-us-east-1/athenasparksqlblog/udf/month_number_to_name.py.

Python UDFs can’t be registered in Spark SQL, so as an alternative we use a small little bit of PySpark code so as to add the Python file, import the operate, after which register it as a UDF:

sc.addPyFile('s3://athena-examples-us-east-1/athenasparksqlblog/udf/month_number_to_name.py')

from month_number_to_name import month_number_to_name
spark.udf.register("month_number_to_name_py",month_number_to_name)

Now that the Python-based UDF is registered, we are able to use the identical question from earlier to search out the minimal recorded temperature for every month of 2022. The truth that it’s Python somewhat than Java doesn’t matter now:

%%sql
choose month_number_to_name_py(month(to_date(date,'yyyy-MM-dd'))) as month_yr_21,
min(min) as min_temp
from sparkblogdb.noaa_pq 
the place NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1 
order by 2

The output ought to be much like that within the previous part.

Plot visuals from the SQL queries

It’s simple to make use of Spark SQL, together with throughout AWS accounts for information exploration, and never sophisticated to increase Athena Spark with UDFs. Now let’s see how we are able to transcend SQL utilizing Python to visualise information throughout the similar Spark session to search for patterns within the information. We use the desk and non permanent views created beforehand to generate a pie chart that exhibits share of readings taken in annually for the station 'SEATTLE TACOMA AIRPORT, WA US'.

Let’s begin by making a Spark information body from a SQL question and changing it to a pandas information body:

#we are going to use spark.sql as an alternative of %%sql magic to surround the question string
#this can permit us to learn the outcomes of the question right into a dataframe to make use of with our plot command
sqlDF = spark.sql("choose yr, rely(*) as cnt from sparkblogdb.noaa_pq the place title="SEATTLE TACOMA AIRPORT, WA US" group by 1 
                  union all 
                  choose 2020 as yr, rely(*) as cnt from y20view the place title="SEATTLE TACOMA AIRPORT, WA US"")

#convert to pandas information body
seatac_year_counts=sqlDF.toPandas()

Subsequent, the next code makes use of the pandas information body and Matplot library to plot a pie chart:

import matplotlib.pyplot as plt

# clear the state of the visualization determine
plt.clf()

# create a pie chart with values from the 'cnt' discipline, and yearly labels
plt.pie(seatac_year_counts.cnt, labels=seatac_year_counts.yr, autopct="%1.1f%%")
%matplot plt

The next determine exhibits our output.

Output of code showing pie chart

Clear up

To scrub up the sources created for this submit, full the next steps:

Run the next SQL statements within the pocket book’s cell to delete the database and tables from the Knowledge Catalog:
```
%%sql
DROP TABLE sparkblogdb.noaa_pq

%%sql
DROP DATABASE sparkblogdb
```
Delete the workgroup created for this submit. This can even delete saved notebooks which can be a part of the workgroup.
Delete the S3 bucket that you just created as a part of the workgroup.

Conclusion

Athena Spark makes it simpler than ever to question databases and tables within the AWS Glue Knowledge Catalog instantly by means of Spark SQL in Athena, and to question information instantly from Amazon S3 while not having a metastore for fast information exploration. It additionally makes it simple to make use of frequent and superior SQL instructions utilized in Spark SQL, together with registering UDFs for customized performance. Moreover, Athena Spark makes it easy to make use of Python in a quick begin pocket book surroundings to visualise and analyze information queried by way of Spark SQL.

Total, Spark SQL unlocks the power to transcend commonplace SQL in Athena, offering superior customers extra flexibility and energy by means of each SQL and Python in a single built-in pocket book, and offering quick, advanced evaluation of information in Amazon S3 with out infrastructure setup. To study extra about Athena Spark, consult with Amazon Athena for Apache Spark.

In regards to the Authors

Pathik Shah is a Sr. Analytics Architect on Amazon Athena. He joined AWS in 2015 and has been focusing within the large information analytics house since then, serving to clients construct scalable and strong options utilizing AWS analytics providers.

Raj Devnath is a Product Supervisor at AWS on Amazon Athena. He’s obsessed with constructing merchandise clients love and serving to clients extract worth from their information. His background is in delivering options for a number of finish markets, corresponding to finance, retail, sensible buildings, residence automation, and information communication programs.

[ad_2]

Conditions

Present Athena Spark entry to your information by means of an IAM position

Run SQL queries instantly in pocket book with out utilizing Python

Create a database

Create a desk

Question a cross-account Knowledge Catalog from Athena Spark

Question a single producer desk

Question a number of producer tables

Infer the schema of your information in Amazon S3 and be part of with tables crawled within the Knowledge Catalog

Lengthen your SQL with a UDF in Spark SQL

Java-based UDF

Python-based UDF

Plot visuals from the SQL queries

Clear up

Conclusion

In regards to the Authors

LEAVE A REPLY Cancel reply