Home Big Data improve your Hive tables to Unity Catalog

improve your Hive tables to Unity Catalog

0
improve your Hive tables to Unity Catalog

[ad_1]

On this weblog we are going to show with examples, how one can seamlessly improve your Hive metastore (HMS)* tables to Unity Catalog (UC) utilizing completely different methodologies relying on the variations of HMS tables being upgraded.

*Notice: Hive metastore could possibly be your default or exterior metastore and even AWS Glue Knowledge Catalog. For simplicity, we’ve used the time period “Hive metastore” all through this doc

Earlier than we dive into the small print, allow us to have a look at the steps we might take for the improve –

  1. Assess – On this step, you’ll consider the prevailing HMS tables recognized for improve in order that we will decide the suitable strategy for improve. This step is mentioned on this weblog.
  2. Create – On this step, you create the required UC property corresponding to, Metastore, Catalog, Schema, Storage Credentials, Exterior Places. For particulars discuss with the documentations – AWS, Azure, GCP
  3. Improve – On this step, you’ll comply with the steerage to improve the tables from HMS to UC. This step is mentioned on this weblog.
  4. Grant – On this step, you will have to offer grants on the newly upgraded UC tables to principals, in order that they’ll entry the UC tables. For element discuss with the documentations – AWS, Azure, GCP

Unity Catalog, now usually accessible on all three cloud platforms (AWS, Azure, and GCP), simplifies safety and governance of your knowledge with the next key options:

  • Outline as soon as, safe in every single place: Unity Catalog presents a single place to manage knowledge entry insurance policies that apply throughout all workspaces.
  • Requirements-compliant safety mannequin: Unity Catalog’s safety mannequin relies on customary ANSI SQL and permits directors to grant permissions of their present knowledge lake utilizing acquainted syntax, on the stage of catalogs, databases (additionally known as schemas), tables, and views.
  • Constructed-in auditing and lineage: Unity Catalog routinely captures user-level audit logs that report entry to your knowledge. Unity Catalog additionally captures lineage knowledge that tracks how knowledge property are created and used throughout all languages.
  • Knowledge discovery: Unity Catalog enables you to tag and doc knowledge property, and gives a search interface to assist knowledge shoppers discover knowledge.
  • System tables (Public Preview): Unity Catalog enables you to simply entry and question your account’s operational knowledge, together with audit logs, billable utilization, and lineage.
  • Knowledge Sharing: Delta Sharing is an open protocol developed by Databricks for safe knowledge sharing with different organizations whatever the computing platforms they use. Databricks has constructed Delta Sharing into its Unity Catalog knowledge governance platform, enabling a Databricks person, known as an information supplier, to share knowledge with an individual or group outdoors of their group, known as an information recipient.

All these wealthy options which can be found with Unity Catalog (UC) out of the field are not available in your Hive metastore in the present day and would take an enormous quantity of your assets to construct floor up. Moreover, as most (if not all) of the newer Databricks options, corresponding to, Lakehouse Monitoring, Lakehouse Federation, LakehouseIQ, are constructed on, ruled by and wishes Unity Catalog as a prerequisite to operate, delaying upgradation of your knowledge property to UC from HMS would restrict your means to benefit from these newer product options.

Therefore, one query that involves thoughts is how are you going to simply improve tables registered in your present Hive metastore to the Unity Catalog metastore as a way to benefit from all of the wealthy options Unity Catalog presents. On this weblog, we are going to stroll you thru concerns, methodologies with examples for upgrading your HMS desk to UC.

Improve Issues and Stipulations

On this part we assessment concerns for improve earlier than we dive deeper into the Improve Methodologies within the subsequent part.

Improve Issues

Variations of Hive Metastore tables is one such consideration. Hive Metastore tables, deemed for improve to Unity Catalog, might have been created with mixing and matching varieties for every parameter proven within the desk beneath. For instance, one might have created a CSV Managed Desk utilizing DBFS root location or a Parquet Exterior desk on Amazon S3 location.This part describes the parameters based mostly on which completely different variation of the tables might have been created in your

Parameter

Variation

Desk Identification Information

Desk Sort

MANAGED

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sector “Sort”. It ought to say “MANAGED”.

EXTERNAL

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sector “Sort”. It ought to say “EXTERNAL”.

Knowledge Storage Location

DBFS Root storage location

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sector “Location”. It ought to begin with “dbfs:/person/hive/warehouse/”

DBFS Mounted Cloud Object Storage

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sector “Location”. It ought to begin with “dbfs:/mnt/”

Straight specifying cloud storage Location (corresponding to S3://, abfss:// or gs://)

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify worth of area “Location”. It ought to begin with “S3://” or “abfss://” or “gs://”

Desk file format and interface

File codecs corresponding to Delta, Parquet, Avro

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sector “Supplier”. It ought to say for instance ”delta”, “parquet”.

Interface corresponding to Hive SerDe interface

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sector “Supplier”. It ought to say ”hive”.

Relying on the variations of the parameters talked about above, the adopted improve methodologies might differ. Particulars are mentioned within the Improve Methodologies part beneath.

One other level must be thought-about earlier than you begin the improve of HMS tables to UC in Azure Databricks:

For AZURE Cloud – Tables presently saved on Blob storage (wasb) or ADLS gen 1 (adl) must be upgraded to ADLS gen 2 (abfs). In any other case it can increase an error in case you attempt to use unsupported Azure cloud storage with Unity Catalog.

Error instance: Desk just isn’t eligible for an improve from Hive Metastore to Unity Catalog. Motive: Unsupported file system scheme wasbs.

Improve Stipulations

Earlier than beginning the improve course of, the storage credentials and exterior places ought to be created as proven within the steps beneath.

  1. Create Storage Credential(s) with entry to the goal cloud storage.
  2. Create Exterior Location(s) pointing to the goal cloud storage utilizing the storage credential(s).
    • The Exterior Places are used for creating UC Exterior Tables, Managed Catalogs, or Managed schemas.

Improve Methodologies

On this part we present you all of the completely different improve choices within the type of a matrix. We additionally use diagrams to indicate the steps concerned in upgrading.

There are two major strategies for upgrading, utilizing SYNC (for supported eventualities) or utilizing knowledge replication (the place SYNC just isn’t supported).

  • Utilizing SYNC – For all of the supported eventualities (as proven within the Improve Matrix part beneath) use SYNC to improve HMS tables to UC. Utilizing SYNC means that you can improve tables with out knowledge replication
  • Utilizing Knowledge Replication – For all unsupported eventualities (as proven within the Improve Matrix part beneath) use both Create Desk As Choose (CTAS) or DEEP CLONE*. This methodology would require knowledge replication

*Notice – Think about using deep clone for HMS Parquet and Delta tables to repeat the information and improve tables in UC from HMS. Use Create Desk As Choose (CTAS) for different file codecs.

The diagrams beneath describes the improve steps for every methodology. To grasp which methodology to make use of in your improve use case, discuss with the Improve Matrix part beneath.

Pictorial Illustration of improve

Diagram 1 – Upgrading HMS tables to UC utilizing SYNC (with out knowledge replication)

HMS tables

Diagram Keys:

  1. HMS Managed and exterior tables retailer knowledge as a listing of recordsdata on cloud object storage
  2. SYNC Command is used to improve desk metadata from HMS to UC. Goal UC tables are Exterior regardless of the supply HMS desk varieties.
  3. No Knowledge is copied when the SYNC command is used for upgrading tables from HMS to UC. Identical underlying cloud storage location (utilized by the supply HMS desk) is referred to by the goal UC Exterior desk.
  4. A storage credential represents an authentication and authorization mechanism for accessing knowledge saved in your cloud tenant.
  5. An exterior location is an object that mixes a cloud storage path with a storage credential that authorizes entry to the cloud storage path.

Diagram 2 – Upgrading HMS tables to UC with knowledge replication

HMS tables

Diagram Keys:

  1. HMS Managed and exterior tables retailer knowledge as a listing of recordsdata on DBFS Root Storage Location.
  2. CTAS or Deep Clone creates the UC goal desk metadata from the HMS desk. One can select to improve to an exterior or managed desk regardless of the HMS desk sort.
  3. CTAS or Deep Clone copies knowledge from DBFS root storage to focus on cloud storage.
  4. A storage credential represents an authentication and authorization mechanism for accessing knowledge saved in your cloud tenant.
  5. An exterior location is an object that mixes a cloud storage path with a storage credential that authorizes entry to the cloud storage path.

Improve Matrix

Under desk showcases the completely different prospects of Upgrading HMS tables to UC tables. For every state of affairs, we offer steps that you would be able to comply with for the improve.

HMS Storage Format utilizing DBFS Root Storage

Ex.

HMS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Knowledge File Format

Improve Methodology

1

Managed

The info recordsdata for the managed tables reside inside DBFS Root (the default location for the Databricks managed HMS database).

%sql

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_parquet
utilizing parquet
as choose * from parquet.`dbfs:/databricks-datasets/learning-spark-v2/individuals/people-10m.parquet/`restrict 100;

Exterior Or Managed

Delta is the popular file format for each Managed and Exterior Tables. Exterior tables do assist non-delta file codecs.1

CTAS or Deep Clone

2

Exterior

This implies the information recordsdata for the Exterior tables reside inside DBFS Root. The desk definition has the “Location” clause which makes the desk exterior.

%sql

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_parquet
utilizing parquet
location “dbfs:/person/hive/warehouse/hmsdb_upgrade_db.db/people_parquet”
as
choose * from parquet.`dbfs:/databricks-datasets/learning-spark-v2/individuals/people-10m.parquet/`restrict 100;

Exterior Or Managed

Delta is the popular file format for each Managed and Exterior Tables. Exterior tables do assist non-delta file codecs. 1

CTAS or Deep Clone

1. Notice – Ideally change it to Delta whereas Upgrading with CTAS.

HMS Hive SerDe desk

Ex.

H MS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Knowledge File Format

Improve Methodology

3

Hive SerDe Exterior or Managed 2

These are the tables created utilizing the Hive SerDe interface. Check with this hyperlink to be taught extra about hive tables on databricks.

%sql

CREATE TABLE if not exists hive_metastore.hmsdb_upgrade_db.parquetExample (id int, identify string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.SequenceFileInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat’
location “s3://databricks-dkushari/sync-test/parquetexample”;

Exterior Or Managed

Delta is the popular file format for each Managed and Exterior Tables. Exterior tables do assist non-delta file codecs. 3

CTAS or Deep Clone

2. Notice – regardless of the underlying storage format, hive SerDe follows the identical improve path..
3. Notice – Ideally change it to Delta when you are doing the improve utilizing CTAS
.

HMS Storage Format utilizing DBFS Mounted Storage

Ex.

HMS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Knowledge File Format

Improve Methodology

4

Managed

That is when the father or mother database has its location set to exterior paths, e.g., a mounted path from the item retailer. The desk is created with no location clause and desk knowledge is saved beneath that default database path.

%sql
create database if not exists hive_metastore.hmsdb_upgrade_db location “dbfs:/mnt/test-mnt/hmsdb_upgrade_db/”;

createtableifnotexists hive_metastore.hmsdb_upgrade_db.people_delta
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/individuals/people-10m.delta`restrict 100;

Exterior

As that of the HMS supply knowledge file format

  1. Run Sync to create UC Exterior desk
  2. Convert HMS Managed to HMS Exterior (code supplied in Appendix beneath)
  3. Drop HMS desk in spite of everything dependencies are resolved, so there isn’t any solution to entry the information
  4. Unmount the mount level in spite of everything dependencies are resolved in order that there isn’t any solution to entry the information utilizing mount factors 4

5

Managed

Managed

Delta

CTAS or Deep Clone

6

Exterior

The desk is created with a location clause and a path specifying a mounted path from a cloud object retailer. 

%sql
create database if not exists hive_metastore.hmsdb_upgrade_db location “dbfs:/mnt/test-mnt/hmsdb_upgrade_db/”;

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_delta
location “dbfs:/mnt/test-mnt/hmsdb_upgrade_db/people_delta”
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/individuals/people-10m.delta`restrict 100;

Exterior

As that of the HMS supply knowledge file format

  1. Run Sync (weblog) to create UC Exterior desk
  2. Drop the HMS desk in spite of everything dependencies are resolved in order that there isn’t any solution to entry the information
  3. Unmount the mount level in spite of everything dependencies are resolved so there isn’t any solution to entry the information

7

Exterior

Managed

Delta

CTAS or Deep Clone

4. Notice – Be sure that the HMS desk is dropped individually after conversion to an exterior desk. If the HMS database/schema was outlined with a location and if the database is dropped with the cascade choice, then the underlying knowledge will likely be misplaced and the upgraded UC tables will lose the information..

HMS Storage Format utilizing Cloud Object Storage

Ex.

HMS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Knowledge File Format

Improve Methodology

8

Managed

The father or mother database has its location set to exterior paths, e.g., a cloud object retailer. The desk is created with no location clause and desk knowledge is saved beneath that default database path.

%sql

create database if not exists hive_metastore.hmsdb_upgrade_db location “s3://databricks-dkushari/hmsdb_upgrade_db/”;

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_delta
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/individuals/people-10m.delta`restrict 100;

Extern

al

As of the supply knowledge file format

  1. Run Sync to create UC Exterior desk
  2. Convert HMS Managed to HMS Exterior (code supplied beneath)
  3. Drop HMS desk in spite of everything dependencies are resolved in order that there isn’t any solution to entry the information 4

9

Managed

Managed

Delta

CTAS or Deep Clone

10

Exterior

The desk is created with a location clause and a path specifying a cloud object retailer. 

%sql

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_delta
location “s3://databricks-dkushari/hmsdb_upgrade_db/people_delta”
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/individuals/people-10m.delta`restrict 100;

Exterior

As of the supply knowledge file format

  1. Run Sync (weblog) to create UC Exterior desk
  2. Drop the HMS desk in spite of everything dependencies are resolved in order that there isn’t any solution to entry the information

11

Exterior

Managed

Delta

CTAS or Deep Clone

Examples of improve

On this part, we’re offering a Databricks Pocket book with examples for every state of affairs mentioned above.

Conclusion

On this weblog, we’ve proven how one can improve your Hive metastore tables to Unity Catalog metastore. Please discuss with the Pocket book to attempt completely different improve choices. You too can discuss with the Demo Heart to get began with automating the improve course of. To automate upgrading Hive Metastore tables to Unity Catalog we suggest you employ this Databricks Lab repository.

Improve your tables to Unity Catalog in the present day and profit from unified governance options. After Upgrading to UC, you possibly can drop Hive metastore schemas and tables in case you now not want them. Dropping an exterior desk doesn’t modify the information recordsdata in your cloud tenant. Take precautions (as described on this weblog) whereas dropping managed tables or schemas with managed tables.

Appendix

import org.apache.spark.sql.catalyst.catalog.{CatalogTable,
CatalogTableType}
import org.apache.spark.sql.catalyst.TableIdentifier
val tableName = "desk"
val dbName = "dbname"
val oldTable: CatalogTable = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName,
Some(dbName)))
val alteredTable: CatalogTable = oldTable.copy(tableType = 
CatalogTableType.EXTERNAL)
spark.sessionState.catalog.alterTable(alteredTable)

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here