Home Big Data Create, prepare, and deploy Amazon Redshift ML mannequin integrating options from Amazon SageMaker Function Retailer

Create, prepare, and deploy Amazon Redshift ML mannequin integrating options from Amazon SageMaker Function Retailer

0
Create, prepare, and deploy Amazon Redshift ML mannequin integrating options from Amazon SageMaker Function Retailer

[ad_1]

Amazon Redshift is a quick, petabyte-scale, cloud information warehouse that tens of hundreds of consumers depend on to energy their analytics workloads. Information analysts and database builders need to use this information to coach machine studying (ML) fashions, which may then be used to generate insights on new information to be used instances reminiscent of forecasting income, predicting buyer churn, and detecting anomalies. Amazon Redshift ML makes it simple for SQL customers to create, prepare, and deploy ML fashions utilizing SQL instructions acquainted to many roles reminiscent of executives, enterprise analysts, and information analysts. We coated in a earlier publish how you need to use information in Amazon Redshift to coach fashions in Amazon SageMaker, a totally managed ML service, after which make predictions inside your Redshift information warehouse.

Redshift ML at present helps ML algorithms reminiscent of XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Moreover, you possibly can import current SageMaker fashions into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

Amazon SageMaker Function Retailer is a totally managed, purpose-built repository to retailer, share, and handle options for ML fashions. Nevertheless, one problem in coaching a production-ready ML mannequin utilizing SageMaker Function Retailer is entry to a various set of options that aren’t at all times owned and maintained by the workforce that’s constructing the mannequin. For instance, an ML mannequin to determine fraudulent monetary transactions wants entry to each figuring out (machine sort, browser) and transaction (quantity, credit score or debit, and so forth) associated options. As an information scientist constructing an ML mannequin, you will have entry to the figuring out data however not the transaction data, and gaining access to a function retailer solves this.

On this publish, we talk about the mixed function retailer sample, which permits groups to keep up their very own native function shops utilizing a neighborhood Redshift desk whereas nonetheless having the ability to entry shared options from the centralized function retailer. In a neighborhood function retailer, you possibly can retailer delicate information that may’t be shared throughout the group for regulatory and compliance causes.

We additionally present you methods to use acquainted SQL statements to create and prepare ML fashions by combining shared options from the centralized retailer with native options and use these fashions to make in-database predictions on new information to be used instances reminiscent of fraud threat scoring.

Overview of answer

For this publish, we create an ML mannequin to foretell if a transaction is fraudulent or not, given the transaction document. To construct this, we have to engineer options that describe a person bank card’s spending sample, such because the variety of transactions or the typical transaction quantity, and likewise details about the service provider, the cardholder, the machine used to make the fee, and another information that could be related to detecting fraud.

To get began, we’d like an Amazon Redshift Serverless information warehouse with the Redshift ML function enabled and an Amazon SageMaker Studio atmosphere with entry to SageMaker Function Retailer. For an introduction to Redshift ML and directions on setting it up, see Create, prepare, and deploy machine studying fashions in Amazon Redshift utilizing SQL with Amazon Redshift ML.

We additionally want an offline function retailer to retailer options in function teams. The offline retailer makes use of an Amazon Easy Storage Service (Amazon S3) bucket for storage and may also fetch information utilizing Amazon Athena queries. For an introduction to SageMaker Function Retailer and directions on setting it up, see Getting began with Amazon SageMaker Function Retailer.

The next diagram illustrates answer structure.

The workflow incorporates the next steps:

  1. Create the offline function group in SageMaker Function Retailer and ingest information into the function group.
  2. Create a Redshift desk and cargo native function information into the desk.
  3. Create an exterior schema for Amazon Redshift Spectrum to entry the offline retailer information saved in Amazon S3 utilizing the AWS Glue Information Catalog.
  4. Practice and validate a fraud threat scoring ML mannequin utilizing native function information and exterior offline function retailer information.
  5. Use the offline function retailer and native retailer for inference.

Dataset

To display this use case, we use an artificial dataset with two tables: identification and transactions. They will each be joined by the TransactionID column. The transaction desk incorporates details about a specific transaction, reminiscent of quantity, credit score or debit card, and so forth, and the identification desk incorporates details about the consumer, reminiscent of machine sort and browser. The transaction should exist within the transaction desk, however may not at all times be obtainable within the identification desk.

The next is an instance of the transactions dataset.

The next is an instance of the identification dataset.

Let’s assume that throughout the group, information science groups centrally handle the identification information and course of it to extract options in a centralized offline function retailer. The information warehouse workforce ingests and analyzes transaction information in a Redshift desk, owned by them.

We work by way of this use case to grasp how the info warehouse workforce can securely retrieve the newest options from the identification function group and be part of it with transaction information in Amazon Redshift to create a function set for coaching and inferencing a fraud detection mannequin.

Create the offline function group and ingest information

To start out, we arrange SageMaker Function Retailer, create a function group for the identification dataset, examine and course of the dataset, and ingest some pattern information. We then put together the transaction options from the transaction information and retailer it in Amazon S3 for additional loading into the Redshift desk.

Alternatively, you possibly can creator options utilizing Amazon SageMaker Information Wrangler, create function teams in SageMaker Function Retailer, and ingest options in batches utilizing an Amazon SageMaker Processing job with a pocket book exported from SageMaker Information Wrangler. This mode permits for batch ingestion into the offline retailer.

Let’s discover a few of the key steps on this part.

  1. Obtain the pattern pocket book.
  2. On the SageMaker console, underneath Pocket book within the navigation pane, select Pocket book situations.
  3. Find your pocket book occasion and select Open Jupyter.
  4. Select Add and add the pocket book you simply downloaded.
  5. Open the pocket book sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb.
  6. Observe the directions and run all of the cells as much as the Cleanup Assets part.

The next are key steps from the pocket book:

  1. We create a Pandas DataFrame with the preliminary CSV information. We apply function transformations for this dataset.
    identity_data = pd.read_csv(io.BytesIO(identity_data_object["Body"].learn()))
    transaction_data = pd.read_csv(io.BytesIO(transaction_data_object["Body"].learn()))
    
    identity_data = identity_data.spherical(5)
    transaction_data = transaction_data.spherical(5)
    
    identity_data = identity_data.fillna(0)
    transaction_data = transaction_data.fillna(0)
    
    # Function transformations for this dataset are utilized 
    # One scorching encode card4, card6
    encoded_card_bank = pd.get_dummies(transaction_data["card4"], prefix="card_bank")
    encoded_card_type = pd.get_dummies(transaction_data["card6"], prefix="card_type")
    
    transformed_transaction_data = pd.concat(
        [transaction_data, encoded_card_type, encoded_card_bank], axis=1
    )

  2. We retailer the processed and remodeled transaction dataset in an S3 bucket. This transaction information will probably be loaded later within the Redshift desk for constructing the native function retailer.
    transformed_transaction_data.to_csv("transformed_transaction_data.csv", header=False, index=False)
    s3_client.upload_file("transformed_transaction_data.csv", default_s3_bucket_name, prefix + "/training_input/transformed_transaction_data.csv")

  3. Subsequent, we’d like a document identifier identify and an occasion time function identify. In our fraud detection instance, the column of curiosity is TransactionID.EventTime might be appended to your information when no timestamp is out there. Within the following code, you possibly can see how these variables are set, after which EventTime is appended to each options’ information.
    # document identifier and occasion time function names
    record_identifier_feature_name = "TransactionID"
    event_time_feature_name = "EventTime"
    
    # append EventTime function
    identity_data[event_time_feature_name] = pd.Sequence(
        [current_time_sec] * len(identity_data), dtype="float64"
    )

  4. We then create and ingest the info into the function group utilizing the SageMaker SDK FeatureGroup.ingest API. This can be a small dataset and due to this fact might be loaded right into a Pandas DataFrame. After we work with giant quantities of knowledge and tens of millions of rows, there are different scalable mechanisms to ingest information into SageMaker Function Retailer, reminiscent of batch ingestion with Apache Spark.
    identity_feature_group.create(
        s3_uri=<S3_Path_Feature_Store>,
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=<role_arn>,
        enable_online_store=False,
    )
    
    identity_feature_group_name = "identity-feature-group"
    
    # load function definitions to the function group. SageMaker FeatureStore Python SDK will auto-detect the info schema primarily based on enter information.
    identity_feature_group.load_feature_definitions(data_frame=identity_data)
    identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)
    

  5. We will confirm that information has been ingested into the function group by operating Athena queries within the pocket book or operating queries on the Athena console.

At this level, the identification function group is created in an offline function retailer with historic information continued in Amazon S3. SageMaker Function Retailer routinely creates an AWS Glue Information Catalog for the offline retailer, which permits us to run SQL queries in opposition to the offline information utilizing Athena or Redshift Spectrum.

Create a Redshift desk and cargo native function information

To construct a Redshift ML mannequin, we construct a coaching dataset becoming a member of the identification information and transaction information utilizing SQL queries. The identification information is in a centralized function retailer the place the historic set of data are continued in Amazon S3. The transaction information is a neighborhood function for coaching information that should made obtainable within the Redshift desk.

Let’s discover methods to create the schema and cargo the processed transaction information from Amazon S3 right into a Redshift desk.

  1. Create the customer_transaction desk and cargo day by day transaction information into the desk, which you’ll use to coach the ML mannequin:
    DROP TABLE customer_transaction;
    CREATE TABLE customer_transaction (
      TransactionID INT,    
      isFraud INT,  
      TransactionDT INT,    
      TransactionAmt decimal(10,2), 
      card1 INT,    
      card2 decimal(10,2),card3 decimal(10,2),  
      card4 VARCHAR(20),card5 decimal(10,2),    
      card6 VARCHAR(20),    
      B1 INT,B2 INT,B3 INT,B4 INT,B5 INT,B6 INT,
      B7 INT,B8 INT,B9 INT,B10 INT,B11 INT,B12 INT,
      F1 INT,F2 INT,F3 INT,F4 INT,F5 INT,F6 INT,
      F7 INT,F8 INT,F9 INT,F10 INT,F11 INT,F12 INT,
      F13 INT,F14 INT,F15 INT,F16 INT,F17 INT,  
      N1 VARCHAR(20),N2 VARCHAR(20),N3 VARCHAR(20), 
      N4 VARCHAR(20),N5 VARCHAR(20),N6 VARCHAR(20), 
      N7 VARCHAR(20),N8 VARCHAR(20),N9 VARCHAR(20), 
      card_type_0  boolean,
      card_type_credit boolean,
      card_type_debit  boolean,
      card_bank_0  boolean,
      card_bank_american_express boolean,
      card_bank_discover  boolean,
      card_bank_mastercard  boolean,
      card_bank_visa boolean  
    );

  2. Load the pattern information by utilizing the next command. Change your Area and S3 path as acceptable. You will see that the S3 path within the S3 Bucket Setup For The OfflineStore part within the pocket book or by checking the dataset_uri_prefix within the pocket book.
    COPY customer_transaction
    FROM '<s3path>/transformed_transaction_data.csv' 
    IAM_ROLE default delimiter ',' 
    area 'your-region';

Now that we’ve got created a neighborhood function retailer for the transaction information, we give attention to integrating a centralized function retailer with Amazon Redshift to entry the identification information.

Create an exterior schema for Redshift Spectrum to entry the offline retailer information

We’ve created a centralized function retailer for identification options, and we will entry this offline function retailer utilizing companies reminiscent of Redshift Spectrum. When the identification information is out there by way of the Redshift Spectrum desk, we will create a coaching dataset with function values from the identification function group and customer_transaction, becoming a member of on the TransactionId column.

This part supplies an summary of methods to allow Redshift Spectrum to question information straight from recordsdata on Amazon S3 by way of an exterior database in an AWS Glue Information Catalog.

  1. First, examine that the identity-feature-group desk is current within the Information Catalog underneath the sagemamker_featurestore database.
  2. Utilizing Redshift Question Editor V2, create an exterior schema utilizing the next command:
    CREATE EXTERNAL SCHEMA sagemaker_featurestore
    FROM DATA CATALOG
    DATABASE 'sagemaker_featurestore'
    IAM_ROLE default
    create exterior database if not exists;

All of the tables, together with identity-feature-group exterior tables, are seen underneath the sagemaker_featurestore exterior schema. In Redshift Question Editor v2, you possibly can examine the contents of the exterior schema.

  1. Run the next question to pattern a couple of data—be aware that your desk identify could also be completely different:
    Choose * from sagemaker_featurestore.identity_feature_group_1680208535 restrict 10;

  2. Create a view to affix the newest information from identity-feature-group and customer_transaction on the TransactionId column. Make sure you change the exterior desk identify to match your exterior desk identify:
    create or exchange view public.credit_fraud_detection_v
    AS choose  "isfraud",
            "transactiondt",
            "transactionamt",
            "card1","card2","card3","card5",
             case when "card_type_credit" = 'False' then 0 else 1 finish as card_type_credit,
             case when "card_type_debit" = 'False' then 0 else 1 finish as card_type_debit,
             case when "card_bank_american_express" = 'False' then 0 else 1 finish as card_bank_american_express,
             case when "card_bank_discover" = 'False' then 0 else 1 finish as card_bank_discover,
             case when "card_bank_mastercard" = 'False' then 0 else 1 finish as card_bank_mastercard,
             case when "card_bank_visa" = 'False' then 0 else 1 finish as card_bank_visa,
            "id_01","id_02","id_03","id_04","id_05"
    from public.customer_transaction ct left be part of sagemaker_featurestore.identity_feature_group_1680208535 id
    on id.transactionid = ct.transactionid with no schema binding;

Practice and validate the fraud threat scoring ML mannequin

Redshift ML offers you the pliability to specify your individual algorithms and mannequin varieties and likewise to supply your individual superior parameters, which may embody preprocessors, downside sort, and hyperparameters. On this publish, we create a buyer mannequin by specifying AUTO OFF and the mannequin sort of XGBOOST. By turning AUTO OFF and utilizing XGBoost, we’re offering the mandatory inputs for SageMaker to coach the mannequin. A advantage of this may be quicker coaching instances. XGBoost is as open-source model of the gradient boosted timber algorithm. For extra particulars on XGBoost, consult with Construct XGBoost fashions with Amazon Redshift ML.

We prepare the mannequin utilizing 80% of the dataset by filtering on transactiondt < 12517618. The opposite 20% will probably be used for inference. A centralized function retailer is beneficial in offering the newest supplementing information for coaching requests. Word that you will want to supply an S3 bucket identify within the create mannequin assertion. It is going to take roughly 10 minutes to create the mannequin.

CREATE MODEL frauddetection_xgboost
FROM (choose  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05"
from credit_fraud_detection_v the place transactiondt < 12517618
)
TARGET isfraud
FUNCTION ml_fn_frauddetection_xgboost
IAM_ROLE default
AUTO OFF
MODEL_TYPE XGBOOST
OBJECTIVE 'binary:logistic'
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100')
SETTINGS (S3_BUCKET <s3_bucket>);

If you run the create mannequin command, it would full rapidly in Amazon Redshift whereas the mannequin coaching is going on within the background utilizing SageMaker. You may examine the standing of the mannequin by operating a present mannequin command:

present mannequin frauddetection_xgboost;

The output of the present mannequin command reveals that the mannequin state is TRAINING. It additionally reveals different data such because the mannequin sort and the coaching job identify that SageMaker assigned.
After a couple of minutes, we run the present mannequin command once more:

present mannequin frauddetection_xgboost;

Now the output reveals the mannequin state is READY. We will additionally see the prepare:error rating right here, which at 0 tells us we’ve got a great mannequin. Now that the mannequin is educated, we will use it for operating inference queries.

Use the offline function retailer and native retailer for inference

We will use the SQL operate to use the ML mannequin to information in queries, reviews, and dashboards. Let’s use the operate ml_fn_frauddetection_xgboost created by our mannequin in opposition to our check dataset by filtering the place transactiondt >=12517618, to foretell whether or not a transaction is fraudulent or not. SageMaker Function Retailer might be helpful in supplementing information for inference requests.

Run the next question to foretell whether or not transactions are fraudulent or not:

choose  "isfraud" as "Precise",
        ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") as "Predicted"
from credit_fraud_detection_v the place transactiondt >= 12517618;

For binary and multi-class classification issues, we compute the accuracy because the mannequin metric. Accuracy might be calculated primarily based on the next:

accuracy = (sum (precise == predicted)/complete) *100

Let’s apply the previous code to our use case to seek out the accuracy of the mannequin. We use the check information (transactiondt >= 12517618) to check the accuracy, and use the newly created operate ml_fn_frauddetection_xgboost to foretell and take the columns aside from the goal and label because the enter:

-- examine accuracy 
WITH infer_data AS (
SELECT "isfraud" AS label,
ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") AS predicted,
CASE 
   WHEN label IS NULL
       THEN 0
   ELSE label
   END AS precise,
CASE 
   WHEN precise = predicted
       THEN 1::INT
   ELSE 0::INT
   END AS appropriate
FROM credit_fraud_detection_v the place transactiondt >= 12517618),
aggr_data AS (
SELECT SUM(appropriate) AS num_correct,
COUNT(*) AS complete
FROM infer_data) 

SELECT (num_correct::FLOAT / complete::FLOAT) AS accuracy FROM aggr_data;

Clear up

As a closing step, clear up the assets:

  1. Delete the Redshift cluster.
  2. Run the Cleanup Assets part of your pocket book.

Conclusion

Redshift ML lets you convey machine studying to your information, powering quick and knowledgeable decision-making. SageMaker Function Retailer supplies a purpose-built function administration answer to assist organizations scale ML improvement throughout enterprise models and information science groups.

On this publish, we confirmed how one can prepare an XGBoost mannequin utilizing Redshift ML with information unfold throughout SageMaker Function Retailer and a Redshift desk. Moreover, we confirmed how one can make inferences on a educated mannequin to detect fraud utilizing Amazon Redshift SQL instructions.


In regards to the authors

Anirban Sinha is a Senior Technical Account Supervisor at AWS. He’s keen about constructing scalable information warehouses and massive information options working carefully with clients. He works with giant ISVs clients, in serving to them construct and function safe, resilient, scalable, and high-performance SaaS functions within the cloud.

Phil Bates is a Senior Analytics Specialist Options Architect at AWS. He has greater than 25 years of expertise implementing large-scale information warehouse options. He’s keen about serving to clients by way of their cloud journey and utilizing the facility of ML inside their information warehouse.

Gaurav Singh is a Senior Options Architect at AWS, specializing in AI/ML and Generative AI. Based mostly in Pune, India, he focuses on serving to clients construct, deploy, and migrate ML manufacturing workloads to SageMaker at scale. In his spare time, Gaurav likes to discover nature, learn, and run.

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here