DSA-C02: SnowPro Advanced-Data Scientist - Free download from Exam Hub

Which of the following method is used for multiclass classification?

one vs rest
loocv
all vs one
one vs another

Correct answer: A

Explanation:

Binary vs. Multi-Class ClassificationClassification problems are common in machine learning. In most cases, developers prefer using a supervised machine-learning approach to predict class tables for a given dataset. Unlike regression, classification involves designing the classifier model and training it to input and categorize the test dataset. For that, you can divide the dataset into either binary or multi-class modules.As the name suggests, binary classification involves solving a problem with only two class labels. This makes it easy to filter the data, apply classification algorithms, and train the model to predict outcomes. On the other hand, multi-class classification is applicable when there are more than two class labels in the input train data. The technique enables developers to categorize the test data into multiple binary class labels.That said, while binary classification requires only one classifier model, the one used in the multi-class approach depends on the classification technique. Below are the two models of the multi-class classification algorithm.One-Vs-Rest Classification Model for Multi-Class ClassificationAlso known as one-vs-all, the one-vs-rest model is a defined heuristic method that leverages a binary classification algorithm for multi-class classifications. The technique involves splitting a multi-class dataset into multiple sets of binary problems. Following this, a binary classifier is trained to handle each binary classification model with the most confident one making predictions.For instance, with a multi-class classification problem with red, green, and blue datasets, binary classification can be categorized as follows:Problem one: red vs. green/blueProblem two: blue vs. green/redProblem three: green vs. blue/redThe only challenge of using this model is that you should create a model for every class. The three classes require three models from the above datasets, which can be challenging for large sets of data with million rows, slow models, such as neural networks and datasets with a significant number of classes.The one-vs-rest approach requires individual models to prognosticate the probability-like score. The class index with the largest score is then used to predict a class. As such, it is commonly used for classification algorithms that can naturally predict scores or numerical class membership such as perceptron and logistic regression.

Binary vs. Multi-Class Classification

Classification problems are common in machine learning. In most cases, developers prefer using a supervised machine-learning approach to predict class tables for a given dataset. Unlike regression, classification involves designing the classifier model and training it to input and categorize the test dataset. For that, you can divide the dataset into either binary or multi-class modules.

As the name suggests, binary classification involves solving a problem with only two class labels. This makes it easy to filter the data, apply classification algorithms, and train the model to predict outcomes. On the other hand, multi-class classification is applicable when there are more than two class labels in the input train data. The technique enables developers to categorize the test data into multiple binary class labels.

That said, while binary classification requires only one classifier model, the one used in the multi-class approach depends on the classification technique. Below are the two models of the multi-class classification algorithm.

One-Vs-Rest Classification Model for Multi-Class Classification

Also known as one-vs-all, the one-vs-rest model is a defined heuristic method that leverages a binary classification algorithm for multi-class classifications. The technique involves splitting a multi-class dataset into multiple sets of binary problems. Following this, a binary classifier is trained to handle each binary classification model with the most confident one making predictions.

For instance, with a multi-class classification problem with red, green, and blue datasets, binary classification can be categorized as follows:

Problem one: red vs. green/blue

Problem two: blue vs. green/red

Problem three: green vs. blue/red

The only challenge of using this model is that you should create a model for every class. The three classes require three models from the above datasets, which can be challenging for large sets of data with million rows, slow models, such as neural networks and datasets with a significant number of classes.

The one-vs-rest approach requires individual models to prognosticate the probability-like score. The class index with the largest score is then used to predict a class. As such, it is commonly used for classification algorithms that can naturally predict scores or numerical class membership such as perceptron and logistic regression.

Which ones are the key actions in the data collection phase of Machine learning included?

Label
Ingest and Aggregate
Probability
Measure

Correct answer: AB

Explanation:

The key actions in the data collection phase include:Label: Labeled data is the raw data that was processed by adding one or more meaningful tags so that a model can learn from it. It will take some work to label it if such information is missing (manually or automatically).Ingest and Aggregate: Incorporating and combining data from many data sources is part of data collection in AI.Data collectionCollecting data for training the ML model is the basic step in the machine learning pipeline. The predictions made by ML systems can only be as good as the data on which they have been trained. Following are some of the problems that can arise in data collection:Inaccurate data. The collected data could be unrelated to the problem statement.Missing data. Sub-data could be missing. That could take the form of empty values in columns or missing images for some class of prediction.Data imbalance. Some classes or categories in the data may have a disproportionately high or low number of corresponding samples. As a result, they risk being under-represented in the model.Data bias. Depending on how the data, subjects and labels themselves are chosen, the model could propagate inherent biases on gender, politics, age or region, for example. Data bias is difficult to detect and remove.Several techniques can be applied to address those problems:Pre-cleaned, freely available datasets. If the problem statement (for example, image classification, object recognition) aligns with a clean, pre-existing, properly formulated dataset, then take ad-vantage of existing, open-source expertise.Web crawling and scraping. Automated tools, bots and headless browsers can crawl and scrape websites for data.Private data. ML engineers can create their own data. This is helpful when the amount of data required to train the model is small and the problem statement is too specific to generalize over an open-source dataset.Custom data. Agencies can create or crowdsource the data for a fee.

The key actions in the data collection phase include:

Label: Labeled data is the raw data that was processed by adding one or more meaningful tags so that a model can learn from it. It will take some work to label it if such information is missing (manually or automatically).

Ingest and Aggregate: Incorporating and combining data from many data sources is part of data collection in AI.

Data collection

Collecting data for training the ML model is the basic step in the machine learning pipeline. The predictions made by ML systems can only be as good as the data on which they have been trained. Following are some of the problems that can arise in data collection:

Inaccurate data. The collected data could be unrelated to the problem statement.

Missing data. Sub-data could be missing. That could take the form of empty values in columns or missing images for some class of prediction.

Data imbalance. Some classes or categories in the data may have a disproportionately high or low number of corresponding samples. As a result, they risk being under-represented in the model.

Data bias. Depending on how the data, subjects and labels themselves are chosen, the model could propagate inherent biases on gender, politics, age or region, for example. Data bias is difficult to detect and remove.

Several techniques can be applied to address those problems:

Pre-cleaned, freely available datasets. If the problem statement (for example, image classification, object recognition) aligns with a clean, pre-existing, properly formulated dataset, then take ad-vantage of existing, open-source expertise.

Web crawling and scraping. Automated tools, bots and headless browsers can crawl and scrape websites for data.

Private data. ML engineers can create their own data. This is helpful when the amount of data required to train the model is small and the problem statement is too specific to generalize over an open-source dataset.

Custom data. Agencies can create or crowdsource the data for a fee.

Which ones are the type of visualization used for Data exploration in Data Science?

Heat Maps
Newton AI
Feature Distribution by Class
2D-Density Plots
Sand Visualization

Correct answer: ADE

Explanation:

Type of visualization used for exploration:Correlation heatmapClass distributions by featureTwo-Dimensional density plots.All the visualizations are interactive, as is standard for Plotly.For More details, please refer the below link:https://towardsdatascience.com/data-exploration-understanding-and-visualization-72657f5eac41

Type of visualization used for exploration:

Correlation heatmap
Class distributions by feature
Two-Dimensional density plots.

All the visualizations are interactive, as is standard for Plotly.

For More details, please refer the below link:

https://towardsdatascience.com/data-exploration-understanding-and-visualization-72657f5eac41

Which one is not the feature engineering techniques used in ML data science world?

Imputation
Binning
One hot encoding
Statistical

Correct answer: D

Explanation:

Feature engineering is the pre-processing step of machine learning, which is used to transform raw data into features that can be used for creating a predictive model using Machine learning or statistical Modelling.What is a feature?Generally, all machine learning algorithms take input data to generate the output. The input data re-mains in a tabular form consisting of rows (instances or observations) and columns (variable or at-tributes), and these attributes are often known as features. For example, an image is an instance in computer vision, but a line in the image could be the feature. Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we can say a feature is an attribute that impacts a problem or is useful for the problem.What is Feature Engineering?Feature engineering is the pre-processing step of machine learning, which extracts features from raw data. It helps to represent an underlying problem to predictive models in a better way, which as a result, improve the accuracy of the model for unseen data. The predictive model contains predictor variables and an outcome variable, and while the feature engineering process selects the most useful predictor variables for the model.Some of the popular feature engineering techniques include:1. ImputationFeature engineering deals with inappropriate data, missing values, human interruption, general errors, insufficient data sources, etc. Missing values within the dataset highly affect the performance of the algorithm, and to deal with them 'Imputation' technique is used. Imputation is responsible for handling irregularities within the dataset.For example, removing the missing values from the complete row or complete column by a huge percentage of missing values. But at the same time, to maintain the data size, it is required to impute the missing data, which can be done as:For numerical data imputation, a default value can be imputed in a column, and missing values can be filled with means or medians of the columns.For categorical data imputation, missing values can be interchanged with the maximum occurred value in a column.2. Handling OutliersOutliers are the deviated values or data points that are observed too away from other data points in such a way that they badly affect the performance of the model. Outliers can be handled with this feature engineering technique. This technique first identifies the outliers and then remove them out.Standard deviation can be used to identify the outliers. For example, each value within a space has a definite to an average distance, but if a value is greater distant than a certain value, it can be considered as an outlier. Z-score can also be used to detect outliers.3. Log transformLogarithm transformation or log transform is one of the commonly used mathematical techniques in machine learning. Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal after transformation. It also reduces the effects of outliers on the data, as because of the normalization of magnitude differences, a model becomes much robust.4. BinningIn machine learning, overfitting is one of the main issues that degrade the performance of the model and which occurs due to a greater number of parameters and noisy data. However, one of the popular techniques of feature engineering, 'binning', can be used to normalize the noisy data. This process involves segmenting different features into bins.5. Feature SplitAs the name suggests, feature split is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset.The feature splitting process enables the new features to be clustered and binned, which results in extracting useful information and improving the performance of the data models.6. One hot encodingOne hot encoding is the popular encoding technique in machine learning. It is a technique that converts the categorical data in a form so that they can be easily understood by machine learning algorithms and hence can make a good prediction. It enables group the of categorical data without losing any information.

Feature engineering is the pre-processing step of machine learning, which is used to transform raw data into features that can be used for creating a predictive model using Machine learning or statistical Modelling.

What is a feature?

Generally, all machine learning algorithms take input data to generate the output. The input data re-mains in a tabular form consisting of rows (instances or observations) and columns (variable or at-tributes), and these attributes are often known as features. For example, an image is an instance in computer vision, but a line in the image could be the feature. Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we can say a feature is an attribute that impacts a problem or is useful for the problem.

What is Feature Engineering?

Feature engineering is the pre-processing step of machine learning, which extracts features from raw data. It helps to represent an underlying problem to predictive models in a better way, which as a result, improve the accuracy of the model for unseen data. The predictive model contains predictor variables and an outcome variable, and while the feature engineering process selects the most useful predictor variables for the model.

Some of the popular feature engineering techniques include:

1. Imputation

Feature engineering deals with inappropriate data, missing values, human interruption, general errors, insufficient data sources, etc. Missing values within the dataset highly affect the performance of the algorithm, and to deal with them 'Imputation' technique is used. Imputation is responsible for handling irregularities within the dataset.

For example, removing the missing values from the complete row or complete column by a huge percentage of missing values. But at the same time, to maintain the data size, it is required to impute the missing data, which can be done as:

For numerical data imputation, a default value can be imputed in a column, and missing values can be filled with means or medians of the columns.

For categorical data imputation, missing values can be interchanged with the maximum occurred value in a column.

2. Handling Outliers

Outliers are the deviated values or data points that are observed too away from other data points in such a way that they badly affect the performance of the model. Outliers can be handled with this feature engineering technique. This technique first identifies the outliers and then remove them out.

Standard deviation can be used to identify the outliers. For example, each value within a space has a definite to an average distance, but if a value is greater distant than a certain value, it can be considered as an outlier. Z-score can also be used to detect outliers.

3. Log transform

Logarithm transformation or log transform is one of the commonly used mathematical techniques in machine learning. Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal after transformation. It also reduces the effects of outliers on the data, as because of the normalization of magnitude differences, a model becomes much robust.

4. Binning

In machine learning, overfitting is one of the main issues that degrade the performance of the model and which occurs due to a greater number of parameters and noisy data. However, one of the popular techniques of feature engineering, 'binning', can be used to normalize the noisy data. This process involves segmenting different features into bins.

5. Feature Split

As the name suggests, feature split is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset.

The feature splitting process enables the new features to be clustered and binned, which results in extracting useful information and improving the performance of the data models.

6. One hot encoding

One hot encoding is the popular encoding technique in machine learning. It is a technique that converts the categorical data in a form so that they can be easily understood by machine learning algorithms and hence can make a good prediction. It enables group the of categorical data without losing any information.

Skewness of Normal distribution is ___________

Negative
Positive
0
Undefined

Correct answer: C

Explanation:

Since the normal curve is symmetric about its mean, its skewness is zero. This is a theoretical explanation for mathematical proofs, you can refer to books or websites that speak on the same in detail.

What is the formula for measuring skewness in a dataset?

MEAN - MEDIAN
MODE - MEDIAN
(3(MEAN - MEDIAN))/ STANDARD DEVIATION
(MEAN - MODE)/ STANDARD DEVIATION

Correct answer: C

Explanation:

Since the normal curve is symmetric about its mean, its skewness is zero. This is a theoretical expla-nation for mathematical proofs, you can refer to books or websites that speak on the same in detail.

What Can Snowflake Data Scientist do in the Snowflake Marketplace as Provider?

Publish listings for free-to-use datasets to generate interest and new opportunities among the Snowflake customer base.
Publish listings for datasets that can be customized for the consumer.
Share live datasets securely and in real-time without creating copies of the data or im-posing data integration tasks on the consumer.
Eliminate the costs of building and maintaining APIs and data pipelines to deliver data to customers.

Correct answer: ABCD

Explanation:

All are correct!About the Snowflake MarketplaceYou can use the Snowflake Marketplace to discover and access third-party data and services, as well as market your own data products across the Snowflake Data Cloud.As a data provider, you can use listings on the Snowflake Marketplace to share curated data offer-ings with many consumers simultaneously, rather than maintain sharing relationships with each indi-vidual consumer. With Paid Listings, you can also charge for your data products.As a consumer, you might use the data provided on the Snowflake Marketplace to explore and ac-cess the following:Historical data for research, forecasting, and machine learning.Up-to-date streaming data, such as current weather and traffic conditions.Specialized identity data for understanding subscribers and audience targets.New insights from unexpected sources of data.The Snowflake Marketplace is available globally to all non-VPS Snowflake accounts hosted on Amazon Web Services, Google Cloud Platform, and Microsoft Azure, with the exception of Mi-crosoft Azure Government. Support for Microsoft Azure Government is planned.

All are correct!

About the Snowflake Marketplace

You can use the Snowflake Marketplace to discover and access third-party data and services, as well as market your own data products across the Snowflake Data Cloud.

As a data provider, you can use listings on the Snowflake Marketplace to share curated data offer-ings with many consumers simultaneously, rather than maintain sharing relationships with each indi-vidual consumer. With Paid Listings, you can also charge for your data products.

As a consumer, you might use the data provided on the Snowflake Marketplace to explore and ac-cess the following:

Historical data for research, forecasting, and machine learning.

Up-to-date streaming data, such as current weather and traffic conditions.

Specialized identity data for understanding subscribers and audience targets.

New insights from unexpected sources of data.

The Snowflake Marketplace is available globally to all non-VPS Snowflake accounts hosted on Amazon Web Services, Google Cloud Platform, and Microsoft Azure, with the exception of Mi-crosoft Azure Government. Support for Microsoft Azure Government is planned.

What Can Snowflake Data Scientist do in the Snowflake Marketplace as Consumer?

Discover and test third-party data sources.
Receive frictionless access to raw data products from vendors.
Combine new datasets with your existing data in Snowflake to derive new business in-sights.
Use the business intelligence (BI)/ML/Deep learning tools of her choice.

Correct answer: ABCD

Explanation:

As a consumer, you can do the following:Discover and test third-party data sources.Receive frictionless access to raw data products from vendors.Combine new datasets with your existing data in Snowflake to derive new business insights.Have datasets available instantly and updated continually for users.Eliminate the costs of building and maintaining various APIs and data pipelines to load and up-date data.Use the business intelligence (BI) tools of your choice.

As a consumer, you can do the following:

Discover and test third-party data sources.
Receive frictionless access to raw data products from vendors.
Combine new datasets with your existing data in Snowflake to derive new business insights.
Have datasets available instantly and updated continually for users.
Eliminate the costs of building and maintaining various APIs and data pipelines to load and up-date data.
Use the business intelligence (BI) tools of your choice.

Which one is the incorrect option to share data in Snowflake?

a Listing, in which you offer a share and additional metadata as a data product to one or more accounts.
a Direct Marketplace, in which you directly share specific database objects (a share) to another account in your region using Snowflake Marketplace.
a Direct Share, in which you directly share specific database objects (a share) to anoth-er account in your region.
a Data Exchange, in which you set up and manage a group of accounts and offer a share to that group.

Correct answer: B

Explanation:

Options for Sharing in SnowflakeYou can share data in Snowflake using one of the following options:a Listing, in which you offer a share and additional metadata as a data product to one or more ac-counts,a Direct Share, in which you directly share specific database objects (a share) to another account in your region,a Data Exchange, in which you set up and manage a group of accounts and offer a share to that group.

Options for Sharing in Snowflake

You can share data in Snowflake using one of the following options:

a Listing, in which you offer a share and additional metadata as a data product to one or more ac-counts,
a Direct Share, in which you directly share specific database objects (a share) to another account in your region,
a Data Exchange, in which you set up and manage a group of accounts and offer a share to that group.

Data providers add Snowflake objects (databases, schemas, tables, secure views, etc.) to a share us-ing Which of the following options?

Grant privileges on objects to a share via Account role.
Grant privileges on objects directly to a share.
Grant privileges on objects to a share via a database role.
Grant privileges on objects to a share via a third-party role.

Correct answer: BC

Explanation:

What is a Share?Shares are named Snowflake objects that encapsulate all of the information required to share a database.Data providers add Snowflake objects (databases, schemas, tables, secure views, etc.) to a share using either or both of the following options:Option 1: Grant privileges on objects to a share via a database role.Option 2: Grant privileges on objects directly to a share.You choose which accounts can consume data from the share by adding the accounts to the share.After a database is created (in a consumer account) from a share, all the shared objects are accessible to users in the consumer account.Shares are secure, configurable, and controlled completely by the provider account:* New objects added to a share become immediately available to all consumers, providing real-time access to shared data.Access to a share (or any of the objects in a share) can be revoked at any time.

What is a Share?

Shares are named Snowflake objects that encapsulate all of the information required to share a database.

Data providers add Snowflake objects (databases, schemas, tables, secure views, etc.) to a share using either or both of the following options:

Option 1: Grant privileges on objects to a share via a database role.

Option 2: Grant privileges on objects directly to a share.

You choose which accounts can consume data from the share by adding the accounts to the share.

After a database is created (in a consumer account) from a share, all the shared objects are accessible to users in the consumer account.

Shares are secure, configurable, and controlled completely by the provider account:

* New objects added to a share become immediately available to all consumers, providing real-time access to shared data.

Access to a share (or any of the objects in a share) can be revoked at any time.

Vendor:	Snowflake
Exam Code:	DSA-C02
Exam Name:	SnowPro Advanced-Data Scientist
Date:	Oct 25, 2023
File Size:	59 KB
Downloads:	3

Download SnowPro Advanced-Data Scientist.DSA-C02.CertDumps.2023-10-25.50q.vcex

How to open VCEX files?

Demo Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

ProfExam at a 20% markdown