Professional-Data-Engineer: Professional Data Engineer on Google Cloud Platform

You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges.

Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?

Convert all daily log tables into date-partitioned tables
Convert the sharded tables into a single partitioned table
Enable query caching so you can cache data from previous months
Create separate views to cover each month, and query from these views

Correct answer: A

Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google

BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?

Migrate the workload to Google Cloud Dataflow
Use pre-emptible virtual machines (VMs) for the cluster
Use a higher-memory node so that the job runs faster
Use SSDs on the worker nodes so that the job can run faster

Correct answer: A

You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip, errors are written to a dead-letter queue, and you are using Sidelnputs to join data You noticed that the pipeline is taking longer to complete than expected, what should you do to expedite the Dataflow job?

Switch to compressed Avro files
Reduce the batch size
Retry records that throw an error
Use CoGroupByKey instead of the Sidelnput

Correct answer: B

You are administering a BigQuery dataset that uses a customer-managed encryption key (CMEK). You need to share the dataset with a partner organization that does not have access to your CMEK. What should you do?

Create an authorized view that contains the CMEK to decrypt the data when accessed.
Provide the partner organization a copy of your CMEKs to decrypt the data.
Copy the tables you need to share to a dataset without CMEKs Create an Analytics Hub listing for this dataset.
Export the tables to parquet files to a Cloud Storage bucket and grant the storageinsights. viewer role on the bucket to the partner organization.

Correct answer: C

Explanation:

If you want to share a BigQuery dataset that uses a customer-managed encryption key (CMEK) with a partner organization that does not have access to your CMEK, you cannot use an authorized view or provide them a copy of your CMEK, because these options would violate the security and privacy of your data. Instead, you can copy the tables you need to share to a dataset without CMEKs, and then create an Analytics Hub listing for this dataset. Analytics Hub is a service that allows you to securely share and discover data assets across your organization and with external partners. By creating an Analytics Hub listing, you can grant the partner organization access to the copied dataset without CMEKs, and also control the level of access and the duration of the sharing.Reference:Customer-managed Cloud KMS keys[Authorized views][Analytics Hub overview][Creating an Analytics Hub listing]

If you want to share a BigQuery dataset that uses a customer-managed encryption key (CMEK) with a partner organization that does not have access to your CMEK, you cannot use an authorized view or provide them a copy of your CMEK, because these options would violate the security and privacy of your data. Instead, you can copy the tables you need to share to a dataset without CMEKs, and then create an Analytics Hub listing for this dataset. Analytics Hub is a service that allows you to securely share and discover data assets across your organization and with external partners. By creating an Analytics Hub listing, you can grant the partner organization access to the copied dataset without CMEKs, and also control the level of access and the duration of the sharing.

Reference:

Customer-managed Cloud KMS keys

[Authorized views]

[Analytics Hub overview]

[Creating an Analytics Hub listing]

You are designing a data mesh on Google Cloud with multiple distinct data engineering teams building data products. The typical data curation design pattern consists of landing files in Cloud Storage, transforming raw data in Cloud Storage and BigQuery datasets. and storing the final curated data product in BigQuery datasets You need to configure Dataplex to ensure that each team can access only the assets needed to build their data products.

You also need to ensure that teams can easily share the curated data product. What should you do?

1 Create a single Dataplex virtual lake and create a single zone to contain landing, raw. and curated data. 2 Provide each data engineering team access to the virtual lake.
1 Create a single Dataplex virtual lake and create a single zone to contain landing, raw. and curated data. 2 Build separate assets for each data product within the zone. 3. Assign permissions to the data engineering teamsat the zone level.
1 Create a Dataplex virtual lake for each data product, and create a single zone to contain landing, raw, and curated data. 2. Provide the data engineering teams with full access to the virtual lake assigned to their dataproduct.
1 Create a Dataplex virtual lake for each data product, and create multiple zones for landing, raw. and curated data. 2. Provide the data engineering teams with full access to the virtual lake assigned to their data product.

Correct answer: D

Explanation:

This option is the best way to configure Dataplex for a data mesh architecture, as it allows each data engineering team to have full ownership and control over their data products, while also enabling easy discovery and sharing of the curated data across the organization12.By creating a Dataplex virtual lake for each data product, you can isolate the data assets and resources for each domain, and avoid conflicts and dependencies between different teams3.By creating multiple zones for landing, raw, and curated data, you can enforce different security and governance policies for each stage of the data curation process, and ensure that only authorized users can access the data assets45. By providing the data engineering teams with full access to the virtual lake assigned to their data product, you can empower them to manage and monitor their data products, and leverage the Dataplex features such as tagging, quality, and lineage.Option A is not suitable, as it creates a single point of failure and a bottleneck for the data mesh, and does not allow for fine-grained access control and governance for different data products2.Option B is also not suitable, as it does not isolate the data assets and resources for each data product, and assigns permissions at the zone level, which may not reflect the different roles and responsibilities of the data engineering teams34.Option C is better than option A and B, but it does not create multiple zones for landing, raw, and curated data, which may compromise the security and quality of the data products5.Reference:1: Building a data mesh on Google Cloud using BigQuery and Dataplex | Google Cloud Blog2: Data Mesh - 7 Effective Practices to Get Started - Confluent3: Best practices | Dataplex | Google Cloud4: Secure your lake | Dataplex | Google Cloud5: Zones | Dataplex | Google Cloud[6]: Managing a Data Mesh with Dataplex -- ROI Training

This option is the best way to configure Dataplex for a data mesh architecture, as it allows each data engineering team to have full ownership and control over their data products, while also enabling easy discovery and sharing of the curated data across the organization12.By creating a Dataplex virtual lake for each data product, you can isolate the data assets and resources for each domain, and avoid conflicts and dependencies between different teams3.By creating multiple zones for landing, raw, and curated data, you can enforce different security and governance policies for each stage of the data curation process, and ensure that only authorized users can access the data assets45. By providing the data engineering teams with full access to the virtual lake assigned to their data product, you can empower them to manage and monitor their data products, and leverage the Dataplex features such as tagging, quality, and lineage.

Option A is not suitable, as it creates a single point of failure and a bottleneck for the data mesh, and does not allow for fine-grained access control and governance for different data products2.Option B is also not suitable, as it does not isolate the data assets and resources for each data product, and assigns permissions at the zone level, which may not reflect the different roles and responsibilities of the data engineering teams34.Option C is better than option A and B, but it does not create multiple zones for landing, raw, and curated data, which may compromise the security and quality of the data products5.

Reference:

1: Building a data mesh on Google Cloud using BigQuery and Dataplex | Google Cloud Blog

2: Data Mesh - 7 Effective Practices to Get Started - Confluent

3: Best practices | Dataplex | Google Cloud

4: Secure your lake | Dataplex | Google Cloud

5: Zones | Dataplex | Google Cloud

[6]: Managing a Data Mesh with Dataplex -- ROI Training

You are on the data governance team and are implementing security requirements to deploy resources. You need to ensure that resources are limited to only the europe-west 3 region You want to follow Googlerecommended practices What should you do?

Deploy resources with Terraform and implement a variable validation rule to ensure that the region is set to the europe-west3 region for all resources.
Set the constraints/gcp. resourceLocations organization policy constraint to in:eu-locations.
Create a Cloud Function to monitor all resources created and automatically destroy the ones created outside the europe-west3 region.
Set the constraints/gcp. resourceLocations organization policy constraint to in: europe-west3-locations.

Correct answer: D

Explanation:

To ensure that resources are limited to only the europe-west3 region, you should set the organization policy constraintconstraints/gcp.resourceLocationstoin:europe-west3-locations. This policy restricts the deployment of resources to the specified locations, which in this case is the europe-west3 region. By setting this policy, you enforce location compliance across your Google Cloud resources, aligning with the best practices for data governance and regulatory compliance.Professional Data Engineer Certification Exam Guide | Learn - Google Cloud1.Preparing for Google Cloud Certification: Cloud Data Engineer2.Professional Data Engineer Certification | Learn | Google Cloud3.3:Professional Data Engineer Certification | Learn | Google Cloud2:Preparing for Google Cloud Certification: Cloud Data Engineer1:Professional Data Engineer Certification Exam Guide | Learn - Google Cloud

To ensure that resources are limited to only the europe-west3 region, you should set the organization policy constraintconstraints/gcp.resourceLocationstoin:europe-west3-locations. This policy restricts the deployment of resources to the specified locations, which in this case is the europe-west3 region. By setting this policy, you enforce location compliance across your Google Cloud resources, aligning with the best practices for data governance and regulatory compliance.

Professional Data Engineer Certification Exam Guide | Learn - Google Cloud1.

Preparing for Google Cloud Certification: Cloud Data Engineer2.

Professional Data Engineer Certification | Learn | Google Cloud3.

3:Professional Data Engineer Certification | Learn | Google Cloud2:Preparing for Google Cloud Certification: Cloud Data Engineer1:Professional Data Engineer Certification Exam Guide | Learn - Google Cloud

You have a BigQuery table that contains customer data, including sensitive information such as names and addresses. You need to share the customer data with your data analytics and consumer support teams securely. The data analytics team needs to access the data of all the customers, but must not be able to access the sensitive data. The consumer support team needs access to all data columns, but must not be able to access customers that no longer have active contracts. You enforced these requirements by using an authorized dataset and policy tags After implementing these steps, the data analytics team reports that they still have access to the sensitive columns. You need to ensure that the data analytics team does not have access to restricted data What should you do?

Choose 2 answers

Create two separate authorized datasets; one for the data analytics team and another for the consumer support team.
Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags.
Enforce access control in the policy tag taxonomy.
Remove the bigquery. dataViewer role from the data analytics team on the authorized datasets.
Replace the authorized dataset with an authorized view Use row-level security and apply filter_ expression to limit data access.

Correct answer: BC

Explanation:

To ensure that the data analytics team does not have access to sensitive columns, you should:B) Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags.This role allows users to read metadata for data assets that have policy tags applied, which could include sensitive information.C) Enforce access control in the policy tag taxonomy.By setting access control at the policy tag level, you can restrict access to specific columns within a dataset, ensuring that only authorized users can view sensitive data.

To ensure that the data analytics team does not have access to sensitive columns, you should:

B) Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags.This role allows users to read metadata for data assets that have policy tags applied, which could include sensitive information.

C) Enforce access control in the policy tag taxonomy.By setting access control at the policy tag level, you can restrict access to specific columns within a dataset, ensuring that only authorized users can view sensitive data.

You are building a streaming Dataflow pipeline that ingests noise level data from hundreds of sensors placed near construction sites across a city. The sensors measure noise level every ten seconds, and send that data to the pipeline when levels reach above 70 dBA. You need to detect the average noise level from a sensor when data is received for a duration of more than 30 minutes, but the window ends when no data has been received for 15 minutes What should you do?

Use session windows with a 30-mmute gap duration.
Use tumbling windows with a 15-mmute window and a fifteen-minute. withAllowedLateness operator.
Use session windows with a 15-minute gap duration.
Use hopping windows with a 15-mmute window, and a thirty-minute period.

Correct answer: B

Explanation:

Session windows are dynamic windows that group elements based on the periods of activity. They are useful for streaming data that is irregularly distributed with respect to time. In this case, the noise level data from the sensors is only sent when it exceeds a certain threshold, and the duration of the noise events may vary. Therefore, session windows can capture the average noise level for each sensor during the periods of high noise, and end the window when there is no data for a specified gap duration. The gap duration should be 15 minutes, as the requirement is to end the window when no data has been received for 15 minutes. A 30-minute gap duration would be too long and may miss some noise events that are shorter than 30 minutes. Tumbling windows and hopping windows are fixed windows that group elements based on a fixed time interval. They are not suitable for this use case, as they may split or overlap the noise events from the sensors, and do not account for the periods of inactivity.Reference:Windowing conceptsSession windowsWindowing in Dataflow

Session windows are dynamic windows that group elements based on the periods of activity. They are useful for streaming data that is irregularly distributed with respect to time. In this case, the noise level data from the sensors is only sent when it exceeds a certain threshold, and the duration of the noise events may vary. Therefore, session windows can capture the average noise level for each sensor during the periods of high noise, and end the window when there is no data for a specified gap duration. The gap duration should be 15 minutes, as the requirement is to end the window when no data has been received for 15 minutes. A 30-minute gap duration would be too long and may miss some noise events that are shorter than 30 minutes. Tumbling windows and hopping windows are fixed windows that group elements based on a fixed time interval. They are not suitable for this use case, as they may split or overlap the noise events from the sensors, and do not account for the periods of inactivity.

Reference:

Windowing concepts

Session windows

Windowing in Dataflow

Your company receives both batch- and stream-based event dat a. You want to process the data using Google Cloud Dataflow over a predictable time period.

However, you realize that in some instances data can arrive late or out of order. How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?

Set a single global window to capture all the data.
Set sliding windows to capture all the lagged data.
Use watermarks and timestamps to capture the lagged data.
Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.

Correct answer: B

You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm.

To do this you need to add a synthetic feature. What should the value of that feature be?

X^2+Y^2
X^2
Y^2
cos(X)

Correct answer: D

Vendor:	Google
Exam Code:	Professional-Data-Engineer
Exam Name:	Professional Data Engineer on Google Cloud Platform
Date:	Aug 23, 2024
File Size:	760 KB
Downloads:	4

Download Professional Data Engineer on Google Cloud Platform.Professional-Data-Engineer.VCEplus.2024-08-23.155q.vcex

How to open VCEX files?

Demo Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

ProfExam at a 20% markdown