Summer Sale Limited Time 75% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code = simple75

Pass the Google Cloud Certified Professional-Data-Engineer Questions and answers with Dumpstech

Exam Professional-Data-Engineer Premium Access

View all detail and faqs for the Professional-Data-Engineer exam

Go to Exam

Practice at least 50% of the questions to maximize your chances of passing.

Viewing page 11 out of 12 pages

Viewing questions 101-110 out of questions

Questions # 101:

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

Options:

Eliminate features that are highly correlated to the output labels.

Combine highly co-dependent features into one representative feature.

Instead of feeding in each feature individually, average their values in batches of 3.

Remove the features that have null values for more than 50% of the training records.

Questions # 102:

You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

Options:

Re-write the application to load accumulated data every 2 minutes.

Convert the streaming insert code to batch load for individual messages.

Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.

Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.

Questions # 103:

Your company built a TensorFlow neural-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

Options:

Threading

Serialization

Dropout Methods

Dimensionality Reduction

Questions # 104:

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Options:

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Questions # 105:

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

Options:

Issue a command to restart the database servers.

Retry the query with exponential backoff, up to a cap of 15 minutes.

Retry the query every second until it comes back online to minimize staleness of data.

Reduce the query frequency to once every hour until the database comes back online.

Questions # 106:

You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?

Options:

Update the current pipeline and use the drain flag.

Update the current pipeline and provide the transform mapping JSON object.

Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.

Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.

Questions # 107:

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Options:

Redefine the schema by evenly distributing reads and writes across the row space of the table.

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Questions # 108:

You are designing a fault-tolerant architecture to store data in a regional BigOuery dataset. You need to ensure that your application is able to recover from a corruption event in your tables that occurred within the past seven days. You want to adopt managed services with the lowest RPO and most cost-effective solution. What should you do?

Options:

Export the data from BigQuery into a new table that excludes the corrupted data.

Migrate your data to multi-region BigQuery buckets.

Access historical data by using time travel in BigQuery.

Create a BigQuery table snapshot on a daily basis.

Questions # 109:

You've migrated a Hadoop job from an on-premises cluster to Dataproc and Good Storage. Your Spark job is a complex analytical workload fiat consists of many shuffling operations, and initial data are parquet toes (on average 200-400 MB size each) You see some degradation in performance after the migration to Dataproc so you'd like to optimize for it. Your organization is very cost-sensitive so you'd Idee to continue using Dataproc on preemptibles (with 2 non-preemptibles workers only) for this workload. What should you do?

Options:

Switch from HODs to SSDs override the preemptible VMs configuration to increase the boot disk size

Increase the see of your parquet files to ensure them to be 1 GB minimum

Switch to TFRecords format (appr 200 MB per We) instead of parquet files

Switch from HDDs to SSDs. copy initial data from Cloud Storage to Hadoop Distributed File System (HDFS) run the Spark job and copy results back to Cloud Storage

Questions # 110:

You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?

Options:

Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.

Log debug information in each ParDo function, and analyze the logs at execution time.

Insert output sinks after each key processing step, and observe the writing throughput of each block.

Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks

Answer

Explanation

When Dataflow fuses multiple transformations into a single stage (step), it can make it harder to pinpoint which specific part of that fused stage is causing a bottleneck because internal metrics for individual ParDos within the fused stage might not be as distinct.

Reshuffle Operation (Option D):Inserting a Reshuffle (or GroupByKey followed by ungrouping, which forces a shuffle) operation between logical processing steps in your Beam pipeline prevents Dataflow from fusing those steps. A shuffle operation acts as a barrier to fusion. This materializes the intermediate PCollection and forces data to be redistributed across workers.

Benefit for Debugging:By breaking the fusion, the Dataflow monitoring UI will display distinct steps for the operations before and after the Reshuffle. This allows you to observe metrics like processing time, throughput, and watermarks for each now-separated step, making it much easier to identify which part of your original fused logic is the bottleneck.

Let's analyze why other options are less effective for this specific problem of afused step:

A (Verify service account permissions):While important for overall pipeline health, permission issues usually result in outright failures or errors in logs, not typically a slowdown within a successfully running (albeit slow) fused step.

B (Insert output sinks):Adding actual output sinks (like writing to Pub/Sub or GCS) after each key step would also break fusion and allow you to measure throughput. However, it's a more heavyweight approach than Reshuffle. It introduces I/O overhead and requires setting up and managing these temporary sinks. Reshuffle is a lighter-weight way to achieve the same goal of breaking fusion for diagnostic purposes within the pipeline itself.

C (Log debug information):Logging can be helpful, but if the entire fused step is slow, logs might not easily distinguish which internal operation is the culprit without very careful and verbose logging. Analyzing potentially massive volumes of logs for performance bottlenecks can be less direct than observing stage metrics in the Dataflow UI once fusion is broken.

Using Reshuffle is a standard technique recommended by Google Cloud for debugging performance issues in fused Dataflow stages.

[Reference:, Google Cloud Documentation: Dataflow > Troubleshooting Dataflow pipelines > Common Dataflow errors and troubleshooting steps > Pipeline is slow or stuck. "Break transform fusion: Certain transforms in your pipeline might be fused together into a single stage for optimization. If a particular fused stage is causing a bottleneck, you can temporarily add Reshuffle transforms between the fused transforms to break them into smaller, separate stages. This allows you to get more visibility into the performance of each individual transform and isolate the bottleneck.", Apache Beam Documentation: Programming Guide > Pipeline I/O > Reshuffle."Reshuffle can be used to prevent fusion, and ensure that data is materialized and redistributed." (While the primary purpose of Reshuffle is often related to data distribution and freshness, a side effect and common use case is to break fusion for monitoring and debugging)., , , , ]

Viewing page 11 out of 12 pages

Viewing questions 101-110 out of questions