Tech – Aakash

GCP ML Engineer Exam Study Notes

The below sections will help you prepare for the Google Cloud Platform- Professional Machine Learning (ML) Engineering Certification. Please click on the links below to read the details of the relevant sections. Introduction Google Cloud Professional Machine Learning Engineer Certification Google Cloud Professional ML Engineer Objective Map Section 1: Framing ML Problems Translating Business Use Cases Machine Learning Approaches ML Success Metrics Responsible AI Practices Summary Exam Essentials Review Questions Section 2: Exploring Data and Building Data Pipelines Visualization Statistics Fundamentals Data Quality and Reliability Establishing Data Constraints Running TFDV on Google Cloud Platform Organizing and Optimizing Training Datasets Handling Missing Data Data Leakage Summary Exam Essentials Review Questions Section 3: Feature Engineering Consistent Data Preprocessing Encoding Structured Data Types Class Imbalance Feature Crosses TensorFlow Transform GCP Data and ETL Tools Summary Exam Essentials Review Questions Section 4: Choosing the Right ML Infrastructure Pretrained vs. AutoML vs. Custom Models Pretrained Models AutoML Custom Training Provisioning for Predictions Summary Exam Essentials Review Questions Section 5: Architecting ML Solutions Designing Reliable, Scalable, and Highly Available ML Solutions Choosing an Appropriate ML Service Data Collection and Data Management Automation and Orchestration Serving Summary Exam Essentials Review Questions Section 6: Building Secure ML Pipelines Building Secure ML Systems Identity and Access Management Privacy Implications of Data Usage and Collection Summary Exam Essentials Review Questions Section 7: Model Building Choice of Framework and Model Parallelism Modeling Techniques Transfer Learning Semi‐supervised Learning Data Augmentation Model Generalization and Strategies to Handle Overfitting and Underfitting Summary Exam Essentials Review Questions Section 8: Model Training and Hyperparameter Tuning Ingestion of Various File Types into Training Developing Models in Vertex AI Workbench by Using Common Frameworks Training a Model as a Job in Different Environments Hyperparameter Tuning Tracking Metrics During Training Retraining/Redeployment Evaluation Unit Testing for Model Training and Serving Summary Exam Essentials Review Questions Section 9: Model Explainability on Vertex AI Model Explainability on Vertex AI Summary Exam Essentials Review Questions Section 10: Scaling Models in Production Scaling Prediction Service Serving (Online, Batch, and Caching) Google Cloud Serving Options Hosting Third‐Party Pipelines (MLflow) on Google Cloud Testing for Target Performance Configuring Triggers and Pipeline Schedules Summary Exam Essentials Review Questions Section 11: Designing ML Training Pipelines Orchestration Frameworks Identification of Components, Parameters, Triggers, and Compute Needs System Design with Kubeflow/TFX Hybrid or Multicloud Strategies Summary Exam Essentials Review Questions Section 12: Model Monitoring, Tracking, and Auditing Metadata Model Monitoring Model Monitoring on Vertex AI Logging Strategy Model and Dataset Lineage Vertex AI Experiments Vertex AI Debugging Summary Exam Essentials Review Questions Section 13: Maintaining ML Solutions MLOps Maturity Retraining and Versioning Models Feature Store Vertex AI Permissions Model Common Training and Serving Errors Summary Exam Essentials Review Questions Section 14: BigQuery ML BigQuery – Data Access BigQuery ML Algorithms Explainability in BigQuery ML BigQuery ML vs. Vertex AI Tables Interoperability with Vertex AI BigQuery Design Patterns Summary Exam Essentials Review Questions

Section 2 – Building a Data Lake

Google Cloud Platform (GCP) provides a number of options for storing and processing data. Here are some of the main options: Cloud Storage: This is a highly scalable, durable, and secure object storage service that allows you to store and retrieve large amounts of data from anywhere on the internet. You can use Cloud Storage to store a variety of data types, including structured data in the form of CSV or JSON files, unstructured data such as audio or video files, and large datasets for analytics or machine learning. BigQuery: This is a fully managed, cloud-native data warehouse that enables super-fast SQL queries on large datasets. It’s ideal for storing and querying data that you need to analyze using SQL, and it integrates seamlessly with other GCP tools such as Data Studio and Cloud Dataproc. Cloud SQL: This is a fully managed relational database service that makes it easy to set up, maintain, and administer a SQL database in the cloud. It supports a number of popular database engines, including MySQL and PostgreSQL, and it can be used to store structured data such as customer information or product catalogs. Cloud Dataproc: This is a fully managed data processing service that makes it easy to run Apache Hadoop, Apache Spark, and other open-source data processing frameworks on GCP. It’s ideal for ETL (extract, transform, load) workflows, as it allows you to quickly and easily process large datasets and load them into storage systems like BigQuery or Cloud Storage. Cloud Data Fusion: This is a fully managed data integration service that makes it easy to build and maintain ETL pipelines on GCP. It provides a visual interface for designing and executing ETL workflows, and it integrates with a number of GCP and third-party data sources and sinks.

Section 1 – Introduction to Data Engineering

Data engineering is a field that involves designing, building, maintaining, and testing systems that are responsible for storing, processing, and analyzing data. It is a critical aspect of data science, as the data infrastructure that data engineers build and maintain enables data scientists to perform their work effectively. Google Cloud provides a number of tools and services that can be used for data engineering on its platform. Some of the key components of Google Cloud’s data engineering offerings include: Cloud Storage: Google Cloud’s scalable, durable, and secure object storage service. It can be used to store large volumes of data for batch processing or for real-time analytics. BigQuery: Google Cloud’s fully managed, cloud-native data warehouse service. It can be used to store, query, and analyze large datasets using SQL. Cloud Functions: Google Cloud’s serverless compute platform. It can be used to run code in response to events or to automate tasks. Cloud Pub/Sub: Google Cloud’s fully managed, real-time messaging service. It can be used to decouple and scale microservices, data pipelines, and event-driven systems. Cloud Data Fusion: Google Cloud’s fully managed, cloud-native data integration platform. It can be used to build, manage, and orchestrate data pipelines for ingesting, transforming, and enriching data. Cloud Dataproc: Google Cloud’s fully managed, cloud-native data processing service. It can be used to run Apache Hadoop, Apache Spark, and other data processing frameworks on Google Cloud. Cloud Data Loss Prevention (DLP): Google Cloud’s fully managed, data security and privacy service. It can be used to discover, classify, and protect sensitive data across Google Cloud. There are also a number of other tools and services available on Google Cloud that can be used for data engineering, such as Cloud Composer (a fully-managed, cloud-native workflow orchestration service), Cloud Dataprep (a fully managed, cloud-native data preparation service), and Cloud Dataproc Autoscaling (a fully-managed, cloud-native service that automatically scales Cloud Dataproc clusters based on workload demands). Overall, Google Cloud provides a comprehensive set of tools and services for data engineering, making it a popular choice for organizations looking to build and maintain effective data infrastructure in the cloud. Data Engineering Challenges: There are a number of challenges that data engineers may face in their work: Data Volume: One of the main challenges that data engineers face is dealing with large volumes of data. Storing, processing, and analyzing large datasets can be resource-intensive, and requires careful planning and optimization to ensure that the data infrastructure is efficient and scalable. Data Quality: Ensuring the quality and integrity of data can be a challenge, as data may be incomplete, incorrect, or inconsistent. Data engineers may need to develop processes for cleansing, normalizing, and enriching data to ensure that it is fit for use. Data Security: Protecting data from unauthorized access, theft, or corruption is a critical concern for data engineers. They may need to implement security measures such as encryption, access controls, and monitoring to ensure that data is secure. Data Governance: Ensuring that data is used ethically and responsibly is also an important consideration for data engineers. They may need to implement processes for managing data access, retention, and deletion to ensure compliance with data governance regulations and policies. Data Integration: Data engineers may also need to integrate data from a variety of sources, which can be a complex and time-consuming task. This may involve dealing with different data formats, schemas, and APIs, and may require the development of custom data pipelines. Data Pipeline Maintenance: Data pipelines can be complex systems, and maintaining them can be a challenge. Data engineers may need to monitor and troubleshoot data pipelines to ensure that they are running smoothly, and make updates and improvements as needed. Collaboration: Data engineering projects often involve working with a team of data scientists, analysts, and other stakeholders. Coordinating and collaborating with these team members can be challenging, and may require effective communication and project management skills. Intro to BigQuery BigQuery is a fully-managed, cloud-native data warehouse service offered by Google Cloud. It enables users to store, query, and analyze large datasets using SQL. One of the main advantages of BigQuery is its ability to handle extremely large datasets. It is designed to be highly scalable, and can process billions of rows of data in seconds. This makes it well-suited for handling data at scale, such as for real-time analytics or for data-intensive applications. BigQuery is a serverless service, which means that users do not need to worry about managing infrastructure or capacity planning. It can automatically scale to meet the needs of users’ workloads, and users only pay for the resources they consume. BigQuery also integrates with a number of other Google Cloud services, such as Cloud Storage, Cloud Functions, and Cloud Pub/Sub, making it easy to build data pipelines and perform data processing and analysis tasks. To use BigQuery, users can load data into the service using a variety of methods, including uploading files from their local machine or transferring data from other Google Cloud services. Once the data is loaded, users can use SQL to query the data and perform various types of analysis. BigQuery also offers a number of advanced features, such as support for geospatial data types, machine learning models, and real-time analytics. These features make it a powerful tool for a wide range of data engineering and analytics tasks. Data Lakes and Data Warehouses A data lake is a central repository that allows organizations to store all their structured and unstructured data at any scale. The data in a data lake can be stored in its raw, original form, making it easier for organizations to retain data for long periods of time and to perform data analysis on a wide range of data types. A data warehouse, on the other hand, is a central repository that is specifically designed for storing and querying structured data. Data warehouses are optimized for fast querying and analysis, and are typically used for business intelligence and reporting purposes. There are a number of key differences

GCP Data Engineer Exam Study Notes

The below sections will help you prepare for the Google Cloud Platform- Professional Data Engineering Certification. Please click on the links below to read the details of the relevant sections. Section 1 – Introduction to Data Engineering Explore the role of a data engineer Data engineering challenges Intro to BigQuery Data Lakes and Data Warehouses Federated Queries with BigQuery Transactional Databases vs Data Warehouses Manage data access and governance Build production-ready pipelines Section 2 – Building a Data Lake Introduction to Data Lakes Data Storage and ETL options on GCP Building a Data Lake using Cloud Storage Optimizing cost with Google Cloud Storage classes and Cloud Functions Securing Cloud Storage Storing All Sorts of Data Types Running federated queries on Parquet and ORC files in BigQuery Cloud SQL as a relational Data Lake Section 3: Building a Data Warehouse The modern data warehouse Intro to BigQuery Getting Started Loading Data Querying Cloud SQL from BigQuery. Exploring Schemas. Exploring BigQuery Public Datasets with SQL using INFORMATION_SCHEMA. Schema Design Nested and Repeated Fields Working with JSON and Array data in BigQuery Optimizing with Partitioning and Clustering Transforming Batch and Streaming Data. Section 4: Introduction to Building Batch Data Pipelines ELT and ETL Quality considerations How to carry out operations in BigQuery ELT to improve data quality in BigQuery Shortcomings ETL to solve data quality issues Section 5: Executing Spark on Cloud Dataproc The Hadoop ecosystem Running Hadoop on Cloud Dataproc GCS instead of HDFS Optimizing Dataproc Running Apache Spark jobs on Cloud Dataproc. Section 6: Serverless Data Processing with Cloud Dataflow Cloud Dataflow Why customers value Dataflow Dataflow Pipelines A Simple Dataflow Pipeline (Python/Java) MapReduce in Dataflow (Python/Java) Side Inputs (Python/Java) Dataflow Templates Dataflow SQL Section 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer Building Batch Data Pipelines visually with Cloud Data Fusion Components UI Overview Building a Pipeline Exploring Data using Wrangler Lab: An Introduction to Cloud Composer. Orchestrating work between GCP services with Cloud Composer Apache Airflow Environment DAGs and Operators Workflow Scheduling Monitoring and Logging. Section 8: Introduction to Processing Streaming Data Processing Streaming Data Section 9: Serverless Messaging with Cloud Pub/Sub Cloud Pub/Sub Section 10: Cloud Dataflow Streaming Features Cloud Dataflow Streaming Features Section 11: High-Throughput BigQuery and Bigtable Streaming Features BigQuery Streaming Features Streaming Analytics and Dashboards Cloud Bigtable Streaming Data Pipelines into Bigtable Section 12: Advanced BigQuery Functionality and Performance Analytic Window Functions Using With Clauses GIS Functions Demo: Mapping Fastest Growing Zip Codes with BigQuery GeoViz. Performance Considerations Optimizing your BigQuery Queries for Performance Creating Date-Partitioned Tables in BigQuery Section 13: Introduction to Analytics and AI What is AI? From Ad-hoc Data Analysis to Data Driven Decisions Options for ML models on GCP Section 14: Prebuilt ML model APIs for Unstructured Data Unstructured Data is Hard ML APIs for Enriching Data Using the Natural Language API to Classify Unstructured Text Section 15: Big Data Analytics with Cloud AI Platform Notebooks Whats a Notebook BigQuery Magic and Ties to Pandas BigQuery in Jupyter Labs on AI Platform Section 16: Production ML Pipelines with Kubeflow Ways to do ML on GCP Kubeflow AI Hub Running AI models on Kubeflow Section 17: Custom Model building with SQL in BigQuery ML BigQuery ML for Quick Model Building Demo: Train a model with BigQuery ML to predict NYC taxi fares Supported Model Lab Option 1: Predict Bike Trip Duration with a Regression Model in BQML Lab Option 2: Movie Recommendations in BigQuery ML Section 18: Custom Model building with Cloud AutoML Why Auto ML? Auto ML Vision Auto ML NLP Auto ML Tables

Category: Tech

GCP ML Engineer Exam Study Notes

Section 2 – Building a Data Lake

Section 1 – Introduction to Data Engineering

GCP Data Engineer Exam Study Notes