Exam
70-775

Microsoft logo

  • Published:
    22 February 2017
  • Languages:
    English
  • Audiences:
    Data scientists
  • Technology:
    Azure HDInsight
  • Credit towards certification:
    MCP, MCSA, MCSE

Perform Data Engineering on Microsoft Azure HDInsight


* Pricing does not reflect any promotional offers or reduced pricing for Microsoft Imagine Academy program members, Microsoft Certified Trainers, and Microsoft Partner Network program members. Pricing is subject to change without notice. Pricing does not include applicable taxes. Please confirm exact pricing with the exam provider before registering to take an exam.

Effective May 1, 2017, the existing cancellation policy will be replaced in its entirety with the following policy: Cancelling or rescheduling your exam within 5 business days of your registered exam time is subject to a fee. Failing to show up for your exam appointment or not rescheduling or cancelling your appointment at least 24 hours prior to your scheduled appointment forfeits your entire exam fee.

Skills measured

This exam measures your ability to accomplish the technical tasks listed below. View video tutorials about the variety of question types on Microsoft exams.

Please note that the questions may test on, but will not be limited to, the topics described in the bulleted text.

Do you have feedback about the relevance of the skills measured on this exam? Please send Microsoft your comments. All feedback will be reviewed and incorporated as appropriate while still maintaining the validity and reliability of the certification process. Note that Microsoft will not respond directly to your feedback. We appreciate your input in ensuring the quality of the Microsoft Certification Program.

If you have concerns about specific questions on this exam, please submit an exam challenge.

If you have other questions or feedback about Microsoft Certification exams or about the certification program, registration, or promotions, please contact your Regional Service Center.

Administer and Provision HDInsight Clusters
  • Deploy HDInsight clusters
    • Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customise a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering
  • Deploy and secure multi-user HDInsight clusters
    • Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data
  • Ingest data for batch and interactive processing
    • Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster
  • Configure HDInsight clusters
    • Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell; perform cluster-level debugging; stop and start services through Ambari; manage Ambari alerts and metrics
  • Manage and debug HDInsight jobs
    • Describe YARN architecture and operation; examine YARN jobs through ResourceManager UI and review running applications; use YARN CLI to kill jobs; find logs for different types of jobs; debug Hadoop and Spark jobs; use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions
Implement Big Data Batch Processing Solutions
  • Implement batch solutions with Hive and Apache Pig
    • Define external Hive tables; load data into a Hive table; use partitioning and bucketing to improve Hive performance; use semi-structured files such as XML and JSON with Hive; join tables with Hive using shuffle joins and broadcast joins; invoke Hive UDFs with Java and Python; design scripts with Pig; identify query bottlenecks using the Hive query graph; identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON
  • Design batch ETL solutions for big data with Spark
    • Share resources between Spark applications using YARN queues and preemption, select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance, connect to external Spark data sources, incorporate custom Python and Scala code in a Spark DataSets program, identify query bottlenecks using the Spark SQL query graph
  • Operationalise Hadoop and Spark
    • Create and customise a cluster by using ADF; attach storage to a cluster and run an ADF activity; choose between bring-your-own and on-demand clusters; use Apache Oozie with HDInsight; choose between Oozie and ADF; share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types; select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)
Implement Big Data Interactive Processing Solutions
  • Implement interactive queries for big data with Spark SQL
    • Execute queries using Spark SQL, cache Spark DataFrames for iterative queries, save Spark DataFrames as Parquet files, connect BI tools to Spark clusters, optimise join types such as broadcast versus merge joins, manage Spark Thrift server and change the YARN resources allocation, identify use cases for different storage types for interactive queries
  • Perform exploratory data analysis by using Spark SQL
    • Use Jupyter and Apache Zeppelin for visualisation and developing tidy Spark DataFrames for modeling, use Spark SQL’s two-table joins to merge DataFrames and cache results, save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet), manage interactive Livy sessions and their resources
  • Implement interactive queries for big data with Interactive Hive
    • Enable Hive LLAP through Hive settings, manage and configure memory allocation for Hive LLAP jobs, connect BI tools to Interactive Hive clusters
  • Perform exploratory data analysis by using Hive
    • Perform interactive querying and visualisation, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualise data
  • Perform interactive processing by using Apache Phoenix on HBase
    • Use Phoenix in HDInsight; use Phoenix Grammar for queries; configure transactions, user-defined functions, and secondary indexes; identify and optimise Phoenix performance; select between Hive, Spark, and Phoenix on HBase for interactive processing; identify when to share metastore between a Hive cluster and a Spark cluster
Implement Big Data Real-Time Processing Solutions
  • Create Spark streaming applications using DStream API
    • Define DStreams and compare them to Resilient Distributed Dataset (RDDs), start and stop streaming applications, transform DStream (flatMap, reduceByKey, UpdateStateByKey), persist long-term data stores in HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Apache Kafka or Event Hub, visualise streaming data in a PowerBI real-time dashboard
  • Create Spark structured streaming applications
    • Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets; create Window Operations on Event Time; define Window Transformations for Stateful and Stateless Operations; stream Window Functions, Reduce by Key, and Window to Summarise Streaming Data; persist Long Term Data HBase and SQL; persist Long Term Data Azure Data Lake and Azure Blob Storage; stream data from Kafka or Event Hub; visualise streaming data in a PowerBI real-time dashboard
  • Develop big data real-time processing solutions with Apache Storm
    • Create Storm clusters for real-time jobs, persist Long Term Data HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Kafka or Event Hub, configure event windows in Storm, visualise streaming data in a PowerBI real-time dashboard, define Storm topologies and describe Storm Computation Graph Architecture, create Storm streams and conduct streaming joins, run Storm topologies in local mode for testing, configure Storm applications (Workers, Debug mode), conduct Stream groupings to broadcast tuples across components, debug and monitor Storm jobs
  • Build solutions that use Kafka
    • Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics
  • Build solutions that use HBase
    • Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimise the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase

Preparation options

Online training
Instructor-led training
Books

Exam Ref 70-775 Perform Data Engineering on Microsoft Azure HDInsight
Published: April 24, 2018

Direct from Microsoft, this Exam Ref is the official study guide for the Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight certification exam. Exam Ref 70-775 Perform Data Engineering on Microsoft Azure HDInsight offers professional-level preparation that helps candidates maximize their exam performance and sharpen their skills on the job. It focuses on the specific areas of expertise modern IT professionals need to successfully administer and provision HDInsight clusters, and implement effective Big Data processing solutions with HDInsight.

Buy this book at the Microsoft Press Store

Microsoft Press books and eBooks are available for preorder within 90 days of the book’s publication date.

Who should take this exam?

This certification exam is targeted towards data engineers, data architects, data scientists, and data developers who implement big data engineering workflows on HDInsight. Candidates for this exam should have relevant work experience in big data analytics solutions.

Candidates should also be familiar with the features and capabilities of batch data processing, real-time processing, and interactive processing.

Candidates who take this exam are expected to be able to apply the Microsoft cloud ecosystem to design and implement big data engineering workflows and to use open source technologies for strategic value-add.

More information about exams

Preparing for an exam

We recommend that you review this exam preparation guide in its entirety and familiarise yourself with the resources on this website before you schedule your exam. See the Microsoft Certification exam overview for information about registration, videos of typical exam question formats and other preparation resources. For information on exam policies and scoring, see the Microsoft Certification exam policies and FAQs.

Note

This preparation guide is subject to change at any time without prior notice and at the sole discretion of Microsoft. Microsoft exams might include adaptive testing technology and simulation items. Microsoft does not identify the format in which exams are presented. Please use this preparation guide to prepare for the exam, regardless of its format. To help you prepare for this exam, Microsoft recommends that you have hands-on experience with the product and that you use the specified training resources. These training resources do not necessarily cover all of the topics listed in the "Skills measured" section.