[DP-900] Short Note

สำหรับคนที่จะมาศึกษา และสอบตัว DP-900 ผมแนะนำว่าควรศึกษา และสอบตัว AZ-900 ให้ผ่านก่อนครับ เพราะตัว AZ-900 มันมันปูพื้นฐานด้าน Cloud และที่นี้ใน DP-900 มาเสริม มาเจาะลึกในแต่ละเรื่อง AZ-900 รู้ว่า Azure Synapse Analytic คือ อะไร แต่ใน DP-900 จะมีการลองใช้งาน เป็นต้น

คนที่เกี่ยวข้องกับ Data
รู้จัก Data ต่างๆ - Data มีแบบไหน / RDBMS / Big Data ครับ
การเอา Data ไป Analytic + Data visualization - ถ้าต้องใช้ AI เข้ามาช่วย จะไปในส่วนของ AI-900 ครับ

คนที่เกี่ยวข้องกับ Data
รู้จัก Data ต่างๆ
การเอา Data ไป Process
Knowledge Check
Reference

คนที่เกี่ยวข้องกับ Data

Database administrators
- manage databases
- assigning permissions to users
- storing backup copies of data and restore data in the event of a failure.
- managing the security of the data in the database
Data engineers
- manage infrastructure and processes for data integration across the organization
  applying data cleaning routines
- identifying data governance rules (Privacy)
- implementing pipelines to transfer and transform data between systems.
Data analysts
- explore and analyze data to create visualizations and charts that enable organizations to make informed decisions.
Data Scientist - อันนี้จะออกไปทาง Azure AI Service

รู้จัก Data ต่างๆ

- Identify data formats

Structured data - ข้อมูลที่มีโครงสร้างชัดเจน พวก Excel หรือ ตารางในฐานข้อมูล
Semi-structured data - JSON
Unstructured data - ข้อมูลที่ไม่มีโครงสร้าง เช่น เอกสาร / รูปภาพ เป็นต้นครับ

การจัดเก็บข้อมูลจะเก็บลง Data Store มี 2 กลุ่มใหญ่
- File Stores - ไฟล์ที่เอาใช้กันทั่วไป txt / doc
- Databases - ดูเป็นทางการมากกว่าไฟล์ มีรูปแบบชัดเจน เช่น ลง SQL DB / No-SQL ครับ

- File Stores

การเลือกรูปแบบที่ใช้ดูจาก data formats / application ที่ใช้ / readable by humans ไหม ? รูปแบบมี ดังนี้
Delimited text files -
- มีตัวคั่น เช่น comma-separated values (CSV) / tab-separated values (TSV) and space-delimited
- Fixed-width data กำหนดขนาดของ แต่ละ Field แน่นอน เช่น Field นั้นมี 30 ตัวอักษร ใช้จริงไป 20 ทีเหลือเป็น Space ตัว Application ที่ใช้อ่าน และตัด String ตาม Spec ของแต่ละ Field ที่กำหนดไว้ครับ
JavaScript Object Notation (JSON)
Extensible Markup Language (XML)
Binary Large Object (BLOB) คนอ่านไม่ออก
Optimized file formats - กำหนดรูปแบบพิเศษ และ Application ที่ใช้งานเลย มันจะช่วยเรื่อง compression, indexing, and efficient storage + processing
- Avro (row-based format) - good format for compressing data and minimizing storage and network bandwidth requirements.
- ORC (Optimized Row Columnar format) - เพิ่มส่วน Stripe ที่มีการประมวลผลเบื้องต้น
  - A stripe contains an index into the rows in the stripe
  - data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
- Parquet (Column-based format) จัดการกับ nested data types ได้ดี พวก Column ที่แตกย่อยได้ Fullname แตกเป็น First Name / Last Name ไงงี้ . It supports very efficient read-heavy workload, compression and encoding schemes.
- ถ้าใครอยากรู้เกี่ยวกับ Optimized file formats ลองไปดูจาก Blog ของคุณ Mils ได้ Big Data file formats: ทำความรู้จัก Avro, Parquet และ ORC (mesodiar.com)

- Databases

--> Relational databases (Normalized Data)

ACID Semantics
- Atomicity transaction is treated as a single unit ทำสำเร็จจบ ถ้าไม่ตี Fail
- Consistency - ใน Single Unit มันต้องสอดคล้อง โอนจาก A -> B เงินหัก A ไปเพิ่ม B
- Isolation - concurrent transactions โอนเงินพร้อมกัน มันไม่ต้องเข้ามั่ว
- Durability - อะไรที่ Commit ไปแล้วสถานะคงเดิม เช่น ปิด DB ไปเปิดมาก็เหมือนเดิม
ข้อมูลจัดเก็บลงตาราง (Table)
- Column (Data Type / Constraint)
- Value - ข้อมูลที่เก็บใน Column
Normalization - การแบ่งกลุ่มของ Entities โดยไม่ให้มีข้อมูลที่ซ้ำซ้อนกัน
- ทำให้เกิด Table หลายอัน โดยในแต่ละ Table จะมีการอ้างอิงข้อมูลที่ Unique ด้วย Primary Key
- แต่ละTable มีความสัมพันธ์กัน เชื่อมผ่าน Foreign Key
มักใช้งานคู่กับ line of business (LOB) applicationsในกลุ่มที่ใช้เดินธุรกิจ (process business data) เช่น บันทึกข้อมูลการของสินค้า / ข้อมูลคงคลัง โดยเป็นงานประเภท Online Transactional Processing (OLTP)
จัดการข้อมูลโดยใช้ SQL
- Data Definition Language (DDL) - CREATE / ALTER / DROP / RENAME
- Data Control Language (DCL) - GRANT / DENY / REVOKE
- Data Manipulation Language (DML) - SELECT / INSERT / UPDATE / DELETE / TRUNCATE
มีหลาย Keyword
- VIEW - Logical Table
- Stored procedure - เขียน Program เพื่อ Process ข้อมูลใน Database
- INDEX - เพิ่มความเร็วจในการดึงข้อมูล (Retrieve)

--> Relational databases (Azure SQL)

Azure SQL Database
- Single Database - Lock Resource ตามที่กำหนด
- Elastic Pool - DB share resource
- Note:
  - security ดูแลโดย Server-Level Firewall
  - Restore ทำแบบ Point in Time ก็ได้นะ
Azure SQL Managed Instance (PaaS) สำหรับ SQL Server ที่ยังต้องการใช้ Feature ที่ On-Premise มีอยู่ ได้แก่
- ต้องการจัดการ backup ใน Azure Blob Storage เอง
- linked servers, Service Broker/ Database Mail
- SQL Server Database engine logins and logins integrated with Azure Active Directory (AD). หรือใช้ a username and a password
Azure SQL VM (IaaS)
- "lift and shift" migration of existing on-premises SQL Server
- Azure Maintain VM image เพียง 1 VM เท่านั้น
Azure SQL Edge - สำหรับ Internet-of-things (IoT) ที่มีข้อมูล streaming time-series
Azure Database for open-source -
- MySQL / MariaDB
  - พวก Table ของ MySQL/MariaDB อันนี้จะเข้าไปยุ่ง แก้ไขไม่ได้
- PostgreSQL (เลือกได้ 3 แบบ Single-server / Flexible Server/ Hyperscale (Citus) - Split DB Across Node)

--> Non-relational databases

Non-Relational Database character
- Scale ตรงนี้จะต่างกับ Relational Database Scale แบบ Vertically - เพิ่ม CPU / RAM แต่ของ Non-Relational จะเป็นแบบ Horizontally เพิ่ม Node / Share
- ข้อมูลที่เก็บยังมีกำหนด Data Type อยู่
- เน้นสำหรับ Application ที่เน้น Performance + Availability มากกว่า Consistency
Non-Relational Type
- Key/Value
- Document - MongoDB มี Document Key ไว้บอก Unique โดยตัว Document หมายถึง JSON
- Column family databases - จัดกลุุ่ม Column ที่เกี่ยวกัน เช่น Product มี Column ย่อย เป็น Name /Price
- Graph
- External index data stores - เอาไว้สำหรับเก็บ lookup ของพวก File เพื่อเอาไว้ค้นหา และบอก Path อีกที

--> Non-relational Azure Service

Azure blob storage - เก็บข้อมูล unstructured data as binary large objects โดยจัดการผ่าน Container
- มี 3 ชนิด
  - Block blobs - 4.77TB / store discrete, large, binary objects that change infrequently.
  - Page blobs (ของ VM - Random Read/Write Access)
  - Append blobs (Streaming Data พวก Log)
- Access tiers: Hot / Cold / Archive
- Storage tiers (เงินทองๆ) - 4 แบบ - Premium / Transaction Optimized / Hot / Cold
- AzCopy เป็น Tools ที่ใช้ Copy Blob/File (Optimized for transfer large file) ระหว่าง Azure Storage Account / Local PC ก่อนที่ใช้งานต้องสร้าง Shared Access Signature (SAS) และ ที่เกิด Transfer failure ก็สามารถทำงานต่อจากจุดที่ติดปัญหาได้
Azure Data Lake Storage Gen2 (Gen2 = integrated into Azure Storage)
- Azure Storage enable the Hierarchical Namespace
- big data รองรับไฟล์ structured, semi-structured, and unstructured data
Azure Files (Share) - Map network drive
- Server Message Block (SMB) - windows / Linux / macOS
- Network File System (NFS) - Linux / macOS
Azure Tables เก็บพวก semi-structured data รูปแบบ Key/Value
- key/value data items. Each item is represented by a row that contains columns for the data fields that need to be stored.
- Partition key - organize data, improve scalability and performance
- row key that is unique to each row (เอาไว้ Query แบบ Point / Range)
- ไม่ต่างกับ Cosmos เลย 5555

Azure Cosmos DB
- รองร้บข้อมูลแบบ
  - Document
  - Graphs
  - Key-Value Table
  - Column Family Store
- Azure Cosmos DB APIs
  - Core (SQL) API
  - MongoDB API
  - Table API
  - Cassandra API
  - Gremlin API (Graph)

การเอา Data ไป Process

--> Modern data warehousing

Data ingestion and processing
Analytical data store - Data Lake + Data Warehouse
Analytical data model - Cube /"drill-up/drill-down" analysis. pre-aggregate the data
Data visualization

--> Data ingestion and processing

ETL (Extract Transform Load) กับ ELT (Extract Load Transform) ต่างกันอย่างไร ?
- ELT รองรับข้อมูลที่ใหญ่กว่า มีจุดเด่นยังเก็บข้อมูลดิบเอาไว้ ต่างกับ ETL ที่ Transform ไปหมดแล้ว

Wrangling
- process by which you transform and map raw data into a more useful format for analysis
- It can involve writing code to capture, filter, clean, combine, and aggregate data from many sources.
Azure Data Factory - pipeline สำหรับ data engineers มาทำ extract, transform, and load (ETL)
- Pipeline - group of activities / component ที่ trigger สำหรับทำ Data Ingest
- activities
- Data Sets - Data Structures with Data Store
- Linked Service - Connection Parameter ที่ต่อกับ Data Source อื่นๆ
- Data Flow
- Control Flow - สำหรับ Orchestrate Pipeline
- Integration Runtime
Azure Synapse Analytics - ตัว pipeline อันเดียวกับ Azure Data Factory
แล้ว pipelines คือ อะไร
- ตัวที่ orchestrate (ผูกแต่ละ Activities) มาทำ ETL processes
- Pipeline allows the management of activities as a set (จัดการกลุ่มของ Activities ตาม Logic ที่เกี่ยวข้องกัน)
Reference: Describe data ingestion and processing - Learn | Microsoft Docs

--> Analytical data store

Data warehouses (RDBMS) denormalized - Contain Structured Information
- Goal
  - Real-time Integration with different Data Source
  - Optimize for read access (denormalized) - Faster Report
- Type
  - Star Schema - Fact (ตัวเลข Pre-aggregate) / dimension (Entities-dimension )
  - snowflake - Star Schema + dimension hierarchies
Data lakes -
- Schema-on-read approach to define tabular schemas on semi-structured data files
- Holds raw business data.
- Azure Data Lake Analytic - ใช้ U-SQL ในการ Query ข้อมูล
Hybrid Approach เอามาผสมกัน
อ๋อยังมี Delta Lake - transactional consistency, schema enforcement

--> Analytical data model

Data Analytics Type - มี Blog เก่าด้วย แปะสักหน่อย Type of Analytics
- Descriptive analytics helps answer questions about what has happened, based on historical data
- Diagnostic analytics - helps answer questions about why things happened / Finding Cause from descriptive analytics
- Predictive analytics - helps answer questions about what will happen in the future
- Prescriptive analytics - helps answer questions about what actions to achieve a goal or target) / Recommendation
- Cognitive analytics - inferences from existing data and patterns

Azure Synapse Analytics - มันทำ Modern data warehousing ทั้ง 4 ข้อได้หมด - Unified Data Analytics solution โดย Data engineers มีส่วนประกอบหลักๆ
- Pipelines - Azure Data Factory.
- SQL - optimized for data warehouse workloads.
  - Synapse SQL Pool - ETL/ELT Process
    - Complex SQL (เขียน SQL ที่ซ้ำซ้อน เช่น เอามาจัด Group และ Summary)
    - Data Ingestion (Polybase - เอาข้อมูลจาก External Source เช่น text file มาแปลง (Transform) ให้อยู่ในรูปแบบ Tabular - Table)
- Apache Spark - An open-source, parallel-processing framework that supports in-memory processing to boost the performance
- Azure Synapse Data Explorer - optimized for real-time querying of log and telemetry data โดยใช้ Kusto Query Language (KQL).
- Use Case ไหนที่ต้องใช้ Synapse
  - Complex Query & Aggregation
  - Quickly Process Large Amount of Data - ต้องการประมวลผลข้อมูลขนาดใหญ่ และต้องการ Speed
Azure Databricks - เหมือน Synapse Analytics เป็น Solution ของ Databricks
Azure HDInsight - Azure-hosted clusters ที่มัดรวม Apache Open-Source
- Apache Spark - An open-source, parallel-processing framework that supports in-memory processing to boost the performance
- Apache Hadoop - MapReduce jobs can be written in Java or abstracted by interfaces such as Apache Hive - a SQL-based API that runs on Hadoop.
- Apache HBase - large-scale NoSQL data storage and querying.
- Apache Kafka - a message broker for data stream processing.
- Apache Storm - real-time data processing through a topology of spouts and bolts.
Reference: Explore data analytics

--> Data visualization

พระเอกของเราจะเป็นตัว Power BI ครับ
Visualization options การนำเสนอข้อมูลดูได้จาก Link Explore data visualization - Learn | Microsoft Docs เลย ในนี้สรุปเฉพาะ อันที่ตัวเองงง และไม่รู้
- Matrix - Summary Table เอาจริงๆ นึกถึง Crystal report แบบ Cross Tab
- Key influencers - displays the major contributors to a selected result or value
An app is a collection of preset, ready-made visuals and reports that are shared with an entire organization.
Canvas vs Tiles Dashboard
- Canvas, single page that tells a story with the help of visualizations / ตัวหน้า design ของ Dashboard
- Tiles - is a snapshot of your data, pinned to a dashboard, visualizations that are present on the Power BI Dashboard / ข้อมูลที่นำไปแสดง (Visualization) บน canvas
Paginated Report
- Optimized for print and shared (Pixel Perfect)
- Fit well on page
Reference: Get started building with Power BI - Learn | Microsoft Docs

Real-time Data analytics

Batch VS Stream processing
- Batch processing - รวบช้อมูล พักไว้ และเอาไป Process เป็นรอบๆ
- Stream processing - เหมือนสายน้ำ พอมีข้อมูลเข้าแล้ว นำไป Process ต่อ
Batch VS Stream Criteria
- Data scope:
- Data size: Batch รับได้มากกว่า
- Performance: Latency Batch รอการประมวลผล ส่วน Stream ได้ผลที่ละนิด แต่ไวกว่า
- Analysis: batch processing to perform complex analytics
ไม่จำเป็นต้องเลือกไปทางใดทางหนึ่ง เอามาใช้งานร่วมกัน (Combine) เช่น รับ Event Data มาเป็น Stream มาพักใน Data lake และ Process เป็นรอบ Batch

--> Source for stream

Azure Event Hubs / Azure IoT Hub / Azure Data Lake Store Gen 2 / Apache Kafka

--> Sinks(Output) for stream

Azure Event Hubs: Used to queue the processed data for further downstream processing.
Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.
Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database table for querying and analysis.
Microsoft Power BI: real time data visualizations.

--> Common Azure elements of stream processing architecture

Azure Stream Analytics
- Ingest data (Input), เช่น Azure event hub, Azure IoT Hub, or Azure Storage blob container.
- Pre-Process select, project, and aggregate data values.
- จากนั้นส่งผลลัพธ์ที่ได้ (output) ได้ให้ Service อื่นๆ เช่น Azure Data Lake Gen 2, Azure SQL Database, Azure Synapse Analytics, Azure Functions, Azure event hub, Microsoft Power BI เป็นต้น
Spark Structured Streaming - Azure Synapse Analytics, Azure Databricks, and Azure HDInsight ซึ่งมองข้อมูลเป็น dataframe
Azure Data Explorer - high-performance querying of log and telemetry ( timestamp attribute)
Azure Purview

อื่นๆ

Transparent data encryption (TDE) (Protect Data at the REST) เข้ารหัสตัว Database โดยใช้ได้ทั้ง Azure SQL / Azure Synapse Analytic

พวก Service บน Azure ที่ใช้จะไปทาง SaaS นะ อาจจะมีข้อสอบถาม ให้เลือกแบบที่ต้อง least management ที่สุด

Resource ที่แนะนำ

Slide สรุป DP-900 จาก Event DataTH x Microsoft: เรียนพื้นฐาน Data ติวเตรียมสอบ Azure Data Fundamentals (DP-900) โดยคุณ Perth@DataTH.com
ติวเข้ม Data Fundamentals (On-demand): https://aka.ms/DATAT_OnDemandTH

Knowledge Check

Microsoft Azure Data Fundamentals: Explore core data concepts
- Explore core data concepts
- Explore data roles and services
Microsoft Azure Data Fundamentals: Explore relational data in Azure
- Explore fundamental relational data concepts
- Explore relational database services in Azure
Microsoft Azure Data Fundamentals: Explore non-relational data in Azure
- Explore Azure Storage for non-relational data
- Explore fundamentals of Azure Cosmos DB
Microsoft Azure Data Fundamentals: Explore data analytics in Azure
แถมครับ คิดว่าน่าจะ Relate ครับ Explore concepts of data analytics

Blog อื่นๆที่เกี่ยวกับการสอบ Cert MS Azure

Reference

Exam DP-900: Microsoft Azure Data Fundamentals - Learn | Microsoft Docs
DP-900: Microsoft Azure Data Fundamentals Sample Questions | Microsoft Docs แนวข้อสอบจาก MS

Discover more from naiwaen@DebuggingSoft

Subscribe to get the latest posts to your email.

คนที่เกี่ยวข้องกับ Data

รู้จัก Data ต่างๆ

- Identify data formats

- File Stores

- Databases

--> Relational databases (Normalized Data)

--> Relational databases (Azure SQL)

--> Non-relational databases

--> Non-relational Azure Service

การเอา Data ไป Process

--> Modern data warehousing

--> Data ingestion and processing

--> Analytical data store

--> Analytical data model

--> Data visualization

Real-time Data analytics

--> Source for stream

--> Sinks(Output) for stream

--> Common Azure elements of stream processing architecture

อื่นๆ

Resource ที่แนะนำ

Knowledge Check

Blog อื่นๆที่เกี่ยวกับการสอบ Cert MS Azure

Reference

Like this:

Related

Discover more from naiwaen@DebuggingSoft

คนที่เกี่ยวข้องกับ Data

รู้จัก Data ต่างๆ

- Identify data formats

- File Stores

- Databases

--> Relational databases (Normalized Data)

--> Relational databases (Azure SQL)

--> Non-relational databases

--> Non-relational Azure Service

การเอา Data ไป Process

--> Modern data warehousing

--> Data ingestion and processing

--> Analytical data store

--> Analytical data model

--> Data visualization

Real-time Data analytics

--> Source for stream

--> Sinks(Output) for stream

--> Common Azure elements of stream processing architecture

อื่นๆ

Resource ที่แนะนำ

Knowledge Check

Blog อื่นๆที่เกี่ยวกับการสอบ Cert MS Azure

Reference

Share this:

Like this:

Related

Discover more from naiwaen@DebuggingSoft

Related Posts

Resource: Secure storage for Azure Files and Azure Blob Storage (MS Applied Skills)

Microsoft Applied Skills: Develop an ASP.NET Core web app that consumes an API

บันทึกการ Redeem Azure Exam Voucher สำหรับสอบ On-Site ที่ศูนย์สอบ