- Describe core data concepts (15–20%)
- Describe how to work with relational data on Azure (25–30%)
- Describe how to work with non-relational data on Azure (25–30%)
- Describe an analytics workload on Azure (25–30%)
Describe core data concepts (15–20%)
Practice questions based on these concepts
- Describe types of core data workloads
- Describe data analytics core concepts
- What is Data and why Data is a very important asset?
Data is a collection of facts such as numbers, descriptions, and observations used in decision making.In this competitive market, data is a valuable asset, and when analyzed properly can turn into a wealth of useful information and inform critical business decisions.
2. How many ways you can classify the data?
Structured
Semi-structured
Unstructured
3. What is Structured Data?
Structured data is typically tabular data that is represented by rows and columns in a database. Databases that hold tables in this form are called relational databases (the mathematical term relation refers to an organized set of data held as a table). Each row in a table has the same set of columns.
4. What is Semi-structured Data?
Semi-structured data is information that doesn't reside in a relational database but still has some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format.There are other types of semi-structured data as well. Examples include key-value stores and graph databases.A key-value store is similar to a relational table, except that each row can have any number of columns.You can use a graph database to store and query information about complex relationships. A graph contains nodes (information about objects), and edges (information about the relationships between objects).
5. What is the Unstructured Data?
Not all data is structured or even semi-structured. For example, audio and video files, and binary data files might not have a specific structure. They're referred to as unstructured data.
6. Azure provides different types of storage services based on the type of data. Is this true?
TrueDepending on the type of data such as structured, semi-structured, or unstructured, data will be stored differently. Structured data is typically stored in a relational database such as SQL Server or Azure SQL Database.If you want to store unstructured data such as video or audio files, you can use Azure Blob storageIf you want to store semi-structured data such as documents, you can use a service such as Azure Cosmos DB.
7. What is called Provisioning?
The act of setting up the database server is called provisioning.
8. You can define several levels of access to your data in Azure. Is this true?
TrueRead-only access means the users can read data but can't modify any existing data or create new data.
Read/write access gives users the ability to view and modify existing data.
Owner privilege gives full access to the data including managing the security like adding new users and removing access to existing users.You can also define which users should be allowed to access the data in the first place.
9. What are the two kinds of Data processing solutions?
transactional system (OLTP)
analytical system (OLAP)
10. What is a transactional system?
A transactional system records transactions. A transaction could be financial, such as the movement of money between accounts in a banking system, or it might be part of a retail system, tracking payments for goods and services from customers. Think of a transaction as a small, discrete, unit of work.
11. What is an analytical system?
An analytical system is designed to support business users who need to query data and gain a big picture view of the information held in a database.Analytical systems are concerned with capturing raw data, and using it to generate insights. An organization can use these insights to make business decisions. For example, detailed insights for a manufacturing company might indicate trends enabling them to determine which product lines to focus on, for profitability.
12. What are the tasks that are involved in the analytical system?
Data Ingestion: Data ingestion is the process of capturing the raw data. This data could be taken from control devices measuring environmental information such as temperature and pressure, point-of-sale devices recording the items purchased by a customer in a supermarket, financial data recording the movement of money between bank accounts, and weather data from weather stations. Some of this data might come from a separate OLTP system. To process and analyze this data, you must first store the data in a repository of some sort. The repository could be a file store, a document database, or even a relational database.Data Transformation/Data Processing: The raw data might not be in a format that is suitable for querying. The data might contain anomalies that should be filtered out, or it may require transforming in some way. For example, dates or addresses might need to be converted into a standard format. After data is ingested into a data repository, you may want to do some cleaning operations and remove any questionable or invalid data, or perform some aggregations such as calculating profit, margin, and other Key Performance Metrics (KPIs). KPIs are how businesses are measured for growth and performance.Data Querying: After data is ingested and transformed, you can query the data to analyze it. You may be looking for trends, or attempting to determine the cause of problems in your systems. Many database management systems provide tools to enable you to perform ad-hoc queries against your data and generate regular reports.Data Visualization: Data represented in tables such as rows and columns, or as documents, aren’t always intuitive. Visualizing the data can often be useful as a tool for examining data. You can generate charts such as bar charts, line charts, plot results on geographical maps, pie charts, or illustrate how data changes over time. Microsoft offers visualization tools like Power BI to provide rich graphical representation of your data.
13. What is called normalization?
The Process of spliting into a large number of narrow, well-defined tables (a narrow table is a table with few columns), with references from one table to another, as shown in the image below. However, querying the data often requires reassembling information from multiple tables by joining the data back together at run-time.
14. You have a lot of customer data and you have decided to store this data in the relational database. What is the first thing you should do?
normalization
15. What are the drawbacks of normalization?
You split the information into tables. When you read this info you need to essemble this information at runtime by joins. These queries might be expensive sometimes.
16. Non-relational databases enable you to store data in a format that more closely matches the original structure. What is the disadvantage of this?
Some of the data is duplicated in the documentaed database. This duplication not only increases the storage required, but can also make maintenance more complex(you have to modify everywhere)
17. What are ACID principles?
Atomicity guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely. If any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes.Consistency ensures that a transaction can only take the data in the database from one valid state to another. A consistent database should never lose or create data in a manner that can’t be accounted for. In the bank transfer example described earlier, if you add funds to an account, there must be a corresponding deduction of funds somewhere, or a record that describes where the funds have come from if they have been received externally. You can’t suddenly create (or lose) money.Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. A concurrent process can’t see the data in an inconsistent state (for example, the funds have been deducted from one account, but not yet credited to another.)Durability guarantees that once a transaction has been committed, it will remain committed even if there’s a system failure such as a power outage or crash.
18. A transactional database must adhere to the ACID properties to ensure that the database remains consistent while processing transactions. Is this true?
True
19. What is eventual consistency and why do we need it?
Many systems implement relational consistency and isolation by applying locks to data when it is updated. The lock prevents another process from reading the data until the lock is released.A distributed database is a database in which data is stored across different physical locations. It may be held in multiple computers located in the same physical location (for example, a datacenter), or may be dispersed over a network of interconnected computers.If you require transactional consistency in this scenario, locks may be retained for a very long time, especially if there's a network failure between databases at a critical point in time. To counter this problem, many distributed database management systems relax the strict isolation requirements of transactions and implement "eventual consistency." In this form of consistency, as an application writes data, each change is recorded by one server and then propagated to the other servers in the distributed database system asynchronously. While this strategy helps to minimize latency, it can lead to temporary inconsistencies in the data. Eventual consistency is ideal where the application doesn't require any ordering guarantees.
20. What is Data processing and how many kinds?
Data processing is simply the conversion of raw data to meaningful information through a process.Processing data as it arrives is called streaming.Buffering and processing the data in groups is called batch processing.
21. What are the advantages and disadvantages of batch processing?
Advantages:* Large volumes of data can be processed at a convenient time.
* It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.Disadvantages:* The time delay between ingesting the data and getting the results.
* All of a batch job's input data must be ready before a batch can be processed.Even minor data errors, such as typographical errors in dates, can prevent a batch job from running.
22. A real-estate website that tracks a subset of data from consumers’ mobile devices, and makes real-time property recommendations of properties to visit based on their geo-location. How do you process this data?
streaming
23. What are the other differences between streaming and batch processing of data?
Data Scope: Batch data can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).Data Size: Batch data is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.Performance: The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds. Latency is the time taken for the data to be received and processed.Analysis: You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.
24. How is data in a relational table organized?
Rows and Columns
25. What is an example of unstructured data?
Audio and Video files
26. What is an example of a streaming dataset?
Data from Twitter feeds
27. What are the roles in the world of data?
Azure Database Administrator role
An Azure database administrator is responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database solutions built on Azure data services and SQL Server. They are responsible for the overall availability and consistent performance and optimizations of the database solutions. They work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error.Data Engineer role
A data engineer collaborates with stakeholders to design and implement data-related assets that include data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They use a wide range of data platform technologies, including relational and nonrelational databases, file stores, and data streams.Data Analyst role
A data analyst enables businesses to maximize the value of their data assets. They are responsible for designing and building scalable models, cleaning and transforming data, and enabling advanced analytics capabilities through reports and visualizations.A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant insights.
28. What are Database Administrator tasks and responsibilities?
https://docs.microsoft.com/en-us/learn/modules/explore-roles-responsibilities-world-of-data/3-review-tasks-tools-for-database-administration
29. What is Azure Data Studio?
Azure Data Studio provides a graphical user interface for managing many different database systems. It currently provides connections to on-premises SQL Server databases, Azure SQL Database, PostgreSQL, Azure SQL Data Warehouse, and SQL Server Big Data Clusters, amongst others. It's an extensible tool, and you can download and install extensions from third-party developers that connect to other systems, or provide wizards that help to automate many administrative tasks.https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15
30. What is SQL Server Management Studio?
SQL Server Management Studio provides a graphical interface, enabling you to query data, perform general database administration tasks, and generate scripts for automating database maintenance and support operations.
31. What are Data Engineer tasks and responsibilities?
https://docs.microsoft.com/en-us/learn/modules/explore-roles-responsibilities-world-of-data/4-review-tasks-tools-for-data-engineering
32. What are some of the common tools that Data engineer uses?
sqlcmd utility, Azure Databricks, and Azure HDInsight, etc
33. What are Data Analyst tasks and responsibilities?
https://docs.microsoft.com/en-us/learn/modules/explore-roles-responsibilities-world-of-data/5-review-tasks-tools-for-data-visualization-reporting
34. What are some of the common tools that Data Analyst uses?
Power BI
35. Name one of the following tasks is the role of a database administrator?
restroing and backup
36. What are the characteristics of relational data?
All data is tabular. Entities are modeled as tables, each instance of an entity is a row in the table, and each property is defined as a column.All rows in the same table have the same set of columns.A table can contain any number of rows.A primary key uniquely identifies each row in a table. No two rows can share the same primary key.A foreign key references rows in another, related table. For each value in the foreign key column, there should be a row with the same value in the corresponding primary key column in the other table.
37. What is the primary key and foreign key?
The primary key indicates the column (or combination of columns) that uniquely identify each row. Every table should have a primary key.The columns marked FK are Foreign Key columns. They reference, or link to, the primary key of another table, and are used to maintain the relationships between tables. A foreign key also helps to identify and prevent anomalies, such as orders for customers that don't exist in the Customers table.
38. How do you query the relational data?
Most relational databases support Structured Query Language (SQL). You use SQL to create tables, insert, update, and delete rows in tables, and to query data.
39. Give an example of SQL?
SELECT CustomerID, CustomerName, CustomerAddress
FROM Customers
40. Why do use JOINS in SQL queries?
You can combine the data from multiple tables in a query using a join operation. A join operation spans the relationships between tables, enabling you to retrieve the data from more than one table at a time. The following query retrieves the name of every customer, together with the product name and quantity for every order they've placed. Notice that each column is qualified with the table it belongs to:SELECT Customers.CustomerName, Orders.QuantityOrdered, Products.ProductName FROM Customers JOIN Orders ON Customers.CustomerID = Orders.CustomerID JOIN Products ON Orders.ProductID = Products.ProductID
41. What are the most common use cases of relational databases?
Examples of OLTP applications that use relational databases are banking solutions, online retail applications, flight reservation systems, and many online purchasing applications.
42. What is an index?
When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table. When the user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row.
43. Why creating indexes make inserts or updates or deletes slow?
An index might consume additional storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations, and incur additional processing charges.
44. You have a table that is read frequently and rarely updates or inserts. How do you increase the performance of the queries?
Creating a index
45. What is a view?
A view is a virtual table based on the result set of a query. In the simplest case, you can think of a view as a window on specified rows in an underlying table.
46. You can query the view and filter the data in much the same way as a table. Is this true?
True
47. What is IaaS and when should you use it?
IaaS is an acronym for Infrastructure-as-a-Service. Azure enables you to create a virtual infrastructure in the cloud that mirrors the way an on-premises data center might work.you're still responsible for many of the day-to-day operations, such as installing and configuring the software, patching, taking backups, and restoring data when neededThe IaaS approach is best for migrations and applications requiring operating system-level access. SQL virtual machines are lift-and-shift. That is, you can copy your on-premises solution directly to a virtual machine in the cloud. The system should work more or less exactly as before in its new location, except for some small configuration changes (changes in network addresses, for example) to take account of the change in environment.
48. What is Paas and when should you use it?
PaaS stands for Platform-as-a-service. Rather than creating a virtual infrastructure, and installing and managing the database software yourself, a PaaS solution does this for you. You specify the resources that you require (based on how large you think your databases will be, the number of users, and the performance you require), and Azure automatically creates the necessary virtual machines, networks, and other devices for you.
49. What is the benefit of using a PaaS service, instead of an on-premises system, to run your database management systems?
Increased scalabilityPaaS solutions enable you to scale up and out without having to procure your own hardware.
50. What are the key characteristics of non-relational data?
A key aspect of non-relational databases is that they enable you to store data in a very flexible manner. Non-relational databases don't impose a schema on data. Instead, they focus on the data itself rather than how to structure it. This approach means that you can store information in a natural format, that mirrors the way in which you would consume, query and use it.
51. Non-relational systems such as Azure Cosmos DB (a non-relational database management system available in Azure), support indexing even when the structure of the indexed data can vary from record to record. Is this true?
True
52. What are the use cases of the non-relational databases?
IoT and telematics: These systems typically ingest large amounts of data in frequent bursts of activity. Non-relational databases can store this information very quickly. The data can then be used by analytics services such as Azure Machine Learning, Azure HDInsight, and Microsoft Power BI. Additionally, you can process the data in real-time using Azure Functions that are triggered as data arrives in the database.Retail and marketing: Microsoft uses CosmosDB for its own ecommerce platforms that run as part of Windows Store and XBox Live. It’s also used in the retail industry for storing catalog data and for event sourcing in order processing pipelines.Gaming: The database tier is a crucial component of gaming applications. Modern games perform graphical processing on mobile/console clients, but rely on the cloud to deliver customized and personalized content like in-game stats, social media integration, and high-score leaderboards. Games often require single-millisecond latencies for reads and write to provide an engaging in-game experience. A game database needs to be fast and be able to handle massive spikes in request rates during new game launches and feature updates.Web and mobile applications: A non-relational database such as Azure Cosmos DB is commonly used within web and mobile applications, and is well suited for modeling social interactions, integrating with third-party services, and for building rich personalized experiences. The Cosmos DB SDKs (software development kits) can be used build rich iOS and Android applications using the popular Xamarin framework.
53. What are the formats of semi-structured data?
JSON
A JSON document is enclosed in curly brackets ({ and }). Each field has a name (a label), followed by a colon, and then the value of the field. Fields can contain simple values, or subdocuments (each starting and ending with curly brackets). Fields can also have multiple values, held as arrays and surrounded with square brackets ([ and ]). Literals in a field are enclosed in quotes, and fields are separated with commas.Avro
Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a very good format for compressing data and minimizing storage and network bandwidth requirements.ORC
ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive. Hive is a data warehouse system that supports fast data summarization and querying over very large datasets. Hive supports SQL-like queries over unstructured data. An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.Parquet
Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
54. What are the NoSQL databases?
NoSQL (non-relational) databases generally fall into four categories: key-value stores, document databases, column family databases, and graph databases.key-value store
A key-value store is the simplest (and often quickest) type of NoSQL database for inserting and querying data. Each data item in a key-value store has two elements, a key and a value. The key uniquely identifies the item, and the value holds the data for the item. The value is opaque to the database management system. Items are stored in key order.document database
A document database represents the opposite end of the NoSQL spectrum from a key-value store. In a document database, each document has a unique ID, but the fields in the documents are transparent to the database management system. Document databases typically store data in JSON format. they could be encoded using other formats such XML, YAML, JSON, BSONcolumn family database
A column family database organizes data into rows and columns. Examples of this structure include ORC and Parquet files
In its simplest form, a column family database can appear very similar to a relational database, at least conceptually. The real power of a column family database lies in its denormalized approach to structuring sparse data.graph database
Graph databases enable you to store entities, but the main focus is on the relationships that these entities have with each other. A graph database stores two types of information: nodes that you can think of as instances of entities, and edges, which specify the relationships between nodes. Nodes and edges can both have properties that provide information about that node or edge (like columns in a table). Additionally, edges can have a direction indicating the nature of the relationship.https://docs.microsoft.com/en-us/learn/modules/explore-concepts-of-non-relational-data/4-describe-types-nosql-databases
55. What are the characteristics of the Key-value store?
* A query specifies the keys to identify the items to be retrieved.
* You can't search on values. An application that retrieves data from a key-value store is responsible for parsing the contents of the values returned.
* The value is opaque to the database management system.
* Write operations are restricted to inserts and deletes.
* If you need to update an item, you must retrieve the item, modify it in memory (in the application), and then write it back to the database, overwriting the original (effectively a delete and an insert).
56. What is the use case for the Key-value store?
The focus of a key-value store is the ability to read and write data very quickly. Search capabilities are secondary. A key-value store is an excellent choice for data ingestion, when a large volume of data arrives as a continual stream and must be stored immediately.
57. You are building a system that monitors the temperature throughout a set of office blocks and sets the air conditioning in each room in each block to maintain a pleasant ambient temperature. Your system has to manage the air conditioning in several thousand buildings spread across the country or region, and each building typically contains at least 100 air-conditioned rooms. What type of NoSQL datastore is most appropriate for capturing the temperature data to enable it to be processed quickly?
A key-value store
58. What Is Data Wrangling?
Wrangling is the process by which you transform and map raw data into a more useful format for analysis. It can involve writing code to capture, filter, clean, combine, and aggregate data from many sources.
59. What are the two important stages of data analytics?
data ingestion, and data processing.Data Ingestion
Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. The data can arrive as a continuous stream, or it may come in batches, depending on the source. The purpose of the ingestion process is to capture this data and store it. This raw data can be held in a repository such as a database management system, a set of files, or some other type of fast, easily accessible storage.The ingestion process might also perform filtering and transformation at this stage.Data Processing
The data processing stage occurs after the data has been ingested and collected. Data processing takes the data in its raw form, cleans it, and converts it into a more meaningful format (tables, graphs, documents, and so on). The result is a database of data that you can use to perform queries and generate visualizations, giving it the form and context necessary to be interpreted by computers and used by employees throughout an organization.
60. What are ETL and ELT?
ETL stands for Extract, Transform, and Load. The raw data is retrieved and transformed before being saved. The extract, transform, and load steps can be performed as a continuous pipeline of operations. It is suitable for systems that only require simple models, with little dependency between items.ELT is an abbreviation of Extract, Load, and Transform. The process differs from ETL in that the data is stored before being transformed. The data processing engine can take an iterative approach, retrieving and processing the data from storage, before writing the transformed data and models back to storage. ELT is more suitable for constructing complex models that depend on multiple items in the database, often using periodic batch processing.
61. What is Reporting?
Reporting is the process of organizing data into informational summaries to monitor how different areas of an organization are performing. Reporting helps companies monitor their online business, and know when data falls outside of expected ranges. Good reporting should raise questions about the business from its end users. Reporting shows you what has happened, while analysis focuses on explaining why it happened and what you can do about it.
62. What is Business Intelligence?
The term Business Intelligence (BI) refers to technologies, applications, and practices for the collection, integration, analysis, and presentation of business information. The purpose of business intelligence is to support better decision making.
63. What is Data Visualization?
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to spot and understand trends, outliers, and patterns in data.
64. What are the most common forms of visualizations?
Bar and column charts: Bar and column charts enable you to see how a set of variables changes across different categories.Line charts: Line charts emphasize the overall shape of an entire series of values, usually over time.Matrix: A matrix visual is a tabular structure that summarizes data. Often, report designers include matrixes in reports and dashboards to allow users to select one or more element (rows, columns, cells) in the matrix to cross-highlight other visuals on a report page.Key influencers: A key influencer chart displays the major contributors to a selected result or value. Key influencers are a great choice to help you understand the factors that influence a key metric.Treemap: Treemaps are charts of colored rectangles, with size representing the relative value of each item. They can be hierarchical, with rectangles nested within the main rectangles.Scatter: A scatter chart shows the relationship between two numerical values. A bubble chart is a scatter chart that replaces data points with bubbles, with the bubble size representing an additional third data dimension.Filled map: If you have geographical data, you can use a filled map to display how a value differs in proportion across a geography or region.
65. What are the categories of data analytics?
Descriptive analytics
Descriptive analytics helps answer questions about what has happened, based on historical data. Descriptive analytics techniques summarize large datasets to describe outcomes to stakeholders.Diagnostic analytics
Diagnostic analytics helps answer questions about why things happened. Diagnostic analytics techniques supplement more basic descriptive analytics. They take the findings from descriptive analytics and dig deeper to find the cause.Predictive analytics
Predictive analytics helps answer questions about what will happen in the future. Predictive analytics techniques use historical data to identify trends and determine if they're likely to recur. Predictive analytical tools provide valuable insight into what may happen in the future.Prescriptive analytics
Prescriptive analytics helps answer questions about what actions should be taken to achieve a goal or target. By using insights from predictive analytics, data-driven decisions can be made. This technique allows businesses to make informed decisions in the face of uncertainty.Cognitive analytics
Cognitive analytics attempts to draw inferences from existing data and patterns, derive conclusions based on existing knowledge bases, and then add these findings back into the knowledge base for future inferences--a self-learning feedback loop. Cognitive analytics helps you to learn what might happen if circumstances change, and how you might handle these situations.