Saturday, 21 October 2023

Data Mining Question & Answer 2077 & 2078 Most-Important

 

Questions 2077, 2078, Important

Data Mining:- Data mining derives its name as Data + Mining the same way in which mining is done in the ground to find a valuable ore, data mining is done to find valuable information in the dataset. Data Mining tools predict customer habits, predict patterns and future trends, allowing business to increase company revenues and make proactive decisions.

Data mining is a key part of data analytics overall and one of the core disciplines in data science, which uses advanced analytics techniques to find useful information in data sets. At a more granular level, data mining is a step in the knowledge discovery in databases (KDD) process, a data science methodology for gathering, processing and analyzing data. Data mining and KDD are sometimes referred to interchangeably, but they're more commonly seen as distinct things.

KEY TAKEAWAYS:-

·       Data mining is the process of analyzing a large batch of information to discern trends and patterns.

·       Data mining can be used by corporations for everything from learning about what customers are interested in or want to buy to fraud detection and spam filtering.

·       Social media companies use data mining techniques to commodity their users in order to generate profit.

Data Mining: Advanced Concepts and Algorithms:- As the amount of research and industry data being collected daily continues to grow, intelligent software tools are increasingly needed to process and filter the data, detect new patterns and similarities within it, and extract meaningful information from it. Data mining and predictive modeling offer a means of effective classification and analysis of large, complex, multi-dimensional data, leading to discovery of functional models, trends and patterns. Building upon the skills learned in previous courses, this course covers advanced data mining, data analysis, and pattern recognition concepts and algorithms, as well as models and machine learning algorithms.


 
Importance of Data Mining:-

·       It helps in automated decision making:- Data mining allows companies to regularly analyze data and automate both routine and critical decisions without the delay of human judgment. With the use of this, banks can detect fraudulent transactions, verification can be requested, and can also secure personal information of customers against identity theft. These models can collect, analyze, and act on data independently when deployed within a firm’s operational algorithms and helps in decision making and improving the daily processes of the company.

·       It helps in cost reduction:- Data mining enables businesses to use and allocate resources more efficiently. Companies can plan and make automated decisions with accurate forecasts and this will result in maximum cost reduction.

·       It helps in precise prediction and forecasting:- Data mining eases planning which is very crucial within every organization, and provides managers with reliable forecasts based on historical trends and present conditions.

·       Customer Insights:- From customer data, firms deploy data mining models to discover key characteristics of and differences between their customers. Data mining is helpful to make personas and personalize each touch point to improve overall customer experience.

·       Data mining has many benefits in various areas of business, and is advantageous for governments and individuals as well. It can predict future trends, signify customer behavior, enable quick decision making, increase company revenue, and provide accurate predictions. It has the power to transform enterprises. So, if a business adopts the right data mining strategy then they can achieve high levels of customer service.

Characteristics of Data Mining:-

·       Large quantities of data. The volume of data so great it has to be analyzed by automated techniques e.g. satellite information, credit card transactions, etc.

·       Noisy, incomplete data. Imprecise data is the characteristic of all data collection.

·       Complex data structure.

·       Heterogeneous data stored in legacy systems.

Functionalities of Data Mining:-Data Mining functions are used to define the trends or correlations contained in data mining activities. In comparison, data mining activities can be divided into 2 categories:

1.     Descriptive Data Mining:- It includes certain knowledge to understand what is happening within the data without a previous idea. The common data features are highlighted in the data set. For examples: count, average etc.

            i.     Class/Concept Descriptions:- Classes or definitions can be correlated with results. In simplified, descriptive and yet accurate ways, it can be helpful to define individual groups and concepts.

These class or concept definitions are referred to as class/concept descriptions.

·       Data Characterization:-This refers to the summary of general characteristics or features of the class that is under the study. For example. To study the characteristics of a software product whose sales increased by 15% two years ago, anyone can collect these type of data related to such products by running SQL queries.

·       Data Discrimination:-It compares common features of class which is under study. The output of this process can be represented in many forms. Eg., bar charts, curves and pie charts.

        ii.    ii.    Mining Frequent Patterns, Associations, and Correlations:-

Frequent Patterns:- Frequent patterns are nothing but things that are found to be most common in the data.

There are different kinds of frequency that can be observed in the dataset.

·       Frequent item set:- This applies to a number of items that can be seen together regularly for eg: milk and sugar.

·       Frequent Subsequence:- This refers to the pattern series that often occurs regularly such as purchasing a phone followed by a back cover.

·       Frequent Substructure:- It refers to the different kinds of data structures such as trees and graphs that may be combined with the item set or subsequence.

Association Analysis:- The process involves uncovering the relationship between data and deciding the rules of the association. It is a way of discovering the relationship between various items. for example, it can be used to determine the sales of items that are frequently purchased together.

Correlation Analysis:- Correlation is a mathematical technique that can show whether and how strongly the pairs of attributes are related to each other. For example, Highted people tend to have more weight.

2.     Predictive Data Mining:- It helps developers to provide unlabeled definitions of attributes. Based on previous tests, the software estimates the characteristics that are absent. For example: Judging from the findings of a patient’s medical examinations that is he suffering from any particular disease.

Decision Tree:- A decision tree is a like a flow chart with a tree structure, in which every junction/node is used to represent a test on an attribute value, moreover, each and every branch is responsible for representing the concluding outcome of the test, and tree leaves are used to represent the classes or the distribution of classes.

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class. 




The benefits of having a decision tree are as follows −

·       It does not require any domain knowledge.

·       It is easy to comprehend.

·       The learning and classification steps of a decision tree are simple and fast.

Neural Network:- A neural network is mainly used for classification can be defined as a collection of processing units with connections between the units. In other and simpler words, Neural networks searches for patterns or trends in large quantity of different sets of data, which allows organizations to understand more and better about their clients or users need which is directly responsible for rendering their marketing strategies, increase sales and lowers costs.

The neural network in data mining is a classification method that takes the input, trains itself to recognize the pattern of input data and predicts the output for new input of a similar kind. Neural network forms the basis of deep learning, a subfield of machine learning that comes under artificial intelligence. Designing neural network algorithms is inspired by the structure of the human brain. Just as the human brain is responsible for intelligence and its discriminating power, the neural network also mimics the human brain and learns from its experience and applies this learning for classification and prediction.

Now each of these pixels is fed as an input to the first layer of the neural network i.e. in our example X1 to X400.

Application of Neural Network in Data Mining:-

The neural network in data mining performs the task such as:

1.     Classification

2.     Clustering or categorization

3.     Prediction

4.     Function approximation

5.     Optimization

6.     Retrieval by content or control

Advantage of Data Mining:-

·       Marketing / Retail:- Data mining helps marketing companies build models based on historical data to predict who will respond to the new marketing campaigns such as direct mail, online marketing campaign…etc. Through the results, marketers will have an appropriate approach to selling profitable products to targeted customers.

Data mining brings a lot of benefits to retail companies in the same way as marketing. Through market basket analysis, a store can have an appropriate production arrangement in a way that customers can buy frequent buying products together with pleasant. In addition, it also helps the retail companies offer certain discounts for particular products that will attract more customers.

·     Finance / Banking:- Data mining gives financial institutions information about loan information and credit reporting. By building a model from historical customer data, the bank, and financial institution can determine good and bad loans. In addition, data mining helps banks detect fraudulent credit card transactions to protect the credit card’s owner.

·     Manufacturing:- By applying data mining in operational engineering data, manufacturers can detect faulty equipment and determine optimal control parameters. For example, semiconductor manufacturers have a challenge that even the conditions of manufacturing environments at different wafer production plants are similar, the quality of wafer are a lot the same and some for unknown reasons even has defects. Data mining has been applying to determine the ranges of control parameters that lead to the production of the golden wafer. Then those optimal control parameters are used to manufacture wafers with desired quality.

·     Governments:- Data mining helps government agencies by digging and analyzing records of financial transaction to build patterns that can detect money laundering or criminal activities.

Disadvantages of data mining:-

Privacy Issues:- The concerns about personal privacy have been increasing enormously recently especially when the internet is booming with social networks, e-commerce, forums, blogs…. Because of privacy issues, people are afraid of their personal information is collected and used in an unethical way that potentially causing them a lot of trouble. Businesses collect information about their customers in many ways for understanding their purchasing behavior trends. However businesses don’t last forever, some days they may be acquired by others or gone. At this time, the personal information they own probably is sold to others or leak.

Security issues:-Security is a big issue. Businesses own information about their employees and customers including social security numbers, birthdays, payroll and etc. However, how properly this information is taken care of is still in question. There have been a lot of cases that hackers accessed and stole big data of customers from a big corporation such as Ford Motor Credit Company, Sony… with so much personal and financial information available, the credit card was stolen and identity theft becomes a big problem.

Misuse of information/inaccurate information:- Information is collected through data mining intended for ethical purposes can be misused. This information may be exploited by unethical people or businesses to take the benefits of vulnerable people or discriminate against a group of people.

Applications of Data Mining:-

Scientific Analysis: Scientific simulations are generating bulks of data every day. Data mining techniques are capable of the analysis of these data. Now we can capture and store more new data faster than we can analyze the old data already accumulated. Example of scientific analysis:

·       Sequence analysis in bioinformatics

·       Classification of astronomical objects

·       Medical decision support.

Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network intrusions often involve stealing valuable network resources. Data mining technique plays a vital role in searching intrusion detection, network attacks, and anomalies. These techniques help in selecting and refining useful and relevant information from large data sets. For example:

·       Detect security violations

·       Misuse Detection

·       Anomaly Detection

Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually time-related and can be inter-business deals or intra-business operations. Data mining helps to analyze these business transactions and identify marketing approaches and decision-making. Example:

·       Direct mail targeting

·       Stock trading

·       Customer segmentation

·       Churn prediction (Churn prediction is one of the most popular Big Data use cases in business)

Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by a customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. This analysis can help to promote deals, offers, sale by the companies and data mining techniques helps to achieve this analysis task. Example:

·       Data mining concepts are in use for Sales and marketing to provide better customer service, to improve cross-selling opportunities, to increase direct mail response rates.

·       Customer Retention in the form of pattern identification and prediction of likely defections is possible by Data mining.

·       Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate or unusual behavior etc.

Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. This method generates patterns that can be used both by learners and educators. By using data mining EDM we can perform some educational task:

·       Predicting students admission in higher education

·       Predicting students profiling

·       Predicting student performance

·       Teachers teaching performance

·       Curriculum development

·       Predicting student placement opportunities

Research: A data mining technique can perform predictions, classification, clustering, associations, and grouping of data with perfection in the research area. Rules generated by data mining are unique to find results. In most of the technical research in data mining, we create a training model and testing model. The training/testing model is a strategy to measure the precision of the proposed model. It is called Train/Test because we split the data set into two sets: a training data set and a testing data set. A training data set used to design the training model whereas testing data set is used in the testing model. Example:

·       Classification of uncertain data.

·       Information-based clustering.

·       Decision support system

·       Web Mining

·       Domain-driven data mining

·       IoT  (Internet of Things)and Cybersecurity

·       Smart farming IoT(Internet of Things)

Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their outcomes to improve the focusing of high-value physicians and figure out which promoting activities will have the best effect in the following upcoming months, Whereas the Insurance sector, data mining can help to predict which customers will buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of customers.

·       Claims analysis i.e. which medical procedures are claimed together.

·       Identify successful medical therapies for different illnesses.

·       Characterizes patient behavior to predict office visits.

Transportation: A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. A large consumer merchandise organization can apply information mining to improve its business cycle to retailers.

·       Determine the distribution schedules among outlets.

·       Analyze loading patterns.

Financial/Banking Sector: A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product.

·       Credit card fraud detection.

·       Identify ‘Loyal’ customers.

·       Extraction of information related to customers.

·       Determine credit card spending by customer groups.

KDD:- The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of discovering knowledge in data and emphasizes the high-level applications of specific Data Mining techniques. It is a field of interest to researchers in various fields, including artificial intelligence, machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert systems, and data visualization. The main objective of the KDD process is to extract information from data in the context of large databases. It does this by using Data Mining algorithms to identify what is deemed knowledge. The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis and modeling of vast data repositories.KDD is the organized procedure of recognizing valid, useful, and understandable patterns from huge and complex data sets. Data Mining is the root of the KDD procedure, including the inferring of algorithms that investigate the data, develop the model, and find previously unknown patterns. The model is used for extracting the knowledge from the data, analyze the data, and predict the data. The availability and abundance of data today make knowledge discovery and Data Mining a matter of impressive significance and need. In the recent development of the field, it isn't surprising that a wide variety of techniques is presently accessible to specialists and experts.

Data mining as an essential step in the process of knowledge discovery. Psrocess :−

Data Cleaning − In this step, the noise and inconsistent data is removed.

Data Integration − In this step, multiple data sources are combined.

Data Selection − In this step, data relevant to the analysis task are retrieved from the database.

Data Transformation − In this step, data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.

Data Mining − In this step, intelligent methods are applied in order to extract data patterns.

Pattern Evaluation − In this step, data patterns are evaluated.

Knowledge Presentation − In this step, knowledge is represented.

The following diagram shows the process of knowledge discovery −



CHALLENGES FOR KDD:-

Larger databases:- Databases with hundreds of fields and tables, millions of records, and multi-gigabyte size are quite commonplace, and terabyte (1012 bytes) databases are beginning to appear.

High dimensionality:- Not only is there often a very large number of records in the database, but there can also be avery large number of fields (attributes, variables) so that the dimensionality of the problem is high.

Over-fitting:- When the algorithm searches for the best parameters for one particular model using a limited set of data, it may over-fit the data, resulting in poor performance of the model on test data.

Changing data and knowledge:- Rapidly changing (non-stationary) data may make previously discovered patterns in valid.

Complex relationships between fields:- Hierarchically structured attributes or values, relations between attributes, and more sophisticated means for representing knowledge about the contents of a database will require algorithms that can effectively utilize such information.

Understandability of patterns:- In many applications it is important to make the discoveries more understandable by humans. Possible solutions include graphical representations, rule structuring with directed a-cyclic graphs, natural language generation, and techniques for visualization of data and knowledge.

User interaction and prior knowledge:- Many current KDD methods and tools are not truly interactive and cannot easily incorporate prior knowledge about a problem except in simple ways: The use of domain knowledge is important in all of the steps of the KDD process.

Integration with other systems:- A stand-alone discovery system may not be very useful. Typical integration issues include integration with a DBMS (e.g., via a query interface), integration with spreadsheets and visualization tools, and accommodating real-time sensor readings.

Data warehousing:- A data warehouse is the secure electronic storage of information by a business or other organization. The goal of a data warehouse is to create a trove of historical data that can be retrieved and analyzed to provide useful insight into the organization's operations.

A data warehouse is a vital component of business intelligence. That wider term encompasses the information infrastructure that modern businesses use to track their past successes and failures and inform their decisions for the future.

KEY TAKEAWAYS:-

·       A data warehouse is the storage of information over time by a business or other organization.

·       New data is periodically added by people in various key departments such as marketing and sales.

·       The warehouse becomes a library of historical data that can be retrieved and analyzed in order to inform decision-making in the business.

·       The key factors in building an effective data warehouse include defining the information that is critical to the organization and identifying the sources of the information.

·       A database is designed to supply real-time information. A data warehouse is designed as an archive of historical information.

Characteristics of Data Warehousing:-

Subject-Oriented:- A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data warehouses typically provide a concise and straightforward view around a particular subject, such as customer, product, or sales, instead of the global organization's ongoing operations. This is done by excluding data that are not useful concerning the subject and including all data needed by the users to understand the subject.

Integrated:- A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction records. It requires performing data cleaning and integration during data warehousing to ensure consistency in naming conventions attributes types, etc., among different data sources.

Time-Variant:-Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6 months, 12 months, or even previous data from a data warehouse. These variations with a transactions system, where often only the most current file is kept.

Non-Volatile:- The data warehouse is a physically separate data storage, which is transformed from the source operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete operations are not performed. It usually requires only two procedures in data accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction processing, recovery, and concurrency capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data should not change.

Goals of Data Warehousing:-

·       To help reporting as well as analysis

·       Maintain the organization's historical information

·       Be the foundation for decision making.

Advantage and Benefits of Data Warehouse:-

·       Understand business trends and make better forecasting decisions.

·       Data Warehouses are designed to perform well enormous amounts of data.

·       The structure of data warehouses is more accessible for end-users to navigate, understand, and query.

·       Queries that would be complex in many normalized databases could be easier to build and maintain in data warehouses.

·       Data warehousing is an efficient method to manage demand for lots of information from lots of users.

·       Data warehousing provide the capabilities to analyze a large amount of historical data.

Architecture of Data warehousing:-Three common architectures are:

1.     Data Warehouse Architecture: Basic

2.     Data Warehouse Architecture: With Staging Area

3.     Data Warehouse Architecture: With Staging Area and Data Marts

1.     Basic:-

Operational System:- An operational system is a method used in data warehousing to refer to a system that is used to process the day-to-day transactions of an organization.

Flat Files:- A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a different name.

Meta Data:-A set of data that defines and gives information about other data.

2.     With Staging Area:-We must clean and process your operational information before put it into the warehouse.

We can do this programmatically, although data warehouses uses a staging area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method coming from multiple source systems, especially for enterprise data warehouses where all relevant data of an enterprise is consolidated.


3.     Staging Area and Data Marts:-We may want to customize our warehouse's architecture for multiple groups within our organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided information for reporting and analysis on a section, unit, department or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In this example, a financial analyst wants to analyze historical data for purchases and sales or mine historical information to make predictions about customer behavior.


Types of Data Warehouse Architectures:-

1.     Single-Tier Architecture:-Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data stored to reach this goal; it removes data redundancies.

The figure shows the only layer physically available is the source layer. In this method, data warehouses are virtual. This means that the data warehouse is implemented as a multidimensional view of operational data created by specific middleware, or an intermediate processing layer.


The vulnerability of this architecture lies in its failure to meet the requirement for separation between analytical and transactional processing. Analysis queries are agreed to operational data after the middleware interprets them. In this way, queries affect transactional workloads.

2.   Two-Tier Architecture:- The requirement for separation plays an essential role in defining the two-tier architecture for a data warehouse system, as shown in fig:


lthough it is typically called two-layer architecture to highlight a separation between physically available sources and data warehouses, in fact, consists of four subsequent data flow stages:

·       Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to corporate relational databases or legacy databases, or it may come from an information system outside the corporate walls.

·       Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a data warehouse.

·       Data Warehouse layer: Information is saved to one logically centralized individual repository: a data warehouse. The data warehouses can be directly accessed, but it can also be used as a source for creating data marts, which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-data repositories store information on sources, access procedures, data staging, users, data mart schema, and so on.

·       Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically analyze information, and simulate hypothetical business scenarios. It should feature aggregate information navigators, complex query optimizers, and customer-friendly GUIs.

1.  3.   Three-Tier Architecture:- The three-tier architecture consists of the source layer (containing multiple source system), the reconciled layer and the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits between the source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data model for a whole enterprise. At the same time, it separates the problems of source data extraction and integration from those of data warehouse population. In some cases, the reconciled layer is also directly used to accomplish better some operational tasks, such as producing daily reports that cannot be satisfactorily prepared using the corporate applications or generating data flows to feed external processes periodically to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this structure is the extra file storage space used through the extra redundant reconciled layer. It also makes the analytical tools a little further away from being real-time.



Data warehouse construction process:-  A Data warehouse is a heterogeneous collection of different data sources organized under unified schema. Builders should take a broad view of the anticipated use of the warehouse while constructing a data warehouse. During the design phase, there is no way to anticipate all possible queries or analyses. Some characteristic of Data warehouse are:

·       Subject oriented

·       Integrated

·       Time Variant

·       Non-volatile

Building a Data Warehouse –Some steps that are needed for building any data warehouse are as following below:

1.     To extract the data (transnational) from different data sources:- For building a data warehouse, a data is extracted from various data sources and that data is stored in central storage area. For extraction of the data Microsoft has come up with an excellent tool. When you purchase Microsoft SQL Server, then this tool will be available at free of cost.

2.     To transform the transnational data:- There are various DBMS where many of the companies stores their data. Some of them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also these companies saves the data in spreadsheets, flat files, mail systems etc. Relating a data from all these sources is done while building a data warehouse.

3.     To load the data (transformed) into the dimensional database:- After building a dimensional model, the data is loaded in the dimensional database. This process combines the several columns together or it may split one field into the several columns. There are two stages at which transformation of the data can be performed and they are: while loading the data into the dimensional model or while data extraction from their origins.

4.     To purchase a front-end reporting tool:- There are top notch analytical tools are available in the market. These tools are provided by the several major vendors. A cost effective tool and Data Analyzer is released by the Microsoft on its own.

ETL:- ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract data from various sources, transform it into a format suitable for loading into a data warehouse, and then load it into the warehouse. The process of ETL can be broken down into the following three stages:-

1.     Extract: The first stage in the ETL process is to extract data from various sources such as transactional systems, spreadsheets, and flat files. This step involves reading data from the source systems and storing it in a staging area.

2.     Transform: In this stage, the extracted data is transformed into a format that is suitable for loading into the data warehouse. This may involve cleaning and validating the data, converting data types, combining data from multiple sources, and creating new data fields.

3.     Load: After the data is transformed, it is loaded into the data warehouse. This step involves creating the physical data structures and loading the data into the warehouse.

4.     The ETL process is an iterative process that is repeated as new data is added to the warehouse. The process is important because it ensures that the data in the data warehouse is accurate, complete, and up-to-date. It also helps to ensure that the data is in the format required for data mining and reporting.

Additionally, there are many different ETL tools and technologies available, such as Informatica, Talend, DataStage, and others, that can automate and simplify the ETL process.

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it into the Data Warehouse system.


Extraction: -The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most important steps of ETL process.

Transformation: - The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format. It may involve following processes/tasks:

Filtering – loading only certain attributes into the data warehouse.

Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA, etc.

Joining – joining multiple attributes into one.

Splitting – splitting a single attribute into multiple attributes.

Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

Loading: - The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is done after longer but regular intervals. The rate and period of loading solely depends on the requirements and varies from system to system.

Schema:- Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse. Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts.

1.    1. Star Schema:- A star schema is a database organizational structure optimized for use in a data warehouse or business intelligence that uses a single large fact table to store transactional or measured data, and one or more smaller dimensional tables that store attributes about the data. It is called a star schema because the fact table sits at the center of the logical diagram, and the small dimensional tables branch off to form the points of the star.

A fact table sits at the center of a star schema database, and each star schema database only has a single fact table. The fact table contains the specific measurable (or quantifiable) primary data to be analyzed, such as sales records, logged performance data or financial data. It may be transactional -- in that rows are added as events happen -- or it may be a snapshot of historical data up to a point in time.

Characteristics of Star Schema:-

The star schema is intensely suitable for data warehouse database design because of the following features:

·       It creates a DE-normalized database that can quickly provide query responses.

·       It provides a flexible design that can be changed easily or added to throughout the development cycle, and as the database grows.

·       It provides a parallel in design to how end-users typically think of and use the data.

·       It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema:-

·       Simpler Queries – Join logic of star schema is quite cinch in comparison to other join logic which are needed to fetch data from a transactional schema that is highly normalized.

·       Simplified Business Reporting Logic – In comparison to a transactional schema that is highly normalized, the star schema makes simpler common business reporting logic, such as of reporting and period-over-period.

·       Feeding Cubes – Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a source without designing a cube structure.

Disadvantages of Star Schema :–

·       Data integrity is not enforced well since in a highly de-normalized schema state.

·       Not flexible in terms if analytical needs as a normalized data model.

·       Star schemas don’t reinforce many-to-many relationships within business entities – at least not frequently.

2.    2. Fact Constellation Schema(Galaxy Schema):- A fact constellation has multiple fact tables. It is also known as galaxy schema. The following diagram shows two fact tables, namely sales and shipping.


The sales fact table is same as that in the star schema. The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The shipping fact table also contains two measures, namely dollars sold and units sold. It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.

Advantage: Provides a flexible schema.

Disadvantage: It is much more complex and hence, hard to implement and maintain.

1.  3. Snowflake Schema:- Some dimension tables in the Snowflake schema are normalized. The normalization splits up the data into additional tables. Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.



Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key. The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.

Characteristics of snowflake schema:-

·       The snowflake schema uses small disk space.

·       It is easy to implement dimension that is added to the schema.

·       There are multiple tables, so performance is reduced.

·       The dimension table consists of two or more sets of attributes that define information at different grains.

·       The sets of attributes of the same dimension table are being populated by different source systems.

Advantages: -

·       It provides structured data which reduces the problem of data integrity.

·       It uses small disk space because data are highly structured.

Disadvantages: -

·       Snowflaking reduces space consumed by dimension tables but compared with the entire data warehouse the saving is usually insignificant.

·       Avoid snowflaking or normalization of a dimension table, unless required and appropriate.

·       Do not snowflake hierarchies of one dimension table into separate tables. Hierarchies should belong to the dimension table only and should never be snowflakes.

·       Multiple hierarchies that can belong to the same dimension have been designed at the lowest possible detail.

Fact Table:- A fact table aggregates metrics, measurements, or facts about business processes. In this example, fact tables are connected to dimension tables to form a schema architecture representing how data relates within the data warehouse. Fact tables store primary keys of dimension tables as foreign keys within the fact table.


Dimension Table:- Dimension tables are non-denormalized tables used to store data attributes or dimensions. As mentioned above, the primary key of a dimension table is stored as a foreign key in the fact table. Dimension tables are not joined together. Instead, they are joined via association through the central fact table.


OLTP:- It is an operational system that supports transaction-oriented applications in a 3-tier architecture. It administers the day to day transaction of an organization. OLTP is basically focused on query processing, maintaining data integrity in multi-access environments as well as effectiveness that is measured by the total number of transactions per second. The full form of OLTP is Online Transaction Processing.

Characteristics of OLTP:- Following are important characteristics of OLTP:-

·       OLTP uses transactions that include small amounts of data.
·       Indexed data in the database can be accessed easily.
·       OLTP has a large number of users.
·       It has fast response times
·       Databases are directly accessible to end-users
·       OLTP uses a fully normalized schema for database consistency.
·       The response time of OLTP system is short.
·       It strictly performs only the predefined operations on a small number of records.
·       OLTP stores the records of the last few days or a week.
·       It supports complex data models and tables.

Architecture of OLTP:- Here is the architecture of OLTP:


1.   1.  Business / Enterprise Strategy: Enterprise strategy deals with the issues that affect the organization as a whole. In OLTP, it is typically developed at a high level within the firm, by the board of directors or the top management

2.  2. Business Process: OLTP b


usiness process is a set of activities and tasks that, once completed, will accomplish an organizational goal.

3. 3.Customers, Orders, and Products: OLTP database store information about products, orders (transactions), customers (buyers), suppliers (sellers), and employees.

4.   4. ETL Processes: It separates the data from various RDBMS source systems, then transforms the data (like applying concatenations, calculations, etc.) and loads the processed data into the Data Warehouse system.

5.   5.  Data Mart and Data warehouse: A data mart is a structure/access pattern specific to data warehouse environments. It is used by OLAP to store processed data.

6.   6.  Data Mining, Analytics, and Decision Making: Data stored in the data mart and data warehouse can be used for data mining, analytics, and decision making. This data helps you to discover data patterns, analyze raw data, and make analytical decisions for your organization’s growth.

Advantages of OLTP::- Following are the pros/benefits of OLTP system:

·       OLTP offers accurate forecast for revenue and expense.
·       It provides a solid foundation for a stable business /organization due to timely modification of all transactions.
·       OLTP makes transactions much easier on behalf of the customers.
·       It broadens the client base for an organization by speeding up and simplifying individual processes.
·       OLTP provides support for bigger databases.
·       Partition of data for data manipulation is easy.
·       We need OLTP to use the tasks which are frequently performed by the system.
·       When we need only a small number of records.
·       The tasks that include insertion, updation, or deletion of data.
·       It is used when you need consistency and concurrency in order to perform tasks that ensure its greater availability.

Disadvantages of OLTP:- Here are cons/drawbacks of OLTP system:

·       If the OLTP system faces hardware failures, then online transactions get severely affected.
·       OLTP systems allow multiple users to access and change the same data at the same time, which many times created an unprecedented situation.
·       If the server hangs for seconds, it can affect to a large number of transactions.
·       OLTP required a lot of staff working in groups in order to maintain inventory.
·       Online Transaction Processing Systems do not have proper methods of transferring products to buyers by themselves.
·       OLTP makes the database much more susceptible to hackers and intruders.
·       In B2B transactions, there are chances that both buyers and suppliers miss out efficiency advantages that the system offers.
·       Server failure may lead to wiping out large amounts of data from the database.
·       You can perform a limited number of queries and updates.

OLAP:- Online Analytical Processing (OLAP) is a category of software that allows users to analyze information from multiple database systems at the same time. It is a technology that enables analysts to extract and view business data from different points of view. Analysts frequently need to group, aggregate and join data. These OLAP operations in data mining are resource intensive. With OLAP data can be pre-calculated and pre-aggregated, making analysis faster. OLAP databases are divided into one or more cubes. The cubes are designed in such a way that creating and viewing reports become easy. OLAP stands for Online Analytical Processing.

The main characteristics of OLAP are as follows:-

1.    1. Multidimensional conceptual view: OLAP systems let business users have a dimensional and logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
2.     Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide normal database operations, containing retrieval, update, adequacy control, integrity, and security.
3.     Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4.     Storing OLAP results: OLAP results are kept separate from data sources.
5.     Uniform documenting performance: Increasing the number of dimensions or database size should not significantly degrade the reporting performance of the OLAP system.
6.     OLAP provides for distinguishing between zero values and missing values so that aggregates are computed correctly.
7.     OLAP system should ignore all missing values and compute correct aggregate values.
8.     OLAP facilitate interactive query and complex analysis for the users.
9.     OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a single business dimension or across multiple dimension.
10.  OLAP provides the ability to perform intricate calculations and comparisons.
11.  OLAP presents results in a number of meaningful ways, including charts and graphs.

Benefits of OLAP:-

1.   1.  OLAP helps managers in decision-making through the multidimensional record views that it is efficient in providing, thus increasing their productivity.
2.     OLAP functions are self-sufficient owing to the inherent flexibility support to the organized databases.
3.     It facilitates simulation of business models and problems, through extensive management of analysis-capabilities.
4.     In conjunction with data warehouse, OLAP can be used to support a reduction in the application backlog, faster data retrieval, and reduction in query drag.

Drill up:- This operation you can meet as a part of pair drill up and drill down in OLAP. Drill-up is an operati­­­­on to gather data from the cube either by ascending a concept hierarchy for a dimension or by dimension reduction in order to receive measures at a less detailed granularity. So that to see a broader perspective in compliance with the concept hierarchy a user has to group columns and unite the values. As there are fewer specifics, one or more dimensions from the data cube will be deleted, when this OLAP operation is run. In some sources drill up and roll up operations in OLAP come as synonyms, so this variant is also possible.


Drill down:-OLAP Drill-down is an operation opposite to Drill-up. It is carried out either by descending a concept hierarchy for a dimension or by adding a new dimension. It lets a user deploy highly detailed data from a less detailed cube. Consequently, when the operation is run, one or more dimensions from the data cube must be appended to provide more information elements.


Slice operations:- The next pair we are going to discuss is slice and dice operations in OLAP. The Slice OLAP operations takes one specific dimension from a cube given and represents a new sub-cube, which provides information from another point of view. It can create a new sub-cube by choosing one or more dimensions. The use of Slice implies the specified granularity level of the dimension.


Dice:- This operation is similar to a slice. The difference in dice is you select 2 or more dimensions that result in the creation of a sub-cube.

                                                                                                      

                                                                                    Next page






















No comments:

Post a Comment

                           Software Engineering Notes 1.1   Software Engineering : The term is made of two words, software and engineering...