Questions 2077, 2078, Important
Data Mining:-
Data mining derives its name as Data + Mining the same way in which mining is
done in the ground to find a valuable ore, data mining is done to find valuable
information in the dataset. Data Mining tools predict customer habits, predict
patterns and future trends, allowing business to increase company revenues and
make proactive decisions.
Data mining is a key part of data analytics overall and one of the core disciplines in data science, which uses advanced analytics techniques to find useful information in data sets. At a more granular level, data mining is a step in the knowledge discovery in databases (KDD) process, a data science methodology for gathering, processing and analyzing data. Data mining and KDD are sometimes referred to interchangeably, but they're more commonly seen as distinct things.
KEY TAKEAWAYS:-
·
Data mining is the process of
analyzing a large batch of information to discern trends and patterns.
·
Data mining can be used by
corporations for everything from learning about what customers are interested
in or want to buy to fraud detection and spam filtering.
· Social media companies use data mining techniques to commodity their users in order to generate profit.
Data Mining: Advanced Concepts and Algorithms:- As the amount of research and industry data being collected daily continues to grow, intelligent software tools are increasingly needed to process and filter the data, detect new patterns and similarities within it, and extract meaningful information from it. Data mining and predictive modeling offer a means of effective classification and analysis of large, complex, multi-dimensional data, leading to discovery of functional models, trends and patterns. Building upon the skills learned in previous courses, this course covers advanced data mining, data analysis, and pattern recognition concepts and algorithms, as well as models and machine learning algorithms.
Importance of Data Mining:-
·
It helps in automated decision making:-
Data mining allows companies to regularly analyze data and automate both
routine and critical decisions without the delay of human judgment. With the
use of this, banks can detect fraudulent transactions, verification can be
requested, and can also secure personal information of customers against
identity theft. These models can collect, analyze, and act on data
independently when deployed within a firm’s operational algorithms and helps in
decision making and improving the daily processes of the company.
·
It helps in cost reduction:- Data
mining enables businesses to use and allocate resources more efficiently.
Companies can plan and make automated decisions with accurate forecasts and
this will result in maximum cost reduction.
·
It helps in precise prediction and
forecasting:- Data mining eases planning which is very crucial within every
organization, and provides managers with reliable forecasts based on historical
trends and present conditions.
·
Customer Insights:- From customer
data, firms deploy data mining models to discover key characteristics of and
differences between their customers. Data mining is helpful to make personas
and personalize each touch point to improve overall customer experience.
·
Data mining has many benefits in
various areas of business, and is advantageous for governments and individuals
as well. It can predict future trends, signify customer behavior, enable quick
decision making, increase company revenue, and provide accurate predictions. It
has the power to transform enterprises. So, if a business adopts the right data
mining strategy then they can achieve high levels of customer service.
Characteristics of Data Mining:-
· Large quantities of data. The volume of data so great it has to be analyzed by automated techniques e.g. satellite information, credit card transactions, etc.
· Noisy, incomplete data. Imprecise data is the characteristic of all data collection.
· Complex data structure.
· Heterogeneous data stored in legacy systems.
Functionalities of Data Mining:-Data
Mining functions are used to define the trends or correlations contained in
data mining activities. In comparison, data mining activities can be divided
into 2 categories:
1.
Descriptive Data Mining:- It includes
certain knowledge to understand what is happening within the data without a
previous idea. The common data features are highlighted in the data set. For
examples: count, average etc.
i. Class/Concept
Descriptions:- Classes or definitions can be
correlated with results. In simplified, descriptive and yet accurate ways, it
can be helpful to define individual groups and concepts.
These class or
concept definitions are referred to as class/concept descriptions.
·
Data Characterization:-This refers to
the summary of general characteristics or features of the class that is under the
study. For example. To study the characteristics of a software product whose
sales increased by 15% two years ago, anyone can collect these type of data
related to such products by running SQL queries.
·
Data Discrimination:-It compares
common features of class which is under study. The output of this process can
be represented in many forms. Eg., bar charts, curves and pie charts.
ii. ii. Mining Frequent Patterns,
Associations, and Correlations:-
Frequent
Patterns:- Frequent patterns are nothing but things that are found to be most
common in the data.
There are
different kinds of frequency that can be observed in the dataset.
·
Frequent item set:- This applies to a
number of items that can be seen together regularly for eg: milk and sugar.
·
Frequent Subsequence:- This refers to
the pattern series that often occurs regularly such as purchasing a phone
followed by a back cover.
·
Frequent Substructure:- It refers to
the different kinds of data structures such as trees and graphs that may be
combined with the item set or subsequence.
Association
Analysis:- The process involves uncovering the
relationship between data and deciding the rules of the association. It is a
way of discovering the relationship between various items. for example, it can
be used to determine the sales of items that are frequently purchased together.
Correlation
Analysis:- Correlation is a mathematical technique
that can show whether and how strongly the pairs of attributes are related to
each other. For example, Highted people tend to have more weight.
2.
Predictive Data Mining:-
It helps developers to provide unlabeled definitions of attributes. Based on
previous tests, the software estimates the characteristics that are absent. For
example: Judging from the findings of a patient’s medical examinations that is
he suffering from any particular disease.
Decision
Tree:- A decision tree is a like a flow chart with a
tree structure, in which every junction/node is used to represent a test on an
attribute value, moreover, each and every branch is responsible for
representing the concluding outcome of the test, and tree leaves are used to
represent the classes or the distribution of classes.
A
decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch denotes
the outcome of a test, and each leaf node holds a class label. The topmost node
in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as
follows −
· It
does not require any domain knowledge.
· It
is easy to comprehend.
· The
learning and classification steps of a decision tree are simple and fast.
Neural Network:-
A neural network is mainly used for
classification can be defined as a collection of processing units with
connections between the units. In other and simpler words, Neural networks
searches for patterns or trends in large quantity of different sets of data,
which allows organizations to understand more and better about their clients or
users need which is directly responsible for rendering their marketing
strategies, increase sales and lowers costs.
The neural network in data mining is a
classification method that takes the input, trains itself to recognize the
pattern of input data and predicts the output for new input of a similar kind.
Neural network forms the basis of deep learning, a subfield of machine learning
that comes under artificial intelligence. Designing neural network algorithms
is inspired by the structure of the human brain. Just as the human brain is
responsible for intelligence and its discriminating power, the neural network
also mimics the human brain and learns from its experience and applies this
learning for classification and prediction.
Now
each of these pixels is fed as an input to the first layer of the neural
network i.e. in our example X1 to X400.
Application of Neural Network in Data
Mining:-
The neural network in data mining
performs the task such as:
1.
Classification
2.
Clustering or categorization
3.
Prediction
4.
Function approximation
5.
Optimization
6.
Retrieval by content or control
Advantage of Data Mining:-
· Marketing
/ Retail:- Data mining helps marketing companies build
models based on historical data to predict who will respond to the new
marketing campaigns such as direct mail, online marketing campaign…etc. Through
the results, marketers will have an appropriate approach to selling profitable
products to targeted customers.
Data
mining brings a lot of benefits to retail companies in the same way as
marketing. Through market basket analysis, a store can have an appropriate
production arrangement in a way that customers can buy frequent buying products
together with pleasant. In addition, it also helps the retail companies offer
certain discounts for particular products that will attract more customers.
·
Finance / Banking:-
Data mining gives financial institutions information about loan information and
credit reporting. By building a model from historical customer data, the bank,
and financial institution can determine good and bad loans. In addition, data
mining helps banks detect fraudulent credit card transactions to protect the
credit card’s owner.
·
Manufacturing:-
By applying data mining in operational engineering data, manufacturers can
detect faulty equipment and determine optimal control parameters. For example,
semiconductor manufacturers have a challenge that even the conditions of
manufacturing environments at different wafer production plants are similar,
the quality of wafer are a lot the same and some for unknown reasons even has
defects. Data mining has been applying to determine the ranges of control
parameters that lead to the production of the golden wafer. Then those optimal
control parameters are used to manufacture wafers with desired quality.
·
Governments:-
Data mining helps government agencies by digging and analyzing records of
financial transaction to build patterns that can detect money laundering or
criminal activities.
Disadvantages of data mining:-
Privacy
Issues:- The concerns about personal privacy have been
increasing enormously recently especially when the internet is booming with
social networks, e-commerce, forums, blogs…. Because of privacy issues, people
are afraid of their personal information is collected and used in an unethical
way that potentially causing them a lot of trouble. Businesses collect
information about their customers in many ways for understanding their
purchasing behavior trends. However businesses don’t last forever, some days
they may be acquired by others or gone. At this time, the personal information
they own probably is sold to others or leak.
Security
issues:-Security is a big issue. Businesses own
information about their employees and customers including social security
numbers, birthdays, payroll and etc. However, how properly this information is
taken care of is still in question. There have been a lot of cases that hackers
accessed and stole big data of customers from a big corporation such as Ford
Motor Credit Company, Sony… with so much personal and financial information
available, the credit card was stolen and identity theft becomes a big problem.
Misuse
of information/inaccurate information:- Information is
collected through data mining intended for ethical purposes can be misused.
This information may be exploited by unethical people or businesses to take the
benefits of vulnerable people or discriminate against a group of people.
Applications of Data Mining:-
Scientific
Analysis: Scientific simulations are generating bulks
of data every day. Data mining techniques are capable of the analysis of these
data. Now we can capture and store more new data faster than we can analyze the
old data already accumulated. Example of scientific analysis:
·
Sequence analysis in bioinformatics
·
Classification of astronomical objects
·
Medical decision support.
Intrusion
Detection: A network intrusion refers to any
unauthorized activity on a digital network. Network intrusions often involve
stealing valuable network resources. Data mining technique plays a vital role
in searching intrusion detection, network attacks, and anomalies. These
techniques help in selecting and refining useful and relevant information from
large data sets. For example:
·
Detect security violations
·
Misuse Detection
·
Anomaly Detection
Business
Transactions: Every business industry is memorized
for perpetuity. Such transactions are usually time-related and can be
inter-business deals or intra-business operations. Data mining helps to analyze
these business transactions and identify marketing approaches and decision-making.
Example:
·
Direct mail targeting
·
Stock trading
·
Customer segmentation
·
Churn prediction (Churn prediction is
one of the most popular Big Data use cases in business)
Market
Basket Analysis: Market Basket Analysis is a technique
that gives the careful study of purchases done by a customer in a supermarket.
This concept identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the companies and data
mining techniques helps to achieve this analysis task. Example:
·
Data mining concepts are in use for
Sales and marketing to provide better customer service, to improve
cross-selling opportunities, to increase direct mail response rates.
·
Customer Retention in the form of
pattern identification and prediction of likely defections is possible by Data
mining.
·
Risk Assessment and Fraud area also
use the data-mining concept for identifying inappropriate or unusual behavior
etc.
Education:
For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners
and educators. By using data mining EDM we can perform some educational task:
·
Predicting students admission in
higher education
·
Predicting students profiling
·
Predicting student performance
·
Teachers teaching performance
·
Curriculum development
·
Predicting student placement
opportunities
Research:
A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules
generated by data mining are unique to find results. In most of the technical
research in data mining, we create a training model and testing model. The
training/testing model is a strategy to measure the precision of the proposed
model. It is called Train/Test because we split the data set into two sets: a
training data set and a testing data set. A training data set used to design
the training model whereas testing data set is used in the testing model.
Example:
·
Classification of uncertain data.
·
Information-based clustering.
·
Decision support system
·
Web Mining
·
Domain-driven data mining
·
IoT
(Internet of Things)and Cybersecurity
·
Smart farming IoT(Internet of Things)
Healthcare
and Insurance: A Pharmaceutical sector can examine
its new deals force activity and their outcomes to improve the focusing of
high-value physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the Insurance sector,
data mining can help to predict which customers will buy new policies, identify
behavior patterns of risky customers and identify fraudulent behavior of
customers.
·
Claims analysis i.e. which medical
procedures are claimed together.
·
Identify successful medical therapies
for different illnesses.
·
Characterizes patient behavior to
predict office visits.
Transportation:
A diversified transportation company with a large direct sales force can apply
data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business
cycle to retailers.
·
Determine the distribution schedules
among outlets.
·
Analyze loading patterns.
Financial/Banking
Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely to be
interested in a new credit product.
·
Credit card fraud detection.
·
Identify ‘Loyal’ customers.
·
Extraction of information related to customers.
·
Determine credit card spending by
customer groups.
KDD:-
The term KDD stands for Knowledge Discovery in Databases. It refers to the
broad procedure of discovering knowledge in data and emphasizes the high-level
applications of specific Data Mining techniques. It is a field of interest to
researchers in various fields, including artificial intelligence, machine
learning, pattern recognition, databases, statistics, knowledge acquisition for
expert systems, and data visualization. The main objective of the KDD process
is to extract information from data in the context of large databases. It does
this by using Data Mining algorithms to identify what is deemed knowledge. The
Knowledge Discovery in Databases is considered as a programmed, exploratory analysis
and modeling of vast data repositories.KDD is the organized procedure of
recognizing valid, useful, and understandable patterns from huge and complex
data sets. Data Mining is the root of the KDD procedure, including the
inferring of algorithms that investigate the data, develop the model, and find
previously unknown patterns. The model is used for extracting the knowledge
from the data, analyze the data, and predict the data. The availability and
abundance of data today make knowledge discovery and Data Mining a matter of
impressive significance and need. In the recent development of the field, it
isn't surprising that a wide variety of techniques is presently accessible to
specialists and experts.
Data mining as an essential step in
the process of knowledge discovery. Psrocess :−
Data
Cleaning − In this step, the noise and inconsistent data is removed.
Data
Integration − In this step, multiple data sources are combined.
Data
Selection − In this step, data relevant to the analysis task are retrieved from
the database.
Data
Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data
Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern
Evaluation − In this step, data patterns are evaluated.
Knowledge
Presentation − In this step, knowledge is represented.
The following diagram shows the
process of knowledge discovery −
CHALLENGES FOR KDD:-
Larger databases:- Databases with
hundreds of fields and tables, millions of records, and multi-gigabyte size are
quite commonplace, and terabyte (1012 bytes) databases are beginning to appear.
High dimensionality:- Not only is
there often a very large number of records in the database, but there can also
be avery large number of fields (attributes, variables) so that the
dimensionality of the problem is high.
Over-fitting:- When the algorithm
searches for the best parameters for one particular model using a limited set
of data, it may over-fit the data, resulting in poor performance of the model
on test data.
Changing data and knowledge:- Rapidly
changing (non-stationary) data may make previously discovered patterns in
valid.
Complex relationships between fields:-
Hierarchically structured attributes or values, relations between attributes,
and more sophisticated means for representing knowledge about the contents of a
database will require algorithms that can effectively utilize such information.
Understandability of patterns:- In
many applications it is important to make the discoveries more understandable
by humans. Possible solutions include graphical representations, rule
structuring with directed a-cyclic graphs, natural language generation, and
techniques for visualization of data and knowledge.
User interaction and prior knowledge:-
Many current KDD methods and tools are not truly interactive and cannot easily
incorporate prior knowledge about a problem except in simple ways: The use of
domain knowledge is important in all of the steps of the KDD process.
Integration with other systems:- A
stand-alone discovery system may not be very useful. Typical integration issues
include integration with a DBMS (e.g., via a query interface), integration with
spreadsheets and visualization tools, and accommodating real-time sensor
readings.
Data warehousing:-
A data warehouse is the secure electronic storage of information by a business
or other organization. The goal of a data warehouse is to create a trove of
historical data that can be retrieved and analyzed to provide useful insight
into the organization's operations.
A data warehouse is a vital component
of business intelligence. That wider term encompasses the information
infrastructure that modern businesses use to track their past successes and
failures and inform their decisions for the future.
KEY TAKEAWAYS:-
·
A data warehouse is the storage of
information over time by a business or other organization.
·
New data is periodically added by
people in various key departments such as marketing and sales.
·
The warehouse becomes a library of
historical data that can be retrieved and analyzed in order to inform
decision-making in the business.
·
The key factors in building an
effective data warehouse include defining the information that is critical to
the organization and identifying the sources of the information.
·
A database is designed to supply
real-time information. A data warehouse is designed as an archive of historical
information.
Characteristics of Data Warehousing:-
Subject-Oriented:- A
data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of
the global organization's ongoing operations. This is done by excluding data
that are not useful concerning the subject and including all data needed by the
users to understand the subject.
Integrated:- A
data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions
attributes types, etc., among different data sources.
Time-Variant:-Historical
information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the
most current file is kept.
Non-Volatile:- The
data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in
the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows
for substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse, and data should not change.
Goals of Data Warehousing:-
·
To help reporting as well as analysis
·
Maintain the organization's historical
information
·
Be the foundation for decision making.
Advantage and Benefits of Data
Warehouse:-
·
Understand business trends and make
better forecasting decisions.
·
Data Warehouses are designed to
perform well enormous amounts of data.
·
The structure of data warehouses is
more accessible for end-users to navigate, understand, and query.
·
Queries that would be complex in many
normalized databases could be easier to build and maintain in data warehouses.
·
Data warehousing is an efficient
method to manage demand for lots of information from lots of users.
·
Data warehousing provide the
capabilities to analyze a large amount of historical data.
Architecture of Data warehousing:-Three
common architectures are:
1.
Data Warehouse Architecture: Basic
2.
Data Warehouse Architecture: With
Staging Area
3.
Data Warehouse Architecture: With
Staging Area and Data Marts
1.
Basic:-
Operational
System:- An operational system is a method used in data warehousing to refer to
a system that is used to process the day-to-day transactions of an
organization.
Flat
Files:- A Flat file system is a system of files in which transactional data is
stored, and every file in the system must have a different name.
Meta
Data:-A set of data that defines and gives information about other data.
2.
With Staging Area:-We
must clean and process your operational information before put it into the
warehouse.
We
can do this programmatically, although data warehouses uses a staging area (A
place where data is processed before entering the warehouse).
A
staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
3.
Staging Area and Data Marts:-We
may want to customize our warehouse's architecture for multiple groups within
our organization.
We can do this by
adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
The figure
illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for
purchases and sales or mine historical information to make predictions about
customer behavior.
Types of Data Warehouse Architectures:-
1.
Single-Tier
Architecture:-Single-Tier architecture is not
periodically used in practice. Its purpose is to minimize the amount of data
stored to reach this goal; it removes data redundancies.
The
figure shows the only layer physically available is the source layer. In this
method, data warehouses are virtual. This means that the data warehouse is
implemented as a multidimensional view of operational data created by specific
middleware, or an intermediate processing layer.
The
vulnerability of this architecture lies in its failure to meet the requirement
for separation between analytical and transactional processing. Analysis
queries are agreed to operational data after the middleware interprets them. In
this way, queries affect transactional workloads.
2.
Two-Tier Architecture:-
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
lthough
it is typically called two-layer architecture to highlight a separation between
physically available sources and data warehouses, in fact, consists of four
subsequent data flow stages:
·
Source layer: A data warehouse system
uses a heterogeneous source of data. That data is stored initially to corporate
relational databases or legacy databases, or it may come from an information
system outside the corporate walls.
·
Data Staging: The data stored to the
source should be extracted, cleansed to remove inconsistencies and fill gaps,
and integrated to merge heterogeneous sources into one standard schema. The
so-named Extraction, Transformation, and Loading Tools (ETL) can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load
source data into a data warehouse.
·
Data Warehouse layer: Information is
saved to one logically centralized individual repository: a data warehouse. The
data warehouses can be directly accessed, but it can also be used as a source
for creating data marts, which partially replicate data warehouse contents and
are designed for specific enterprise departments. Meta-data repositories store
information on sources, access procedures, data staging, users, data mart
schema, and so on.
·
Analysis: In this layer, integrated
data is efficiently, and flexible accessed to issue reports, dynamically
analyze information, and simulate hypothetical business scenarios. It should
feature aggregate information navigators, complex query optimizers, and
customer-friendly GUIs.
1. 3. Three-Tier Architecture:-
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer (containing
both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse.
The
main advantage of the reconciled layer is that it creates a standard reference
data model for a whole enterprise. At the same time, it separates the problems
of source data extraction and integration from those of data warehouse
population. In some cases, the reconciled layer is also directly used to
accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or
generating data flows to feed external processes periodically to benefit from
cleaning and integration.
This
architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the
extra redundant reconciled layer. It also makes the analytical tools a little
further away from being real-time.
Data warehouse construction process:-
A Data warehouse is a heterogeneous
collection of different data sources organized under unified schema. Builders
should take a broad view of the anticipated use of the warehouse while
constructing a data warehouse. During the design phase, there is no way to
anticipate all possible queries or analyses. Some characteristic of Data
warehouse are:
· Subject
oriented
· Integrated
· Time
Variant
· Non-volatile
Building a Data Warehouse –Some steps
that are needed for building any data warehouse are as following below:
1.
To extract the data (transnational)
from different data sources:- For building a data warehouse, a data is
extracted from various data sources and that data is stored in central storage
area. For extraction of the data Microsoft has come up with an excellent tool.
When you purchase Microsoft SQL Server, then this tool will be available at
free of cost.
2.
To transform the transnational data:-
There are various DBMS where many of the companies stores their data. Some of
them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also these companies
saves the data in spreadsheets, flat files, mail systems etc. Relating a data
from all these sources is done while building a data warehouse.
3.
To load the data (transformed) into
the dimensional database:- After building a dimensional model, the data is
loaded in the dimensional database. This process combines the several columns
together or it may split one field into the several columns. There are two
stages at which transformation of the data can be performed and they are: while
loading the data into the dimensional model or while data extraction from their
origins.
4.
To purchase a front-end reporting
tool:- There are top notch analytical tools are available in the market. These
tools are provided by the several major vendors. A cost effective tool and Data
Analyzer is released by the Microsoft on its own.
ETL:-
ETL stands for Extract, Transform, Load and it is a
process used in data warehousing to extract data from various sources,
transform it into a format suitable for loading into a data warehouse, and then
load it into the warehouse. The process of ETL can be broken down into the
following three stages:-
1.
Extract: The first stage
in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading
data from the source systems and storing it in a staging area.
2.
Transform: In this stage, the extracted data is transformed
into a format that is suitable for loading into the data warehouse. This may
involve cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
3.
Load: After the data is transformed, it is loaded
into the data warehouse. This step involves creating the physical data
structures and loading the data into the warehouse.
4.
The ETL process is an iterative process that is
repeated as new data is added to the warehouse. The process is important
because it ensures that the data in the data warehouse is accurate, complete,
and up-to-date. It also helps to ensure that the data is in the format required
for data mining and reporting.
Additionally,
there are many different ETL tools and technologies available, such as
Informatica, Talend, DataStage, and others, that can automate and simplify the
ETL process.
ETL
is a process in Data Warehousing and it stands for Extract, Transform and Load.
It is a process in which an ETL tool extracts the data from various data source
systems, transforms it in the staging area, and then finally, loads it into the
Data Warehouse system.
Extraction: -The
first step of the ETL process is extraction. In this step, data from various
source systems is extracted which can be in various formats like relational
databases, No SQL, XML, and flat files into the staging area. It is important
to extract the data from various source systems and store it into the staging
area first and not directly into the data warehouse because the extracted data
is in various formats and can be corrupted also. Hence loading it directly into
the data warehouse may damage it and rollback will be much more difficult.
Therefore, this is one of the most important steps of ETL process.
Transformation: -
The second step of the ETL process is transformation. In this step, a set of
rules or functions are applied on the extracted data to convert it into a
single standard format. It may involve following processes/tasks:
Filtering – loading
only certain attributes into the data warehouse.
Cleaning – filling
up the NULL values with some default values, mapping U.S.A, United States, and
America into USA, etc.
Joining – joining
multiple attributes into one.
Splitting –
splitting a single attribute into multiple attributes.
Sorting – sorting
tuples on the basis of some attribute (generally key-attribute).
Loading: - The third
and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse. Sometimes the data is updated
by loading into the data warehouse very frequently and sometimes it is done
after longer but regular intervals. The rate and period of loading solely
depends on the requirements and varies from system to system.
Schema:- Schema is a
logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items
and aggregates. Much like a database, a data warehouse also requires to
maintain a schema. A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will
discuss the schemas used in a data warehouse. Multidimensional schema is
defined using Data Mining Query Language (DMQL). The two primitives, cube
definition and dimension definition, can be used for defining the data
warehouses and data marts.
1. 1. Star Schema:-
A star schema is a database organizational structure optimized for use in a
data warehouse or business intelligence that uses a single large fact table to
store transactional or measured data, and one or more smaller dimensional
tables that store attributes about the data. It is called a star schema because
the fact table sits at the center of the logical diagram, and the small
dimensional tables branch off to form the points of the star.
A
fact table sits at the center of a star schema database, and each star schema
database only has a single fact table. The fact table contains the specific
measurable (or quantifiable) primary data to be analyzed, such as sales
records, logged performance data or financial data. It may be transactional --
in that rows are added as events happen -- or it may be a snapshot of
historical data up to a point in time.
Characteristics
of Star Schema:-
The
star schema is intensely suitable for data warehouse database design because of
the following features:
· It
creates a DE-normalized database that can quickly provide query responses.
· It
provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
· It
provides a parallel in design to how end-users typically think of and use the
data.
· It
reduces the complexity of metadata for both developers and end-users.
Advantages
of Star Schema:-
· Simpler
Queries – Join logic of star schema is quite cinch in
comparison to other join logic which are needed to fetch data from a
transactional schema that is highly normalized.
· Simplified
Business Reporting Logic – In comparison to a transactional
schema that is highly normalized, the star schema makes simpler common business
reporting logic, such as of reporting and period-over-period.
· Feeding
Cubes – Star schema is widely used by all OLAP systems
to design OLAP cubes efficiently. In fact, major OLAP systems deliver a ROLAP
mode of operation which can use a star schema as a source without designing a
cube structure.
Disadvantages
of Star Schema :–
· Data
integrity is not enforced well since in a highly de-normalized schema state.
· Not
flexible in terms if analytical needs as a normalized data model.
· Star
schemas don’t reinforce many-to-many relationships within business entities –
at least not frequently.
2. 2. Fact
Constellation Schema(Galaxy Schema):- A fact
constellation has multiple fact tables. It is also known as galaxy schema. The
following diagram shows two fact tables, namely sales and shipping.
The
sales fact table is same as that in the star schema. The shipping fact table
has the five dimensions, namely item_key, time_key, shipper_key, from_location,
to_location. The shipping fact table also contains two measures, namely dollars
sold and units sold. It is also possible to share dimension tables between fact
tables. For example, time, item, and location dimension tables are shared
between the sales and shipping fact table.
Advantage:
Provides a flexible schema.
Disadvantage:
It is much more complex and hence, hard to implement and maintain.
1. 3. Snowflake
Schema:- Some dimension tables in the Snowflake
schema are normalized. The normalization splits up the data into additional
tables. Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is normalized
and split into two dimension tables, namely item and supplier table.
Now
the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key. The supplier key is linked to the supplier dimension
table. The supplier dimension table contains the attributes supplier_key and
supplier_type.
Characteristics
of snowflake schema:-
· The
snowflake schema uses small disk space.
· It
is easy to implement dimension that is added to the schema.
· There
are multiple tables, so performance is reduced.
· The
dimension table consists of two or more sets of attributes that define information
at different grains.
· The
sets of attributes of the same dimension table are being populated by different
source systems.
Advantages:
-
· It
provides structured data which reduces the problem of data integrity.
· It
uses small disk space because data are highly structured.
Disadvantages:
-
· Snowflaking
reduces space consumed by dimension tables but compared with the entire data
warehouse the saving is usually insignificant.
· Avoid
snowflaking or normalization of a dimension table, unless required and appropriate.
· Do
not snowflake hierarchies of one dimension table into separate tables.
Hierarchies should belong to the dimension table only and should never be
snowflakes.
· Multiple
hierarchies that can belong to the same dimension have been designed at the
lowest possible detail.
Fact Table:- A
fact table aggregates metrics, measurements, or facts about business processes.
In this example, fact tables are connected to dimension tables to form a schema
architecture representing how data relates within the data warehouse. Fact
tables store primary keys of dimension tables as foreign keys within the fact
table.
Dimension Table:- Dimension
tables are non-denormalized tables used to store data attributes or dimensions.
As mentioned above, the primary key of a dimension table is stored as a foreign
key in the fact table. Dimension tables are not joined together. Instead, they
are joined via association through the central fact table.
OLTP:-
It is an operational system that supports transaction-oriented applications in
a 3-tier architecture. It administers the day to day transaction of an
organization. OLTP is basically focused on query processing, maintaining data
integrity in multi-access environments as well as effectiveness that is
measured by the total number of transactions per second. The full form of OLTP
is Online Transaction Processing.
Characteristics of OLTP:- Following
are important characteristics of OLTP:-
·
OLTP uses transactions that include
small amounts of data.
·
Indexed data in the database can be
accessed easily.
·
OLTP has a large number of users.
·
It has fast response times
·
Databases are directly accessible to
end-users
·
OLTP uses a fully normalized schema
for database consistency.
·
The response time of OLTP system is
short.
·
It strictly performs only the
predefined operations on a small number of records.
·
OLTP stores the records of the last
few days or a week.
·
It supports complex data models and
tables.
Architecture of OLTP:- Here
is the architecture of OLTP:
1. 1. Business / Enterprise Strategy: Enterprise
strategy deals with the issues that affect the organization as a whole. In
OLTP, it is typically developed at a high level within the firm, by the board
of directors or the top management
2. 2. Business Process: OLTP b
usiness process is a set of activities and tasks that, once completed, will accomplish an organizational goal.
3. 3.Customers, Orders, and Products: OLTP
database store information about products, orders (transactions), customers
(buyers), suppliers (sellers), and employees.
4. 4. ETL Processes: It separates the data
from various RDBMS source systems, then transforms the data (like applying
concatenations, calculations, etc.) and loads the processed data into the Data
Warehouse system.
5. 5. Data Mart and Data warehouse: A data
mart is a structure/access pattern specific to data warehouse environments. It
is used by OLAP to store processed data.
6. 6. Data Mining, Analytics, and Decision
Making: Data stored in the data mart and data warehouse can be used for data
mining, analytics, and decision making. This data helps you to discover data
patterns, analyze raw data, and make analytical decisions for your
organization’s growth.
Advantages of OLTP::-
Following are the pros/benefits of OLTP system:
·
OLTP offers accurate forecast for
revenue and expense.
·
It provides a solid foundation for a
stable business /organization due to timely modification of all transactions.
·
OLTP makes transactions much easier on
behalf of the customers.
·
It broadens the client base for an
organization by speeding up and simplifying individual processes.
·
OLTP provides support for bigger
databases.
·
Partition of data for data
manipulation is easy.
·
We need OLTP to use the tasks which
are frequently performed by the system.
·
When we need only a small number of
records.
·
The tasks that include insertion,
updation, or deletion of data.
·
It is used when you need consistency
and concurrency in order to perform tasks that ensure its greater availability.
Disadvantages of OLTP:-
Here are cons/drawbacks of OLTP system:
·
If the OLTP system faces hardware
failures, then online transactions get severely affected.
·
OLTP systems allow multiple users to
access and change the same data at the same time, which many times created an
unprecedented situation.
·
If the server hangs for seconds, it
can affect to a large number of transactions.
·
OLTP required a lot of staff working
in groups in order to maintain inventory.
·
Online Transaction Processing Systems
do not have proper methods of transferring products to buyers by themselves.
·
OLTP makes the database much more
susceptible to hackers and intruders.
·
In B2B transactions, there are chances
that both buyers and suppliers miss out efficiency advantages that the system
offers.
·
Server failure may lead to wiping out
large amounts of data from the database.
·
You can perform a limited number of
queries and updates.
OLAP:- Online
Analytical Processing (OLAP) is a category of software that allows users to
analyze information from multiple database systems at the same time. It is a
technology that enables analysts to extract and view business data from
different points of view. Analysts frequently need to group, aggregate and join
data. These OLAP operations in data mining are resource intensive. With OLAP
data can be pre-calculated and pre-aggregated, making analysis faster. OLAP
databases are divided into one or more cubes. The cubes are designed in such a
way that creating and viewing reports become easy. OLAP stands for Online
Analytical Processing.
The main
characteristics of OLAP are as follows:-
1. 1. Multidimensional conceptual view: OLAP
systems let business users have a dimensional and logical view of the data in
the data warehouse. It helps in carrying slice and dice operations.
2.
Multi-User Support: Since the OLAP
techniques are shared, the OLAP operation should provide normal database operations,
containing retrieval, update, adequacy control, integrity, and security.
3.
Accessibility: OLAP acts as a mediator
between data warehouses and front-end. The OLAP operations should be sitting
between data sources (e.g., data warehouses) and an OLAP front-end.
4.
Storing OLAP results: OLAP results are
kept separate from data sources.
5.
Uniform documenting performance:
Increasing the number of dimensions or database size should not significantly
degrade the reporting performance of the OLAP system.
6.
OLAP provides for distinguishing
between zero values and missing values so that aggregates are computed
correctly.
7.
OLAP system should ignore all missing
values and compute correct aggregate values.
8.
OLAP facilitate interactive query and
complex analysis for the users.
9.
OLAP allows users to drill down for
greater details or roll up for aggregations of metrics along a single business
dimension or across multiple dimension.
10. OLAP
provides the ability to perform intricate calculations and comparisons.
11. OLAP
presents results in a number of meaningful ways, including charts and graphs.
Benefits of OLAP:-
1. 1. OLAP helps managers in decision-making
through the multidimensional record views that it is efficient in providing,
thus increasing their productivity.
2.
OLAP functions are self-sufficient
owing to the inherent flexibility support to the organized databases.
3.
It facilitates simulation of business
models and problems, through extensive management of analysis-capabilities.
4.
In conjunction with data warehouse,
OLAP can be used to support a reduction in the application backlog, faster data
retrieval, and reduction in query drag.
Drill down:-OLAP
Drill-down is an operation opposite to Drill-up. It is carried out either by
descending a concept hierarchy for a dimension or by adding a new dimension. It
lets a user deploy highly detailed data from a less detailed cube.
Consequently, when the operation is run, one or more dimensions from the data
cube must be appended to provide more information elements.
Slice operations:- The
next pair we are going to discuss is slice and dice operations in OLAP. The
Slice OLAP operations takes one specific dimension from a cube given and
represents a new sub-cube, which provides information from another point of
view. It can create a new sub-cube by choosing one or more dimensions. The use
of Slice implies the specified granularity level of the dimension.
Dice:- This operation is similar to a slice. The difference in
dice is you select 2 or more dimensions that result in the creation of a
sub-cube.
Next page


















No comments:
Post a Comment