Data generalization in data mining: Principles, techniques, and approaches

Data is a key component of modern-day success. Data becomes sensitive when it is only accessible to a limited number of people. Data professionals use data mining to uncover hidden information in company databases.

What’s data generalization?

Data mining uses well-known data that is shared only by a small number of people to identify statistical patterns. Data mining is possible with reliable data without the use of sensitive data. The following article introduces data generalization in mining. This is a method of hiding private details and revealing useful patterns.

Data Mining

Data mining, in technical terms, is the process of analyzing data and extracting it from larger data sets to identify patterns and create rules. It is also a discipline in the field of data science. This is different from predictive analytics because it describes historical data, while the former aims to predict future outcomes. Although databases can store large amounts of data, they are often very detailed. Users prefer to view the data in a summary format.

Data mining can be categorized into the following two types from the perspective of data analysis:

Descriptive mining

This category of data mining is concerned with describing information or task-relevant data set in summative, concise and informative method forms. It also emphasizes the importance of presenting data’s basic attention-grabbing characteristics.

Predictive mining

It focuses on analysing the data to build models for the database and predict the conduct and properties in recent undiscovered data sets.

Data mining, when you take all this into account, is extremely beneficial because it can summarize and present large data sets to an excessive conceptual stage. Data generalization is a crucial functionality.

Data generalization in data mining

Data generalization refers to the process of combining common features of objects within a class and creating characteristic rules. Concept hierarchies are used to transform low-level attributes into high-level data. Age data, for example, can be represented as (20, 40) in a dataset. It will therefore be converted to a higher conceptual level like (young, old). This is a categorical value.

To get a better understanding of data, this transformation is extremely useful. Data generalization allows users to replace one value with another using multiple techniques. This protects data utility from attacks such as the re-identification or unintentionally disclosing private information.

The data can also be associated with a user-specified type that can be retrieved via a database query. A real summarization module runs to extract and calculate essence of data at different abstraction levels. There are two types of generalization, namely declarative and automated. The first blurs values until they reach a specific value attribute.

This type of encryption is more beneficial for companies because it provides both accuracy and privacy for both parties. It uses an algorithm that minimizes distortion to achieve the desired value. Declarative generalization, on the other hand, allows users to specify the bin size upfront. However, this technique can cause data distortions and bias. Online Analytical Processing technology (OLAP) can make data generalization very useful. This is useful for quick answers to multi-dimensional analytical questions.

Generalized Results Presentation

To present generalized data, users can use the following:

Generalized Relation

It is the area where many or all attributes are combined, using either counts or other aggregation value collected and accumulated.

Cross-Tabulation

It also includes the mapping of results into cross-tabulation type (similarly to contingency table).

Visualization Techniques

This includes the use of bar charts and pie charts, cubes, curves, as well as other visuals.

Guidelines for quantitative attributes

This article focuses on the mapping of generalized findings in characteristic rules to attribute guidance with quantitative information.