Beruflich Dokumente
Kultur Dokumente
CHAPTER-1 INTRODUCTION
1.1 Database
A database is an organized collection of data. The data is typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for example, finding a hotel with vacancies). Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields; and a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. It is a database of unique data structure that allows relatively quick and easy performance of complex query over large amount of data.
~2~
Fig - 1.1 Data mining of finger print converted into digital data.
~3~
Example :
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. O On n Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.
~4~
1.5 Dataset
A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.
~5~
~6~
~7~
~8~
~9~
Compute Fh
Fig- 2.1 Main steps of methods based on F (un-optimized). (un optimized).
~ 10 ~
Compute Fv
Compute Fh
Fig- 2.2 Main steps of methods based on FV (optimized).
~ 11 ~
now introduce the indirect aggregation based on the intermediate table F , that will be used for both the SPJ and the CASE method. Let FV be a table containing the vertical aggregation, based on {L1Lj} and {R1..Rj}. Let V() represent the corresponding vertical aggregation for H(). The statement to compute F gets a cube: INSERT INTO SELECT L1 Lj, R1..RK,V(A) FROM F GROUP BY L1 Lj, R1..RK; Then each table F aggregates only those rows that correspond to the Ith unique combination of R1.Rk, given by the WHERE clause. A possible optimization is synchronizing table scans to compute the d tables in one pass. Finally, to get FH we need d left outer joins with the d + 1 tables so that all individual aggregations are properly assembled as a set of d dimensions for each group. Outer joins set result columns to null for missing combinations for the given group. In general, nulls should be the default value for groups with missing combinations. We believe it would be incorrect to set the result to zero or some other number by default if there are no qualifying rows. Such approach should be considered on a per-case basis. INSERT INTO FH SELECT F0.L1, F0.L2,,F0.Lj, F1.A, F2.A,, Fd.A, FROM F0 LEFT OUTER JOIN F1 ON F0.L1=F1.L1 and and F0.Lj = F1.Lj LEFT OUTER JOIN F2 ON F0.L1=F2.L1 and and F0.Lj = F2.Lj .. LEFT OUTER JOIN Fd ON F0.L1=Fd.L1 and and F0.Lj=Fd.Lj; Then each table FI aggregates only those rows that correspond to the Ith unique combination of R1, . . .,Rk, given by the WHERE clause. A possible optimization is synchronizing table scans to compute the d tables in one pass. Finally, to get FH we need d left outer joins with the d + 1 tables so that all
~ 12 ~
individual aggregations are properly assembled as a set of d dimensions for each group. Outer joins set result columns to null for missing combinations for the given group. In general, nulls should be the default value for groups with missing combinations. We believe it would be incorrect to set the result to zero or some other number by default if there are no qualifying rows. Such approach should be considered on a per-case basis. INSERT INTO FH SELECT F0.L1, F0.L2, . . . ,F0.Lj, F1.A, F2.A, . . . , Fd.A FROM F0 LEFT OUTER JOIN F1 ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj LEFT OUTER JOIN F2 ON F0.L1 = F2.L1 and . . . and F0:Lj = F2.Lj ... LEFT OUTER JOIN Fd ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj; This statement may look complex, but it is easy to see that each left outer join is based on the same columns L1, . . . , Lj. To avoid ambiguity in column references, L1, . . . , Lj are qualified with F0. Result column I is qualified with table FI . Since F0 has n rows each left outer join produces a partial table with n rows and one additional column. Then at the end, FH will have n rows and d aggregation columns. The statement above is equivalent to an update-based strategy. Table FH can be initialized inserting n rows with key L1, . . . , Lj and nulls on the d dimension aggregation columns. Then FH is iteratively updated from FI joining on L1, . . . ,Lj. This strategy basically incurs twice I/O doing updates instead of insertion. Reordering the d projected tables to join cannot accelerate processing because each partial table has n rows. Another claim is that it is not possible to correctly compute horizontal aggregations without using outer joins. In other words, natural joins would produce an incomplete result set.
~ 13 ~
This statement computes aggregations in only one scan on F. The main difficulty is that there must be a feedback process to produce the case boolean expressions. We now consider an optimized version using FV . Based
~ 14 ~
on FV , we need to transpose rows to get groups based on L1, . . . , Lj. Query evaluation needs to combine the desired aggregation with CASE statements for each distinct combination of values of R1, . . .,Rk. As explained above, horizontal aggregations must set the result to null when there are no qualifying rows for the specific horizontal group. The boolean expression for each case statement has a conjunction of k equality comparisons. The following statements compute FH: SELECT DISTINCT R1,. . .,Rk FROM FV ; INSERT INTO FH SELECT L1,..,Lj ,sum(CASE WHEN R1 = v11 and .. and Rk = vk1 THEN A ELSE null END) ...... ,sum(CASE WHEN R1 = v1d and .. and Rk = vkd THEN A ELSE null END) FROM FV GROUP BY L1, L2, . . . , Lj; As can be seen, the code is similar to the code presented before, the main difference being that we have a call to sum() in each term, which preserves whatever values were previously computed by the vertical aggregation. It has the disadvantage of using two tables instead of one as required by the direct computation from F. For very large tables F computing FV first, may be more efficient than computing directly from F.
~ 15 ~
~ 16 ~
Notice that in the optimized query the nested query trims F from columns that are not later needed. That is, the nested query projects only those columns that will participate in FH. Also, the first and second queries can be computed from FV .
~ 17 ~
~ 18 ~
3.1.2 Problem 2 :
It is impossible to aggregate when data fields are image or file(such as blob data). Suppose when an image data converted to a column or attribute name then it exceed the defined DBMS column name length. This issue is automatically generating unique column names. If there are many sub grouping columns {R1, . . .,Rk} or columns are of string data types, this may lead to generate very long column names, which may exceed DBMS limits. However, these are not important limitations because if there are many dimensions that is likely to correspond to a sparse matrix (having many zeroes or nulls) on which it will be difficult or impossible to compute a data mining model. On the other hand, the large column name length can be solved as explained below.
~ 19 ~
The problem of d going beyond the maximum number of columns can be solved by vertically partitioning FH so that each partition table does not exceed the maximum number of columns allowed by the DBMS. Evidently, each partition table must have {L1,. . . , Lj } as its primary key. Alternatively, the column name length issue can be solved by generating column identifiers with integers and creating a dimension description table that maps identifiers to full descriptions, but the meaning of each dimension is lost. An alternative is the use of abbreviations, which may require manual input.
~ 20 ~
Table 3.1 Different database permitted column If we see the table, the lowest allowed column is 255 (Microsoft Access). So we decide the splitting point is 255 sequentially. Example : If vertical attributes of a table is : ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . . . ,VA255, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273 (It is impossible to aggregate in SPJ method) The output of Split-SPJ method : Table-1 ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . .,VA255 Table-2 ID, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273
~ 21 ~
~ 22 ~
~ 23 ~
~ 24 ~
~ 25 ~
~ 26 ~
~ 27 ~
4.3 Comparison of SPJ with Split-SPJ When aggregated column < 255
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
~ 28 ~
20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 255 Fig 4.3.3 : SPJ curve when number of column is 360. No. of Column
4 Time (ms)
SPJ
3 2.4 2 SPJ 1
20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 255 Fig 4.3.4 : Split-SPJ curve when number of column is 360. No. of Column
~ 29 ~
4.4 Code for the different methods 4.4.1 Code for vertical aggregation :
using System; using System.Windows.Forms; using HorizontalAggregation.App_Code; namespace HorizontalAggregation.UI { public partial class VerticalAggregationUI : Form { public VerticalAggregationUI() { InitializeComponent(); } private DataManager dataManager = null; private void VerticalAggregationUI_Load(object sender, EventArgs e) { dataManager = new DataManager(); dgvVerticalAggregation.DataSource = dataManager.GetVerticalTable(); } } }
public DataTable GetVerticalTable() { dataExecuteClass = new DataExecuteClass(); dataSet = new DataSet(); DataTable dataTable = null; string queryString = string.Format("SELECT facebook_id, image_name, sum(comments_char) as [SUM] from stdinfo group by facebook_id,image_name order by facebook_id,image_name;"); try { dataSet = dataExecuteClass.getDataSet(queryString); dataTable = dataSet.Tables[0]; return dataTable; } catch (Exception ex) { throw ex; } }
~ 30 ~
public DataTable GetHorizontalTable() { dataExecuteClass = new DataExecuteClass(); dataSet = new DataSet(); DataTable dataTable = null; string queryString = string.Format("SELECT * from horizontal order by facebook_id;"); try { dataSet = dataExecuteClass.getDataSet(queryString); dataTable = dataSet.Tables[0]; return dataTable; } catch (Exception ex) { throw ex; } }
~ 31 ~
Compute Fv
Compute Fh
From the experimental data table first produced vertical aggregated table and then horizontal aggregated table. Data can be null but not blob(Such as image, file etc).
~ 32 ~
4.4.4 The Split-SPJ Algorithm (Proposed Algorithm): Algorithm 4.1 : Split-SPJ (D, DV, DH, TRV, TCH, TEMP) Let experimental data table D, it produced vertical aggregated table DV and then horizontal aggregated table DH. Data can be null but not blob(Such as image, file etc). The variable TRV, TCH and TEMP denote respectively total rows of DV, Total columns of DH. 1. [Create vertical aggregated table from experimental table.] TEMP =: SELECT(D). 2. [Assigning vertical data.] DV =: TEMP. 3. [Create horizontal aggregated table from vertical aggregated table.] TEMP =: SELECT(DV). 4. [Assigning horizontal data.] DH =: TEMP. 5. [Count column of horizontal data table.] COUNTER =: COUNT(DH). 6. [Check condition.] If COUNTER > 255 then : Create table using 255 column. COUNTER =: COUNTER 255. GoTo step 6. Else : Create table using total column. End If 7. Exit.
~ 33 ~
~ 34 ~
{ int totalColumnLength = attributeName.Count()-1; int fstSkipPoint = 0, lstSkipPoint = 0; int numOfDGV = (int)Math.Ceiling((float)totalColumnLength/maxColLength); for (int j = 0; j < numOfDGV; j++) { fstSkipPoint = j*maxColLength+1; lstSkipPoint = fstSkipPoint+maxColLength-1; DataTable dataTable = CrieateHorizantalAgreateTable(); for (int i = 1; i <= totalColumnLength; i++) { if((i>=fstSkipPoint && i<=lstSkipPoint) || i==1) { continue; } else { string column = col[i-1]; dataTable.Columns.Remove(column); } } dataGridView = new DataGridView(); dataGridView.DataSource = dataTable; dataGridView.Dock = DockStyle.Top; this.Controls.Add(dataGridView); } } } private DataTable CrieateHorizantalAgreateTable() { dataManager = new DataManager(); dataExecuteClass = new DataExecuteClass(); int i = 0; DataRow dr; string[] horizontalColumn = dataManager.SelectDistinctRowInaColumn("D2", "stdinfo"); DataTable horizontalAggrigationTable = new DataTable(); //Column of horizontal table string col1 = attributeName[1]; col[0] = col1; string col2 = attributeName[2] + horizontalColumn[0]; col[1] = col2; string col3 = attributeName[2] + horizontalColumn[1]; col[2] = col3; horizontalAggrigationTable.Columns.Add(col1); horizontalAggrigationTable.Columns.Add(col2); horizontalAggrigationTable.Columns.Add(col3); //Create Rows of horizontal table string[] data1 = dataManager.SelectDistinctRowInaColumn("D1", "stdinfo");//Prepare 1st Column string[] data2 = new string[data1.Count()]; string[] data3 = new string[data1.Count()]; //Prepare 2nd Column string query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by D1,D2 order by D1,D2) WHERE D2='x'"; OleDbDataReader reader = dataExecuteClass.ExecuteReader(query); while (reader.Read()) { data2[i] = reader["SUM"].ToString();
~ 35 ~
i++; } //Prepare 3rd Column query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by D1,D2 order by D1,D2) WHERE D2='y'"; reader = null; i = 0; reader = dataExecuteClass.ExecuteReader(query); while (reader.Read()) { data3[i] = reader["SUM"].ToString(); i++; } for (i = 0; i < data1.Count(); i++) { dr = horizontalAggrigationTable.NewRow(); dr[col1] = data1[i]; dr[col2] = data2[i]; dr[col3] = data3[i]; horizontalAggrigationTable.Rows.Add(dr); } return horizontalAggrigationTable; } } }
~ 36 ~
~ 37 ~
5.1 Conclusion
We introduced a new method to extend aggregate functions, called Split SPJ horizontal aggregations which help preparing data sets for data mining . Specifically, the method is useful to create data sets with a horizontal layout, as commonly required by data mining algorithms. Basically, a horizontal aggregation returns a set of numbers instead of a single number for each group, resembling a multidimensional vector. We proposed an abstract, but minimal, extension to SQL standard aggregate functions to compute horizontal aggregations which just Split the data set at the final limit of column of related database. From a query optimization perspective, we used query evaluation methods.
~ 38 ~
REFERENCE
1. Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis. [IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012]
2. Vertical and Horizontal Percentage Aggregations. [Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 04), pp. 866-871, 2004.]
3. Data Set Preprocessing and Transformation in a Database System. [Intelligent Data Analysis, vol. 15, no. 4, pp. 613-631, 2011.]
4. Integrating K-Means Clustering with a Relational DBMS Using SQL. [IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201, Feb. 2006.]
5. Data Cube A Relational Aggregation Operator [Proc. Intl Conf. Data Eng., pp. 152-159, 1996.]
6. Mining Low-Support Discriminative Patterns [IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012]
7. Data Mining Techniques for Software Effort [IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 38, NO. X, XXXXXXX 2012]
8. C. Galindo-Legaria and A. Rosenthal, Outer Join Simplification and Reordering for Query Optimization, ACM Trans. Database Systems, vol. 22, no. 1, pp. 43-73, 1997.
~ 39 ~
9. C. Ordonez, Horizontal Aggregations for Building Tabular Data Sets, Proc. Ninth ACM SIGMOD Workshop Data Mining and Knowledge Discovery (DMKD 04), pp. 35-42, 2004.
10. H. Wang, C. Zaniolo, and C.R. Luo, ATLAS: A Small But Complete SQL Extension for Data Mining and Data Streams, Proc. 29th Intl Conf. Very Large Data Bases (VLDB 03), pp. 1113-1116, 2003.
~ 40 ~