Sie sind auf Seite 1von 40

~1~

CHAPTER-1 INTRODUCTION

1.1 Database
A database is an organized collection of data. The data is typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for example, finding a hotel with vacancies). Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields; and a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number.

1.2 Database Management System


To access information from a database, we need a database management system (DBMS). This is a collection of programs that enable us to enter, organize, and select data in a database.
1.3 Data Warehouse

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. It is a database of unique data structure that allows relatively quick and easy performance of complex query over large amount of data.

~2~

1.4 Data Mining


Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase speed , cuts costs. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Fig - 1.1 Data mining of finger print converted into digital data.

~3~

Fig -1.2 1.2 Data mining from data warehouse.

Example :
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. O On n Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

~4~

1.5 Dataset
A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

1.6 Structure Query Language(SQL)


SQL, which is an abbreviation for Structured Query Language, is a language to request data from a database, to add, update, or remove data within a database, or to manipulate the metadata of the database. SQL is a declarative language in which the expected result or operation is given without the specific details about how to accomplish the task. The steps required to execute SQL statements are handled transparently by the SQL database. Sometimes SQL is characterized as non-procedural because procedural languages generally require the details of the operations to be specified, such as opening and closing tables, loading and searching indexes, or flushing buffers and writing data to files systems. Therefore, SQL is considered to be designed at a higher conceptual level of operation than procedural languages because the lower level logical and physical operations aren't specified and are determined by the SQL engine or server process that executes it.

1.7 Vertical Aggregation


The essential idea is to allow relevant sites to be overlaid on top of each other by the end user to create a complete view of the information they are looking for. It arrange dataset from database in vertically as respect with necessary query (such as group by clause in SQL) .Generally in relational database system the aggregation are arranged by vertical aggregation.

~5~

1.8 Horizontal Aggregation


Here introduce a new class of aggregations that have similar behavior to SQL standard aggregations, but which produce tables with a horizontal layout. In contrast, we call standard SQL aggregations vertical aggregations since they produce tables with a vertical layout. Horizontal aggregations just require a small syntax extension to aggregate functions called in a SELECT statement. Alternatively, horizontal aggregations can be used to generate SQL code from a data mining tool to build data sets for data mining analysis. We start by explaining how to automatically generate SQL code.

~6~

~7~

CHAPTER-2 Review Literature

2.1 Analysis the literature


Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns[1]. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. A simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations[2]. Horizontal aggregations build data sets with a horizontal de-normalized layout (e.g., point-dimension, observation variable, instance-feature), which is the standard layout required by most data mining algorithms. Here three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.

~8~

2.2 Explanation of F, FV , and FH Table


2.2.1 F(Original Table) :
This table contains data that can be aggregate first vertical then horizontal. It can be contain null but must not contain blob(data type) data. K 1 2 3 4 5 6 7 8 D1 3 2 1 1 2 1 3 2 D2 X Y Y Y X X X X A 9 6 10 0 1 null 8 7

Table 2.1 Original Data Table

2.2.2 FV (Vertical Aggregated Table) :


The essential idea is to allow relevant sites to be overlaid on top of each other by the end user to create a complete view of the information they are looking for. It arrange dataset from database in vertically as respect with necessary query (such as group by clause in SQL) .Generally in relational database system the aggregation are arranged by vertical aggregation. D1 1 1 2 2 3 D2 X Y X Y X A null 10 8 6 17

Table 2.2 Vertical Table

~9~

2.2.3 FH (Horizontal Horizontal Aggregated Table) :


Here introduce a new class of aggregations that have similar behavior to SQL standard aggregations, but which produce tables with a horizontal layout. In contrast, we call standard SQL aggregations vertical aggregations since they produce tables with a vertica vertical layout. Horizontal aggregations just require a small syntax extension to aggregate functions called alled in a SELECT statement. Alternatively, horizontal aggregations can be used to generate SQL code from a data mining tool to build data sets for data mining analysis. We start by explaining how to automatically generate SQL code. D1 1 2 3 D2X null 8 17 D2Y 10 6 null

Table 2.3 Horizontal Table

Select Distinct R1.....Rk

SPJ d left joins

CASE d sum(case) terms

PIVOT d pivoting Value

Compute Fh
Fig- 2.1 Main steps of methods based on F (un-optimized). (un optimized).

~ 10 ~

Select Distinct R1.....Rk

Compute Fv

SPJ d left joins

CASE d sum(case) terms

PIVOT d pivoting Value

Compute Fh
Fig- 2.2 Main steps of methods based on FV (optimized).

2.3 SPJ method


The SPJ method is interesting from a theoretical point of view because it is based on relational operators only. The basic idea is to create one table with a vertical aggregation for each result column, and then join all those tables to produce FH. We aggregate from F into d projected tables with d Select Select-ProjectJoin-Aggregation Aggregation queries (selection, projection, join, aggregation). Each table FI one subgrouping combination combin and has {L1; ...;Lj} primary key and an aggregation on A as the only nonkey column. It is necessary to introduce an additional table F, that will be outer joined with projected tables to get a complete result set. We propose two basic substrategies to compute F . The first one e directly aggregates from F. The second one computes the equivalent vertical aggregation in a temporary table FV grouping by {L1; ...;Lj}. Then horizontal aggregations can be instead computed from FV, which is a compressed version of F, since standard aggregations gregations are distributive [9].We

~ 11 ~

now introduce the indirect aggregation based on the intermediate table F , that will be used for both the SPJ and the CASE method. Let FV be a table containing the vertical aggregation, based on {L1Lj} and {R1..Rj}. Let V() represent the corresponding vertical aggregation for H(). The statement to compute F gets a cube: INSERT INTO SELECT L1 Lj, R1..RK,V(A) FROM F GROUP BY L1 Lj, R1..RK; Then each table F aggregates only those rows that correspond to the Ith unique combination of R1.Rk, given by the WHERE clause. A possible optimization is synchronizing table scans to compute the d tables in one pass. Finally, to get FH we need d left outer joins with the d + 1 tables so that all individual aggregations are properly assembled as a set of d dimensions for each group. Outer joins set result columns to null for missing combinations for the given group. In general, nulls should be the default value for groups with missing combinations. We believe it would be incorrect to set the result to zero or some other number by default if there are no qualifying rows. Such approach should be considered on a per-case basis. INSERT INTO FH SELECT F0.L1, F0.L2,,F0.Lj, F1.A, F2.A,, Fd.A, FROM F0 LEFT OUTER JOIN F1 ON F0.L1=F1.L1 and and F0.Lj = F1.Lj LEFT OUTER JOIN F2 ON F0.L1=F2.L1 and and F0.Lj = F2.Lj .. LEFT OUTER JOIN Fd ON F0.L1=Fd.L1 and and F0.Lj=Fd.Lj; Then each table FI aggregates only those rows that correspond to the Ith unique combination of R1, . . .,Rk, given by the WHERE clause. A possible optimization is synchronizing table scans to compute the d tables in one pass. Finally, to get FH we need d left outer joins with the d + 1 tables so that all

~ 12 ~

individual aggregations are properly assembled as a set of d dimensions for each group. Outer joins set result columns to null for missing combinations for the given group. In general, nulls should be the default value for groups with missing combinations. We believe it would be incorrect to set the result to zero or some other number by default if there are no qualifying rows. Such approach should be considered on a per-case basis. INSERT INTO FH SELECT F0.L1, F0.L2, . . . ,F0.Lj, F1.A, F2.A, . . . , Fd.A FROM F0 LEFT OUTER JOIN F1 ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj LEFT OUTER JOIN F2 ON F0.L1 = F2.L1 and . . . and F0:Lj = F2.Lj ... LEFT OUTER JOIN Fd ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj; This statement may look complex, but it is easy to see that each left outer join is based on the same columns L1, . . . , Lj. To avoid ambiguity in column references, L1, . . . , Lj are qualified with F0. Result column I is qualified with table FI . Since F0 has n rows each left outer join produces a partial table with n rows and one additional column. Then at the end, FH will have n rows and d aggregation columns. The statement above is equivalent to an update-based strategy. Table FH can be initialized inserting n rows with key L1, . . . , Lj and nulls on the d dimension aggregation columns. Then FH is iteratively updated from FI joining on L1, . . . ,Lj. This strategy basically incurs twice I/O doing updates instead of insertion. Reordering the d projected tables to join cannot accelerate processing because each partial table has n rows. Another claim is that it is not possible to correctly compute horizontal aggregations without using outer joins. In other words, natural joins would produce an incomplete result set.

~ 13 ~

2.4 Case Method


For this method, the case programming construct available in SQL. The case statement returns a value selected from a set of values based on boolean expressions. From a relational database theory point of view this is equivalent to doing a simple projection/aggregation query where each nonkey value is given by a function that returns a number based on some conjunction of conditions. Proposed two basic substrategies to compute F. In a similar manner to SPJ, the first one directly aggregates from F andthe second one computes the vertical aggregation in a temporary table FV and then horizontal aggregations are indirectly computed from FV. Now present the direct aggregation method. Horizontal aggregation queries can be evaluated by directly aggregating from F and transposing rows at the same time to produce FH. First, we need to get the unique combinations of R. R1,..,Rk. that define the matching Boolean expression for result columns. The SQL code to compute horizontal aggregations directly from F is as follows: observe V () is a standard (vertical) SQL aggregation that has a case statement as argument. Horizontal aggregations need to set the result to null when there are no qualifying rows for the specific horizontal group to be consistent with the SPJ method and also with the extended relational model [4]. SELECT DISTINCT FROM F; INSERT INTO FH SELECT L1,,Lj ,V(CASE WHEN R1=V11 and.and RK=VK1 THEN A ELSE NULL END) .. ,V(CASE WHEN R1=V11 and.and RK=VKd THEN A ELSE null END) FROM F GROUP BY L1, L2,.., Lj;

This statement computes aggregations in only one scan on F. The main difficulty is that there must be a feedback process to produce the case boolean expressions. We now consider an optimized version using FV . Based

~ 14 ~

on FV , we need to transpose rows to get groups based on L1, . . . , Lj. Query evaluation needs to combine the desired aggregation with CASE statements for each distinct combination of values of R1, . . .,Rk. As explained above, horizontal aggregations must set the result to null when there are no qualifying rows for the specific horizontal group. The boolean expression for each case statement has a conjunction of k equality comparisons. The following statements compute FH: SELECT DISTINCT R1,. . .,Rk FROM FV ; INSERT INTO FH SELECT L1,..,Lj ,sum(CASE WHEN R1 = v11 and .. and Rk = vk1 THEN A ELSE null END) ...... ,sum(CASE WHEN R1 = v1d and .. and Rk = vkd THEN A ELSE null END) FROM FV GROUP BY L1, L2, . . . , Lj; As can be seen, the code is similar to the code presented before, the main difference being that we have a call to sum() in each term, which preserves whatever values were previously computed by the vertical aggregation. It has the disadvantage of using two tables instead of one as required by the direct computation from F. For very large tables F computing FV first, may be more efficient than computing directly from F.

~ 15 ~

2.5 PIVOT Method


Here use the PIVOT operator which is a built-in operator in a commercial DBMS. Since this operator can perform transposition it can help evaluating horizontal aggregations. The PIVOT method internally needs to determine how many columns are needed to store the transposed table and it can be combined with the GROUP BY clause. The basic syntax to exploit the PIVOT operator to compute a horizontal aggregation assuming one BY column for the right key columns (i.e., k = 1) is as follows: SELECT DISTINCT R1 FROM F; SELECT L1, L2,., Lj; ,v1,v2,vd INTO Ft FROM F PIVOT( V(A) FOR R1 in (v1,v2..vd) )AS P; SELECT L1, L2.,Lj ,V(v1), V(v2). V(vd) INTO FH FROM Ft GROUP BY L1, L2.,Lj; This set of queries may be inefficient because Ft can be a large intermediate table. We introduce the following optimized set of queries which reduces of the intermediate table: SELECT DISTINCT R1 FROM F; /* produces v1, . . . , vd */ SELECT L1, L2, . . . ,Lj ,v1, v2, . . . , vd INTO FH FROM ( SELECT L1, L2, . . . ,Lj, R1, A FROM F) Ft

~ 16 ~

PIVOT( V (A) FOR R1 in (v1, v2, . . . , vd) ) AS P;

Notice that in the optimized query the nested query trims F from columns that are not later needed. That is, the nested query projects only those columns that will participate in FH. Also, the first and second queries can be computed from FV .

~ 17 ~

~ 18 ~

CHAPTER-3 Problem Structure Analysis

3.1 Problem of literature 3.1.1 Problem 1 :


Number of column may be exceed than the allowed number of column of DBMS[1]. That means reaching the maximum number of columns in one table and reaching the maximum column name length when columns are automatically named. To elaborate on this, a horizontal aggregation can return a table that goes beyond the maximum number of columns in the DBMS when the set of columns {R1,. . .,Rk} has a large number of distinct combinations of values, or when there are multiple horizontal aggregations in the same query.

3.1.2 Problem 2 :
It is impossible to aggregate when data fields are image or file(such as blob data). Suppose when an image data converted to a column or attribute name then it exceed the defined DBMS column name length. This issue is automatically generating unique column names. If there are many sub grouping columns {R1, . . .,Rk} or columns are of string data types, this may lead to generate very long column names, which may exceed DBMS limits. However, these are not important limitations because if there are many dimensions that is likely to correspond to a sparse matrix (having many zeroes or nulls) on which it will be difficult or impossible to compute a data mining model. On the other hand, the large column name length can be solved as explained below.

~ 19 ~

The problem of d going beyond the maximum number of columns can be solved by vertically partitioning FH so that each partition table does not exceed the maximum number of columns allowed by the DBMS. Evidently, each partition table must have {L1,. . . , Lj } as its primary key. Alternatively, the column name length issue can be solved by generating column identifiers with integers and creating a dimension description table that maps identifiers to full descriptions, but the meaning of each dimension is lost. An alternative is the use of abbreviations, which may require manual input.

~ 20 ~

3.2 Introduce with Split-SPJ


When number of column exceed than the allowed number of column in DBMS, then it limit SPJ method, But the Split-SPJ method create another table when the DBMS column limit exceed. Without exceeding column number all properties of SPJ are contains Split-SPJ.

Column limit of different Database System :


Database Microsoft Access Microsoft SQL Server MySql Oracle Maximum Permitted Column 255 1024 4096 Default 1000 but it can be increase by command.

Table 3.1 Different database permitted column If we see the table, the lowest allowed column is 255 (Microsoft Access). So we decide the splitting point is 255 sequentially. Example : If vertical attributes of a table is : ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . . . ,VA255, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273 (It is impossible to aggregate in SPJ method) The output of Split-SPJ method : Table-1 ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . .,VA255 Table-2 ID, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273

~ 21 ~

~ 22 ~

CHAPTER-4 Experimental Description

4.1 Experimental data of this system


We use a system for the simulation that is now days popular social networking system. There are a lot of picture is handled in facebook within a few second. We consider here four users whose are named by user1, user2, user3, user4 and 25 pictures are named by pic1 to pic25. Here any user can comment any picture randomly by using any character length. We use horizontal aggregation concept to find out the total character number of a picture comment by each user. If any user does not comment any picture than the field is defined by NULL. By following the process of previous literature each user is aggregate with each picture. Our proposed system is simulating that we assume the column number of database is 20. So the total number column will be break at 20 and next 5 column will create a new table. This was unable at previous thesis. The time complexity of the proposed system is same to previous SPJ method but able to show the full horizontal aggregation. If we indexed the picture number than character length that commented by all users from pic1 to pic2 are shown in table number one and last five number of picture are shown in next table.

~ 23 ~

Fig- 4.1 Experimental data(Original data table)

~ 24 ~

Fig- 4.2 Experimental data(Vertical table)

~ 25 ~

Fig- 4.3 Experimental data(Horizontal table)

~ 26 ~

4.2 Figure of Split-SPJ horizontal aggregation

Fig- 4.4 Split-SPJ horizontal aggregation

~ 27 ~

4.3 Comparison of SPJ with Split-SPJ When aggregated column < 255

4 Time (ms) SPJ 3

10

20

30

40

50

60

70

80

90

100

Fig 4.3.1 : SPJ curve when number of column is 100.

4 Time (ms) SPJ 3

10

20

30

40

50

60

70

80

90

100

Fig 4.3.2 : Split-SPJ curve when number of column is 100.

~ 28 ~

When aggregated column > 255

4 Time (ms) SPJ 3

20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 255 Fig 4.3.3 : SPJ curve when number of column is 360. No. of Column

4 Time (ms)

SPJ

3 2.4 2 SPJ 1

20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 255 Fig 4.3.4 : Split-SPJ curve when number of column is 360. No. of Column

~ 29 ~

4.4 Code for the different methods 4.4.1 Code for vertical aggregation :
using System; using System.Windows.Forms; using HorizontalAggregation.App_Code; namespace HorizontalAggregation.UI { public partial class VerticalAggregationUI : Form { public VerticalAggregationUI() { InitializeComponent(); } private DataManager dataManager = null; private void VerticalAggregationUI_Load(object sender, EventArgs e) { dataManager = new DataManager(); dgvVerticalAggregation.DataSource = dataManager.GetVerticalTable(); } } }

public DataTable GetVerticalTable() { dataExecuteClass = new DataExecuteClass(); dataSet = new DataSet(); DataTable dataTable = null; string queryString = string.Format("SELECT facebook_id, image_name, sum(comments_char) as [SUM] from stdinfo group by facebook_id,image_name order by facebook_id,image_name;"); try { dataSet = dataExecuteClass.getDataSet(queryString); dataTable = dataSet.Tables[0]; return dataTable; } catch (Exception ex) { throw ex; } }

~ 30 ~

4.4.2 Code for horizontal aggregation :


using System; using System.Windows.Forms; using HorizontalAggregation.App_Code; namespace HorizontalAggregation.UI { public partial class HorizontalAggregationUI : Form { public HorizontalAggregationUI() { InitializeComponent(); } private DataManager dataManager = null; private DataExecuteClass DataExecuteClass = null; private void HorizontalAggregationUI_Load(object sender, EventArgs e) { dataManager = new DataManager(); DataExecuteClass = new DataExecuteClass(); dataManager = new DataManager(); dgvHA.DataSource = dataManager.GetHorizontalTable(); } } }

public DataTable GetHorizontalTable() { dataExecuteClass = new DataExecuteClass(); dataSet = new DataSet(); DataTable dataTable = null; string queryString = string.Format("SELECT * from horizontal order by facebook_id;"); try { dataSet = dataExecuteClass.getDataSet(queryString); dataTable = dataSet.Tables[0]; return dataTable; } catch (Exception ex) { throw ex; } }

~ 31 ~

4.4.3 Main steps of Split-SPJ Split method based on FV :

Select Distinct R1.....Rk

Compute Fv

Split-SPJ d left joins

Compute Fh

Fig- 4.5 Main steps of Split-SPJ Split method based on FV.

From the experimental data table first produced vertical aggregated table and then horizontal aggregated table. Data can be null but not blob(Such as image, file etc).

~ 32 ~

4.4.4 The Split-SPJ Algorithm (Proposed Algorithm): Algorithm 4.1 : Split-SPJ (D, DV, DH, TRV, TCH, TEMP) Let experimental data table D, it produced vertical aggregated table DV and then horizontal aggregated table DH. Data can be null but not blob(Such as image, file etc). The variable TRV, TCH and TEMP denote respectively total rows of DV, Total columns of DH. 1. [Create vertical aggregated table from experimental table.] TEMP =: SELECT(D). 2. [Assigning vertical data.] DV =: TEMP. 3. [Create horizontal aggregated table from vertical aggregated table.] TEMP =: SELECT(DV). 4. [Assigning horizontal data.] DH =: TEMP. 5. [Count column of horizontal data table.] COUNTER =: COUNT(DH). 6. [Check condition.] If COUNTER > 255 then : Create table using 255 column. COUNTER =: COUNTER 255. GoTo step 6. Else : Create table using total column. End If 7. Exit.

~ 33 ~

4.4.5 Code for Split-SPJ horizontal aggregation :


For oracle : SELECT (SELECT column_name FROM user_tab_columns WHERE table_name like table_name and rownum = 255) FROM (FROM F0 LEFT OUTER JOIN F1 ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj LEFT OUTER JOIN F2 ON F0.L1 = F2.L1 and . . . and F0.Lj = F2.Lj ......... LEFT OUTER JOIN Fd ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj)
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Data.OleDb; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using HorizontalAggregation.App_Code; namespace HorizontalAggregation.UI { public partial class ProposedHorizontalAggregationUI : Form { public ProposedHorizontalAggregationUI() { InitializeComponent(); } private DataManager dataManager = null; private DataExecuteClass dataExecuteClass = null; private DataGridView dataGridView = null; private string[] attributeName = (new DataManager()).GetAllAttributeOfAtable("stdinfo"); private string[] col = new string[20]; private void ProposedHorizontalAggregationUI_Load(object sender, EventArgs e) { dataManager=new DataManager(); int maxColLength = int.Parse(dataManager.GetMaxColumnLength()); if (maxColLength==0) { dataGridView =new DataGridView(); dataGridView.Dock=DockStyle.Top; dataGridView.DataSource = CrieateHorizantalAgreateTable(); this.Controls.Add(dataGridView); } else

~ 34 ~

{ int totalColumnLength = attributeName.Count()-1; int fstSkipPoint = 0, lstSkipPoint = 0; int numOfDGV = (int)Math.Ceiling((float)totalColumnLength/maxColLength); for (int j = 0; j < numOfDGV; j++) { fstSkipPoint = j*maxColLength+1; lstSkipPoint = fstSkipPoint+maxColLength-1; DataTable dataTable = CrieateHorizantalAgreateTable(); for (int i = 1; i <= totalColumnLength; i++) { if((i>=fstSkipPoint && i<=lstSkipPoint) || i==1) { continue; } else { string column = col[i-1]; dataTable.Columns.Remove(column); } } dataGridView = new DataGridView(); dataGridView.DataSource = dataTable; dataGridView.Dock = DockStyle.Top; this.Controls.Add(dataGridView); } } } private DataTable CrieateHorizantalAgreateTable() { dataManager = new DataManager(); dataExecuteClass = new DataExecuteClass(); int i = 0; DataRow dr; string[] horizontalColumn = dataManager.SelectDistinctRowInaColumn("D2", "stdinfo"); DataTable horizontalAggrigationTable = new DataTable(); //Column of horizontal table string col1 = attributeName[1]; col[0] = col1; string col2 = attributeName[2] + horizontalColumn[0]; col[1] = col2; string col3 = attributeName[2] + horizontalColumn[1]; col[2] = col3; horizontalAggrigationTable.Columns.Add(col1); horizontalAggrigationTable.Columns.Add(col2); horizontalAggrigationTable.Columns.Add(col3); //Create Rows of horizontal table string[] data1 = dataManager.SelectDistinctRowInaColumn("D1", "stdinfo");//Prepare 1st Column string[] data2 = new string[data1.Count()]; string[] data3 = new string[data1.Count()]; //Prepare 2nd Column string query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by D1,D2 order by D1,D2) WHERE D2='x'"; OleDbDataReader reader = dataExecuteClass.ExecuteReader(query); while (reader.Read()) { data2[i] = reader["SUM"].ToString();

~ 35 ~

i++; } //Prepare 3rd Column query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by D1,D2 order by D1,D2) WHERE D2='y'"; reader = null; i = 0; reader = dataExecuteClass.ExecuteReader(query); while (reader.Read()) { data3[i] = reader["SUM"].ToString(); i++; } for (i = 0; i < data1.Count(); i++) { dr = horizontalAggrigationTable.NewRow(); dr[col1] = data1[i]; dr[col2] = data2[i]; dr[col3] = data3[i]; horizontalAggrigationTable.Rows.Add(dr); } return horizontalAggrigationTable; } } }

~ 36 ~

~ 37 ~

CHAPTER-5 Conclusion and Future Research

5.1 Conclusion
We introduced a new method to extend aggregate functions, called Split SPJ horizontal aggregations which help preparing data sets for data mining . Specifically, the method is useful to create data sets with a horizontal layout, as commonly required by data mining algorithms. Basically, a horizontal aggregation returns a set of numbers instead of a single number for each group, resembling a multidimensional vector. We proposed an abstract, but minimal, extension to SQL standard aggregate functions to compute horizontal aggregations which just Split the data set at the final limit of column of related database. From a query optimization perspective, we used query evaluation methods.

5.2 Future Research Work


We need to understand if Split-SPJ method of horizontal aggregations can be applied to holistic functions (e.g., rank()). Optimizing a workload of horizontal aggregation queries is another challenging problem. If the length of aggregate object is exceed column length of related database than there occur an error which may be overcome by using alias method. That means it is very complex to aggregate when data fields are contain image or file (such as blob data).

~ 38 ~

REFERENCE

1. Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis. [IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012]

2. Vertical and Horizontal Percentage Aggregations. [Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 04), pp. 866-871, 2004.]

3. Data Set Preprocessing and Transformation in a Database System. [Intelligent Data Analysis, vol. 15, no. 4, pp. 613-631, 2011.]

4. Integrating K-Means Clustering with a Relational DBMS Using SQL. [IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201, Feb. 2006.]

5. Data Cube A Relational Aggregation Operator [Proc. Intl Conf. Data Eng., pp. 152-159, 1996.]

6. Mining Low-Support Discriminative Patterns [IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2, FEBRUARY 2012]

7. Data Mining Techniques for Software Effort [IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 38, NO. X, XXXXXXX 2012]

8. C. Galindo-Legaria and A. Rosenthal, Outer Join Simplification and Reordering for Query Optimization, ACM Trans. Database Systems, vol. 22, no. 1, pp. 43-73, 1997.

~ 39 ~

9. C. Ordonez, Horizontal Aggregations for Building Tabular Data Sets, Proc. Ninth ACM SIGMOD Workshop Data Mining and Knowledge Discovery (DMKD 04), pp. 35-42, 2004.

10. H. Wang, C. Zaniolo, and C.R. Luo, ATLAS: A Small But Complete SQL Extension for Data Mining and Data Streams, Proc. 29th Intl Conf. Very Large Data Bases (VLDB 03), pp. 1113-1116, 2003.

~ 40 ~

Das könnte Ihnen auch gefallen