• No results found

SQL and Data Analysis. Some Implications for Data Analysits and Higher Education

N/A
N/A
Protected

Academic year: 2021

Share "SQL and Data Analysis. Some Implications for Data Analysits and Higher Education"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Procedia Economics and Finance 20 ( 2015 ) 243 – 251

2212-5671 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of the Faculty of Economics and Business Administration, Alexandru Ioan Cuza University of Iasi. doi: 10.1016/S2212-5671(15)00071-4

ScienceDirect

7th International Conference on Globalization and Higher Education in Economics and Business

Administration, GEBA 2013

SQL and data analysis. Some implications for data analysits and

higher education

Marin Fotache

a

*, Catalin Strimbei

b

a,bAl. I. Cuza University, B-dul Carol 1 nr.22, Iasi, 700505, Romania

Abstract

Hypes or not, Big Data, NoSQL, Analytics, Business Intelligence, Data Science require processing huge amounts of data in various and complex ways using a vast array of statistical methods and tools. Market increasingly needs graduates with both databases and data warehouses technological skills and also statistical competencies in order to decipher the business patterns and trends hidden in the mountains of data. This paper presents the main coordinates of data processing today and some implications for academic curricula. It argues that data analysis and business intelligence professionals could benefit if trained to acquire a proper level of SQL and data warehouses knowledge.

© 2014 The Authors. Published by Elsevier B.V.

Selection and peer-review under responsibility of the Faculty of Economics and Business Administration, Alexandru Ioan Cuza University of Iasi.

Keywords: Big Data, Statistics, Data Analysis, SQL, OLAP

1. Introduction: Data Deluge and a Consequent Hype – Big Data

In IT and business world hypes and fads emerge and disappear in a very quick pace. The top of the hypes changes every few months. Marketing departments of almost all tech companies compete in repackaging and rebranding old stuff suggesting coolness and desirability of their products (Buhl et.al, 2013). It seems the strategy works, at least for some.

* Corresponding author. Tel.: +40232201430; fax: +40232217000

E-mail address: [email protected]

© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

As Stonebraker (2012) put it, Big Data is the "buzzword du jour", Stonebraker. As with other buzzwords, there is no rigorous definition of the term. How much data is big (Jacobs, 2009; Borkar et al., 2012)? What are the differences between Big Data, databases and data warehouses, all of them dealing with huge volumes of data (Borkar et al.,2012) ?

E-commerce sites, sensors, cameras, mobile apps all produces huge amount of data with different periodicity. This mountain of data must be processed and analyzed in order to detect patterns, to explain business phenomena, to make predictions. The basic assumption of Big Data is we can learn from data (Cron et al., 2012).

According to Jacobs (2009), Big Data should be defined at any point in time as data whose size forces us to look beyond the tried and true methods that are prevalent at that time, whereas for Cuzzocrea et al. (2011) Big Data refers to enormous amounts of unstructured data produced by high-performance applications falling in a wide and heterogeneous family of application scenarios: from scientific computing applications to social networks, from e-government applications to medical information systems, and so forth.

Stonebraker (2012) identifies four flavors of Big Data: x Big volumes of data, but small analytics

x Big analytics on big volumes of data x Big velocity

x Big variety.

Big volumes of data, but small analytics usually means using regular SQL queries (SELECT with MIN, MAX, SUM, COUNT, AVG, GROUP BY, HAVING functions and clauses) on large datasets. All types of SQL (relational) databases, commercial (Oracle, IBM DB2, Microsoft SQL Server) or open-source (PostgreSQL, MySQL) could be platforms/tools for this type of processing.

Big analytics on big volumes of data requires combining ETL (Extract-Transform-Load) tools with statistical packager. Big analytics denote regression, data mining, machine learning and other types of more complex processing. Data could be extracted from various data sources using SQL queries and/or ETL tools. Complex analyses require packages such as SPSS, R, SAS, etc. and sometimes a good deal of coding.

Big velocity is the ability to absorb on the fly streams of data from stock exchange, electronic trading, mobile social networks, web sites, etc.

Big variety is related to the heterogeneity of data sources and data formats (XLS, relational databases, CSV, flat files, etc.) which must me imported and transformed in order to be processed/analyzed.

Managing Big Data means dealing with three types of operations: gathering data, storing data, and processing data. Consequently, twp key ingredients for Big Data are the databases and statistical packages.

2. SQL and Statistical Packages

There is a large offer of statistical packages dedicated to data analysis and other types of complex processing. Some of the most popular commercial products are: SPSS, SAS, Stata, S-PLUS, Minitab. They generally provide a vast array of statistical functions and options with very friendly interfaces to the regular user (non-programmers). But at least some of them are also notorious for their costs. Small and Medium Business, as well as a good range of universities, cannot afford to spend sometimes thousand of dollars for a not-so-large number of licenses. Of course prices and licensing systems differ, but from our experience the price is still the most common barrier to their usage. Nevertheless many universities have acquired packages such as SPSS, SAS through donations, research grants, projects with the industry, etc.

The recent years witnessed a general trend in higher education and research world towards open-source statistical software, mainly R, Tsoukalos et al. (2013). R is gradually becoming the dominant platform for universities, companies and researchers that could not spend too much on software especially within current financial troubles. R has a huge community of enthusiast developers with continuously implement the most recent advancements in statistics, data mining, machine learning, etc. without any cost for the final user.

There are two main limitations of R in relation with these papers objectives. One is R specific and concerns the user interface. Even if some open-source extensions (such as RStudio) soften somehow the dialog, R is based on the command prompt and scripts and also programming-prone. In other word, R is still far away from the elegance of commercial products.

(3)

The second limitation is inherent to all statistical packages and concerns the data source. Surveys and lab data can be entered directly in the statistical package, but in the real-world companies’ data to be analyzed reside on a large scope of platforms: SQL databases, web logs, sensors, mobile applications, Excel files, etc. As consequence, in most cases some extraction-transformation-load (ETL) mechanisms are needed for gathering data in R or other package.

Usually statistical packages usually load their data to be processed using one or more from the following solutions:

x Direct import from external data files (Excel, CSV-Comma Separated Values, text files etc.) using their menus (where available).

x Save intermediate results from the data sources (databases, Excel, etc.) into common format files and then import these intermediate files into the package; the most popular interchange formats are XML, CSV and JSON. x Create data sources using ODBC (Object Data Base Connectivity) or JDBC (Object Data Base Connectivity)

drivers and then connect directly the package to ODBC/JDBC data sources. No intermediate files are needed, data being imported directly into the package variables/tables.

In recent years, some new options are available for data imports, see also Tsoukalos et al.(2013), such as: x Using special ETL procedures which can be customized for both the data source and the destination package x Connecting to special APIs (Application Programming Interfaces) or web/data services which provide data sets

in formats easy to import. Google Analytics is such a service becoming more popular over the years.

x Import data from web servers log using user-defined or standard ETL procedures. This is an area where NoSQL systems have a strong presence.

x In addition to plain import through ODBC/JDBC connections, sometimes is possible to perform database query in a database server directly from the statistical package. For example, R users can query SQLLite databases directly and import the results from the tables into the R workspace.

3. SQL Features for Data Analysis

Basically SQL extracts record sets from huge databases based on a relational algebra. SELECT is the core SQL, endowed with powerful clauses for filtering records, columns/attributes, computation, grouping, etc. The immense popularity of SQL (Michael Stonebraker once called SQL the intergalactic dataspeak language) is due mainly to its high level syntax (no programming is necessary for most of the queries) and also to its implementation in all types DataBase Management Systems, from desktop (Access) to open-source (MySQL, PostgreSQL) and commercial (Oracle, IBM DB2, Microsoft SQL Servers) ones. The broad adoption was facilitated by the standardization of SQL by ISO with ANSI and various national agencies. First SQL standard was published in 1986 (ANSI) and 1989 (ISO), and then in 1992, 1999, 2003, 2008 and 2011.

As pointed out in previous section, the result of SQL queries (SELECT commands) can be saved/stored inside the database (mainly as table or view) but also is prone to be exported from the DBMS to various targets and formats, i.e. another database, Excel/CSV file, text file, HTML, ODBC/JDBC data source, etc.

But SELECT commands do not merely extract and filter data from the database. Its various clauses can do various processing tasks for all the result rows or for groups or rows (GROUP BY and HAVING clauses).

Starting with the first standard (1986/1989), all SQL dialects have implemented the basic statistical functions called (statistical) aggregate functions with self-descriptive names: SUM, COUNT, AVG, MIN, MAX.

Since 1999 one of the most important target of SQL standards has been data analysis, mainly through OLAP (On Line Analytical Processing) features (sometimes also called window functions). There are some OLAP differences among dialects. The richest DBMSs for data analysis are Oracle and DB2 whereas open source systems are less generous. But some basic OLAP operations, such as ranking, are common. For example, current version of PostgreSQL (9.3) implements, among other functions, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, LEAD, LAG, NTILE, NTH_VALUE, etc. For some advanced SQL OLAP features see next section.

Less known and used in SQL are the statistical functions for common statistical procedures. Again commercial database servers (Oracle, DB2, SQL Server) are endowed with the best statistical features. But also open-source servers provide useful functions such as STDEV_POP (standard deviation for a population), STDEV_SAMP (standard deviation for a sample) CORR (correlation), COVAR_POP (population covariance), COVAR_SAMP

(4)

(sample covariance), REGR_INTERCEPT (y intercept of the least-squares-fit linear equation determined by the (x,y) pairs), REGR_SLOPE (slope of the least-squares-fit linear equation), PostgreSQL (2013).

As a main representative of commercial database servers Oracle is endowed with a vast array of statistical features. Most of them are included in the Oracle SQL core dialect, but some other extensions are available, such as Oracle Data Mining and Oracle R Enterprise. According to Oracle database documentation, Oracle (2013), the main statistical options available are:

x Descriptive statistics x Hypothesis testing

x Correlations analysis (parametric and nonparametric) x Ranking functions

x Cross Tabulations with Chi-square statistics x Linear regression

x ANOVA

x Test Distribution fit

x Window Aggregate functions x Statistical Aggregates x LAG/LEAD functions x Reporting aggregate functions

The following simple query illustrates one-way ANOVA test performed in Oracle. SELECT emp_gender,

STATS_ONE_WAY_ANOVA(years_in_company, salary, 'F_RATIO') f_ratio, STATS_ONE_WAY_ANOVA(years_in_company, salary, 'SIG') p_value FROM employees emp INNER JOIN salaries sal ON emp.emp_id = sal.emp_id GROUP BY emp_gender

ORDER BY emp_gender;

The computed f_ratio and p_value show the significance of the difference in mean salaries within each value of number of years spent in the company and differences in mean salaries between years. A p-value less than 0.05 would indicate that, for both men and women, the difference in the amount of salaries across years in company is significant.

It is expected that future SQL standards and implementations will have more statistical features. From SQL:1999 standard, when OLAP functions were introduced, SQL has entered corporate data analysis, a market dominated by data warehouse systems (see next section). It is worth pointing out that for some database servers (i.e. PostgreSQL) there is an open-source library for scalable in-database analytics called MADlib, Hellerstein et al. (2012).

4. Data Analysis in the Corporrate World: Data Warehouse, Data Mining and All That Jazz

4.1. Data Mining, Knowledge Discovery and OLAP

Data Mining and Knowledge Discovery are broadly considered to form the academic and technological domain of turning raw data into valuable information useful for business intelligence decision process. In fact, Knowledge Discovery (KD) is considered as the methodological process and the Data Mining (DM) as a range of tools and techniques to achieve the goals of such informational processing (Peng et al.,2008). The KD process is heavily dependent on the database support and consequently is often referred as KDD - Knowledge Discovery in Database (Fayyad et al.,1996). It covers activities like planning and setting analysis goals, setting the data collections to be processed, data pre-processing through cleaning and preparation techniques, data transforming to simplify and to adapt data structures to analytical data models, data mining with techniques as searching interpretative patterns to represent analytical data and, finally, interpretation/evaluation plus visualization of the interpretative data patterns extracted. Therefore the DM domain covers the most significant data analysis techniques to extract the new predictive information using techniques such as classification (learning functions), regression, clustering or summarization (Peng et al.,2008).

The Online Analytical Processing tools heavily use data analytical techniques, inspired by statistical operations, to build their query trees. They are usually related with data integration and data refinement activities applied mostly

(5)

within the third stage of the KDD process. Although there is no clear delimitation between OLAP and DM specific tools, the latest ones use OLAP queries to set their own processing inputs. That is why the SQL-ROLAP is considered a very important component of the DM-KDD architecture, especially in the context of multidimensional databases built on relational extensions.

4.2. Data Warehouses and OLAP

Prior to feeding the DM processing activities, the analytical data need to be modeled (multidimensional), stored (data warehousing) and properly queried (using OLAP operators). Therefore, the data warehouses (DW) are very often related with the multidimensional databases regarded as subject-oriented, integrated, time-variant and nonvolatile collections of data intended to decision support systems (Reddy et al., 2010). The original sources of analytical data are most frequent OLTP systems, mainly relational databases. Also, structured or semi-structured external data sources could feed DW. The integration of these kinds of virtually heterogeneous data sources could be one of the most critical factors in the building of data warehouse. That is why an entire range of extract-transform-load tools (ETL) are part of DW architecture. These tools are responsible for extracting, cleaning, integration, transforming and loading data into DW and also they address the problem of data refreshing within DW context (Chaudhuri et al., 1997). Therefore ETL tools ensure the entering gate of data into the multidimensional database of data warehouses. By contrast, the OLAP tools are positioned somehow as the outer gate of DW systems: ensuring the data access for the external reporting tools or for the data mining tools. The queries performed to multidimensional databases of DW are processed by the OLAP engines.

There were developed four main OLAP data models and query engines:

x MOLAP (Multidimensional OLAP) tools support some analytical operators applied on multidimensional data structures grafted on concepts like dimensions and metrics (Thomsen , 2003);

x ROLAP (Relational OLAP) tools support the analytical functions using dedicated operators implemented as relational extensions on multidimensional databases designed with star-schema or snowflake-schema methodologies (Celko, 2006);

x HOLAP (Hierarchical OLAP) tools solve some performance problems of ROLAP systems by using specialized storage and indexing techniques (Celko, 2006);

x XML-OLAP tools (i.e. XCube) dedicated to XML Data Warehouses where dimensional facts are stored as XML documents (Hümmer et al.,2003; Park et al., 2005).

4.3. OLAP and SQL

The SQL analytical capabilities are closely related with OLAP tools, or, more specifically, with the ROLAP extensions. All the other tools try to formalize and to implement different flavors of multidimensional query languages just resembling with SQL syntax, such as MDX (Multidimensional Expression Language) of Microsoft, but these (sub)languages are not standardized as the conventional SQL.

From the SQL perspective, the main approach on querying multidimensional databases consists in operating directly on the specific tables produced by the relational-based multidimensional design (star or snowflake methodologies). Therefore, the SQL syntax was extended with an entire new range of operators, statistical functions and expressions such as ( Chaudhuri et al., 1997):

x statistical inspired aggregate functions such as new medians; x multiple “group by” operators such as ROLLUP and CUBE; x sophisticated comparisons on aggregates, etc.

OLAP analytical processing is a core feature of multidimensional databases that propose a new data organizational model meant to surpass the traditional relational model. In order to capitalize the SQL language in the multidimensional domain, an extended relational model for multidimensional purposes is required. Several such models have been proposed, such as Wang et al. (1996), Vassiliadis et al. (1998) and Lujan-Mora et al. (2006). These models deal mainly with the mathematical foundation of relational model, but Pedersen et al. (2003) proposed a fairly formalized analytical system that aim straight to the SQL language, named SQL-M, obviously M coming from multidimensional. While the SQL approach tries to add to the regular SELECT phrases some multidimensional

(6)

GROUP BY specialized sub-clauses like CUBE, ROLLUP, Wang et al. (2000) proposed object-relational SQL extensions defining and implementing statistical processing functions which apply to aggregates. These extensions are based on a sub-language named AXL (Aggregate Extension Language). Consequently, the multidimensional databases could be designed and implemented into Object-Relational DBMSs so that SQL extensions like DCUBE could be fully leveraged.

4.4. Some Actual Challenges for SQL ROLAP

Given the overwhelming diversity of heterogeneous web-based data sources (Big Data) it seems that OLAP engines of DBMSs are limited in handling the inherent complexity of analytical data processing. This pressure has stimulated significant efforts to enhance SQL systems so that they could be capable to analyze larger and larger relational tables, processing the following statistical models, and also keeping them into relational model context. One such solution is based on user defined functions developed and added to the core of object-relational systems (Ordonez et al., 2011).

A significant competitor for the relational systems seems to come from the new kind of data systems as in-memory column databases (Plattner, 2009). These systems are considered to favor the storage of OLAP specific multidimensional cubes more effectively than the row-based systems as traditional RDBMS. OLAP tools, hosted by the relational systems and exposed through SQL extensions, could be vulnerable when compete with these new kinds of Big Data tools because of the closed nature of the many traditional RDBMS. Therefore, the data integration issues remain critical, challenging the distributed relational systems, especially within a heterogeneous data source environment. In this context, the declarative and very flexible syntax of SQL language is struggling to exceed the boundaries of the closed SQL-relational systems, aiming to reach the magnitude of web-related data storage and processing areas (Konopnicki and Shmueli, 1998). Even NoSQL systems, like column-based databases or map-reduce systems, manifest a very strong disposition to adopt … SQL. Therefore, even though Big Data will conquer the data analytical spot of OLAP, maybe SQL, as the reference data query language, will survive through this kind of systems…

Another approach to keep SQL-OLAP still relevant in the (O)RDBMS context would be to “open” relational database systems for accommodating various data sources (structured and semi-structured) integrated by Big Data tools. We have discussed such architecture (Strimbei, 2012), and we have tried to argue the possibility of extending SQL-DBMS systems in order to access external data sources and to reveal them as web data services.

5.

Databases and Yet Another Hype: NoSQL

The database world was usually seen in the latest decades as a peaceful place. The core theories of the field emerged in 1970s, the 1980s and 1990s witnessed the consolidation of theories and various strands of research targeted database implementation issues – indexing, optimization, concurrency, transactions, etc.

All the revolutions failed to materialize one by one. Object-Oriented Data Model (Atkinson et al., 1989), Intelligent Databases (Zaniolo, 1992), XML Databases (Bonifati and Cuzzocrea, 2007) – to name a few – promised to change the complacent database community but did not reach critical mass on the application development market.

Perhaps the main problem of SQL (relational) databases in Big Data environments has been the inability to scale properly (Jacobs, 2009). We would rather modify this statement, because many SQL/relational DBMSs have had for many years excellent solutions for distributions and scaling but these usually require exquisite hardware and large amounts of money for the license. So “the inability of SQL databases to scale properly”, means actually “the cost of implementing fully scalable SQL databases is prohibitive”.

In 2010 and 2011, the NoSQL tsunami came almost from nowhere. Actually from nowhere is a misnomer. Some internet giants (i.e. Yahoo, Google, Amazon, Facebook, LinkedIn) found that, for a wide range of applications, SQL databases based on transactions (ACID – Atomicity, Consistency, Isolation, Durability) were too fastidious and too expensive to scale, and also some data gathered in their applications were not so critical (messages, comments, reviews, etc.) to require ACID. So new DBMSs have been proposed which are based on data models radically different than the relational model.

(7)

NoSQL is a direct consequence of Big Data trying to answer the question: which are the most appropriate systems for storing and querying/analyzing huge amounts of data gathered in various format? For Cattell (2010) there are six key technical features of NoSQL systems:

x Horizontal scalability of simple operations over many servers x Replication and distribution of data over many nodes x Simple Call Level Interfaces or protocols

x A weaker (than ACID) concurrency model

x Efficient use of distributed indexes and RAM for data storage x Dynamic insertion of new attributes into the data records.

But the explosion of NoSQL systems is due mainly to the economic consequences of the above features (Cogean et al., 2013):

x NoSQL datastores can handle immense volumes of data.

x Data can be gathered “on-the-fly” (from web servers’ logs, sensors, mobile devices) very close to the moments it is produced.

x Openness: almost all NoSQL datastores are open-source and prone for add-ins wrote in whatever language (Python, Perl, Java, etc.).

x Cost: most of NoSQL systems are free of charge. x NoSQL can be installed on modest or average equipment. x Easier and quicker development and deployment. x Large and enthusiastic users and developers communities.

x Data openness: comparing with proprietary database products, NoSQL datastores deal easier with import and export of data in a large variety of formats.

If traditional infrastructure for Big Data used to be parallel relational DBMS, the basic NoSQL ingredients for large-scale data analysis are Hadoop and Map-Reduce. Marketed as radical departures from the relational counterparts, Hadoop and Map-Reduce present features available many decades ago in high-end relational systems (Pavlo et al., 2009). Moreover, performance comparisons between Hadoop/Map-Reduce and parallel/SQL systems still show, in many regards, better results for the later (Pavlo et al., 2009).

NoSQL systems have also a number of drawbacks. Comparing to the relational model, there is a plethora of NoSQL data models: key-value stores, document, column, and graph databases (Cattell, 2010). They share some commonalities but generally they differ significantly. NoSQL technologies are still immature even they are growing fast. They lack the established research and practice communities of the relational technologies. Query mechanism is very heterogeneous and low level requiring programming skills. Sometimes map-reduce solutions is traumatic for beginners and completely opaque for the debugging (see the case of MongoDB). NoSQL systems suffer from to much noise. As with other buzzwords, many NoSQL adepts do not lack blindness and hysteria and that could damage NoSQL reputation on long term.

As a recent trend one can notice that more and more NoSQL datastores provide query languages (MongoDB Aggregation Framework, Cassandra C-SQL, Hive, Pig) with troubling similarities to…SQL. We use troubling because they are proudly called NoSQL! So even with the NoSQL advent, SQL is not (yet) dead.

6. Conclusions: SQL and Data Analysis Teaching and Practice

With all the buzzwords and fads, there is a tremendous interest in storing, exchanging and processing huge volumes of heterogeneous data. No matter the denomination - Analytics, Data Mining, Business Analysis, Business Intelligence, Data Science, etc. – more and more companies hire professionals with a mixture of skills and competencies which combines databases and statistics.

This is good news for Information Systems and Statistics students and graduates but also require adapting the curricula for these types of undergraduate and graduate programmes. Namely, Information Systems programmes need more courses related to data analysis and also Statistics programmes need courses targeted to provide students core competencies for dealing with databases and data warehouses, especially on extracting and processing data. Real world data and case studies are essentials for a proper understanding of business phenomena.

(8)

This paper argues that, for business graduates aspiring to careers in Data Analysis and Data Science, SQL could prove an invaluable core competency in order to extract and pre-process vast amounts of data. Moreover, despite the intricate relation of SQL with relational databases (we actually prefer to use the term SQL databases), more and more NoSQL datastores implement SQL-like query languages. This makes many Data Analysis/Data Science tasks accessible to non-programmers.

On a general level, the basic assumption of Big Data, statistics, Data Analysis is that we can learn from the past. This also true for other business domains (such as Accounting). But also we must be careful. Even if the only exact data we may know is all about the past, we should never forget that, in all domains of human activity, the future has not been just a linear (or logarithmic) consequence of the past and the present. In other words, hanging on the past is not the best way to cope with the future.

References

Buhl, H.U., Röglinger, M., Moser, F. (2013), Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?, Business & Information Systems Engineering, 2, 2013, pp.65-69

Stonebraker, M. (2012), What Does 'Big Data' Mean?, Communications of the ACM (BLOG@CACM), September 21, 2012, http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext

Jacobs, A. (2009), The Pathologies of Big Data, Communications of the ACM, 52(8), pp. 36-44 Borkar, V.R., Carey, M.J., Li, C. (2012), Big Data Platforms: What’s Next?, XRDS, 19 (1), pp.44-49

Borkar, V.R., Carey, M.J., Li, C. (2012), Inside "Big Data Management": Ogres, Onions, or Parfaits?, Proc. of the 15th International Conference on Extending Database Technology (EDBT '12), ACM, New York, USA, pp.3-14

Cron, A., Nguyen, H.L., Parameswaran, A. (2012), Big Data, XRDS, 19(1), pp.7-8

Cuzzocrea, A., Song, I.Y., Davis, K.C. (2011), Analytics over Large-Scale Multidimensional Data: The Big Data Revolution!, Proc. of the ACM 14th international workshop on Data Warehousing and OLAP (DOLAP '11), ACM, New York, USA, pp.101-104

Tsoukalos, M. (2013), Using the R Advanced Statistical Package, Linux Journal, Iss.232 (August 2013), pp.60-73 PostgreSQL 9.3.0 Documentation. Chapter 9: Functions and Operators, 9.20: Aggregate Functions,

http://www.postgresql.org/docs/current/static/functions-aggregate.html Oracle Statistical Functions, Available (on September 1st 2013) at

http://www.oracle.com/technetwork/middleware/bi-foundation/index-092760.html

Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Arun Kumar, A. (2012), The MADlib analytics library: or MAD skills, the SQL, Proc. of the VLDB Endowment, 5 (12), pp.700-1711.

Peng, Yi, Kou, Gang, Shi, Young, Zhengxin, Chen (2008), A descriptive framework for the field of Data Mining and Knowledge Discovery, International Journal of Information Technology & Decision Making, Vol. 7, No. 4, 2008

Fayyad, Usama, Piatetsky-Shapiro, Gregory, Smyth, Padhraic (1996), Knowledge Discovery and Data Mining: Towards a Unifying Framework, In proceeding of: Proceedings of the 2nd International Conference on Knowledge Discovery and Data, 1996

Reddy, G. Satyanarayana, Srinivasu, Rallabandi, Rao, M., (2010), Data Warehousing, Data Mining, OLAP and OLTP Technologies are essential elements to support decision-making process in industries, International Journal on Computer Science and Engineering, 2(9), 2010

Chaudhuri, Surajit, Dayal, Umeshwar (1997), An Overview of DataWarehousing and OLAP Technology, ACM SIGMOD Record, Volume 26(1), March 1997, pp.65-74

Thomsen, E. (2003), OLAP Solutions Building Multidimension Information Systems, Second Edition, Wiley, New York, USA Celko, J. (2006), Analytics and OLAP in SQL, Morgan Kaufmann

Hümmer, Wolfgang, Bauer, Andreas, Harde, Gunnar (2003), XCube – XML For Data Warehouses, Proceeding DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP pg. 33-40, ACM, New York, USA

Park, B.K., Han, H., Song, I.-Y. (2005), XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses, DaWaK 2005, LNCS 3589, Springer-Verlag Berlin, pp. 32–42

Li, C., Wang, S. (1996). A Data Model for Supporting On-Line Analytical Processing, Proc. of the 5th International Conference on Information and Knowledge Management, pp. 81-88

Vassiliadis, P. (1998). Modeling Multidimensional Databases, Cubes and Cube Operations, in M. Rafanelli and M. Jarke (Eds.), Proc.of the 10th International Conference on Scientific and Statistical Database Management, pp.53-62

Lujan-Mora, S., Trujillo, J., Song, I.-Y. (2006), A UML a profile for multidimensional modeling in data warehouses. Data Knowl. Eng., 59(3), pp.725–769

Pedersen, D., Riis, K., Pedersen, T.B. (2002), A Powerful and SQL-Compatible Data Model and Query Language For OLAP, Proc. of the 13th Australasian Database Conference (ADC2002), Melbourne, Australia

Wang, H., Zaniolo, C. (2000), Using SQL to Build New Aggregates and Extenders for Object Relational Systems, Proc. of the 26th VLDB Conference, Cairo, Egypt

Ordonez, C., Gracia-Alvarado, C. (2011), A Data Mining System Based on SQL Queries and UDFs for Relational Databases, CIKM’11, Glasgow, Scotland, UK.

(9)

Plattner, H. (2009) A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, SIGMOD’09, Providence, Rhode Island, USA

Konopnicki, D., Shmueli, O. (1998), Information Gathering in the World Wide Web: The W3QL Query Language and the W3QS System, ACM Transactions on Database Systems, 23(4), pp.369–410.

Strîmbei,, C. (2012), OLAP Services on Cloud Architecture, IBIMA Journal of Software & Systems Development, 1

Atkinson, M., Bancilhon, F., DeWitt, D., Dittrich, K., Maier, D., Zdonik, S. (1989), The Object-Oriented Database System Manifesto, Proc. of the First International Conference on Deductive and Object-Oriented Databases, Kyoto, Japan, pp.223-240

Zaniolo, C. (1992), Intelligent Databases: Old Challenges and New Opportunities, Journal of Intelligent Information Systems, 1, pp.271-292 Bonifati, A., Cuzzocrea, A. (2007), Synopsis Data Structures for XML Databases: Models, Issues, and Research Perspectives, Proc. of the 18th

International Conference on Database and Expert Systems Applications (DEXA '07). Washington, DC, USA, pp.20-24 Cattell, R. (2010), Scalable SQL and NoSQL Data Stores, ACM SIGMOD Record, 39(4), pp. 12-27

Cogean, D.I., Fotache, M., Greavu-Serban, V. (2013), NoSQL in Higher Education. A Case Study, Proc. of the IE 2013 International Conference, ASE Bucuresti

Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M. (2009), A Comparison of Approaches to Large-Scale Data Analysis, Proc. of ACM SIGMOD Conference (SIGMOD’09), Providence, Rhode Island, USA, pp.165-178

References

Related documents

The Trends in International Mathematics and Science Study (TIMSS) uses five dimen- sions related to school climate: class learning environment, discipline, safety, absence of

This proposal allows Region One Education Service Center Texas Energy Center (TEC) to provide competitive pricing and services to its members for energy supply, natural gas

The present investigation was carried out with the idea of developing an Online Pest Management Information System (PMISNET) on major agricultural crops containing

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

These cavities spent the least amount of time above 35˚C and 40˚C (Fig 9A-F) and thus a model cannot be run because there are so few non- diapausing individuals spending

I We also consider a noisy variant with results concerning the asymptotic behaviour of the MLE. Ajay Jasra Estimation of

26,32,60 The largest study to date of medication prescribing by physicians uses observations from the National Ambulatory Medical Care Survey (NAMCS) for 1980 and 2000 to compare