Abstract—Large object normally treated as ‘Big’. It is a fact that Data is the raw information and content. Technology is rapidly changing emerging and today social media is very much popular and broken all the geographical boundaries. Bigdata is a concept and procedure which is deals with the data sets which are so large and in which traditional data processing become tough and eventually applications are inadequate. Analysis, capture, sharing, storage, visualization, querying, information etc in general data management principles become important challenge. Hence data sets having complexity and huge sizes suffer in adequacy. Business Intelligence is a related branch and accountable for the descriptive statistics with soaring information compactness to measure things, identify trends and so on. Datascience approaches is deals with the quantitative analysis of data by using methods of statistical learning. It is an approach and combines classical statistical methods including progress in computational systems along with machinelearning. This is a theoretical paper depicted current trends and issues of datascience and bigdata. Moreover paper is also describes the potential and available programs in the field. Paper is also proposed and possible programs in the field.
Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. A key assumption is that the relationship between an input variable and the outcome variable is linear. Although this assumption may appear restrictive, it is often possible to properly transform the input or outcome variables to achieve a linear relationship between the modiﬁed input and outcome variables. Possible transformations will be covered in more detail later in the chapter. The physical sciences have well-known linear models, such as Ohm’s Law, which states that the electri- cal current ﬂowing through a resistive circuit is linearly proportional to the voltage applied to the circuit. Such a model is considered deterministic in the sense that if the input values are known, the value of the outcome variable is precisely determined. A linear regression model is a probabilistic one that accounts for the randomness that can aﬀect any particular outcome. Based on known input values, a linear regression model provides the expected value of the outcome variable based on the values of the input variables, but some uncertainty may remain in predicting any particular outcome. Thus, linear regression models are useful in physical and social science applications where there may be considerable variation in a particular outcome based on a given set of input values. After presenting possible linear regression use cases, the foundations of linear regression modeling are provided.
Bigdatascience will revolutionize the way businesses generate value from data. It provides the ability to create, deploy, and interact with production quality datascience models right where the data is stored. In addition, by wrapping bigdatascience in a standard SQL interface, EXASOL provides a smooth transition from traditional BI to bigdatascience, both for analysts and for their SQL toolsets. In this paper we have discussed how bigdatascience architectures result from the convergence of the following technologies: advanced in-memory, massive parallel processing, and in- database programming. This is the very reason why EXASOL is the perfect solution if anyone wants to build and create an agile and scalable bigdatascience system.
ABSTRACT: Bigdata is more than just repository and access to data. BigdataAnalytics plays an imperative role in making sense of the data and capitalizing it. But it’s a substantial challenge to discern and cultivate new types of machinelearning algorithms. Scaling up bigdata to suitable dimensionality is an issue that is tackled in machinelearning algorithms, also there are challenges of dealing with velocity, volume and various more across all categories of machinelearning algorithms. This paper probes bigdata concept, bringing with a desperate need for advanced data acquisition, management, and analysis mechanisms.Also, this paper presents the concept of bigdata and spotlights the four phases of bigdata that are engendering data, acquisition of data, storing this voluminous data, and then analysing it. The next phase of this paper, zeros on dealing with bigdata using machinelearning (ML), and spotlighted the four ML methods: supervised learning, unsupervised learning, semi-supervised and reinforcement learning and its impact on bigdata.
Chanchal Yadav, Shullang Wang, Manoj Kumar, “Algorithm and Approaches to handle large Data- A Survey”, IJCSN, Vol 2, Issuue 3, 2013 ISSN:2277-5420, presents a review of various algorithms from 1994-2013 necessary for handling bigdata set. It gives an overview of architecture and algorithms used in large data sets. These algorithms define various structures and methods implemented to handle BigData and this paper lists various tools that were developed for analyzing them. It also describes about the various security issues, application and trends followed by a large data set .
Available online: https://edupediapublications.org/journals/index.php/IJR/ P a g e | 3051  S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu and X. Shi, Evaluating Map Reduce on Virtual Machines: The Hadoop Case, Springer: Cloud Computing Lecture Notes in Computer Science, vol. 5931, pp. 519–528, (2009).
Paper  gives the survey for Disease prediction in bigdata healthcare using extended CNN. This concept is applied in the medical field to implements the hospital. It provides the (i) high accuracy, (ii) high performance, (iii) high convergence speed. To select the particular region and then, analyzed the chronic diseases, that holds the structured data (extracted useful features), the unstructured data is use the CNN technique, so automatically selects the features. The novel CNN is proposed the medical data, and disease risk model is combined this data. The characteristic behavior of this system is selects the data via previous term. This term is previously applied is possible but not satisfied the disease changes, because disease level is not standard, it is changed in every seconds. To take the selected data from large number of data and improves the accuracy by using risk classification term. The proposed system aim is to predict the risk in liver oriented disease. So, the hospital dataset is related to the liver oriented disease and it collects only the structured data from liver disease information. In the proposed system is use the disease risk modeling and get the accuracy. But the risk prediction is depends on the different feature of medical data with higher accuracy.
New technologies enable us to collect more data than ever before. With an overwhelming amount of web-based, mobile, and sensor-generated data arriving at a terabyte and even zeta byte scale, new science and insights can be discovered from the highly detailed and domain-specific information which can contain useful information about problems such as national intelligence, cyber security, fraud detection, financial trading, personalized medicine and treatments, personalized information and recommendations and personalized athletic training. Machinelearning algorithms, particularly deep learning (evolved from artificial neural networks) plays a vital role in bigdata analysis. Deep Learning algorithms extracts high-level and complex abstractions by discovering intricate structure in large data sets. Deep learning techniques are nowadays the leading approaches to solve complex machinelearning and pattern recognition problems such as speech and image understanding, semantic indexing, data tagging and fast information retrieval. This paper focuses on all aspects of bigdataanalytics, with a particular emphasis on the analysis and learning of massive volume of unstructured data and developing effective and efficient large-scale learning algorithms.
One of the most prevalent unintended effects of the introduction of LA was the fact that LA data led to changes in work and working practices, as already signalled by critical researchers of education, whose arguments I have summarised in the background literature. Indeed, also in the case studied, for example, teaching staff found that “it [LA] detracts from the job of educating” (I_Teaching_011) and introduces a host of different data-related activities which ultimately take away time they would otherwise spend teaching or interacting with students. Importantly, a number of interviewees have experienced what they called “the move towards e-learning” (I_Teaching_011), that is, an impression or encouragement they received that e- learning elements should be introduced even in face-to-face teaching, with some residential modules introducing two or three weeks of online classes with an explicit connection to “the move towards using the data that you get from e-learning” (I_Teaching_011). While it could be argued that the move towards e-learning can have other causes, such as savings, resourcing, and the immense profitability of distance learning programmes, the conviction with which some interviewees expressed their view that they were being almost forced to introduce distance learning components in their face-to-face modules seems to confirm the attribution of these changes to the LA system: “Maybe the data can strengthen them more to having more like more online programmes. Or also to have the campus based programmes to move closer to the distance learning approaches, I guess” (I_Academic_007). It has been pointed out that “The university seems to have become a lot more open to online learning as a way of engaging students, not as a way of just disseminating information. And I feel that part of that is to do with the ability to monitor the analytics and understand the students better” (I_Teaching_002 Follow-up). One interviewee in particular, puzzled as to why she was asked to introduce a few weeks of distance learning into her residential course, arrived at the conclusion that it was due to the trackability and traceability of online actions as opposed to classroom activity.
Abstract- The main focused of Watermarking is developing and The project carried is in the field of Bigdataanalytics related to computer science. Dataanalytics is the process of examining data sets in order to draw conclusions about the information they contain. Bigdataanalytics refers to the techniques that can be used for converting raw data into meaningful information which helps in business analysis and forms a decision support system for the executives in the organization. Bigdata is the large and complex collection of data that cannot be processed using traditional tools. In this proposed work, web application is designed to help government and people to gain knowledge about the government policies and count of people using policies. Citizens will know about the policies they are eligible for with existing policies. Government will know the count of people who are using policies. To implement this project, we are using Hadoop. Common masses can be benefited by the various governmental policies and they can proceed to recommended policies.
Abstract: BigData is a noteworthy environment to maintain the diversity of the huge amount of data. The bigdata utilizes machinelearning algorithms to process large datasets which comes from various places such as histories, weblogs, and data repositories, large datasets and data warehousing, etc. In an existing method, most of the data mining approaches might not be able to maintain the large dataset. Using datamining, the bigdata are having lack of compatibility with database systems and analysis tools; large dataset clustering and analyzing is a big issue in bigdata. For this reason, the research work uses machinelearning algorithms which are implemented in the Hadoop tool to collect and process the large amount of data which is structured, semi-structured or unstructured in a reasonable amount of time. Also, it gives more accurate prediction system and accurate information. Using MachineLearning Algorithm computational cost and complexities is minimized. The overall research work is implemented in the Hadoop tool with the help of the python programming language and it is compared with some existing algorithms. The proposed work tested with suitable parameters such as accuracy, Kappa T and Kappa M.
Zhou et al.  describe how a Deep Learning algorithm can be used for incremental feature learning on very large datasets, employing denoising autoencoders . Denoising autoencoders are a variant of autoencoders which extract features from corrupted input, where the extracted features are robust to noisy data and good for classification purposes. Deep Learning algorithms in general use hidden layers to contribute towards the extrac- tion of features or data representations. In a denoising autoencoder, there is one hidden layer which extracts features, with the number of nodes in this hidden layer initially being the same as the number of features that would be extracted. Incrementally, the samples that do not conform to the given objective function (for example, their classification error is more than a threshold, or their reconstruction error is high) are collected and are used for adding new nodes to the hidden layer, with these new nodes being initialized based on those samples. Subsequently, incoming new data samples are used to jointly retrain all the features. This incremental feature learning and mapping can improve the discrimina- tive or generative objective function; however, monotonically adding features can lead to having a lot of redundant features and overfitting of data. Consequently, similar features are merged to produce a more compact set of features. Zhou et al.  demonstrate that the incremental feature learning method quickly converges to the optimal number of fea- tures in a large-scale online setting. This kind of incremental feature extraction is useful in applications where the distribution of data changes with respect to time in massive online data streams. Incremental feature learning and extraction can be generalized for other Deep Learning algorithms, such as RBM , and makes it possible to adapt to new incom- ing stream of an online large-scale data. Moreover, it avoids expensive cross-validation analysis in selecting the number of features in large-scale datasets.
1) Size: The name (BigData) which is indicate that the data is big in the size which growth rate is high as compare to the last few recent years, that time the increasement of the data are low as compare to current time. The size of the data breaking all boundaries which is reached at top of the peak of the data storage because all these data are stored in the use of the future purpose. The data size goes to the Petabytes or zettabytes, which is a typical work to manage all these data for further use in the machine.
Online education has a very big development at recent years and has a very increasing impact of the education sector. Digital learning is actually a collection of data and analytics which can contribute to teaching and learning. In this way many students participate in online or mobile learning, where are crated new data . These new data, also with the help of social networks, are helping the students with the different background to correlate between them and help them to understand core course Concepts. Except from making education more personal and executive, also new types of data help researchers’ ability to learn about learning. In this case BigData can provide more opportunities for new learning experience for children and young adults. Hence students can share information with educational institutions in this way they can expand their knowledge and skills. Furthermore, Educational institutes and Universities are able to help and prepare their future students.
Organizations are increasingly generating large volumes of data as result of instrumented business processes, monitoring of user activity, web site tracking, sensors, finance, accounting, among other reasons. With the advent of social network Web sites, users create records of their lives by daily posting details of activities they perform, events they attend, places they visit, pictures they take, and things they enjoy and want. This data deluge is often referred to as BigData; a term that conveys the challenges it poses on existing infrastructure with respect to storage, management, interoperability, governance, and analysis of the data. In today’s competitive market, being able to explore data to understand customer behavior, segment customer base, offer customized services, and gain insights from data provided by multiple sources is key to competitive advantage. Although decision makers would like to base their decisions and actions on insights gained from this data, making sense of data, extracting non obvious patterns, and using these patterns to predict future behavior are not new topics. Knowledge Discovery in Data (KDD) aims to extract non obvious information using careful and detailed analysis and interpretation. Data mining, more specifically, aims to discover previously unknown interrelations among apparently unrelated attributes of data sets by applying methods from several areas including machinelearning, database systems, and statistics. Analytics comprises techniques of KDD, data mining, text mining, statistical and quantitative analysis, explanatory and predictive models, and advanced and interactive visualization to drive decisions and actions. Fig. 1 depicts the common phases of a traditional analytics workflow for BigData. Data from various sources, including databases, streams, marts, and data warehouses, are used to build models. The large volume and different types of the data can demand pre-processing tasks for integrating the data, cleaning it, and filtering it. The prepared data is used to train a model and to estimate its parameters. Once the model is estimated, it should be validated before its consumption. Normally this phase requires the use of the original input data and specific methods to validate the created model. Finally, the model is consumed and applied to data as it arrives. This phase, called model scoring, is used to generate predictions, prescriptions, and recommendations. The results are interpreted and evaluated, used to generate new models or calibrate existing ones, or are integrated to pre-processed data.
Apache Sqoop is a CLI device designed in accordance with switch facts within Hadoop or relational databases. Sqoop be able inhalant statistics from an RDBMS such namely MySQL and Oracle Database among HDFS yet afterward export the information again afterwards information has been converted the usage of MapReduce. Sqoop also has the ability in accordance with income information within HBase or Hive. Sqoop connects in accordance with an RDBMS thru its JDBC connector then relies concerning the RDBMS to pencil the database schema for data in accordance with keep imported. Both arrival and export turn to advantage MapReduce, which offers parallelism verb as much well namely error tolerance. longevity During import, Sqoop reads the table, rank by using row, into HDFS. Because import is executed within parallel, the output between HDFS is more than one file.
Dell offers its own bigdata package. Their solution includes an automated facility to load and continuously replicate changes from an Oracle database to a Hadoop cluster to support bigdataanalytics projects. Techniques such as natural language processing, machinelearning and sentiment analysis are made accessible through straightforward search and powerful visualization to enable users to learn relationships between different data streams and leverage these for their businesses.
Security and Public Safety: Since the tragic events of September 11, 2001, security research has gained much attention, especially given the increasing dependency of business and our global society on digital enablement. Researchers in computational science, information systems, social sciences, engineering, medicine, and many other fields have been called upon to help enhance our ability to fight violence, terrorism, cyber crimes, and other cyber security concerns. Critical mission areas have been identified where information technology can contribute, as suggested in the U.S. Office of Homeland Security‘s report ―National Strategy for Homeland Security,‖ released in 2002, including intelligence and warning, border and transportation security, domestic counter-terrorism, protecting critical infrastructure(including cyberspace), defending against catastrophic terrorism, and emergency preparedness and response Intelligence, security, and public safety agencies are gathering large amounts of data from multiple sources, from criminal records of terrorism incidents, and from cyber security threats to multilingual open-source intelligence. Companies of different sizes are facing the daunting task of defending against Cyber security threats and protecting their intellectual assets and infrastructure. Processing and analyzing security-related data, however, is increasingly difficult. A significant challenge in security IT research is the information stovepipe and overload resulting from diverse data sources, multiple data formats, and large data volumes. Current research on technologies for cyber security, counter- terrorism, and crime fighting applications lacks a consistent framework for addressing these data challenges. Selected BI&A technologies such as criminal association rule mining and clustering, criminal network analysis, spatial-temporal analysis and visualization, multilingual text analytics, sentiment and affect analysis, and cyber attacks analysis and attribution should be considered for security informatics research.
6) Deep Learning: Deep learning (DL) refers to a family of approaches that have taken machinelearning to a new level, helping computers make sense out of vast amounts of data. Deep learning algorithms are used to train deep networks with large amounts of data. DL has become a big wave of technology trend for bigdata and artificial intelligence . 7) Visual Analytics: Data visualization plays a major role in understanding and exploring data because there is much to gain when data is presented in a visual manner. Visual analytics are efficient when working in a geospatial domain and multi-dimensional analysis.
Firstly, a platform for streaming data acquisition and ingestion is required, which has the bandwidth to handle multiple waveforms at different fidelities. Integrating these dynamic waveform data with static data from the EHR is a crucial component to provide situational and contextual awareness for the analytics engine. Enriching the data consumed by analytics not only makesthe system more robust but also helps balance the sensitivity and specificity of the predictive analytics. The specifics of the signal processing will largely depend on the type of disease cohort under investigation. A variety of signal processing mechanisms can be utilized to extract a multitude of target features which are then consumed by a trained machinelearning model to produce actionable insight.