Faculty Development Program 47: Cyber Security

(1)

Page 204 of 429

Faculty Development Program 47: Cyber Security

(Held during 05^th August to 10^th August 2019)

Venue: Dr. D. Y. Patil Institute of Engineering, Management &Research, Akurdi, Pune Day-wise Description:

Day 1:Began with introduction to Data Science and Basic Python by Mr.Proyas Bose Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured, semi-structured and unstructured data. It employs the techniques and theories drawn from many fields such as mathematics, statistics, information science and computer science.

Data science is a blend of different disciplines. It is a concept to unify statistics, data analysis and machine learning. Turing award winner Jim Gray imagined data science as a fourth paradigm of science, the first three being, empirical, theoretical &computational, and now data driven.

Figure 1&2:Session by Mr.Proyas Bose

Business analytics, business intelligence, predictive modelling and statistics all fall under the umbrella of data science. Machine learning applications learn from data. Steps in application of Data Science were discussed as,Data Collection, Data Cleaning, Exploratory Data Analysis (EDA), Data Modelling, Data Visualization. It may be required to iteratively update the model due to the changing conditions and the role of domain expert in model building was emphasized.

In the more realistic model the following points may be considered:

● Independent variables with low probability having less significance can be discarded.

● Different weights may be required to be assigned to different variables based on their degree of importance.

For modelling continuous data regression model is used while for modelling discrete data.Classification and clustering is employed. Difference between Supervised and Unsupervised Learning. Supervised learning is the learning model which defines an

(2)

Page 205 of 429

algorithm to map an input variable to output variable. If x and y represent input and output variable, respectively, then, the mapping function f is such that y = f(x). Mapping function will predict the output for a new input value. In supervised learning, there is a labelled data.

In contrast to this in unsupervised learning, there is no labelled data and the classification is based on the characteristics of data. Clustering and association are techniques of unsupervised learning. Some applications of Supervised Learning are were described viz.

Issuing credit card, Sanctioning Loan.

Class assignment on identifying attributes for finding MCA aspirants among computer students of various undergraduate courses. The following attributes were finalized collectively by the participants:

Name of the Student

Contact No No. of Siblings Family Income

Interest in Maths Interest in Computer

Programming

Interest in

Management

Career Objectives

Participation in Intercollege

competition

Academic

Performance in Computer

Academic

Performance in Management

Reason for joining the course

Influenced by whom to join the course

Problem solving and logical skills

Coding interest Father’s occupation

Criterion for selecting institute

Ergonomic Problems

Close relatives in IT job

Part time job

if yes job type Lab facilities in your institution

ICT in teaching and learning

Academic curriculum

Plans after

graduation

if desires to pursue PG what type of course

whether maths was one of the subjects

at 12th or

graduation

Friends influence

Table 8:Attributes for MCA aspirants for classification

(3)

Page 206 of 429

The new era in machine learning is Robotic Process Automation (RPA) which deals with the process of observing the person and mimicking the person thereby automating majority of daily tasks by bots. RPA can automate any process that humans can do which then can work without any human intervention. RPA has resulted in the reduction of data entry operators. Another major advantage offered by RPA no integration of software with backend such SQL server is required so that data is not at risk.

Mr. Pritesh also gave introduction to Anaconda and took practical session on its installation.

He demonstrated working with anaconda. Setting, activating, deleting, importing and exporting environment in Anaconda. This was followed by introduction to Python data types, Lists, functions and loops.

Day 2 :Second day of the FDP was delivered by Mr. Sahil Gaikwadwith focus on topics like Python basics - hands on, Introduction to Data Science, Data Science Libraries, Data analysis and data manipulation using “pandas” and hands on, scientific computing using

“numpy” and hands on Data Visualization using “matplotlib”.

Figure 3 & 4:Session by Mr.Sahil Gaikwad

First session of second day dealt with Python basics with practical exercises like Create a list and perform indexing, From the above list generate another list that contains element within the index range, Given string check if it is username and print 'valid' or print 'enter correct user name', Create a dictionary to keep record of ages of three persons. , Create a list of given animal names and print all the names while iterating through the list, Consider above list. Iterate until deer is found and print the index, Make a function that takes two numbers and returns multiplication result, Change the above function so that it takes 3 numbers and returns multiplication of all the three numbers, Given a string, check if it ends with 'ing' if not add 'ing' to it, Create a list of animal names, Adding list elements were carried out.

(4)

Page 207 of 429

Second session of the day dealt with Data analysis and data pre-processing using pandas.

Loading a file for pre-processing was shown with hands on to read data from .csv file, data manipulation, categorical to numeric, one hot encoding, min max scaler.

Third session was based on scientific computing using numpy. Session covered scientific computing with numpy library with topics like general and broadcasting, others, ones, eyes, reshape, matrix operation, slicing. Session also covered data visualization using Matplotlib with graphical packages for scatter plot, bar chart, and pie chart

Fourth session concentrated on constructing Machine Learning models using sklearn. The machine learning models constructed using sklearn using steps like drop unnecessary columns, one hot encode, required columns, rescale to normalize, train test split, Decision Tree, SVM modelling, Random Forest. Overview of Deep learning with Kearas was provided

Day 3:The Third Day of the FDP delivered by Mr. Swapnil Javahire,The third day focussed on Big data & Hadoop .

Figure 5& 6:Session by Mr. Swapnil Javahire

Session started with VMWare and introduction to Ubuntu. VMware is a virtualization product that makes it possible to partition a single physical server into multiple virtual machines. VMware works with Windows, Solaris, Linux and Netware, any or all of which can be used concurrently on the same hardware. Ubuntu is an open source software operating system that runs from the desktop, to the cloud, to all your internet connected things. Steps to install VMWare were demonstrated followed by hands on installation of Hadoop.

Session two dealt with Big Data Hadoop Ecosystem. Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie to deep dive into Big Data Hadoop and to acquire master level knowledge of the Hadoop Ecosystem. It was a technical discussion on Hadoop configuration settings with Settings the environment variables user profiles. Introduction to

(5)

Page 208 of 429

Map Reduce was done. MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Working of Map Reduce was demonstrated.

Hadoop File System was developed using distributed file system design (HDFS). HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Third session was dedicated to Pig. It’s a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Fourth session talked about Hive and Sqoop. Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Hive installation was demonstrated. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external data stores such as relational databases, enterprise data warehouses. Sqoop is used to import data from external data stores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase.

Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external data stores such as relational databases, enterprise data warehouses. Session ended with Sqoop Installation.

(6)

Page 209 of 429

Day 4 :Fourth day of the FDP was delivered by Mr. Swapnil Javahire. Topics discussed were Introduction to Machine Learning, Logistic Regression with Python (SVM with PYTHOS).

Figure 7:Mr. Swapnil Jawahire, Delivering the talk

Making machine to learn by using Supervised Learning was the topic for first session.

Supervised learning does have a target variable is defined. It is mostly used for classification purpose. Machine learning techniques like Logistic Regression, Support Vector Machine, Decision Tree, Random Forest were discussed.

Session two was technical discussion on Logistic Regression and Probabilities covering Theory of logistic, Application of logistics, Under fitting and Over fitting, Trade-off, Training and testing. Few examples were demonstrated.

Support Vector Machine (SVM) was the topic for third session of the day. It covered Introduction to SVM, Types of SVM (Support Vector Classification and Support Vector Regression), Radial Basis Function (RBF), Confusion Matrix, Grid Search function, Application of SVM. Examples are demonstrated using Colab.

Day 5:First session commenced with the discussions on Machine Learning subtopic

“Decision Tree”, where all the participants practically worked on the various methods used in data Science for taking business decisions within a short period of time by analysing a tera byte of information which are generated from various sources like WhatsApp, twitter, face book and IOT, which enhance the data security.In the afternoon session, the discussion was on “Random Forest”, the resource person have taken the Bank loan case study i.e. the dataset provided by the bank is freely available on “LendingClub.com” has downloaded from website and executed the different commands associated with decision tree and Random forest. The young data scientist influenced on participant for his practical approach. The fifth day emphases was on practical approach for acquiring and assimilating new scientific knowledge and concepts, rather than treating participants as a passive recipient of lecture driven classes

(7)

Page 210 of 429

Figure 8: Session by Mr.Swapnil jawahireDelivering the talk

Day 6: The session started with theory regarding unsupervised learning. After explaining difference between classification and clustering, he continued with hands on session on clustering. Unsupervised learning was the focus. Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.“Clustering” is the process of grouping similar entities together. The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together.

The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the dataset. The concepts relevant to unsupervised learning and clustering were explained with example and all participants were directed to perform practicals to solve the problem using k-means clustering algorithm.

Figure 10& 11:Participant enjoying the session

In post lunch sessions, each participant was given a problem statement and a necessary data set.

Each participant was given a separate problem to analyse and a detailed report is expected as a

(8)

Page 211 of 429

partof evaluation for hands on skills. The participants have to define the problemstatement andexpected conclusion for that problem after exploratory data analysis of the given dataset.