• No results found

Dr. Lian s Class Every transaction data is saved in the database, for that reason, even math majors are required to study SQL.

N/A
N/A
Protected

Academic year: 2021

Share "Dr. Lian s Class Every transaction data is saved in the database, for that reason, even math majors are required to study SQL."

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

June 13th Notes

Dr. Lian’s Class

Every transaction data is saved in the database, for that reason, even math majors are required to study SQL.

It’s really hard to keep track of what’s wrong with a program in Java, C# or C++. To see the result in every step we use Python and Scala. When we write a line, we immediately see the result. This helps the human understand the code and find errors more easily.

R programming language is not as good because it freezes with small (A couple MB of) data. Python and Scala is in the middle of C, C++ and R, SAS, and Matlab. It combines the best parts of the others,

however, it always remains with some disadvantages (i.e. Slower than C++ and java). • Suggested language choice for data Science:

o Python: Is better for beginners. It is as interactive and easy as R, and can still handle real word application. Suggested for entry level and non-IS/CS major.

o Scala: advanced and tightly connected with the spark system. Using by real companies. This system is 10 times faster than Python, but this is not a big deal unless we have a

really big program.

Basic Syntax

• Advanced

o Power: 2 ** 3

o How about square root? o Discard the fraction: 17//3 o Remainder: 17%3

o Round decimal: round(3.14159,3)

To differentiate built-in and keywords with our own, we append “p_” in the beginning of the variable name.

(2)

There is no difference between Double Quote (“”) and Single Quotes (‘’) when assigning a string to a Variable and/or printing it.

Python has String Operators such as “+” and “*”

The * Operator in a string appends the same String Y, X times. Example:

Every string can behave like an array in Python allowing access to a substring given the beginning index and end index. However, if we just put one number, it will only print a single character.

A negative number in the index means that the string will be read backwards. -1 is actually the last character in the string.

• Some other substring methods.

(3)

Split function

Given a String, if we want to divide it into two or more pieces based on a specific token (Usually a character), we can use the function split.

The return value of the split function is an array that contains every substring splitted based on the token

Functions

To create a function of our own, we use the keyword def, followed by the name of our function and then the parameters that it takes.

(4)

List Data

List are like arrays, except that in Python, lists can have a mixed variety of objects/data types. They don’t need to be the same data type for them to be in an array.

• Show a List

(5)

Open file in Python

It is actually very simple. Since python does not really need us to declare a data type, we don’t need anything like “FileReader” (In Java). We just need to use the function open and pass it the absolute path to the file and assign it to any variable.

Dr. Songhua’s Class

Notes regarding recognizing a good/bad research paper.

• Having very few Resources is a bad signal for a research paper • Having very few issues to solve.

• Google Scholar accepts anything slightly research related, compared to another research journals.

• Google Scholar can help get copies of good articles that require a paid subscription or log in. Read “Question Generation via Overgenerating transformations and Ranking.: Derive questions based on the article. Look for the things that I don’t know.

Professor Usman Roshan

https://web.njit.edu/~usman/

He works in Machine Learning and bioinformatics. This professor will leave to Canada in 2.5/3 weeks (Starting from 6/13/2017, will probably leave by the end of June).

(6)

Projects

• Disease Risk Prediction Project (http://genopsis.njit.edu/): It requires a bit more work,

23andme.com will tell us the DNA. They could also give us the Cancer risk % but it was banned. Now it has been allowed for a few diseases. This was banned since the accuracy was very low based solely on the DNA. Only one grad and two undergraduates are working on this, for that reason it is not completely finished. We can work on this tool (Frontend is in JavaScript, and backend is in Python).

o Search for “23andme white paper (23-14)” on Google. This explains the methods used for risk prediction.

o How to predict risks with Snips data.

o Check some of the estimates they get from 23andme.com. o This is no new research, they are using well-known algorithms. o They are using Snip data.

o We would have a couple of meetings in their lab. o They are going to submit a proposal.

o Top Journals in bioinformatics have sections for databases and special journals and issues that come up special for databases.

o There are places where this will get visibility. o Have a better User Interface of the website.

o Peer Review would involve some examples and stress testing.

o The Python program just takes in data and (Scikit-learn Python Library) o Make the result numbers clickable

o Activate some filters

• Genomics Research – Diseases prediction. “Cross-Validation and Cross-Study Validation of Chronic Lymphocytic Leukemia with exome sequences and machine learning”. This project does

not require that much programming.

o IDLE Python IDE.

o Study Loops, Conditionals, Lists, and Dictionaries. o Learn Linear Classifiers.

o If we use Exome data improves prediction accuracy. o They have Breast Cancer Data (2TB)

o It can be finished in 9 weeks

o It involves dealing with a lot of data, sequence arrangements, tools for genomics o Method to align sequences

o Determine mutations from the aligned sequences. o It is a pipeline to do this with exome data.

o We need to see if the results they got with Leukemia works with Breast Cancer Data. o 150 case controls for Lymphocytic leukemia

o 1,000 for Breast Cancer.

o There are multiple companies that do cancer or disease prediction. o Only 8 or 9 diseases are currently allowed to be predicted.

o We want to learn a model from one set of people and predict it for other group of people that have different data. This is a big deal!

(7)

o

• Machine Learning (Representation Learning) is more foundational – There is already a work in progress.

o Look for the concept of Stacking and Deep Learning. o Some metrics (?)

o Book for Machine Learning: Introduction to Machine Learning by Ethem Alpaydin. o Heavy Book: Learning with Kernels by Bermhard Schlkopf

o We can apply this machine learning to a lot of problems. o This will introduce us to Machine Learning

o Possibly publishable

o Loss function, then we minimize the loss data.

o Learn a new representation of the data so the data is easy to classify. o Deep Learning is representation learning.

o Stacking is a method use where a lot content… (www.kaggle.com) o Random methods here are very effective

o They want to compare two main methods for representation learning: Stacking vs. Deep Learning

o We would use python running deep learners. We would use a Google Library, implement deep learning models on multiple data sets.

o We would be creating multiple tables of our results with the Mean error and Median Error.

o This is an experimental performance. o Can be done in 9 weeks.

o There will be a collaboration with Zhi Wei. o Possible collaboration with Berkley.

o This might go to the top journal of deep learning.

o We want to know which method would do best on which type of data. o Stacking is much easier to implement than deep learning and also faster o We can write an effective stacking program in one day.

o Is it more effective to Deep Learning?

o Large amounts of data take Deep Learning months to run. o There’s Data Science Data sets and Image data.

o Can Stack perform as well in both?

o We want to see if we can find something that can provide the same effectiveness to Stacking for a cheaper/more efficient result.

References

Related documents

In this tutorial, you not learn how this query data via multiple tables by using SQL INNER JOIN statement.. If you understand is required for multiple sql statement is no result:

Additional resources were allocated to both the School of Business and the School of Science and Computer Engineering for growth in their graduate program enrollment including

The Movie here is the model we wish to relate and Movies corresponds to the Movies table in our database.. DbSet will act much like our List collection but

The life assured can claim up to three different Listed Critical Illness at different times throughout the policy term, provided that each claim is for a Listed Critical Illness

This course provides the essential SQL skills that allow developers to write queries against single and multiple tables, manipulate data in tables, and create database

Attachments are contained within a division of Company A called Division 1 Company A is assigned an NJUNS member code of COMPA.. The attachments would be

If further requirements are to be imposed on making information available to ultimate investors who hold their investments through a chain, we think any obligation should

Reading late Chapter Conservation Chair Diane Beck’s reports on the Trinity Alps 1999 Megram Fire, in Redwood Needles, which brought me to a North Group meeting?. Why do you