3.6 LIBRARIES
3.6.1.10 Exercises
3.6.2.1.2 Text Files
To read text files into a variable called content you can use
You can also use the following code while using the convenient with statement import pickle flavor = { "small": 100, "medium": 1000, "large": 10000 }
pickle.dump( flavor, open( "data.p", "wb" ) )
flavor = pickle.load( open( "data.p", "rb" ) )
content = open('filename.txt', 'r').read()
with open('filename.txt','r') as file: content = file.read()
To split up the lines of the file into an array you can do
This cam also be done with the build in readlines function
In case the file is too big you will want to read the file line by line:
3.6.2.1.3 CSV Files
Often data is contained in comma separated values (CSV) within a file. To read such files you can use the csv package.
Using pandas you can read them as follows.
There are many other modules and libraries that include CSV read functions. In case you need to split a single line by comma, you may also use the split function.
However, remember it swill split at every comma, including those contained in quotes. So this method although looking originally convenient has limitations. 3.6.2.1.4 Excel spread sheets
Pandas contains a method to read Excel files
3.6.2.1.5 YAML
YAML is a very important format as it allows you easily to structure data in
with open('filename.txt','r') as file: lines = file.read().splitlines()
lines = open('filename.txt','r').readlines()
with open('filename.txt','r') as file: line = file.readline()
print (line)
import csv
with open('data.csv', 'rb') as f: contents = csv.reader(f)
for row in content: print row import pandas as pd df = pd.read_csv("example.csv") import pandas as pd filename = 'data.xlsx' data = pd.ExcelFile(file) df = data.parse('Sheet1')
hierarchical fields It is frequently used to coordinate programs while using yaml as the specification for configuration files, but also data files. To read in a yaml file the following code can be used
The nice part is that this code can also be used to verify if a file is valid yaml. To write data out we can use
The flow style set to false formats the data in a nice readable fashion with indentations.
3.6.2.1.6 JSON
3.6.2.1.7 XML
XML format is extensively used to transport data across the web. It has a hierarchical data format, and can be represented in the form of a tree.
A Sample XML data looks like:
Python provides the ElementTree XML API to parse and create XML data. Importing XML data from a file:
Reading XML data from a string directly:
import yaml
with open('data.yaml', 'r') as f: content = yaml.load(f)
with open('data.yml', 'w') as f:
yaml.dump(data, f, default_flow_style=False)
import json
with open('strings.json') as f: content = json.load(f)
<data>
<items>
<item name="item-1"></item>
<item name="item-2"></item>
<item name="item-3"></item>
</items> </data>
import xml.etree.ElementTree as ET tree = ET.parse('data.xml') root = tree.getroot()
Iterating over child nodes in a root:
Modifying XML data using ElementTree:
Modifying text within a tag of an element using .text method:
Adding/modifying an attribute using .set() method:
Other Python modules used for parsing XML data include
minidom: https://docs.python.org/3/library/xml.dom.minidom.html
BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
3.6.2.1.8 RDF
To read RDF files you will need to install RDFlib with
This will than allow you to read RDF files
Good examples on using RDF are provided on the RDFlib Web page at
https://github.com/RDFLib/rdflib
From the Web page we showcase also how to directly process RDF data from the Web
for child in root:
print(child.tag, child.attrib)
tag.text = new_data tree.write('output.xml')
tag.set('key', 'value') tree.write('output.xml')
$ pip install rdflib
from rdflib.graph import Graph g = Graph()
g.parse("filename.rdf", format="format")
for entry in g: print(entry) import rdflib g=rdflib.Graph() g.load('http://dbpedia.org/resource/Semantic_Web') for s,p,o in g: print s,p,o
3.6.2.1.9 PDF
The Portable Document Format (PDF) has been made available by Adobe Inc. royalty free. This has enabled PDF to become a world wide adopted format that also has been standardized in 2008 (ISO/IEC 32000-1:2008,
https://www.iso.org/standard/51502.html). A lot of research is published in
papers making PDF one of the de-facto standards for publishing. However, PDF is difficult to parse and is focused on high quality output instead of data representation. Nevertheless, tools to manipulate PDF exist:
PDFMiner
https://pypi.python.org/pypi/pdfminer/ allows the simple translation of PDF
into text that than can be further mined. The manual page helps to demonstrate some examples http://euske.github.io/pdfminer/index.html. pdf-parser.py
https://blog.didierstevens.com/programs/pdf-tools/ parses pdf documents
and identifies some structural elements that can than be further processed. If you know about other tools, let us know.
3.6.2.1.10 HTML
A very powerful library to parse HTML Web pages is provided with
https://www.crummy.com/software/BeautifulSoup/
More details about it are provided in the documentation page
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
TODO: Students can contribute a section
Beautiful Soup is a python library to parse, process and edit HTML documents. To install Beautiful Soup, use pip command as follows:
In order to process HTML documents, a parser is required. Beautiful Soup
supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers like the lxml parser which is
commonly used [1].
Following command can be used to install lxml parser
To begin with, we import the package and instantiate an object as follows for a html document html_handle:
Now, we will discuss a few functions, attributes and methods of Beautiful Soup.
prettify function
prettify() method will turn a Beautiful Soup parse tree into a nicely formatted
Unicode string, with a separate line for each HTML/XML tag and string. It is analgous to pprint() function. The object created above can be viewed by printing
the prettfied version of the document as follows:
tag Object
A tag object refers to tags in the HTML document. It is possible to go down to the
inner levels of the DOM tree. To access a tag div under the tag body, it can be done
as follows:
The attrs attribute of the tag object returns a dictionary of all the defined attributes
of the HTML tag as keys.
has_attr() method
To check if a tag object has a specific attribute, has_attr() method can be used. $ pip install lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_handle, `lxml`)
print(soup.prettify())
body_div = soup.body.div print(body_div.prettify())
if body_div.has_attr('p'):
tag object attributes
name - This attribute returns the name of the tag selected.
attrs - This attribute returns a dictionary of all the defined attributes of the
HTML tag as keys.
contents - This attribute returns a list of contents enclosed within the HTML
tag
string - This attribute which returns the text enclosed within the HTML tag.
This returns None if there are multiple children
strings - This overcomes the limitation of string and returns a generator of all
strings enclosed within the given tag
Following code showcases usage of the above discussed attributes:
Searching the Tree
find() function takes a filter expression as argument and returns the first
match found
findall() function returns a list of all the matching elements
select() function can be used to search the tree using CSS selectors
3.6.2.1.11 ConfigParser
TODO: Students can contribute a section
body_tag = soup.body
print("Name of the tag:', body_tag.name) attrs = body_tag.attrs
print('The attributes defined for body tag are:', attrs) print('The contents of \'body\' tag are:\n', body_tag.contents) print('The string value enclosed in \'body\' tag is:', body_tag.string) for s in body_tag.strings:
print(repr(s))
search_elem = soup.find('a') print(search_elem.prettify())
search_elems = soup.find_all("a", class_="sample") pprint(search_elems)
# Select `a` tag with class `sample`
a_tag_elems = soup.select('a.sample') print(a_tag_elems)
https://pymotw.com/2/ConfigParser/
3.6.2.1.12 ConfigDict
https://github.com/cloudmesh/cloudmesh-
common/blob/master/cloudmesh/common/ConfigDict.py 3.6.2.2 Encryption
Often we need to protect the information stored in a file. This is achieved with encryption. There are many methods of supporting encryption and even if a file is encrypted it may be target to attacks. Thus it is not only important to encrypt data that you do not want others to se but also to make sure that the system on which the data is hosted is secure. This is especially important if we talk about big data having a potential large effect if it gets into the wrong hands.
To illustrate one type of encryption that is non trivial we have chosen to demonstrate how to encrypt a file with an ssh key. In case you have openssl installed on your system, this can be achieved as follows.
Most important here are Step 4 that encrypts the file and Step 5 that decrypts the file. Using the Python os module it is straight forward to implement this. However, we are providing in cloudmesh a convenient class that makes the use in python very simple.
#! /bin/sh
# Step 1. Creating a file with data
echo "Big Data is the future." > file.txt # Step 2. Create the pem
openssl rsa -in ~/.ssh/id_rsa -pubout > ~/.ssh/id_rsa.pub.pem
# Step 3. look at the pem file to illustrate how it looks like (optional)
cat ~/.ssh/id_rsa.pub.pem
# Step 4. encrypt the file into secret.txt
openssl rsautl -encrypt -pubin -inkey ~/.ssh/id_rsa.pub.pem -in file.txt -out secret.txt
# Step 5. decrypt the file and print the contents to stdout
openssl rsautl -decrypt -inkey ~/.ssh/id_rsa -in secret.txt
from cloudmesh.common.ssh.encrypt import EncryptFile
e = EncryptFile('file.txt', 'secret.txt') e.encrypt()
In our class we initialize it with the locations of the file that is to be encrypted and decrypted. To initiate that action just call the methods encrypt and decrypt.