CIS 455/555: Internet and Web Systems
Indexing
September 27, 2021
Plan for today
n
Naming
n Flat naming
n Attribute-based naming; LDAP
n The Domain Name System (DNS)
n Attacks on DNS
n DNSSEC
n
Inverted indices
n
B+ trees
NEXT
Attacking DNS
n
Suppose an adversary wants to learn Alice's banking password
n The adversary wants to redirect Alice's HTTP requests to his own server
n
How can the adversary accomplish this?
Simple attacks
n
Break into the DNS server
n Can modify the zone file directly
n
Phishing / Spearphishing
n
Spoof the responses
Cache poisoning
www.bank.com?
Query ID #4711
www.bank.com?
Query ID #3829
www.bank.com=6.6.6.6
Query ID #3827,#3828,#3829,...
(guesses query IDs) www.bank.com
=1.2.3.4
Query ID #3829
1.2.3.4
Authoritative DNS server Caching
DNS server
www.bank.com=6.6.6.6
Query ID #4711
GET / HTTP/1.1
6.6.6.6
When does this work?
n
Name can't already be in the cache
n Otherwise no external queries will occur
n
Adversary has to guess the query ID
n If the name server doesn't properly choose the query IDs (e.g., counts up by one), this is easy
n
Adversary has to be faster than the real name server
n
Countermeasures?
n Randomize query IDs?
n Set long TTLs?
The Kaminsky attack
www.bank.com?
Query ID #4711
www38.bank.com?
Query ID #3829
www38.bank.com=6.6.6.6
Query ID #3827,#3828,#3829,...
(guesses query IDs) www38.bank.com
=1.2.3.4
Query ID #3829
1.2.3.4
Authoritative DNS server Caching
DNS server
www.bank.com=6.6.6.6
Query ID #4711
GET / HTTP/1.1
6.6.6.6
www3
8.bank.com?
When does this work?
n
Name can't already be in the cache
n Otherwise no external queries will occur
n
Adversary has to guess the query ID
n If the name server doesn't properly choose the query IDs (e.g., counts up by one), this is easy
n
Adversary has to be faster than the real name server
n
Countermeasures?
n Randomize query IDs?
n Set long TTLs?
n Randomize source ports?
Plan for today
n
Naming
n Flat naming
n Attribute-based naming; LDAP
n The Domain Name System (DNS)
n Attacks on DNS
n DNSSEC
n
Inverted indices
n
B+ trees
NEXT
What is DNSSEC?
n
DNS Security Extensions
n A change to the DNS specification to make it more secure while maintaining backwards compatibility
n
What does it do?
n Protect DNS against spoofing and corruption
n Cryptographically signs DNS responses
Digital signature basics
n
We can generate a keypair (k
pub,k
priv) and implement two operations:
n Enc(kpriv, P) ® C: encrypt plain text P with the private key; result is a ciphertext C
n Dec(kpub, C) ® P: decrypt ciphertext C with the public key; result is plaintext P
n We can make it very difficult to 1) compute kpriv from kpub, or to
2) compute, without kpriv, a C that decrypts to P
n
Idea: Alice publishes k
pub, but k
privis secret
n
When Alice wants to sign some statement S, she publishes (S, Enc(k
priv,S))
n Anyone can verify by decrypting Enc(kpriv,S) with kpub P
P
Enc
Dec C
kpriv
kpub
How DNSSEC works
n
Each domain has a keypair (public/private key)
n Controlled by the entity to whom the domain is delegated
n
Parent domains sign keys of child domains
n Key of the root domain is well-known
n NOTE: Parent does NOT sign the child domain itself!
NOTE: The data is signed, not just the messages!
.
edu
upenn
cis signs
signs
signs
DNSSEC deployment
n
DNSSEC is now finally deployed
n Signed root zone available since July 15, 2010
n But DNSSEC has been under discussion since at least 1993!
n
Why has it taken so long?
n Attacks are relatively rare
n DNSSEC is not free (e.g., key management)
n Bootstrapping problem: Value is limited until it is widely deployed ® Initial lack of incentives to deploy
n A common problem with deploying new technologies
n Other classic examples: S-BGP, IPv6
Recap: Attacks on DNS; DNSSEC
n
The original DNS is vulnerable to a variety of attacks
n Spoofing
n Cache poisoning
n Kaminsky attack
n
DNSSEC solves many of these issues
n DNS records are signed cryptographically
n
It has taken a long time to deploy DNSSEC
n Under discussion since at least 1993
n Reasons are more economic than technical (incentives!)
n Bootstrapping problem
Finding the 'right' architecture
n
Should we use names or addresses?
n
If names, what kind of names?
n Flat names (Gnutella, ARP), hierarchical names (LDAP, DNS)
n
How are names assigned?
n Choose-your-own (Gnutella), explicit registration (DNS)
n Delegation in DNS
n
How do we resolve names?
n Flooding (Gnutella), centralized directory (Napster),
hierarchical directory (DNS, LDAP), decentralized directory
n Caching in DNS, dynamic mapping in Akamai
n
What about security?
n Flooding/poisoning/Kaminsky ® Anycast, DNSSEC
Plan for today
n
Naming
n Flat naming
n Attribute-based naming; LDAP
n The Domain Name System (DNS)
n Attacks on DNS
n DNSSEC
n
Inverted indices
n
B+ trees
NEXT
Finding data by content
n
We’ve seen two approaches to search:
n Flood the network with requests (example: Gnutella), and do all the work at the data stores
n Have a directory based on names (example: LDAP)
n Which of these is the 'best'?
n
An alternative, two-step process:
n Build a content index over what’s out there
n An index is a key®value map
n Typically limited in what kinds of queries can be supported
n Most common instance: an index of document keywords
n Example: Incidence matrix.
n Is this a good idea?
A common model for search
n
Index the words in every document
n
“Forward index”: document (ID) à list of words
n
“Inverted index”: word à document (ID)
Inverted indices
n
A conceptually very simple map-multiset data structure: <keyword, {list of occurrences}>
n
In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position
n What might a count be useful for? A position?
n
Requires two components, an indexer and a retrieval system
n
We’ll consider the cost of building the index, plus searching the index using a single keyword
n Storage efficiency is also a concern
How do we lay out an inverted index?
n
Which operations do we need to support?
n insert
n delete
n find
n next
n
Which data structures could we use?
n Hash table
n Unordered list (e.g., a log)
n Ordered list
n Tree
n ...
n
Which properties are we looking for?
Unordered, ordered, and linked lists
n
Assume that we have list of entries such as:
<keyword, #items, {occurrences}>
n
What does ordering buy us?
n
How do we insert items?
Tree-based indices
n
Trees have several benefits over lists:
n logarithmic search time, as with a well-designed sorted list
n if it is balanced!
n Ability to handle variable-length lists
n
We’ve already seen how trees might make a natural way of distributing data, as well
n
How does a binary search tree fare?
n Cost of building?
n Cost of finding an item in it?
Potentially
Plan for today
n
Naming
n Flat naming
n Attribute-based naming; LDAP
n The Domain Name System (DNS)
n Attacks on DNS
n DNSSEC
n
Inverted indices
n
B+ trees
NEXTThe B+ tree
n
A flexible, height-balanced, high-fanout tree
n
Insert/delete at log
FN cost
(F = fanout, N = # leaf pages)n Need to keep tree height-balanced
n
Minimum 50% occupancy (except for root)
n Each node contains d <= m <= 2d entries
n Inner nodes contain up to 2d+1 pointers
n d is called the order of the tree
n
Can search efficiently based on equality (or also rang)
Index Entries
Data Entries ("Sequence set") (Direct search)
Linked list ...
(compare to B-tree!)
Example B+ Tree
n
Data (inverted list pointers) is at the leaves;
intermediate nodes have copies of search keys
n Why is that a good thing?
n
Search begins at root, and key comparisons direct it to a leaf
n
Search for be↓, bobcat↓ ...
Ø Based on the search for bobcat*, we know it is not in the tree!
Root
best but dog
a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓
art
Inserting data into a B+ Tree
n
Find correct leaf L
n
Put data entry onto L
n If L has enough space we are, done!
n Else, must split leaf node L (into L and a new node L2)
n Redistribute entries evenly, copy up middle key
n Insert index entry pointing to L2 into parent of L
n
This can happen recursively
n To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)
n
Splits “grow” tree; root split increases height
n Tree growth: gets wider or one level taller at the top
Root
best but dog
a↓ am ↓an↓ ant↓ art↓ be↓ best↓bit↓ bob↓ but↓can↓cry↓ dog↓dry↓elf↓ fox↓
art
Inserting “and↓” Example: Copy up
Want to insert here; no room, so split & copy up:
a↓ am ↓ an↓ and↓ant↓
an
Entry to be inserted in parent node.
(Note that key “an” is copied up and continues to appear in the leaf.) and↓
Root
best but dog
a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓
art
But where? Parent node is already "full"!
Inserting “and↓” Example: Push up 1/2
Root
art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓
an
Need to split node
& push up
best but dog art
a↓ am ↓ dog↓ dry↓ elf↓ fox↓
an↓ ant↓and↓
Inserting “and↓” Example: Push up 2/2
Root
art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓
an but dog
best
art
Entry to be inserted in parent node.
(Note that best is pushed up and only appears once in the index. Contrast
this with a leaf split.)
a↓ am ↓ dog↓ dry↓ elf↓ fox↓
an↓ ant↓and↓
Summary: Copying vs. splitting
n
Every keyword (search key) appears in at most one intermediate node
n Hence, in splitting an intermediate node, we push up
n
Every inverted list entry must appear in a leaf
n We may also need it in an intermediate node to define a partition point in the tree
n We must copy up the key of this entry
n
Note that B+ trees easily accommodate
multiple occurrences of a keyword
Some details
n
How would you choose the order of the tree?
n
How would you find all the words starting with the letters 'com'?
n
How would you delete something?
n
Do you always have to split/merge?
How do we distribute a B+ Tree?
n
We need to host the root at one machine and
distribute the rest
n
What are the implications for scalability?
n Consider building the index as well as searching
Eliminating the root
n
Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure
n
Two strategies:
n Modified tree structure (e.g., the BATON p2p tree)
n Non-hierarchical structure (e.g., distributed hash table)
Example: B+ tree
n
Insert 15, 11, 12, 32, 74
65 130 187
9 25 45 70 80 101 122 138 150 159 180
1¯ 4¯ 6¯ 9¯ 14¯ 16¯ 25¯ 31¯ 38¯ 41¯ 45¯ 61¯ 63¯ 64¯
65¯ 67¯ 68¯ 69¯ 70¯ 72¯ 75¯ 79¯
Virtues of the B+ Tree
n
B+ tree and other indices are quite efficient:
n Height-balanced; logF N cost to search
n High fanout (F) means depth rarely more than 3 or 4
n Almost always better than maintaining a sorted file
n Typically, 67% occupancy on average
n
Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using
later in the semester:
n Interface: open B+ Tree; get and put items based on key
n Handles concurrency, caching, etc.
Recap: B+ trees
n
A very common data structure for indices
n Used, e.g., in many file systems and many DBMS
n
Berkeley DB library is a toolkit for B+ trees that you will be using later in the semester:
n Interface: open B+ Tree; get and put items based on key
n Handles concurrency, caching, etc.
n
Very efficient
n Height-balanced; logF N cost to search
n High fanout (F) means depth rarely more than 3 or 4
n Typically, 67% occupancy on average
Plan for today
n
B+ trees
n
Data interchange
n
Extensible Markup Language (XML)
n
XML Schema; DOM
n
XPath
NEXT
Kinds of content
n
Keyword search and inverted indices are great for locating text documents
n
But what if we want to index and/or share other kinds of content?
n Spreadsheets
n Maps
n Purchase records
n Objects
n etc.
n
Let’s talk about structured data!
n Now: Representation and transport
n Later: Indexing and retrieval
38
39
Sending data
n
How do we send data within a program?
n What is the implicit model?
n How does this change when we need to make the data persistent?
n
What happens when we are coupling systems?
n How do we send data between programs
n on the same machine?
n between different programming languages?
n on different machines?
Motivating example: Web services
n
Accessible to other applications over the web
n Examples: Google Search, Google Maps API, Facebook Graph API, eBay APIs, Amazon Web Services, ...
Alice Bob Alice Bob
Charlie Map service
(used by Alice) Web page combines
data from different sources ('mashup')
Mashup example: Google Transit
A key challenge
n
Nodes need to communicate with each other
n E.g., using remote procedure calls
n
Network messages are strings of bytes
n No particular structure - must be defined by the application
n Sender marshals the data and produces a string of bytes
n Pointers must be encoded somehow
n Specific byte order; metadata to describe the data
n Receiver unmarshals the data again
17 01 17 02 48 3F 12 9E ... 17
Marshalling Unmarshalling
43
Communication and streams
n
When storing data to disk, we have a
combination of sequential and random access
n
When sending data on “the wire”, data is only sequential
n “Stream-based communication” based on packets
n
What are the implications here?
n Pipelining, incremental evaluation, …
Data interchange is hard
n
What does Bob need to know to understand Alice's document?
n Physical data model (data encoding)
n Code: ASCII or Unicode or ...?
n Byte order: Little-endian? Big-endian?
n Marshalling format: Tagged? Fixed? Which field sizes?
n Logical data model (data representation)
n Semantic heterogeneity
n Imprecise and ambiguous values or descriptions
n ...
Alice Bob
Data comes in many formats
Data type Formats
Text
Database Image Music Video
Scientific data
ASCII, Word document, RTF, TeX, PDF, HTML, ...
MySQL, Oracle, Access, Works, OpenOffice, ...
JPG, GIF, BMP, PNG, RAW, TIFF, Corel, Photoshop, ...
AIFF, MP3, AAC, RA, Ogg, MID, MOD, SWA, ...
AVI, M4V, MPEG, Ogg, WMV, RM, DVD, MOV, ...
Probably at least as many as there are researchers
46
Example: Old ID3v1 tags in MP3
Offs Len Description 0 3 Identifier: "TAG"
3 30 Song title string 33 30 Artist string 63 30 Album string 93 4 Year string 97 28 Comment string 125 1 Zero byte separator 126 1 Track byte
127 1 Genre byte ...
006d3720 da 00 54 41 47 4d 65 6d 62 65 72 73 20 4f 6e 6c 006d3730 79 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 006d3740 00 00 00 53 68 65 72 79 6c 20 43 72 6f 77 00 00 006d3750 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 006d3760 00 54 68 65 20 47 6c 6f 62 65 20 53 65 73 73 69 006d3770 6f 6e 73 00 00 00 00 00 00 00 00 00 00 00 00 31 006d3780 39 39 38 00 00 00 00 00 00 00 00 00 00 00 00 00 006d3790 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 006d37a0 0a ff
"TAG" "Members Only"
"Sheryl Crow"
"The Globe Sessions"
"1998"
Track #10 Genre not specified
47
Example: JPEG header
JPEG “JFIF” header:
n Start of Image (SOI) marker -- two bytes (FFD8)
n JFIF marker (FFE0)
n length -- two bytes
n identifier -- five bytes: 4A, 46, 49, 46, 00
(the ASCII code equivalent of a zero terminated "JFIF" string)
n version -- two bytes: often 01, 02
n the most significant byte is used for major revisions
n the least significant byte for minor revisions
n units -- one byte: Units for the X and Y densities
n 0 => no units, X and Y specify the pixel aspect ratio
n 1 => X and Y are dots per inch
n 2 => X and Y are dots per cm
n Xdensity -- two bytes
n Ydensity -- two bytes
n Xthumbnail -- one byte: 0 = no thumbnail
n Ythumbnail -- one byte: 0 = no thumbnail
n (RGB)n -- 3n bytes: packed (24-bit) RGB values for the thumbnail pixels,
n = Xthumbnail * Ythumbnail