Assignment Solutions

(1)

Dear Learning Community,

welcome to the two exam weeks of our In-Memory Data Management online course. During the next two weeks you will be able to take the final exam. The exact timespan is from 07.10.13, 10 am (CEST, Germany time zone) to 21.10.13, 5 pm.

There are no prerequisites (like a point restriction) to take the exam. You have 100 minutes to solve the 50 questions of the exam. Please be advised that the exam can only be started ONCE (closing the browser does not stop the timer). The exam starts when you click on the start button on the exam page, so only click on the button if you are willing to take the assignment. Additionally, we advise you to re-check the answers you picked before you submit the answers. Most errors that occurred during the assignments were caused by accidental inputs through a touchpad or unwanted keyboard strokes. The results of the exam will be published after week 8.

Your final score will be build up from the assignments as well as of the final exam. You have to reach at least 50% of the total reachable 198 points to get the certification (99 points and up). The total reachable score is composed of 99 points from the assignments and 99 points from the exam (which means the ratio between assignments and exam is exactly 50:50).

Feedback Request

We enjoyed doing the second iteration of this course together with you. We had many deep and

interesting discussions with you, that also gave us new perspectives on the content. On behalf of all future learners, we thank you for the many suggestions as well as hints to rework certain examples which improved the slides and the reading material. To sum up, it was again a great learning opportunity not just for you but also for us.

To further improve for the future, we are relying on your insights, thoughts, and feedback. For this, we kindly ask you to fill out our feedback survey after taking the exam. The survey will take about 5 to 10 minutes to complete.

Assignment Solutions

As a help to prepare for the exam, we added the assignment solutions for all closed assignments to the respective weeks. For your convenience, please see the collection of links below:

Assignment Solutions Week 1 Assignment Solutions Week 2 Assignment Solutions Week 3 Assignment Solutions Week 4 Assignment Solutions Week 5

Dear learning community,

as many of you asked for some more details on the correct answers for the assignment of week 1, we wrote the following explanations, that we want to gladly share with you.

If there are any questions left, please do not hesitate to ask in the forum.

Best regards,

(2)

______________________________________________________

Question Name: "Transactional Data for OLAP"

Text: "Using in-memory databases and transactional data for the Available to Promise (ATP) check has many advantages.

Which of the following is NOT an advantage, when we do the ATP check on transactional data?"

Correct answer:

 "Aggregating on the fly is faster than reading the aggregated sum"

Incorrect answers:

 "Programming is simplified, because data changes do not have to be reflected in aggregates anymore"

 "Calculations can be done more fine grained and based on real-time facts because data is not aggregated to monthly, daily, etc. sums"

 "Time consuming ETL processes can be abolished"

Explanation: "Reading pre-aggregated values is always the fastest way to retrieve information, because the actual work was done beforehand. Therefore, aggregating on the fly is not faster than reading a pre-aggregated value and this is NOT an advantage. The other points clearly are."

______________________________________________________

Question Name: "OLTP OLAP Separation Drawbacks" Text: "Why was the separation of OLAP and OLTP a bad idea? The separation should be reverted, because ..."

Correct answer:

 "latency between the separated storages prohibits up-to-data analytics"

 "the creation of materialized views and pre-computed aggregates negatively impacts OLTP performance"

 "query execution time is increased if we use two systems"  "analytical queries are not used any longer"

Explanation: "The separation causes a latency between the transactional and the analytical data,

because the analytical data has to be transformed and pre-optimized via an ETL (Extract-Transform-Load) process, which is time intensive and is therefore often only run once a night. The all in all query execution time is not increased on the OLTP system at all, in best case it decreases on the OLTP system because some workload is shifted to the OLAP system. Of course, during the update of the OLAP data some load is caused on the OLTP system, but since only new data is retrieved and this is done in periods of low usage, the overall performance was increased in general, at the cost of increased complexity and the inability to do analytical queries on up to date data."

(3)

______________________________________________________

Question Name: "Enterprise Data Characteristics"

Text: "Which characteristic does enterprise data NOT have?" Correct answer:

 "High entropy in many columns"

 "NULL and default values are dominant in many columns"  "Large number of columns (attributes)"

 "Very low entropy in many columns"

Explanation: "Various analyses of different enterprise systems from actual customers showed that most tables are \"sparse and wide\". Many columns are not even used. Furthermore, the columns are often dominated by default or NULL values. So there is a large number of columns, with very low entropy. The correct answer is therefore, that high entropy in columns is a characteristic that enterprise systems usually do NOT have."

______________________________________________________

Question Name: "NUMA"

Text: "What is NUMA (non uniform memory access)?" Correct answer:

 "In multi-core setups, it means that each processor can access the local memory of all other processors"

 "It means that data types can have variable size now and do not need to be uniform any longer"  "It is a standard that describes the physical structure of memory chips to fit in server-blades"

Explanation: "In NUMA systems, each processor has its own part of main memory that can be accessed very fast. Data, which is not in that local storage of a processor, has to be requested from non-local storage, i.e. another processor's local memory. In that implementation, all processors share the same adress space which simplifies memory management."

______________________________________________________

Question Name: "Moore's Law" Text: "What does Moore’s law state?" Correct answer:

(4)

 "In enterprise computing it is forbidden to run software on CPUs that are older than 2 years"  "Enterprise data is not needed any longer after 2 years and is erased due to data privacy

protection"

 "Computers tend to break by a mean time of 36 months"  "Clock speed doubles every 18 to 24 months"

Explanation: "In 1965, Gordon E. Moore stated that the number of components per integrated circuit is doubling approximately every two years. The other answers are just made up."

______________________________________________________

Question Name: "Architecture Name"

Text: "What is the name of the architecture in which a processor can access the cache of other processors?"

Correct answer:

 "Non-Uniform Memory Access (NUMA)"

 "Symmetric Multiprocessing (SMP)"  "Shared Nothing Architecture (SNA)"

Explanation: "In Shared Nothing architectures, every node is independent and does not share any memory with other processors. SMP systems have a shared main memory, but their chaches remain non-accessible to other processors. NUMA allows to access other nodes' local memory, therefore it is the correct answer."

______________________________________________________

Question Name: "NUMA and Cache Coherency"

Text: "Which statement concerning NUMA and cache coherency is correct?" Correct answer:

 "Most currently sold NUMA realizations come with special-purpose hardware to maintain cache coherency"

 "Most NUMA realizations currently in the market use software layers to maintain cache coherency"

 "Every program gains a huge performance boost from NUMA; no adaption of the software is needed to fully exploit the potential"

 "Cache coherency is no longer a concern when using NUMA architectures, since NUMA does not use caches at all"

(5)

Explanation: "As stated in the reading material, non ccNUMA hardware is practically non existent, because it is harder to program. Therefore, the terms NUMA and ccNUMA are usually used identically."

______________________________________________________

Question Name: "Disk Still Required"

Text: "Why is disk usage (for example SSDs) still required for SanssouciDB?" Correct answer:

 "Because logging and snapshotting need non-volatile storage"

 "Because non-volatile storage is needed for the hashing phase in the join operation"  "Because main memory is still slower than SSDs in some use cases"

 "Because some database operations can only be processed on disk"

Explanation: "DRAM is a volatile storage medium, meaning that it looses it's state and therefore it's information when the power supply is interrupted. For that reason, it is necessary to persist the information on another storage medium that is non-volatile, like SSDs or HDDs. Hardware producers are currently researching on non-volatile RAM like MRAM or PCRAM, that might make this obsolote in the future, but can not compete in means of capacity and price with proven solutions today."

______________________________________________________

Question Name: "Scan Duration"

Text: "How long does it take a quad core CPU to scan 2 columns with 10 million entries each and 50 byte per attribute in both columns? The assumed scan speed is 2 MB/ms per core."

Correct answer:  "125 ms" Incorrect answers:  "50 ms"  "250 ms"  "1 s"

Explanation: "2 columns with 10,000,000 entries with 50 bytes each result in a overall size of 1,000,000,000 bytes. 4 cores with a scan speed of 2 MB/ms each run through this amount of data in 1,000,000,000 B / (4 * 2,000,000 B/ms) = 125 ms."

just as for week 1, we wrote some explanations for the assignment questions of week 2 that we want to share with you.

(6)

Best regards,

your IMDB Teaching Team

______________________________________________________

Question Name: "Technical Building Blocks of Dictionary Encoding" Text: "What data structures are required for dictionary encoding?" Correct answer:

 "attribute vector and dictionary"

 "attribute vector and index"  "attribute vector and foreign key"  "index and dictionary"

Explanation: "The attribute vector holds references to the dictionary, where the actual values are stored. So, these two structures are sufficient for dictionary encoding."

______________________________________________________

Question Name: "Bit Representation of Countries"

Text: "What is the minimum number of bits required to represent all countries (about 200) in the world?" Correct answer:  "8" Incorrect answers:  "6"  "7"  "9"

Explanation: "The logarithm of the number of values to encode to the base to is log_2(200) = 7,64, so we have to choose the next integer above 7, which is 8, as the number of bits needed."

______________________________________________________

Question Name: "Compression Example Run Length Encoding Assignment"

Text: "Suppose there is a table where all 14 million inhabitants of Blueland are assigned to their cities. The table is sorted by city. Blueland consists of about 3,000 cities (represented by 12 bit). Further assume that inhabitants are uniformly distributed across cities. Using Run Length Encoding with a start position vector, what is the size of the compressed city vector? Always use the minimal number of bits required for any of the values you have to choose and include all needed auxiliary structures. Further assume the following conversions: 1MB = 1,000 KB, 1KB = 1,000B"

(7)

Correct answer:  "13.5 kB" Incorrect answers:  "6 kB"  "20.5 kB"  "2 MB"

Explanation: "We have to compute the size of a) the value array and b) the size of the start position array. The size of a) is the distinct number of cities (3,000) times the size of each field of the value array (log_2(3,000)). The size of b) is the number of entries in the dictionary (3,000) times the number of bits required to encode the highest possible number of inhabitants (log_2(14,000,000)). The total result is thus 12 bit times 3,000 (36,000) plus 24 bit times 3,000 (72,000), thus 108,000 bits (or 13.5 kB) in total."

______________________________________________________

Question Name: "Suitable Use Cases for Column Layout" Text: "A columnar layout is well suited to ..."

Correct answer:

 "... process sets and do full column scans"

Incorrect answers:  "... transform data"

 "... handle insert operations"  "... materialize full tuples"

Explanation: "A columnar layout is especially suited to do set operations and full column scans. The complexity of data transformation is not influenced by the chosen layout. Insert operations are more cumbersome in a columnar layout than in a row layout, since we have to distribute the values of a tuple over different columns and therefore place them in different memory regions. On top of that come auxiliary structures like dictionaries, which have to be kept up to date. These migth also be used on row layouts, so the layout is not the influencing part concerning to that. But in general, row layouts are better suited for inserts than columnar layouts. For the same reasons is tuple reconstruction easier on row layouts than on columnar layouts."

______________________________________________________

Question Name: "Suitable partitoning strategy"

Text: "Assume a table with customer data. Three different units in a company do support for the customers but also have to provide analytics about the customers they handle. The table is distributed over several servers, each unit has a server. What partitioning type is suited best if the company decided that the first organizational unit should handle customers with last names A-G, the second unit customers with last names H - R, and the last unit customers with last names S- Z?"

(8)

 "range partitioning"

Incorrect answers:  "hash partitioning"  "round robin partitioning"

Explanation: "Based on the decision to assign specifc ranges of names to specific organizational units, like the word says, range partitioning is most suited. Every organisational unit can have their primary data close to them, if we have a distributed server landscape. Hash partitioning could not guarantee the best locality. Of course, one could choose the first letter as the partitioning key, but at a closer look, this is range partitioning in disguise again. Round robin partitioning would clearly prohibit the locality advantage, since it would distribute the entries fairly over all servers, regardless of the actual information in the entries."

______________________________________________________

Question Name: "Delete implementation for hospital"

Text: "Assume you have to setup a new database for a hospital which allows the hospital staff to keep track of all their patient records. Which delete implementation should be prefered for that use case?"

Correct answer:  "Logical delete"

Incorrect answers:  "Doesn’t matter"  "Physical delete"

 "Depends on the number of patients"

Explanation: "Hospitals typically have very strict regulations for keeping patient data, regardless of their number of patients. They need to be able to look at patient histories for multiple years. When using only logical delete, the data is still available for queries concerning the past. A physical delete would not allow this."

______________________________________________________

Question Name: "New value in dictionary" Text: "Given the following entities:

Old dictionary: chair, drawer, monitor, table Old attribute vector: 0, 3, 0, 1, 2, 3, 3 Value to be inserted: lamp

What valueID is the monitor mapped to in the new attribute vector?"

Correct answer:  "3"

(9)

 "1"  "2"  "4"

Explanation: "The word monitor used to have the valueID 2. After inserting the value lamp and resorting the dictionary, the monitor will have the valueID 3, since the correct alphabethical oder is: chair (0), drawer (1), lamp (2), monitor (3), table (4). The updated attribute vector is 0, 4, 0, 1, 3, 4, 4, 2"

______________________________________________________

Question Name: "Dictionary resorting" Text: "Assume the following dictionaries:

Brand dictionary: Audi, BMW, Mercedes-Benz, Opel, Porsche, Volkswagen Year dictionary: 2007, 2008, 2009, 2010, 2011

Model dictionary: 3 Series, 5 Series, 7 Series, 911, A3, A4, A8, A class, Boxter, C class, E class, Golf, Jetta, Passat

And the following insert: INSERT INTO cars VALUES(‘Audi’, ‘2012’, ‘A7’); How many dictionaries need to be resorted?"

Correct answer:  "1" Incorrect answers:  "None"  "2"  "All"

Explanation: "‘Audi’ is already present in the brands dictionary. The year ‘2012’ simply gets appended to the year dictionary, since it is the last value with respect to existing the sorting order. When appending ‘A7’ to the model dictionary, it needs to be resorted."

______________________________________________________

Question Name: "Dictionary reordering after updates"

Text: "Consider the world population table (first name, last name) that includes all people in the world: Angela Mueller marries Friedrich Schulze and becomes Angela Schulze. Should the dictionary for the last name column be reordered?"

Correct answer:

 "No, because the value ‘Schulze’ is already in dictionary"

 "No, because ‘Schulze’ > ‘Mueller’ when compared lexicographically"  "Yes, because ‘Schulze’ is a new last name of Angela"

(10)

Explanation: "Mr. Friedrich Schulze is already in this table. Therefore ‘Schulze’ is in the last-name dictionary and its key can be taken to update the last name of Ms. Mueller (now Mrs. Schulze)"

______________________________________________________

Question Name: "Tuple reconstruction performance factors"

Text: "The number of attributes of the reconstructed tuple is an important factor that can influence the performance of the operation on the column layout. Which of the following is the right explanation for this behavior?"

Correct answer:

 "A new cache line must be read for every attribute of the tuple and thus the number of the processed bytes will increase"

 "The size of the reconstructed tuple increases"

 "There is a risk that the size of a whole tuple exceeds the size of a cache line and therefore it cannot be read in one cache access"

 "The number of attributes is not an important factor for the tuple reconstruction in a column layout, but the size of the tuple is the key factor, because the data is stored tuple-wise"

Explanation: "In a columnar layout, each attribute of a tuple has to be retrieved via an own cache access, because the attributes of one tuple are located faw away from each other in memory. So the number of bytes to be read increases with every attribute of the reconstructed tuple and this answer is correct. The size of the reconstructed tuple is not increased, this size is fixed for a given tuple. Of course, with every additional attribute, the whole tuple gets bigger, but the important factor that we want to describe within this question is that we need an additional cache access for every attribute, regardless how small it is. The risk, that a whole tuple could not fit into one cache line is also not of interest, since this, as described above, clearly does not reflect the retrieval behavior. Last but not least, the option that the number of attributes is not an important factor but the size of the tuple is the key factor is wrong, because it has the false addition \"because the data is stored tuple-wise\", which is not the case in a columnar layout."

______________________________________________________

Question Name: "Scan performance on column and row layout: table scan"

Text: "Given is a table with the following characteristics containing information about all customers in Germany:

- columns: CustomerId, Customer Name, City, Street, Status, Sector, Category; - size per field (uncompressed): 28 byte;

- number of rows: 500,000;

- cardinality of the city column: 12,200.

A user wants to know \"How many customers do we have in Berlin?\" How long will this query take?"

(11)

 "row store: 49 milliseconds"

 "row store with stride access: 16 milliseconds"

 "row store: 24.5 milliseconds"

 "row store with stride access: 8 milliseconds"

 "column store with dictionary compression: 5.5 milliseconds"  "column store with dictionary compression: 3.5 milliseconds"

We had a little glitch here, this question had 2 correct answers in it. Everybody who has chosen one of both correct answers got full points, these changes already have already been reflected in your results. (Update was done on 18.09.2013)

Explanation: "The needed time in a row store without stride access is (7 attributes * 28 byte per attribute per row * 500,000 rows / ( 2 MB / ms / core ) = 7 * 28 * 500,000 / 2,000,000ms = 49ms

The needed time in a row store with stride access is (size of a cache line * number of accesses / scan speed) = 64 * 500,000 / 2,000,000ms = 16ms;

The needed time in a column store is (number of needed bits to encode attribute * number of values / scan speed / cores) = (14 bit / 8 bit/byte) * 500,000 / 2,000,000 byte/ms = 0.4375ms

So, the answers for the row store with and without stride access are correct here." Dear Learning Community,

here are the explanations for the assignment questions of week 3.

If you notice any flaws or still have questions, please leave us a post in the forum for week 3.

Best regards,

your IMDB Teaching Team

Question Name: "Execution Plans" Text: "For any SELECT statement..." Correct answer:

 "several execution plans with the same result set, but differing performance may exist"

 "exactly one execution plan exists"

 "several executions plans may exist that deliver differing result sets"

 "there always exist exactly two execution plans, which perform identically in each aspect"

Explanation: "For any SELECT statement, several execution plans with the same result set, but different runtimes may exist. As an example, we want to query all men living in Italy from world population table;

(12)

the database offers three different execution plans. We could query for the gender 'male' first and then for the country 'Italy' in the result set or we start with the selection for 'Italy' and then we narrow the result to only males, or we might perform the two selections on 'male' and 'Italy' in parallel queries, both running on the full dataset and then create the intersection. All three execution plans create the same result set, but require different runtimes. For example its faster to query first for 'Italy'and then for 'male', because in this case first 8 billion entries (all entries) and then further select on the resulting 60 million entries (all Italiens), if you start with 'male' and then query for 'Italy' you have to scan through 8 billion (all Italiens) and then through 4 billion entries (all males)."

Question Name: "Scan Speed"

Text: "What is the assumed scan speed per core?" Correct answer:  "2 MB per ms" Incorrect answers:  "1 MB per ms"  "2 PB per ms"  "100 kB per ms"

Explanation: "Nothing to explain here, it is the assumed speed under normal conditions. For given SQL queries, the measured speeds will vary, since not all operations are purely scans and the scan has often to be interrupted."

Question Name: "Early Materialization" Text: "What is early materialization?" Correct answer:

 "A processing strategy, where valueIDs are decoded into actual values at the earliest time during processing"

 "An advanced aggregation strategy compared to basic ones like SUM"  "A strategy, which works on compressed data as long as possible"

Explanation: "Early materialization, in contrast to late materialization, decodes valueIDs into the actual values at the earliest time during processing. The other answers are wrong, for an analog description please have a look on the explanation of the question \"Late Materialization\"."

Question Name: "Faster Materialization Strategy"

Text: "Assuming the execution of the question which queries the complete data of all people in the world whose first name is NOT Jean-Pierre. Which of the following statements is true?"

(13)

Correct answer:

 "Both strategies will perform poorly, as the predicate selects a lot of values, i.e., the costs are bound by the dictionary lookups to materialize the result"

 "Late materialization should be faster, since the predicate can be evaluated solely by using the dictionary of the column fname."

 "Early materialization should be faster, since the result will contain less returned rows."

Explanation: "The reason for this question is, that I think it's important for people to be able to estimate result sizes, since especially column store can suffer (see SELECT *). Here they should be able to see, that this query is painfull and any strategy will perform poorly. "

Question Name: "Late Materialization" Text: "What is late materialization?" Correct answer:

 "A processing strategy, which restores the requested tuple at the latest possible point during processing"

 "An advanced aggregation strategy compared to basic ones like SUM"  "A strategy that works on uncompressed data as long as possible"  "Long batch operations that are run over night in big enterprises"

Explanation: "Late materialization, in contrast to early materialization, is a processing strategy, which aims at reconstructing the actual attribute values from the valueIDs at the latest possible point during processing. Working with the compressed integer values leads to speed advantages in most cases. Depending on the circumstances, there are also constellations where early materialization is favorable, however these situations are seldom. In general, late materialization is tried to keep up as long as

possible. The other answers are wrong. There is no way to determine or even a special name for complex or advanced aggregation strategies. To work on uncompressed data as long as possible is like the correct answer, just twisted. Because data is saved in a compressed format, it is not the case that we work on uncompressed data first and than compress it afterwards. The last wrong answer, that late materialization describes long batch operations run over night is just messing with the word \"late\". These scheduled batch jobs that are run over night have no distinguished term we know of, sometimes they are just called \"over-night batch jobs\"."

Question Name: "Querying the Differential Buffer"

Text: "Correctly complete the following sentence: Whereas write accesses are going against the differential buffer, ..."

Correct answer:

(14)

 "read accesses are going against the main store"  "read accesses against the differential buffer are denied"  "read access are cached in a row-oriented format"

Explanation:"Read accesses have to go against the main store and the differential buffer, in order retrieve the newest valid entries. Just querying the main store would return potentially outdated

information or lack completely new entries, just querying the differential buffer would not suffice since it holds only a fraction of the total data. Read accesses are not cached for any reason, regardless of the format."

Question Name: "Statements Concerning the Differential Buffer" Text: "Which statement related to the differential buffer is true?" Correct answer:

 "Since the dictionary of the differential buffer is unsorted, range selects on the differential buffer are less efficient than on the main store"

 "Tuples in the differential buffer require less memory, because the advanced compression techniques used there result in better compression than in the main store"

 "The differential buffer is read-optimized"

 "The differential buffer should not exceed the size of a cache line (64 byte) for performance reasons"

Explanation:"The differential buffer has an unsorted dictionary to improve the write performance. Therefore it is not read-optimized. Tuples in the differential buffer require at least as much memory than the associated tuples in the main store, since no additional compression is employed. The size of a cache line is certainly not enough for the differential buffer. The correct answer is therefore, that range selects are less efficient in the differential buffer, because the dictionary is not sorted. Even with a CSB+ tree, which allows fast access on the unsorted values, we can not determine whether a value is in the desired range by just checking for example whether it is >=10 and <20, but we have to check against a set of 10 individual values."

Question Name: "Point Representation"

Text: "Considering point representation and a table with one tuple, that was invalidated five times, how many tuples have to be checked to insert a new tuple?"

Correct answer:

 "None, the tuple can just be inserted with the current time stamp"

 "Only one, that is, the most recent one"  "Five"

(15)

 "Two, the oldest and the newest to determine the time stamp"

Explanation: "In point representation, only the tuple with the newest time stamp is valid. However, when inserting a new tuple, no other tuple must be checked, because the new one will always have the latest time stamp."

Question Name: "Time Travel Queries" Text: "What is a time travel query?" Correct answer:

 "A query that enables the user to view the data as it was at a specified earlier point in time"

 "A query that inserts future time stamps"

 "A query that automatically projects current developments into the future"  "A very short and therefore fast query"

Explanation:"Time travel queries allow users to easily query the database for historic data. They can state that they want to see the database "as it was on a certain point in time" and see exactly the results that were valid then. Future time stamps are never inserted in a database as creation or update

timestamps, since they would contradict the integrity of the data. Queries that make analyses based on assumptions and project developments into the future are called "predictive queries". Short and fast queries have no special name."

Question Name: "Asynchronous Merge"

Text: "The merge process is asynchronous. This means:" Correct answer:

 "It allows reading and writing of tuples during the merge phase"

 "During the merge process, no data modifications or queries can be performed until the merge is completed. All queries are saved and delayed to be executed immediately after the merge finished."

 "Immediately after one record is merged, a new query is executed and thus the overall execution is interleaved"

 "After each INSERT operation the merge process is triggered"

Explanation: "An asynchronous merge is essential for real time settings, since it allows reading and writing of tuples during the merge phase. This functionality is necessary to decouple the inner

optimizations of the database from the availability of the database in general with regards to accepting incoming data and queries. Saving or buffering write queries until the merge finished would not yield any benefit to the direct appliance of the modifications into the new differential buffer. In contrast, it would either prohibit also read operations, or it would lead to inconsistent reads because all modifications that were applied in the meantime, would not be reflected before the merge finished. Interleaving as described

(16)

in the wrong answer, would also lead to unpredictable behavior, since one does not know exactly when a read or write operation will take place. Triggering the merge after each INSERT would simply render the whole concept of a differential buffer useless."

Question Name: "Understanding Dictionary Inserts"

Text: "Inserting a new element directly into a sorted dictionary ALWAYS causes the following penalty" Correct answer:

 "If the new element cannot be simply appended, the correct position for the new element has to be found and all following elements have to be moved"

 "The performance of the insert into the dictionary is bound by the write performance to disk"  "Because the dictionary is unsorted no penalty occurs"

 "The new element is always inserted at the beginning of the dictionary, thus, all following elements have to be moved"

Explanation: "If the element fits in the dictionary's last position, no penalty is paid, as the element can simply be appended. However, if the element has to be inserted into the dictionary, every entry behind the insert position has to be shifted by one position."

just as for the former weeks, below you will find the explanations for the assignment questions of week 4. If you notice any flaws or still have questions, please write us a post in the forum for week 4.

Best regards, your IMDB Teaching Team

Question Name: "Amdahl's Law II"

Text: "What is the maximum theoretical speedup of a program consisting of a 50% sequential part executed by four individual processors?"

Correct answer:  "1.6" Incorrect answers:  "0.2"  "4"  "2"

Explanation: "The calculation is as follows: 1/ (s + p/N) = 1/(0.5 + 0,5/4) = 1.6, where s is the serial part of the program, p ( p = 1-s ) the parallel part and N is the number of used cores."

(17)

Question Name: "SIMD"

Text: "SIMD - Single instruction, multiple data..." Correct answer:

 "describes computers that perform a single operation on multiple data items in parallel"

 "is a processor optimized for single instructions"  "is only required for computer graphic operations"  "is an outdated programming language"

Explanation: "Because this question asks for the definition of an abbreviation, there is not much to explain. As the question text already hints, the correct answer is that SIMD describes computers that perform a single operation on multiple data items in parallel. It is not only required for computer graphic operations and it is not an outdated programming language, either. The answer that it is a processor optimized for single instructions does not cover the whole meaning of SIMD and additionally, SIMD describes an architecture, not a specific processor. Therefore, this answer is wrong, too."

Question Name: "Rare Look-Ups in a Column"

Text: "If a column is very rarely used to identify or look-up specific tuples..." Correct answer:

 "… we probably don’t need an index on top of it, but use a full column scan"

 "… an index on top of it is a \"must-have\""  "… we cannot use a full column scan"  "... we should delete it"

Explanation: "This question is basically about optimization and judgement in actual use cases. Indexes need additional memory and have to be maintained, so their benefits come at the cost of additional overhead and complexity. Therefore, they are normally only used on frequently accessed columns. A full column scan without an index is a vbit more expensive with regards to computational cost, but this is acceptable if the column is only used seldomly. So this is the correct answer. With that, the option \"we cannot use a full column scan\" is clearly wrong, and because we should never delete any information from our database, the last option is obviously wrong, too."

Question Name: "Hash-Join Complexity"

Text: "What is the complexity of the the Hash-Join?" Correct answer:

(18)

Incorrect answers:  "O(n*m)"

 "O(n*log(n)+m+log(m))"  "O(n²/m²)"

Explanation: "Let m and n be the cardinality of the input relations M and N with m <= n. The hash function used in the first phase of a Hash-Join, maps a variable-length value to a fixed length value in constant time and is applied to the smaller input relation. Thus, it takes m operations. In the second phase, the attribute vector of the larger relation is probed against the hash table which was generated in the first step. Again, the hash function is used and therefore n constant time operations take place. In short, the Hash-Join has a complexity of O(m+n)."

Question Name: "Many-to-Many Relation" Text: "What is a many-to-many relation?" Correct answer:

 "A many-to-many relation between two tables means that each object on the left side is joined to zero or more objects on the right side of the table and vice versa each object on the right side has zero or more join partners on the left side of the table"

 "A many-to-many relation between two tables means that for each object on the left side, there are one or more objects on the right side of the joined table"

 "A many-to-many relation between two tables means that for exactly one object on the left side of the join exists exactly one object on the right side"

 "A many-to-many relation between two tables randomly combines each object of the left table with many objects from the right table and vice versa"

Explanation: "The first wrong answers makes no statement concerning the cardinalities from the right side to the left side of the joined table. Additionally, the statement \"one or more objects\" is not correct, since it has to be \"zero or more\" objects. The second wrong answer would enfore a 1-1 relationship and additionally would not allow any entries without partners. This requirement does not have a common name. To clarify, the relationship \"one-to-one\" allows also entries without partners and is written \"0..1 - 0..1\" in short. The third incorrect answer states that the entries are connected randomly, which is never the case in relational databases."

Question Name: "Specific Aggregate Function" Text: "Which of the following is an aggregate function?" Correct answer:

 "MAXIMUM"

Incorrect answers:  "INSERT"

(19)

 "SELECT"  "DELETE"

Explanation: "MAXIMUM is an aggregate function, because it calculates its result from a group / set of input attributes. INSERT, SELECT and DELETE are SQL keywords that start a query and determine the operation that should be done on the respective tables, rows or columns."

Question Name: "Performance Impact of Parallel Select"

Text: "What has to be considered when parallelizing the execution of a SELECT statement?" Correct answer:

 "The time needed to split up the data set, to distribute the sub-tasks, and to collect the results"

 "The time needed to set up an index for each chunk of data"

 "For large data sets, the performance will decrease, because different sub-tasks often lock each other"

 "The size of shared memory that is needed to store messages"

Explanation: "To enable the parallelization of SELECT statements, a certain administration, i.e. the correct distribution of the splitted data set to the parallel workers and the collection of their results, has to be done additionally to the SELECT itself. As these are operations, which need computing resources, also additional time is needed, which results in situations where an unparallelized SELECT runs faster than the parallel variant because of this overhead for small datasets."

Question Name: "Scheduling"

Text: "Scheduling in a database context is..." Correct answer:

 "a strategy to assign the computer's resources such as processing power, memory, or network bandwidth to running queries"

 "a strategy to filter out nonrelevant results of database queries"  "a strategy to produce more exact query results"

 "a strategy to format the results of database queries for presentation"

Explanation: "Scheduling in general is the allocation of available resources to processes. In a database context, these processes are usually queries or stored procedures. The other answers are wrong."

Question Name: "Transactional Query Response Times" Text: "The response time for transactional workloads..." Correct answer:

(20)

 "has to be guaranteed, even for peak-load situations"

 "is always so fast that it is negligible"  "is of no importance"

 "is always much slower than that of analytical queries"

Explanation: "Response times are always important and unfortunately they are not always negligible for transactional workloads. Usually, transactional queries are much faster than analytical queries, but it is additionally required to guarantee some maximal response times in order to rule out negative effects to business processes. For a query that is needed to conclude the check out in a shopping process, it is essential to guarantee a response time even under peak loads to enable efficient marketing. If the system can not handle peak loads that are induced by promotions, profit is would be lost."

Question Name: "Exclusive Data Subsets"

Text: "Why do merge threads in the hash phase of the parallel join algorithm work on exclusively assigned subsets of the values?"

Correct answer:

 "To reduce the write synchronization overhead on the unified hash table to a minimum"

 "To maximize the number of cache misses"

 "To increase the write synchronization overhead on the unified hash table to a maximum"  "Because a typical cache line is only 64 bytes long"

Explanation: "If a thread is exclusively working on a dataset of distinct values determined by a hash function, it does not need any write synchronization to store its intermediate values in the result structure. Of course, we do not want to increase synchronization overhead or maximize the amount of cache misses. The length of a cache line also has no effect on the assignment of data subsets to merger threads."

Question Name: "Sizes of Intermediate Tables"

Text: "Why should intermediate hash tables in a parallel join algorithm be cache-sized?" Correct answer:

 "To avoid cache misses for random lookups in the hash table"

 "Because the probing phase cannot be parallelized with hash tables larger than the cache size"  "Because hash table lookup performance decreases logarithmically with the table size"

(21)

Explanation: "Intermediate hash tables in a parallel join algorithm should be cache-sized to avoid cache misses for random lookups in the hash table. Otherwise, a hash table would force its own data out of the cache during the process. The probe phase can be parallized regardless of the size of the intermediate hash tables. The lookup performance does not decrease logarithmically with the table size, and the table size also has no effect on the access pattern that is used."

Question Name: "Aggregation - HAVING" Text: "The HAVING clause is used to express..." Correct answer:

 "additional filter criteria based on aggregated values"

 "that the aggregate function shall be computed for every distinct value (or value combination) of the specified attribute(s)"

 "the sort order of values in the result set"

 "the number of subsets the results should be split into"

Explanation: "The HAVING clause acts like sort of a WHERE clause on SETs. It checks for

characteristics of the set as a whole and therefore has to take the results of aggregate functions into account for these checks. The other answers are wrong. The clause that determines on which groups sharing the same distinct values the specified aggregate functions should be computed is the GROUP BY clause. THe sort order is expressed by the SORT BY clause and the number of subsets into which the result is split is not determined by an additional specific clause, but it is a result of the GROUP BY clause."

Question Name: "Number of Threads"

Text: "How many threads will be used during the first (hashing) phase of the described parallel

aggregation algorithm when the table is split into 20 chunks and the GROUP BY attribute has 6 distinct values?" Correct answer:  "20 threads" Incorrect answers:  "14 threads"  "6 threads"  "26 threads"

Explanation: "The maximum number of used threads is limited by the number of tasks which may be processed separately in the hashing phase. In this case, like in most typical database queries, the number of tasks is determined by the number of data chunks, therefore the correct answer is 20 here. The number of distinct values does not limit the parallelism here."

(22)

Question Name: "Data Persistence"

Text: "Which data medium provides the best requirements to guarantee the persistence of the data?" Correct answer:  "SSD" Incorrect answers:  "DRAM"  "CPU"  "GPU"

Explanation: "Solid State Drives (SSD) offer non-volatile storage, while the others lose stored data without a steady power connection. In general, CPUs (Central Processing Units) and GPUs (Graphics Processing Units) are not storage but processing devices, as the name already indicates."

Question Name: "Dictionary-Encoded Logging"

Text: "When is Dictionary-Encoded Logging superior to Logical Logging?" Correct answer:

 "The number of distinct values is low"

 "The number of distinct values is high"  "Large values are inserted only one time"  "All values are different"

Explanation: "Dictionary-Encoded Logging is superior to logical logging, if the number of distinct values is low. According to the principles we already learned in the chapter \"Dictionary Encoding\", the resulting compression depends on the amount of redundant information."

Question Name: "Recovery Process"

Text: "When recovering an in-memory database system after a server failure..." Correct answer:

 "the latest snapshot of the main store has to be loaded into main memory and logs have to be replayed"

 "the latest snapshot has to be loaded and all users have to redo/restart the transactions that were lost"

(23)

 "the system can rely on the consistent state of data in main memory and only has to replay the latest log files"

 "only the caches have to be refilled, everything else is still available in main memory"

Explanation: "To recover an in-memory database, the latest snapshot of the main store has to be loaded into main memory and the additional logs have to be replayed. Since main memory is volatile, there is no consistent state in main memory and the system can not rely on that. Also, users do not have to redo their transactions."

Question Name: "Data Access Patterns"

Text: "How is data access distributed among hot and cold data?" Correct answer:

 "Probably more than 90% of all queries (transactional and analytical) go against hot data"

 "About 80% of all queries (transactional and analytical) go against cold data"  "Data accesses are uniformly distributed between hot and cold data"

 "Neither hot nor cold data is accessed during enterprise operations"

Explanation: "The correct answer here is based on our findings. The wrong answers can be sorted out with pure logic. If 80% of all data accesses would go against cold data, this data would not be called cold. If the accesses would be uniformly distributed between hot and cold data, a separation into hot and cold would not be possible at all. The last answer \"neither hot nor cold data is accessed during enterprise operations\" is false for the same reason. If neither hot nor cold data is accessed, our definition of hot and cold data would be void."

Question Name: "Hot Data" Text: "What is hot data?" Correct answer:

 "Data, which is still accessed frequently and on which updates are still expected"

 "Data which is not modified any longer"  "Data that is used infrequently"

 "The data within the database, which belongs to the result of the current query"

Explanation: "Data that is active and used frequently is also called hot data. The connotation with temperature shall reflect the usage frequency. The origin of the term is not 100% sure, but since the atoms of hot items oscillate at a higher frequency, this is a common explanation for the usage of the word \"hot\" to describe data that is accessed with high frequency. Heat maps, a specific 2 dimensional chart to map activeness of a variety of items, also relates to the temperature in order to distinguish the mapped items. The other possible answers are wrong. Data which is used only infrequently is \"cold data\" in our terms. Data that is not modified any longer can not automatically be assigned as hot or cold. It might be

(24)

the case that certain read-only entries are accessed very often and would therefore be considered hot, but it might also be the case that these entries are just saved for the sake of historic completeness and might be cold. The set of data which belongs to the result of a query is simply called result-set."