The basics of creating a table in Hive are similar to typical SQL, but there are a number of extensions, particularly for dealing with different fi le and record formats. A basic table can be created with the following:
CREATE TABLE MsBigData.customer ( name STRING,
city STRING, state STRING,
postalCode STRING,
purchases MAP<STRING, DECIMAL> );
This table holds some basic customer information, including a list of the customer purchases in a MAP column, where the key is the product name, and value is the amount paid. To copy the schema for an existing table, you can use the LIKE keyword:
CREATE TABLE IF NOT EXISTS MsBigData.customer2 LIKE MsBigData.customer; You can use the SHOW command to list the tables in either the current database or other databases. The DESCRIBE command can also be used with tables:
SHOW TABLES;
SHOW TABLES IN default; DESCRIBE MsBigData.customer;
N O T E You might have noticed that no primary or foreign keys are defi ned on these tables, nor any NOT NULL or other column constraints. Hive doesn’t support these options because it doesn’t have any way to enforce constraints on the data. In a relational system, these constraints help enforce data quality and consistency and are generally enforced when the data is inserted into a table (schema on write). Hive doesn’t control the data, so it can’t enforce the constraints.
Tables can be removed by using the DROP TABLE command, and renamed using the ALTER TABLE statement. Columns can be renamed or have their types changed using CHANGE COLUMN, and they can be added or deleted using ADD COLUMNS and REPLACE COLUMNS, respectively. Replacing columns deletes any column not included in the new column list:
DROP TABLE MsBigData.customer2;
ALTER TABLE customer RENAME TO customer_backup; ALTER TABLE customer_backup
CHANGE COLUMN name fullname STRING; ALTER TABLE customer_backup
ADD COLUMNS ( country STRING); ALTER TABLE customer_backup REPLACE COLUMNS (
name STRING, city STRING, state STRING, postalCode STRING,
purchases MAP<STRING, DECIMAL>);
W A R N I N G Using ALTER TABLE to modify a table changes the metadata for the table. It does not modify the data in the fi les. This option is useful for correcting mis- takes in the schema for a table, but any data issues have to be cleaned up separately.
120 Part III ■ Storing and Managing Big Data
As discussed in the “Custom File and Record Formats” section, Hive gives you control over the record format. In the preceding CREATE TABLE statement, the Hive defaults are used; it expects text fi les in delimited format, with Ctrl-A
(octal 001) as a fi eld delimiter. To control that format, Hive supports explicitly declaring the format options. The preceding table, with explicit delimiters defi ned, would look like this:
CREATE TABLE MsBigData.customer ( name STRING,
city STRING, state STRING, postalCode STRING,
purchases MAP<STRING, DECIMAL> )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n' STORED AS TEXTFILE;
The fi le format is controlled by the STORED AS portion of the statement. To use the SEQUENCEFILE fi le format, you replace STORED AS TEXTFILE with STORED AS SEQUENCEFILE. To use custom fi le formats, you specify the INPUTFORMAT and
OUTPUTFORMAT options directly. For example, here is the specifi cation for the RCFile
format. The value in the string is the class name for the fi le format to be used: STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
The row format options are controlled by the ROW FORMAT portion. The delim- ited SerDe is the default. To specify a custom SerDe, use the SERDE keyword followed by the class name of the SerDe. For example, the RegexSerDe can be specifi ed as follows:
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' Another important option in table creation is the EXTERNAL option. By default, when you create a table without specifying EXTERNAL, it is created as a managed table. This means that Hive considers itself the manager of the table, including any data created in it. The data for the table will be stored in a subdirectory under the database folder, and if the table is dropped, Hive will remove all the data associated with the table.
However, if you use CREATE EXTERNALTABLE to create the table, Hive creates the metadata for the table, and allows you to query it, but it doesn’t consider itself the owner of the table. If the table is dropped, the metadata for it will be
deleted, but the data will be left intact. External tables are particularly useful for data fi les that are shared among multiple applications. Creating the Hive table defi nition allows it to be queried using the power of Hive, but it makes it clear that the data is shared with other applications.
When you use the EXTERNAL keyword, you must also use the LOCATION option: CREATE EXTERNAL TABLE MsBigData.customer (
name STRING, city STRING, state STRING, postalCode STRING,
purchases MAP<STRING, DECIMAL> )
LOCATION 'user/MyCustomerTable';
You can the LOCATION option with managed tables, as well, but it’s not necessary unless you want a table that Hive manages that is also stored in a directory that Hive doesn’t manage. For clarity, it’s recommended that LOCATION be used only with external tables.
W A R N I N G Be aware that, regardless of whether the table is managed or exter- nal, the data is still accessible through the Hadoop fi le system. Files can be added or deleted by anyone with access to Hadoop. So, even for managed tables, Hive doesn’t really take full control of the data fi les.