• No results found

Unloading Data from Greenplum Database

In document Greenplum Database 4.2 (Page 107-111)

Specifying SEGMENT REJECT LIMIT runs the COPY operation in single row error isolation mode. Specify the acceptable number of error rows on each segment, after which the entire COPY FROM operation fails and no rows load. The error row count is for each Greenplum segment, not for the entire load operation.

If the COPY operation does not reach the error limit, Greenplum loads all

correctly-formatted rows and discards the error rows. The LOG ERRORS INTO clause allows you to keep error rows for further examination. Use LOG ERRORS INTO to declare an error table in which to write error rows. For example:

=> COPY country FROM '/data/gpdb/country_data' WITH DELIMITER '|' LOG ERRORS INTO err_country SEGMENT REJECT LIMIT 10 ROWS;

See “Viewing Bad Rows in the Error Table” on page 96 for information about investigating error rows.

Optimizing Data Load and Query Performance

Use the following tips to help optimize your data load and subsequent query performance.

Drop indexes before loading data into existing tables.

Creating an index on pre-existing data is faster than updating it incrementally as each row is loaded. You can temporarily increase the maintenance_work_mem

server configuration parameter to help speed up CREATE INDEX commands, though load performance is affected. Drop and recreate indexes only when there are no active users on the system.

Create indexes last when loading data into new tables. Create the table, load the data, and create any required indexes.

Run ANALYZE after loading data. If you significantly altered the data in a table, run

ANALYZE or VACUUM ANALYZE to update table statistics for the query planner. Current statistics ensure that the planner makes the best decisions during query planning and avoids poor performance due to inaccurate or nonexistent statistics. Run VACUUM after load errors. If the load operation does not run in single row error

isolation mode, the operation stops at the first error. The target table contains the rows loaded before the error occurred. You cannot access these rows, but they occupy disk space. Use the VACUUM command to recover the wasted space.

Unloading Data from Greenplum Database

A writable external table allows you to select rows from other database tables and output the rows to files, named pipes, to applications, or as output targets for Greenplum parallel MapReduce calculations. You can define file-based and web-based writable external tables.

This section describes how to unload data from Greenplum Database using parallel unload (writable external tables) and non-parallel unload (COPY).

Defining a File-Based Writable External Table

Unloading Data Using a Writable External Table Unloading Data Using COPY

Defining a File-Based Writable External Table

Writable external tables that output data to files use the Greenplum parallel file server program, gpfdist, or the Hadoop Distributed File System interface, gphdfs.

Use the CREATE WRITABLE EXTERNAL TABLE command to define the external table and specify the location and format of the output files. See “Using the Greenplum Parallel File Server (gpfdist)” on page 79 for instructions on setting up gpfdist for use with an external table and “Using Hadoop Distributed File System (HDFS) Tables” on page 81 for instructions on setting up gphdfs for use with an external

table.

With a writable external table using the gpfdist protocol, the Greenplum segments send their data to gpfdist, which writes the data to the named file.

gpfdist must run on a host that the Greenplum segments can access over the network. gpfdist points to a file location on the output host and writes data received from the Greenplum segments to the file. To divide the output data among multiple files, list multiple gpfdist URIs in your writable external table definition.

A writable external web table sends data to an application as a stream of data. For example, unload data from Greenplum Database and send it to an application that connects to another database or ETL tool to load the data elsewhere. Writable external web tables use the EXECUTE clause to specify a shell command, script, or application to run on the segment hosts and accept an input stream of data. See “Defining a Command-Based Writable External Web Table” for more

information about using EXECUTE commands in a writable external table definition.

You can optionally declare a distribution policy for your writable external tables. By default, writable external tables use a random distribution policy. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table improves unload

performance by eliminating the requirement to move rows over the interconnect. If you unload data from a particular table, you can use the LIKE clause to copy the column definitions and distribution policy from the source table.

Example 1—Greenplum file server (gpfdist)

=# CREATE WRITABLE EXTERNAL TABLE unload_expenses ( LIKE expenses )

LOCATION ('gpfdist://etlhost-1:8081/expenses1.out', 'gpfdist://etlhost-2:8081/expenses2.out') FORMAT 'TEXT' (DELIMITER ',')

DISTRIBUTED BY (exp_id);

Example 2—Hadoop file server (gphdfs)

Unloading Data from Greenplum Database 102

( LIKE expenses )

LOCATION ('gphdfs://hdfslhost-1:8081/path') FORMAT 'TEXT' (DELIMITER ',')

DISTRIBUTED BY (exp_id);

You can only specify a directory for a writable external table with the gphdfs

protocol. (You can only specify one file for a readable external table with the gphdfs

protocol)

Note: The default port number is 9000.

Defining a Command-Based Writable External Web Table You can define writable external web tables to send output rows to an application or script. The application must accept an input stream, reside in the same location on all of the Greenplum segment hosts, and be executable by the gpadmin user. All

segments in the Greenplum system run the application or script, whether or not a segment has output rows to process.

Use CREATE WRITABLE EXTERNAL WEB TABLE to define the external table and specify the application or script to run on the segment hosts. Commands execute from within the database and cannot access environment variables (such as $PATH). Set environment variables in the EXECUTE clause of your writable external table

definition. For example:

=# CREATE WRITABLE EXTERNAL WEB TABLE output (output text) EXECUTE 'export PATH=$PATH:/home/gpadmin/programs; myprogram.sh'

FORMAT 'TEXT'

DISTRIBUTED RANDOMLY;

The following Greenplum Database variables are available for use in OS commands executed by a web or writable external table. Set these variables as environment variables in the shell that executes the command(s). They can be used to identify a set of requests made by an external table statement across the Greenplum Database array of hosts and segment instances.

Table 7.4 External Table EXECUTE Variables

Variable Description

$GP_CID Command count of the session executing the external table statement.

$GP_DATABASE The database in which the external table definition resides. $GP_DATE The date on which the external table command ran.

$GP_MASTER_HOST The host name of the Greenplum master host from which the external table statement was dispatched.

$GP_MASTER_PORT The port number of the Greenplum master instance from which the external table statement was dispatched.

$GP_SEG_DATADIR The location of the data directory of the segment instance executing the external table command.

Disabling EXECUTE for Web or Writable External Tables

There is a security risk associated with allowing external tables to execute OS commands or scripts. To disable the use of EXECUTE in web and writable external table

definitions, set the gp_external_enable_exec server configuration parameter to off in your master postgresql.conf file:

gp_external_enable_exec = off

Unloading Data Using a Writable External Table

Writable external tables allow only INSERT operations. You must grant INSERT

permission on a table to enable access to users who are not the table owner or a superuser. For example:

GRANT INSERT ON writable_ext_table TO admin;

To unload data using a writable external table, select the data from the source table(s) and insert it into the writable external table. The resulting rows are output to the writable external table. For example:

INSERT INTO writable_ext_table SELECT * FROM regular_table;

Unloading Data Using COPY

COPYTO copies data from a table to a file (or standard input) on the Greenplum master host using a single process on the Greenplum master instance. Use COPY to output a table’s entire contents, or filter the output using a SELECT statement. For example:

COPY (SELECT * FROM country WHERE country_name LIKE 'A%') TO '/home/gpadmin/a_list_countries.out';

$GP_SEG_PG_CONF The location of the postgresql.conf file of the segment instance executing the external table command.

$GP_SEG_PORT The port number of the segment instance executing the external table command.

$GP_SEGMENT_COUNT The total number of primary segment instances in the Greenplum Database system.

$GP_SEGMENT_ID The ID number of the segment instance executing the external table command (same as dbid in gp_segment_configuration). $GP_SESSION_ID The database session identifier number associated with the external

table statement.

$GP_SN Serial number of the external table scan node in the query plan of the external table statement.

$GP_TIME The time the external table command was executed. $GP_USER The database user executing the external table statement. $GP_XID The transaction ID of the external table statement. Table 7.4 External Table EXECUTE Variables

In document Greenplum Database 4.2 (Page 107-111)