Using Hadoop Distributed File System (HDFS) Tables
2. Author and run code for a MapReduce job Use the same import statements shown in “Example 1 Read Custom-Formatted Data from HDFS” on page 85
Note: Omit the port number when using the gpmr-1.0-gnet-1.0.0.1 connector.
MapReduce Sample Code /*
* DEMO Map/Reduce class test2
* -- Convert GPDBFormat back to TEXT */
public static class Map_test2 extends Mapper<LongWritable, GPDBWritable, Text, NullWritable> {
public void map(LongWritable key, GPDBWritable value, Context context ) throws IOException {
try {
context.write(new Text(value.toString()), NullWritable.get()); } catch (Exception e) { throw new IOException (e.getMessage()); } }
}
public static void runTest2() throws Exception{ Configuration conf = new Configuration(true); Job job = new Job(conf, "test2");
job.setJarByClass(demoMR.class);
Using Hadoop Distributed File System (HDFS) Tables 88 job.setOutputKeyClass (Text.class); job.setOutputValueClass(NullWritable.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapperClass(Map_test2.class); GPDBInputFormat.setInputPaths (job, new Path("/demo/data/writeFromGPDB_42"));
GPDBOutputFormat.setOutputPath(job, new Path("/demo/data/MRTest2")); job.waitForCompletion(true);
}
Creating and Using Web External Tables
CREATE EXTERNAL WEB TABLE creates a web table definition. Web external tables allow Greenplum Database to treat dynamic data sources like regular database tables. Because web table data can change as a query runs, the data is not rescannable. You can define command-based or URL-based web external tables. The definition forms are distinct: you cannot mix command-based and URL-based definitions.
Command-based Web External Tables
The output of a shell command or script defines command-based web table data. Specify the command in the EXECUTE clause of CREATE EXTERNAL WEB TABLE. The data is current as of the time the command runs. The EXECUTE clause runs the shell command or script on the specified master, and/or segment host or hosts. The
command or script must reside on the hosts corresponding to the host(s) defined in the
EXECUTE clause.
By default, all active segments run the command on all segment hosts. For example, if each segment host runs four primary segment instances, the command runs four times per segment host. You can optionally limit the number of segment instances that execute the web table command. All segments included in the web table definition in the ON clause run the command in parallel.
The command that you specify in the external table definition executes from the database and cannot access environment variables from .bashrc or .profile. Set environment variables in the EXECUTE clause. For example:
=# CREATE EXTERNAL WEB TABLE output (output text)
EXECUTE 'PATH=/home/gpadmin/programs; export PATH; myprogram.sh'
FORMAT 'TEXT';
Scripts must be executable by the gpadmin user and reside in the same location on the
master or segment hosts.
The following command defines a web table that runs a script once on each segment host.
=# CREATE EXTERNAL WEB TABLE log_output (linenum int, message text)
FORMAT 'TEXT' (DELIMITER '|'); URL-based Web External Tables
A URL-based web table accesses data from a web server using the HTTP protocol. Web table data is dynamic: the data is not rescannable.
Specify the LOCATION of files on a web server using http://. The web data file(s) must reside on a web server that Greenplum segment hosts can access. The number of URLs specified corresponds to the number of segment instances that work in parallel to access the web table. For example, if you specify 2 external files to a Greenplum Database system with 8 primary segments, 2 of the 8 segments access the web table in parallel at query runtime.
The following sample command defines a web table that gets data from several URLs.
=# CREATE EXTERNAL WEB TABLE ext_expenses (name text,
date date, amount float4, category text, description text) LOCATION ( 'http://intranet.company.com/expenses/sales/file.csv', 'http://intranet.company.com/expenses/exec/file.csv', 'http://intranet.company.com/expenses/finance/file.csv', 'http://intranet.company.com/expenses/ops/file.csv', 'http://intranet.company.com/expenses/marketing/file.csv', 'http://intranet.company.com/expenses/eng/file.csv' ) FORMAT 'CSV' ( HEADER );
Loading Data Using an External Table
Use SQL commands such as INSERT and SELECT to query a readable external table, the same way that you query a regular database table. For example, to load travel expense data from an external table, ext_expenses, into a database table,
expenses_travel:
=# INSERT INTO expenses_travel
SELECT * from ext_expenses where category='travel';
To load all data into a new database table:
=# CREATE TABLE expenses AS SELECT * from ext_expenses;
Loading and Writing Non-HDFS Custom Data
Greenplum supports TEXT and CSV formats for importing and exporting data. You can load and write the data in other formats by defining and using a custom format or custom protocol.
• Using a Custom Format • Using a Custom Protocol
For information about importing custom data from HDFS, see “Reading and Writing Custom-Formatted HDFS Data” on page 84.
Using Hadoop Distributed File System (HDFS) Tables 90 Using a Custom Format
You specify a custom data format in the FORMAT clause of CREATE EXTERNAL TABLE.
FORMAT ‘CUSTOM’ (formatter=format_function, key1=val1,...keyn=valn)
Where the ‘CUSTOM’ keyword indicates that the data has a custom format and
formatter specifies the function to use to format the data, followed by comma-separated parameters to the formatter function.
Greenplum Database provides functions for formatting fixed-width data, but you must author the formatter functions for variable-width data. The steps are as follows. 1. Author and compile input and output functions as a shared library.