Apache Pig

Apache Pig

Developers	Apache Software Foundation, Yahoo Research
Initial release	September 11, 2008; 17 years ago (2008-09-11)

Stable release	0.17.0 / June 19, 2017; 8 years ago (2017-06-19)

Repository	svn.apache.org/repos/asf/pig/
Operating system	Microsoft Windows, OS X, Linux
Type	Data analytics
License	Apache License 2.0
Website	pig.apache.org

History

Summarize

Perspective

Apache Pig was originally^[4] developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007,^[5] it was moved into the Apache Software Foundation.

More information Version, Original release date ...

Version	Original release date	Latest version	Release date^[6]
Unsupported: 0.1	2008-09-11	0.1.1	2008-12-05
Unsupported: 0.2	2009-04-08	0.2.0	2009-04-08
Unsupported: 0.3	2009-06-25	0.3.0	2009-06-25
Unsupported: 0.4	2009-08-29	0.4.0	2009-08-29
Unsupported: 0.5	2009-09-29	0.5.0	2009-09-29
Unsupported: 0.6	2010-03-01	0.6.0	2010-03-01
Unsupported: 0.7	2010-05-13	0.7.0	2010-05-13
Unsupported: 0.8	2010-12-17	0.8.1	2011-04-24
Unsupported: 0.9	2011-07-29	0.9.2	2012-01-22
Unsupported: 0.10	2012-01-22	0.10.1	2012-04-25
Unsupported: 0.11	2013-02-21	0.11.1	2013-04-01
Unsupported: 0.12	2013-10-14	0.12.1	2014-04-14
Unsupported: 0.13	2014-07-04	0.13.0	2014-07-04
Unsupported: 0.14	2014-11-20	0.14.0	2014-11-20
Unsupported: 0.15	2015-06-06	0.15.0	2015-06-06
Unsupported: 0.16	2016-06-08	0.16.0	2016-06-08
Latest version: 0.17	2017-06-19	0.17.0	2017-06-19
Legend: Unsupported Supported Latest version Preview version Future version

Naming

Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty.^[7]^[8]^[9]

The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something. Off the top of his head, one researcher suggested Pig, and the name stuck. It is quirky yet memorable and easy to spell. While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository.

— Alan Gates, Daniel Dai, "What Is Pig?", Programming Pig, 2nd Edition (November 2017)

Remove ads

Example

Below is an example of a "Word Count" program in Pig Latin:

 input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\\w+';
 
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet.

Remove ads

Pig vs SQL

Summarize

Perspective

In comparison to SQL, Pig

has a nested relational model,
uses lazy evaluation,
uses extract, transform, load (ETL),
is able to store data at any point during a pipeline,
declares execution plans,
supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines.

On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance.^[10]

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways.^[11] In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task.^[12]

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.^[11]

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.^[11]

History

Naming

Example

Pig vs SQL

See also

References

External links

Wikiwand - on