Agfdhyk

Apache Pig
Developer(s)	; Apache Software Foundation, Yahoo Research
Initial release	September 11, 2008; 10 years ago (2008-09-11);
Stable release	v0.17.0; / June 19, 2017; 18 months ago (2017-06-19);
	;
Operating system	; Microsoft Windows, OS X, Linux;
Type	Data analytics
License	; Apache License 2.0
Website	pig.apache.org

Apache Pig^[1]
is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.^[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark^[2]. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy^[3] and then call directly from the language.

History

Apache Pig was originally^[4] developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing MapReduce jobs on very large data sets. In 2007,^[5] it was moved into the Apache Software Foundation.^[6]

Version	Original release date	Latest version	Release date
Old version, no longer supported: 0.1	2008-09-11	0.1.1	2008-12-05
Old version, no longer supported: 0.2	2009-04-08	0.2.0	2009-04-08
Old version, no longer supported: 0.3	2009-06-25	0.3.0	2009-06-25
Old version, no longer supported: 0.4	2009-08-29	0.4.0	2009-08-29
Old version, no longer supported: 0.5	2009-09-29	0.5.0	2009-09-29
Old version, no longer supported: 0.6	2010-03-01	0.6.0	2010-03-01
Old version, no longer supported: 0.7	2010-05-13	0.7.0	2010-05-13
Old version, no longer supported: 0.8	2010-12-17	0.8.1	2011-04-24
Old version, no longer supported: 0.9	2011-07-29	0.9.2	2012-01-22
Old version, no longer supported: 0.10	2012-01-22	0.10.1	2012-04-25
Old version, no longer supported: 0.11	2013-02-21	0.11.1	2013-04-01
Old version, no longer supported: 0.12	2013-10-14	0.12.1	2014-04-14
Old version, no longer supported: 0.13	2014-07-04	0.13.0	2014-07-04
Old version, no longer supported: 0.14	2014-11-20	0.14.0	2014-11-20
Old version, no longer supported: 0.15	2015-06-06	0.15.0	2015-06-06
Old version, no longer supported: 0.16	2016-06-08	0.16.0	2016-06-08
Current stable version: 0.17	2017-06-19	0.17.0	2017-06-19
Current stable version: 0.18	2017-06-19	0.18.0	2017-12-10
Legend: Old version Older version, still supported Latest version Latest preview version Future release

Example

Below is an example of a "Word Count" program in Pig Latin:

 input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

 

 -- Extract words from each line and put them into a pig bag

 -- datatype, then flatten the bag to get one word on each row

 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

 

 -- filter out any words that are just white spaces

 filtered_words = FILTER words BY word MATCHES '\w+';

 

 -- create a group for each word

 word_groups = GROUP filtered_words BY word;

 

 -- count the entries in each group

 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

 

 -- order the records by count

 ordered_word_count = ORDER word_count BY count DESC;

 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet.

Pig vs SQL

In comparison to SQL, Pig

has a nested relational model,

uses lazy evaluation,

uses extract, transform, load (ETL),

is able to store data at any point during a pipeline,

declares execution plans,

supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines.

On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance.^[7]

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways.^[8] In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task.^[9]

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.^[8]

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.^[8]

References

^ ^a^b "Hadoop: Apache Pig". Retrieved Sep 2, 2011..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output q{quotes:"""""""'""'"}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}

^ "[PIG-4167] Initial implementation of Pig on Spark - ASF JIRA". issues.apache.org. Retrieved 2018-12-29.

^ "Pig user defined functions". Retrieved May 3, 2013.

^ "Yahoo Blog:Pig – The Road to an Efficient High-level language for Hadoop". Archived from the original on February 3, 2016. Retrieved May 23, 2015.

^ "Pig into Incubation at the Apache Software Foundation". Archived from the original on February 3, 2016. Retrieved May 23, 2015.

^ "The Apache Software Foundation". Retrieved Nov 1, 2010.

^ "Communications of the ACM: MapReduce and Parallel DBMSs: Friends or Foes?" (PDF). Archived from the original (PDF) on July 1, 2015. Retrieved May 23, 2015.

^ ^a^b^c "Yahoo Pig Development Team: Comparing Pig Latin and SQL for Constructing Data Processing Pipelines". Archived from the original on May 30, 2015. Retrieved May 23, 2015.

^ "ACM SigMod 08: Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF). Retrieved May 23, 2015.

External links

Official website

[mainpage-1] "Hadoop: Apache Pig". Retrieved Sep 2, 2011..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output q{quotes:"""""""'""'"}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}

[2] "[PIG-4167] Initial implementation of Pig on Spark - ASF JIRA". issues.apache.org. Retrieved 2018-12-29.

[3] "Pig user defined functions". Retrieved May 3, 2013.

[4] "Yahoo Blog:Pig – The Road to an Efficient High-level language for Hadoop". Archived from the original on February 3, 2016. Retrieved May 23, 2015.

[5] "Pig into Incubation at the Apache Software Foundation". Archived from the original on February 3, 2016. Retrieved May 23, 2015.

[6] "The Apache Software Foundation". Retrieved Nov 1, 2010.

[7] "Communications of the ACM: MapReduce and Parallel DBMSs: Friends or Foes?" (PDF). Archived from the original (PDF) on July 1, 2015. Retrieved May 23, 2015.

[ypgd-8] "Yahoo Pig Development Team: Comparing Pig Latin and SQL for Constructing Data Processing Pipelines". Archived from the original on May 30, 2015. Retrieved May 23, 2015.

[9] "ACM SigMod 08: Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF). Retrieved May 23, 2015.

v t e Apache Software Foundation
Top level projects	Accumulo ActiveMQ Ambari Ant Apex Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Buildr Calcite Camel Cassandra Cayenne Chemistry CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Empire-db Felix Flex Flink Flume Forrest Geronimo Gora Guacamole Gump Hadoop Hama HBase Helix Hive Impala Jackrabbit James Jini JMeter Kafka Karaf Kibble Kudu Kylin Labs Lucene Mahout Marmotta Maven MINA mod_perl MyFaces Nutch ODE OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pivot Qpid Roller RocketMQ Samza ServiceMix Shiro Sling Solr Spark Stanbol Storm SpamAssassin Sqoop Struts 1 Struts 2 Subversion SystemML Tapestry Thrift Tika Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	MXNet NetBeans SINGA Taverna XAP
Other projects	Batik Chainsaw FOP Ivy Log4j
Attic	Abdera AxKit Beehive Bluesky iBATIS Cactus Click Continuum Deltacloud Etch Excalibur Harmony HiveMind Jakarta Lenya Shale Shindig Slide stdcxx Tuscany Wave Wink XMLBeans
Licenses	Apache License
Category

搜尋此網誌

Agfdhyk

Apache Pig

Contents

History

Example

Pig vs SQL

See also

References

External links

Popular posts from this blog

How to pass form data using jquery Ajax to insert data in database?

National Museum of Racing and Hall of Fame

Firestore DeadlineExceeded exception for big collections

Developer(s)	Apache Software Foundation, Yahoo Research
Initial release	September 11, 2008; 10 years ago (2008-09-11)

Stable release	v0.17.0 / June 19, 2017; 18 months ago (2017-06-19)

Operating system	Microsoft Windows, OS X, Linux
Type	Data analytics
License	Apache License 2.0
Website	pig.apache.org