Sunday, September 17, 2017

Study notes for Big Data - Part II

My study notes on Big Data - Part II 


  • Cloud Computing:  Self-Service; Elasticity; Pay-as-use; Convenient Access; Resource Pooling
  • 3 Vendor's (Amazon, Azure, Google) offerings
    •     Storage: Amazon S3; Azure Storage; Google Cloud Storage
    •     MapReduce: Amazon EMR (Elastic MapReduce); Hadoop on Azure/HDInsignt;  Google Cloud Dataflow/Dataproc(Spark/Hadoop services)/Datalab(interactive tool to explore/ana/visual data)
    •     RDBMs:  Amazon RDS(Aurora/Oracle/MsSQL/PostgreSQL/MySQL/MariaDB); Azure SQL Database as a service;  Google Cloud SQL(MySQL)
    •     NoSQL:  Amazon DynamoDB or EMR with Apache Hbase;  Azure DocumentDB/TableStorage; Google Cloud BigTable/Cloud Datastore
    •     DataWarehouse: Amazon Redshift;  Azure SQL Data warehouse as a service;  Google BigQuery
    •     Stream processing:  Amazon Kinesis;  Azure Stream analytics; Google Cloud Dataflow
    •     Machine Learning:  Amazon ML;  AzureML/ Azure Cognitive Services; Google Cloud Machine Learning Services/Vision API/Speech API/Natural Language API/Translate API
  • Hadoop Basics:   Google's MapReduce; Google File System; Yahoo Engineer: Doug Cutting; Written in Java in 2005, Apache Project; 
  • Hadoop Common (Libraries/Utilities); HDFS (Hadoop Distributed File System); Hadoop MapReduce (programming model for large-scale data proc); Hadoop YARN
  • HDFS: Master/Slave model:  Namenode is Master, controls all metadata, multiple Data nodes are Slaves, manage storage attached to nodes where they run on; Files split in blocks; blocks stored on datanodes; blocks replicated; block size/replication factors configurable per file; Namenode makes decisions on replication and receives heartbeat and BlockReport from data nodes; Hadoop is a data service on top of regular file systems, fault tolerant; resilient; clustered; Tuned for large files: GB or TB per file and tens of millions of files in a cluster; Write once, read many; Optimized for throughput rather than latency; long-running batch processing; move compute near data; 
  • YARN:  Yet Another Resource Negotiator (Cluster/Resource manager for resource allocation, scheduling, distributing and parallelizing tasks); Global ResourceManager (Scheduler and ApplicationManager) and per-machine NodeManager; Hadoop job (Input/Output location; MapReduce func() interfaces/classes; job params; written in Java, Hadoop streaming/Hadoop Pipes) client submits job (jar/exe) and config to ResourceManager in YARN which distributes to workers (scheduling/monitoring/status/diag)
  • MapReduce is a Software framework;  For large scale data processing; Parallel Processing; Clusters of commodity H/W; reliable/fault-tolerant;  Two Components: API for MapReduce workflows in java;  A set of services managing workflows/scheduling/distribution/parallelizing; Four Phases:  Split; Map; Shuffle/Sort/Group; Reduce ; exclusively on <key,value> pairs; input splits, map() produces a <key, value> pair for each record; output of mappers shuffled/sorted/grouped and passed to reducer;  Reducer applied to <key, value> pairs with the same key/aggregates value for pairs with same key;  Can't be used for Fibonacci (value depends on previous input);  Compiled Lang; Lower level of abstraction; more complex; code performance high; for complex business logic; for structured and unstructured data;
  • Hadoop ecosystem:  Data Mgmt HDFS/HBase/YARN; Data Access MapReduce/Hive(DataWarehouse SQL query HiveQL)/Pig(Scripting alternative for Java job client); Data Ingestion/integration: Flume/Sqoop/Kafka/Storm; Data Mon: Ambari/Zookeeper/Oozie; Data Govern/Secure: Falcon/Ranger/Knox; ML: Mahout; R Connectors (Statistics)
  • Hive: data warehouse solution for tasks like ETL, summarization, query, ana;  SQL like interface for datasets in HDFS or HBase; Hive engine complies HiveQL queries into MapReduce jobs; Tables/DBs are created first, data loaded later; Came with a CLI; Two components: HCatlog: table/storage mgmt layer; WebHCat: REST interface; SQL like query Lang; Higher level of abstraction; less complex; code performance less than MapReduce; for ad hoc analysis; for both structured and unstructured data; Built by Facebook opened to Apache Foundation;
  • Pig provides scripting alternative, Pig engine converts scripts to MapReduce programs; Two components: Pig Latin: The lang;  Runtime Env: incl. parser, optimizer, compiler and exec engine: converts Pig scripts to MapReduce Jobs that will be run on Hadoop MapReduce engine; Pig Latin consists of data types, statements, gen and relational operators such as join, group, filter etc;  Can be extended using java/Python/Ruby etc, as UDFs (User Defined Functions) call from Pig Latin;  Two exec mode -- Local mode: Runs in a single JVM use local FS; MapReduce Mode: Translate into MapReduce jobs and runs on Hadoop cluster;  Hive is for querying data, Pig is for preparing data to make it suitable for query; Higher level of abstraction; less complex; code performance less than MapReduce; Easy for writing joins; for both structured and unstructured data; Built by Yahoo and opened to Apache Foundation;
  • Data ingestion and integration tools:  
    • Flume: Move large log streaming data (such as web servers, social media);  Channel-based transactions (sender and receiver); rate control;  Four types of source/destinations: tail; system logs/log4j; social medias; HDFS/HBase;  Arch: Flume Data Source -> Flume Data Channel -> Flume Data Sinks -> HDFS;  Push model not scaling well; tightly integrate with Hadoop; Suitable for streaming data; has own query processing engine helps moving data to sinks; 
    • Sqoop connectors (MySQL, PostgreSQL with DB APIs for effective bulk transfer, Oracle, SQL Server and DB2), JDBC connectors, support both import and export, text files or Avro/Sequence files sed when data in RDBMS, data warehouse or NoSQL;
    • Kfaka: A distributed streaming platform used for building real-time data pipelines and streaming apps;  Publish-Subscribe messaging system for real-time event processing; stores streams of records fault tolerant way;  process streams as they occur; Run on cluster;  stores streams of records in categories named topics; each record has a {key, value, timestamp}; Four core APIs:  Producers, Consumers, Stream Processors and Connectors; Pull data from topic; Not tightly integrate with Hadoop, may need write own producers/consumers; Kafaka + Flume works well;  Use Kafaka source write to Flume agent;
    • Storm: Real-time message computation system - not a queue; Three abstractions:  
      • Spout: source of streams in a computation (such as Kafaka broker, Twitter streaming API etc);  
      • Bolt:  process input streams and produces output streams, functions include: filters, joins, aggregation, to DB etc;
      • Topology: network of spouts and bolts representing multi-stage stream computation,  bolt subscribing to the output stream of other spout or bolt; topology run indefinitely once deployed;
  • Data operations tools:
    • Ambari: installation lifecycle management, admin, monitor systems for Hadoop; Web GUI with Ambari REST APIs; Provision (step-by-step wizard), Configuration mgmt; Central mgmt for services start/stop/reconf;  Dashboard for status monitoring, metrics collection and alert framework
    • Oozie:  Java web app used for schedule Hadoop jobs as workflow engine, combines multiple jobs into logical unit of work; trigger workflow action; tightly integrated with Hadoop jobs such as MapReduce, Hive, Pig, Sqoop, System Java/Shell works; Action nodes + control flow nodes arranged in DAG (Direct Acyclic Graph); written in hPDL (XML Process Definition Language); Control flow nodes define workflow and control exec path (decision, fork, join nodes etc)
    • Zookeeper:  coordinate distributed processes;  help store and mediate updates to important configuration info;  It provides:  Naming service; Configuration management; Process synchronization; Self-election; Reliable messaging (publish/subscribe queue guarentee message delivery)
  • NoSQL (Not Only SQL) Basics:  Schema less; no strict data structure; agile; scale horizontally (Sharding/Replication); tightly integrate cloud computing; better performance for unstructured data; CAP theorem:  Consistency; Availability; Partition tolerance (can only satisfy 2 out of 3); Eventual consistency better for read heavy apps; v.s Strong Consistency (can be tuned, such as Amazon DynamoDB) ; Better suited for storing and processing Big Data;  
  • Four categories of NoSQL: (Choosing type requires analysis of app requirements and NoSQL properties)
    • Key-value;  unique key, value in any format; useful for session info, user profiles; preference; shopping cart info;  don't use when data has relationships or need operate on multiple keys at same time;  Examples: Both on disk (Redis) or in Memory (Memcached), chose as app needs;   Others:  Amazon DynamoDB, Riak and Couchbase
    • Document;  can be XML, JSON, EDI, SWIFT etc;  documents in the value part; self-describing; hierarchical tree structures, collections and scalar values; indexed using Btree queried using javascript query engine;  Examples: MongoDB and CouchDB
    • Column family; store data in column families with row key;  column family as container for ordered collection of rows; rows can have different columns thus easy to add; can have super column (container of columns; map of columns with name and value ); Examples: HBase, Cassandra, Hyptertable
    • Graph: nodes/edges/properties based on graph theory; store entities and relationships (nodes and properties); relations as edges with properties and have directional significance; joins are very fast as they are persisted rather than calculated; need design, Examples: Neo4J, Infinite Graph, OrientDB
  • Apache Spark (Superstar project status at Apache) : A cluster computing framework for data processing (data science pipelines) and analysis; parallel on commodity H/W; fast and easy (API optimized for fast job processing); comprehensive  unified framework for Big Data Analytics; Apache Top Level project;  Allows rich APIs as SQL, ML, graph processing, ELT, streaming, interactive and batch processing; in depth analysis with SQL, statistics machine learning (ML); wide use cases (intrusion detection, prod recommend, estimate risks, genomic analysis etc);  Very fast because in-memory caching and DAG-based processing engine (100 times faster than MapReduce in memory and 10 times faster on disk); suitable for iterative algorithms in machine learning; fast real-time response to user queries (in-memory); low latency analysis for live data streams;  General purpose prog model: Java, Scala, Python;  Libraries/APIs for batch, stream, ML, queries etc;  Interactive shell for Python and Scala; Written in Scala on top of JVM (performance and reliable) ; Tools from Java Stack can be used; Strong integration with Hadoop ecosystem;  Support HDFS, Cassandra, S3 and HBase; It is not a data storage system; can access traditional BI tools and use JDBC/ODBC; DataFrame API pluggable using Spark SQL;  Built by UC Berkely AMPLab;  Motivated by MapReduce and apply Machine Learning in scalable fashion; 

No comments:

Post a Comment