Journey of a sysadmin: Study notes for Big Data

My study notes on Big Data - Part II

Cloud Computing: Self-Service; Elasticity; Pay-as-use; Convenient Access; Resource Pooling
3 Vendor's (Amazon, Azure, Google) offerings

Storage: Amazon S3; Azure Storage; Google Cloud Storage
MapReduce: Amazon EMR (Elastic MapReduce); Hadoop on Azure/HDInsignt; Google Cloud Dataflow/Dataproc(Spark/Hadoop services)/Datalab(interactive tool to explore/ana/visual data)
RDBMs: Amazon RDS(Aurora/Oracle/MsSQL/PostgreSQL/MySQL/MariaDB); Azure SQL Database as a service; Google Cloud SQL(MySQL)
NoSQL: Amazon DynamoDB or EMR with Apache Hbase; Azure DocumentDB/TableStorage; Google Cloud BigTable/Cloud Datastore
DataWarehouse: Amazon Redshift; Azure SQL Data warehouse as a service; Google BigQuery
Stream processing: Amazon Kinesis; Azure Stream analytics; Google Cloud Dataflow
Machine Learning: Amazon ML; AzureML/ Azure Cognitive Services; Google Cloud Machine Learning Services/Vision API/Speech API/Natural Language API/Translate API

Hadoop Basics: Google's MapReduce; Google File System; Yahoo Engineer: Doug Cutting; Written in Java in 2005, Apache Project;
Hadoop Common (Libraries/Utilities); HDFS (Hadoop Distributed File System); Hadoop MapReduce (programming model for large-scale data proc); Hadoop YARN
HDFS: Master/Slave model: Namenode is Master, controls all metadata, multiple Data nodes are Slaves, manage storage attached to nodes where they run on; Files split in blocks; blocks stored on datanodes; blocks replicated; block size/replication factors configurable per file; Namenode makes decisions on replication and receives heartbeat and BlockReport from data nodes; Hadoop is a data service on top of regular file systems, fault tolerant; resilient; clustered; Tuned for large files: GB or TB per file and tens of millions of files in a cluster; Write once, read many; Optimized for throughput rather than latency; long-running batch processing; move compute near data;
YARN: Yet Another Resource Negotiator (Cluster/Resource manager for resource allocation, scheduling, distributing and parallelizing tasks); Global ResourceManager (Scheduler and ApplicationManager) and per-machine NodeManager; Hadoop job (Input/Output location; MapReduce func() interfaces/classes; job params; written in Java, Hadoop streaming/Hadoop Pipes) client submits job (jar/exe) and config to ResourceManager in YARN which distributes to workers (scheduling/monitoring/status/diag)
MapReduce is a Software framework; For large scale data processing; Parallel Processing; Clusters of commodity H/W; reliable/fault-tolerant; Two Components: API for MapReduce workflows in java; A set of services managing workflows/scheduling/distribution/parallelizing; Four Phases: Split; Map; Shuffle/Sort/Group; Reduce ; exclusively on <key,value> pairs; input splits, map() produces a <key, value> pair for each record; output of mappers shuffled/sorted/grouped and passed to reducer; Reducer applied to <key, value> pairs with the same key/aggregates value for pairs with same key; Can't be used for Fibonacci (value depends on previous input); Compiled Lang; Lower level of abstraction; more complex; code performance high; for complex business logic; for structured and unstructured data;
Hadoop ecosystem: Data Mgmt HDFS/HBase/YARN; Data Access MapReduce/Hive(DataWarehouse SQL query HiveQL)/Pig(Scripting alternative for Java job client); Data Ingestion/integration: Flume/Sqoop/Kafka/Storm; Data Mon: Ambari/Zookeeper/Oozie; Data Govern/Secure: Falcon/Ranger/Knox; ML: Mahout; R Connectors (Statistics)
Hive: data warehouse solution for tasks like ETL, summarization, query, ana; SQL like interface for datasets in HDFS or HBase; Hive engine complies HiveQL queries into MapReduce jobs; Tables/DBs are created first, data loaded later; Came with a CLI; Two components: HCatlog: table/storage mgmt layer; WebHCat: REST interface; SQL like query Lang; Higher level of abstraction; less complex; code performance less than MapReduce; for ad hoc analysis; for both structured and unstructured data; Built by Facebook opened to Apache Foundation;
Pig provides scripting alternative, Pig engine converts scripts to MapReduce programs; Two components: Pig Latin: The lang; Runtime Env: incl. parser, optimizer, compiler and exec engine: converts Pig scripts to MapReduce Jobs that will be run on Hadoop MapReduce engine; Pig Latin consists of data types, statements, gen and relational operators such as join, group, filter etc; Can be extended using java/Python/Ruby etc, as UDFs (User Defined Functions) call from Pig Latin; Two exec mode -- Local mode: Runs in a single JVM use local FS; MapReduce Mode: Translate into MapReduce jobs and runs on Hadoop cluster; Hive is for querying data, Pig is for preparing data to make it suitable for query; Higher level of abstraction; less complex; code performance less than MapReduce; Easy for writing joins; for both structured and unstructured data; Built by Yahoo and opened to Apache Foundation;
Data ingestion and integration tools:

Flume: Move large log streaming data (such as web servers, social media); Channel-based transactions (sender and receiver); rate control; Four types of source/destinations: tail; system logs/log4j; social medias; HDFS/HBase; Arch: Flume Data Source -> Flume Data Channel -> Flume Data Sinks -> HDFS; Push model not scaling well; tightly integrate with Hadoop; Suitable for streaming data; has own query processing engine helps moving data to sinks;
Sqoop connectors (MySQL, PostgreSQL with DB APIs for effective bulk transfer, Oracle, SQL Server and DB2), JDBC connectors, support both import and export, text files or Avro/Sequence files sed when data in RDBMS, data warehouse or NoSQL;
Kfaka: A distributed streaming platform used for building real-time data pipelines and streaming apps; Publish-Subscribe messaging system for real-time event processing; stores streams of records fault tolerant way; process streams as they occur; Run on cluster; stores streams of records in categories named topics; each record has a {key, value, timestamp}; Four core APIs: Producers, Consumers, Stream Processors and Connectors; Pull data from topic; Not tightly integrate with Hadoop, may need write own producers/consumers; Kafaka + Flume works well; Use Kafaka source write to Flume agent;
Storm: Real-time message computation system - not a queue; Three abstractions:

Spout: source of streams in a computation (such as Kafaka broker, Twitter streaming API etc);
Bolt: process input streams and produces output streams, functions include: filters, joins, aggregation, to DB etc;
Topology: network of spouts and bolts representing multi-stage stream computation, bolt subscribing to the output stream of other spout or bolt; topology run indefinitely once deployed;

Data operations tools:

Ambari: installation lifecycle management, admin, monitor systems for Hadoop; Web GUI with Ambari REST APIs; Provision (step-by-step wizard), Configuration mgmt; Central mgmt for services start/stop/reconf; Dashboard for status monitoring, metrics collection and alert framework
Oozie: Java web app used for schedule Hadoop jobs as workflow engine, combines multiple jobs into logical unit of work; trigger workflow action; tightly integrated with Hadoop jobs such as MapReduce, Hive, Pig, Sqoop, System Java/Shell works; Action nodes + control flow nodes arranged in DAG (Direct Acyclic Graph); written in hPDL (XML Process Definition Language); Control flow nodes define workflow and control exec path (decision, fork, join nodes etc)
Zookeeper: coordinate distributed processes; help store and mediate updates to important configuration info; It provides: Naming service; Configuration management; Process synchronization; Self-election; Reliable messaging (publish/subscribe queue guarentee message delivery)

NoSQL (Not Only SQL) Basics: Schema less; no strict data structure; agile; scale horizontally (Sharding/Replication); tightly integrate cloud computing; better performance for unstructured data; CAP theorem: Consistency; Availability; Partition tolerance (can only satisfy 2 out of 3); Eventual consistency better for read heavy apps; v.s Strong Consistency (can be tuned, such as Amazon DynamoDB) ; Better suited for storing and processing Big Data;
Four categories of NoSQL: (Choosing type requires analysis of app requirements and NoSQL properties)

Key-value; unique key, value in any format; useful for session info, user profiles; preference; shopping cart info; don't use when data has relationships or need operate on multiple keys at same time; Examples: Both on disk (Redis) or in Memory (Memcached), chose as app needs; Others: Amazon DynamoDB, Riak and Couchbase
Document; can be XML, JSON, EDI, SWIFT etc; documents in the value part; self-describing; hierarchical tree structures, collections and scalar values; indexed using Btree queried using javascript query engine; Examples: MongoDB and CouchDB
Column family; store data in column families with row key; column family as container for ordered collection of rows; rows can have different columns thus easy to add; can have super column (container of columns; map of columns with name and value ); Examples: HBase, Cassandra, Hyptertable
Graph: nodes/edges/properties based on graph theory; store entities and relationships (nodes and properties); relations as edges with properties and have directional significance; joins are very fast as they are persisted rather than calculated; need design, Examples: Neo4J, Infinite Graph, OrientDB

Apache Spark (Superstar project status at Apache) : A cluster computing framework for data processing (data science pipelines) and analysis; parallel on commodity H/W; fast and easy (API optimized for fast job processing); comprehensive unified framework for Big Data Analytics; Apache Top Level project; Allows rich APIs as SQL, ML, graph processing, ELT, streaming, interactive and batch processing; in depth analysis with SQL, statistics machine learning (ML); wide use cases (intrusion detection, prod recommend, estimate risks, genomic analysis etc); Very fast because in-memory caching and DAG-based processing engine (100 times faster than MapReduce in memory and 10 times faster on disk); suitable for iterative algorithms in machine learning; fast real-time response to user queries (in-memory); low latency analysis for live data streams; General purpose prog model: Java, Scala, Python; Libraries/APIs for batch, stream, ML, queries etc; Interactive shell for Python and Scala; Written in Scala on top of JVM (performance and reliable) ; Tools from Java Stack can be used; Strong integration with Hadoop ecosystem; Support HDFS, Cassandra, S3 and HBase; It is not a data storage system; can access traditional BI tools and use JDBC/ODBC; DataFrame API pluggable using Spark SQL; Built by UC Berkely AMPLab; Motivated by MapReduce and apply Machine Learning in scalable fashion;

Journey of a sysadmin

Sunday, September 17, 2017

Study notes for Big Data - Part II

No comments:

Post a Comment

Followers

Blog Archive

About Me

Total Pageviews