Saturday, September 16, 2017

Study notes for Big Data - Part I

My study notes on Big Data - Part I

  • 4 V's of Big Data:  Volume, Variety, Velocity, Veracity or Variability
  • Format of Big Data: Structured (SQL), Semi-structured (EDI, SWIFT, XML), Unstructured (multimedia, text, images)
  • Big Data Analytics: Basic (report, dashboard,visualization,slice/dice); Advanced (ML, statistics, text analytics, neural networks, data mining); Operationalized (embed analytics in business process); Business decision (decision-making that drives $$$)
  • Big Data Trends:  Machine Learning; Embedding Intelligence; In the Cloud; IOT+BigData+Cloud; NoSQL; Real-time analytics; Challenges (Privacy, Discrimination, Spying, Hacking )
  • Cycle of Big Data Management: Capture - Organize - Integrate - Analyze - Act - Capture
  • Components:  Physical Infrastructure; Security Infra; Data Stores; Organize/Integrate; Analytics
  • Phys Infra:  Support 4 Vs;  Cloud: Perf/Avail/Scala/Flex/Cost
  • Security Infra:  Data Access; App Access; Data Encryption; Threat Detection
  • Data Store: DFS (HDFS); NoSQL (Cassandra, MongoDB); RDBMs (Oracle MySQL); Real-time (Kafka, Storm, Spark streaming )
  • DFS:  5 Transparencies: Access; Concurrency; Failure; Scalability; Replication
  • RDBMs:  ACID:  Atomicity; Consistency; Isolation; Durability
  • NoSQL:  Document-oriented; Column-oriented; Graph DB; Key-Value
  • Org/Integrate Data:  Cleaning; Transformation; Normalization
  • 2 types of data integration:  multiple data sources ;  unstructured source with structured big data
  • Process/Organize Big Data:  ETL; Hadoop's MapReduce;  Spark's SQL
  • ETL: Extract; Transform; Load
  • Data Warehouse:  RDBMs; by subject area; highly transformed; strictly defined use cases 
  • Hadoop's MapReduce: batch processing large volumes data (salable/resilient); Apache Spark:  complex analytics using ML models in an interactive approach
  • Data Lake:  Contains: all data; different types; "Schema-on-read"; agile/adapt to business changes quickly;   Hard to secure; commodity hardware in cluster; used for advance analytics by data scientists
  • Data Warehouse:  use case oriented; transactional/quantitative metrics; "Schema-on-write"; time-consuming when need modify business process;  Old Tech/Mature in Security; enterprise grade hardware;  Used for operational analysis/reports/KPIs/slices of data
  • Analyzing Big Data:  Predictive; Advanced (deep learning; speech/face/genomic); Social Media Ana; Text Ana; Alerts/Recommends/Prescriptive; Reports/Dashboard/Visualization; In summary: Basic BI Solution + Advanced Analytics; combines statistics, data mining, machine learning and have wide use cases: 
    • Descriptive Analytics (What happened):  Excel; RDBMS; Data Warehouse (IBM COgnos, Teradata);  Reporting (Jasper Reports); Business Intelligence (Tableau, Qlik);  Visualizations (Tableau, Qlik); Programming Languages (R, D3.js)
    • Predictive Analytics (What could happen in the future): Combines statistics, data mining and machine learning techniques: Linear Regression; Logistic Regression; Decision trees and Random forests; Naive Bayes theorem; Clustering; Neural network; Link analysis (graph theory),  Tools include: R, Apache Mahout, Apache Spark MLlib, H2O, NumPy, SciPy; IBM SPSS, SAS, SAP, RapidMiner; Google Prediction API, Amazon Machine Learning, Azure Machine Learning 
    • Prescriptive Analytics (What can I do to make this happen):  combines tools such as business rules, algorithms, machine learning, computational modeling  etc; Tools include: SAS, IBM, Dell Statistca

2 comments: