- 4 V's of Big Data: Volume, Variety, Velocity, Veracity or Variability
- Format of Big Data: Structured (SQL), Semi-structured (EDI, SWIFT, XML), Unstructured (multimedia, text, images)
- Big Data Analytics: Basic (report, dashboard,visualization,slice/dice); Advanced (ML, statistics, text analytics, neural networks, data mining); Operationalized (embed analytics in business process); Business decision (decision-making that drives $$$)
- Big Data Trends: Machine Learning; Embedding Intelligence; In the Cloud; IOT+BigData+Cloud; NoSQL; Real-time analytics; Challenges (Privacy, Discrimination, Spying, Hacking )
- Cycle of Big Data Management: Capture - Organize - Integrate - Analyze - Act - Capture
- Components: Physical Infrastructure; Security Infra; Data Stores; Organize/Integrate; Analytics
- Phys Infra: Support 4 Vs; Cloud: Perf/Avail/Scala/Flex/Cost
- Security Infra: Data Access; App Access; Data Encryption; Threat Detection
- Data Store: DFS (HDFS); NoSQL (Cassandra, MongoDB); RDBMs (Oracle MySQL); Real-time (Kafka, Storm, Spark streaming )
- DFS: 5 Transparencies: Access; Concurrency; Failure; Scalability; Replication
- RDBMs: ACID: Atomicity; Consistency; Isolation; Durability
- NoSQL: Document-oriented; Column-oriented; Graph DB; Key-Value
- Org/Integrate Data: Cleaning; Transformation; Normalization
- 2 types of data integration: multiple data sources ; unstructured source with structured big data
- Process/Organize Big Data: ETL; Hadoop's MapReduce; Spark's SQL
- ETL: Extract; Transform; Load
- Data Warehouse: RDBMs; by subject area; highly transformed; strictly defined use cases
- Hadoop's MapReduce: batch processing large volumes data (salable/resilient); Apache Spark: complex analytics using ML models in an interactive approach
- Data Lake: Contains: all data; different types; "Schema-on-read"; agile/adapt to business changes quickly; Hard to secure; commodity hardware in cluster; used for advance analytics by data scientists
- Data Warehouse: use case oriented; transactional/quantitative metrics; "Schema-on-write"; time-consuming when need modify business process; Old Tech/Mature in Security; enterprise grade hardware; Used for operational analysis/reports/KPIs/slices of data
- Analyzing Big Data: Predictive; Advanced (deep learning; speech/face/genomic); Social Media Ana; Text Ana; Alerts/Recommends/Prescriptive; Reports/Dashboard/Visualization; In summary: Basic BI Solution + Advanced Analytics; combines statistics, data mining, machine learning and have wide use cases:
- Descriptive Analytics (What happened): Excel; RDBMS; Data Warehouse (IBM COgnos, Teradata); Reporting (Jasper Reports); Business Intelligence (Tableau, Qlik); Visualizations (Tableau, Qlik); Programming Languages (R, D3.js)
- Predictive Analytics (What could happen in the future): Combines statistics, data mining and machine learning techniques: Linear Regression; Logistic Regression; Decision trees and Random forests; Naive Bayes theorem; Clustering; Neural network; Link analysis (graph theory), Tools include: R, Apache Mahout, Apache Spark MLlib, H2O, NumPy, SciPy; IBM SPSS, SAS, SAP, RapidMiner; Google Prediction API, Amazon Machine Learning, Azure Machine Learning
- Prescriptive Analytics (What can I do to make this happen): combines tools such as business rules, algorithms, machine learning, computational modeling etc; Tools include: SAS, IBM, Dell Statistca
Saturday, September 16, 2017
Study notes for Big Data - Part I
My study notes on Big Data - Part I
Subscribe to:
Post Comments (Atom)
ReplyDeleteThanks for providing good information,Thanks for your sharing.Tableau Online Training
kayseriescortu.com - alacam.org - xescortun.com
ReplyDelete