Big Data

Big Data Systems

  1. What is Big Data
  2. Data Warehouse, Data Lakes
  3. Hadoop – Components
  4. Storage – HDFS, Hbase
  5. Resource Manager (MapReduce, YARN)
  6. Types of data formats (JSON, ORC, Parquet, AVRO)
  7. Scripting  (Hive, Pig)
  8. Stream Processing
  9. Massive Parallel Processing (Spark, Imapala, Mahout)
  10. RDDs in Spark
  11. Data Migration (Scoop/ Flume)
  12. Schedular (Oozie)
  13. Resource Negotiator (Zookeeper)
  14. RDBMS Database
  15. Columnar Database
  16. Multimodel Database
  17. NoSQL (HBase, Cassandra, MongoDB, DynamoDB)
  18. RDBMS (MySQL, PostgreSQL)
  19. CosmoDB
  20. In memory database (Redis)
  21. Spark SQL
  22. Case Study

Stream Processing & Analytics

  1. Real Time Streaming Architecture
  2. Service Configuration and Coordination
  3. Data Flow Management, Storing and Processing Streaming Data
  4. Visualization Techniques for Real Time Streaming Data
  5. Aggregation (Timed Counting, Multi Resolution Time Series Aggregation)
  6. Statistical Approximation
  7. Approximating with sketches

PySpark

  1. Overview & Installation.
  2. RDD
  3. Dataframe.
  4. Architecture.
  5. MLLib
  6. NLP
  7. Linear regression
  8. Logistic regression
  9. Decision tree
  10. Naive Bayes
  11. XGBoost
  12. Timeseries
  13. Spark Job automation with Scheduler
  14. NYC Parking Case Study: Apache Spark