Big Data
Big Data
Big Data Systems
- What is Big Data
- Data Warehouse, Data Lakes
- Hadoop – Components
- Storage – HDFS, Hbase
- Resource Manager (MapReduce, YARN)
- Types of data formats (JSON, ORC, Parquet, AVRO)
- Scripting (Hive, Pig)
- Stream Processing
- Massive Parallel Processing (Spark, Imapala, Mahout)
- RDDs in Spark
- Data Migration (Scoop/ Flume)
- Schedular (Oozie)
- Resource Negotiator (Zookeeper)
- RDBMS Database
- Columnar Database
- Multimodel Database
- NoSQL (HBase, Cassandra, MongoDB, DynamoDB)
- RDBMS (MySQL, PostgreSQL)
- CosmoDB
- In memory database (Redis)
- Spark SQL
- Case Study
Stream Processing & Analytics
- Real Time Streaming Architecture
- Service Configuration and Coordination
- Data Flow Management, Storing and Processing Streaming Data
- Visualization Techniques for Real Time Streaming Data
- Aggregation (Timed Counting, Multi Resolution Time Series Aggregation)
- Statistical Approximation
- Approximating with sketches
PySpark
- Overview & Installation.
- RDD
- Dataframe.
- Architecture.
- MLLib
- NLP
- Linear regression
- Logistic regression
- Decision tree
- Naive Bayes
- XGBoost
- Timeseries
- Spark Job automation with Scheduler
- NYC Parking Case Study: Apache Spark