Ramp up your essential knowledge into market ready skills with this Big Data Engineering Booster course. Keeping the version of the code is one of the most important skill for any Data Engineer, you will become an expert in managing GitHub repositories through this course. Get ready to learn Designing Databases, Data Lakes, Data Marts, HDFS File Types, Compressions, Airflow and more excitingly Spark (PySpark) processing frameworks in this course.
Have you ever thought of getting hands-on in completing 3 Real World Projects? Yes, you heard it right. All the combinations of tech stacks are covered in these projects – Non-Hadoop, Big Data & Hybrid (Big Data & Non-Hadoop mix). You will get the experience of implementing the knowledge in action. You could still use the AWS Cloud Server setup from the Big Data Engineering – Essentials course for your hands on if your system resources are not sufficient enough.

Pre-requisites (Free)
Big Data Engineering – Essentials
SQL Foundation
Python Fundamentals
Shell/Bash Scripting for Beginners
System Requirements
CPU: Quad cores with i5 or better/M1
Memory: 16GB
OS: Windows/MacOS
Not to worry if you do not have enough capacity on your system, towards the end of this course you will be guided on procuring AWS Cloud server for your practice.
Mode Of Trainings
Online Interactive Sessions
Recorded Video Sessions – From the latest Online batch
Resources
Approximate number of sessions: 52 (Varies across the batches)
Lifetime access to the recorded videos will be given along with all supportive documents, logs, references and software’s if any.
Placements
Chapter 1: VERSION CONTROL SYSTEM
- GitHub Introduction
- Setup
- Repo
- Branches
- Forks
- Code Issues
- Commits
- Pull Requests
- Squash & Merges
- Conflicts
- Code Reviews & Testing
- Responsibilities
- Real world Hands-on
- Command line: Single contributor, smooth life cycle
- Command line: Single Contributor, linking issues
- Command line: Single Contributor, PR reviews
- GitHub Desktop: Repeat exercises of command line
- GitHub Web: User access control
- GitHub Web: Create an Organization
- Command line: Dual Contributors, conflicts
- Command line: How to avoid conflicts
Chapter 2: DATA WAREHOUSE
- Definition
- Types
- Advantages
- Data Mart
- Hadoop Warehouse
- Data Lake
- Architectures
- RDBMS
- NoSQL
- InMemory
- Clear the Clutter
Chapter 3: DATA MODELLING
- Schemas
- Types
- Facts & Dimensions
- Data Models
- Normalization
- Star Schema
- Snowflake Schema
- OTLP & OLAP
- SCD tables
- Summarize
Chapter 4: SPARK PROCESSING
- Local Setup
- PyCharm Integration
- Realworld Setup
- Recap
- Zeppelin
- Transformations & Actions
- PySpark
- First Code
- Spark SQL
- Spark Dataframe
- Applications
- Hands-on (Part 1)
- Hands-on (Part 2)
Chapter 5: HADOOP DATAFILES
- File Formats
- Text
- Avro
- Parquet & RCFile
- ORC
- SequenceFile
- Compressions
- Choose the best
Chapter 6: SCHEDULING PIPELINES
- Airflow Intro
- Installation
- Example DAG’s
- Pipelines & Dependencies
- Importing Modules
- Default Arguments
- Tasks
- Setting up Dependencies
- Testing Pipeline
- Schedule
- Presets
- Catchup
- Backfill
- Passing Parameters when triggering dags
- Hands-on
Chapter 7: REAL WORLD PROJECT – LOCAL
- Project Requirements
- Data Source Schema (ER Diagram)
- Data Mart Modeling
- Design Jobs
- Infra Setup
- Initial Data Load
- Initial Data Load
- Development
- Incremental Loads
- Testing
- Airflow Pipeline
- Logging
- Deployment & Scheduling
Chapter 8: REAL WORLD PROJECT – BIGDATA
- Project Requirements
- Data Source Schema
- Data Mart Modeling
- Design Tasks
- Infra Setup
- Initial Data Load
- Development
- Incremental Loads
- Testing
- Airflow Pipeline
- Logging
- Deployment & Scheduling
Chapter 9: REAL WORLD PROJECT – HYBRID
- API’s
- API Source & End points
- UNIX/EPOCH timestamps
- Project Requirements
- Lake
- Marts
- Database Design
- Pipelines
- Infra Setup
- Development
- Code
- Logging
- Debugging
- Deployment & Scheduling
- Exercise
Chapter 10: REWIND & RECAP
- Summary
- Q & A
Chapter 11: PROFILE BUILDING
Chapter 12: MOCK INTERVIEW
Frequently Asked Questions (FAQs)
There are two modes of training. Online Instructor Led or Recorded Video Sessions. While you can purchase the later anytime, look out for the schedule on this page to take the first.
This is the second level course towards becoming a Data Engineer and you are market ready for Junior positions.
Basic SQL, Python & Shell script programming skills and Booster are the pre-requisites.
You will be part of the professional community and there will be assistance for your blockers.
You will be assisted and guided in profile building and mock interviews
Related Posts
- myadmin
- Data Analytics
- 1 Comments
Build Simple Machine Learning Web Application using Python
Pre-processing data and developing efficient model on a given data set is one of the daily tasks of machine learning engineer with commonly used languages like Python or R. Not every machine learning engineer would get a chance or requirement to integrate the model into real time applications like web or mobile for end users […]
- myadmin
- Data Analytics
- 0 Comments
Ways to identify if data is Normally Distributed
Normal distribution also known as Gaussian distribution is one of the core probabilistic models to data scientists. Naturally occurring high volume data is approximated to follow normal distribution. According to Central limit theorem, large volume of data with multiple independent variables also assumed to follow normal distribution irrespective of their individual distributions. In reality we […]
- myadmin
- Data Analytics
- 0 Comments
Will highly correlated variables impact Linear Regression?
Linear regression is one of the basic and widely used machine learning algorithms in the area of data science and analytics for predictions. Data scientists will deal with huge dimensions of data and it will be quite challenging to build simplest linear model possible. In this process of building the model there are possibilities of […]
- myadmin
- General topics
- 14 Comments
Will Oracle 18c impact DBA roles in the market?
There has been a serious concern in the market with announcement of Oracle Autonomous database 18c release. Should this be considered as a threat to Oracle DBA’s roles in the market? Let us gather facts available on the Oracle web to understand what exactly this is going to be and focus on skill improvements accordingly. […]
- myadmin
- General topics
- 4 Comments
How Oracle database does instance recovery after failures?
INSTANCE RECOVERY – Oracle database have inherit feature of recovering from instance failures automatically. SMON is the background process which plays a key role in making this possible. Though this is an automatic process that runs after the instance faces a failure, it is very important for every DBA to understand how is it made […]
- myadmin
- Performance tuning
- 16 Comments
Why should we configure limits.conf for Oracle database?
Installing Oracle Database is a very common activity to every DBA. In this process, DBA would try to configure all the pre-requisites that Oracle installation document will guide, respective to the version and OS architecture. In which the very common configuration on UNIX platforms is setting up LIMITS.CONF file from /etc/security directory. But why should […]
- myadmin
- RMAN
- 9 Comments
Why RMAN needs REDO for Database Backups?
RMAN is one of the key important utility that every Oracle DBA is dependent on for regular day to day backup and restoration activities. It is proven to be the best utility for hot backups, in-consistent backups while database is running and processing user sessions. With all that known, as an Oracle DBA it will […]
- myadmin
- General topics
- 10 Comments
Will huge Consistent Reads floods BUFFER CACHE?
Oracle Database BUFFER CACHE is one of the core important architectural memory component which holds the copies of data blocks read from datafiles. In my journey of Oracle DBA this memory component played major role in handling Performance Tuning issues. In this Blog, I will demonstrate a case study and analyze the behavior of BUFFER […]
- myadmin
- Storage
- 11 Comments
Can a data BLOCK accommodate rows of distinct tables?
In Oracle database, data BLOCK is defined as the smallest storage unit in the data files. But, there are many more concepts run around the BLOCK architecture. One of them is to understand if a BLOCK can accommodate rows from distinct tables. In this article, we are going to arrive at the justifiable answer with […]
- myadmin
- Performance tuning
- 32 Comments
Can you really flush Oracle SHARED_POOL?
One of the major player in the SGA is SHARED_POOL, without which we can say that there are no query executions. During some performance tuning trials, you would have used ALTER SYSTEM command to flush out the contents in SHARED_POOL. Do you really know what exactly this command cleans out? As we know that internally […]