Data Science Course Content
Type 1 : Data Science with R and Statistics
Type 2 : Data Science with Python and Statistics
Introduction to Data Science
- Introduction to Big Data
- Roles played by a Data Scientist
- Analyzing Big Data using Hadoop and R
- Methodologies used for analysis
- The Architecture and Methodologies used to solve the Big Data problems
Basic Data Manipulation using R
- Understanding vectors in R
- Reading Data, Combining Data
- Sub Setting data
- Sorting data and some basic data generation functions
Machine Learning Techniques Using R Part-1
- Machine Learning Overview,
- ML Common Use Cases
- Understanding Supervised and Unsupervised Learning
- Techniques, Clustering
- Similarity Metrics
- Distance Measure Types: Euclidean, Cosine Measures, Creating predictive models
Machine Learning Techniques Using R Part-2
- Understanding K-Means Clustering
- Understanding TF-IDF and Cosine Similarity and their application to Vector Space Model
- Implementing Association rule mining in R
Machine Learning Techniques Using R Part-3
- Understanding Process flow of Supervised Learning Techniques
- Decision Tree Classifier
- How to build Decision trees
- Random Forest Classifier
- What is Random Forests
- Features of Random Forest
- Out of Box Error Estimate and Variable Importance
- Naive Bayes Classifier
Introduction to Hadoop Architecture
- Hadoop Architecture
- Common Hadoop commands
- MapReduce and Data loading techniques (Directly in R and in Hadoop using SQOOP, FLUME, and other Data Loading Techniques)
- Removing anomalies from the data
Integrating R with Hadoop
- Integrating R with Hadoop using RHadoop and RMR package
- Exploring RHIPE (R Hadoop Integrated Programming Environment)
- Writing MapReduce Jobs in R and executing them on Hadoop
Mahout Introduction and Algorithm Implementation
- Implementing Machine Learning Algorithms on larger Data Sets with Apache Mahout
Additional Mahout Algorithms and Parallel Processing using R
- Implementation of different Mahout algorithms
- Random Forest Classifier with parallel processing Library in R
Introduction to Statistics:
- Types of Statistics
- Types of Data
Descriptive Statistics
- Measures of Central Tendency
- Measures of Central Tendency – Usage Chart
- Measures of Dispersion / Variability
- Measures of Shape
- Application of Variance/Std Deviation
Hypothesis Testing
- Applications of Hypothesis Testing (Called T Test or Z Test)
- Steps in Hypothesis Testing
Anova (Analysis of Variance)
- What is Anova
- Anova Steps
- Simple One-Way Anova
- Simple Two-Way Anova with Multiple Variables
Chi Square Tests
- What is Chi-Square
- Applications of Chi-Square
Correlation
- Types of Correlation
- Properties of Correlation
- Methods of Calculating Correlation
- Steps to Calculate Correlation
Regression Analysis
- What is Regression
- Types of Regression Analysis
- Properties of The Regression Line
- Validating the Model
- Regression Assumptions
Data Transformation for Regression
Dummy Variable Analysis
Variable Selection Procedure for Regression
- Forward Selection Procedure
- Backward Elimination Procedure
- Stepwise Regression Method
Logistic Regression
- Likelihood Profiling
- Assumption
- Variable Selection Method :- Woe And Iv
- Model Validation
- Model Performance
- Prediction
Cluster Analysis
- What is cluster
- Application of clustering
- Types of clustering
- K Means
- Dendrogram
- Validation of Cluster
Decision Tree
- What is decision Tree
- How decision tree works
- Cart
- Pruning
- Overfitting
- Underfitting
- Model validation
- Model performance
Market Basket Analysis
- What is MBA
- Application of MBA
- Support
- Confidence
- Lift
- Rules
Random Forest
- What is random forest
- Application of random forest
- Tune parameters
- How to tune parameters
- Model validation
- Model performance
Support Vector Machine
- What is support vector machine
- Why to use SVM
- Hyperplane
- Kernel
- Cost
- Gamma
- Model validation
- Model performance
Naïve bayes
- What is Naïve bayes
- Bayes theorem
- Conditional probability
- Prior probability
- Posterior probability
- Application of Naïve bayes
- Model validation
- Model performance
ARIMA
- What is time series
- What is Arima
- Stationary
- Seasonality
- Trend
- How to find p,d,q
- What are p,d,q
- Find best model
- Forecasting
GBM