Data Science
- Create or train ML models
Clustering Decision Trees Random Forest Naïve Bayes Q-Learning Time Series
- Split train test
- Put data into ML to create a model
- So, to decide if a photo is of a dog or a cat, need a training phase and testing phase to see learn how to do it
- Collect many dog and cat photos
- That data must then be split into training set and testing set
- Two data set in this case, X will be the independent features, Y will be the dependent variables
- The formula to split it is x Train - x Test / y Train - y Test
-
x Train and y Train are used to create the model
- Once the model is created, input x Test and the output should be equal to the y Test
-
The more closely the model output is to y Test, the more accurate the model is
- Supervised learning methods
-
Tree based learning algorithms
- Gradient descent is the most used learning algorithm, there are many variants of gradient descents
- Gradient descent requires a cost function, there are many types of cost functions
- We need the cost function because we want to minimize it
- Minimizing any function means finding the deepest valley in that function
- Cost function is used to monitor error in prediction of a ML model
- So, minimizing this means reducing the error or increasing the accuracy of the model
Decision tree algorithms are referred to as CART (Classification and Regression Trees), the possible solutions to a given problem emerge as the leaves of a tree, each node representing a point of deliberation and decision
Decision tree model is a flowchart like structure in which each internal node represents a test on attribute (example: whether a coin comes up head or tail), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes), the paths from root to leaf represents classification rules
Use cases:
- When user is trying to achieve maximum profit, optimize cost
- When there are several courses of action
- Calculate measure of benefit of the various alternatives
- When there are events beyond the control of decision makers (such as environmental factors)
-
Uncertainty concerning which outcome will actually happen
- Overfitting is more common than underfitting
- Overfitting means that the model is too well trained and fit too closely to the training dataset
-
Overfitting means that the model is too complex, underfitting means the model is too simple
- To avoid overfitting, the data can’t have too many features/variables compared to the number of observations
- To avoid underfitting, the data need enough predicators/independent variables
accuracy_score() takes two arguments, first is the test data labels (actual labels) and then the model’s prediction
Example:
accuracy = accuracy_score(y_test, predictions)
brew install graphviz (not pip)
TensorFlow
Graphs
Neural Network
Regression
Classification
data
applied_statistics
]