Showing posts with label Data mining quiz questions. Show all posts

Monday, October 12, 2020

Data warehousing and mining quiz questions and answers set 03

Data warehousing and Data mining solved quiz questions and answers, multiple choice questions MCQ in data mining, questions and answers explained in data mining concepts, data warehouse exam questions, data mining mcq

Data Warehousing and Data Mining - MCQ Questions and Answers SET 03

1. Which of the following practices can help in handling overfitting problem?

a) Use of faster processor

b) Increasing the number of training examples

c) Reducing the number of training instances

d) Increasing the model complexity

Answer: (b) Increasing the number of training examples

Once we increase the number of training examples we will have lower test-error (variance of the model decrease) and this results in reduced overfitting.

If our model does not generalize well from our training data to unseen data, we denote this as overfitting. An overfit model will have extremely low training error but a high testing error.

2. Which of the following statements is INCORRECT about the SVM and kernels?

a. Kernels map the original dataset into a higher dimensional space and then find a hyper-plane in the mapped space

b. Kernels map the original dataset into a higher dimensional space and then find a hyper-plane in the original space

c. Using kernels allows us to obtain non linear decision boundaries for a classification problem

d. The kernel trick allows us to perform computations in the original space and enhances speed of SVM learning.

Answer: (b) Kernels map the original dataset into a higher dimensional space and then find a hyper-plane in the original space

SVM transforms the original feature space into a higher-dimensional space based on a user-defined kernel function and then finds support vectors to maximize the separation (margin) between two classes in the higher-dimensional space.

3. Dimensionality reduction reduces the data set size by removing ____________.

a) Relevant attributes.

b) Irrelevant attributes.

c) Support vector attributes.

d) Mining attributes

Answer: (b) Irrelevant attributes

We remove those attributes or features that are irrelevant and redundant in order to reduce the dimension of the feature set.

Dimensionality reduction

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. [Wikipedia]

The process of dimensionality reduction is divided into two components, feature selection and feature extraction. In feature selection, smaller subsets of features are chosen from a set of many dimensional data to represent the model by filtering, wrapping or embedding. Feature extraction reduces the number of dimensions in a dataset in order to model variables and perform component analysis. [For more please refer here]

4. What is the Hamming distance between the binary vectors a = 0101010001 and b = 0100011001?

a) 2

b) 3

c) 5

d) 10

Answer: (a) 2

For binary data, the Hamming distance is the number of bits that are different between two binary vectors.

5. What is the Jaccard similarity between the binary vectors a = 0111010101 and b = 0100011111?

a) 0.5

b) 1.5

c) 2.5

d) 3

Answer: (a) 0.5

For binary data, the Jaccad similarity is a measure of similarity between two binary vectors.

Jaccard similarity between binary vectors can be calculated using the following equation;

J_sim = C₁₁ / (C₀₁ + C₁₀ + C₁₁)

Here, C11 is the count of matching 1’s between two vectors,

C01 and C10 is the count of dissimilar binary values between two vectors

For the given question,

C11 = the number of bit positions that has matching 1’s = 4

C10 = the number of bit positions where the first binary vector (vector a) is 1 and second vector (vector b) is 0 = 2

C01 = the number of bit positions where the first binary (vector b) vector is 0 and second vector (vector b) is 1 = 2

J_{sim(a, b)} = 4/(2+2+4) = 4/8 = ½ = 0.5

**********************

What is the impact of increasing training sample in overfitting?

What is the impact of overfitting?

How to calculate Jaccard similarity between two binary vectors

Calculate Hamming distance

List down the components of dimensionality reduction

SVM transforms the original feature space into a higher-dimensional space

Data warehousing and mining quiz questions and answers set 02

Data warehousing and Data mining solved quiz questions and answers, multiple choice questions MCQ in data mining, questions and answers explained in data mining concepts, data warehouse exam questions, data mining mcq

Data Warehousing and Data Mining - MCQ Questions and Answers SET 02

1. In non-parametric models

a) There are no parameters

b) The parameters are fixed in advance

c) A type of probability distribution is assumed, then its parameters are inferred

d) The parameters are flexible

Answer: (d) The parameters are flexible

Non-parametric models differ from parametric models in that the model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance.

In non-parametric models, no fixed set of parameters and no probability distribution is assumed. They have parameters that are flexible.

2. The goal of clustering analysis is to:

a) Maximize the inter-cluster similarity

b) Maximize the intra-cluster similarity

c) Maximize the number of clusters

d) Minimize the intra-cluster similarity

Answer: (b) Maximize the intra-cluster similarity

One of the goals of a clustering algorithm is to maximize the intra-cluster similarity.

A clustering algorithm with small intra-cluster distance (high intra-cluster similarity) and high inter-cluster distance (low inter-cluster similarity) is said to be a good clustering algorithm.

Clustering analysis is a technique for grouping similar observations into a number of clusters based on multiple variables for each individual observed value. It is an unsupervised classification.

Inter-cluster distance – the distance between two objects from two different clusters.

Intra-cluster distance – the distance between two objects from the same cluster.

3. In decision tree algorithms, attribute selection measures are used to

a) Reduce the dimensionality

b) Select the splitting criteria which best separate the data

c) Reduce the error rate

d) Rank attributes

Answer: (b) Select the splitting criteria which best separate the data

Attribute selection measures in decision tree algorithms are mainly used to select the splitting criterion that best separates the given data partition.

During the induction phase of the decision tree, the attribute selection measure is determined by choosing the attribute that will best separate the remaining samples of the nodes partition into individual classes.

The data set is partitioned according to a splitting criterion into subsets. This procedure is repeated recursively for each subset until each subset contains only members belonging to the same class or is sufficiently small.

Information gain, Gain ratio and Gini index are the popular attribute selection measures.

4. Pruning a decision tree always

a) Increases the error rate

b) Reduces the size of the tree

c) Provides the partitions with lower entropy

d) Reduces classification accuracy

Answer: (b) Reduces the size of the tree

Pruning means simplifying/compressing and optimizing a decision tree by removing sections of the tree that are uncritical and redundant to classify instances. It helps in significantly reducing the size of the decision tree.

Decision trees are the most susceptible machine learning algorithm to overfitting (the undesired induction of noise in the tree). Pruning can reduce the likelihood of overfitting problem.

5. Which of the following classifiers fall in the category of lazy learners:

a) Decision trees

b) Bayesian classifies

c) k-NN classifiers

d) Rule-based classifiers

Answer: (c) k-NN classifier

k-nearest neighbor (k-NN) classifier is a lazy learner because it doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead.

Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple. When it does, classification is conducted based on the most related data in the stored training data.

Lazy learning is also referred as “just-in-time learning”.

The other categories of classifiers is “Eager learners”.

************************

What is lazy learning in data mining?

Which of the data noise problem is reduced through pruning in decision trees?

What is the role of attribute selection measure in data mining.

What are the popular attribute selection measure

Why non-parametric models are said to be flexible

Which machine learning algorithm is most susceptible to overfitting

Define inter-cluster and intra-cluster distance

Machine learning algorithms MCQ with answers

Machine learning question banks and answers

Data Warehousing and Data Mining Quiz Questions and Answers Home

Data warehousing and Data mining solved quiz questions and answers, multiple choice questions MCQ in data mining, questions and answers explained in data mining concepts, data warehouse exam questions, data mining mcq

Data Warehousing and Data Mining - MCQ Questions and Answers

Data Warehousing and Data Mining Quiz - SET 01

Keywords: Outlier analysis, curse of dimensionality, frequent pattern analysis

Data Warehousing and Data Mining Quiz - SET 02

Keywords:Non-parametric methods, clustering analysis, attribute selection measure, pruning a decision tree, lazy learners

Data Warehousing and Data Mining Quiz - SET 03

Keywords: Overfitting, kernels mapping the dataset, dimensionality reduction, Hamming distance, Jaccard similarity

Data Warehousing and Data Mining Quiz - SET 04

Keywords: Minkowski distance, Simple Matching Coefficient, Apriori algorithm, upward closure property

Data Warehousing and Data Mining Quiz - SET 05

Keywords: Validation dataset, OLTP vs OLAP, Aggregated data, agglomerative clustering

Data Warehousing and Data Mining Quiz - SET 06

Keywords:

Data Warehousing and Data Mining Quiz - SET 07

Keywords:

Data Warehousing and Data Mining Quiz - SET 08

Keywords:

Data Warehousing and Data Mining Quiz - SET 09

Keywords:

Data Warehousing and Data Mining Quiz - SET 10

Keywords:

*******************

Related links:

TOPICS (Click to Navigate)

Monday, October 12, 2020

Data warehousing and Data mining solved quiz questions and answers, multiple choice questions MCQ in data mining, questions and answers explained in data mining concepts, data warehouse exam questions, data mining mcq

Data Warehousing and Data Mining - MCQ Questions and Answers SET 03

Once we increase the number of training examples we will have lower test-error (variance of the model decrease) and this results in reduced overfitting.

b. Kernels map the original dataset into a higher dimensional space and then find a hyper-plane in the original space

3. Dimensionality reduction reduces the data set size by removing ____________.

For binary data, the Hamming distance is the number of bits that are different between two binary vectors.

For binary data, the Jaccad similarity is a measure of similarity between two binary vectors.

Related links:

What is the impact of increasing training sample in overfitting?

What is the impact of overfitting?

How to calculate Jaccard similarity between two binary vectors

Calculate Hamming distance

List down the components of dimensionality reduction

SVM transforms the original feature space into a higher-dimensional space

Data warehousing and Data mining solved quiz questions and answers, multiple choice questions MCQ in data mining, questions and answers explained in data mining concepts, data warehouse exam questions, data mining mcq

Data Warehousing and Data Mining - MCQ Questions and Answers SET 02

In non-parametric models, no fixed set of parameters and no probability distribution is assumed. They have parameters that are flexible.

One of the goals of a clustering algorithm is to maximize the intra-cluster similarity.

Inter-cluster distance – the distance between two objects from two different clusters.

Intra-cluster distance – the distance between two objects from the same cluster.

Attribute selection measures in decision tree algorithms are mainly used to select the splitting criterion that best separates the given data partition.

Information gain, Gain ratio and Gini index are the popular attribute selection measures.

4. Pruning a decision tree always

Decision trees are the most susceptible machine learning algorithm to overfitting (the undesired induction of noise in the tree). Pruning can reduce the likelihood of overfitting problem.

5. Which of the following classifiers fall in the category of lazy learners:

Lazy learning is also referred as “just-in-time learning”.

Related links:

What is lazy learning in data mining?

Which of the data noise problem is reduced through pruning in decision trees?

What is the role of attribute selection measure in data mining.

What are the popular attribute selection measure

Why non-parametric models are said to be flexible

Which machine learning algorithm is most susceptible to overfitting

Define inter-cluster and intra-cluster distance

Machine learning algorithms MCQ with answers

Machine learning question banks and answers

Data warehousing and Data mining solved quiz questions and answers, multiple choice questions MCQ in data mining, questions and answers explained in data mining concepts, data warehouse exam questions, data mining mcq

Data Warehousing and Data Mining - MCQ Questions and Answers

Keywords: Outlier analysis, curse of dimensionality, frequent pattern analysis

Keywords:Non-parametric methods, clustering analysis, attribute selection measure, pruning a decision tree, lazy learners

Keywords: Overfitting, kernels mapping the dataset, dimensionality reduction, Hamming distance, Jaccard similarity

Keywords: Minkowski distance, Simple Matching Coefficient, Apriori algorithm, upward closure property

Keywords: Validation dataset, OLTP vs OLAP, Aggregated data, agglomerative clustering

Data Warehousing and Data Mining Quiz - SET 06

Keywords:

Data Warehousing and Data Mining Quiz - SET 07

Keywords:

Data Warehousing and Data Mining Quiz - SET 08

Keywords:

Data Warehousing and Data Mining Quiz - SET 09

Keywords:

Data Warehousing and Data Mining Quiz - SET 10

Keywords:

Important exam questions in data mining and data warehousing topics

Multiple choice questions and explained answers in data mining

MCQ solved questions for university exams in data mining and data warehousing

Data mining and data warehousing quiz question bank with solved answers

Machine learning algorithms MCQ with answers

Machine learning question banks and answers

Featured Content

All time most popular contents