Frequently asked Interview Question and Answer in Data Science suites for both Freshers and Experienced Candidates.
Note:Python Questions from question no. 74
Data Science Interview Questions with Answers listed down below are handpicked by the experienced Data Scientist from top IT firms which includes Oracle, Wipro, DBS Bank, ODBC Bank, Google, Cisco, Dell and IBM.
Data Science is one of the leading and most popular trending technology which has been currently used various domains to draw a meaningful insights using various tools, algorithms, machine learning principles.
Know More: Data Science Certification Topics – Latest Updated
This Data Science Interview Questions and Answers collection will help you to crack the interview easily. This Data Science Interview Questions will suit both fresher and experienced Data Scientist.
All The Best..!
***************************************************************************************************
Data Science General Interview Questions
1. How would you create a taxonomy to identify key customer trends in unstructured data?
Ans.The best way to approach this question is to mention that it is good to check with the business owner and understand their objectives before categorizing the data. Having done this, it is always good to follow an iterative approach by pulling new data samples and improving the model accordingly by validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model is producing actionable results anproving over the time.
2. Python or R – Which one would you prefer for text analytics?
Ans.The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high performance data analysis tools.
3. Which technique is used to predict categorical responses?
Ans.Classification technique is used widely in mining for classifying data sets.
4. What are Recommender Systems?
Ans.A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
5. What is power analysis?
Ans.An experimental design technique for determining the effect of a given sample size.
6. What is Collaborative filtering?
Ans.The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.
7. What is Machine Learning?
Ans.The simplest way to answer this question is – we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation.
For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.
8. During analysis, how do you treat missing values?
Ans.The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. There are various factors to be considered when answering this question:
Understand the problem statement, understand the data and then give the answer. Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
If you have a distribution of data coming, for normal distribution give the mean value.
Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.
9. How can outlier values be treated?
Ans.Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values –
To change the value and bring in within a range
To just remove the value.
10. What is the goal of A/B Testing?
Ans.It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.
11. Why data cleaning plays a vital role in analysis?
Ans.Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.
12. Differentiate between univariate, bivariate and multivariate analysis.
Ans.These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sale and a spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
13. What do you understand by the term Normal Distribution?
Ans.Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of an symmetrical bell shaped curve.
14. What is Interpolation and Extrapolation?
Ans.Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.
15. Are expected value and mean value different?
Ans.They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.
For Sampling Data
Mean value is the only value that comes from the sampling data.
Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean.
For Distributions
Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population.
16. What is the difference between Supervised Learning an Unsupervised Learning?
Ans.If an algorithm learns from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.
17. What is an Eigenvalue and Eigenvector?
Ans.Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
18. What are various steps involved in an analytics project?
Ans.• Understand the business problem
• Explore the data and become familiar with it.
• Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.
• After data preparation, start running the model, analyze the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.
• Validate the model using a new data set.
• Start implementing the model and track the result to analyze the performance of the model over the period of time.
19. How can you deal with different types of seasonality in time series modelling?
Ans.Seasonality in time series occurs when time series shows a repeated pattern over time. E.g., stationary sales decreases during holiday season, air conditioner sales increases during the summers etc. are few examples of seasonality in a time series.
Seasonality makes your time series non-stationary because average value of the variables at different time periods. Differentiating a time series is generally known as the best method of removing seasonality from a time series. Seasonal differencing can be defined as a numerical difference between a particular value and a value with a periodic lag (i.e. 12, if monthly seasonality is present)
20. Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?
Ans.Regularizations in statistics or in the field of machine learning is used to include some extra information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints to optimization problems.
In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the corners as solutions while in L2, it doesn’t. So in L1 variables are penalized more as compared to L2 which results into sparsity.
In other words, errors are squared in L2, so model sees higher error and tries to minimize that squared error.
21. What does P-value signify about the statistical data?
Ans.P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1.
• P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
• P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected. • P-value=0.05is the marginal value indicating it is possible to go either way.
22. Do gradient descent methods always converge to same point?
Ans.No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions.
23. How can you iterate over a list and also retrieve element indices at the same time?
Ans.This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.
24. Can you explain the difference between a Test Set and a Validation Set?
Ans.Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating the performance of a trained machine leaning model. In simple terms, the differences can be summarized as-
• Training Set is to fit the parameters i.e. weights.
• Test Set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
• Validation set is to tune the parameters.
25. What do you understand by statistical power of sensitivity and how do you calculate it?
Ans.Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity is nothing but “Predicted TRUE events/ Total events”. True events here are the events which were true and model also predicted them as true. Calculation of seasonality is pretty straight forward- Seasonality = True Positives /Positives in Actual Dependent Variable Where, True positives are Positive events which are correctly classified as Positives.
26. How do data management procedures like missing data handling make selection bias worse?
Ans.Missing value treatment is one of the primary tasks which a data scientist is supposed to do before starting data analysis. There are multiple methods for missing value treatment. If not done properly, it could potentially result into selection bias. Let see few missing value treatment examples and their impact on selection- Complete Case Treatment: Complete case treatment is when you remove entire row in data even if one value is missing. You could achieve a selection bias if your values are not missing at random and they have some pattern. Assume you are conducting a survey and few people didn’t specify their gender. Would you remove all those people? Can’t it tell a different story? >Available case analysis: Let say you are trying to calculate correlation matrix for data so you might remove the missing values from variables which are needed for that particular correlation coefficient. In this case your values will not be fully correct as they are coming from population sets.
Mean Substitution: In this method missing values are replaced with mean of other available values. This might make your distribution biased e.g., standard deviation, correlation and regression are mostly dependent on the mean value of variables.
Hence, various data management procedures might include selection bias in your data if not chosen correctly.
***************************************************************************************************
Data Science Confusion Matrix Interview Questions
27. Can you cite some examples where a false positive is important than a false negative?
Ans.Before we start, let us understand what false positives are and what false negatives are.
False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error.
And, False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
In medical field, assume you have to give chemo therapy to patients. Your lab tests patients for certain vital information and based on those results they decide to give radiation therapy to a patient.
Assume a patient comes to that hospital and he is tested positive for cancer (But he doesn’t have cancer) based on lab prediction. What will happen to him? (Assuming Sensitivity is 1)
One more example might come from marketing. Let’s say an ecommerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $5000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above 5K.
Now what if they have sent it to false positive cases?
28. Can you cite some examples where a false negative important than a false positive?
Ans.Assume there is an airport ‘A’ which has received high security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to shortage of staff they decided to scan passenger being predicted as risk positives by their predictive model.
What will happen if a true threat customer is being flagged as non-threat by airport model?
Another example can be judicial system. What if Jury or judge decide to make a criminal go free?
What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?
29. Can you cite some examples where both false positive and false negatives are equally important?
Ans.In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point of time they don’t want to acquire bad customers. In this scenario both the false positives and false negatives become very important to measure.
These days we hear many cases of players using steroids during sport competitions. Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.
30. A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of having that condition?
Ans.Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness. Thus there is a 5% error in case you do not have the illness.
Out of 1000 people, 1 person who has the disease will get true positive result. Out of the remaining 999 people, 5% will also get true positive result.
Close to 50 people will get a true positive result for the disease.
This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.
***************************************************************************************************
Data Science Classification and Regression Interview Questions
31. What is Linear Regression?
Ans.Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X
is referred to as the predictor variable and Y as the criterion variable.
32. What is the difference between Cluster and Systematic Sampling?
Ans.Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example for systematic sampling is equal probability method.
33. How can you assess a good logistic model?
Ans.There are various methods to assess the results of a logistic regression analysis-
• Using Classification Matrix to look at the true negatives and false positives.
• Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
• Lift helps assess the logistic model by comparing it with random selection.
34. How will you define the number of clusters in a clustering algorithm?
Ans.Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.
For example, the following image shows three different groups.
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see any decrement in WSS. This point is known as bending point and taken as K in K – Means.
This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendograms and identify the distinct groups from there.
35. Is it possible to perform logistic regression with Microsoft Excel?
Ans.It is possible to perform logistic regression with Microsoft Excel. There are two ways to do it using Excel.
• One is to use Add-ins provided by many websites which we can use.
• Second is to use fundamentals of logistic regression and use Excel’s computational power to build a logistic regression
But when this question is being asked in an interview, interviewer is not looking for a name of Add-ins rather a method using the base excel functionalities. Let’s use a sample data to learn about logistic regression using Excel. (Example assumes that you are familiar with basic concepts of logistic regression). Data shown above consists of three variables where X1 and X2 are independent variables and Y is a class variable. We have kept only 2 categories for our purpose of binary logistic regression classifier.
Next we have to create a logit function using independent variables, i.e.
Logit = L = â0 + â1*X1 + â2*X2
36. You created a predictive model of a quantitative outcome variable using multiple regressions. What are the steps you would follow to validate the model?
Ans.Since the question asked, is about post model building exercise, we will assume that you have already tested for null hypothesis, multi collinearity and Standard error of coefficients.
Once you have built the model, you should check for following –
– Global F-test to see the significance of group of independent variables on dependent variable
– R^2
– Adjusted R^2
– RMSE, MAPE
In addition to above mentioned quantitative metrics you should also check for-
– Residual plot
– Assumptions of linear regression
37. Give some situations where you will use an SVM over a RandomForest Machine Learning algorithm and vice-versa.
Ans.SVM and Random Forest are both used in classification problems.
a) If you are sure that your data is outlier free and clean then go for SVM. It is the opposite – if your data might contain outliers then Random forest would be the best choice.
b) Generally, SVM consumes more computational power than Random Forest, so if you are constrained with memory go for Random Forest machine learning algorithm.
c) Random Forest gives you a very good idea of variable importance in your data, so if you want to have variable importance then choose Random Forest machine learning algorithm.
d) Random Forest machine learning algorithms are preferred for multiclass problems.
e) SVM is preferred in multi-dimensional problem set – like text classification but as a good data scientist, you should experiment with both of them and test for accuracy or rather you can use ensemble of many Machine Learning techniques.
***************************************************************************************************
Data Science R programming Interview Questions
38. What do you understand by element recycling in R?
Ans.If two vectors with different lengths perform an operation –the elements of the shorter vector will be re-used to complete the operation. This is referred to as element recycling.
Example – Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated when computing the result.
39. How can you verify if a given object “X” is a matrix data object?
Ans.If the function call is.matrix(X) returns true then X can be considered as a matrix data object otheriwse not.
40. How will you measure the probability of a binary response variable in R language?
Ans.Logistic regression can be used for this and the function glm () in R language provides this functionality.
41. What is the use of sample and subset functions in R programming language?
Ans.Sample () function can be used to select a random sample of size ‘n’ from a huge dataset.
Subset () function is used to select variables and observations from a given dataset.
42. There is a function fn(a, b, c, d, e) a + b * c – d / e.
Ans.Write the code to call fn on the vector c(1,2,3,4,5) such that the output is same as fn(1,2,3,4,5).
do.call (fn, as.list(c (1, 2, 3, 4, 5)))
43. How can you resample statistical tests in R language?
Ans.Coin package in R provides various options for re-randomization and permutations based on statistical tests. When test assumptions cannot be met then this package serves as the best alternative to classical methods as it does not assume random sampling from well-defined populations.
44. What is the purpose of using Next statement in R language?
Ans.If a developer wants to skip the current iteration of a loop in the code without terminating it then they can use the next statement. Whenever the R parser comes across the next statement in the code, it skips evaluation of the loop further and jumps to the next iteration of the loop.
45. How will you create scatterplot matrices in R language?
Ans.A matrix of scatterplots can be produced using pairs. Pairs function takes various parameters like formula, data, subset, labels, etc.
The two key parameters required to build a scatterplot matrix are –
• formula- A formula basically like ~a+b+c . Each term gives a separate variable in the pairs plots where the terms should be numerical vectors. It basically represents the series of variables used in pairs.
• data- It basically represents the dataset from which the variables have to be taken for building a scatterplot.
46. How will you check if an element 25 is present in a vector?
Ans.There are various ways to do this-
i. It can be done using the match () function- match () function returns the first appearance of a particular element.
ii. The other is to use %in% which returns a Boolean value either true or false.
iii. Is.element () function also returns a Boolean value either true or false based on whether it is present in a vector or not.
47. What is the difference between library() and require() functions in R language?
Ans.There is no real difference between the two if the packages are not being loaded inside the function. require () function is usually used inside function and throws a warning whenever a particular package is not found. On the flip side, library () function gives an error message if the desired package cannot be loaded.
***************************************************************************************************
Data Science R Programming
48. What are the rules to define a variable name in R programming language?
Ans.A variable name in R programming language can contain numeric and alphabets along with special characters like dot (.) and underline (-). Variable names in R language can begin with an alphabet or the dot symbol. However, if the variable name begins with a dot symbol it should not be a followed by a numeric digit.
49. What do you understand by a workspace in R programming language?
Ans.The current R working environment of a user that has user defined objects like lists, vectors, etc. is referred to as Workspace in R language.
50. Which function helps you perform sorting in R language?
Ans.Order ()
51. How will you list all the data sets available in all R packages?
Ans.Using the below line of code-
data(package = .packages(all.available = TRUE))
52. Which function is used to create a histogram visualisation in R programming language?
Ans.Hist()
53. Write the syntax to set the path for current working directory in R environment?
Ans.Setwd(“dir_path”)
54. How will you drop variables using indices in a data frame?
Ans.Let’s take a dataframe df<-data.frame(v1=c(1:5),v2=c(2:6),v3=c(3:7),v4=c(4:8))
df
## v1 v2 v3 v4
## 1 1 2 3 4
## 2 2 3 4 5
## 3 3 4 5 6
## 4 4 5 6 7
## 5 5 6 7 8
Suppose we want to drop variables v2 & v3 , the variables v2 and v3 can be dropped using negative indicies as follows-
df1<-df[-c(2,3)]
df1
## v1 v4
## 1 1 4
## 2 2 5
## 3 3 6
## 4 4 7
## 5 5 8
55. What will be the output of runif (7)?
Ans.It will generate 7 randowm numbers between 0 and 1.
56. What is the difference between rnorm and runif functions ?
Ans.rnorm function generates “n” normal random numbers based on the mean and standard deviation arguments passed to the function.
Syntax of rnorm function – rnorm(n, mean = , sd = )
runif function generates “n” unform random numbers in the interval of minimum and maximum values passed to the function.
Syntax of runif function –
runif(n, min = , max = )
57. What will be the output on executing the following
Ans.R programming code – mat<-matrix(rep(c(TRUE,FALSE),8),nrow=4)
sum(mat)
8
***************************************************************************************************
Data Science R Programming Interview Questions
58. How will you combine multiple different string like “Data”, “Science”, “in” ,“R”, “Programming” as a single string “Data_Science_in_R_Programmming” ?
Ans.paste(“Data”, “Science”, “in” ,“R”, “Programming”,sep=”_”)
59. Write a function to extract the first name from the string
Ans.“Mr. Tom White”. substr (“Mr. Tom White”,start=5, stop=7)
60. Can you tell if the equation given below is linear or not? Emp_sal= 2000+2.5(emp_age)2
Ans.Yes it is a linear equation as the coefficients are linear.
61. What will be the output of the following R programming code? var2<- c(“I”,”Love,”DeZyre”)
Ans.var2
It will give an error.
62. What will be the output of the following R programming code? x<-5
Ans. if(x%%2==0)
print(“X is an even number”) else
print(“X is an odd number”)
Executing the above code will result in an error as shown below –
## Error: :4:1: unexpected ‘else’
## 3: print(“X is an even number”)
## 4: else
## ^
R programming language does not know if the else related to the first ‘if’ or not as the first if() is a complete command on its own.
63. I have a string “contact@dezyre.com”. Which string function can be used to split the string into two different strings “contact@dezyre” and “com” ?
Ans.This can be accomplished using the strsplit function which splits a string based on the identifier given in the function call. The output of strsplit() function is a list.
strsplit(“contact@dezyre.com”,split = “.”)
Output of the strsplit function is –
## [[1]]
## [1] ” contact@dezyre” “com”
64. What is R Base package?
Ans.R Base package is the package that is loaded by default whenever R programming environent is loaded .R base package provides basic fucntionalites in R environment like arithmetic calcualtions, input/output.
65. How will you merge two dataframes in R programming language?
Ans.Merge () function is used to combine two dataframes and it identifies common rows or columns between the 2 dataframes. Merge () function basically finds the intersection between two different sets of data.
Merge () function in R language takes a long list of arguments as follows – Syntax for using Merge function in R language –
merge (x, y, by.x, by.y, all.x or all.y or all )
• X represents the first dataframe.
• Y represents the second dataframe.
• by.X- Variable name in dataframe X that is common in Y.
• by.Y- Variable name in dataframe Y that is common in X.
• all.x – It is a logical value that specifies the type of merge. all.X should be set to true, if we want all the observations from dataframe X . This results in Left Join.
• all.y – It is a logical value that specifies the type of merge. all.y should be set to true , if we want all the observations from dataframe Y . This results in Right Join.
• all – The default value for this is set to FALSE which means that only matching rows are returned resulting in Inner join. This should be set to true if you want all the observations from dataframe X and Y resulting in Outer join.
66. Write the R programming code for an array of words so that the output is displayed in decreasing frequency order.
Ans.R Programming Code to display output in decreasing frequency order
tt <- sort(table(c(“a”, “b”, “a”, “a”, “b”, “c”, “a1”, “a1”, “a1”)), dec=T)
depth <- 3 tt[1:depth]
Output –
1) a a1 b
2) 3 3 2
67. How to check the frequency distribution of a categorical variable?
Ans.The frequency distribution of a categorical variable can be checked using the table function in R language. Table () function calculates the count of each categories of a categorical variable.
gender=factor(c(“M”,”F”,”M”,”F”,”F”,”F”))
table(sex)
Output of the above R Code – Gender
F M 4 2
Programmers can also calculate the % of values for each categorical group by storing the output in a dataframe and applying the column percent function as shown below –
t = data.frame(table(gender))
t$percent= round(t$Freq / sum(t$Freq)*100,2)
Gender
Frequency
Percent
F
4
66.67
M
2
33.33
68. What is the procedure to check the cumulative frequency distribution of any categorical variable?
Ans.The cumulative frequency distribution of a categorical variable can be checked using the cumsum () function in R language.
gender = factor(c(“f”,”m”,”m”,”f”,”m”,”f”))
y = table(gender)
cumsum(y)
Output of the above R code-
Cumsum(y)
f m 3 3
69. What will be the result of multiplying two vectors in R having different lengths?
Ans.The multiplication of the two vectors will be performed and the output will be displayed with a warning message like – “Longer object length is not a multiple of shorter object length.” Suppose there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the multiplication of the vectors a*b will give the resultant as 2 6 6 with the warning message. The multiplication is performed in a sequential manner but since the length is not same, the first element of the smaller vector b will be multiplied with the last element of the larger vector a.
70. R programming language has several packages for data science which are meant to solve a specific problem, how do you decide which one to use?
Ans.CRAN package repository in R has more than 6000 packages, so a data scientist needs to follow a well-defined process and criteria to select the right one for a specific task. When looking for a package in the CRAN repository a data scientist should list out all the requirements and issues so that an ideal R package can address all those needs and issues.
The best way to answer this question is to look for an R package that follows good software development principles and practices. For example, you might want to look at the quality documentation and unit tests. The next step is to check out how a particular R package is used and read the reviews posted by other users of the R package. It is important to know if other data scientists or data analysts have been able to solve a similar problem as that of yours. When you in doubt choosing a particular R package, I would always ask for feedback from R community members or other colleagues to ensure that I am making the right choice.
71. How can you merge two data frames in R language?
Ans.Data frames in R language can be merged manually using cbind () functions or by using the merge () function on common rows or columns.
72. Explain the usage of which() function in R language.
which() function determines the postion of elemnts in a logical vector that are TRUE. In the below example, we are finding the row number wherein the maximum value of variable v1 is recorded.
mydata=data.frame(v1 = c(2,4,12,3,6)) which(mydata$v1==max(mydata$v1))
It returns 3 as 12 is the maximum value and it is at 3rd row in the variable x=v1.
73. How will you convert a factor variable to numeric in R language?
Ans.A factor variable can be converted to numeric using the as.numeric() function in R language. However, the variable first needs to be converted to character before being converted to numberic because the as.numeric() function in R does not return original values but returns the vector of the levels of the factor variable.
X <- factor(c(4, 5, 6, 6, 4))
X1 = as.numeric(as.character(X))
***************************************************************************************************
Data Science Python Interview Questions
74. Write a function that takes in two sorted lists and outputs a sorted list that is their union.
Ans.First solution which will come to your mind is to merge two lists and short them afterwards
Python code
def return_union(list_a, list_b): return sorted(list_a + list_b)
R code-
return_union <- function(list_a, list_b)
{
list_c<-list(c(unlist(list_a),unlist(list_b))) return(list(list_c[[1]][order(list_c[[1]])]))
}
Generally, the tricky part of the question is not to use any sorting or ordering function. In that case you will have to write your own logic to answer the question and impress your interviewer.
Python code-
def return_union(list_a, list_b):
len1 = len(list_a)
len2 = len(list_b)
final_sorted_list = []
j = 0
k = 0
for i in range(len1+len2):
if k == len1:
final_sorted_list.extend(list_b[j:])
break
elif j == len2:
final_sorted_list.extend(list_a[k:])
break
elif list_a[k] < list_b[j]:
final_sorted_list.append(list_a[k])
k += 1
else:
final_sorted_list.append(list_b[j])
j += 1
return final_sorted_list
Similar function can be returned in R as well by following the similar steps.
return_union <- function(list_a,list_b)
{
#Initializing length variables
len_a <- length(list_a)
len_b <- length(list_b)
len <- len_a + len_b
#initializing counter variables
j=1
k=1
#Creating an empty list which has length equal to sum of both the lists
list_c <- list(rep(NA,len)) #Here goes our for loop for(i in 1:len) { if(j>len_a)
{
list_c[i:len] <- list_b[k:len_b] break } else if(k>len_b)
{
list_c[i:len] <- list_a[j:len_a]
break
}
else if(list_a[[j]] <= list_b[[k]])
{
list_c[[i]] <- list_a[[j]]
j <- j+1 } else if(list_a[[j]] > list_b[[k]])
{
list_c[[i]] <- list_b[[k]]
k <- k+1
}
}
return(list(unlist(list_c)))
}
75. Name a few libraries in Python used for Data Analysis and Scientific computations.
Ans.NumPy, SciPy, Pandas, SciKit, Matplotlib, Seaborn
76. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib?
Ans.Matplotlib is the python library used for plotting but it needs lot of fine-tuning to ensure that the plots look shiny. Seaborn helps data scientists create statistically and aesthetically appealing meaningful plots. The answer to this question varies based on the requirements for plotting data.
77. Which method in pandas.tools.plotting is used to create scatter plot matrix?
Ans.Scatter_matrix
78. How can you check if a data set or time series is Random?
Ans.To check whether a dataset is random or not use the lag plot. If the lag plot for the given dataset does not show any structure then it is random.
79. What are the possible ways to load an array from a text data file in Python? How can the efficiency of the code to load data file be improved?
Ans.numpy.loadtxt ()
80. Which is the standard data missing marker used in Pandas?
Ans.NaN
81. Which Python library would you prefer to use for Data Munging?
Ans.Pandas
82. Write the code to sort an array in NumPy by the nth column?
Ans.Using argsort () function this can be achieved. If there is an array X and you would like to sort the nth column then code for this will be x[x [: n-1].argsort ()]
83. Which python library is built on top of matplotlib and Pandas to ease data plotting?
Ans.Seaborn
84. Which plot will you use to access the uncertainty of a statistic?
Ans.Bootstrap
85. What is pylab?
Ans.A package that combines NumPy, SciPy and Matplotlib into a single namespace.
86. Which python library is used for Machine Learning?
Ans.SciKit-Learn
87. How can you copy objects in Python?
Ans. a. The functions used to copy objects in Python are-
b. Copy.copy () for shallow copy
c. Copy.deepcopy () for deep copy
d. However, it is not possible to copy all objects in Python using these functions. For instance, dictionaries have a separate copy method whereas sequences in Python have to be copied by ‘Slicing’.
88. What is the difference between tuples and lists in Python?
Ans.Tuples can be used as keys for dictionaries i.e. they can be hashed. Lists are mutable whereas tuples are immutable – they cannot be changed. Tuples should be used when the order of elements in a sequence matters. For example, set of actions that need to be executed in sequence, geographic locations or list of points on a specific route.
89. What is PEP8?
Ans.PEP8 consists of coding guidelines for Python language so that programmers can write readable code making it easy to use for any other person, later on.
90. Is all the memory freed when Python exits?
Ans.No, it is not, because the objects that are referenced from global namespaces of Python modules are not always de-allocated when Python exits.
Data Science Python Interview Questions- 2
91. What does _init_.py do?
Ans.a. init_.py is an empty py file used for importing a module in a directory. _init_.py provides an easy way to organize the files. If there is a module maindir/subdir/module.py,_init_.py is placed in all the directories so that the module can be imported using the following command-
b. import maindir.subdir.module
92. What is the different between range () and xrange () functions in Python?
Ans.range () returns a list whereas xrange () returns an object that acts like an iterator for generating numbers on demand.
93. How can you randomize the items of a list in place in Python?
Ans.Shuffle (lst) can be used for randomizing the items of a list in Python
94. What is a pass in Python?
Ans.Pass in Python signifies a no operation statement indicating that nothing is to be done.
95. If you are gives the first and last names of employees, which data type in Python will you use to store them?
Ans.You can use a list that has first name and last name included in an element or use Dictionary.
96. What happens when you execute the statement mango=banana in Python?
Ans.A name error will occur when this statement is executed in Python.
97. Optimize the below python code-word = ‘word’
Ans.a. print word.__len_()
b. Answer: print ‘word’._len_()
98. What is monkey patching in Python?
Ans.Monkey patching is a technique that helps the programmer to modify or extend other code at runtime. Monkey patching comes handy in testing but it is not a good practice to use it in production environment as debugging the code could become difficult.
99. What is pickling and unpickling?
Ans.Pickle module accepts any Python object and converts it into a string representation and dumps it into a file by using dump function, this process is called pickling. While the process of retrieving original Python objects from the stored string representation is called unpickling.
100. What are the tools that help to find bugs or perform static analysis?
Ans.PyChecker is a static analysis tool that detects the bugs in Python source code and warns about the style and complexity of the bug. Pylint is another tool that verifies whether the module meets the coding standard.