Give a decision tree representing a simple boolean function What is the entropy of the labels in a set of data and what is the information gain of an attribute. Design a perceptron (or small network of perceptrons) implementing a simple function. Derive a gradient descent learning rule for updating the parameters in a prediction rule with a given loss measure (e.g. minimize the square loss of a function like w1*x1 + w2*x2 + w3*(x1)^2 + w4*(x2)^2 on a set of data). Given a concept class $\cal C$, why might it be easier to PAC learn $\cal C$ using a larger hypothesis class ${\cal C'}\supset {\cal C}$? According to the (noise-free) PAC sample size bounds, if we want to increase the accuracy by a factor of two, approximately how many more examples will we need? What is the VC dimension of hyper-rectangles in d dimensions? (i.e. each hyper-rectangle is described by range for each dimension and a point is in a hyper-rectangle if each coordinate of the point is in the corresponding range). Given a set of coin flips and a prior distribution on the probability of heads, compute the maximum likelihood, maximum a'posteriori, and mean a'posteriori probabilities that the next flip is heads. Given losses and probabilities of outcomes, find a Bayes optimal prediction. Find the maximum likelihood gaussian distribution from a (small) sample. Given some data, find the prediction made by Naive Bayes on a (given) new instance. Describe the difference between regression and classification problems. Given a (simple) coding scheme, find the hypothesis which minimizes the description length of the data. Given a Bayes network over some features, and some relationships between features (indepence, conditional independence) does the network correctly reflect the relationships? Simulate the WM algorithm on a sequence of instances. How do the mistake bounds of Winnow for learning disjunctions depend on the number of attributes? Run the perceptron algorithm on a sequence of instances What is the logistic function?