Project outline and scope
The idea of the project is to tie the course together, applying
several of the main topics we have learned during the quarter. In
particular, the project should involve the following three parts:
- Experimental Design - a careful decision of what datapoints to
collect. The type of design and number of datapoints will depend upon
aspects of the particular problem, such as how hard it is to obtain
each point.
- Computer Simulation - the experiments must be run, and should all
be done by computer. Ideally the simulator can be set up to run a
number of experiments in batch without any further inputs from the user.
- Analysis of Results - use of statistical tools to analyze the
data and draw scientifically valid conclusions
Here are some additional guidelines to help focus the project and keep
it manageable:
- Question of interest - the key question of interest should be
well defined, and will generally be one (or more) of the following
three types of questions
- Is this variable significant? - Either asking which of a set
of variables are important, or asking if a particular variable is
important when a number of others are adjusted for. This question
is generally answered with hypothesis testing. You should be able
to clearly state your hypotheses.
- How do we optimize? - Trying to either maximize or minimize a
function by adjusting the inputs.
- What is the response surface? - Creating a model for the
output of the simulator, in order to understand the relationships
between the inputs and the output. The primary analytical method
here is response surface methodology, which we will get to after
the midterm.
It is critical to make sure that the question of interest can be
answered by the data that will be collected. Phrasing the question
as above will help.
- The response variable - in general, the methods of this course
are geared toward having a single continuous-valued response. If your
simulator produces multiple outputs (as many do), often the problem
can be simplified by considering just one of them and ignoring the
rest. Categorical responses require more powerful statistical
macherinery than we will see in class, so those are best avoided for
the project. If a categorical response is unavoidable, then you will
have to learn a bit more statistics.
- Workable numbers of inputs (factors) - you want to have enough
adjustable factors to make the problem interesting, but not too many
to make the problem too hard. The right number depends on how many
runs are feasible, i.e., how long it takes or how hard it is to do
each run. For a complicated simulator that needs to run overnight to
produce a single datapoint, you will only be able to do a small number
of runs. For a fast simulator, you may be able to do thousands or
millions of runs, so more factors can be considered.
- Time allocation - I'm aiming for this to take about 25-30 hours
over the course of several weeks, with this including time to set up
runs and analyze the results, but not including the actual simulation
time (which will vary tremendously by project).