[index] [options] [help]

provenance_challenge_ipaw_info messages

[provenance-challenge] Second provenance challenge

From: "Simon Miles" <sm AT ecs.soton.ac.uk>
Date: Tue, 28 Nov 2006 16:59:20 +0000


Threading:      • This Message
             Re: [provenance-challenge] Second provenance challenge from sm AT ecs.soton.ac.uk
             Re: [provenance-challenge] Second provenance challenge from dholland AT eecs.harvard.edu
             Re: [provenance-challenge] Second provenance challenge from sm AT ecs.soton.ac.uk
             {SPAM?} Re: [provenance-challenge] Second provenance challenge from jmgomez AT isoco.com

Hello,

We have drafted a proposal for a second provenance challenge, derived
from that discussed at the workshop in Washington in September.

http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge

We welcome any comments or suggestions - does it seem reasonable and
what you were expecting?  Can I ask that all comments are given by 6th
December so that, if acceptable, the challenge can officially start on
8th December.

Thanks,
Simon, Juliana, Luc


Re: [provenance-challenge] Second provenance challenge

From: "Simon Miles" <sm AT ecs.soton.ac.uk>
Date: Wed, 29 Nov 2006 11:20:06 +0000


Threading: [provenance-challenge] Second provenance challenge from sm AT ecs.soton.ac.uk
      • This Message

Hello Jun,

Thanks for the feedback.

Jun Zhao wrote:
> As I saw from the conclusion of the first challenge, it seems difficult to
> compare the query results returned from different groups. One of the
> problems that occurred to me when answering the first challenge were to
> understand the scope of querying information space, i.e. should I retrieve
> information from one run of the workflow or many runs. I am not sure
> whether it matters that much to the other projects.

I suppose it is implicit in the description and the variation in
Question 7 that the answers are for one run of the workflow only.
This could be made explicit in the challenge description.  The method
by which you would distinguish the workflow run of interest from
others is certainly interesting.  For Southampton, it is part of the
query mechanism and so directly relevant for answering the queries in
the challenges, but I agree it might not be as relevant for other
teams for this challenge.  I suggest we leave it to be documented by
the teams if they think it relevant to their challenge results.

> The second thing (as I read the challenge quickly, I might have missed
> it:)), are there any requirements as to which projects we should choose to
> pair up and how many we should choose?

No, we have placed no requirements on whose data to try and translate
/ query over - as many other teams as possible!

> Maybe we can also share the parsers if that would help?

I agree that this is important.  We have requested on the page that
"...a reference [be] given to a free parser for that format" and that
"we strongly encourage (but do not require) teams to export their data
in XML"  Hopefully this is enough to make the parsing of each others'
data as straightforward as possible.

Thanks,
Simon

> cheers,
>
> Jun
>
> On Nov 28 2006, Simon Miles wrote:
>
> > Hello,
> >
> > We have drafted a proposal for a second provenance challenge, derived
> > from that discussed at the workshop in Washington in September.
> >
> > http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge
> >
> > We welcome any comments or suggestions - does it seem reasonable and
> > what you were expecting?  Can I ask that all comments are given by 
6th
> > December so that, if acceptable, the challenge can officially start 
on
> > 8th December.
> >
> > Thanks,
> > Simon, Juliana, Luc
> >
>


Re: [provenance-challenge] Second provenance challenge

From: dholland AT eecs.harvard.edu (David Holland)
Date: Mon, 4 Dec 2006 20:42:50 -0500 (EST)


Threading: [provenance-challenge] Second provenance challenge from sm AT ecs.soton.ac.uk
      • This Message

 > http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge

So it says

 : [T]he queries and their expected results were weakly specified, and
 : so interpreted differently by different groups.

but there's no additional clarification of the queries.

I think at least some of this should be done beforehand; we've all run
the queries and probably in the process noticed things that were
underspecified, and it would make the downstream comparison of results
easier if all questions that have already arisen can be resolved in
advance.

Some points that come to mind:

   - In Q2, what exactly is "the averaging of images with softmean"?
     From a dataflow perspective, is the cutoff point supposed to be
     the softmean process executions themselves, or the files used as
     input to softmean? Or, similarly from an events perspective, is
     the cutoff point where softmean has read the inputs and is doing
     the computation before generating the outputs, or the point at
     which softmean first begins execution, or what?

   - In Q4, does "all invocations" mean "all invocations related 
to
     this workflow" or "all invocations that might have ever
     happened"? Same for Q6.

   - In Q7, is the variant workload supposed to start from the *same*
     input files, or new copies of the same input data? Should the
     variant workload clobber the intermediate and output files from
     the original, or should it be run such that both can exist
     simultaneously (e.g., in a different directory)?

   - For Q8 and Q9, we should all agree on a set of annotations to
     perform on the various available files, and also when they should
     be added relative to the workflow execution, so we all get
     vaguely comparable results searching for them.

-- 
   - David A. Holland / dholland AT eecs.harvard.edu


Re: [provenance-challenge] Second provenance challenge

From: "Simon Miles" <sm AT ecs.soton.ac.uk>
Date: Tue, 5 Dec 2006 17:30:19 +0000


Threading: [provenance-challenge] Second provenance challenge from sm AT ecs.soton.ac.uk
      • This Message

Hello David,

Thanks, these are good points.  I have tried to fix the ambiguities
you indicate (details below).
http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge

First, as well as clarifications to the queries, it is apparent from
your and Jun's mails that we need to be more explicit about what
actually occurs before the provenance data is exported in the
challenge.  In particular, some queries only make sense if the
workflow has been run more than once and we need to be able to
identify annotations within the exported data.  I've added the
following text to make this explicit.

"Specifically, the exported data should contain:
   * Documentation of the three parts of one run of the workflow as
shown in the Workflow Parts section below.
   * Documentation of the three parts of one run of the workflow in
the adaptation specified by Provenance Query 7, i.e. replacing the
single convert procedure with two procedures, pgmtoppm then pnmtojpeg,
in workflow Part 3.
   * The following annotations:
      * Anatomy Image 1, as used in the first workflow run, is
annotated with key-value pair center=UChicago.
      * Anatomy Image 2, as used in the first workflow run, is
annotated with key-value pairs center=southampton and
studyModality=speech.

If the output of a team differs from that given above, including
omissions of one or other piece of data, please make it clear in your
data output."

For the queries, I have added the clarifications below.

>    - In Q2, what exactly is "the averaging of images with 
softmean"?
>      From a dataflow perspective, is the cutoff point supposed to be
>      the softmean process executions themselves, or the files used as
>      input to softmean? Or, similarly from an events perspective, is
>      the cutoff point where softmean has read the inputs and is doing
>      the computation before generating the outputs, or the point at
>      which softmean first begins execution, or what?

We will rephrase this as:

2. Find the process that led to Atlas X Graphic, excluding everything
prior to softmean outputting the Atlas Image, i.e. the inputs,
processing and outputs of align_warp and reslice, and the inputs and
processing of softmean will be excluded.

>    - In Q4, does "all invocations" mean "all invocations 
related to
>      this workflow" or "all invocations that might have ever
>      happened"? Same for Q6.

We will rephrase these as:

4. Find all invocations of procedure align_warp that have ever
occurred in the system using a twelfth order nonlinear 1365 parameter
model (see model menu describing possible values of parameter "-m 12"
of align_warp) that ran on a Monday.

6. Find all images ever output from softmean where the warped images
taken as input were align_warped using a twelfth order nonlinear 1365
parameter model, i.e. "where softmean was preceded in the workflow,
directly or indirectly, by an align_warp procedure with argument -m
12."

>    - In Q7, is the variant workload supposed to start from the *same*
>      input files, or new copies of the same input data? Should the
>      variant workload clobber the intermediate and output files from
>      the original, or should it be run such that both can exist
>      simultaneously (e.g., in a different directory)?

The query is rephrased below to answer your first point.  The second
point above seems to assume too much about the operation of the
system: overwriting of data only makes sense in some systems (some
systems may pass data by value and never store it in a file or by
other means).  I feel that stating it would be too restrictive - if it
is important to understanding your provenance data, please may it
clear with the exported data.

7. A user has run the workflow twice on the same input files, in the
second instance replacing each convert procedure in the final stage
with two procedures: pgmtoppm, then pnmtojpeg. Find the differences
between the two workflow runs. The exact level of detail in the
difference that is detected by a system is up to each participant.

>    - For Q8 and Q9, we should all agree on a set of annotations to
>      perform on the various available files, and also when they should
>      be added relative to the workflow execution, so we all get
>      vaguely comparable results searching for them.

OK, hopefully this point is addressed by the specification of
annotations at the start of the email.

Thanks,
Simon


{SPAM?} Re: [provenance-challenge] Second provenance challenge

From: jmgomez AT isoco.com
Date: Wed, 6 Dec 2006 00:11:18 +0000


Threading: [provenance-challenge] Second provenance challenge from sm AT ecs.soton.ac.uk
      • This Message

Hi Simon,

As I said, our provenance system is still far from our goals but we still want 
to participate. Will provide you with the required materials (treces, etc) 
asap.

Thanks,
Jose

PS: thanks for the logs!
---  

-----Original Message-----
From: "Simon Miles" <sm AT ecs.soton.ac.uk>
Date: Tue, 5 Dec 2006 17:30:19 
To:provenance-challenge AT ipaw.info
Subject: Re: [provenance-challenge] Second provenance challenge

Hello David,

Thanks, these are good points.  I have tried to fix the ambiguities
you indicate (details below).
http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge

First, as well as clarifications to the queries, it is apparent from
your and Jun's mails that we need to be more explicit about what
actually occurs before the provenance data is exported in the
challenge.  In particular, some queries only make sense if the
workflow has been run more than once and we need to be able to
identify annotations within the exported data.  I've added the
following text to make this explicit.

"Specifically, the exported data should contain:
   * Documentation of the three parts of one run of the workflow as
shown in the Workflow Parts section below.
   * Documentation of the three parts of one run of the workflow in
the adaptation specified by Provenance Query 7, i.e. replacing the
single convert procedure with two procedures, pgmtoppm then pnmtojpeg,
in workflow Part 3.
   * The following annotations:
      * Anatomy Image 1, as used in the first workflow run, is
annotated with key-value pair center=UChicago.
      * Anatomy Image 2, as used in the first workflow run, is
annotated with key-value pairs center=southampton and
studyModality=speech.

If the output of a team differs from that given above, including
omissions of one or other piece of data, please make it clear in your
data output."

For the queries, I have added the clarifications below.

>    - In Q2, what exactly is "the averaging of images with 
softmean"?
>      From a dataflow perspective, is the cutoff point supposed to be
>      the softmean process executions themselves, or the files used as
>      input to softmean? Or, similarly from an events perspective, is
>      the cutoff point where softmean has read the inputs and is doing
>      the computation before generating the outputs, or the point at
>      which softmean first begins execution, or what?

We will rephrase this as:

2. Find the process that led to Atlas X Graphic, excluding everything
prior to softmean outputting the Atlas Image, i.e. the inputs,
processing and outputs of align_warp and reslice, and the inputs and
processing of softmean will be excluded.

>    - In Q4, does "all invocations" mean "all invocations 
related to
>      this workflow" or "all invocations that might have ever
>      happened"? Same for Q6.

We will rephrase these as:

4. Find all invocations of procedure align_warp that have ever
occurred in the system using a twelfth order nonlinear 1365 parameter
model (see model menu describing possible values of parameter "-m 12"
of align_warp) that ran on a Monday.

6. Find all images ever output from softmean where the warped images
taken as input were align_warped using a twelfth order nonlinear 1365
parameter model, i.e. "where softmean was preceded in the workflow,
directly or indirectly, by an align_warp procedure with argument -m
12."

>    - In Q7, is the variant workload supposed to start from the *same*
>      input files, or new copies of the same input data? Should the
>      variant workload clobber the intermediate and output files from
>      the original, or should it be run such that both can exist
>      simultaneously (e.g., in a different directory)?

The query is rephrased below to answer your first point.  The second
point above seems to assume too much about the operation of the
system: overwriting of data only makes sense in some systems (some
systems may pass data by value and never store it in a file or by
other means).  I feel that stating it would be too restrictive - if it
is important to understanding your provenance data, please may it
clear with the exported data.

7. A user has run the workflow twice on the same input files, in the
second instance replacing each convert procedure in the final stage
with two procedures: pgmtoppm, then pnmtojpeg. Find the differences
between the two workflow runs. The exact level of detail in the
difference that is detected by a system is up to each participant.

>    - For Q8 and Q9, we should all agree on a set of annotations to
>      perform on the various available files, and also when they should
>      be added relative to the workflow execution, so we all get
>      vaguely comparable results searching for them.

OK, hopefully this point is addressed by the specification of
annotations at the start of the email.

Thanks,
Simon


[index] [options] [help]