[index] [options] [help]

provenance_challenge_ipaw_info messages

[provenance-challenge] Errors in pchal data

From: Joe Futrelle <futrelle AT ncsa.uiuc.edu>
Date: Tue, 5 Jun 2007 12:43:01 -0500


Threading:      • This Message
             Re: [provenance-challenge] Errors in pchal data from dholland AT eecs.harvard.edu
             Re: [provenance-challenge] Errors in pchal data from pg03r AT ecs.soton.ac.uk


--Apple-Mail-5-894718541
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
	charset=US-ASCII;
	delsp=yes;
	format=flowed

In the process of integrating data from various teams involved in the  
second provenance challenge I have found a small number of what I  
think are errors, in other words isolated discrepancies that, when  
corrected in an obvious way, enable a consistent interpretation of  
the data.

I can describe these in detail to whoever is interested, but I wanted  
to get a sense from the group of whether it would be better to bring  
these up here and now, to give the teams the opportunity to correct  
the (apparent) errors before the challenge, or whether it would be  
more instructive to allow other participants to discover (or not  
discover) the discrepancies themselves. I plan to contact individual  
teams anyway to get their account of the discrepancies and their  
opinion about how to correct them.

--
Joe Futrelle
Digital Library Technologies, NCSA
http://www.ncsa.uiuc.edu/People/futrelle



--Apple-Mail-5-894718541
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=US-ASCII

<HTML><BODY style=3D"word-wrap: break-word; -khtml-nbsp-mode: 
space; =
-khtml-line-break: after-white-space; ">In the process of integrating =
data from various teams involved in the second provenance challenge I =
have found a small number of what I think are errors, in other words =
isolated discrepancies that, when corrected in an obvious way, enable a =
consistent interpretation of the data.<DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>I can 
describe these in =
detail to whoever is interested, but I wanted to get a sense from the =
group of whether it would be better to bring these up here and now, to =
give the teams the opportunity to correct the (apparent) errors before =
the challenge, or whether it would be more instructive to allow other =
participants to discover (or not discover) the discrepancies themselves. =
I plan to contact individual teams anyway to get their account of the =
discrepancies and their opinion about how to correct =
them.</DIV><DIV><BR><DIV> <SPAN 
class=3D"Apple-style-span" =
style=3D"border-collapse: separate; border-spacing: 0px 0px; color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; text-align: auto; =
-khtml-text-decorations-in-effect: none; text-indent: 0px; =
-apple-text-size-adjust: auto; text-transform: none; orphans: 2; =
white-space: normal; widows: 2; word-spacing: 0px; =
"><DIV>--</DIV><DIV>Joe 
Futrelle</DIV><DIV>Digital Library Technologies, =
NCSA</DIV><DIV><A =
href=3D"http://www.ncsa.uiuc.edu/People/futrelle">http://www.ncsa.uiuc.edu=
/People/futrelle</A></DIV><BR 
class=3D"Apple-interchange-newline"></SPAN> =
</DIV><BR></DIV></BODY></HTML>=

--Apple-Mail-5-894718541--


Re: [provenance-challenge] Errors in pchal data

From: dholland AT eecs.harvard.edu (David Holland)
Date: Tue, 5 Jun 2007 16:24:54 -0400 (EDT)


Threading: [provenance-challenge] Errors in pchal data from futrelle AT ncsa.uiuc.edu
      • This Message

 > In the process of integrating data from various teams involved in the  
 > second provenance challenge I have found a small number of what I  
 > think are errors, in other words isolated discrepancies that, when  
 > corrected in an obvious way, enable a consistent interpretation of  
 > the data.

We (PASS) have a related problem in that over the past couple months
I've found quite a number of bugs in the toolset that was used to
produce our posted data, and I haven't been sure whether (or when) to
post updates. We eventually decided to let it go until someone asked...

 > I can describe these in detail to whoever is interested, but I wanted  
 > to get a sense from the group of whether it would be better to bring  
 > these up here and now, to give the teams the opportunity to correct  
 > the (apparent) errors before the challenge, or whether it would be  
 > more instructive to allow other participants to discover (or not  
 > discover) the discrepancies themselves. I plan to contact individual  
 > teams anyway to get their account of the discrepancies and their  
 > opinion about how to correct them.

I think it would be best to disclose everything that comes up, for a
number of reasons:

 - Bugs in the data presumably reflect bugs in the system that
generated it, which may be otherwise undetected. Getting these fixed
benefits everyone.

 - I don't think any of us are really familiar enough with anyone
else's data model to be really sure about what's a real bug and what's
an odd property of the data encoding. So if something's funny it's
better if standard working procedure is to ask.

 - Interoperability is probably hard enough without trying to debug
nonsensical query results caused by translating data bugs into a
different worldview.

 - Different groups probably have different degrees (and types) of
consistency checking, or tools for same. Also, some mistakes may not
be encodable in some data models. Explaining what led to the discovery
of a problem in someone else's data will probably be interesting for
everyone.

 - We can still try discover the problems ourselves; obviously we
can't stumble on them by accident, but we can try the known-bad
dataset and see if anything in our system or procedures turns up a
problem.

 - It may also be interesting, from a data model perspective, if some
problems disappear in the course of certain translations, or if
they're exacerbated to the point where they become unavoidably
noticeable. I think to really get a good picture of this, particularly
where the problems disappear in translation, we'll need to have the
problems disclosed before the workshop, in order to have time to
figure out exactly what's going on.

Note that there are quite a number of reasons I might use a broken
dataset and not notice; for example, I might simply be unobservant,
which might be embarrassing in some cases but is not enlightening to
anyone else; or I might have thought that the bug was simply a
property of the foreign data model and had my converter translate it
to something that makes sense in my world; or the problem might
disappear in translation; or I might have managed to introduce a
compensating or partially-compensating bug into my translator.  Or
other things.

If disclosing these bugs prompts someone to spend the next two weeks
writing a consistency checker for their system, so they can claim
their system catches 100% of the known problems, I don't see that as a
bad thing.

 - If it turns out that we have datasets that are broken in
interesting ways, for whatever that might mean, we can use them as a
test case library for any subsequent work in consistency/integrity
checking.

 - I suspect that some data bugs may be interesting to some groups for
semantic reasons, and they might not find out unless we disclose
things fairly widely.

 - It may also turn out that several groups have invented the same
underlying bug in their systems; while this kind of relevation tends
to be depressing, it will probably be very interesting from an
interoperability standpoint.


Some bugs may be too embarrassing to disclose fully :-) but in general
we're all experienced enough with software here to know that even the
best hackers write the most amazingly stupid code from time to time,
and that sometimes such things can hang around undetected for years.
And we aren't competing on bug counts anyway; some systems (like ours,
since we're working on a second generation) are immature from a
software development perspective, and haven't been fully tested or
debugged.

-- 
   - David A. Holland / dholland AT eecs.harvard.edu


Re: [provenance-challenge] Errors in pchal data

From: Paul Groth <pg03r AT ecs.soton.ac.uk>
Date: Tue, 5 Jun 2007 23:20:22 +0100


Threading: [provenance-challenge] Errors in pchal data from futrelle AT ncsa.uiuc.edu
      • This Message

Hi,

We might as well have those discussions here as that's what the  
mailing list is for. In addition, it's probably a good build up to  
the meeting in california :-)

Paul

-------------------------------------------------
Paul Groth (pg03r AT ecs.soton.ac.uk)
http://www.ecs.soton.ac.uk/~pg03r

On Jun 5, 2007, at 18:43, Joe Futrelle wrote:

> In the process of integrating data from various teams involved in  
> the second provenance challenge I have found a small number of what  
> I think are errors, in other words isolated discrepancies that,  
> when corrected in an obvious way, enable a consistent  
> interpretation of the data.
>
> I can describe these in detail to whoever is interested, but I  
> wanted to get a sense from the group of whether it would be better  
> to bring these up here and now, to give the teams the opportunity  
> to correct the (apparent) errors before the challenge, or whether  
> it would be more instructive to allow other participants to  
> discover (or not discover) the discrepancies themselves. I plan to  
> contact individual teams anyway to get their account of the  
> discrepancies and their opinion about how to correct them.
>
> --
> Joe Futrelle
> Digital Library Technologies, NCSA
> http://www.ncsa.uiuc.edu/People/futrelle
>
>


[index] [options] [help]