From the front lines of data licensing at a research institution!

26 Sep 2017

Dear All,
Wishing you all a smooth re-entry from last week's plenary in Montreal. I miss the lobster poutine!
The below exchange about data licensing (with requestor anonymized) is a pretty good reflection of the legal confusion and uncertainty facing researchers in today's online data ecosystem.
Maybe another resource we could work on is an FAQ covering actual legal interop questions asked and answered?
Cheers,
Gail
Gail P. Clement  | Head of Research Services  | Caltech Library  | Mail Code 1-43  | Pasadena CA 91125-4300  | 626-395-1203
http://orcid.org/0000-0001-5494-4806 | library.caltech.edu
- Show quoted text -From: Peretsman-Clement, Gail
Sent: Tuesday, September 26, 2017 12:29 PM
To: (retracted)
Subject: RE: Redistributing old datasets?
Hi (retracted),
Thanks for writing to the Caltech Library with your dataset licensing questions. This is a complex question because of the large number and diversity of datasets included in the release; their differential provenance and the heterogeneity of metadata associated with each dataset; and the distributor's own statement as follows:
"I made a good faith effort to determine the license under which the actual data (i.e. rows/columns of numbers) were distributed, but I was unable to find a definitive answer. My understanding is that these datasets are free to re-distribute. However, if you own the rights to data that are included here and you object to their inclusion in Rdatasets, send me an email at ***@***.***. I will promptly remove the data in question and will make sure that all traces are erased from the git revision history."
Without making a big investment in analyzing the rights issues with this release, I can share some initial thoughts:
- some of the datasets included in the release may be in the public domain (no copyright) because each datum itself is not eligible for copyright protection and the compilation of "datums" into a dataset does not have sufficient originality to be eligible for copyright protection
- some of the datasets may be copyrighted, and the redistributor is either taking a risk of infringement by sharing without permission (which is what a license would provide) OR is asserting Fair Use.
If Fair use is being asserted, it must be done under certain conditions laid out in US Copyright Statute (see https://www.copyright.gov/fair-use/more-info.html). Fair use covers a particular usage of copyrighted resources and it applies case-by-case, not in batch mode. A data provider may not release data under Fair use but a user may reuse data under Fair use.
Does that help? Please feel free to ping back if I can be of further assistance!
Best wishes,
Gail
Gail P. Clement  | Head of Research Services  | Caltech Library  | Mail Code 1-43  | Pasadena CA 91125-4300  | 626-395-1203
http://orcid.org/0000-0001-5494-4806 | library.caltech.edu
-----Original Message-----
From: (retracted)
Sent: Tuesday, September 26, 2017 11:40 AM
To: ***@***.***
Subject: Redistributing old datasets?
Hi,
I believe you're in dataset licensing issues?
I was updating a Debian package and I discovered statsmodels is using some old scientific datasets.
They've been redistributed by the R community for years, but I can't see any approval for this redistribution?
One example there's a dataset of 45 rows with 4 columns on "Data on the prestige and other characteristics of 45 U. S. occupations in 1950."
Duncan, O. D. (1961)
A socioeconomic index for all occupations.
In Reiss, A. J., Jr. (Ed.)
Occupations and Social Status. Free Press [Table VI-1].
The site where these datasets is curated:
https://vincentarelbundock.github.io/Rdatasets/
Do you know if these small datasets are covered by fair use or some other blanket approval?
(Retracted)
Dear All,
Wishing you all a smooth re-entry from last week's plenary in Montreal. I miss the lobster poutine!
The below exchange about data licensing (with requestor anonymized) is a pretty good reflection of the legal confusion and uncertainty facing researchers in today's online data ecosystem.
Maybe another resource we could work on is an FAQ covering actual legal interop questions asked and answered?
Cheers,
Gail
Gail P. Clement  | Head of Research Services  | Caltech Library  | Mail Code 1-43  | Pasadena CA 91125-4300  | 626-395-1203
http://orcid.org/0000-0001-5494-4806 | library.caltech.edu
-----Original Message-----
From: Peretsman-Clement, Gail
Sent: Tuesday, September 26, 2017 12:29 PM
To: (retracted)
Subject: RE: Redistributing old datasets?
Hi (retracted),
Thanks for writing to the Caltech Library with your dataset licensing questions. This is a complex question because of the large number and diversity of datasets included in the release; their differential provenance and the heterogeneity of metadata associated with each dataset; and the distributor's own statement as follows:
"I made a good faith effort to determine the license under which the actual data (i.e. rows/columns of numbers) were distributed, but I was unable to find a definitive answer. My understanding is that these datasets are free to re-distribute. However, if you own the rights to data that are included here and you object to their inclusion in Rdatasets, send me an email at ***@***.***. I will promptly remove the data in question and will make sure that all traces are erased from the git revision history."
Without making a big investment in analyzing the rights issues with this release, I can share some initial thoughts:
- some of the datasets included in the release may be in the public domain (no copyright) because each datum itself is not eligible for copyright protection and the compilation of "datums" into a dataset does not have sufficient originality to be eligible for copyright protection
- some of the datasets may be copyrighted, and the redistributor is either taking a risk of infringement by sharing without permission (which is what a license would provide) OR is asserting Fair Use.
If Fair use is being asserted, it must be done under certain conditions laid out in US Copyright Statute (see https://www.copyright.gov/fair-use/more-info.html). Fair use covers a particular usage of copyrighted resources and it applies case-by-case, not in batch mode. A data provider may not release data under Fair use but a user may reuse data under Fair use.
Does that help? Please feel free to ping back if I can be of further assistance!
Best wishes,
Gail
Gail P. Clement  | Head of Research Services  | Caltech Library  | Mail Code 1-43  | Pasadena CA 91125-4300  | 626-395-1203
http://orcid.org/0000-0001-5494-4806 | library.caltech.edu
- Show quoted text -From: (retracted)
Sent: Tuesday, September 26, 2017 11:40 AM
To: ***@***.***
Subject: Redistributing old datasets?
Hi,
I believe you're in dataset licensing issues?
I was updating a Debian package and I discovered statsmodels is using some old scientific datasets.
They've been redistributed by the R community for years, but I can't see any approval for this redistribution?
One example there's a dataset of 45 rows with 4 columns on "Data on the prestige and other characteristics of 45 U. S. occupations in 1950."
Duncan, O. D. (1961)
A socioeconomic index for all occupations.
In Reiss, A. J., Jr. (Ed.)
Occupations and Social Status. Free Press [Table VI-1].
The site where these datasets is curated:
https://vincentarelbundock.github.io/Rdatasets/
Do you know if these small datasets are covered by fair use or some other blanket approval?
(Retracted)