CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000...

115
Advanced Overview of Version 2.0 of the Open Archives Initiative Protocol for Metadata Harvesting Michael L. Nelson Old Dominion University Norfolk VA [email protected] Herbert Van de Sompel Los Alamos National Laboratory Los Alamos NM [email protected] Simeon Warner Cornell University Ithaca NY [email protected] ACM/IEEE Joint Conference on Digital Libraries Tutorial 5 Portland, Oregon 14:00 - 17:30 July 14 2002

Transcript of CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000...

Page 1: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Advanced O

verview of Version 2.0 of theO

pen Archives Initiative

Protocol for Metadata H

arvesting

Michael L. N

elsonO

ld Dom

inion University

Norfolk VA

mln@

cs.odu.edu

Herbert Van de Som

pelLos A

lamos N

ational LaboratoryLos A

lamos N

Mherbertv@

lanl.gov

Simeon W

arnerCornell U

niversityIthaca N

Ysim

[email protected]

ACM

/IEEE Joint Conference on Digital Libraries

Tutorial 5Portland, O

regon14:00 - 17:30July 14 2002

Page 2: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Scope and Focus

•This Tutorial is not…–

an introduction to OA

I-PMH

–a listing of all the wonderful projects that useO

AI-PM

H–

a discussion of the merits of m

etadataharvesting vs. distributed searching

•A

passing familiarity is assum

ed for:–

OA

I-PMH

1.x–

Dublin Core

Page 3: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Outline

•H

ow 2.0 evolved from SFC and 1.x

–people, processes, events

•W

hat’s new in 2.0–

comparison with 1.x

•Guidelines, recom

mendations, best

practices for 2.0 implem

entations–

harvesters, repositories, aggregators, optionalcontainers

•N

ovel applications of OA

I-PMH

Page 4: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Spoiler: What’s N

ew in 2.0?!•

Good news: OA

I-PMH

is still

Six Verbs + DC

•Increm

ental improvem

ents–

single XM

L schema

–am

biguities removed

–m

ore expressive options–

cleaner separation of roles & responsibilities•

Bad news: not backwards compatible with 1.1

Page 5: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

fromSanta Fe Convention

toO

AI-PM

H v.2.0

Page 6: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

abouteprints

document

like objectsresources

metadata

OA

MS

unqualifiedD

ublin CoreunqualifiedD

ublin Core

transportH

TTPH

TTPH

TTP

responsesX

ML

XM

LX

ML

requestsH

TTP GET/POST

HTTP GET/PO

STH

TTP GET/POST

verbsD

ienstO

AI-PM

HO

AI-PM

H

natureexperim

entalexperim

entalstable

model

metadata

harvestingm

etadataharvesting

metadata

harvesting

Santa Feconvention

OA

I-PMH

v.1.0/1.1O

AI-PM

Hv.2.0

Page 7: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Santa Fe Convention [02/2000]

• goal: optimize discovery of e-prints

• input:

• the UPS prototype

• RePEc /SOD

A “data provider / service

provider model”

• Dienst protocol

• deliberations at Santa Fe meeting

[10/99]

Page 8: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

OA

I-PMH

v.1.0 [01/2001]

• goal: optimize discovery of docum

ent-like

objects

• input:

• SFC• D

LF meetings on m

etadata harvesting• deliberations at Cornell m

eeting [09/00]• alpha test group of O

AI-PM

H v.1.0

Page 9: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• low-barrier interoperability specification

• metadata harvesting m

odel: data provider / serviceprovider

• focus on document-like objects

• autonomous protocol

• HTTP based

• XM

L responses

• unqualified Dublin Core

• experimental: 12-18 m

onths

OA

I-PMH

v.1.0 [01/2001]

Page 10: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Selected Pre- 2.0 OA

I Highlights

•O

ctober 21-22, 1999 - initial UPS m

eeting•

February 15, 2000 - Santa Fe Convention published in D-Lib M

agazine–

precursor to the OA

I metadata harvesting protocol

•June 3, 2000 - workshop at A

CM D

L 2000 (Texas)•

August 25, 2000 - O

AI steering com

mittee form

ed, DLF/CN

I support•

September 7-8, 2000 - technical m

eeting at Cornell University

–defined the core of the current O

AI m

etadata harvesting protocol•

September 21, 2000 - workshop at ECD

L 2000 (Portugal)•

Novem

ber 1, 2000 - Alpha test group announced (~15 organizations)

•January 23, 2001 - O

AI protocol 1.0 announced, O

AI O

pen Day in the

U.S. (W

ashington DC)

–purpose: freeze protocol for 12-16 m

onths, generate critical mass

•February 26, 2001 - O

AI O

pen Day in Europe (Berlin)

•July 3, 2001 - O

AI protocol 1.1 announced

–to reflect changes in the W

3C’s XM

L latest schema

recomm

endation•

September 8, 2001 - workshop at ECD

L 2001 (Darm

stadt)

Page 11: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

OA

I-PMH

v.2.0 [06/2002]

• goal: recurrent exchange of metadata about resources

between systems

• input:

• OA

I-PMH

v.1.0• feedback on O

AI-im

plementers

• deliberations by OA

I-tech [09/01 - 06/02]

• alpha test group of OA

I-PMH

v.2.0 [03/02 - 06/02]

•officially released June 14, 2002

Page 12: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• low-barrier interoperability specification

• metadata harvesting m

odel: data provider / serviceprovider

• metadata about resources

• autonomous protocol

• HTTP based

• XM

L responses

• unqualified Dublin Core

• stable

OA

I-PMH

v.2.0 [06/2002]

Page 13: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

releasing OA

I-PMH

v.2.0

Page 14: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

fi pre-alpha phase

fi alpha-phase

fi creation of O

AI-tech

fi beta-phase

Page 15: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• created for 1 year period

• charge:

• review functionality and nature of OA

I-PMH

v.1.0

• investigate extensions

• release stable version of OA

I-PMH

by 05/02

• determine need for infrastructure to support broad

adoption of the protocol

• comm

unication: listserv, SourceForge, conference calls

creation of OA

I-tech [06/01]

Page 16: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

US representatives

Thomas Krichel (Long Island U

) - Jeff Young (OCLC) - Tim

Cole - (U of Illinois at U

rbana Champaign) - H

ussein Suleman

(Virginia Tech) - Simeon W

arner (Cornell U) - M

ichael Nelson

(NA

SA) - Caroline A

rms (LoC) - M

ohamm

ad Zubair (Old

Dom

inion U) - Steven Bird (U

Penn.)

European representatives

Andy Powell (Bath U

. & UKO

LN) - M

ogens Sandfaer (DTV) -

Thomas Baron (CERN

) - Les Carr (U of Southam

pton)

OA

I-tech

Page 17: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• review process by OA

I-tech:

• identification of issues

• conference call to filter/combine issues

• white paper per issue

• on-line discussion per white paper

• proposal for resolution of issue by OA

I-exec

• discussion of proposal & closure of issue

• conference call to resolve open issues

pre-alpha phase [09/01 – 02/02]

Page 18: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• creation of revised protocol document

• in-person meeting Lagoze - Van de Som

pel -N

elson – Warner

• autonomous decisions

• internal vetting of protocol document

pre-alpha phase [02/02]

Page 19: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• alpha-1 release to OA

I-tech March 1st

2002

• OA

I-tech extended with alpha testers

• discussions/implem

entations by OA

I-tech

• ongoing revision of protocol document

alpha phase [02/02 – 05/02]

Page 20: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• The British Library• Cornell U

. -- NSD

L project & e-print arXiv

• Ex Libris• FS Consulting Inc -- harvester for m

y.OA

I• H

umboldt-U

niversität zu Berlin• InQ

uirion Pty Ltd, RMIT U

niversity• Library of Congress• N

ASA

• OCLC O

AI-PM

H 2.0 alpha testers (1/2)

Page 21: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

OA

I-PMH

2.0 alpha testers (2/2)

• Old D

ominion U

. -- ARC , D

P9• U

. of Illinois at Urbana-Cham

paign• U

. Of Southam

pton -- OA

IA (now Celestial),

CiteBase, eprints.org

• UCLA

, John Hopkins U

., Indiana U., N

YU -- sheet

music collection

• UKO

LN, U

. of Bath -- RDN

• Virginia Tech -- repository explorer

Page 22: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

beta phase [05/02-06/02]

• beta release on May 1st 2002 to:

• registered data providers and serviceproviders

• interested parties

• fine tuning of protocol document

• preparation for the release of 2.0conform

ant tools by alpha testers

Page 23: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

what’s new in OA

I-PMH

v.2.0

Page 24: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

fi corrections

fi new functionality

fi general changes to im

prove solidity ofprotocol

fi quick recap

Page 25: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Overview of O

AI-PM

H Verbs

FunctionVerb

listing of a single recordGetRecord

listing of N records

ListRecords

OA

I unique ids contained in archiveListIdentifiers

sets defined by archiveListSets

metadata form

ats supported byarchive

ListMetadataForm

ats

description of archiveIdentify

metadata

about therepository

harvestingverbs

most verbs take argum

ents: dates, sets, ids, metadata form

atsand resum

ption token (for flow control)

Page 26: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

general changes

Page 27: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

protocol vs periphery

• clear distinction between protocol and

periphery

• fixed protocol document

• extensible implem

entation guidelines:

• e.g. sample m

etadata formats, description

containers, about containers

• allows for OA

I guidelines and comm

unityguidelines

Page 28: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

OA

I-PMH

vs HTTP

• clear separation of OA

I-PMH

and HTTP

• OA

I-PMH

error handling

• all OK at H

TTP level? => 200 OK

• something wrong at O

AI-PM

H level? =>

OA

I-PMH

error (e.g. badVerb)

• http codes 302, 503, etc. still available toim

plementers, but no longer represent O

AI-PM

Hevents

Page 29: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

resource

all available metadata

about David

item

Dublin Core

metadata

MA

RCm

etadata S

PECTRUM

metadata

records

item = identifier

record = identifier + metadata form

at + datestamp

set-mem

bership isitem

-level property

resource – item - record

Page 30: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

other general changes

• better definitions of harvester,

repository, item, unique identifier, record,

set, selective harvesting

• oai_dc schema builds on D

CMI X

ML

Schema for unqualified D

ublin Core

• usage of must, m

ust not etc. as in RFC2119

• wording on response compression

Page 31: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

other general changes

• all protocol responses can be validated with

a single XM

L Schema

• easier for data providers

• no redundancy in type definitions

• SOA

P-ready

• clean for error handling

Page 32: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH><responseDate>

2002-0208T08:55:46Z</responseDate>

<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>

<GetRecord> <record>

<header>

<identifier>oai:arXiv:cs/0112017</identifier>

<datestamp>2001-12-14</datestamp>

<setSpec>cs</setSpec>

<setSpec>math</setSpec>

</header>

<metadata>

…..

</metadata>

</record>

</GetRecord></OAI-PMH>

response no errors

note no http encodingof the O

AI-PM

H request

Page 33: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH><responseDate>

2002-0208T08:55:46Z</responseDate>

<request>http://arXiv.org/oai2</request>

<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>

</OAI-PMH>

response with error

with errors, only the correctattributes are echoed in <request>

Page 34: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

corrections

Page 35: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

dates/times

• all dates/times are U

TC, encoded in

ISO8601, Z-notation

1957-03-20T20:30:00Z

Page 36: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• idempotency of r

esumptionToken: return sam

e incomplete

list when rT is reissued

• while no changes occur in the repo: strict

• while changes occur in the repo: all items with unchanged

datestamp

•new, optional attributes for the resumptionToken:

•expirationDate

•completeListSize

•cursor

resumptionToken

Page 37: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•1.x - if no records m

atch, an empty list was returned

noRecordsMatch

Page 38: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•2.0 - if no records m

atch, the error conditionnoRecordsM

atch is returned -- not an empty list

noRecordsMatch

Page 39: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

new functionality

Page 40: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• harvesting granularity

• mandatory support of Y

YYY-MM-DD

• optional support of YYYY-MM-DDThh:mm:ssZ

• other granularities considered, but ultimately rejected

• granularity of from and u

ntil m

ust be the

same

harvesting granularity

Page 41: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• Identify m

ore expressive

Identify

<Identify>

<repositoryName>Library of Congress 1</repositoryName>

<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>

<protocolVersion>2.0

</protocolVersion>

<adminEmail>

[email protected]</adminEmail>

<adminEmail>

[email protected]</adminEmail>

<deletedRecord>transient</deletedRecord>

<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>

<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>

<compression>deflate</compression>

Page 42: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• header contains set m

embership of item

header

<record>

<header>

<identifier>oai:arXiv:cs/0112017</identifier>

<datestamp>2001-12-14</datestamp>

<setSpec>cs</setSpec> <setSpec>math</setSpec> </header>

<metadata>

…..

</metadata>

</record>

eliminates the need for the “double

harvest” 1.x required to get all recordsand all set inform

ation

Page 43: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• ListIdentifiers returns h

eaders

ListIdentifiers

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH>

<responseDate>2002-0208T08:55:46Z</responseDate>

<request verb=“…” …>http://arXiv.org/oai2</request>

<ListIdentifiers>

<header>

<identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:hep</setSpec> <setSpec>physic:exp</setSpec> </header> ……

Page 44: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• ListIdentifiers m

andates

metadataPrefix as argum

ent

ListIdentifiers

http://www.perseus.tufts.edu/cgi-bin/pdataprov?

verb=ListIdentifiers

&metadataPrefix=olac

&from=2001-01-01

&until=2001-01-01

&set=Perseus:collection:PersInfo

Page 45: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•the changes to ListIdentifiers are subtle, andreflect a change in the O

AI-PM

H data m

odel•

Could have been named “ListH

eaders” or reduced toan option for ListRecords–

“ListIdentifiers” kept for lexigraphical consistency

ListIdentifiers

Page 46: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• character set for metadataPrefix and

setSpec extended to U

RL-safe characters

metadataPrefix

A-Z a-z 0-9 _ ! ‘ $ ( ) + - . *

Page 47: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

in the periphery

Page 48: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• introduction of provenance container to

facilitate tracing of harvesting history

provenance

<about>

<provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate>

<originDescription>

… … …

</originDescription>

</originDescription> </provenance></about>

Page 49: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• introduction of friends container to

facilitate discovery of repositories

friends

<description>

<friends> <baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL> <baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL> <baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL> <baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL> </friends></description>

Page 50: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• introduction of branding container for

DPs to suggest rendering & association hints

<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/

http://www.openarchives.org/OAI/2.0/branding.xsd">

<collectionIcon>

<url>http://my.site/icon.png</url>

<link>http://my.site/homepage.html</link>

<title>MySite(tm)</title>

<width>88</width>

<height>31</height>

</collectionIcon>

<metadataRendering

metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/"

mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering>

<metadataRendering

metadataNamespace="http://another.place/MARC"

mimeType="text/css">http://another.place/MARCrender.css</metadataRendering>

</branding>

branding

Page 51: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

• revision of oai-identifier

<description>

<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-

identifier"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-

identifier

http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">

<scheme>oai</scheme>

<repositoryIdentifier>oai-stuff.foo.org

</repositoryIdentifier>

<delimiter>:</delimiter>

<sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier>

</oai-identifier>

</description>

oai-identifier

domain based

repository names

Page 52: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•O

AI 1.x: oai_dc Schem

a defined by OA

I•

OA

I 2.0: oai_dc Schema im

ports from D

CMI

Schema for unqualified D

C elements

oai_dc

Page 53: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•O

AI 1.x: oai_m

arc•

OA

I 2.0: LoC marxm

l, oai_marc

–http://www.loc.gov/standards/m

arcxml/

MA

RC21

Page 54: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

did not make it into O

AI-PM

H v.2.0

Page 55: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•SO

AP im

plementation

•Result set filtering

•M

ultiple / “best” metadata

•GetRecord -> GetRecords

•M

achine readable rights managem

ent•

XM

L format for “m

ini-archives”

Page 56: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Detailed Review of theO

AI-PM

H 2.0 Verbs

Page 57: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Identify

•A

rguments

–none

•Errors–

none

•A

rguments

–none

•Errors–

badArgum

ent

1.12.0

Page 58: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

ListMetadataForm

ats

•A

rguments

–identifier(O

PTION

AL)

•Errors–

id does not exist

•A

rguments

–identifier(O

PTION

AL)

•Errors–

badArgum

ent–

noMetadataForm

ats–

idDoesN

otExist

1.12.0

Page 59: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

ListSets

•A

rguments

–resum

ptionToken(EX

CLUSIVE)

•Errors–

no set hierarchy

•A

rguments

–resum

ptionToken(EX

CLUSIVE)

•Errors–

badArgum

ent–

badResumptionToken

–noSetH

ierarchy

1.12.0

Page 60: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

ListIdentifiers

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)•

Errors–

no records match

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)–

metadataPrefix (REQ

UIRED

)•

Errors–

badArgum

ent–

cannotDissem

inateFormat

–badResum

ptionToken–

noSetHierarchy

–noRecordsM

atch

1.12.0

Page 61: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

ListRecords

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)–

metadataPrefix (REQ

UIRED

)•

Errors–

no records match

–m

etadata format cannot be

disseminated

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)–

metadataPrefix (REQ

UIRED

)•

Errors–

noRecordsMatch

–cannotD

isseminateForm

at–

badResumptionToken

–noSetH

ierarchy–

badArgum

ent

1.12.0

Page 62: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

GetRecord

•A

rguments

–identifier (REQ

UIRED

)–

metadataPrefix

(REQU

IRED)

•Errors–

id does not exist–

metadata form

at cannotbe dissem

inated

•A

rguments

–identifier (REQ

UIRED

)–

metadataPrefix (REQ

UIRED

)

•Errors–

badArgum

ent–

cannotDissem

inateFormat

–idD

oesNotExist

1.12.0

Page 63: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Argum

ent Summ

aryidentifier

resumptionToken

setuntil

fromm

etadataPrefix

48

88

84

GetRecord

8exclusive

optionaloptional

optional4

ListRecords

8exclusive

optionaloptional

optional4

ListIdentifiers

8exclusive

88

88

ListSets

optional8

88

88

ListMetadata

Formats

88

88

88

Identify

Page 64: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Error Summ

ary

IDD

NE

IDD

NE

CDF

BAGetRecord

NS

HN

RMCD

FBRT

BAListRecords

NS

HN

RMCD

FBRT

BAListIdentifiers

NS

HBRT

BAListSets

NM

FBA

ListMetadata

Formats

BAIdentify

Generate badVerb on any input not matching the 6 defined verbs

this is an inversion of the table in section 3.6 of the OA

I-PMH

specification

Page 65: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

RepositoryIm

plemenation Guidelines

Page 66: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Minim

al Repository•

2.0 provides many expressive, but optional

features–

but still low barrier!

•if you are writing your own repository software,the quickest path to im

plementation can involve

initially:–

only supporting DC

–skipping: <about>, sets, com

pression–

skip flow control (resumptionTokens) if < 1000 item

s

•add optional features as requirem

ents andfam

iliarity allows

Page 67: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Be Honest with datestam

p!•

a change in the process of dynamic generation of a

metada form

at really does mean all records have

been updated!

if (internalItemDatestamp > disseminationInterfaceDatestamp) {

datestamp = internalItemDatestamp

} else {

datestamp = disseminationInterfaceDatestamp

}

Page 68: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Not H

iding Updates

•O

AI-PM

H is designed to allow increm

entalharvesting

•U

pdates must be available by the end of the

period of the datestamp assigned, i.e.

–D

ay granularity => during same day

–Seconds granularity => during sam

e second

•Reason: harvesters need to overlap requests byjust one datestam

p interval (one day or onesecond)–

in 1.x, 2 intervals were required (in many circum

stances)

Page 69: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

State in resumptionTokens

•H

TTP is stateless•

resumptionTokens allow state inform

ation to bepassed back to the repository to create acom

plete list from sequence of incom

plete lists•

EITHER – all state in resum

ptionToken•

OR – cache result set in repository

Page 70: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Caching the Result Set•

Repository caches results of initial request,returns only incom

plete list•

resumptionToken does not contain all state

information, it includes:

–a session id

–offset inform

ation, necessary for idempotency

•resum

ptionToken allows repository to returnnext incom

plete list•

increased complexity due to cache m

anagement

–but a potential perform

ance win

Page 71: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

All State in the

resumptionToken

•A

rrange that remaining item

s/headers in complete list

response can be specified with a new query and encode thatin resum

ptionToken•

One sim

ple approach is to return items/headers in id order

and make the new query specify the sam

e parameters and

the last id return (or by date)–

simple to im

plement, but possibly longer execution tim

es

•Can encode param

eters very simply:

<resumptionToken>metadataPrefix=oai_dc

from=1999-02-03&until=2002-04-01&

lastid=fghy:45:123</resumptionToken>

Page 72: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

resumptionToken attributes (1)

•expirationDate – likely to be useful when cache

clean-up schedule is known–

Do not specify e

xpirationDate if all state in

resumptionToken

•badResumptionToken error to be used if

resumptionToken expired

–M

ay also be used if request cannot be completed for

some other reason

•e.g.: if repository changes cause the incom

plete list to haveno records

Page 73: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

resumptionToken attributes (2)

•completeListSize and c

ursor optionally

provide information about size of com

pletelist and num

ber of records so fardissem

inated–

use consistently if used–

designed for status monitoring

–caveat harvester:

completeListSize m

ay beapproxim

ate and may be revised

Page 74: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

resumptionToken

The only defined use of resumptionToken is as follows:

•a repository must include a resum

ptionToken element as

part of each response that includes an incomplete list;

•in order to retrieve the next portion of the complete list,!

the next request must use the value of that

resumptionToken elem

ent as the value of theresum

ptionToken argument of the request;

•the response containing the incomplete list that

completes the list m

ust include an empty resum

ptionTokenelem

ent;

Page 75: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Flow Control & Load Balancing

How to respond to a “bad” harvester:

1.H

TTP status code 200; response to OA

I-PMH

requestwith a resum

ptionToken.2.

HTTP status code 503; with the Retry-A

fter headerset to an appropriate value if subsequent requestfollows too quickly or if the server is heavily loaded.

3.H

TTP status code 403; with an appropriate reasonspecified if subsequent requests do not adhere toRetry-A

fter delays.

Page 76: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

302 Load Balancing•

Interactive users on main D

L machine should not be

impacted by m

etadata harvesting–

don’t take deliveries through the front door–

not part of the protocol; defined outside the protocol

OA

IServer

naca.larc.nasa.gov/oai/

if load > 0.50

redirect request

OA

IServer

buckets.dsi.internet2.edu/naca/oai/

harvesterhttp://blah/oai/?verb=L

istIdentifiers&m

etadataPrefix=oai_dc

HT

TP Status C

ode 302

http://blah/oai/?verb=ListIdentifiers&

metadataPrefix=oai_dc

<?xml version=“1.0” encoding=“U

TF-8”?>

…<ListIdentifiers>

…</ListIdentifiers>

Page 77: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

DN

S Load Balancing

•using a D

NS rotor, establish

–a.foo.org, b.foo.org, c.foo.org

–each with a synchronized copy of therepository

–let D

NS & chance distribute the load

Page 78: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Load Balancing Caveats•

Copies of the repository must be synchronized

–(cf. Pande, et al. JCD

L 02)

•Com

plex hierarchies are possible–

programm

er must insure no cycles in redirection graphs!

•The baseU

RL in the reply must always point to the

original repository, not the repository thateventually answered the request

Page 79: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Error Handling: Verbosity

More is better…

<error code="badArgument">Illegal argument ‘foo’</error>

<error code="badArgument">Illegal argument ‘bar’</error>

is preferred over:

<error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error>

which is preferred over:

<error code="badArgument">Illegal arguments</error>

Page 80: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Error Handling: Levels

•the O

AI-PM

H error / exception conditions are for

OA

I-PMH

semantic events

•they are not for situations when:–

the database is down–

a record is malform

ed•

remem

ber: record = id + datestamp + m

etadataPrefix•

if you’re missing one of those, you don’t have an O

AI record!

–and other conditions that occur outside the O

AI scope

•use http codes 500, 503 or other appropriate values to indicatenon-O

AI problem

s

Page 81: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Error Handling: Extensions

•A

rguments that are not 'required', 'optional' or

'exclusive’ are 'illegal' and should generatebadArgument errors.

•If you want to extend the O

AI-PM

H:

–stop and consider: do you really need to?

•m

aybe you should have different OA

I-PMH

interfaces, or creativem

etadata formats

–if you really, really want to, tunnel your extensions through the“set” feature

•see http://www.dlib.org/dlib/decem

ber01/suleman/12sulem

an.html for

examples

Page 82: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Idempotency of “List”Requests (1)

•Purpose is to allow harvesters to recover from

lostresponses or crashes without starting a largeharvest from

scratch•

Recover by re-issuing request usingresumptionToken from

previous request•

IMPLICA

TION

: harvester must accept both the

most recent r

esumptionToken issued and the

previous one

Page 83: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Idempotency of “List”Requests (2)

•response to a re-issued request m

ust contain all unchangedrecords

•any changed records will get new datestam

ps after time of

initial request•

changes will be picked up by subsequent harvest if notincluded

[no experience yet with incomplete responses to ListSets or

ListMetadataForm

ats requests]

Page 84: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Case Study: “bucket” based repositories

•Buckets: see N

elson & Maly, CA

CM 44(5)

•2.0–

LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer)

–N

ACA

- naca.larc.nasa.gov/oai2.0/ (file system, refer)

•1.1–

LTRS - techreports.larc.nasa.gov/ltrs/oai/–

NA

CA - naca.larc.nasa.gov/ltrs/oai/

–O

pen Video - www.open-video.org/oai/ (MySQ

L, local)–

JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS A

ccess dump, local)

–GLTRS (filesystem

, HTM

L scraping)•

Characteristics:–

resumptionToken support initially skipped; added later (all)

•highly encoded rT’s: [2001-01-01!!!!301!600]

–sets initially skipped, added later (LTRS)

–initially had load balancing with 2 N

ACA

repositories…

Page 85: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Case Study: “bucket” based repositories

•in bucket term

inology:–

6 OA

I verbs (methods) added to the existing list

of methods

•http://naca.larc.nasa.gov/oai2.0/?m

ethod=list_methods

•http://naca.larc.nasa.gov/oai2.0/?m

ethod=list_source&target=ListId

entifiers

–a data elem

ent is added to the bucket that containsthe specifics of the particular repository and itsm

etadata format

•http://naca.larc.nasa.gov/oai2.0/?m

ethod=display&pkg_nam

e=oai&elem

ent_name=oai.pl

Page 86: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Harvester

Implem

entation Guidelines

Page 87: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Be a Polite OA

I Neighbor

•Re-use existing free harvesters rather thanwriting your own–

http://www.openarchives.org/tools/index.html

•Read h

ttp://www.robotstxt.org/wc/robots.html if

you insist on writing your own

•Provide m

eaningful User-Agent & F

rom http

headers–

these values can be configured even if using someone

else’s harvester

Page 88: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Listen to the Repository!•

Check Identify’s <granularity> elem

ent if you wish touse finer than YYYY-M

M-D

D•

If you harvest with sets, remem

ber that “:” indicateshierarchies–

harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d”

•Check for and handle non-200 http status codes (503,302, etc.)

•em

pty resumptionToken => end of com

plete list•

consider asking for compressed responses if the

repository supports them…

Page 89: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

How to Grab Everything

•Issue an Identify request to find the finest datestam

p granularitysupported.

•Issue a ListM

etadataFormats request to obtain a list of all

metadataPrefixes supported.

•H

arvest using ListRecords requests for each metadataPrefix supported.

Knowledge of the datestamp granularity allows for less overlap if

granularities finer than a day are supported.•

Set structure can be inferred from the setSpec elem

ents in the headerblocks of each record returned (consistency checks are possible).

•Item

s may be reconstructed from

the constituent records. Localdatestam

ps must be assigned to harvested item

s.•

Provenance and other information in <

about> blocks m

ay be re-assem

bled at the item level if it is the sam

e for all metadata form

atsharvested. H

owever, this information m

ay be supplied differently fordifferent m

etadata formats and m

ay thus need to be store separatelyfor each m

etadata format.

Page 90: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Harvesting v1.1 and v2.0

•N

ot difficult tohandle both cases,test with I

dentify:

–v1.1: <

Identify>

<protocolVersion>

–v2.0 <

OAI-PMH>

<Identify>

<protocolVersion>

•N

ote also differenterror andexception handling

•M

any similarities,

harvesters canshare lots of code

Page 91: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Harvesting D

emo

•H

arvester written in Perl•

Handles v1.0, v1.1 and v2.0

•U

ses LWP,

Expat and X

ML::Parser (no schem

avalidation)

•U

TF-8 conditioning to deal with some “im

perfect”repositories

•Sequence of requests: I

dentify,

ListMetadataFormats, ListSets then

ListRecords/ListIdentifiers

•Support for increm

ental harvesting, usesresponseDate from

last harvest to get new startdatestamp

Page 92: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Harvesting logs

•A

lan Kent’s v2.0 harvester logs:http://www.inquirion.com:8123/public/collList;collListCmd=

list

•A

lan Kent’s summ

ary of v1.1 harvesting resultshttp://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm

•Celestial v1.1 harvesting logshttp://celestial.eprints.org/cgi-bin/status

•D

P9 gateway using arc harvested information

http://arc.cs.odu.edu:8080/dp9/index.jsp

Page 93: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

NA

SA <friends> exam

ple (1)•

A light weight, D

P-centric method

to comm

unicate the existence of“others”

http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify

..<description>

<friends ..nam

espace stuff..>

<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>

<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>

<baseURL>http://horus.riacs.edu/perl/oai/</baseURL>

<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>

</friends>

</description>

..

Page 94: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<friends>…</friends/

http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/

http://ntrs.nasa.gov/oai2.0/

http://ston.jsc.nasa.gov/collections/TRS/oai/

http://horus.riacs.edu/perl/oai/

harvester

Identify

NA

SA <friends> exam

ple (2)

Page 95: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Case Study: arXiv (1)

•Existing system

, running >10years. Written

mostly in Perl

•Flat file system

for ‘database’•

200k papers with metadata in hom

ebrewform

at•

~200 updates/day. OA

I repository just oneview of system

, must integrate with daily

update schedule

Page 96: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Case Study: arXiv (2)

•W

rite in Perl–

Easy integration with rest of system, reuse

code from v1.0/v1.1 interface

–U

se libwww; XML::DOM

•D

aily rebuild of datestamp database

–N

o existing date in system appropriate

–Base on U

nix cdate of m

etadata files

•O

n-the-fly metadata translation

–Straightforward, avoids data duplication

Page 97: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Case Study: arXiv (3)

•Flow control to avoid loading server and to avoidharvesters tripping robot alarm

s–resumptionTokens to lim

it response size (500-1000records or 5k-10k identifiers / response)

–503 R

etry-After replies based on client ip

•Im

plement r

esumptionTokens that include all

state–

Avoid need to cache result sets / clean cache

Page 98: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Aggregator / Cache / Proxy

Implem

entation Guidelines

Page 99: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<provenance> & datestam

ps

•rem

inder: datestamps are local to the

repository, and an re-exportingservice m

ust use new localdatestam

ps•

such services should use the<provenance> container to preserve

the original datestamps

Page 100: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Identifiers are Local

•identifiers are local to the repository

•unless you absolutely did not change them

etadata and the identifier correspondsto a recognized U

RI scheme, use a new

identifier upon re-exporting–

use the <provenance> container to preserve

the harvesting history

Page 101: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Derived from

the same item

?3 ways to determ

ine if records share provenancefrom

the same item

:

1.both records have the sam

e identifier and thebaseU

RL in the request elements of the O

AI-PM

Hreponses which include the record are the sam

e;2.

both records have the same identifier and that

identifier belongs to some recognized U

RI scheme;

3.the provenance containers of both records have thesam

e entries for both the identifier and baseURL;

Page 102: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<provenance> exam

ple (1)

<?xml version="1.0" encoding="UTF-8"?>

<GetRecord ..nam

espace stuff..> <responseDate>2002-02-08T08:55:46.1</responseDate>

<request verb="GetRecord" metadataPrefix="odd_fmt"

identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request>

<record>

<header>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp> </header>

<metadata> ...metadata record in odd_fmt... </metadata>

</record>

</GetRecord>

Consider a request from crosswalker.oa.org:

http://odd.oa.org?verb=GetRecord

&identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt

and the following response from odd.oa.org:

Page 103: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

Imagine that c

rosswalker.oa.org cross-walks

harvested metadata from

odd_fmt into o

ai_marc and

then re-exposes the metadata with new identifiers.

A request from

getmarc.oa.org:

http://crosswalker.oa.org?verb=GetRecord

&identifier=oai:cw.oa.org:z1x2y3

&metadataPrefix=oai_marc

might then yield the following response from

crosswalker.oa.org:

<provenance> exam

ple (2)

Page 104: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<provenance> exam

ple (3)<record>

<header> <identifier>oai:cw.oa.org:z1x2y3</identifier>

<datestamp>2002-02-09T01:15:43Z</datestamp>

</header>

<metadata> ...metadata record in oai_marc... </metadata>

<about>

<provenance ..nam

espace stuff..>

<originDescription harvestDate="2002-02-08T08:55:46Z“

altered="true">

<baseURL>http://odd.oa.org</baseURL>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp>

<metadataNamespace>http://odd.oa.org/odd_fmt</..>

</originDescription>

</provenance>

</about>

</record>

Page 105: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<provenance> exam

ple (4)

This oai_marc record is then re-

exposed by getmarc.oa.org with the

same identifier o

ai:cw.oa.og:z1x2y3

(because the record has not beenaltered).

The associated <provenance> container

might be:

Page 106: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

<provenance> exam

ple (5)<record>

<header> <identifier>oai:cw.oa.org:z1x2y3</identifier>

<datestamp>2002-03-01T01:46:11Z</datestamp> </header>

<metadata> ...metadata record in oai_marc... </metadata>

<about>

<provenance ..nam

espace stuff..> <originDescription harvestDate=“2002-03-01T01:23:45” altered=“false”> <baseURL>http://crosswalker.oa.org/<baseURL> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-02-09T01:15:43Z</datestamp> <metadataNamespace>http://../oai_marc</..>

<originDescription harvestDate="2002-02-08T08:55:46Z” altered="true">

<baseURL>http://odd.oa.org</baseURL>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp>

<metadataNamespace>http://odd.oa.org/odd_fmt</..>

</originDescription>

</originDescription>

</provenance>

</about>

</record>

Page 107: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

oai-identifier

•not com

patible with v1.1 oai-identifier–

repositoryNam

e now domain nam

e based–

not reliant upon OA

I centralizedregistration

•one-to-one m

apping for escapingcharacters: %

3F allowed, %3f not

–allows sim

ple comparison

Page 108: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

looking ahead:novel uses of O

AI-PM

H

Page 109: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

•Registry of m

etadata formats for

OpenU

RL–

http://www.sfxit.com/openurl/

–http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf

•other uses?

Page 110: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

linkingservers

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registrationpollingO

AI-PM

H harvesting

centralrepositoryOAI-PMH

Poll

regis.

userservice

registry model

Page 111: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

Goal:• inform

linking servers re Schema

• ease of admin for all parties involved

• limit hum

an overhead

Page 112: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registration centralrepositoryregis.

Registry:• schem

aLocation• registration date• m

irror of Schema

Page 113: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registration centralrepositoryregis.

polling

PollPoll:• fetch schem

a at schemaLocation

• log failure/success• com

pare fetched Schema with m

irror• changed => replace m

irror• rem

oved => deregistered

Page 114: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registrationpolling

centralrepositoryOAI-PMH

Poll

regis.

OA

I repo:• record-ids = schem

aLocation• oai_dc record :

• registration info• (de)registration datestam

p• xsi record :

• mirror schem

a• schem

a update datestamp

• poll record : • process info• recent poll datestam

p

Page 115: CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) •

linkingservers

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registrationpollingO

AI-PM

H harvesting

centralrepositoryOAI-PMH

Poll

regis.

userservice