CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000...

Post on 20-Mar-2020

6 views 0 download

Transcript of CM - cs.odu.edumln/jcdl02/oai-2.0-adv-final.pdf · September 21, 2000 - workshop at ECDL 2000...

Advanced O

verview of Version 2.0 of theO

pen Archives Initiative

Protocol for Metadata H

arvesting

Michael L. N

elsonO

ld Dom

inion University

Norfolk VA

mln@

cs.odu.edu

Herbert Van de Som

pelLos A

lamos N

ational LaboratoryLos A

lamos N

Mherbertv@

lanl.gov

Simeon W

arnerCornell U

niversityIthaca N

Ysim

eon@cs.cornell.edu

ACM

/IEEE Joint Conference on Digital Libraries

Tutorial 5Portland, O

regon14:00 - 17:30July 14 2002

Scope and Focus

•This Tutorial is not…–

an introduction to OA

I-PMH

–a listing of all the wonderful projects that useO

AI-PM

H–

a discussion of the merits of m

etadataharvesting vs. distributed searching

•A

passing familiarity is assum

ed for:–

OA

I-PMH

1.x–

Dublin Core

Outline

•H

ow 2.0 evolved from SFC and 1.x

–people, processes, events

•W

hat’s new in 2.0–

comparison with 1.x

•Guidelines, recom

mendations, best

practices for 2.0 implem

entations–

harvesters, repositories, aggregators, optionalcontainers

•N

ovel applications of OA

I-PMH

Spoiler: What’s N

ew in 2.0?!•

Good news: OA

I-PMH

is still

Six Verbs + DC

•Increm

ental improvem

ents–

single XM

L schema

–am

biguities removed

–m

ore expressive options–

cleaner separation of roles & responsibilities•

Bad news: not backwards compatible with 1.1

fromSanta Fe Convention

toO

AI-PM

H v.2.0

abouteprints

document

like objectsresources

metadata

OA

MS

unqualifiedD

ublin CoreunqualifiedD

ublin Core

transportH

TTPH

TTPH

TTP

responsesX

ML

XM

LX

ML

requestsH

TTP GET/POST

HTTP GET/PO

STH

TTP GET/POST

verbsD

ienstO

AI-PM

HO

AI-PM

H

natureexperim

entalexperim

entalstable

model

metadata

harvestingm

etadataharvesting

metadata

harvesting

Santa Feconvention

OA

I-PMH

v.1.0/1.1O

AI-PM

Hv.2.0

Santa Fe Convention [02/2000]

• goal: optimize discovery of e-prints

• input:

• the UPS prototype

• RePEc /SOD

A “data provider / service

provider model”

• Dienst protocol

• deliberations at Santa Fe meeting

[10/99]

OA

I-PMH

v.1.0 [01/2001]

• goal: optimize discovery of docum

ent-like

objects

• input:

• SFC• D

LF meetings on m

etadata harvesting• deliberations at Cornell m

eeting [09/00]• alpha test group of O

AI-PM

H v.1.0

• low-barrier interoperability specification

• metadata harvesting m

odel: data provider / serviceprovider

• focus on document-like objects

• autonomous protocol

• HTTP based

• XM

L responses

• unqualified Dublin Core

• experimental: 12-18 m

onths

OA

I-PMH

v.1.0 [01/2001]

Selected Pre- 2.0 OA

I Highlights

•O

ctober 21-22, 1999 - initial UPS m

eeting•

February 15, 2000 - Santa Fe Convention published in D-Lib M

agazine–

precursor to the OA

I metadata harvesting protocol

•June 3, 2000 - workshop at A

CM D

L 2000 (Texas)•

August 25, 2000 - O

AI steering com

mittee form

ed, DLF/CN

I support•

September 7-8, 2000 - technical m

eeting at Cornell University

–defined the core of the current O

AI m

etadata harvesting protocol•

September 21, 2000 - workshop at ECD

L 2000 (Portugal)•

Novem

ber 1, 2000 - Alpha test group announced (~15 organizations)

•January 23, 2001 - O

AI protocol 1.0 announced, O

AI O

pen Day in the

U.S. (W

ashington DC)

–purpose: freeze protocol for 12-16 m

onths, generate critical mass

•February 26, 2001 - O

AI O

pen Day in Europe (Berlin)

•July 3, 2001 - O

AI protocol 1.1 announced

–to reflect changes in the W

3C’s XM

L latest schema

recomm

endation•

September 8, 2001 - workshop at ECD

L 2001 (Darm

stadt)

OA

I-PMH

v.2.0 [06/2002]

• goal: recurrent exchange of metadata about resources

between systems

• input:

• OA

I-PMH

v.1.0• feedback on O

AI-im

plementers

• deliberations by OA

I-tech [09/01 - 06/02]

• alpha test group of OA

I-PMH

v.2.0 [03/02 - 06/02]

•officially released June 14, 2002

• low-barrier interoperability specification

• metadata harvesting m

odel: data provider / serviceprovider

• metadata about resources

• autonomous protocol

• HTTP based

• XM

L responses

• unqualified Dublin Core

• stable

OA

I-PMH

v.2.0 [06/2002]

releasing OA

I-PMH

v.2.0

fi pre-alpha phase

fi alpha-phase

fi creation of O

AI-tech

fi beta-phase

• created for 1 year period

• charge:

• review functionality and nature of OA

I-PMH

v.1.0

• investigate extensions

• release stable version of OA

I-PMH

by 05/02

• determine need for infrastructure to support broad

adoption of the protocol

• comm

unication: listserv, SourceForge, conference calls

creation of OA

I-tech [06/01]

US representatives

Thomas Krichel (Long Island U

) - Jeff Young (OCLC) - Tim

Cole - (U of Illinois at U

rbana Champaign) - H

ussein Suleman

(Virginia Tech) - Simeon W

arner (Cornell U) - M

ichael Nelson

(NA

SA) - Caroline A

rms (LoC) - M

ohamm

ad Zubair (Old

Dom

inion U) - Steven Bird (U

Penn.)

European representatives

Andy Powell (Bath U

. & UKO

LN) - M

ogens Sandfaer (DTV) -

Thomas Baron (CERN

) - Les Carr (U of Southam

pton)

OA

I-tech

• review process by OA

I-tech:

• identification of issues

• conference call to filter/combine issues

• white paper per issue

• on-line discussion per white paper

• proposal for resolution of issue by OA

I-exec

• discussion of proposal & closure of issue

• conference call to resolve open issues

pre-alpha phase [09/01 – 02/02]

• creation of revised protocol document

• in-person meeting Lagoze - Van de Som

pel -N

elson – Warner

• autonomous decisions

• internal vetting of protocol document

pre-alpha phase [02/02]

• alpha-1 release to OA

I-tech March 1st

2002

• OA

I-tech extended with alpha testers

• discussions/implem

entations by OA

I-tech

• ongoing revision of protocol document

alpha phase [02/02 – 05/02]

• The British Library• Cornell U

. -- NSD

L project & e-print arXiv

• Ex Libris• FS Consulting Inc -- harvester for m

y.OA

I• H

umboldt-U

niversität zu Berlin• InQ

uirion Pty Ltd, RMIT U

niversity• Library of Congress• N

ASA

• OCLC O

AI-PM

H 2.0 alpha testers (1/2)

OA

I-PMH

2.0 alpha testers (2/2)

• Old D

ominion U

. -- ARC , D

P9• U

. of Illinois at Urbana-Cham

paign• U

. Of Southam

pton -- OA

IA (now Celestial),

CiteBase, eprints.org

• UCLA

, John Hopkins U

., Indiana U., N

YU -- sheet

music collection

• UKO

LN, U

. of Bath -- RDN

• Virginia Tech -- repository explorer

beta phase [05/02-06/02]

• beta release on May 1st 2002 to:

• registered data providers and serviceproviders

• interested parties

• fine tuning of protocol document

• preparation for the release of 2.0conform

ant tools by alpha testers

what’s new in OA

I-PMH

v.2.0

fi corrections

fi new functionality

fi general changes to im

prove solidity ofprotocol

fi quick recap

Overview of O

AI-PM

H Verbs

FunctionVerb

listing of a single recordGetRecord

listing of N records

ListRecords

OA

I unique ids contained in archiveListIdentifiers

sets defined by archiveListSets

metadata form

ats supported byarchive

ListMetadataForm

ats

description of archiveIdentify

metadata

about therepository

harvestingverbs

most verbs take argum

ents: dates, sets, ids, metadata form

atsand resum

ption token (for flow control)

general changes

protocol vs periphery

• clear distinction between protocol and

periphery

• fixed protocol document

• extensible implem

entation guidelines:

• e.g. sample m

etadata formats, description

containers, about containers

• allows for OA

I guidelines and comm

unityguidelines

OA

I-PMH

vs HTTP

• clear separation of OA

I-PMH

and HTTP

• OA

I-PMH

error handling

• all OK at H

TTP level? => 200 OK

• something wrong at O

AI-PM

H level? =>

OA

I-PMH

error (e.g. badVerb)

• http codes 302, 503, etc. still available toim

plementers, but no longer represent O

AI-PM

Hevents

resource

all available metadata

about David

item

Dublin Core

metadata

MA

RCm

etadata S

PECTRUM

metadata

records

item = identifier

record = identifier + metadata form

at + datestamp

set-mem

bership isitem

-level property

resource – item - record

other general changes

• better definitions of harvester,

repository, item, unique identifier, record,

set, selective harvesting

• oai_dc schema builds on D

CMI X

ML

Schema for unqualified D

ublin Core

• usage of must, m

ust not etc. as in RFC2119

• wording on response compression

other general changes

• all protocol responses can be validated with

a single XM

L Schema

• easier for data providers

• no redundancy in type definitions

• SOA

P-ready

• clean for error handling

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH><responseDate>

2002-0208T08:55:46Z</responseDate>

<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>

<GetRecord> <record>

<header>

<identifier>oai:arXiv:cs/0112017</identifier>

<datestamp>2001-12-14</datestamp>

<setSpec>cs</setSpec>

<setSpec>math</setSpec>

</header>

<metadata>

…..

</metadata>

</record>

</GetRecord></OAI-PMH>

response no errors

note no http encodingof the O

AI-PM

H request

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH><responseDate>

2002-0208T08:55:46Z</responseDate>

<request>http://arXiv.org/oai2</request>

<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>

</OAI-PMH>

response with error

with errors, only the correctattributes are echoed in <request>

corrections

dates/times

• all dates/times are U

TC, encoded in

ISO8601, Z-notation

1957-03-20T20:30:00Z

• idempotency of r

esumptionToken: return sam

e incomplete

list when rT is reissued

• while no changes occur in the repo: strict

• while changes occur in the repo: all items with unchanged

datestamp

•new, optional attributes for the resumptionToken:

•expirationDate

•completeListSize

•cursor

resumptionToken

•1.x - if no records m

atch, an empty list was returned

noRecordsMatch

•2.0 - if no records m

atch, the error conditionnoRecordsM

atch is returned -- not an empty list

noRecordsMatch

new functionality

• harvesting granularity

• mandatory support of Y

YYY-MM-DD

• optional support of YYYY-MM-DDThh:mm:ssZ

• other granularities considered, but ultimately rejected

• granularity of from and u

ntil m

ust be the

same

harvesting granularity

• Identify m

ore expressive

Identify

<Identify>

<repositoryName>Library of Congress 1</repositoryName>

<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>

<protocolVersion>2.0

</protocolVersion>

<adminEmail>

r.e.gillian@larc.nasa.gov</adminEmail>

<adminEmail>

rgillian@visi.net</adminEmail>

<deletedRecord>transient</deletedRecord>

<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>

<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>

<compression>deflate</compression>

• header contains set m

embership of item

header

<record>

<header>

<identifier>oai:arXiv:cs/0112017</identifier>

<datestamp>2001-12-14</datestamp>

<setSpec>cs</setSpec> <setSpec>math</setSpec> </header>

<metadata>

…..

</metadata>

</record>

eliminates the need for the “double

harvest” 1.x required to get all recordsand all set inform

ation

• ListIdentifiers returns h

eaders

ListIdentifiers

<?xml version="1.0" encoding="UTF-8"?>

<OAI-PMH>

<responseDate>2002-0208T08:55:46Z</responseDate>

<request verb=“…” …>http://arXiv.org/oai2</request>

<ListIdentifiers>

<header>

<identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:hep</setSpec> <setSpec>physic:exp</setSpec> </header> ……

• ListIdentifiers m

andates

metadataPrefix as argum

ent

ListIdentifiers

http://www.perseus.tufts.edu/cgi-bin/pdataprov?

verb=ListIdentifiers

&metadataPrefix=olac

&from=2001-01-01

&until=2001-01-01

&set=Perseus:collection:PersInfo

•the changes to ListIdentifiers are subtle, andreflect a change in the O

AI-PM

H data m

odel•

Could have been named “ListH

eaders” or reduced toan option for ListRecords–

“ListIdentifiers” kept for lexigraphical consistency

ListIdentifiers

• character set for metadataPrefix and

setSpec extended to U

RL-safe characters

metadataPrefix

A-Z a-z 0-9 _ ! ‘ $ ( ) + - . *

in the periphery

• introduction of provenance container to

facilitate tracing of harvesting history

provenance

<about>

<provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate>

<originDescription>

… … …

</originDescription>

</originDescription> </provenance></about>

• introduction of friends container to

facilitate discovery of repositories

friends

<description>

<friends> <baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL> <baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL> <baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL> <baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL> </friends></description>

• introduction of branding container for

DPs to suggest rendering & association hints

<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/

http://www.openarchives.org/OAI/2.0/branding.xsd">

<collectionIcon>

<url>http://my.site/icon.png</url>

<link>http://my.site/homepage.html</link>

<title>MySite(tm)</title>

<width>88</width>

<height>31</height>

</collectionIcon>

<metadataRendering

metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/"

mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering>

<metadataRendering

metadataNamespace="http://another.place/MARC"

mimeType="text/css">http://another.place/MARCrender.css</metadataRendering>

</branding>

branding

• revision of oai-identifier

<description>

<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-

identifier"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-

identifier

http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">

<scheme>oai</scheme>

<repositoryIdentifier>oai-stuff.foo.org

</repositoryIdentifier>

<delimiter>:</delimiter>

<sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier>

</oai-identifier>

</description>

oai-identifier

domain based

repository names

•O

AI 1.x: oai_dc Schem

a defined by OA

I•

OA

I 2.0: oai_dc Schema im

ports from D

CMI

Schema for unqualified D

C elements

oai_dc

•O

AI 1.x: oai_m

arc•

OA

I 2.0: LoC marxm

l, oai_marc

–http://www.loc.gov/standards/m

arcxml/

MA

RC21

did not make it into O

AI-PM

H v.2.0

•SO

AP im

plementation

•Result set filtering

•M

ultiple / “best” metadata

•GetRecord -> GetRecords

•M

achine readable rights managem

ent•

XM

L format for “m

ini-archives”

Detailed Review of theO

AI-PM

H 2.0 Verbs

Identify

•A

rguments

–none

•Errors–

none

•A

rguments

–none

•Errors–

badArgum

ent

1.12.0

ListMetadataForm

ats

•A

rguments

–identifier(O

PTION

AL)

•Errors–

id does not exist

•A

rguments

–identifier(O

PTION

AL)

•Errors–

badArgum

ent–

noMetadataForm

ats–

idDoesN

otExist

1.12.0

ListSets

•A

rguments

–resum

ptionToken(EX

CLUSIVE)

•Errors–

no set hierarchy

•A

rguments

–resum

ptionToken(EX

CLUSIVE)

•Errors–

badArgum

ent–

badResumptionToken

–noSetH

ierarchy

1.12.0

ListIdentifiers

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)•

Errors–

no records match

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)–

metadataPrefix (REQ

UIRED

)•

Errors–

badArgum

ent–

cannotDissem

inateFormat

–badResum

ptionToken–

noSetHierarchy

–noRecordsM

atch

1.12.0

ListRecords

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)–

metadataPrefix (REQ

UIRED

)•

Errors–

no records match

–m

etadata format cannot be

disseminated

•A

rguments

–from

(OPTIO

NA

L)–

until (OPTIO

NA

L)–

set (OPTIO

NA

L)–

resumptionToken

(EXCLU

SIVE)–

metadataPrefix (REQ

UIRED

)•

Errors–

noRecordsMatch

–cannotD

isseminateForm

at–

badResumptionToken

–noSetH

ierarchy–

badArgum

ent

1.12.0

GetRecord

•A

rguments

–identifier (REQ

UIRED

)–

metadataPrefix

(REQU

IRED)

•Errors–

id does not exist–

metadata form

at cannotbe dissem

inated

•A

rguments

–identifier (REQ

UIRED

)–

metadataPrefix (REQ

UIRED

)

•Errors–

badArgum

ent–

cannotDissem

inateFormat

–idD

oesNotExist

1.12.0

Argum

ent Summ

aryidentifier

resumptionToken

setuntil

fromm

etadataPrefix

48

88

84

GetRecord

8exclusive

optionaloptional

optional4

ListRecords

8exclusive

optionaloptional

optional4

ListIdentifiers

8exclusive

88

88

ListSets

optional8

88

88

ListMetadata

Formats

88

88

88

Identify

Error Summ

ary

IDD

NE

IDD

NE

CDF

BAGetRecord

NS

HN

RMCD

FBRT

BAListRecords

NS

HN

RMCD

FBRT

BAListIdentifiers

NS

HBRT

BAListSets

NM

FBA

ListMetadata

Formats

BAIdentify

Generate badVerb on any input not matching the 6 defined verbs

this is an inversion of the table in section 3.6 of the OA

I-PMH

specification

RepositoryIm

plemenation Guidelines

Minim

al Repository•

2.0 provides many expressive, but optional

features–

but still low barrier!

•if you are writing your own repository software,the quickest path to im

plementation can involve

initially:–

only supporting DC

–skipping: <about>, sets, com

pression–

skip flow control (resumptionTokens) if < 1000 item

s

•add optional features as requirem

ents andfam

iliarity allows

Be Honest with datestam

p!•

a change in the process of dynamic generation of a

metada form

at really does mean all records have

been updated!

if (internalItemDatestamp > disseminationInterfaceDatestamp) {

datestamp = internalItemDatestamp

} else {

datestamp = disseminationInterfaceDatestamp

}

Not H

iding Updates

•O

AI-PM

H is designed to allow increm

entalharvesting

•U

pdates must be available by the end of the

period of the datestamp assigned, i.e.

–D

ay granularity => during same day

–Seconds granularity => during sam

e second

•Reason: harvesters need to overlap requests byjust one datestam

p interval (one day or onesecond)–

in 1.x, 2 intervals were required (in many circum

stances)

State in resumptionTokens

•H

TTP is stateless•

resumptionTokens allow state inform

ation to bepassed back to the repository to create acom

plete list from sequence of incom

plete lists•

EITHER – all state in resum

ptionToken•

OR – cache result set in repository

Caching the Result Set•

Repository caches results of initial request,returns only incom

plete list•

resumptionToken does not contain all state

information, it includes:

–a session id

–offset inform

ation, necessary for idempotency

•resum

ptionToken allows repository to returnnext incom

plete list•

increased complexity due to cache m

anagement

–but a potential perform

ance win

All State in the

resumptionToken

•A

rrange that remaining item

s/headers in complete list

response can be specified with a new query and encode thatin resum

ptionToken•

One sim

ple approach is to return items/headers in id order

and make the new query specify the sam

e parameters and

the last id return (or by date)–

simple to im

plement, but possibly longer execution tim

es

•Can encode param

eters very simply:

<resumptionToken>metadataPrefix=oai_dc

from=1999-02-03&until=2002-04-01&

lastid=fghy:45:123</resumptionToken>

resumptionToken attributes (1)

•expirationDate – likely to be useful when cache

clean-up schedule is known–

Do not specify e

xpirationDate if all state in

resumptionToken

•badResumptionToken error to be used if

resumptionToken expired

–M

ay also be used if request cannot be completed for

some other reason

•e.g.: if repository changes cause the incom

plete list to haveno records

resumptionToken attributes (2)

•completeListSize and c

ursor optionally

provide information about size of com

pletelist and num

ber of records so fardissem

inated–

use consistently if used–

designed for status monitoring

–caveat harvester:

completeListSize m

ay beapproxim

ate and may be revised

resumptionToken

The only defined use of resumptionToken is as follows:

•a repository must include a resum

ptionToken element as

part of each response that includes an incomplete list;

•in order to retrieve the next portion of the complete list,!

the next request must use the value of that

resumptionToken elem

ent as the value of theresum

ptionToken argument of the request;

•the response containing the incomplete list that

completes the list m

ust include an empty resum

ptionTokenelem

ent;

Flow Control & Load Balancing

How to respond to a “bad” harvester:

1.H

TTP status code 200; response to OA

I-PMH

requestwith a resum

ptionToken.2.

HTTP status code 503; with the Retry-A

fter headerset to an appropriate value if subsequent requestfollows too quickly or if the server is heavily loaded.

3.H

TTP status code 403; with an appropriate reasonspecified if subsequent requests do not adhere toRetry-A

fter delays.

302 Load Balancing•

Interactive users on main D

L machine should not be

impacted by m

etadata harvesting–

don’t take deliveries through the front door–

not part of the protocol; defined outside the protocol

OA

IServer

naca.larc.nasa.gov/oai/

if load > 0.50

redirect request

OA

IServer

buckets.dsi.internet2.edu/naca/oai/

harvesterhttp://blah/oai/?verb=L

istIdentifiers&m

etadataPrefix=oai_dc

HT

TP Status C

ode 302

http://blah/oai/?verb=ListIdentifiers&

metadataPrefix=oai_dc

<?xml version=“1.0” encoding=“U

TF-8”?>

…<ListIdentifiers>

…</ListIdentifiers>

DN

S Load Balancing

•using a D

NS rotor, establish

–a.foo.org, b.foo.org, c.foo.org

–each with a synchronized copy of therepository

–let D

NS & chance distribute the load

Load Balancing Caveats•

Copies of the repository must be synchronized

–(cf. Pande, et al. JCD

L 02)

•Com

plex hierarchies are possible–

programm

er must insure no cycles in redirection graphs!

•The baseU

RL in the reply must always point to the

original repository, not the repository thateventually answered the request

Error Handling: Verbosity

More is better…

<error code="badArgument">Illegal argument ‘foo’</error>

<error code="badArgument">Illegal argument ‘bar’</error>

is preferred over:

<error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error>

which is preferred over:

<error code="badArgument">Illegal arguments</error>

Error Handling: Levels

•the O

AI-PM

H error / exception conditions are for

OA

I-PMH

semantic events

•they are not for situations when:–

the database is down–

a record is malform

ed•

remem

ber: record = id + datestamp + m

etadataPrefix•

if you’re missing one of those, you don’t have an O

AI record!

–and other conditions that occur outside the O

AI scope

•use http codes 500, 503 or other appropriate values to indicatenon-O

AI problem

s

Error Handling: Extensions

•A

rguments that are not 'required', 'optional' or

'exclusive’ are 'illegal' and should generatebadArgument errors.

•If you want to extend the O

AI-PM

H:

–stop and consider: do you really need to?

•m

aybe you should have different OA

I-PMH

interfaces, or creativem

etadata formats

–if you really, really want to, tunnel your extensions through the“set” feature

•see http://www.dlib.org/dlib/decem

ber01/suleman/12sulem

an.html for

examples

Idempotency of “List”Requests (1)

•Purpose is to allow harvesters to recover from

lostresponses or crashes without starting a largeharvest from

scratch•

Recover by re-issuing request usingresumptionToken from

previous request•

IMPLICA

TION

: harvester must accept both the

most recent r

esumptionToken issued and the

previous one

Idempotency of “List”Requests (2)

•response to a re-issued request m

ust contain all unchangedrecords

•any changed records will get new datestam

ps after time of

initial request•

changes will be picked up by subsequent harvest if notincluded

[no experience yet with incomplete responses to ListSets or

ListMetadataForm

ats requests]

Case Study: “bucket” based repositories

•Buckets: see N

elson & Maly, CA

CM 44(5)

•2.0–

LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer)

–N

ACA

- naca.larc.nasa.gov/oai2.0/ (file system, refer)

•1.1–

LTRS - techreports.larc.nasa.gov/ltrs/oai/–

NA

CA - naca.larc.nasa.gov/ltrs/oai/

–O

pen Video - www.open-video.org/oai/ (MySQ

L, local)–

JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS A

ccess dump, local)

–GLTRS (filesystem

, HTM

L scraping)•

Characteristics:–

resumptionToken support initially skipped; added later (all)

•highly encoded rT’s: [2001-01-01!!!!301!600]

–sets initially skipped, added later (LTRS)

–initially had load balancing with 2 N

ACA

repositories…

Case Study: “bucket” based repositories

•in bucket term

inology:–

6 OA

I verbs (methods) added to the existing list

of methods

•http://naca.larc.nasa.gov/oai2.0/?m

ethod=list_methods

•http://naca.larc.nasa.gov/oai2.0/?m

ethod=list_source&target=ListId

entifiers

–a data elem

ent is added to the bucket that containsthe specifics of the particular repository and itsm

etadata format

•http://naca.larc.nasa.gov/oai2.0/?m

ethod=display&pkg_nam

e=oai&elem

ent_name=oai.pl

Harvester

Implem

entation Guidelines

Be a Polite OA

I Neighbor

•Re-use existing free harvesters rather thanwriting your own–

http://www.openarchives.org/tools/index.html

•Read h

ttp://www.robotstxt.org/wc/robots.html if

you insist on writing your own

•Provide m

eaningful User-Agent & F

rom http

headers–

these values can be configured even if using someone

else’s harvester

Listen to the Repository!•

Check Identify’s <granularity> elem

ent if you wish touse finer than YYYY-M

M-D

D•

If you harvest with sets, remem

ber that “:” indicateshierarchies–

harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d”

•Check for and handle non-200 http status codes (503,302, etc.)

•em

pty resumptionToken => end of com

plete list•

consider asking for compressed responses if the

repository supports them…

How to Grab Everything

•Issue an Identify request to find the finest datestam

p granularitysupported.

•Issue a ListM

etadataFormats request to obtain a list of all

metadataPrefixes supported.

•H

arvest using ListRecords requests for each metadataPrefix supported.

Knowledge of the datestamp granularity allows for less overlap if

granularities finer than a day are supported.•

Set structure can be inferred from the setSpec elem

ents in the headerblocks of each record returned (consistency checks are possible).

•Item

s may be reconstructed from

the constituent records. Localdatestam

ps must be assigned to harvested item

s.•

Provenance and other information in <

about> blocks m

ay be re-assem

bled at the item level if it is the sam

e for all metadata form

atsharvested. H

owever, this information m

ay be supplied differently fordifferent m

etadata formats and m

ay thus need to be store separatelyfor each m

etadata format.

Harvesting v1.1 and v2.0

•N

ot difficult tohandle both cases,test with I

dentify:

–v1.1: <

Identify>

<protocolVersion>

–v2.0 <

OAI-PMH>

<Identify>

<protocolVersion>

•N

ote also differenterror andexception handling

•M

any similarities,

harvesters canshare lots of code

Harvesting D

emo

•H

arvester written in Perl•

Handles v1.0, v1.1 and v2.0

•U

ses LWP,

Expat and X

ML::Parser (no schem

avalidation)

•U

TF-8 conditioning to deal with some “im

perfect”repositories

•Sequence of requests: I

dentify,

ListMetadataFormats, ListSets then

ListRecords/ListIdentifiers

•Support for increm

ental harvesting, usesresponseDate from

last harvest to get new startdatestamp

Harvesting logs

•A

lan Kent’s v2.0 harvester logs:http://www.inquirion.com:8123/public/collList;collListCmd=

list

•A

lan Kent’s summ

ary of v1.1 harvesting resultshttp://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm

•Celestial v1.1 harvesting logshttp://celestial.eprints.org/cgi-bin/status

•D

P9 gateway using arc harvested information

http://arc.cs.odu.edu:8080/dp9/index.jsp

NA

SA <friends> exam

ple (1)•

A light weight, D

P-centric method

to comm

unicate the existence of“others”

http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify

..<description>

<friends ..nam

espace stuff..>

<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>

<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>

<baseURL>http://horus.riacs.edu/perl/oai/</baseURL>

<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>

</friends>

</description>

..

<friends>…</friends/

http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/

http://ntrs.nasa.gov/oai2.0/

http://ston.jsc.nasa.gov/collections/TRS/oai/

http://horus.riacs.edu/perl/oai/

harvester

Identify

NA

SA <friends> exam

ple (2)

Case Study: arXiv (1)

•Existing system

, running >10years. Written

mostly in Perl

•Flat file system

for ‘database’•

200k papers with metadata in hom

ebrewform

at•

~200 updates/day. OA

I repository just oneview of system

, must integrate with daily

update schedule

Case Study: arXiv (2)

•W

rite in Perl–

Easy integration with rest of system, reuse

code from v1.0/v1.1 interface

–U

se libwww; XML::DOM

•D

aily rebuild of datestamp database

–N

o existing date in system appropriate

–Base on U

nix cdate of m

etadata files

•O

n-the-fly metadata translation

–Straightforward, avoids data duplication

Case Study: arXiv (3)

•Flow control to avoid loading server and to avoidharvesters tripping robot alarm

s–resumptionTokens to lim

it response size (500-1000records or 5k-10k identifiers / response)

–503 R

etry-After replies based on client ip

•Im

plement r

esumptionTokens that include all

state–

Avoid need to cache result sets / clean cache

Aggregator / Cache / Proxy

Implem

entation Guidelines

<provenance> & datestam

ps

•rem

inder: datestamps are local to the

repository, and an re-exportingservice m

ust use new localdatestam

ps•

such services should use the<provenance> container to preserve

the original datestamps

Identifiers are Local

•identifiers are local to the repository

•unless you absolutely did not change them

etadata and the identifier correspondsto a recognized U

RI scheme, use a new

identifier upon re-exporting–

use the <provenance> container to preserve

the harvesting history

Derived from

the same item

?3 ways to determ

ine if records share provenancefrom

the same item

:

1.both records have the sam

e identifier and thebaseU

RL in the request elements of the O

AI-PM

Hreponses which include the record are the sam

e;2.

both records have the same identifier and that

identifier belongs to some recognized U

RI scheme;

3.the provenance containers of both records have thesam

e entries for both the identifier and baseURL;

<provenance> exam

ple (1)

<?xml version="1.0" encoding="UTF-8"?>

<GetRecord ..nam

espace stuff..> <responseDate>2002-02-08T08:55:46.1</responseDate>

<request verb="GetRecord" metadataPrefix="odd_fmt"

identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request>

<record>

<header>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp> </header>

<metadata> ...metadata record in odd_fmt... </metadata>

</record>

</GetRecord>

Consider a request from crosswalker.oa.org:

http://odd.oa.org?verb=GetRecord

&identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt

and the following response from odd.oa.org:

Imagine that c

rosswalker.oa.org cross-walks

harvested metadata from

odd_fmt into o

ai_marc and

then re-exposes the metadata with new identifiers.

A request from

getmarc.oa.org:

http://crosswalker.oa.org?verb=GetRecord

&identifier=oai:cw.oa.org:z1x2y3

&metadataPrefix=oai_marc

might then yield the following response from

crosswalker.oa.org:

<provenance> exam

ple (2)

<provenance> exam

ple (3)<record>

<header> <identifier>oai:cw.oa.org:z1x2y3</identifier>

<datestamp>2002-02-09T01:15:43Z</datestamp>

</header>

<metadata> ...metadata record in oai_marc... </metadata>

<about>

<provenance ..nam

espace stuff..>

<originDescription harvestDate="2002-02-08T08:55:46Z“

altered="true">

<baseURL>http://odd.oa.org</baseURL>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp>

<metadataNamespace>http://odd.oa.org/odd_fmt</..>

</originDescription>

</provenance>

</about>

</record>

<provenance> exam

ple (4)

This oai_marc record is then re-

exposed by getmarc.oa.org with the

same identifier o

ai:cw.oa.og:z1x2y3

(because the record has not beenaltered).

The associated <provenance> container

might be:

<provenance> exam

ple (5)<record>

<header> <identifier>oai:cw.oa.org:z1x2y3</identifier>

<datestamp>2002-03-01T01:46:11Z</datestamp> </header>

<metadata> ...metadata record in oai_marc... </metadata>

<about>

<provenance ..nam

espace stuff..> <originDescription harvestDate=“2002-03-01T01:23:45” altered=“false”> <baseURL>http://crosswalker.oa.org/<baseURL> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-02-09T01:15:43Z</datestamp> <metadataNamespace>http://../oai_marc</..>

<originDescription harvestDate="2002-02-08T08:55:46Z” altered="true">

<baseURL>http://odd.oa.org</baseURL>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp>

<metadataNamespace>http://odd.oa.org/odd_fmt</..>

</originDescription>

</originDescription>

</provenance>

</about>

</record>

oai-identifier

•not com

patible with v1.1 oai-identifier–

repositoryNam

e now domain nam

e based–

not reliant upon OA

I centralizedregistration

•one-to-one m

apping for escapingcharacters: %

3F allowed, %3f not

–allows sim

ple comparison

looking ahead:novel uses of O

AI-PM

H

•Registry of m

etadata formats for

OpenU

RL–

http://www.sfxit.com/openurl/

–http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf

•other uses?

linkingservers

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registrationpollingO

AI-PM

H harvesting

centralrepositoryOAI-PMH

Poll

regis.

userservice

registry model

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

Goal:• inform

linking servers re Schema

• ease of admin for all parties involved

• limit hum

an overhead

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registration centralrepositoryregis.

Registry:• schem

aLocation• registration date• m

irror of Schema

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registration centralrepositoryregis.

polling

PollPoll:• fetch schem

a at schemaLocation

• log failure/success• com

pare fetched Schema with m

irror• changed => replace m

irror• rem

oved => deregistered

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registrationpolling

centralrepositoryOAI-PMH

Poll

regis.

OA

I repo:• record-ids = schem

aLocation• oai_dc record :

• registration info• (de)registration datestam

p• xsi record :

• mirror schem

a• schem

a update datestamp

• poll record : • process info• recent poll datestam

p

linkingservers

registrars

XM

L Schema

URL1

XM

L Schema

URL2

XM

L Schema

URLn

registrationpollingO

AI-PM

H harvesting

centralrepositoryOAI-PMH

Poll

regis.

userservice