Exercises

Download files:

$ wget https://jorol.github.io/2022-elag/files/processing-marc.zip
$ unzip processing-marc.zip
$ cd processing-marc

MARC 21 documentation

Check the MARC 21 documenation of the following fields:

007
020
082
100
245
700

Are data elements positionally defined? Are codes used within fields? Are fields or subfields repeatable?

MARC 21 serializations

Look at the different MARC serialization of the "Code4lib" record (code4lib.*), e.g.:

# on the command-line
$ less code4lib.mrk
# press `q` to quit command

# with a text editor like geany
$ geany code4lib.xml

Documentation of software tools

Check the documentation of the recommended tools:

$ catmandu --help
$ catmandu info
$ marcvalidate --help
$ marcstats.pl --help
# press `q` to quit `man` pages
$ man uconv
$ man yaz-client
$ man yaz-marcdump
$ man xmllint
$ man xsltproc

Download MARC data

... from a Z39.50 server with `yaz-client`

Replace the variable {SUBJECT} with a subject term of your choice.

See ATTRIBUTE SET BIB-1 (Z39.50-1995) for other attribute values.

# open client
$ yaz-client
# connect to database
Z> open lx2.loc.gov/LCDB
# set format to MARC
Z> format 1.2.840.10003.5.10
# set element set
Z> element F
# append retrieved records to file
Z> set_marcdump elag.z3950.mrc
# find record for subject
Z> find @attr 5=100 @attr 1=21 "{SUBJECT}"
# get first 50 records
Z> show 1+50
# close client
Z> exit
# show file content
$ cat -v elag.z3950.mrc

... from a SRU server with `catmandu`

Replace the variable {TITLE} with a journal title of your choice.

See SRU explain for other supported indices.

$ catmandu convert -v SRU \
--base https://services.dnb.de/sru/zdb \
--recordSchema MARC21-xml \
--query 'dnb.tit = {TITLE}' \
--parser marcxml \
to MARC --type XML --pretty 1 > elag.sru.xml
# show file content
$ xmllint --format elag.sru.xml

Validate your data

... with `yaz-marcdump`

$ yaz-marcdump -n elag.z3950.mrc
$ yaz-marcdump -n -i marcxml elag.sru.xml

... with `xmllint`

$ xmllint --noout \
--schema http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd \
elag.sru.xml

... with `marcvalidate`

$ marcvalidate elag.z3950.mrc
$ marcvalidate --type XML elag.sru.xml

Create statistics for your data

... with `marcstats.pl`

$ marcstats.pl elag.z3950.mrc
# marcstats.pl can handle only MARC (ISO 2709) files
$ marcstats.pl <(catmandu convert MARC --type XML to MARC < elag.sru.xml)
# replace spaces with dots & save result in file
$ marcstats.pl --dots -o elag.z3950.stats.txt elag.z3950.mrc
# show statistic
$ less elag.z3950.stats.txt
# press `q` to quit

... with `catmandu`

# "break" MARC record in pieces
$ catmandu convert MARC --type XML to Breaker --handler marc \
< elag.sru.xml > elag.sru.breaker
# generate statistic
$ catmandu breaker elag.sru.breaker
# save statistic as tab-separated TSV file
$ catmandu breaker --as TSV elag.sru.breaker > elag.sru.tsv
$ less elag.sru.tsv
# press `q` to quit

Unicode normalization

# get some data in Unicode normalization form NFD
# pretty print with xmllint
$ catmandu convert -v SRU \
--base https://services.dnb.de/sru/zdb \
--recordSchema MARC21-xml \
--query 'dnb.zdbid = 242095-8' \
--parser marcxml to MARC --type XML \
| xmllint --format - > elag.nfd.xml
# convert to Unicode normalization form NFC
$ uconv -x NFC elag.nfd.xml > elag.nfc.xml
# show difference between files: only lines with umlauts are marked. 
$ diff elag.nfc.xml elag.nfd.xml
# show Unicode code points
$ echo -n 'ö' | uconv -x Any-Name
$ echo -n 'ö' | uconv -x Any-Name

Transform different MARC serializations

... with `yaz-marcdump`

# MARC (ISO 2709) to MARC Line
$ yaz-marcdump -o line elag.z3950.mrc
# MARC XML to Turbomarc
$ yaz-marcdump -i marcxml -o turbomarc elag.sru.xml

... with `catmandu`

# MARC (ISO 2709) to MARC-in-JSON
$ catmandu convert MARC to MARC --type MiJ < elag.z3950.mrc
# MARC XML to MARCMaker
$ catmandu convert MARC --type XML to MARC --type MARCMaker < elag.sru.xml
# MARC to Breaker
$ catmandu convert MARC --type XML to Breaker --handler marc < elag.sru.xml 
# MARC to CSV. mapping with fixes required.
$ catmandu convert MARC --type XML to TSV \
--fix 'marc_map(022a,bibo_issn,join:";");marc_map(245abc,dc_title,join:" ");remove_field(record)' \
< elag.sru.xml

... with `xlstproc`

# MARC XML to MODS
$ xsltproc MARC21slim2MODS3-7.xsl elag.sru.xml
# MARC XML to RFD-DC
$ xsltproc MARC21slim2RDFDC.xsl elag.sru.xml
# MARC XML to BIBFRAME
$ xsltproc bibframe-xsl/marc2bibframe2.xsl elag.sru.xml

Extract data from MARC records

... with `xmllint`

Get a list of all MARC tags:

$ xmllint --xpath '//@tag' elag.sru.xml | sort | uniq -c

Get all IDs from MARC 001:

$ xmllint --xpath '//*[local-name()="controlfield"][@tag="001"]/text()' elag.sru.xml

Get all MARC 245 subfields:

$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]' elag.sru.xml

Get title from MARC 245$a:

$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]/*[local-name()="subfield"][@code="a"]/text()' elag.sru.xml

Get all ISSN from MARC 022$a:

$ xmllint --xpath '//*[local-name()="datafield"][@tag="022"]/*[local-name()="subfield"][@code="a"]/text()' elag.sru.xml

Extract all DDC numbers from MARC 082$a:

$ xmllint --xpath '//*[local-name()="datafield"][@tag="082"]/*[local-name()="subfield"][@code="a"]/text()' elag.sru.xml

... with `catmandu`

Use Catmandu::Breaker in combination unix utilities like grep, cut, sort and uniq:

# check the Breaker output format
$ catmandu convert MARC --type XML to Breaker --handler marc < elag.sru.xml
# get all ISSNs from MARC 022$a
$ catmandu convert MARC --type XML to Breaker --handler marc < elag.sru.xml | \
grep -P '\t022a\t' | cut -f 3 | sort
# get all uniq DDC form MARC 082$a
$ catmandu convert MARC --type XML to Breaker --handler marc < elag.sru.xml | \
grep -P '\t082a\t' | cut -f 3 | sort | uniq -c

Use the MARCMaker format in combination with grep:

# get all MARC 022 fields
$ catmandu convert MARC --type XML to MARC --type MARCMaker < elag.sru.xml | \
grep -P '^=022'
# get all MARC 245 fields with indicators 0  
$ catmandu convert MARC --type XML to MARC --type MARCMaker  < elag.sru.xml | \
grep -P '^=245  00'

Use fix marc_map to extract and map data from MARC records:

$ catmandu convert MARC --type XML to TSV \
--fix 'marc_map(001,dc_identifier);marc_map(020a,bibo_isbn,join:",");marc_map(022a,bibo_issn,join:",");marc_map(245a,dc_title);remove_field(record)' \
--fields dc_identifier,bibo_isbn,bibo_issn,dc_title \
< elag.sru.xml

Extract the language code from MARC 008 and lookup the language:

$ catmandu convert MARC --type XML to TSV \
--fix 'marc_map(008/35-37,dc_language);lookup(dc_language,dict_languages.csv,delete:1);remove_field(record)' \
< elag.sru.xml

Extract fields with specific indicators:

$ catmandu convert MARC --type XML to TSV \
--fix 'marc_map("246[1,4]",marc_varyingFormOfTitle);remove_field(record)' \
--fields _id,marc_varyingFormOfTitle \
< loc.mrc.xml

Extract several subfields with certain codes:

# as string
$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(245abc,dc_title,join:" ");remove_field(record)' \
< loc.mrc.xml
# as array
$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(245abc,dc_title,split:1);remove_field(record)' \
< loc.mrc.xml
# as array in certain order
$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(245cba,dc_title,split:1,pluck:1);remove_field(record)' \
< loc.mrc.xml

Extract data from repeatable fields:

# create an array for subjects
$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(650a,dc_subject,split:1);remove_field(record)' \
< loc.mrc.xml

Extract data from repeatable fields with repeatable subfields:

# create one array for index terms
$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(655ay,marc_indexTermGenre,split:1);remove_field(record)' \
< loc.mrc.xml
# create an array of arrays for index terms, one array for each MARC field.
$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(655ay,marc_indexTermGenre,split:1,nested_arrays:1);remove_field(record)' \
< loc.mrc.xml

Extract subfields depending on another subfield value:

$ catmandu convert MARC --type XML to JSON --pretty 1 \
--fix '
do marc_each()
  if marc_match(856x,EZB)
    marc_map(856u,ezb_uri)
  end
end;
remove_field(record)' \
< code4lib.xml

Assign a new value to (sub)fields:

$ catmandu convert MARC --type XML to MARC --type MARCMaker \
--fix 'marc_set(001,123456789);marc_set(245a,"My Title");marc_set(245b," ... added by fix.")' \
< elag.sru.xml

Remove (sub)fields from records:

$ catmandu convert MARC --type XML to MARC --type MARCMaker \
--fix 'marc_remove(003);marc_remove(6..);marc_remove(856xz);' \
< elag.sru.xml

Add a new MARC 999 field to all records:

Create a UUID with fix uuid and add it as subfield $b to the new MARC field.

$ catmandu convert MARC --type XML to MARC --type MARCMaker \
--fix 'uuid(uuid);marc_add(999,a,"Local UUID",b,$.uuid)' \
< elag.sru.xml

Replace strings in (sub)fields:

$ catmandu convert MARC --type XML to MARC --type MARCMaker \
--fix 'marc_replace_all(856u,"^http://","https://")' \
< elag.sru.xml

Filter MARC records:

$ catmandu convert -v MARC --type XML to MARC --type MARCMaker \
< loc.mrc.xml
$ catmandu convert -v MARC --type XML to MARC --type MARCMaker \
--fix 'select marc_match(245a,Perl)' \
< loc.mrc.xml

Validate MARC records:

# catch errors
$ catmandu convert -v MARC --type XML to JSON --pretty 1 \
--fix 'validate(.,MARC,error_field: errors);remove_field(record)' \
< loc.mrc.xml
# filter valid records
$ catmandu convert -v MARC --type XML to MARC --type MARCMaker \
--fix 'select valid(.,MARC)' \
< loc.mrc.xml

Extract and normalize IBSNs, keep only uniques:

$ catmandu convert -v MARC --type XML to JSON --pretty 1 \
--fix 'marc_map(020a,bibo_isbn,split:1); replace_all(bibo_isbn.*,"\s.*$","");isbn13(bibo_isbn.*);uniq(bibo_isbn);remove_field(record);' \
< loc.mrc.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exercises.md

Exercises.md

Exercises

MARC 21 documentation

MARC 21 serializations

Documentation of software tools

Download MARC data

... from a Z39.50 server with `yaz-client`

... from a SRU server with `catmandu`

Validate your data

... with `yaz-marcdump`

... with `xmllint`

... with `marcvalidate`

Create statistics for your data

... with `marcstats.pl`

... with `catmandu`

Unicode normalization

Transform different MARC serializations

... with `yaz-marcdump`

... with `catmandu`

... with `xlstproc`

Extract data from MARC records

... with `xmllint`

... with `catmandu`

Files

Exercises.md

Latest commit

History

Exercises.md

File metadata and controls

Exercises

MARC 21 documentation

MARC 21 serializations

Documentation of software tools

Download MARC data

... from a Z39.50 server with yaz-client

... from a SRU server with catmandu

Validate your data

... with yaz-marcdump

... with xmllint

... with marcvalidate

Create statistics for your data

... with marcstats.pl

... with catmandu

Unicode normalization

Transform different MARC serializations

... with yaz-marcdump

... with catmandu

... with xlstproc

Extract data from MARC records

... with xmllint

... with catmandu

... from a Z39.50 server with `yaz-client`

... from a SRU server with `catmandu`

... with `yaz-marcdump`

... with `xmllint`

... with `marcvalidate`

... with `marcstats.pl`

... with `catmandu`

... with `yaz-marcdump`

... with `catmandu`

... with `xlstproc`

... with `xmllint`

... with `catmandu`