Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some output still over two CSV rows for multiple matches on a format #75

Closed
ross-spencer opened this issue Mar 15, 2016 · 7 comments
Closed
Assignees
Milestone

Comments

@ross-spencer
Copy link
Collaborator

Hi Richard,

Example XSL snippet/document that I have that's still outputting to CSV over two rows. I'm using a sig with PRONOM/Tika/and FreeDesktop. -nopriority.

  <?xml version="1.0" encoding="UTF-8"?>
  <xsl:stylesheet version="1.0" 
     xmlns:dc="http://purl.org/dc/elements/1.1/" 
     xmlns:dcterms="http://purl.org/dc/terms/"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  >

     <xsl:template match="/">
        <html>
           <head>
              <title>Title</title>
           </head>

              <body>

           </body>
        </html>         
     </xsl:template>
  </xsl:stylesheet>

That seems to be contrary to the change log, but I'm not sure what the desired behavior of CSV is to be.

Ross

@richardlehane
Copy link
Owner

Does yours look something like..example.xlsx
... interesting sidenote - github allows xslx attachments but not csv!

There looks to be a minor bug here with the repetition of the pronom ns but otherwise it is outputting as expected.

The change in the changes notes was to give a separate set of columns for each identifier but not each identification. I.e. the old output would have had even more rows and just the first 11 columns, the new output gives additional columns for the two additional identifiers (tika and freedesktop) with rows that total the maximum identifications from any single identifier + padding for the other identifiers where they give fewer results.

Does that make sense? Otherwise, in multiple identification scenarios it would be impossible to know ahead of time how many columns would be in the CSV.

I had a thought in the shower today that it might be worth adding a -nomulti (no multiple identifications) flag to roy. This would force identifiers to only return a single result and when they encounter multi identifications would return UNKNOWN instead, plus a descriptive warning giving the possible matches. Would this help in your use case?

@ross-spencer
Copy link
Collaborator Author

Hi Richard,

Looking at your reply, it seems it wasn't a bug:

image

Bit it does seem like you found one!

I like -nomulti as it normalizes the shape of the CSV output if we're using CSV. Could the label be 'MULTI' or 'MANY' or similar instead of 'UNKNOWN'?

I just spoke with the team here, and it sounds promising so if put into the dev branch I'd output some samples and get them tested too.

I'll be trying to work with your YAML output for that slight potential of multiple identifications that I've been pushing SF to return. i think the YAML is a bit more intuitive. But would like to offer CSV analysis too. It'd be good to hear @timothyryanwalsh's thoughts too as he's working with the CSV output already.

@richardlehane
Copy link
Owner

Could definitely have a new MULTIPLE return value instead of UNKNOWN - like that!

Rather than introducing a -nomulti flag, another possibility is to just change behaviour of -nopriority flag (and rename it to `-multi).

At the moment, multiple identifications are only returned where you get multiple IDs with same priority/weight. This is actually quite rare (esp. for PRONOM which has a fairly complete set of priority relations between formats). The effect of -nopriorty is to ignore weight altogether so you get many more IDs (e.g. a DOCX and a ZIP for a DOCX file).

Introducing a -nomulti flag alongside the -nopriority flag might be confusing for users as the difference between them would be quite subtle and would only be apparent in rare cases.

The effect of the original proposal would be:

  1. if -nopriority is set, return all possible IDs regardless of weight
  2. if -nopriority is unset and -multi is set (default), apply priority rules and only return multiple IDs where those IDs have the same weight (this is the current behaviour)
  3. if -nopriority is unset and -multi is unset, apply priority rules, return a single result if possible, if multiple IDs have same weight, return MULTIPLE with a warning (this would be new and would happen quite rarely)
  4. if -nopriority is set and -multi is unset, don't do this! Makes no sense.

A simpler approach might be just to rename -nopriority to -multi and have two rules:

  1. If -multi is set, return an exhaustive set of all possible results ignoring priority/weight
  2. If -multi is unset, return a single result if possible. If the multiple results have the same weight, report those in a warning and return MULTIPLE

Advantage of the first approach is it retains current behaviour (item (2) in the first list), but at the cost of complexity & would users realistically want all those choices? Or are the two latter choices the only ones users would really care about?

@ross-spencer
Copy link
Collaborator Author

Trying to work through this, the latter two behaviours seem to be the most intuitive to me. I don't think the flag name needs to be changed. -nopriority works well, it's understandable.

I'm a bit torn on any change, as the benefit seems to be for the SF CSV, where having a file's details across two rows is ever so slightly more difficult to handle - one has to filter path for unique values first, and then display all identification results.

I think that's where the real benefit comes in. Removing the repetition on rows and making all the data available in a single field as a warning.

Does that help your thinking too?

@tw4l
Copy link

tw4l commented Apr 8, 2016

Hi Richard and Ross! Jumping in a bit late here (still working through all the vacation email). I'm in favor of Richard's last suggestion to rename -nopriority to -multi and simplify to the two rules. It seems like the simplest solution, and like Richard I am also struggling to think of a practical use case for when having both flags would be desirable.

I'm also on board with Ross' last comment - removing repetition seems key for processing the CSV outputs, and if multiple identification with equal weight consistently throws a warning and a value of MULTIPLE, it'll be easy to flag these for review.

@richardlehane
Copy link
Owner

Thx for chiming in Tim. I'll aim to get a small 1.5.1 release out in the next few weeks and will include this change.

@richardlehane richardlehane added this to the 1.5.1 milestone Apr 8, 2016
@richardlehane
Copy link
Owner

sf 1.6.0 has new -multi flag.

This has 5 levels -multi 0 through to -multi 4:
roy build -multi 0 forces single matches only, and gives UNKNOWN with descriptive warning if can't give a single definitive result. This can be used to force CSVs to have a single line per file.
roy build -multi 1 is the current default behaviour which will only return multiple results if they all have the same strength
roy build -multi 2 will return all strong results (i.e. byte/xml/container and not extension/mime only) so long as this doesn't slow down matching
roy build -multi 3 is same as 2 but it can slow down matching to return all strong results
roy build -multi 4 returns ALL results - strong and weak (i.e. extension only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants