Some output still over two CSV rows for multiple matches on a format #75

ross-spencer · 2016-03-15T04:48:05Z

Hi Richard,

Example XSL snippet/document that I have that's still outputting to CSV over two rows. I'm using a sig with PRONOM/Tika/and FreeDesktop. -nopriority.

  <?xml version="1.0" encoding="UTF-8"?>
  <xsl:stylesheet version="1.0" 
     xmlns:dc="http://purl.org/dc/elements/1.1/" 
     xmlns:dcterms="http://purl.org/dc/terms/"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  >

     <xsl:template match="/">
        <html>
           <head>
              <title>Title</title>
           </head>

              <body>

           </body>
        </html>         
     </xsl:template>
  </xsl:stylesheet>

That seems to be contrary to the change log, but I'm not sure what the desired behavior of CSV is to be.

Ross

The text was updated successfully, but these errors were encountered:

richardlehane · 2016-03-15T05:09:00Z

Does yours look something like..example.xlsx
... interesting sidenote - github allows xslx attachments but not csv!

There looks to be a minor bug here with the repetition of the pronom ns but otherwise it is outputting as expected.

The change in the changes notes was to give a separate set of columns for each identifier but not each identification. I.e. the old output would have had even more rows and just the first 11 columns, the new output gives additional columns for the two additional identifiers (tika and freedesktop) with rows that total the maximum identifications from any single identifier + padding for the other identifiers where they give fewer results.

Does that make sense? Otherwise, in multiple identification scenarios it would be impossible to know ahead of time how many columns would be in the CSV.

I had a thought in the shower today that it might be worth adding a -nomulti (no multiple identifications) flag to roy. This would force identifiers to only return a single result and when they encounter multi identifications would return UNKNOWN instead, plus a descriptive warning giving the possible matches. Would this help in your use case?

ross-spencer · 2016-03-15T23:23:49Z

Hi Richard,

Looking at your reply, it seems it wasn't a bug:

Bit it does seem like you found one!

I like -nomulti as it normalizes the shape of the CSV output if we're using CSV. Could the label be 'MULTI' or 'MANY' or similar instead of 'UNKNOWN'?

I just spoke with the team here, and it sounds promising so if put into the dev branch I'd output some samples and get them tested too.

I'll be trying to work with your YAML output for that slight potential of multiple identifications that I've been pushing SF to return. i think the YAML is a bit more intuitive. But would like to offer CSV analysis too. It'd be good to hear @timothyryanwalsh's thoughts too as he's working with the CSV output already.

richardlehane · 2016-03-16T01:12:14Z

Could definitely have a new MULTIPLE return value instead of UNKNOWN - like that!

Rather than introducing a -nomulti flag, another possibility is to just change behaviour of -nopriority flag (and rename it to `-multi).

At the moment, multiple identifications are only returned where you get multiple IDs with same priority/weight. This is actually quite rare (esp. for PRONOM which has a fairly complete set of priority relations between formats). The effect of -nopriorty is to ignore weight altogether so you get many more IDs (e.g. a DOCX and a ZIP for a DOCX file).

Introducing a -nomulti flag alongside the -nopriority flag might be confusing for users as the difference between them would be quite subtle and would only be apparent in rare cases.

The effect of the original proposal would be:

if -nopriority is set, return all possible IDs regardless of weight
if -nopriority is unset and -multi is set (default), apply priority rules and only return multiple IDs where those IDs have the same weight (this is the current behaviour)
if -nopriority is unset and -multi is unset, apply priority rules, return a single result if possible, if multiple IDs have same weight, return MULTIPLE with a warning (this would be new and would happen quite rarely)
if -nopriority is set and -multi is unset, don't do this! Makes no sense.

A simpler approach might be just to rename -nopriority to -multi and have two rules:

If -multi is set, return an exhaustive set of all possible results ignoring priority/weight
If -multi is unset, return a single result if possible. If the multiple results have the same weight, report those in a warning and return MULTIPLE

Advantage of the first approach is it retains current behaviour (item (2) in the first list), but at the cost of complexity & would users realistically want all those choices? Or are the two latter choices the only ones users would really care about?

ross-spencer · 2016-03-17T08:52:27Z

Trying to work through this, the latter two behaviours seem to be the most intuitive to me. I don't think the flag name needs to be changed. -nopriority works well, it's understandable.

I'm a bit torn on any change, as the benefit seems to be for the SF CSV, where having a file's details across two rows is ever so slightly more difficult to handle - one has to filter path for unique values first, and then display all identification results.

I think that's where the real benefit comes in. Removing the repetition on rows and making all the data available in a single field as a warning.

Does that help your thinking too?

tw4l · 2016-04-08T00:38:48Z

Hi Richard and Ross! Jumping in a bit late here (still working through all the vacation email). I'm in favor of Richard's last suggestion to rename -nopriority to -multi and simplify to the two rules. It seems like the simplest solution, and like Richard I am also struggling to think of a practical use case for when having both flags would be desirable.

I'm also on board with Ross' last comment - removing repetition seems key for processing the CSV outputs, and if multiple identification with equal weight consistently throws a warning and a value of MULTIPLE, it'll be easy to flag these for review.

richardlehane · 2016-04-08T00:49:34Z

Thx for chiming in Tim. I'll aim to get a small 1.5.1 release out in the next few weeks and will include this change.

richardlehane · 2016-06-26T11:57:07Z

sf 1.6.0 has new -multi flag.

This has 5 levels -multi 0 through to -multi 4:
roy build -multi 0 forces single matches only, and gives UNKNOWN with descriptive warning if can't give a single definitive result. This can be used to force CSVs to have a single line per file.
roy build -multi 1 is the current default behaviour which will only return multiple results if they all have the same strength
roy build -multi 2 will return all strong results (i.e. byte/xml/container and not extension/mime only) so long as this doesn't slow down matching
roy build -multi 3 is same as 2 but it can slow down matching to return all strong results
roy build -multi 4 returns ALL results - strong and weak (i.e. extension only).

richardlehane added the enhancement label Apr 8, 2016

richardlehane self-assigned this Apr 8, 2016

richardlehane added this to the 1.5.1 milestone Apr 8, 2016

richardlehane closed this as completed Jun 26, 2016

tw4l mentioned this issue Jul 8, 2016

Multiple Siegfried CSV lines for single file tw4l/brunnhilde#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some output still over two CSV rows for multiple matches on a format #75

Some output still over two CSV rows for multiple matches on a format #75

ross-spencer commented Mar 15, 2016

richardlehane commented Mar 15, 2016

ross-spencer commented Mar 15, 2016

richardlehane commented Mar 16, 2016

ross-spencer commented Mar 17, 2016

tw4l commented Apr 8, 2016

richardlehane commented Apr 8, 2016

richardlehane commented Jun 26, 2016

Some output still over two CSV rows for multiple matches on a format #75

Some output still over two CSV rows for multiple matches on a format #75

Comments

ross-spencer commented Mar 15, 2016

richardlehane commented Mar 15, 2016

ross-spencer commented Mar 15, 2016

richardlehane commented Mar 16, 2016

ross-spencer commented Mar 17, 2016

tw4l commented Apr 8, 2016

richardlehane commented Apr 8, 2016

richardlehane commented Jun 26, 2016