Follow

Why does it take longer to do copy a folder when I tell it to *skip* duplicate files than when I tell it to *overwrite*?

This makes no sense.

And why won't it show me which files it's working on that are taking so long to not-copy?

(And why isn't there a compare-after-copy option? How did I end up with ~200 GB of files not being copied on the first attempt, and why isn't there a list of what didn't copy?)

I found my paranoid-file-copier program! The UI seriously needs work, tho...

@woozle if it's a lot of tiny files, it's probably because it's really expensive to crawl the directory tree on both ends and compare (also wow i just tooted that from the news account, good going me)

@kity ...but it has to do that anyway? (note that I'm pretty sure it's not actually comparing file contents -- just names.)

I'm thinking the slowness might be a cognitive illusion caused by the fact that the progress bar is based entirely on bytes copied -- and as long as the filenames match, no bytes are getting copied. (this again would be bad gui design.)

@kity @woozle A few years ago, I shelled out for a copy of Beyond Compare. It does all the file copy stuffs, as well as giving you a feckton of filtering/comparison/merging tools. I legit cannot do filestuffs anymore without it.

Cross platform too!

@eryn @kity I wrote a thing in PHP that does a lot of this stuff -- compare-after-copy, copy-only-if-mismatch-or-missing... but I'm not sure what I did with it...

...and it needs a GUI, but PHP GUI has been scarce upon the ground. (I should check and see if there are any packages yet; it was all CIY (compile-it-yourself) last I checked.

@woozle At first glance it looks like you just started a package manager.

What about it classifies as "paranoid" though?

@ThorstenAnzomi By default, it does a byte-by-byte comparison after each copy, and logs any mismatches. If it's in "offload" mode, it won't delete the source if the target doesn't match.

There's some other minor stuff it does, but I think that's the main reason.

@woozle That's....... interesting.

you don't happen to have the source code available anywhere, do you?

Also VERY dumb question, why not use a (secure) hash?

@ThorstenAnzomi I'm planning to post it on GitLab as soon as I can get it tidied up a bit.

Correct me if I'm wrong, but a hash would require just as much I/O and would therefore not be any faster.

@woozle A hash would require the same amount of I/O, yes, as both files need to be read in their entirety and summed up.

However, the time saved is in the comparisons: you're doing a single comparison (hash == hash) and not one comparison per byte. Yes, hash comparisons are also made of multiple, but they average 40-50, not in the thousands, as you would for byte-by-byte.

@ThorstenAnzomi But the comparison takes place in RAM, so is almost instantaneous either way, no?

@woozle Comparisons are usually done in the CPU's ALU, effectively a special subtract. If the subtraction result is 0, the two are equal. Takes a few cycles to complete, for take of argument, let's say 5.

To check an entire file, that's 5 cycles per comparison, times file length.

To check a hash, that's 5 times either the length of the hash divided by the CPU's word size (if stored as a number, usually <10), or the length of the string (90 char for b64). I/O it's the same, but CPU time isn't.

@woozle To expand more: using SHA-256 as an example, which generates 256 bit hashes, and most computers today are 64 bit, that means the hash is 4 times the CPU's word size, meaning it'd take 4 subcompares to store a hash when stored as a (giant, admittedly) number.

@woozle Then again, I could also be COMPLETELY wrong, but from my understanding of computers, you'd shave off a little bit of cycle time by using hash comparisons, especially since this is the sort of thing hashes are used for.

@ThorstenAnzomi Possibly -- but is the time saving significant by comparison to the I/O time? ...not to mention the time spent parsing interpreted code and displaying results on the screen, which can measurably slow down progress if there's too much updating (ask me how I know).

@woozle That would depend on the speed at which the files can be read, and the cpu speed, few other things...

probably not a significant speed increase because it's the same I/O, however it might reduce memory consumption a little.

Then again, I come from the world of trying to shave literal milliseconds off programs so they run just that much quicker, it's probably nothing noticable in production, but I'd need a copy to test myself what the difference is, if any.

@ThorstenAnzomi Been there -- I ended up translating backprop neural network code from Pascal to ASM86 just to get a few percentage points of improvement in speed.

(Geez, isn't there anything faster than a 486DX-25?? :D)

@woozle Ah, pascal... How I originally taught myself programming (and THEN moved to VB.NET, then... C++...)

@ThorstenAnzomi I did FORTRAN IV (and then 77), then Pascal, then C++, then VB6, then a bit of Perl and finally PHP.

@woozle I've done.... in chronological order... Pascal, VB.NET, C++, Java, Lua, FORTRAN, COBOL, Perl, Python, FORTH, C#, Go, F#, Haskell...

Yeah, I think I might have a minor obsession...

@woozle I'm definitely a Go / C# programmer now, despite C# being god awful on Linux. C++ has it's place, and so does Python.

And yes.. as someone who does the modded Minecraft, both OpenComputers and ComputerCraft use Lua by default, though I did write a OC CPU architecture for Python 3. So that's my reasoning for Lua.

@ThorstenAnzomi Belated thought: if sftp had an API function for returning a hash of a file, that would *definitely* make it faster to compare hashes than block-by-block.

Failing that... I suppose it would be easy enough to just write a utility to calculate a hash for a local file, and then have the remote request that... or does such a thing exist already, perhaps? But this would still require executing remote code...

Show more
Sign in to participate in the conversation
Toot.Cat

A Mastodon instance for cats, the people who love them, and kindness in general. We strive to be a radically inclusive safe space. By creating an account, you agree to follow our CoC below.

Instance Administration

  • Woozle: Supreme Uberwensch, general support, web hostess
  • Charlotte: tech support, apprentice in warp-drive arcana (aka Mastomaintenance)
  • ash: backend stuff, gay crimes

The Project: