-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split is inefficient; can be sped up over 2x for single character delimiter #39
Comments
(Actually it's more like 3x as fast since the readFile of 500 mb is responsible for 20 seconds of the runtime.) |
Btw, http://hackage.haskell.org/package/split's splitOn is similarly slow to MissingH's split. |
Wow, good catch. It would be easy enough to add an optimized splitc for a single-character delimter, and have split use it automatically. Does that splitOn perform better than this split? |
split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split. |
From way back in the MissingH history dating back to 2004 is another
definition of split. I have not verified its correctness, but I wonder
how it performs in comparison?
split :: Eq a => [a] -> [a] -> [[a]]
split delim str =
let splitworker :: Eq a => [a] -> [a] -> [a] -> [[a]]
splitworker delim [] [] = []
splitworker delim [] accum = [accum]
splitworker delim str accum =
if delim == str then
accum : [] : []
else if startswith delim str then
accum : splitworker delim (drop (length delim) str) []
else splitworker delim (tail str) (accum ++ [head str])
in
splitworker delim str []
…On 02/14/2017 09:41 PM, Joey Hess wrote:
split's splotOn does not perform identically, but it's in a very close
ballpark to the current MissingH split.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG5HXQ7yUm9jFEy1f8r51t4F9qcfgxSks5rcnPggaJpZM4LzVBB>.
|
This time I used this data file:
perl -e 'for (1..50000000) { print $_." "}' > data
The current MissingH split profile with that:
Wed Feb 15 12:14 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 80.99 secs (80991 ticks @ 1000 us, 1 processor)
total alloc = 99,733,231,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 47.7 59.3
startswith Data.List.Utils 27.0 14.1
MAIN MAIN 14.7 17.8
split Data.List.Utils 8.0 6.4
breakList Data.List.Utils 2.6 2.4
Compare with the 2004 version:
Wed Feb 15 12:17 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 52.74 secs (52739 ticks @ 1000 us, 1 processor)
total alloc = 47,066,563,520 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 93.1 70.2
startswith Data.List.Utils 6.9 29.8
Compare with splitc:
Wed Feb 15 12:19 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 38.25 secs (38251 ticks @ 1000 us, 1 processor)
total alloc = 71,155,452,896 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 100.0 100.0
splitc still wins by a reasonable margin, interesting it has more allocations.
The 2004 split is a nice improvement, assuming it's correct.
…--
see shy jo
|
Hello.
(testing on equality and isPrefixOf have much in common + accumulating with (++) is bad i believe) Actually, i also tried to speciallize split to 2-character delimiter and didn't see significant improvements (like in case with splitc) On my system 2004 version runs ~41 seconds, version in this post ~ 29 seconds and splitc ~ 23 seconds for data generated above and single-character delimiter. |
Interesting. I wonder about multi-character delimiters? (This is one
of the reasons for this split.) If it's no worse than the other
implementation for those, then this is a slam dunk
|
I generated test file like this (version with 1000000 eats 3gb of memory when used with non-existing delimiter, be careful). And tested all 3 versions (default, 2004, 2004Opt) For delimiters 123 and abcdefghij default version is the slowest, 2004 with optimizations the fastest. It is also interesting if 'spanList' could be more effective. |
Data.List.Utils.split if you profile it, does a lot of allocation.
With a 600 mb "data" file containing short words, I profiled this program:
Compare with this program using a splitc specialized to a single character delimiter:
So over twice as fast!
A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.
The text was updated successfully, but these errors were encountered: