split is inefficient; can be sped up over 2x for single character delimiter #39

joeyh · 2017-01-31T23:56:10Z

Data.List.Utils.split if you profile it, does a lot of allocation.

With a 600 mb "data" file containing short words, I profiled this program:

import Data.List.Utils
main = do
        s <- readFile "data"
        let l = split " " s
        print (length l)

Tue Jan 31 19:49 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =      117.11 secs   (117106 ticks @ 1000 us, 1 processor)
    total alloc = 139,419,374,656 bytes  (excludes profiling overheads)

COST CENTRE MODULE           %time %alloc

spanList    Data.List.Utils   46.2   59.3
startswith  Data.List.Utils   28.4   14.1
MAIN        MAIN              14.7   17.8
split       Data.List.Utils    8.3    6.4
breakList   Data.List.Utils    2.5    2.4

Compare with this program using a splitc specialized to a single character delimiter:

main = do
    s <- readFile "data"
    let l = splitc ' ' s
    print (length l)

splitc :: Char -> String -> [String]
splitc _ [] = []
splitc c s = case break (== c) s of
    (i, d:rest) -> i : splitc c rest
    (i, []) -> i : []

    Tue Jan 31 19:54 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =       54.44 secs   (54435 ticks @ 1000 us, 1 processor)
    total alloc = 99,505,766,960 bytes  (excludes profiling overheads)

So over twice as fast!

A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.

The text was updated successfully, but these errors were encountered:

joeyh · 2017-01-31T23:58:19Z

(Actually it's more like 3x as fast since the readFile of 500 mb is responsible for 20 seconds of the runtime.)

joeyh · 2017-02-01T00:02:41Z

Btw, http://hackage.haskell.org/package/split's splitOn is similarly slow to MissingH's split.

jgoerzen · 2017-02-15T01:47:46Z

Wow, good catch. It would be easy enough to add an optimized splitc for a single-character delimter, and have split use it automatically. Does that splitOn perform better than this split?

joeyh · 2017-02-15T03:41:20Z

split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.

jgoerzen · 2017-02-15T14:07:07Z

From way back in the MissingH history dating back to 2004 is another

definition of split. I have not verified its correctness, but I wonder how it performs in comparison? split :: Eq a => [a] -> [a] -> [[a]] split delim str = let splitworker :: Eq a => [a] -> [a] -> [a] -> [[a]] splitworker delim [] [] = [] splitworker delim [] accum = [accum] splitworker delim str accum = if delim == str then accum : [] : [] else if startswith delim str then accum : splitworker delim (drop (length delim) str) [] else splitworker delim (tail str) (accum ++ [head str]) in splitworker delim str []

…

On 02/14/2017 09:41 PM, Joey Hess wrote: split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#39 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAG5HXQ7yUm9jFEy1f8r51t4F9qcfgxSks5rcnPggaJpZM4LzVBB>.

joeyh · 2017-02-15T16:21:35Z

This time I used this data file: perl -e 'for (1..50000000) { print $_." "}' > data The current MissingH split profile with that: Wed Feb 15 12:14 2017 Time and Allocation Profiling Report (Final) foo +RTS -p -RTS total time = 80.99 secs (80991 ticks @ 1000 us, 1 processor) total alloc = 99,733,231,232 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc spanList Data.List.Utils 47.7 59.3 startswith Data.List.Utils 27.0 14.1 MAIN MAIN 14.7 17.8 split Data.List.Utils 8.0 6.4 breakList Data.List.Utils 2.6 2.4 Compare with the 2004 version: Wed Feb 15 12:17 2017 Time and Allocation Profiling Report (Final) foo +RTS -p -RTS total time = 52.74 secs (52739 ticks @ 1000 us, 1 processor) total alloc = 47,066,563,520 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 93.1 70.2 startswith Data.List.Utils 6.9 29.8 Compare with splitc: Wed Feb 15 12:19 2017 Time and Allocation Profiling Report (Final) foo +RTS -p -RTS total time = 38.25 secs (38251 ticks @ 1000 us, 1 processor) total alloc = 71,155,452,896 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 100.0 100.0 splitc still wins by a reasonable margin, interesting it has more allocations. The 2004 split is a nice improvement, assuming it's correct.

…

-- see shy jo

fyrchik · 2017-02-16T12:52:06Z

Hello.
I think, 2004 split is much cutier and it could be further improved, to smth like this, for example:

split :: Eq a => [a] -> [a] -> [[a]]
split _ [] = []
split [x] str = splitc x str
split delim str =
    let splitworker :: Eq a => Int -> [a] -> [a] -> [a] -> [[a]]
        splitworker _ delim [] [] = []
        splitworker _ delim [] accum = [reverse accum]
        splitworker l delim str accum =
            let (a, t) = prefixLen 0 delim str
            in if null t then
               accum : [] : []
            else if a == l then
               accum : splitworker l delim t []
            else splitworker l delim (tail str) (head str : accum)
        prefixLen :: Eq a => Int -> [a] -> [a] -> (Int, [a])
        prefixLen acc (x:xs) l@(y:ys) = if x == y then prefixLen (acc+1) xs ys else (acc, l)
        prefixLen acc _ ys = (acc, ys)
        in
        splitworker (length delim) delim str []

(testing on equality and isPrefixOf have much in common + accumulating with (++) is bad i believe)

Actually, i also tried to speciallize split to 2-character delimiter and didn't see significant improvements (like in case with splitc)

On my system 2004 version runs ~41 seconds, version in this post ~ 29 seconds and splitc ~ 23 seconds for data generated above and single-character delimiter.

jgoerzen · 2017-02-16T20:13:01Z

Interesting. I wonder about multi-character delimiters? (This is one of the reasons for this split.) If it's no worse than the other implementation for those, then this is a slam dunk

fyrchik · 2017-02-17T08:12:04Z

I generated test file like this (version with 1000000 eats 3gb of memory when used with non-existing delimiter, be careful).
perl -e 'for (1..1000000) { print $_."abcdefghij"}' >many10CharDelims
Ill appreciate other tests on which default version could be the fastest.

And tested all 3 versions (default, 2004, 2004Opt)
https://github.com/fyrchik/randomthings (pushed all stuff here with testScript and results)

For delimiters 123 and abcdefghij default version is the slowest, 2004 with optimizations the fastest.
I also tried to use non-existent delimiter and here 2004 version was the slowest, but it still has consumed less memory than default version.
2004 with optimization outperformed both versions in tests with all 3 delimiters in terms of both time and memory (thought it loses to 2004-nonoptimized sometimes).

It is also interesting if 'spanList' could be more effective.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split is inefficient; can be sped up over 2x for single character delimiter #39

split is inefficient; can be sped up over 2x for single character delimiter #39

joeyh commented Jan 31, 2017

joeyh commented Jan 31, 2017

joeyh commented Feb 1, 2017

jgoerzen commented Feb 15, 2017

joeyh commented Feb 15, 2017

jgoerzen commented Feb 15, 2017 via email

joeyh commented Feb 15, 2017 via email

fyrchik commented Feb 16, 2017 •

edited

Loading

jgoerzen commented Feb 16, 2017 via email

fyrchik commented Feb 17, 2017 •

edited

Loading

split is inefficient; can be sped up over 2x for single character delimiter #39

split is inefficient; can be sped up over 2x for single character delimiter #39

Comments

joeyh commented Jan 31, 2017

joeyh commented Jan 31, 2017

joeyh commented Feb 1, 2017

jgoerzen commented Feb 15, 2017

joeyh commented Feb 15, 2017

jgoerzen commented Feb 15, 2017 via email

joeyh commented Feb 15, 2017 via email

fyrchik commented Feb 16, 2017 • edited Loading

jgoerzen commented Feb 16, 2017 via email

fyrchik commented Feb 17, 2017 • edited Loading

fyrchik commented Feb 16, 2017 •

edited

Loading

fyrchik commented Feb 17, 2017 •

edited

Loading