Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split is inefficient; can be sped up over 2x for single character delimiter #39

Open
joeyh opened this issue Jan 31, 2017 · 9 comments
Open

Comments

@joeyh
Copy link

joeyh commented Jan 31, 2017

Data.List.Utils.split if you profile it, does a lot of allocation.

With a 600 mb "data" file containing short words, I profiled this program:

import Data.List.Utils
main = do
        s <- readFile "data"
        let l = split " " s
        print (length l)

Tue Jan 31 19:49 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =      117.11 secs   (117106 ticks @ 1000 us, 1 processor)
    total alloc = 139,419,374,656 bytes  (excludes profiling overheads)

COST CENTRE MODULE           %time %alloc

spanList    Data.List.Utils   46.2   59.3
startswith  Data.List.Utils   28.4   14.1
MAIN        MAIN              14.7   17.8
split       Data.List.Utils    8.3    6.4
breakList   Data.List.Utils    2.5    2.4

Compare with this program using a splitc specialized to a single character delimiter:

main = do
    s <- readFile "data"
    let l = splitc ' ' s
    print (length l)

splitc :: Char -> String -> [String]
splitc _ [] = []
splitc c s = case break (== c) s of
    (i, d:rest) -> i : splitc c rest
    (i, []) -> i : []

    Tue Jan 31 19:54 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =       54.44 secs   (54435 ticks @ 1000 us, 1 processor)
    total alloc = 99,505,766,960 bytes  (excludes profiling overheads)

So over twice as fast!

A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.

@joeyh
Copy link
Author

joeyh commented Jan 31, 2017

(Actually it's more like 3x as fast since the readFile of 500 mb is responsible for 20 seconds of the runtime.)

@joeyh
Copy link
Author

joeyh commented Feb 1, 2017

Btw, http://hackage.haskell.org/package/split's splitOn is similarly slow to MissingH's split.

@jgoerzen
Copy link
Collaborator

Wow, good catch. It would be easy enough to add an optimized splitc for a single-character delimter, and have split use it automatically. Does that splitOn perform better than this split?

@joeyh
Copy link
Author

joeyh commented Feb 15, 2017

split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.

@jgoerzen
Copy link
Collaborator

jgoerzen commented Feb 15, 2017 via email

@joeyh
Copy link
Author

joeyh commented Feb 15, 2017 via email

@fyrchik
Copy link

fyrchik commented Feb 16, 2017

Hello.
I think, 2004 split is much cutier and it could be further improved, to smth like this, for example:

split :: Eq a => [a] -> [a] -> [[a]]
split _ [] = []
split [x] str = splitc x str
split delim str =
    let splitworker :: Eq a => Int -> [a] -> [a] -> [a] -> [[a]]
        splitworker _ delim [] [] = []
        splitworker _ delim [] accum = [reverse accum]
        splitworker l delim str accum =
            let (a, t) = prefixLen 0 delim str
            in if null t then
               accum : [] : []
            else if a == l then
               accum : splitworker l delim t []
            else splitworker l delim (tail str) (head str : accum)
        prefixLen :: Eq a => Int -> [a] -> [a] -> (Int, [a])
        prefixLen acc (x:xs) l@(y:ys) = if x == y then prefixLen (acc+1) xs ys else (acc, l)
        prefixLen acc _ ys = (acc, ys)
        in
        splitworker (length delim) delim str []

(testing on equality and isPrefixOf have much in common + accumulating with (++) is bad i believe)

Actually, i also tried to speciallize split to 2-character delimiter and didn't see significant improvements (like in case with splitc)

On my system 2004 version runs ~41 seconds, version in this post ~ 29 seconds and splitc ~ 23 seconds for data generated above and single-character delimiter.

@jgoerzen
Copy link
Collaborator

jgoerzen commented Feb 16, 2017 via email

@fyrchik
Copy link

fyrchik commented Feb 17, 2017

I generated test file like this (version with 1000000 eats 3gb of memory when used with non-existing delimiter, be careful).
perl -e 'for (1..1000000) { print $_."abcdefghij"}' >many10CharDelims
Ill appreciate other tests on which default version could be the fastest.

And tested all 3 versions (default, 2004, 2004Opt)
https://github.com/fyrchik/randomthings (pushed all stuff here with testScript and results)

For delimiters 123 and abcdefghij default version is the slowest, 2004 with optimizations the fastest.
I also tried to use non-existent delimiter and here 2004 version was the slowest, but it still has consumed less memory than default version.
2004 with optimization outperformed both versions in tests with all 3 delimiters in terms of both time and memory (thought it loses to 2004-nonoptimized sometimes).

It is also interesting if 'spanList' could be more effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants