Show HN: K(r)ep - A high-performance string search utility

github.com

21 points | by daviducolo 4 days ago ago

15 comments

aw1621107 4 days ago
Posted here about a month ago: https://news.ycombinator.com/item?id=43333946
[-]
- finnh 4 days ago
  With this comment showing that search was inconsistent and gave incorrect results: https://news.ycombinator.com/item?id=43335277
  why the repost?
  [-]
  - globnomulous 4 days ago
    He posted the project 36 days ago, 35 days ago, and today. The commit history appears to show that he has been working on it, but that's no excuse for this sort of re-posting.
    The guy has a pattern of not just posting his own projects but contributing nothing meaningful to discussion. Either he makes a sales pitch when introducing the project or he thanks people for their feedback and fails to engage in any way with substantive feedback or criticism.
    Last time he posted this project he couldn't even be bothered to reply to a very generous response by the author of ripgrep: https://news.ycombinator.com/context?id=43334661
    I'm not a moderator (thank God) but it's patently clear to me that there is an etiquette, essential for any productive conversation, that this guy isn't following and that his submissions should be ignored.
    [-]
    burntsushi 4 days ago
    He also posts threads on reddit and then deletes everything:
    https://old.reddit.com/r/programming/comments/1jra9y0/introd...
    https://old.reddit.com/r/programming/comments/1jpk8sw/krep_a...
    https://old.reddit.com/r/commandline/comments/1jqhssc/krep_b...
    daviducolo 4 days ago
    I reposted the project because I have released a new version fixing everything that users have encountered.
alkh 4 days ago
The fact that even in the test example it took ripgrep 0.115 s with 97% CPU usage vs Krep's 0.106 s with 328% CPU usage is a testament of why ripgrep is still preferable. Such a huge tradeoff in resource consumption for such a small speed gain is hard to justify
[-]
- burntsushi 4 days ago
  ripgrep author here.
  Like many grep tools, it also has significant performance cliffs that are very easy to hit:
```
    $ time rg -o -c '[A-Z]' subtitles2016-sample.en
    51835353

    real    4.391
    user    4.344
    sys     0.042
    maxmem  923 MB
    faults  0
    $ time krep-1.0.2 -E -o -c '[A-Z]' subtitles2016-sample.en
    ^C

    real    8:02.95
    user    3:04:10.58
    sys     4.203
    maxmem  919 MB
    faults  0
```
  ripgrep has cliffs too, but they tend to be rarer and/or still give decent perf.
MattPalmer1086 3 days ago
There's a few issues with the Horspool implementation.
To obtain non case sensitive searching, you set the shift in the bad char table for both the upper and lower case variants. That's fine. But when you come to do the search, you ignore this and only look up the lower case version if it's non case sensitive. This is unnecessary - the valid shifts for both cases are already in the table, since you put them there. So you can avoid doing a lookup on the lower table.
I think you also only ever shift by one if you get a full match, but this is also unnecessary. You can shift by whatever the last char bad char shift is and you won't miss any patterns, even if they can overlap.
Finally, you have a test to see if the shift is zero, and to set it to one if it is. You should never get a shift of zero from the bad char table (you don't process the last char of the pattern when creating the bad char table, so there is never a distance of zero from the end).
[-]
- daviducolo 3 days ago
  Thank you for the report! Here’s what I changed:
  1. Always use the raw bad_char_table[text[i + pattern_len‑1]] shift—no extra lowercase lookup at runtime. 2. After a full match, advance by that same table shift instead of just doing i++, so you never miss overlapping occurrences. 3. Removed the “if shift==0 then shift=1” guard—since we build the table excluding the last pattern byte, zero shifts can’t occur.
  https://github.com/davidesantangelo/krep/releases/tag/v1.0.6
  [-]
  - MattPalmer1086 3 days ago
    Looks better! A couple of other minor comments, since I'm looking at it:
    1. You don't need to assign the last char of the pattern inside the search loop every time around the loop. It will never change (line 1143). Just do it once before the search loop starts.
    2. tc_last is the text aligned with the end of the pattern window (line 1142). You don't need to get it again when getting the shift (lines 1201, 1208). bad char is the last char which you already have.
    [-]
    daviducolo 3 days ago
    I have applied your suggestions. Thank you again.
    https://github.com/davidesantangelo/krep/releases/tag/v1.1.1
    [-]
    MattPalmer1086 2 days ago
    No problem, hope it was useful.
    I've found a lot of real world performance comes from doing as little as possible inside the main search loop.
    UPDATE: It's often the case that a simpler algorithm outperforms a theoretically better one. For example, you are using Boyer Moore Horspool, which is the simpler cousin of the original Boyer Moore. BM can get better shifts than Horspool, but it's often slower in practice.
    MattPalmer1086 2 days ago
    An implication of doing as little as possible in the loop is you could boost performance by creating variants of the search algorithm for different types of search. Then you won't incur the penalty or constantly testing "if it's this kind of search do this else do that" inside the search loop.
daviducolo 3 days ago
I have just released version 1.1.0 which should fix the problems reported by users.
https://github.com/davidesantangelo/krep/releases/tag/v1.1.0
4 days ago
[deleted]