admin管理员组

文章数量:1026989

I was wondering if someone knows some really efficient way how to get all unique numbers from the sorted list. Example: if I have list = [1,1,2,3,3,3,6,6,8,10,100,180,180] I want to get a list like this [1,2,3,6,8,10,100,180]. I am looking for a solution that is more efficient than going through the whole list, to have complexity better than O(n).

My first idea was to check every second number, but it did not work. There is my code:

solution = []
i = 0
while i < values.size() - 2:
    if values.get(i) == values.get(i + 2):
        if values.get(i) not in solution:
            solution.append(values.get(i))
    else:
        if values.get(i) not in solution:
                solution.append(values.get(i))
                
        if values.get(i + 1) != values.get(i + 2) \
            and values.get(i) != values.get(i + 1):
            solution.append(values.get(i + 1))
        solution.append(values.get(i + 2))
    i += 2
    
if values.get(values.size() - 1) not in solution:
    solution.append(values.get(values.size() - 1))
return solution

I was wondering if someone knows some really efficient way how to get all unique numbers from the sorted list. Example: if I have list = [1,1,2,3,3,3,6,6,8,10,100,180,180] I want to get a list like this [1,2,3,6,8,10,100,180]. I am looking for a solution that is more efficient than going through the whole list, to have complexity better than O(n).

My first idea was to check every second number, but it did not work. There is my code:

solution = []
i = 0
while i < values.size() - 2:
    if values.get(i) == values.get(i + 2):
        if values.get(i) not in solution:
            solution.append(values.get(i))
    else:
        if values.get(i) not in solution:
                solution.append(values.get(i))
                
        if values.get(i + 1) != values.get(i + 2) \
            and values.get(i) != values.get(i + 1):
            solution.append(values.get(i + 1))
        solution.append(values.get(i + 2))
    i += 2
    
if values.get(values.size() - 1) not in solution:
    solution.append(values.get(values.size() - 1))
return solution
Share Improve this question edited Nov 23, 2024 at 2:15 John Kugelman 363k69 gold badges553 silver badges597 bronze badges asked Nov 16, 2024 at 19:18 Jack SkyblueJack Skyblue 412 bronze badges 0
Add a comment  | 

4 Answers 4

Reset to default 2

Let k be the number of distinct elements, and n the total number of elements.

get all unique numbers from the sorted list [with] ... complexity better than O(n).

In the general case, clearly you can't, since k might be some fraction of n or even be as large as n. There does appear to be at least two degrees of wiggle room, here.

counting sort

How did that sorted list come to be, in the first place? Perhaps it was constructed using some comparison sort, with O(n log n) cost. But perhaps we know the elements have restricted range, and we chose to construct the list using counting sort, with O(n) time complexity and O(k) memory complexity.

In which case, preserve those k counters, and read out the element values they correspond to, with O(k) time complexity. So rather than just input of "a list", we also accept the counters data structure.

compression

Almost the same thing: insist that the sorted values arrive in some general purpose compressed format such as Lempel-Ziv. The PyArrow apache parquet binary file format works nicely. With k = 12 and n = 10^6, those million randomly chosen values will be represented in a file of just 675 bytes. And now it's a matter of uncompressing to recover a few (count, value) tuples, rather than recovering a giant number of element values.

binary search

As we sequentially scan through all n elements we shall encounter k "breaks", where we advance to a brand new element value. We can bisect to locate such breaks.

Consider the degenerate case. Maybe each element corresponds to "female" or "male", and the application was used at a girl's school so the list simply indicates we have n females. Clearly we can identify this case in O(1) constant time, by examining first and last elements and relying on monotonicity of the input.

Let's visit a public school and try that again. It develops that first and last elements differ by one, and we will call the start of the list the "first break". Now we need to find a single break. Apply binary search in the usual way, with O(log n) time complexity.

In a more general setting where k << n still holds, we can find k breaks in O(k log n) time, with O(k) memory complexity.


The following Rust code will find breaks in values pretty quickly. It uses a half-open interval.

/// Finds an xs index where the element values change.
fn find_break(xs: &[i16], start: usize, end: usize) -> usize {
    assert!(!xs.is_empty());
    let mut i = start;
    let mut j = end;
    assert!(i < j);

    while i < j {
        let mid = i + (j - i) / 2;
        if xs[i] == xs[mid] {
            i = mid + 1;
        } else {
            j = mid;
        }
    }
    assert!(i == 0 || xs[i - 1] != xs[i]);

    i
}

Going left-to-right, finding the end of the current value with exponential search:

from bisect import bisect

def uniq1(s):
    n = len(s)
    i = 0
    uniq = []
    while i < n:
        uniq.append(s[i])
        k = 1
        while i+k < n and s[i+k] == s[i]:
            k *= 2
        i = bisect(s, s[i], i+k//2+1, min(i+k, n))
    return uniq

Divide-and-conquer idea from someone's comment:

def uniq2(s):
    uniq = []
    def add(i, j):
        if s[i] == s[j]:
            if s[i] not in uniq[-1:]:
                uniq.append(s[i])
            return
        m = (i + j) // 2
        add(i, m)
        add(m+1, j)
    if s:
        add(0, len(s) - 1)
    return uniq

Rough benchmark with a million elements from a pool of 1000 unique values:

True   4.4 ms  uniq_exponential
True   9.0 ms  uniq_divide
True  64.4 ms  uniq_dict
True  26.0 ms  uniq_groupby_1
True  36.4 ms  uniq_groupby_2
True  29.3 ms  uniq_groupby_3

Test script:

from bisect import bisect

def uniq_exponential(s):
    n = len(s)
    i = 0
    uniq = []
    while i < n:
        uniq.append(s[i])
        k = 1
        while i+k < n and s[i+k] == s[i]:
            k *= 2
        i = bisect(s, s[i], i+k//2+1, min(i+k, n))
    return uniq


def uniq_divide(s):
    uniq = []
    def add(i, j):
        if s[i] == s[j]:
            if s[i] not in uniq[-1:]:
                uniq.append(s[i])
            return
        m = (i + j) // 2
        add(i, m)
        add(m+1, j)
    if s:
        add(0, len(s) - 1)
    return uniq


def uniq_dict(s):
    return list(dict.fromkeys(s))


from itertools import groupby

def uniq_groupby_1(s):
    return [x for x, _ in groupby(s)]

def uniq_groupby_2(s):
    return list(next(zip(*groupby(s))))

from operator import itemgetter

def uniq_groupby_3(s):
    return list(map(itemgetter(0), groupby(s)))


funcs = [uniq_exponential, uniq_divide, uniq_dict, uniq_groupby_1, uniq_groupby_2, uniq_groupby_3]

import random
from time import time

def test(s):
    expect = sorted(set(s))
    for f in funcs:
        t = time()
        print(f(s) == expect, f'{(time()-t)*1e3:5.1f} ms ', f.__name__)
    print()

test([1,1,2,3,3,3,6,6,8,10,100,180,180])
test(sorted(random.choices(random.sample(range(1000000), 1000), k=1000000)))
test(list(range(10000)))

Attempt This Online!

You haven't told us how many unique elements k there are among the n elements, but if k was small relative to n, and n was really large, I'd so something like this.

For a sublist x[low .. < high], we want to know all the runs that start in that sublist.

  • If x[low] == x[high], then look at x[low - 1]. If low > 0 and x[low] == x[low - 1], then no runs start in this sublist. Otherwise, x[low] is the start of a run.
  • If high - low < delta, for some value delta to be determined, just look at the values linearly. There's nothing to be gained by the recursive algorithm.
  • Let s = sqrt(high - low). Break the sublist into s subsets of s elements. Perform the algorithm recursively on each one.

I suspect that this will, in most cases be sublinear. It will quickly identify large blocks of identical elements without looking at each element.

My gut tells me that dividing into s elements of s is the best algorithm to do in the third case above. I might also experiment with just splitting the list into two and doing a binary search for run starts that way. Probably easier to code.

from itertools import groupby

sorted_list = [1, 1, 2, 3, 3, 3, 6, 6, 8, 10, 100, 180, 180]

unique_list = [key for key, _ in groupby(sorted_list)]

print(unique_list)

I was wondering if someone knows some really efficient way how to get all unique numbers from the sorted list. Example: if I have list = [1,1,2,3,3,3,6,6,8,10,100,180,180] I want to get a list like this [1,2,3,6,8,10,100,180]. I am looking for a solution that is more efficient than going through the whole list, to have complexity better than O(n).

My first idea was to check every second number, but it did not work. There is my code:

solution = []
i = 0
while i < values.size() - 2:
    if values.get(i) == values.get(i + 2):
        if values.get(i) not in solution:
            solution.append(values.get(i))
    else:
        if values.get(i) not in solution:
                solution.append(values.get(i))
                
        if values.get(i + 1) != values.get(i + 2) \
            and values.get(i) != values.get(i + 1):
            solution.append(values.get(i + 1))
        solution.append(values.get(i + 2))
    i += 2
    
if values.get(values.size() - 1) not in solution:
    solution.append(values.get(values.size() - 1))
return solution

I was wondering if someone knows some really efficient way how to get all unique numbers from the sorted list. Example: if I have list = [1,1,2,3,3,3,6,6,8,10,100,180,180] I want to get a list like this [1,2,3,6,8,10,100,180]. I am looking for a solution that is more efficient than going through the whole list, to have complexity better than O(n).

My first idea was to check every second number, but it did not work. There is my code:

solution = []
i = 0
while i < values.size() - 2:
    if values.get(i) == values.get(i + 2):
        if values.get(i) not in solution:
            solution.append(values.get(i))
    else:
        if values.get(i) not in solution:
                solution.append(values.get(i))
                
        if values.get(i + 1) != values.get(i + 2) \
            and values.get(i) != values.get(i + 1):
            solution.append(values.get(i + 1))
        solution.append(values.get(i + 2))
    i += 2
    
if values.get(values.size() - 1) not in solution:
    solution.append(values.get(values.size() - 1))
return solution
Share Improve this question edited Nov 23, 2024 at 2:15 John Kugelman 363k69 gold badges553 silver badges597 bronze badges asked Nov 16, 2024 at 19:18 Jack SkyblueJack Skyblue 412 bronze badges 0
Add a comment  | 

4 Answers 4

Reset to default 2

Let k be the number of distinct elements, and n the total number of elements.

get all unique numbers from the sorted list [with] ... complexity better than O(n).

In the general case, clearly you can't, since k might be some fraction of n or even be as large as n. There does appear to be at least two degrees of wiggle room, here.

counting sort

How did that sorted list come to be, in the first place? Perhaps it was constructed using some comparison sort, with O(n log n) cost. But perhaps we know the elements have restricted range, and we chose to construct the list using counting sort, with O(n) time complexity and O(k) memory complexity.

In which case, preserve those k counters, and read out the element values they correspond to, with O(k) time complexity. So rather than just input of "a list", we also accept the counters data structure.

compression

Almost the same thing: insist that the sorted values arrive in some general purpose compressed format such as Lempel-Ziv. The PyArrow apache parquet binary file format works nicely. With k = 12 and n = 10^6, those million randomly chosen values will be represented in a file of just 675 bytes. And now it's a matter of uncompressing to recover a few (count, value) tuples, rather than recovering a giant number of element values.

binary search

As we sequentially scan through all n elements we shall encounter k "breaks", where we advance to a brand new element value. We can bisect to locate such breaks.

Consider the degenerate case. Maybe each element corresponds to "female" or "male", and the application was used at a girl's school so the list simply indicates we have n females. Clearly we can identify this case in O(1) constant time, by examining first and last elements and relying on monotonicity of the input.

Let's visit a public school and try that again. It develops that first and last elements differ by one, and we will call the start of the list the "first break". Now we need to find a single break. Apply binary search in the usual way, with O(log n) time complexity.

In a more general setting where k << n still holds, we can find k breaks in O(k log n) time, with O(k) memory complexity.


The following Rust code will find breaks in values pretty quickly. It uses a half-open interval.

/// Finds an xs index where the element values change.
fn find_break(xs: &[i16], start: usize, end: usize) -> usize {
    assert!(!xs.is_empty());
    let mut i = start;
    let mut j = end;
    assert!(i < j);

    while i < j {
        let mid = i + (j - i) / 2;
        if xs[i] == xs[mid] {
            i = mid + 1;
        } else {
            j = mid;
        }
    }
    assert!(i == 0 || xs[i - 1] != xs[i]);

    i
}

Going left-to-right, finding the end of the current value with exponential search:

from bisect import bisect

def uniq1(s):
    n = len(s)
    i = 0
    uniq = []
    while i < n:
        uniq.append(s[i])
        k = 1
        while i+k < n and s[i+k] == s[i]:
            k *= 2
        i = bisect(s, s[i], i+k//2+1, min(i+k, n))
    return uniq

Divide-and-conquer idea from someone's comment:

def uniq2(s):
    uniq = []
    def add(i, j):
        if s[i] == s[j]:
            if s[i] not in uniq[-1:]:
                uniq.append(s[i])
            return
        m = (i + j) // 2
        add(i, m)
        add(m+1, j)
    if s:
        add(0, len(s) - 1)
    return uniq

Rough benchmark with a million elements from a pool of 1000 unique values:

True   4.4 ms  uniq_exponential
True   9.0 ms  uniq_divide
True  64.4 ms  uniq_dict
True  26.0 ms  uniq_groupby_1
True  36.4 ms  uniq_groupby_2
True  29.3 ms  uniq_groupby_3

Test script:

from bisect import bisect

def uniq_exponential(s):
    n = len(s)
    i = 0
    uniq = []
    while i < n:
        uniq.append(s[i])
        k = 1
        while i+k < n and s[i+k] == s[i]:
            k *= 2
        i = bisect(s, s[i], i+k//2+1, min(i+k, n))
    return uniq


def uniq_divide(s):
    uniq = []
    def add(i, j):
        if s[i] == s[j]:
            if s[i] not in uniq[-1:]:
                uniq.append(s[i])
            return
        m = (i + j) // 2
        add(i, m)
        add(m+1, j)
    if s:
        add(0, len(s) - 1)
    return uniq


def uniq_dict(s):
    return list(dict.fromkeys(s))


from itertools import groupby

def uniq_groupby_1(s):
    return [x for x, _ in groupby(s)]

def uniq_groupby_2(s):
    return list(next(zip(*groupby(s))))

from operator import itemgetter

def uniq_groupby_3(s):
    return list(map(itemgetter(0), groupby(s)))


funcs = [uniq_exponential, uniq_divide, uniq_dict, uniq_groupby_1, uniq_groupby_2, uniq_groupby_3]

import random
from time import time

def test(s):
    expect = sorted(set(s))
    for f in funcs:
        t = time()
        print(f(s) == expect, f'{(time()-t)*1e3:5.1f} ms ', f.__name__)
    print()

test([1,1,2,3,3,3,6,6,8,10,100,180,180])
test(sorted(random.choices(random.sample(range(1000000), 1000), k=1000000)))
test(list(range(10000)))

Attempt This Online!

You haven't told us how many unique elements k there are among the n elements, but if k was small relative to n, and n was really large, I'd so something like this.

For a sublist x[low .. < high], we want to know all the runs that start in that sublist.

  • If x[low] == x[high], then look at x[low - 1]. If low > 0 and x[low] == x[low - 1], then no runs start in this sublist. Otherwise, x[low] is the start of a run.
  • If high - low < delta, for some value delta to be determined, just look at the values linearly. There's nothing to be gained by the recursive algorithm.
  • Let s = sqrt(high - low). Break the sublist into s subsets of s elements. Perform the algorithm recursively on each one.

I suspect that this will, in most cases be sublinear. It will quickly identify large blocks of identical elements without looking at each element.

My gut tells me that dividing into s elements of s is the best algorithm to do in the third case above. I might also experiment with just splitting the list into two and doing a binary search for run starts that way. Probably easier to code.

from itertools import groupby

sorted_list = [1, 1, 2, 3, 3, 3, 6, 6, 8, 10, 100, 180, 180]

unique_list = [key for key, _ in groupby(sorted_list)]

print(unique_list)

本文标签: pythonWhat is the most efficient way to get unique elements from sorted listStack Overflow