The Sieve of Eratosthenes
To generate all prime numbers, i.e. primes, in a given range, the sieve of Eratosthenes is an old,
but nevertheless still the most efficiently known algorithm. It works as follows:
Put into an array all natural numbers up to a given limit size. Set
the first sieve prime = 2. Then cross out all multiples of the current sieve prime. Next,
look for the next larger, not crossed-out number. It will become the new sieve
prime. Repeat this process with the next not crossed out number until all
numbers are worked off. Here this process is illustrated for the odd numbers < 360:
.
Obviously, crossing out of multiples need to start firstly at the square
of the new chosen sieve prime. So the total effort is about size SUMp
1/p where the sum is taken over all primes less than sqrt(size).
The sum can be estimated rather well to be about ln(ln(size)/2)+0.261497. Hence, for large sieve sizes, the time of the algorithm should be roughly
proportional to the sieve size, because the function ln(ln(x)) increases so slowly.
After seen some "fast" implementations of this algorithm by other people,
I decided to
write my own, really fast and less memory consuming computer program.
There are four main improvement points for
The Art of Prime Sieving
- Dense bit packing for the crossing-out flags
- To use the memory efficiently, it is necessary to mark the numbers not in
a byte or even a word of the computer, but in a bit only. This must be done
very clever, because bit-access is much more expensive in cpu-time than
byte-access or the CPU-prefered word-access!
- Only presieved numbers in the sieve
- With exception of 2, all other primes are odd, so we need only to store the flags for
the odd numbers in the sieve. But furthermore, only the numbers 6k+1
and 6k+5 can be primes (except 2 and 3). So, we can reduce the total amount
of sieve memory by a factor of 3 storing only the flags for these numbers. If we even exclude all multiples of 5, resulting in a factor of 3.75, we need only 8 flags each 30 numbers. This is
really nice, because 1 byte has 8 bits!
- Don't bother with multiples of the smallest primes
- We know, that the primes, except 2 and 3,
can occur only for those numbers which have modulo 6 a remainder 1 or 5.
So we can avoid to cross out all multiples of 2 and 3, saving a
factor of 3 in sieving speed. What is more, this list of smallest primes can
be extended, e.g. including 5 and 7, and we need only to consider 48 numbers out
of 210, achieving a speed-up factor of even 4.375. Each further small prime p will decrease the run time by a factor 1-1/p, but increase the code of
the program by a factor p-1.
- Choose an appropriate sieve size
- Typically, the sieve should fit into the computers main memory and even
better into its cache in modern high speed computers. Therefore, one can't use
a sieve of the size of the sieve limit, but must use far smaller sieve sizes normally. So, one has to choose an appropriate fixed sieve size and sieve the total
range in many parts sequentially. And hence, the dense bit packing in the sieve
pays very well!
With these tricks in mind (and a lot of optimization), I wrote a C-program
which was the fastest sieve-of-Eratosthenes implementation, I ever became aware of. In May 1998, I have further refined the algorithm with an even denser sieve, resulting in access to fixed bit-positions, and a quicker presieving. These improvements gain at least 15% speed-up over the old version.
To give you a feeling of its speed: it generates all primes
less than 1 milliard in less than 1 minute and all primes up to 2^32 in less
than 3.5 minutes on a 133 MHz Pentium CPU (I used sieve size = 8000 Bytes (processor has 8 KB Data-Cache), smallest primes = 2,3,5 and gcc-2.7.2!).
Thanks to Thomas Fiedler, University of Jena, who discovered in May 2003 an important bug (a 1 not a 0 should have been there) when segmentating sieving, I polished the source a bit
and thus you can fetch the latest prime_sieve.c version 2.0c. And here is the corresponding README.
Features
- source code availible
- very fast
- interval sieving possible
- adjustable to CPU cache size
- works up to numbers 264
- user definable macro called for each found prime number
Benchmarks
Limit | Prime Count | CPU1 | CPU2 | CPU3 | CPU4 | CPU5 | CPU6 | CPU7 | CPU8 | CPU9 | Factor |
10^2 | 25 | 0.0 s | 0.0 s | 0.00 s | 0.0 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 1.0955 |
10^3 | 168 | 0.0 s | 0.0 s | 0.00 s | 0.0 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 1.5010 |
10^4 | 1229 | 0.0 s | 0.1 s | 0.00 s | 0.0 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 1.7887 |
2^16 | 6542 | 0.0 s | 0.1 s | 0.01 s / 0.01 s | 0.0 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 1.9744 |
10^5 | 9592 | 0.1 s | 0.1 s | 0.01 s / 0.01 s | 0.0 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 0.00 s | 2.0118 |
10^6 | 78498 | 0.1 s | 0.2 s | 0.02 s / 0.03 s | 0.1 s | 0.01 s | 0.01 s | 0.01 s/ 0.01 s | 0.00 s | 0.00 s | 2.1941 |
10^7 | 664579 | 0.2 s | 1.6 s | 0.14 s / 0.20 s | 0.5 s | 0.13 s | 0.08 s | 0.04 s/ 0.04 s | 0.02 s | 0.01 s | 2.3483 |
10^8 | 5761455 | 1.2 s | 17.0 s | 1.42 s / 2.02 s | 4.9 s | 1.36 s | 0.80 s | 0.36 s/ 0.37 s | 0.18 s | 0.11 s | 2.4818 |
10^9 | 50847534 | 11.3 s | 187.8 s | 16.2 s / 21.7 s | 51.3 s | 15.30 s | 8.84 s | 3.62 s/ 3.68 s | 1.55 s | 1.16 s | 2.5996 |
2^32 | 203280221 | 50.7 s | 889.5 s | 79.6 s / 104.7 s | 249.5 s | 73.85 s | 43.01 s | 15.97 s/ 15.98 s | 7.55 s | 5.14 s | 2.6676 |
10^10 | 455052511 | 120.4 s | | 268.9 s | | 191.35 s | | 38.49 s | 16.72 s | 12.34 s | 2.7050 |
10^11 | 4118054813 | 1268.6 s | | 4122.7 s | | | | 482.77 s | 214.51 s | 143.71 s | 2.8003 |
10^12 | 37607912018 | 15207.7 s | | | | | | 7466.49 s | 3071.43 s | 2074.49 s | 2.8873 |
10^13 | 346065536839 | 249032.9 s | | | | | | | 32327.34 s | 30955.69 s | 2.9673 |
CPU1: HP PA-8000 180MHz with 400 MB RAM (256 KB Data-Cache) running HP-UX 10.20 (sieve size=200KB).
CPU2: MIPS 3000 33MHz with 32 MB RAM (no Cache) running Ultrix V4.4 (sieve size=15KB).
CPU3: AMD K6 233MHz with 64 MB RAM (32 KB Data-Cache) running Linux 2.032 (sieve size=22KB).
The C-source was compiled using gcc 2.7.2.1 for i486-linux one time for 32bit LONG and the other time for 64 bit LONG. Further, for limit > 2^32 one should
increase the sieve size to get shorter running times.
CPU4: Intel Pentium 133MHz with 64 MB RAM (8KB Data-Cache) running Windows 95 (sieve size=8000B).
Compiler: Visual C++ (max speed)
CPU5: DEC Alpha 21164a 400 MHz with 64 MB RAM (8 KB Data-Cache) running OSF1 V4.0 (sieve size=8000)
CPU6: Intel Pentium III 450 MHz with 128 MB RAM (16 KB Data-Cache) running Linux 2.2.14 using gcc 2.7.2.3 (i386 Linux/ELF) (sieve size=16384).
As you see, it is very important how well the code-optimizer and the caching logic of the cpu does. The sieve size are nearly optimal chosen for limits < 10^10.
CPU7: AMD Thunderbird 900 MHz with 256 MB RAM (64 KB 1st level Data-Cache) running Linux 2.2.19 using gcc 2.95.2 (i386 Linux/ELF) (sieve size=64000/64KB).
CPU8: PowerPC 970FX 2.5 GHz with 1.5 GB RAM running Mac OSX 10.3.8 using
compiler IBM XLF 6.0 Advanced Edition (sieve size=32000 for limit <= 10^11 else
the minimal necessary sieve size).
CPU9: AMD Athlon64 Winchester 2 GHz with 3 GB RAM (64 KB 1st level Data-Cache)
running Linux 2.6.9 using gcc 3.4.2 (x86-64 Linux/ELF) (sieve size=65000).
Factor = ln(1/2 ln(Limit))+0.2615
The average access for each bit in the sieve is PROD(1-1/p) (Factor - SUM1/p) whereby the primes p in the sum and the product runs over the smallest not-bother-primes, --- here 2,3,5, resulting in 8/30(Factor -31/30).
BTW: Because the gaps between successive primes are <= 250 up to p=436273009 and
<= 500 up to 304599508537, to hold a list of primes only their differences
need to be stored in a byte.
For interested persons: the frequency of primes
and a little problem for the mathematicians: estimate the magnitude of
SUMp<=n 1/p - ln(ln(n))-0.2614972128....
Hint:
Hardy & Wright tell us O(1/ln(n)), but this seems to be too pessimisticly.
Notice/Note
For intervals larger about 10^9, surely for those > 10^10, the Sieve of Eratosthenes is outperformed
by the Sieve of Atkins and Bernstein which uses irreducible binary quadratic forms. See their paper for background informations as well as paragraph 5 of W. Galway's Ph.D. thesis.
Achim Flammenkamp
created: 1998-06-02 17:30 UTC+2
updated: 2005-05-19 17:10 UTC+2