The Sieve of Eratosthenes

To generate all prime numbers, i.e. primes, in a given range, the sieve of Eratosthenes is an old, but nevertheless still the most efficiently known algorithm. It works as follows:
Put into an array all natural numbers up to a given limit size. Set the first sieve prime = 2. Then cross out all multiples of the current sieve prime. Next, look for the next larger, not crossed-out number. It will become the new sieve prime. Repeat this process with the next not crossed out number until all numbers are worked off. Here this process is illustrated for the odd numbers < 360: .
Obviously, crossing out of multiples need to start firstly at the square of the new chosen sieve prime. So the total effort is about size SUMp 1/p where the sum is taken over all primes less than sqrt(size). The sum can be estimated rather well to be about ln(ln(size)/2)+0.261497. Hence, for large sieve sizes, the time of the algorithm should be roughly proportional to the sieve size, because the function ln(ln(x)) increases so slowly.

After seen some "fast" implementations of this algorithm by other people, I decided to write my own, really fast and less memory consuming computer program. There are four main improvement points for

The Art of Prime Sieving

Dense bit packing for the crossing-out flags
To use the memory efficiently, it is necessary to mark the numbers not in a byte or even a word of the computer, but in a bit only. This must be done very clever, because bit-access is much more expensive in cpu-time than byte-access or the CPU-prefered word-access!
Only presieved numbers in the sieve
With exception of 2, all other primes are odd, so we need only to store the flags for the odd numbers in the sieve. But furthermore, only the numbers 6k+1 and 6k+5 can be primes (except 2 and 3). So, we can reduce the total amount of sieve memory by a factor of 3 storing only the flags for these numbers. If we even exclude all multiples of 5, resulting in a factor of 3.75, we need only 8 flags each 30 numbers. This is really nice, because 1 byte has 8 bits!
Don't bother with multiples of the smallest primes
We know, that the primes, except 2 and 3, can occur only for those numbers which have modulo 6 a remainder 1 or 5. So we can avoid to cross out all multiples of 2 and 3, saving a factor of 3 in sieving speed. What is more, this list of smallest primes can be extended, e.g. including 5 and 7, and we need only to consider 48 numbers out of 210, achieving a speed-up factor of even 4.375. Each further small prime p will decrease the run time by a factor 1-1/p, but increase the code of the program by a factor p-1.
Choose an appropriate sieve size
Typically, the sieve should fit into the computers main memory and even better into its cache in modern high speed computers. Therefore, one can't use a sieve of the size of the sieve limit, but must use far smaller sieve sizes normally. So, one has to choose an appropriate fixed sieve size and sieve the total range in many parts sequentially. And hence, the dense bit packing in the sieve pays very well!
With these tricks in mind (and a lot of optimization), I wrote a C-program which was the fastest sieve-of-Eratosthenes implementation, I ever became aware of. In May 1998, I have further refined the algorithm with an even denser sieve, resulting in access to fixed bit-positions, and a quicker presieving. These improvements gain at least 15% speed-up over the old version. To give you a feeling of its speed: it generates all primes less than 1 milliard in less than 1 minute and all primes up to 2^32 in less than 3.5 minutes on a 133 MHz Pentium CPU (I used sieve size = 8000 Bytes (processor has 8 KB Data-Cache), smallest primes = 2,3,5 and gcc-2.7.2!).
Thanks to Thomas Fiedler, University of Jena, who discovered in May 2003 an important bug (a 1 not a 0 should have been there) when segmentating sieving, I polished the source a bit and thus you can fetch the latest prime_sieve.c version 2.0c. And here is the corresponding README.



Limit Prime Count CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 Factor
10^2 25 0.0 s0.0 s 0.00 s0.0 s 0.00 s 0.00 s 0.00 s 0.00 s 0.00 s1.0955
10^3 168 0.0 s0.0 s 0.00 s0.0 s 0.00 s 0.00 s 0.00 s 0.00 s 0.00 s1.5010
10^4 1229 0.0 s0.1 s 0.00 s0.0 s 0.00 s 0.00 s 0.00 s 0.00 s 0.00 s1.7887
2^16 6542 0.0 s0.1 s 0.01 s / 0.01 s0.0 s 0.00 s 0.00 s 0.00 s 0.00 s 0.00 s1.9744
10^5 9592 0.1 s0.1 s 0.01 s / 0.01 s0.0 s 0.00 s 0.00 s 0.00 s 0.00 s 0.00 s2.0118
10^6 78498 0.1 s0.2 s 0.02 s / 0.03 s0.1 s 0.01 s 0.01 s 0.01 s/ 0.01 s 0.00 s 0.00 s2.1941
10^7 664579 0.2 s1.6 s 0.14 s / 0.20 s0.5 s 0.13 s 0.08 s 0.04 s/ 0.04 s 0.02 s 0.01 s2.3483
10^8 5761455 1.2 s17.0 s 1.42 s / 2.02 s4.9 s 1.36 s 0.80 s 0.36 s/ 0.37 s 0.18 s 0.11 s2.4818
10^9 50847534 11.3 s187.8 s 16.2 s / 21.7 s51.3 s 15.30 s 8.84 s 3.62 s/ 3.68 s 1.55 s 1.16 s2.5996
2^32 203280221 50.7 s889.5 s 79.6 s / 104.7 s249.5 s 73.85 s 43.01 s 15.97 s/ 15.98 s 7.55 s 5.14 s2.6676
10^10 455052511 120.4 s 268.9 s 191.35 s 38.49 s 16.72 s 12.34 s2.7050
10^11 41180548131268.6 s 4122.7 s 482.77 s 214.51 s 143.71 s2.8003
10^12 3760791201815207.7 s 7466.49 s 3071.43 s 2074.49 s2.8873
10^13 346065536839249032.9 s 32327.34 s 30955.69 s2.9673
CPU1: HP PA-8000 180MHz with 400 MB RAM (256 KB Data-Cache) running HP-UX 10.20 (sieve size=200KB).
CPU2: MIPS 3000 33MHz with 32 MB RAM (no Cache) running Ultrix V4.4 (sieve size=15KB).
CPU3: AMD K6 233MHz with 64 MB RAM (32 KB Data-Cache) running Linux 2.032 (sieve size=22KB). The C-source was compiled using gcc for i486-linux one time for 32bit LONG and the other time for 64 bit LONG. Further, for limit > 2^32 one should increase the sieve size to get shorter running times.
CPU4: Intel Pentium 133MHz with 64 MB RAM (8KB Data-Cache) running Windows 95 (sieve size=8000B). Compiler: Visual C++ (max speed)
CPU5: DEC Alpha 21164a 400 MHz with 64 MB RAM (8 KB Data-Cache) running OSF1 V4.0 (sieve size=8000)
CPU6: Intel Pentium III 450 MHz with 128 MB RAM (16 KB Data-Cache) running Linux 2.2.14 using gcc (i386 Linux/ELF) (sieve size=16384).
As you see, it is very important how well the code-optimizer and the caching logic of the cpu does. The sieve size are nearly optimal chosen for limits < 10^10.
CPU7: AMD Thunderbird 900 MHz with 256 MB RAM (64 KB 1st level Data-Cache) running Linux 2.2.19 using gcc 2.95.2 (i386 Linux/ELF) (sieve size=64000/64KB).
CPU8: PowerPC 970FX 2.5 GHz with 1.5 GB RAM running Mac OSX 10.3.8 using compiler IBM XLF 6.0 Advanced Edition (sieve size=32000 for limit <= 10^11 else the minimal necessary sieve size).
CPU9: AMD Athlon64 Winchester 2 GHz with 3 GB RAM (64 KB 1st level Data-Cache) running Linux 2.6.9 using gcc 3.4.2 (x86-64 Linux/ELF) (sieve size=65000).
Factor = ln(1/2 ln(Limit))+0.2615
The average access for each bit in the sieve is PROD(1-1/p) (Factor - SUM1/p) whereby the primes p in the sum and the product runs over the smallest not-bother-primes, --- here 2,3,5, resulting in 8/30(Factor -31/30).
BTW: Because the gaps between successive primes are <= 250 up to p=436273009 and <= 500 up to 304599508537, to hold a list of primes only their differences need to be stored in a byte.

For interested persons: the frequency of primes and a little problem for the mathematicians: estimate the magnitude of
SUMp<=n 1/p - ln(ln(n))-0.2614972128....
Hint: Hardy & Wright tell us O(1/ln(n)), but this seems to be too pessimisticly.


For intervals larger about 10^9, surely for those > 10^10, the Sieve of Eratosthenes is outperformed by the Sieve of Atkins and Bernstein which uses irreducible binary quadratic forms. See their paper for background informations as well as paragraph 5 of W. Galway's Ph.D. thesis.
Achim Flammenkamp
created: 1998-06-02 17:30 UTC+2
updated: 2005-05-19 17:10 UTC+2