PEP 3106 [1] changed the behavior of the dictionaries `items` method.
In Python 2, `items` builds a real list of tuples where `iteritems`
returns a generator. PEP 3106 changes Python 3's `items` method to be
equivalent to Python 2's `iteritems` and completely removes `iteritems`
in Python 3.
This patch switches to both to use `items`. This could have a negative
impact on Python 2's performance because it now causes the dictionary
tuples to be built in memory.
[1] https://www.python.org/dev/peps/pep-3106/
All strings are sequences of Unicode characters in Python 3. This is
entirely different than that of Python 2. Python 2's strings were of
bytes. However, Python 2 does have the concept of Unicode strings. This
patch changes the behavior of the file reader to use the same the codecs
module on Python 2 to properly read a string into a unicode string. From
there the strings are meant to be equivalent on 2 and 3. The rest of the
patch just updates the code to natively work with unicode strings.
To test the class `GraphemeClusterBreakPropertyTable`:
$ python2 utils/gyb --test \
-DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
-DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
-DCMAKE_SIZEOF_VOID_P=8 \
-o /tmp/UnicodeExtendedGraphemeClusters.cpp.2.7.tmp \
./stdlib/public/stubs/UnicodeExtendedGraphemeClusters.cpp.gyb
$ python3 utils/gyb --test \
-DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
-DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
-DCMAKE_SIZEOF_VOID_P=8 \
-o /tmp/UnicodeExtendedGraphemeClusters.cpp.3.5.tmp \
./stdlib/public/stubs/UnicodeExtendedGraphemeClusters.cpp.gyb
$ diff -u /tmp/UnicodeExtendedGraphemeClusters.cpp.2.7.tmp \
/tmp/UnicodeExtendedGraphemeClusters.cpp.3.5.tmp
To test the method `get_grapheme_cluster_break_tests_as_UTF8`:
$ python2 utils/gyb --test \
-DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
-DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
-DCMAKE_SIZEOF_VOID_P=8 \
-o /tmp/UnicodeGraphemeBreakTest.cpp.2.7.tmp \
./unittests/Basic/UnicodeGraphemeBreakTest.cpp.gyb
$ python3 utils/gyb --test \
-DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
-DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
-DCMAKE_SIZEOF_VOID_P=8 \
-o /tmp/UnicodeGraphemeBreakTest.cpp.3.5.tmp \
./unittests/Basic/UnicodeGraphemeBreakTest.cpp.gyb
$ diff -u /tmp/UnicodeGraphemeBreakTest.cpp.2.7.tmp \
/tmp/UnicodeGraphemeBreakTest.cpp.3.5.tmp
trie parameters and fix a few bugs
The bugs did not affect correctness of the particular instance of trie created
for grapheme cluster property, because trie parameters that were confused with
each other happened to be equal.
Also, fix a trie size bug: we were creating a trie large enough to store
information for 0x200000 code points, but there are only 0x10ffff. It saved
only 15 bytes in the grapheme cluster tree, because that extra information was
compressed with some supplementary planes that also had default values. This
also improved trie generation time by almost 2x.
Swift SVN r19457
algorithm
The implementation uses a specialized trie that has not been tuned to the table
data. I tried guessing parameter values that should work well, but did not do
any performance measurements.
There is no efficient way to initialize arrays with static data in Swift. The
required tables are being generated as C++ code in the runtime library.
rdar://16013860
Swift SVN r19340