Abstract and keywords
Abstract (English):
The article is devoted to the problem of source code finding copies that duplicates the given one. In the interests of this, the existing approaches for searching for code clones based on textual, lexical, syntactic, metric and semantic analysis are considered. Based on their criterion comparison, a new method for searching for duplicates is proposed, which is based on the random walk algorithm. The essence of the method is to build graphs of two source codes (where the nodes are the tokens of the text, and the edges are the links between them), on which the specified algorithm is then applied; the description of the method is given in the form of pseudocode. An experiment is being carried out to evaluate the performance of the method using the following metrics: Jaccard, differences in the number of edges, vertices and average clustering of graphs, the shortest path between their vertices, as well as similarities between graphs. Experimental scenarios consist of calculating metrics for combinations of two source codes instances, their union and the proportion of one of them. Conclusions are drawn regarding the applicability of both the method itself and each of the evaluation metrics.

Keywords:
information security, duplicate search, random walk, source code
Text
Publication text (PDF): Read Download
References

1. Cifrovye tekhnologii i problemy informacionnoj bezopasnosti: monografiya / T.I. Abdullin [i dr.]. SPb.: SPGEU, 2021. 163 s.

2. Zashchita informacii v komp'yuternyh sistemah: monografiya / M.V. Bujnevich [i dr.]. SPb.: SPGEU, 2017. 163 s.

3. Izrailov K.E. Modelirovanie programmy s uyazvimostyami s pozicii evolyucii ee predstavlenij. Ch. 1. Skhema zhiznennogo cikla // Trudy uchebnyh zavedenij svyazi. 2023. T. 9. № 1. S. 75-93. DOI:https://doi.org/10.31854/1813-324X-2023-9-1-75-93.

4. Sojnikov M.A. Vzyskanie ushcherba, prichinennogo prestupleniem protiv intellektual'noj sobstvennosti: processual'nye aspekty // Lex Russica (Russkij zakon). 2019. № 12 (157). S. 80-86.

5. Sleta V.D. Podderzhka povtornogo ispol'zovaniya koda na osnove ontologicheskogo podhoda // Sovremennye informacionnye tekhnologii. 2010. № 11. S. 178-181.

6. Romanov N.E., Izrailov K.E., Pokusov V.V. Sistema podderzhki intellektual'nogo programmirovaniya: mashinnoe obuchenie feat. Bystraya razrabotka bezopasnyh programm // Informatizaciya i svyaz'. 2021. № 5. S. 7-17. DOI:https://doi.org/10.34219/2078-8320-2021-12-5-7-16.

7. Shuvaev F.L., Tatarka M.V. Analiz matematicheskih modelej sluchajnyh grafov, primenyaemyh v imitacionnom modelirovanii informacionno-kommunikacionnyh setej // Nauch.-analit. zhurn. «Vestnik S.-Peterb. un-ta GPS MCHS Rossii». 2020. № 2. S. 67-77.

8. Suttichaya V., Eakvorachai N., Lurkraisit T. Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings // The proceedings of 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (Chiang Mai, Thailand, 5-7 November 2022). 2022. P. 1-6. DOI:https://doi.org/10.1109/iSAI-NLP56921.2022.9960266.

9. Nishi M.A., Ciborowska A., Damevski K. Characterizing Duplicate Code Snippets between Stack Overflow and Tutorials // The proceedings of 16th International Conference on Mining Software Repositories (Montreal, QC, Canada, 25-31 May 2019). 2019. P. 240-244. DOI:https://doi.org/10.1109/MSR.2019.00048.

10. Raheja K., Tekchandani R.K. An efficient code clone detection model on Java byte code using hybrid approach // The proceedings of Confluence 2013: The Next Generation Information Technology Summit (Noida, 26-27 September 2013). 2013. P. 16-21. DOI:https://doi.org/10.1049/cp.2013.2287.

11. Wang H., Zhong J., Zhang D. A Duplicate Code Checking Algorithm for the Programming Experiment // The proccedings of Second International Conference on Mathematics and Computers in Sciences and in Industry (Sliema, Malta, 17 August 2015). 2015. P. 39-42. DOI:https://doi.org/10.1109/MCSI.2015.12.

12. Moshkin V., Kalachev V., Zarubin A. Automation of Program Code Analysis Using Machine Learning Methods // The proocedings of International Russian Automation Conference (Sochi, Russian Federation, 4-10 September 2022). 2022. P. 404-408. DOI:https://doi.org/10.1109/RusAutoCon54946.2022.9896360.

13. Izrailov K.E., Gololobov N.V., Kraskin G.A. Metod analiza vredonosnogo programmnogo obespecheniya na baze Fuzzy Hash // Informatizaciya i svyaz'. 2019. № 2. S. 36-44.

14. Hil'ko V.O., Sharov I.A. Poisk dublikatov v iskhodnyh kodah programm // Regional'naya informatika i informacionnaya bezopasnost'. Vyp. 4. 2017. S. 184-185.

15. Liss A.R., Andrianov I.A. Analiz i razrabotka metodov poiska dublikatov v programmnom kode // Izvestiya SPbGETU LETI. 2010. № 7. S. 55-61.

16. Vahrushev I.N. Ispol'zovanie suffiksnyh derev'ev dlya poiska dubliruyushchihsya fragmentov koda // Sistemy upravleniya i informacionnye tekhnologii. 2012. № 4 (50). S. 55-58.

17. Borodashchenko A.Yu. Analiz tekstov na semanticheskoe skhodstvo na osnove apparata teorii grafov // Izvestiya Orlovskogo gosudarstvennogo tekhnicheskogo universiteta. Ser.: Informacionnye sistemy i tekhnologii. 2008. № 1-2. S. 46-52.

18. De J., Zhang X., Lin F., Cheng L. Transduction on Directed Graphs via Absorbing Random Walks // IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 40. № 7. P. 1770-1784. DOI:https://doi.org/10.1109/TPAMI.2017.2730871.

19. Gori M., Maggini M., Sarti L. Exact and approximate graph matching using random walks // IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 27. № 7. P. 1100-1111. DOI:https://doi.org/10.1109/TPAMI.2005.138.

20. Lambiotte R., Delvenne J.-C., Barahona M. Random Walks, Markov Processes and the Multiscale Modular Organization of Complex Networks // IEEE Transactions on Network Science and Engineering. Vol. 1. № 2. P. 76-90. DOI:https://doi.org/10.1109/TNSE.2015.2391998.

21. Kotov E.M. Metody analiza giperssylok pri informacionnom poiske v global'noj seti // Izvestiya YUFU. Tekhnicheskie nauki. 2012. № 4 (129). S. 233-237.

22. Gasnikov A.V., Dmitriev D.Yu. Ob effektivnyh randomizirovannyh algoritmah poiska vektora PageRank // Zhurnal vychislitel'noj matematiki i matematicheskoj fiziki. 2015. T. 55. № 3. S. 355. DOI:https://doi.org/10.7868/S0044466915030060.

23. Kuz'minova M.V. Periodicheskie dinamicheskie grafy. Zadachi o sluchajnyh bluzhdaniyah i o kratchajshih putyah // Izvestiya vysshih uchebnyh zavedenij. Severo-Kavkazskij region. Ser.: Estestvennye nauki. 2008. № 2 (144). S. 16-21.

24. Nemchenko D.S. Vyyavlenie obshchih chert pol'zovatelej dlya rekomendatel'noj sistemy na primere koefficienta skhodstva Zhakkara // Vestnik magistratury. 2023. № 3-1 (138). S. 14-15.

25. Kajda A.Yu. Analiz skhodstva tekstov na osnove koefficienta Zhakkara-Tanimoto // Matematicheskie metody i modeli tekhniki, tekhnologij i ekonomiki: materialy Vseros. stud. nauch.-prakt. konf. 2022. S. 115-118.

26. Samanthula B.K., Jiang W. Secure Multiset Intersection Cardinality and its Application to Jaccard Coefficient // IEEE Transactions on Dependable and Secure Computing. Vol. 13. № 5. P. 591-604. DOI:https://doi.org/10.1109/TDSC.2015.2415482.

27. Samohin M.V. Vychislenie skhodstva pomechennyh grafov i ih proekcij // Nauchno-tekhnicheskaya informaciya. Ser. 2: Informacionnye processy i sistemy. 2006. № 3. S. 1-12.

28. Kotenko I., Izrailov K., Buinevich M. The Method and Software Tool for Identification of the Machine Code Architecture in Cyberphysical Devices // Journal of Sensor and Actuator Networks. 2023. Vol. 12. Iss. 1. P. 11. DOI:https://doi.org/10.3390/jsan12010011.

29. Kotenko I., Izrailov K., Buinevich M. Analytical Modeling for Identification of the Machine Code Architecture of Cyberphysical Devices in Smart Homes // Sensors. 2022. Vol. 22. Iss. 3. P. 1017. DOI:https://doi.org/10.3390/s22031017.

30. Kotenko I., Izrailov K., Buinevich M. Static Analysis of Information Systems for IoT Cyber Security: A Survey of Machine Learning Approaches // Sensors. 2022. Vol. 22. Iss. 4. P. 1335. DOI:https://doi.org/10.3390/s22041335.

Login or Create
* Forgot password?